0 for all i. Then P(B j | A) = P(A | B j )P(B j ) n i=1 P(A | Bi )P(Bi ) The proof of Bayes’ rule follows exactly as in the preceding discussion. ■ EXAMPLE E Diamond and Forrester (1979) applied Bayes’ rule to the diagnosis of coronary artery disease. A procedure called cardiac ﬂuoroscopy is used to determine whether there is calciﬁcation of coronary arteries and thereby to diagnose coronary artery disease. From the test, it can be determined if 0, 1, 2, or 3 coronary arteries are calciﬁed. Let T0, T1, T2, T3 denote these events. Let D+ or D− denote the event that disease is present or absent, respectively. Diamond and Forrester presented the following table, based on medical studies: iP(Ti | D+) P(Ti | D−) 0 .42 .96 1 .24 .02 2 .20 .02 3 .15 .00 1.5 Conditional Probability 21 According to Bayes’ rule, P(D+|Ti ) = P(Ti | D+)P(D+) P(Ti | D+)P(D+) + P(Ti | D−)P(D−) Thus, if the initial probabilities P(D+) and P(D−) are known, the probability that a patient has coronary artery disease can be calculated. Let us consider two speciﬁc cases. For the ﬁrst, suppose that a male between the ages of 30 and 39 suffers from nonanginal chest pain. For such a patient, it is known from medical statistics that P(D+) ≈ .05. Suppose that the test shows that no arteries are calciﬁed. From the preceding equation, P(D+|T0) = .42 × .05 .42 × .05 + .96 × .95 = .02 It is unlikely that the patient has coronary artery disease. On the other hand, suppose that the test shows that one artery is calciﬁed. Then P(D+|T1) = .24 × .05 .24 × .05 + .02 × .95 = .39 Now it is more likely that this patient has coronary artery disease, but by no means certain. As a second case, suppose that the patient is a male between ages 50 and 59 who suffers typical angina. For such a patient, P(D+) = .92. For him, we ﬁnd that P(D+|T0) = .42 × .92 .42 × .92 + .96 × .08 = .83 P(D+|T1) = .24 × .92 .24 × .92 + .02 × .08 = .99 Comparing the two patients, we see the strong inﬂuence of the prior probability, P(D+). ■ EXAMPLE F Polygraph tests (lie-detector tests) are often routinely administered to employees or prospective employees in sensitive positions. Let + denote the event that the polygraph reading is positive, indicating that the subject is lying; let T denote the event that the subject is telling the truth; and let L denote the event that the subject is lying. According to studies of polygraph reliability (Gastwirth 1987), P(+|L) = .88 from which it follows that P(−|L) = .12 also P(−|T ) = .86 from which it follows that P(+|T ) = .14. In words, if a person is lying, the prob- ability that this is detected by the polygraph is .88, whereas if he is telling the truth, the polygraph indicates that he is telling the truth with probability .86. Now suppose that polygraphs are routinely administered to screen employees for security reasons, and that on a particular question the vast majority of subjects have no reason to lie so 22 Chapter 1 Probability that P(T ) = .99, whereas P(L) = .01. A subject produces a positive response on the polygraph. What is the probability that the polygraph is incorrect and that she is in fact telling the truth? We can evaluate this probability with Bayes’ rule: P(T |+) = P(+|T )P(T ) P(+|T )P(T ) + P(+|L)P(L) = (.14)(.99) (.14)(.99) + (.88)(.01) = .94 Thus, in screening this population of largely innocent people, 94% of the positive polygraph readings will be in error. Most of those placed under suspicion because of the polygraph result will, in fact, be innocent. This example illustrates some of the dangers in using screening procedures on large populations. ■ Bayes’ rule is the fundamental mathematical ingredient of a subjective, or “Bayesian,” approach to epistemology, theories of evidence, and theories of learning. According to this point of view, an individual’s beliefs about the world can be coded in probabilities. For example, an individual’s belief that it will hail tomorrow can be represented by a probability P(H). This probability varies from individual to indi- vidual. In principle, each individual’s probability can be ascertained, or elicited, by offering him or her a series of bets at different odds. According to Bayesian theory, our beliefs are modiﬁed as we are confronted with evidence. If, initially, my probability for a hypothesis is P(H), after seeing evidence E (e.g., a weather forecast), my probability becomes P(H|E). P(E|H) is often easier to evaluate than P(H|E). In this case, the application of Bayes’ rule gives P(H|E) = P(E|H)P(H) P(E|H)P(H) + P(E| ¯H)P( ¯H) where ¯H is the event that H does not hold. This point can be illustrated by the preceding polygraph example. Suppose an investigator is questioning a particular suspect and that the investigator’s prior opinion that the suspect is telling the truth is P(T ). Then, upon observing a positive polygraph reading, his opinion becomes P(T |+). Note that different investigators will have different prior probabilities P(T ) for different suspects, and thus different posterior probabilities. As appealing as this formulation might be, a long line of research has demon- strated that humans are actually not very good at doing probability calculations in evaluating evidence. For example, Tversky and Kahneman (1974) presented subjects with the following question: “If Linda is a 31-year-old single woman who is outspo- ken on social issues such as disarmament and equal rights, which of the following statements is more likely to be true? • Linda is bank teller. • Linda is a bank teller and active in the feminist movement.” More than 80% of those questioned chose the second statement, despite Property C of Section 1.3. 1.6 Independence 23 Even highly trained professionals are not good at doing probability calculations, as illustrated by the following example of Eddy (1982), regarding interpreting the results from mammogram screening. One hundred physicians were presented with the following information: • In the absence of any special information, the probability that a woman (of the age and health status of this patient) has breast cancer is 1%. • If the patient has breast cancer, the probability that the radiologist will correctly diagnose it is 80%. • If the patient has a benign lesion (no breast cancer), the probability that the radiol- ogist will incorrectly diagnose it as cancer is 10%. They were then asked, “What is the probability that a patient with a positive mam- mogram actually has breast cancer?” Ninety-ﬁve of the 100 physicians estimated the probability to be about 75%. The correct probability, as given by Bayes’ rule, is 7.5%. (You should check this.) So even experts radically overestimate the strength of the evidence provided by a positive outcome on the screening test. Thus the Bayesian probability calculus does not describe the way people actually assimilate evidence. Advocates for Bayesian learning theory might assert that the theory describes the way people “should think.” A softer point of view is that Bayesian learning theory is a model for learning, and it has the merit of being a simple model that can be programmed on computers. Probability theory in general, and Bayesian learning theory in particular, are part of the core of artiﬁcial intelligence. 1.6 Independence Intuitively, we would say that two events, A and B, are independent if knowing that one had occurred gave us no information about whether the other had occurred; that is, P(A | B) = P(A) and P(B | A) = P(B).Now,if P(A) = P(A | B) = P(A ∩ B) P(B) then P(A ∩ B) = P(A)P(B) Wewill use this last relation as the deﬁnition of independence. Note that it is symmetric in A and in B, and does not require the existence of a conditional probability, that is, P(B) can be 0. DEFINITION A and B are said to be independent events if P(A ∩ B) = P(A)P(B). ■ EXAMPLE A A card is selected randomly from a deck. Let A denote the event that it is an ace and D the event that it is a diamond. Knowing that the card is an ace gives no 24 Chapter 1 Probability information about its suit. Checking formally that the events are independent, we have P(A) = 4 52 = 1 13 and P(D) = 1 4 . Also, A∩ D is the event that the card is the ace of diamonds and P(A ∩ D) = 1 52 . Since P(A)P(D) = ( 1 4 ) × ( 1 13 ) = 1 52 , the events are in fact independent. ■ EXAMPLE B A system is designed so that it fails only if a unit and a backup unit both fail. Assuming that these failures are independent and that each unit fails with probability p, the system fails with probability p2. If, for example, the probability that any unit fails during a given year is .1, then the probability that the system fails is .01, which represents a considerable improvement in reliability. ■ Things become more complicated when we consider more than two events. For example, suppose we know that events A, B, and C are pairwise independent (any two are independent). We would like to be able to say that they are all independent based on the assumption that knowing something about two of the events does not tell us anything about the third, for example, P(C | A ∩ B) = P(C). But as the following example shows, pairwise independence does not guarantee mutual independence. EXAMPLE C A fair coin is tossed twice. Let A denote the event of heads on the ﬁrst toss, B the event of heads on the second toss, and C the event that exactly one head is thrown. A and B are clearly independent, and P(A) = P(B) = P(C) = .5. To see that A and C are independent, we observe that P(C | A) = .5. But P(A ∩ B ∩ C) = 0 = P(A)P(B)P(C) ■ To encompass situations such as that in Example C, we deﬁne a collection of events, A1, A2,...,An,tobemutually independent if for any subcollection, Ai1 ,...,Aim , P(Ai1 ∩···∩ Aim ) = P(Ai1 ) ···P(Aim ) EXAMPLE D We return to Example B of Section 1.3 (infectivity of AIDS). Suppose that virus transmissions in 500 acts of intercourse are mutually independent events and that the probability of transmission in any one act is 1/500. Under this model, what is the probability of infection? It is easier to ﬁrst ﬁnd the probability of the complement of this event. Let C1, C2,...,C500 denote the events that virus transmission does not occur during encounters 1, 2, ...,500. Then the probability of no infection is P(C1 ∩ C2 ∩···∩C500) = 1 − 1 500 500 = .37 so the probability of infection is 1 − .37 = .63, not 1, which is the answer produced by incorrectly adding probabilities. ■ 1.6 Independence 25 EXAMPLE E Consider a circuit with three relays (Figure 1.4). Let Ai denote the event that the ith relay works, and assume that P(Ai ) = p and that the relays are mutually independent. If F denotes the event that current ﬂows through the circuit, then F = A3 ∪(A1 ∩ A2) and, from the addition law and the assumption of independence, P(F) = P(A3) + P(A1 ∩ A2) − P(A1 ∩ A2 ∩ A3) = p + p2 − p3 ■ 1 2 3 FIGURE 1.4 Circuit with three relays. EXAMPLE F Suppose that a system consists of components connected in a series, so the system fails if any one component fails. If there are n mutually independent components and each fails with probability p, what is the probability that the system will fail? It is easier to ﬁnd the probability of the complement of this event; the system works if and only if all the components work, and this situation has probability (1 − p)n. The probability that the system fails is then 1 − (1 − p)n. For example, if n = 10 and p = .05, the probability that the system works is only .9510 = .60, and the probability that the system fails is .40. Suppose, instead, that the components are connected in parallel, so the system fails only when all components fail. In this case, the probability that the system fails is only .0510 = 9.8 × 10−14. ■ Calculations like those in Example F are made in reliability studies for sys- tems consisting of quite complicated networks of components. The absolutely crucial assumption is that the components are independent of one another. Theoretical studies of the reliability of nuclear power plants have been criticized on the grounds that they incorrectly assume independence of the components. EXAMPLE G Matching DNA Fragments Fragments of DNA are often compared for similarity, for example, across species. A simple way to make a comparison is to count the number of locations, or sites, at which these fragments agree. For example, consider these two sequences, which agree at three sites: fragment 1: AGATCAGT; and fragment 2: TGGATACT. Many such comparisons are made, and to sort the wheat from the chaff, a prob- ability model is often used. A comparison is deemed interesting if the number of 26 Chapter 1 Probability matches is much larger than would be expected by chance alone. This requires a chance model; a simple one stipulates that the nucleotide at each site of fragment 1 occurs randomly with probabilities pA1, pG1, pC1, pT 1, and that the second fragment is similarly composed with probabilities pA2,...,pT 2. What is the chance that the fragments match at a particular site if in fact the identity of the nucleotide on frag- ment 1 is independent of that on fragment 2? The match probability can be calculated using the law of total probability: P(match) = P(match|A on fragment 1)P(A on fragment 1) + ...+ P(match|T on fragment 1)P(T on fragment 1) = pA2 pA1 + pG2 pG1 + pC2 pC1 + pT 2 pT 1 The problem of determining the probability that they match at k out of a total of n sites is discussed later. ■ 1.7 Concluding Remarks This chapter provides a simple axiomatic development of the mathematical theory of probability. Some subtle issues that arise in a careful analysis of inﬁnite sample spaces have been neglected. Such issues are typically addressed in graduate-level courses in measure theory and probability theory. Certain philosophical questions have also been avoided. One might ask what is meant by the statement “The probability that this coin will land heads up is 1 2 .” Two commonly advocated views are the frequen- tist approach and the Bayesian approach. According to the frequentist approach, the statement means that if the experiment were repeated many times, the long-run average number of heads would tend to 1 2 . According to the Bayesian approach, the statement is a quantiﬁcation of the speaker’s uncertainty about the outcome of the experiment and thus is a personal or subjective notion; the probability that the coin will land heads up may be different for different speakers, depending on their ex- perience and knowledge of the situation. There has been vigorous and occasionally acrimonious debate among proponents of various versions of these points of view. In this and ensuing chapters, there are many examples of the use of probability as a model for various phenomena. In any such modeling endeavor, an idealized mathematical theory is hoped to provide an adequate match to characteristics of the phenomenon under study. The standard of adequacy is relative to the ﬁeld of study and the modeler’s goals. 1.8 Problems 1. A coin is tossed three times and the sequence of heads and tails is recorded. a. List the sample space. b. List the elements that make up the following events: (1) A = at least two heads, (2) B = the ﬁrst two tosses are heads, (3) C = the last toss is a tail. c. List the elements of the following events: (1) Ac, (2) A ∩ B, (3) A ∪ C. 1.8 Problems 27 2. Two six-sided dice are thrown sequentially, and the face values that come up are recorded. a. List the sample space. b. List the elements that make up the following events: (1) A = the sum of the two values is at least 5, (2) B = the value of the ﬁrst die is higher than the value of the second, (3) C = the ﬁrst value is 4. c. List the elements of the following events: (1) A∩C, (2) B∪C, (3) A∩(B∪C). 3. An urn contains three red balls, two green balls, and one white ball. Three balls are drawn without replacement from the urn, and the colors are noted in sequence. List the sample space. Deﬁne events A, B, and C as you wish and ﬁnd their unions and intersections. 4. Draw Venn diagrams to illustrate De Morgan’s laws: (A ∪ B)c = Ac ∩ Bc (A ∩ B)c = Ac ∪ Bc 5. Let A and B be arbitrary events. Let C be the event that either A occurs or B occurs, but not both. Express C in terms of A and B using any of the basic operations of union, intersection, and complement. 6. Verify the following extension of the addition rule (a) by an appropriate Venn diagram and (b) by a formal argument using the axioms of probability and the propositions in this chapter. P(A ∪ B ∪ C) = P(A) + P(B) + P(C) − P(A ∩ B) − P(A ∩ C) − P(B ∩ C) + P(A ∩ B ∩ C) 7. Prove Bonferroni’s inequality: P(A ∩ B) ≥ P(A) + P(B) − 1 8. Prove that P n i=1 Ai ≤ n i=1 P(Ai ) 9. The weather forecaster says that the probability of rain on Saturday is 25% and that the probability of rain on Sunday is 25%. Is the probability of rain during the weekend 50%? Why or why not? 10. Make up another example of Simpson’s paradox by changing the numbers in Example B of Section 1.4. 11. The ﬁrst three digits of a university telephone exchange are 452. If all the se- quences of the remaining four digits are equally likely, what is the probability that a randomly selected university phone number contains seven distinct digits? 12. In a game of poker, ﬁve players are each dealt 5 cards from a 52-card deck. How many ways are there to deal the cards? 28 Chapter 1 Probability 13. In a game of poker, what is the probability that a ﬁve-card hand will contain (a) a straight (ﬁve cards in unbroken numerical sequence), (b) four of a kind, and (c) a full house (three cards of one value and two cards of another value)? 14. The four players in a bridge game are each dealt 13 cards. How many ways are there to do this? 15. How many different meals can be made from four kinds of meat, six vegetables, and three starches if a meal consists of one selection from each group? 16. How many different letter arrangements can be obtained from the letters of the word statistically, using all the letters? 17. In acceptance sampling, a purchaser samples 4 items from a lot of 100 and rejects the lot if 1 or more are defective. Graph the probability that the lot is accepted as a function of the percentage of defective items in the lot. 18. A lot of n items contains k defectives, and m are selected randomly and inspected. How should the value of m be chosen so that the probability that at least one defective item turns up is .90? Apply your answer to (a) n = 1000, k = 10, and (b) n = 10,000, k = 100. 19. A committee consists of ﬁve Chicanos, two Asians, three African Americans, and two Caucasians. a. A subcommittee of four is chosen at random. What is the probability that all the ethnic groups are represented on the subcommittee? b. Answer the question for part (a) if a subcommittee of ﬁve is chosen. 20. A deck of 52 cards is shufﬂed thoroughly. What is the probability that the four aces are all next to each other? 21. A fair coin is tossed ﬁve times. What is the probability of getting a sequence of three heads? 22. A standard deck of 52 cards is shufﬂed thoroughly, and n cards are turned up. What is the probability that a face card turns up? For what value of n is this probability about .5? 23. How many ways are there to place n indistinguishable balls in n urns so that exactly one urn is empty? 24. If n balls are distributed randomly into k urns, what is the probability that the last urn contains j balls? 25. A woman getting dressed up for a night out is asked by her signiﬁcant other to wear a red dress, high-heeled sneakers, and a wig. In how many orders can she put on these objects? 26. The game of Mastermind starts in the following way: One player selects four pegs, each peg having six possible colors, and places them in a line. The sec- ond player then tries to guess the sequence of colors. What is the probability of guessing correctly? 1.8 Problems 29 27. If a ﬁve-letter word is formed at random (meaning that all sequences of ﬁve letters are equally likely), what is the probability that no letter occurs more than once? 28. How many ways are there to encode the 26-letter English alphabet into 8-bit binary words (sequences of eight 0s and 1s)? 29. A poker player is dealt three spades and two hearts. He discards the two hearts and draws two more cards. What is the probability that he draws two more spades? 30. A group of 60 second graders is to be randomly assigned to two classes of 30 each. (The random assignment is ordered by the school district to ensure against any bias.) Five of the second graders, Marcelle, Sarah, Michelle, Katy, and Camerin, are close friends. What is the probability that they will all be in the same class? What is the probability that exactly four of them will be? What is the probability that Marcelle will be in one class and her friends in the other? 31. Six male and six female dancers perform the Virginia reel. This dance requires that they form a line consisting of six male/female pairs. How many such ar- rangements are there? 32. A wine taster claims that she can distinguish four vintages of a particular Caber- net. What is the probability that she can do this by merely guessing? (She is confronted with four unlabeled glasses.) 33. An elevator containing ﬁve people can stop at any of seven ﬂoors. What is the probability that no two people get off at the same ﬂoor? Assume that the occupants act independently and that all ﬂoors are equally likely for each occupant. 34. Prove the following identity: n k=0 n k m − n n − k = m n (Hint: How can each of the summands be interpreted?) 35. Prove the following two identities both algebraically and by interpreting their meaning combinatorially. a. n r = n n−r b. n r = n−1 r−1 + n−1 r 36. What is the coefﬁcient of x3 y4 in the expansion of (x + y)7? 37. What is the coefﬁcient of x2 y2z3 in the expansion of (x + y + z)7? 38. A child has six blocks, three of which are red and three of which are green. How many patterns can she make by placing them all in a line? If she is given three white blocks, how many total patterns can she make by placing all nine blocks in a line? 39. A monkey at a typewriter types each of the 26 letters of the alphabet exactly once, the order being random. a. What is the probability that the word Hamlet appears somewhere in the string of letters? 30 Chapter 1 Probability b. How many independent monkey typists would you need in order that the probability that the word appears is at least .90? 40. In how many ways can two octopi shake hands? (There are a number of ways to interpret this question—choose one.) 41. A drawer of socks contains seven black socks, eight blue socks, and nine green socks. Two socks are chosen in the dark. a. What is the probability that they match? b. What is the probability that a black pair is chosen? 42. How many ways can 11 boys on a soccer team be grouped into 4 forwards, 3 midﬁelders, 3 defenders, and 1 goalie? 43. A software development company has three jobs to do. Two of the jobs require three programmers, and the other requires four. If the company employs ten programmers, how many different ways are there to assign them to the jobs? 44. In how many ways can 12 people be divided into three groups of 4 for an evening of bridge? In how many ways can this be done if the 12 consist of six pairs of partners? 45. Show that if the conditional probabilities exist, then P(A1 ∩ A2 ∩···∩ An) = P(A1)P(A2 | A1)P(A3 | A1 ∩ A2) ···P(An | A1 ∩ A2 ∩···∩ An−1) 46. Urn A has three red balls and two white balls, and urn B has two red balls and ﬁve white balls. A fair coin is tossed. If it lands heads up, a ball is drawn from urn A; otherwise, a ball is drawn from urn B. a. What is the probability that a red ball is drawn? b. If a red ball is drawn, what is the probability that the coin landed heads up? 47. Urn A has four red, three blue, and two green balls. Urn B has two red, three blue, and four green balls. A ball is drawn from urn A and put into urn B, and then a ball is drawn from urn B. a. What is the probability that a red ball is drawn from urn B? b. If a red ball is drawn from urn B, what is the probability that a red ball was drawn from urn A? 48. An urn contains three red and two white balls. A ball is drawn, and then it and another ball of the same color are placed back in the urn. Finally, a second ball is drawn. a. What is the probability that the second ball drawn is white? b. If the second ball drawn is white, what is the probability that the ﬁrst ball drawn was red? 49. A fair coin is tossed three times. a. What is the probability of two or more heads given that there was at least one head? b. What is the probability given that there was at least one tail? 1.8 Problems 31 50. Two dice are rolled, and the sum of the face values is six. What is the probability that at least one of the dice came up a three? 51. Answer Problem 50 again given that the sum is less than six. 52. Suppose that 5 cards are dealt from a 52-card deck and the ﬁrst one is a king. What is the probability of at least one more king? 53. A ﬁre insurance company has high-risk, medium-risk, and low-risk clients, who have, respectively, probabilities .02, .01, and .0025 of ﬁling claims within a given year. The proportions of the numbers of clients in the three categories are .10, .20, and .70, respectively. What proportion of the claims ﬁled each year come from high-risk clients? 54. This problem introduces a simple meteorological model, more complicated versions of which have been proposed in the meteorological literature. Consider a sequence of days and let Ri denote the event that it rains on day i. Suppose that P(Ri | Ri−1) = α and P(Rc i | Rc i−1) = β. Suppose further that only today’s weather is relevant to predicting tomorrow’s; that is, P(Ri | Ri−1 ∩ Ri−2 ∩···∩ R0) = P(Ri | Ri−1). a. If the probability of rain today is p, what is the probability of rain tomorrow? b. What is the probability of rain the day after tomorrow? c. What is the probability of rainn days from now? What happens asn approaches inﬁnity? 55. This problem continues Example D of Section 1.5 and concerns occupational mobility. a. Find P(M1 | M2) and P(L1 | L2). b. Find the proportions that will be in the three occupational levels in the third generation. To do this, assume that a son’s occupational status depends on his father’s status, but that given his father’s status, it does not depend on his grandfather’s. 56. A couple has two children. What is the probability that both are girls given that the oldest is a girl? What is the probability that both are girls given that one of them is a girl? 57. There are three cabinets, A, B, and C, each of which has two drawers. Each drawer contains one coin; A has two gold coins, B has two silver coins, and C has one gold and one silver coin. A cabinet is chosen at random, one drawer is opened, and a silver coin is found. What is the probability that the other drawer in that cabinet contains a silver coin? 58. A teacher tells three boys, Drew, Chris, and Jason, that two of them will have to stay after school to help her clean erasers and that one of them will be able to leave. She further says that she has made the decision as to who will leave and who will stay at random by rolling a special three-sided Dungeons and Dragons die. Drew wants to leave to play soccer and has a clever idea about how to increase his chances of doing so. He ﬁgures that one of Jason and Chris will certainly stay and asks the teacher to tell him the name of one of the two who will stay. Drew’s idea 32 Chapter 1 Probability is that if, for example, Jason is named, then he and Chris are left and they each have a probability .5 of leaving; similarly, if Chris is named, Drew’s probability of leaving is still .5. Thus, by merely asking the teacher a question, Drew will increase his probability of leaving from 1 3 to 1 2 . What do you think of this scheme? 59. A box has three coins. One has two heads, one has two tails, and the other is a fair coin with one head and one tail. A coin is chosen at random, is ﬂipped, and comes up heads. a. What is the probability that the coin chosen is the two-headed coin? b. What is the probability that if it is thrown another time it will come up heads? c. Answer part (a) again, supposing that the coin is thrown a second time and comes up heads again. 60. A factory runs three shifts. In a given day, 1% of the items produced by the ﬁrst shift are defective, 2% of the second shift’s items are defective, and 5% of the third shift’s items are defective. If the shifts all have the same productivity, what percentage of the items produced in a day are defective? If an item is defective, what is the probability that it was produced by the third shift? 61. Suppose that chips for an integrated circuit are tested and that the probability that they are detected if they are defective is .95, and the probability that they are declared sound if in fact they are sound is .97. If .5% of the chips are faulty, what is the probability that a chip that is declared faulty is sound? 62. Show that if P(A | E) ≥ P(B | E) and P(A | Ec) ≥ P(B | Ec), then P(A) ≥ P(B). 63. Suppose that the probability of living to be older than 70 is .6 and the probability of living to be older than 80 is .2. If a person reaches her 70th birthday, what is the probability that she will celebrate her 80th? 64. If B is an event, with P(B)>0, show that the set function Q(A) = P(A | B) satisﬁes the axioms for a probability measure. Thus, for example, P(A ∪ C | B) = P(A | B) + P(C | B) − P(A ∩ C | B) 65. Show that if A and B are independent, then A and Bc as well as Ac and Bc are independent. 66. Show that ∅ is independent of A for any A. 67. Show that if A and B are independent, then P(A ∪ B) = P(A) + P(B) − P(A)P(B) 68. If A is independent of B and B is independent of C, then A is independent of C. Prove this statement or give a counterexample if it is false. 69. If A and B are disjoint, can they be independent? 70. If A ⊂ B, can A and B be independent? 71. Show that if A, B, and C are mutually independent, then A ∩ B and C are independent and A ∪ B and C are independent. 1.8 Problems 33 72. Suppose that n components are connected in series. For each unit, there is a backup unit, and the system fails if and only if both a unit and its backup fail. Assuming that all the units are independent and fail with probability p, what is the probability that the system works? For n = 10 and p = .05, compare these results with those of Example F in Section 1.6. 73. A system has n independent units, each of which fails with probability p. The system fails only if k or more of the units fail. What is the probability that the system fails? 74. What is the probability that the following system works if each unit fails inde- pendently with probability p (see Figure 1.5)? FIGURE 1.5 75. This problem deals with an elementary aspect of a simple branching process. A population starts with one member; at time t = 1, it either divides with prob- ability p or dies with probability 1 − p. If it divides, then both of its children behave independently with the same two alternatives at time t = 2. What is the probability that there are no members in the third generation? For what value of p is this probability equal to .5? 76. Here is a simple model of a queue. The queue runs in discrete time (t = 0, 1, 2,...), and at each unit of time the ﬁrst person in the queue is served with probability p and, independently, a new person arrives with probability q.At time t = 0, there is one person in the queue. Find the probabilities that there are 0, 1, 2, 3 people in line at time t = 2. 77. A player throws darts at a target. On each trial, independently of the other trials, he hits the bull’s-eye with probability .05. How many times should he throw so that his probability of hitting the bull’s-eye at least once is .5? 78. This problem introduces some aspects of a simple genetic model. Assume that genes in an organism occur in pairs and that each member of the pair can be either of the types a or A. The possible genotypes of an organism are then AA, Aa, and aa (Aa and aAare equivalent). When two organisms mate, each independently contributes one of its two genes; either one of the pair is transmitted with prob- ability .5. a. Suppose that the genotypes of the parents are AA and Aa. Find the possible genotypes of their offspring and the corresponding probabilities. 34 Chapter 1 Probability b. Suppose that the probabilities of the genotypes AA, Aa, and aa are p, 2q, and r, respectively, in the ﬁrst generation. Find the probabilities in the second and third generations, and show that these are the same. This result is called the Hardy-Weinberg Law. c. Compute the probabilities for the second and third generations as in part (b) but under the additional assumption that the probabilities that an individual of type AA, Aa, or aa survives to mate are u,v,and w, respectively. 79. Many human diseases are genetically transmitted (for example, hemophilia or Tay-Sachs disease). Here is a simple model for such a disease. The genotype aa is diseased and dies before it mates. The genotype Aa is a carrier but is not diseased. The genotype AA is not a carrier and is not diseased. a. If two carriers mate, what are the probabilities that their offspring are of each of the three genotypes? b. If the male offspring of two carriers is not diseased, what is the probability that he is a carrier? c. Suppose that the nondiseased offspring of part (b) mates with a member of the population for whom no family history is available and who is thus assumed to have probability p of being a carrier ( p is a very small number). What are the probabilities that their ﬁrst offspring has the genotypes AA, Aa, and aa? d. Suppose that the ﬁrst offspring of part (c) is not diseased. What is the proba- bility that the father is a carrier in light of this evidence? 80. If a parent has genotype Aa, he transmits either A or a to an offspring (each with a 1 2 chance). The gene he transmits to one offspring is independent of the one he transmits to another. Consider a parent with three children and the following events: A ={children 1 and 2 have the same gene}, B ={children 1 and 3 have the same gene}, C ={children 2 and 3 have the same gene}. Show that these events are pairwise independent but not mutually independent. CHAPTER 2 Random Variables 2.1 Discrete Random Variables A random variable is essentially a random number. As motivation for a deﬁnition, let us consider an example. A coin is thrown three times, and the sequence of heads and tails is observed; thus, ={hhh, hht, htt, hth, ttt, tth, thh, tht} Examples of random variables deﬁned on are (1) the total number of heads, (2) the total number of tails, and (3) the number of heads minus the number of tails. Each of these is a real-valued function deﬁned on ; that is, each is a rule that assigns a real number to every point ω ∈ . Since the outcome in is random, the corresponding number is random as well. In general, a random variable is a function from to the real numbers. Because the outcome of the experiment with sample space is random, the number produced by the function is random as well. It is conventional to denote random variables by italic uppercase letters from the end of the alphabet. For example, we might deﬁne X to be the total number of heads in the experiment described above. A discrete random variable is a random variable that can take on only a ﬁnite or at most a countably inﬁnite number of values. The random variable X just deﬁned is a discrete random variable since it can take on only the values 0, 1, 2, and 3. For an example of a random variable that can take on a countably inﬁnite number of values, consider an experiment that consists of tossing a coin until a head turns up and deﬁning Y to be the total number of tosses. The possible values of Y are0,1,2,3,....Ingeneral, a countably inﬁnite set is one that can be put into one-to-one correspondence with the integers. If the coin is fair, then each of the outcomes in above has probability 1 8 , from which the probabilities that X takes on the values 0, 1, 2, and 3 can be easily 35 36 Chapter 2 Random Variables computed: P(X = 0) = 1 8 P(X = 1) = 3 8 P(X = 2) = 3 8 P(X = 3) = 1 8 Generally, the probability measure on the sample space determines the probabilities of the various values of X; if those values are denoted by x1, x2,...,then there is a function p such that p(xi ) = P(X = xi ) and i p(xi ) = 1. This function is called the probability mass function, or the frequency function, of the random variable X. Figure 2.1 shows a graph of p(x) for the coin tossing experiment. The frequency function describes completely the probability properties of the random variable. .4 .1 .2 .3 0123 x p ( x ) 0 FIGURE 2.1 A probability mass function. In addition to the frequency function, it is sometimes convenient to use the cumulative distribution function (cdf) of a random variable, which is deﬁned to be F(x) = P(X ≤ x), −∞ < x < ∞ Cumulative distribution functions are usually denoted by uppercase letters and fre- quency functions by lowercase letters. Figure 2.2 is a graph of the cumulative distri- bution function of the random variable X of the preceding paragraph. Note that the cdf jumps wherever p(x)>0 and that the jump at xi is p(xi ). For example, if 0 < x < 1, F(x) = 1 8 ; at x = 1, F(x) jumps to F(1) = 4 8 = 1 2 . The jump at x = 1isp(1) = 3 8 . The cumulative distribution function is non-decreasing and satisﬁes limx→−∞ F(x) = 0 and limx→∞ F(x) = 1. Chapter 3 will cover in detail the joint frequency functions of several random variables deﬁned on the same sample space, but it is useful to deﬁne here the concept 2.1 Discrete Random Variables 37 .2 0 x F ( x ) 0 .4 .6 .8 1.0 1 1 2 3 4 FIGURE 2.2 The cumulative distribution function corresponding to Figure 2.1. of independence of random variables. In the case of two discrete random variables X and Y, taking on possible values x1, x2,...,and y1, y2,...,X and Y are said to be independent if, for all i and j, P(X = xi and Y = y j ) = P(X = xi )P(Y = y j ) The deﬁnition is extended to collections of more than two discrete random variables in the obvious way; for example, X, Y, and Z are said to be mutually independent if, for all i, j, and k, P(X = xi , Y = y j , Z = zk) = P(X = xi )P(Y = y j )P(Z = zk) We next discuss some common discrete distributions that arise in applications. 2.1.1 Bernoulli Random Variables A Bernoulli random variable takes on only two values: 0 and 1, with probabilities 1 − p and p, respectively. Its frequency function is thus p(1) = p p(0) = 1 − p p(x) = 0, if x = 0 and x = 1 An alternative and sometimes useful representation of this function is p(x) = px (1 − p)1−x , if x = 0orx = 1 0, otherwise If A is an event, then the indicator random variable, IA, takes on the value 1 if A occurs and the value 0 if A does not occur: IA(ω) = 1, if ω ∈ A 0, otherwise 38 Chapter 2 Random Variables IA is a Bernoulli random variable. In applications, Bernoulli random variables often occur as indicators. A Bernoulli random variable might take on the value 1 or 0 according to whether a guess was a success or a failure. 2.1.2 The Binomial Distribution Suppose that n independent experiments, or trials, are performed, where n isaﬁxed number, and that each experiment results in a “success” with probability p and a “failure” with probability 1 − p. The total number of successes, X, is a binomial random variable with parameters n and p. For example, a coin is tossed 10 times and the total number of heads is counted (“head” is identiﬁed with “success”). The probability that X = k,orp(k), can be found in the following way: Any particular sequence of k successes occurs with probability pk(1 − p)n−k, from the multiplication principle. The total number of such sequences is n k , since there aren k ways to assign k successes to n trials. P(X = k) is thus the probability of any particular sequence times the number of such sequences: p(k) = n k pk(1 − p)n−k Two binomial frequency functions are shown in Figure 2.3. Note how the shape varies as a function of p. .2 x p ( x ) 0 .4 0 12345678910 (a) .10 x p ( x ) 0 .20 0 12345678910 (b) FIGURE 2.3 Binomial frequency functions, (a) n = 10 and p = .1 and (b) n = 10 and p = .5. EXAMPLE A Tay-Sachs disease is a rare but fatal disease of genetic origin occurring chieﬂy in infants and children, especially those of Jewish or eastern European extraction. If a couple are both carriers of Tay-Sachs disease, a child of theirs has probability .25 of being born with the disease. If such a couple has four children, what is the frequency function for the number of children who will have the disease? 2.1 Discrete Random Variables 39 We assume that the four outcomes are independent of each other, so, if X denotes the number of children with the disease, its frequency function is p(k) = 4 k .25k × .754−k, k = 0, 1, 2, 3, 4 These probabilities are given in the following table: kp(k) 0 .316 1 .422 2 .211 3 .047 4 .004 ■ EXAMPLE B If a single bit (0 or 1) is transmitted over a noisy communications channel, it has probability p of being incorrectly transmitted. To improve the reliability of the trans- mission, the bit is transmitted n times, where n is odd. A decoder at the receiving end, called a majority decoder, decides that the correct message is that carried by a majority of the received bits. Under a simple noise model, each bit is independently subject to being corrupted with the same probability p. The number of bits that is in error, X, is thus a binomial random variable with n trials and probability p of success on each trial (in this case, and frequently elsewhere, the word success is used in a generic sense; here a success is an error). Suppose, for example, that n = 5 and p = .1. The probability that the message is correctly received is the probability of two or fewer errors, which is 2 k=0 n k pk(1 − p)n−k = p0(1 − p)5 + 5p(1 − p)4 + 10p2(1 − p)3 = .9914 The result is a considerable improvement in reliability. ■ EXAMPLE C DNA Matching We continue Example G of Section 1.6. There we derived the probability p that two fragments agree at a particular site under the assumption that the nucleotide proba- bilities were the same at every site and the identities on fragment 1 were independent of those on fragment 2. To ﬁnd the probability of the total number of matches, further assumptions must be made. Suppose that the fragments are each of length n and that the nucleotide identities are independent from site to site as well as between frag- ments. Thus, the identity of the nucleotide at site 1 of fragment 1 is independent of the identity at site 2, etc. We did not make this assumption in Example G in Section 1.6; in that case, the identity at site 2 could have depended on the identity at site 1, for example. Now, under the current assumption, the two fragments agree at each site with probability p as calculated in Example G of Section 1.6, and agreement is in- dependent from site to site. So, the total number of agreements is a binomial random variable with n trials and probability p of success. ■ 40 Chapter 2 Random Variables A random variable with a binomial distribution can be expressed in terms of inde- pendent Bernoulli random variables, a fact that will be quite useful for analyzing some properties of binomial random variables in later chapters of this book. Speciﬁcally, let X1, X2,...,Xn be independent Bernoulli random variables with p(Xi = 1) = p. Then Y = X1 + X2 +···+Xn is a binomial random variable. 2.1.3 The Geometric and Negative Binomial Distributions The geometric distribution is also constructed from independent Bernoulli trials, but from an inﬁnite sequence. On each trial, a success occurs with probability p, and X is the total number of trials up to and including the ﬁrst success. So that X = k, there must be k − 1 failures followed by a success. From the independence of the trials, this occurs with probability p(k) = P(X = k) = (1 − p)k−1 p, k = 1, 2, 3,... Note that these probabilities sum to 1: ∞ k=1 (1 − p)k−1 p = p ∞ j=0 (1 − p) j = 1 EXAMPLE A The probability of winning in a certain state lottery is said to be about 1 9 . If it is exactly 1 9 , the distribution of the number of tickets a person must purchase up to and including the ﬁrst winning ticket is a geometric random variable with p = 1 9 . Figure 2.4 shows the frequency function. ■ .02 10 x p ( x ) 0 .04 .08 .10 .12 20 30 40 500 .06 FIGURE 2.4 The probability mass function of a geometric random variable with p = 1 9 . 2.1 Discrete Random Variables 41 The negative binomial distribution arises as a generalization of the geometric distribution. Suppose that a sequence of independent trials, each with probability of success p, is performed until there are r successes in all; let X denote the total number of trials. To ﬁnd P(X = k), we can argue in the following way: Any particular such sequence has probability pr (1 − p)k−r , from the independence assumption. The last trial is a success, and the remaining r − 1 successes can be assigned to the remaining k − 1 trials in k−1 r−1 ways. Thus, P(X = k) = k − 1 r − 1 pr (1 − p)k−r It is sometimes helpful in analyzing properties of the negative binomial distribu- tion to note that a negative binomial random variable can be expressed as the sum of r independent geometric random variables: the number of trials up to and including the ﬁrst success plus the number of trials after the ﬁrst success up to and including the second success, . . . plus the number of trials from the (r − 1)st success up to and including the rth success. EXAMPLE B Continuing Example A, the distribution of the number of tickets purchased up to and including the second winning ticket is negative binomial: p(k) = (k − 1)p2(1 − p)k−2 This frequency function is shown in Figure 2.5. ■ .01 10 x p ( x ) 0 .02 20 30 40 500 .03 .04 .05 FIGURE 2.5 The probability mass function of a negative binomial random variable with p = 1 9 and r = 2. The deﬁnitions of the geometric and negative binomial distributions vary slightly from one textbook to another; for example, instead of X being the total number of trials in the deﬁnition of the geometric distribution, X is sometimes deﬁned as the total number of failures. 42 Chapter 2 Random Variables 2.1.4 The Hypergeometric Distribution The hypergeometric distribution was introduced in Chapter 1 but was not named there. Suppose that an urn contains n balls, of whichr are black and n−r are white. Let X denote the number of black balls drawn when taking m balls without replacement. Following the line of reasoning of Examples H and I of Section 1.4.2, P(X = k) = r k n − r m − k n m X is a hypergeometric random variable with parameters r, n, and m. EXAMPLE A As explained in Example G of Section 1.4.2, a player in the California lottery chooses 6 numbers from 53 and the lottery ofﬁcials later choose 6 numbers at random. Let X equal the number of matches. Then P(X = k) = 6 k 47 6 − k 53 6 The probability mass function of X is displayed in the following table: k 0123 4 5 6 p(k) .468 .401 .117 .014 7.06 × 10−4 1.22 × 10−5 4.36 × 10−8 ■ 2.1.5 The Poisson Distribution The Poisson frequency function with parameter λ(λ>0) is P(X = k) = λk k! e−λ, k = 0, 1, 2,... Since eλ = ∞ k=0(λk/k!), it follows that the frequency function sums to 1. Figure 2.6 shows four Poisson frequency functions. Note how the shape varies as a function of λ. The Poisson distribution can be derived as the limit of a binomial distribution as the number of trials, n, approaches inﬁnity and the probability of success on each trial, p, approaches zero in such a way that np = λ. The binomial frequency function is p(k) = n! k!(n − k)! pk(1 − p)n−k Setting np = λ, this expression becomes p(k) = n! k!(n − k)! λ n k 1 − λ n n−k = λk k! n! (n − k)! 1 nk 1 − λ n n 1 − λ n −k 2.1 Discrete Random Variables 43 .4 x p ( x ) 0 .8 012345 .10 x p ( x ) 0 0 (a) .2 x p ( x ) 0 012345 (b) .4 (c) .20 1234567891011121314151617181920 .04 x p ( x ) 0 0 (d) 1234567891011121314151617181920 .08 .12 FIGURE 2.6 Poisson frequency functions, (a) λ = .1, (b) λ = 1, (c) λ = 5, (d) λ = 10. As n →∞, λ n → 0 n! (n − k)!nk → 1 1 − λ n n → e−λ 44 Chapter 2 Random Variables and 1 − λ n −k → 1 We thus have p(k) → λke−λ k! which is the Poisson frequency function. EXAMPLE A Two dice are rolled 100 times, and the number of double sixes, X, is counted. The distribution of X is binomial with n = 100 and p = 1 36 = .0278. Since n is large and p is small, we can approximate the binomial probabilities by Poisson probabilities with λ = np = 2.78. The exact binomial probabilities and the Poisson approximations are shown in the following table: Binomial Poisson k Probability Approximation 0 .0596 .0620 1 .1705 .1725 2 .2414 .2397 3 .2255 .2221 4 .1564 .1544 5 .0858 .0858 6 .0389 .0398 7 .0149 .0158 8 .0050 .0055 9 .0015 .0017 10 .0004 .0005 11 .0001 .0001 The approximation is quite good. ■ The Poisson frequency function can be used to approximate binomial probabil- ities for large n and small p. This suggests how Poisson distributions can arise in practice. Suppose that X is a random variable that equals the number of times some event occurs in a given interval of time. Heuristically, let us think of dividing the interval into a very large number of small subintervals of equal length, and let us assume that the subintervals are so small that the probability of more than one event in a subinterval is negligible relative to the probability of one event, which is itself very small. Let us also assume that the probability of an event is the same in each subinterval and that whether an event occurs in one subinterval is independent of what happens in the other subintervals. The random variable X is thus nearly a binomial random variable, with the subintervals consitituting the trials, and, from the limiting result above, X has nearly a Poisson distribution. The preceding argument is not formal, of course, but merely suggestive. But, in fact, it can be made rigorous. The important assumptions underlying it are (1) what 2.1 Discrete Random Variables 45 happens in one subinterval is independent of what happens in any other subinterval, (2) the probability of an event is the same in each subinterval, and (3) events do not happen simultaneously. The same kind of argument can be made if we are concerned with an area or a volume of space rather than with an interval on the real line. The Poisson distribution is of fundamental theoretical and practical importance. It has been used in many areas, including the following: • The Poisson distribution has been used in the analysis of telephone systems. The number of calls coming into an exchange during a unit of time might be modeled as a Poisson variable if the exchange services a large number of customers who act more or less independently. • One of the earliest uses of the Poisson distribution was to model the number of alpha particles emitted from a radioactive source during a given period of time. • The Poisson distribution has been used as a model by insurance companies. For example, the number of freak acidents, such as falls in the shower, for a large popu- lation of people in a given time period might be modeled as a Poisson distribution, because the accidents would presumably be rare and independent (provided there was only one person in the shower). • The Poisson distribution has been used by trafﬁc engineers as a model for light trafﬁc. The number of vehicles that pass a marker on a roadway during a unit of time can be counted. If trafﬁc is light, the individual vehicles act independently of each other. In heavy trafﬁc, however, one vehicle’s movement may inﬂuence another’s, so the approximation might not be good. EXAMPLE B This amusing classical example is from von Bortkiewicz (1898). The number of fatalities that resulted from being kicked by a horse was recorded for 10 corps of Prussian cavalry over a period of 20 years, giving 200 corps-years worth of data. These data and the probabilities from a Poisson model with λ = .61 are displayed in the following table. The ﬁrst column of the table gives the number of deaths per year, ranging from 0 to 4. The second column lists how many times that number of deaths was observed. Thus, for example, in 65 of the 200 corps-years, there was one death. In the third column of the table, the observed numbers are converted to relative frequencies by dividing them by 200. The fourth column of the table gives Poisson probabilities with the parameter λ = .61. In Chapters 8 and 9, we discuss how to choose a parameter value to ﬁt a theoretical probability model to observed frequencies and methods for testing goodness of ﬁt. For now, we will just remark that the value λ = .61 was chosen to match the average number of deaths per year. Number of Deaths Relative Poisson per Year Observed Frequency Probability 0 109 .545 .543 1 65 .325 .331 2 22 .110 .101 3 3 .015 .021 4 1 .005 .003 ■ 46 Chapter 2 Random Variables The Poisson distribution often arises from a model called a Poisson process for the distribution of random events in a set S, which is typically one-, two-, or three- dimensional, corresponding to time, a plane, or a volume of space. Basically, this model states that if S1, S2,...,Sn are disjoint subsets of S, then the numbers of events in these subsets, N1, N2,...,Nn, are independent random variables that follow Pois- son distributions with parameters λ|S1|,λ|S2|,...,λ|Sn|, where |Si | denotes the mea- sure of Si (length, area, or volume, for example). The crucial assumptions here are that events in disjoint subsets are independent of each other and that the Poisson parameter for a subset is proportional to the subset’s size. Later, we will see that this latter assump- tion implies that the average number of events in a subset is proportional to its size. EXAMPLE C Suppose that an ofﬁce receives telephone calls as a Poisson process with λ = .5 per min. The number of calls in a 5-min. interval follows a Poisson distribution with parameter ω = 5λ = 2.5. Thus, the probability of no calls in a 5-min. interval is e−2.5 = .082. The probability of exactly one call is 2.5e−2.5 = .205. ■ EXAMPLE D Figure 2.7 shows four realizations of a Poisson process with λ = 25 in the unit square, 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1. It is interesting that the eye tends to perceive patterns, such 01 x y 1 0 01 x y 1 01 x y 1 0 01 x y 1 0 0 FIGURE 2.7 Four realizations of a Poisson process with λ = 25. 2.2 Continuous Random Variables 47 as clusters of points and large blank spaces. But by the nature of a Poisson process, the locations of the points have no relationship to one another, and these patterns are entirely a result of chance. ■ 2.2 Continuous Random Variables In applications, we are often interested in random variables that can take on a contin- uum of values rather than a ﬁnite or countably inﬁnite number. For example, a model for the lifetime of an electronic component might be that it is random and can be any positive real number. For a continuous random variable, the role of the frequency function is taken by a density function, f (x), which has the properties that f (x) ≥ 0, f is piecewise continuous, and ∞ −∞ f (x) dx = 1. If X is a random variable with a density function f , then for any a < b, the probability that X falls in the interval (a, b) is the area under the density function between a and b: P(a < X < b) = b a f (x) dx EXAMPLE A A uniform random variable on the interval [0, 1] is a model for what we mean when we say “choose a number at random between 0 and 1.” Any real number in the interval is a possible outcome, and the probability model should have the property that the probability that X is in any subinterval of length h is equal to h. The following density function does the job: f (x) = 1, 0 ≤ x ≤ 1 0, x < 0orx > 1 This is called the uniform density on [0, 1]. The uniform density on a general interval [a, b]is f (x) = 1/(b − a), a ≤ x ≤ b 0, x < a or x > b ■ One consequence of this deﬁnition is that the probability that a continuous random variable X takes on any particular value is 0: P(X = c) = c c f (x) dx = 0 Although this may seem strange initially, it is really quite natural. If the uniform random variable of Example A had a positive probability of being any particular number, it should have the same probability for any number in [0, 1], in which case the sum of the probabilities of any countably inﬁnite subset of [0, 1] (for example, the rational numbers) would be inﬁnite. If X is a continuous random variable, then P(a < X < b) = P(a ≤ X < b) = P(a < X ≤ b) Note that this is not true for a discrete random variable. 48 Chapter 2 Random Variables For small δ,if f is continuous at x, P x − δ 2 ≤ X ≤ x + δ 2 = x+δ/2 x−δ/2 f (u) du ≈ δf (x) Therefore, the probability of a small interval around x is proportional to f (x).Itis sometimes useful to employ differential notation: P(x ≤ X ≤ x + dx) = f (x) dx. The cumulative distribution function of a continuous random variable X is deﬁned in the same way as for a discrete random variable: F(x) = P(X ≤ x) F(x) can be expressed in terms of the density function: F(x) = x −∞ f (u) du From the fundamental theorem of calculus, if f is continuous at x, f (x) = F (x). The cdf can be used to evaluate the probability that X falls in an interval: P(a ≤ X ≤ b) = b a f (x) dx = F(b) − F(a) EXAMPLE B From this deﬁnition, we see that the cdf of a uniform random variable on [0, 1] (Example A) is F(x) = 0, x ≤ 0 x, 0 ≤ x ≤ 1 1, x ≥ 1 ■ Suppose that F is the cdf of a continuous random variable and is strictly increasing on some interval I, and that F = 0 to the left of I and F = 1 to the right of I; I may be unbounded. Under this assumption, the inverse function F−1 is well deﬁned; x = F−1(y) if y = F(x). The pth quantile of the distribution F is deﬁned to be that value x p such that F(x p) = p,orP(X ≤ x p) = p. Under the preceding assumption stated, x p is uniquely deﬁned as x p = F−1(p); see Figure 2.8. Special cases are p = 1 2 , which corresponds to the median of F; and p = 1 4 and p = 3 4 , which correspond to the lower and upper quartiles of F. EXAMPLE C Suppose that F(x) = x2 for 0 ≤ x ≤ 1. This statement is shorthand for the more explicit statement F(x) = 0, x ≤ 0 x 2, 0 ≤ x ≤ 1 1, x ≥ 1 To ﬁnd F−1, we solve y = F(x) = x2 for x, obtaining x = F−1(y) = √ y. The median is F−1(.5) = .707, the lower quartile is F−1(.25) = .50, and the upper quartile is F−1(.75) = .866. ■ 2.2 Continuous Random Variables 49 .2 2 x F ( x ) 0 .4 .6 .8 1.0 3 1 0 1 32xp p FIGURE 2.8 A cdf F and F −1. EXAMPLE D Value at Risk Financial ﬁrms need to quantify and monitor the risk of their investments. Value at Risk (VaR) is a widely used measure of potential losses. It involves two parameters: a time horizon and a level of conﬁdence. For example, if the VaR of an institution is $10 million with a one-day horizon and a level of conﬁdence of 95%, the interpretation is that there is a 5% chance of losses exceeding $10 million. Such a loss should be anticipated about once in 20 days. To see how VaR is computed, suppose the current value of the investment is V0 and the future value is V1. The return on the investment is R = (V1 − V0)/V0, which is modeled as a continuous random variable with cdf FR(r). Let the desired level of conﬁdence be denoted by 1 − α. We want to ﬁnd v∗, the VaR. Then α = P(V0 − V1 ≥ v∗) = P V1 − V0 V0 ≤−v∗ V0 = FR − v∗ V0 Thus, −v∗/V0 is the α quantile, rα; and v∗ =−V0rα. The VaR is minus the current value times the α quantile of the return distribution. ■ We next discuss some density functions that commonly arise in practice. 50 Chapter 2 Random Variables 2.2.1 The Exponential Density The exponential density function is f (x) = λe−λx , x ≥ 0 0, x < 0 Like the Poisson distribution, the exponential density depends on a single parameter, λ>0, and it would therefore be more accurate to refer to it as the family of expo- nential densities. Several exponential densities are shown in Figure 2.9. Note that as λ becomes larger, the density drops off more rapidly. .5 2 x f ( x ) 0 1.0 1.5 2.0 0 4 6 8 10 FIGURE 2.9 Exponential densities with λ = .5 (solid), λ = 1 (dotted), and λ = 2 (dashed). The cumulative distribution function is easily found: F(x) = x −∞ f (u) du = 1 − e−λx , x ≥ 0 0, x < 0 The median of an exponential distribution, η, say, is readily found from the cdf. We solve F(η) = 1 2 : 1 − e−λη = 1 2 from which we have η = log 2 λ The exponential distribution is often used to model lifetimes or waiting times, in which context it is conventional to replace x by t. Suppose that we consider modeling the lifetime of an electronic component as an exponential random variable, that the 2.2 Continuous Random Variables 51 component has lasted a length of time s, and that we wish to calculate the probability that it will last at least t more time units; that is, we wish to ﬁnd P(T > t +s | T > s): P(T > t + s | T > s) = P(T > t + s and T > s) P(T > s) = P(T > t + s) P(T > s) = e−λ(t+s) e−λs = e−λt We see that the probability that the unit will last t more time units does not depend on s. The exponential distribution is consequently said to be memoryless; it is clearly not a good model for human lifetimes, since the probability that a 16-year-old will live at least 10 more years is not the same as the probability that an 80-year-old will live at least 10 more years. It can be shown that the exponential distribution is characterized by this memoryless property—that is, the memorylessness implies that the distribution is exponential. It may be somewhat surprising that a qualitative characterization, the property of memorylessness, actually determines the form of this density function. The memoryless character of the exponential distribution follows directly from its relation to a Poisson process. Suppose that events occur in time as a Poisson process with parameter λ and that an event occurs at time t0. Let T denote the length of time until the next event. The density of T can be found as follows: P(T > t) = P(no events in (t0, t0 + t)) Since the number of events in the interval (t0, t0 + t), which is of length t, follows a Poisson distribution with parameter λt, this probability is e−λt , and thus T follows an exponential distribution with parameter λ. We can continue in this fashion. Suppose that the next event occurs at time t1; the distribution of time until the third event is again exponential by the same analysis and, from the independence property of the Poisson process, is independent of the length of time between the ﬁrst two events. Generally, the times between events of a Poisson process are independent, identically distributed, exponential random variables. Proteins and other biologically important molecules are regulated in various ways. Some undergo aging and are thus more likely to degrade when they are old than when they are young. If a molecule was not subject to aging, but its chance of degradation was the same at any age, its lifetime would follow an exponential distribution. EXAMPLE A Muscle and nerve cell membranes contain large numbers of channels through which selected ions can pass when the channels are open. Using sophisticated experimental techniques, neurophysiologists can measure the resulting current that ﬂows through a single channel, and experimental records often indicate that a channel opens and closes at seemingly random times. In some cases, simple kinetic models predict that the duration of the open time should be exponentially distributed. 52 Chapter 2 Random Variables 0 1 2 3 4 5 6 7 8 9 10 50 100 150 200 250 300 350 A Open time (ms) Frequency ( per 0.5 ms) 10 M-Sux 1.18 ms 0 1 2 3 4 5 6 7 8 9 10 50 100 150 200 B Open time (ms) Frequency ( per 0.5 ms) 20 M-Sux 873 s 0 0.5 50 100 150 200 250 300 D Open time (ms) Frequency ( per 0.15 ms) 100 M-Sux 307 s 1.0 1.5 2.0 2.5 3.00 0.5 25 50 75 100 C Open time (ms) Frequency ( per 0.15 ms) 1.0 1.5 2.0 2.5 3.0 50 M-Sux 573 s 0 0.25 50 100 150 200 E Open time (ms) Frequency ( per 0.05 ms) 0.50 0.75 1.0 200 M-Sux 117 s 250 0 0.1 50 100 150 F Open time (ms) Frequency ( per 0.25 ms) 0.2 0.3 0.4 500 M-Sux 38 s FIGURE 2.10 Histograms of open times at varying concentrations of suxamethonium and ﬁtted exponential densities. Marshall et al. (1990) studied the action of a channel-blocking agent (suxa- methonium) on a channel (the nicotinic receptor of frog muscle). Figure 2.10 displays histograms of open times and ﬁtted exponential distributions at a range of concentra- tions of suxamethonium. In this example, the exponential distribution is parametrized as f (t) = (1/τ)exp(−t/τ). τ is thus in units of time, whereas λ is in units of the reciprocal of time. From the ﬁgure, we see that the intervals become shorter and that the parameter τ decreases with increasing concentrations of the blocker. It can also be seen, especially at higher concentrations, that very short intervals are not recorded because of limitations of the instrumentation. ■ 2.2 Continuous Random Variables 53 2.2.2 The Gamma Density The gamma density function depends on two parameters, α and λ: g(t) = λα (α)tα−1e−λt , t ≥ 0 For t < 0, g(t) = 0. So that the density be well deﬁned and integrate to 1, α>0 and λ>0. The gamma function, (x), is deﬁned as (x) = ∞ 0 ux−1e−u du, x > 0 Some properties of the gamma function are developed in the problems at the end of this chapter. Note that if α = 1, the gamma density coincides with the exponential density. The parameter α is called a shape parameter for the gamma density, and λ is called a scale parameter. Varying α changes the shape of the density, whereas varying λ corresponds to changing the units of measurement (say, from seconds to minutes) and does not affect the shape of the density. Figure 2.11 shows several gamma densities. Gamma densities provide a fairly ﬂexible class for modeling nonnegative random variables. .5 .5 t g ( t ) 00 1.0 1.5 2.0 1.0 1.5 2.0 2.5 (a) .05 5 t g ( t ) 001015 20 .10 .15 .20 (b) FIGURE 2.11 Gamma densities, (a) α = .5 (solid) and α = 1 (dotted) and (b) α = 5 (solid) and α = 10 (dotted); λ = 1 in all cases. EXAMPLE A The patterns of occurrence of earthquakes in terms of time, space, and magnitude are very erratic, and attempts are sometimes made to construct probabilistic models for these events. The models may be used in a purely descriptive manner or, more ambitiously, for purposes of predicting future occurrences and consequent damage. Figure 2.12 shows the ﬁt of a gamma density and an exponential density to the observed times separating a sequence of small earthquakes (Udias and Rice, 1975). The gamma density clearly gives a better ﬁt (α = .509 and λ = .00115). Note that an 54 Chapter 2 Random Variables 100 5 Hours Frequency 0 01015 20 200 300 400 500 25 30 35 40 45 50 600 1369 (1331) 19681971 Time intervals N 4764 FIGURE 2.12 Fit of gamma density (triangles) and of exponential density (circles) to times between microearthquakes. exponential model for interoccurrence times would be memoryless; that is, knowing that an earthquake had not occurred in the last t time units would tell us nothing about the probability of occurrence during the next s time units. The gamma model does not have this property. In fact, although we will not show this, the gamma model with these parameter values has the character that there is a large likelihood that the next earthquake will immediately follow any given one and this likelihood decreases monotonically with time. ■ 2.2.3 The Normal Distribution The normal distribution plays a central role in probability and statistics, for reasons that will become apparent in later chapters of this book. This distribution is also called the Gaussian distribution after Carl Friedrich Gauss, who proposed it as a model for measurement errors. The central limit theorem, which will be discussed in Chapter 6, justiﬁes the use of the normal distribution in many applications. Roughly, the central limit theorem says that if a random variable is the sum of a large number of independent random variables, it is approximately normally distributed. The normal distribution has been used as a model for such diverse phenomena as a person’s height, the distribu- tion of IQ scores, and the velocity of a gas molecule. The density function of the normal distribution depends on two parameters, μ and σ (where −∞ <μ<∞,σ >0): f (x) = 1 σ √ 2π e−(x−μ)2/2σ 2 , −∞ < x < ∞ 2.2 Continuous Random Variables 55 The parameters μ and σ are called the mean and standard deviation of the normal density. The cdf cannot be evaluated in closed form from this density function (the integral that deﬁnes the cdf cannot be evaluated by an explicit formula but must be found numerically). A problem at the end of this chapter asks you to show that the normal density just given integrates to one. As shorthand for the statement “X follows a normal distribution with parameters μ and σ,” it is convenient to use X ∼ N(μ, σ 2). From the form of the density function, we see that the density is symmetric about μ, f (μ − x) = f (μ + x), where it has a maximum, and that the rate at which it falls off is determined by σ. Figure 2.13 shows several normal densities. Normal densities are sometimes referred to as bell-shaped curves. The special case for which μ = 0 and σ = 1 is called the standard normal density. Its cdf is denoted by and its density by φ (not to be confused with the empty set). The relationship between the general normal density and the standard normal density will be developed in the next section. .2 4 x f ( x ) 0 6 2 0 2 .4 .6 .8 4 6 FIGURE 2.13 Normal densities, μ = 0 and σ = .5 (solid), μ = 0 and σ = 1 (dotted), and μ = 0 and σ = 2 (dashed). EXAMPLE A Acoustic recordings made in the ocean contain substantial background noise. To de- tect sonar signals of interest, it is useful to characterize this noise as accurately as possible. In the Arctic, much of the background noise is produced by the cracking and straining of ice. Veitch and Wilks (1985) studied recordings of Arctic undersea noise and characterized the noise as a mixture of a Gaussian component and occa- sional large-amplitude bursts. Figure 2.14 is a trace of one recording that includes a burst. Figure 2.15 shows a Gaussian distribution ﬁt to observations from a “quiet” (nonbursty) period of this noise. ■ 56 Chapter 2 Random Variables 8 20 Time (milliseconds) Standardized Amplitude 04060 80 6 4 2 100 0 2 FIGURE 2.14 A record of undersea noise containing a large burst. .1 40 2 0 2 .2 .3 4 .4 .5 FIGURE 2.15 A histogram from a “quiet” period of undersea noise with a ﬁtted normal density. EXAMPLE B Turbulent air ﬂow is sometimes modeled as a random process. Since the velocity of the ﬂow at any point is subject to the inﬂuence of a large number of random eddies in the neighborhood of that point, one might expect from the central limit theorem that the velocity would be normally distributed. Van Atta and Chen (1968) analyzed data gathered in a wind tunnel. Figure 2.16, taken from their paper, shows a normal distribution ﬁt to 409,600 observations of one component of the velocity; the ﬁt is remarkably good. ■ EXAMPLE C S&P 500 The Standard and Poors 500 is an index of important U.S. stocks; each stock’s weight in the index is proportional to its market value. Individuals can invest in mutual funds that track the index. The top panel of Figure 2.17 shows the sequential values of the 2.2 Continuous Random Variables 57 p ( u/ ) .05 2 u/ 03 1 0 1 .10 .15 .20 2 3 .25 .30 .35 .40 FIGURE 2.16 A normal density (solid line) ﬁt to 409,600 measurements of one component of the velocity of a turbulent wind ﬂow. The dots show the values from a histogram. 0 50 100 150 200 250 0.03 0 0.03 Return Density 0.04 0.02 0 0.02 0.04 40 30 20 10 0 Time Returns FIGURE 2.17 Returns on the S&P 500 during 2003 (top panel) and a normal curve ﬁtted to their histogram (bottom panel). The region area to the left of the 0.05 quantile is shaded. 58 Chapter 2 Random Variables returns during 2003. The average return during this period was 0.1% per day, and we can see from the ﬁgure that daily ﬂuctuations were as large as 3% or 4%. The lower panel of the ﬁgure shows a histogram of the returns and a ﬁtted normal density with μ = 0.001 and σ = 0.01. A ﬁnancial company could use the ﬁtted normal density in calculating its Value at Risk (see Example D of Section 2.2). Using a time horizon of one day and a conﬁdence level of 95%, the VaR is the current investment in the index, V0, multiplied by the negative of the 0.05 quantile of the distribution of returns. In this case, the quantile can be calculated to be −0.0165, so the VaR is .0165V0. Thus, if V0 is $10 million, the VaR is $165,000. The company can have 95% “conﬁdence” that its losses will not exceed that amount on a given day. However, it should not be surprised if that amount is exceeded about once in every 20 trading days. ■ 2.2.4 The Beta Density The beta density is useful for modeling random variables that are restricted to the interval [0, 1]: f (u) = (a + b) (a)(b)ua−1(1 − u)b−1, 0 ≤ u ≤ 1 Figure 2.18 shows beta densities for various values of a and b. Note that the case a = b = 1 is the uniform distribution. The beta distribution is important in Bayesian statistics, as you will see later. 2.3 Functions of a Random Variable Suppose that a random variable X has a density function f (x). We often need to ﬁnd the density function of Y = g(X) for some given function g. For example, X might be the velocity of a particle of mass m, and we might be interested in the probability density function of the particle’s kinetic energy, Y = 1 2 mX2. Often, the density and cdf of X are denoted by fX and FX ; and those of Y,by fY and FY . To illustrate techniques for solving such a problem, we ﬁrst develop some useful facts about the normal distribution. Suppose that X ∼ N(μ, σ 2) and that Y = aX+b, where a > 0. The cumulative distribution function of Y is FY (y) = P(Y ≤ y) = P(aX + b ≤ y) = P X ≤ y − b a = FX y − b a 2.3 Functions of a Random Variable 59 0.4 0 0.8 1.2 1.6 0.2 p Beta density (a) .4 .6 .8 1.0 1.0 0 2.0 3.0 0.2 p Beta density (b) .4 .6 .8 1.0 1.0 0 2.0 3.0 0.2 p Beta density (c) .4 .6 .8 1.0 4 0 8 12 0.2 p Beta density (d) .4 .6 .8 1.0 10 6 2 FIGURE 2.18 Beta density functions for various values of a and b: (a) a = 2, b = 2; (b) a = 6, b = 2; (c) a = 6, b = 6; and (d) a = .5, b = 4. Thus, fY (y) = d dyFX y − b a = 1 a fX y − b a Up to this point, we have not used the assumption of normality at all, so this result holds for a general continuous random variable, provided that FX is appropriately differentiable. If fX is a normal density function with parameters μ and σ, we ﬁnd that, after substitution, fY (y) = 1 aσ √ 2π exp −1 2 y − b − aμ aσ 2 From this, we see that Y follows a normal distribution with parameters aμ + b and aσ. The case for which a < 0 can be analyzed similarly (see Problem 57 in the end-of-chapter problems), yielding the following proposition. PROPOSITION A If X ∼ N(μ, σ 2) and Y = aX + b, then Y ∼ N(aμ + b, a2σ 2). ■ This proposition is quite useful for ﬁnding probabilities from the normal dis- tribution. Suppose that X ∼ N(μ, σ 2) and we wish to ﬁnd P(x0 < X < x1) for 60 Chapter 2 Random Variables some numbers x0 and x1. Consider the random variable Z = X − μ σ = X σ − μ σ Applying Proposition A with a = 1/σ and b =−μ/σ, we see that Z ∼ N(0, 1), that is, Z follows a standard normal distribution. Therefore, FX (x) = P(X ≤ x) = P X − μ σ ≤ x − μ σ = P Z ≤ x − μ σ = x − μ σ We thus have P(x0 < X < x1) = FX (x1) − FX (x0) = x1 − μ σ − x0 − μ σ Thus, probabilities for general normal random variables can be evaluated in terms of probabilities for standard normal random variables. This is quite useful, since tables need to be made up only for the standard normal distribution rather than separately for every μ and σ. EXAMPLE A Scores on a certain standardized test, IQ scores, are approximately normally dis- tributed with mean μ = 100 and standard deviation σ = 15. Here we are referring to the distribution of scores over a very large population, and we approximate that discrete cumulative distribution function by a normal continuous cumulative distri- bution function. An individual is selected at random. What is the probability that his score X satisﬁes 120 < X < 130? We can calculate this probability by using the standard normal distribution as follows: P(120 < X < 130) = P 120 − 100 15 < X − 100 15 < 130 − 100 15 = P(1.33 < Z < 2) where Z follows a standard normal distribution. Using a table of the standard normal distribution (Table 2 of Appendix B), this probability is P(1.33 < Z < 2) = (2) − (1.33) = .9772 − .9082 = .069 Thus, approximately 7% of the population will have scores in this range. ■ 2.3 Functions of a Random Variable 61 EXAMPLE B Let X ∼ N(μ, σ 2), and ﬁnd the probability that X is less than σ away from μ; that is, ﬁnd P(|X − μ| <σ). This probability is P(−σ__ Y) can be found by integrating f over the set {(x, y)|0 ≤ y ≤ x ≤ 1}: P(X > Y) = 12 7 1 0 x 0 (x2 + xy) dy dx = 9 14 ■ 76 Chapter 3 Joint Distributions 1 0.8 0.6 0.2 0.4 01 0.8 0.6 0.4 0.2 0 0 1 2 3 x y f (x, y) FIGURE 3.4 The density function f (x, y) = 12 7 (x2 + xy), 0 ≤ x ≤ 1, 0 ≤ y ≤ 1. The marginal cdf of X,orFX ,is FX (x) = P(X ≤ x) = limy→∞ F(x, y) = x −∞ ∞ −∞ f (u, y) dy du From this, it follows that the density function of X alone, known as the marginal density of X,is fX (x) = F X (x) = ∞ −∞ f (x, y) dy In the discrete case, the marginal frequency function was found by summing the joint frequency function over the other variable; in the continuous case, it is found by integration. EXAMPLE B Continuing Example A, the marginal density of X is fX (x) = 12 7 1 0 (x2 + xy) dy = 12 7 x2 + x 2 A similar calculation shows that the marginal density of Y is fY (y) = 12 7 ( 1 3 + y/2). ■ For several jointly continuous random variables, we can make the obvious gen- eralizations. The joint density function is a function of several variables, and the marginal density functions are found by integration. There are marginal density 3.3 Continuous Random Variables 77 functions of various dimensions. Suppose that X, Y, and Z are jointly continuous random variables with density function f (x, y, z). The one-dimensional marginal distribution of X is fX (x) = ∞ −∞ ∞ −∞ f (x, y, z) dy dz and the two-dimensional marginal distribution of X and Y is fXY(x, y) = ∞ −∞ f (x, y, z) dz EXAMPLE C Farlie-Morgenstern Family If F(x) and G(y) are one-dimensional cdfs, it can be shown that, for any α for which |α|≤1, H(x, y) = F(x)G(y){1 + α[1 − F(x)][1 − G(y)]} is a bivariate cumulative distribution function. Because limx→∞ F(x) = limy→∞ F(y) = 1, the marginal distributions are H(x, ∞) = F(x) H(∞, y) = G(y) In this way, an inﬁnite number of different bivariate distributions with given marginals can be constructed. As an example, we will construct bivariate distributions with marginals that are uniform on [0, 1] [F(x) = x, 0 ≤ x ≤ 1, and G(y) = y, 0 ≤ y ≤ 1]. First, with α =−1, we have H(x, y) = xy[1 − (1 − x)(1 − y)] = x2 y + y2x − x2 y2, 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 The bivariate density is h(x, y) = ∂2 ∂x∂y H(x, y) = 2x + 2y − 4xy, 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 The density is shown in Figure 3.5. Perhaps you can imagine integrating over y (pushing all the mass onto the x axis) to produce a marginal uniform density for x. Next, if α = 1, H(x, y) = xy[1 + (1 − x)(1 − y)] = 2xy − x2 y − y2x + x2 y2, 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 The density is h(x, y) = 2 − 2x − 2y + 4xy, 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 This density is shown in Figure 3.6. We just constructed two different bivariate distributions, both of which have uniform marginals. ■ 78 Chapter 3 Joint Distributions 1 0.8 0.6 0.4 0.2 0 2 1.5 1 0.5 0 h(x, y) 0 0.2 0.4 0.6 0.8 1x y FIGURE 3.5 The joint density h(x, y) = 2x + 2y − 4xy, where 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1, which has uniform marginal densities. 1 0.8 0.6 0.4 0.2 02 1.5 1 0.5 0 h(x, y) 0 0.2 0.4 0.6 0.8 1x y FIGURE 3.6 The joint density h(x, y) = 2 − 2x − 2y + 4xy, where 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1, which has uniform marginal densities. A copula is a joint cumulative distribution function of random variables that have uniform marginal distributions. The functions H(x, y) in the preceding example are copulas. Note that a copula C(u,v) is nondecreasing in each variable, because it is a cdf. Also, P(U ≤ u) = C(u, 1) = u and C(1,v) = v, since the marginal distributions are uniform. We will restrict ourselves to copulas that have densities, in which case the density is c(u,v)= ∂2 ∂u∂vC(u,v)≥ 0 3.3 Continuous Random Variables 79 Now, suppose that X and Y are continuous random variables with cdfs FX (x) and FY (y). Then U = FX (x) and V = FY (y) are uniform random variables (Propo- sition 2.3C). For a copula C(u,v), consider the joint distribution deﬁned by FXY(x, y) = C(FX (x), FY (y)) Since C(FX (x), 1) = FX (x), the marginal cdfs corresponding to FXY are FX (x) and FY (y). Using the chain rule, the corresponding density is fXY(x, y) = c(FX (x), FY (y)) fX (x) fY (y) This construction points out that from the ingredients of two marginal distributions and any copula, a joint distribution with those marginals can be constructed. It is thus clear that the marginal distributions do not determine the joint distribution. The dependence between the random variables is captured in the copula. Copulas are not just academic curiousities—they have been extensively used in ﬁnancial statistics in recent years to model dependencies in the returns of ﬁnancial instruments. EXAMPLE D Consider the following joint density: f (x, y) = λ2e−λy, 0 ≤ x ≤ y,λ>0 0, elsewhere This joint density is plotted in Figure 3.7. To ﬁnd the marginal densities, it is helpful to draw a picture showing where the density is nonzero to aid in determining the limits of integration (see Figure 3.8). 3 2 1 03 2 1 0 x y 1 .75 .5 .25 0 f (x, y) FIGURE 3.7 The joint density of Example D. 80 Chapter 3 Joint Distributions x y y x FIGURE 3.8 The joint density of Example D is nonzero over the shaded region of the plane. First consider the marginal density fX (x) = ∞ −∞ fXY(x, y)dy. Since f (x, y) = 0 for x ≥ y, fX (x) = ∞ x λ2e−λydy = λe−λx , x ≥ 0 and we see that the marginal distribution of X is exponential. Next, because fXY(x, y) = 0 for x ≤ 0 and x > y, fY (y) = y 0 λ2e−λydx = λ2 ye−λy, y ≥ 0 The marginal distribution of Y is a gamma distribution. ■ In some applications, it is useful to analyze distributions that are uniform over some region of space. For example, in the plane, the random point (X, Y) is uniform over a region, R, if for any A ⊂ R, P((X, Y) ∈ A) = |A| |R| where ||denotes area. EXAMPLE E A point is chosen randomly in a disk of radius 1. Since the area of the disk is π, f (x, y) = 1 π , if x2 + y2 ≤ 1 0, otherwise 3.3 Continuous Random Variables 81 We can calculate the distribution of R, the distance of the point from the origin. R ≤ r if the point lies in a disk of radius r. Since this disk has area πr 2, FR(r) = P(R ≤ r) = πr 2 π = r 2 The density function of R is thus fR(r) = 2r, 0 ≤ r ≤ 1. Let us now ﬁnd the marginal density of the x coordinate of the random point: fX (x) = ∞ −∞ f (x, y) dy = 1 π √ 1−x2 − √ 1−x2 dy = 2 π 1 − x2, −1 ≤ x ≤ 1 Note that we chose the limits of integration carefully; outside these limits the joint density is zero. (Draw a picture of the region over which f (x, y)>0 and indicate the preceding limits of integration.) By symmetry, the marginal density of Y is fY (y) = 2 π 1 − y2, −1 ≤ y ≤ 1 ■ EXAMPLE F Bivariate Normal Density The bivariate normal density is given by the complicated expression f (x, y) = 1 2πσX σY 1 − ρ2 exp − 1 2(1 − ρ2) (x − μX )2 σ 2 X + (y − μY )2 σ 2 Y −2ρ(x − μX )(y − μY ) σX σY One of the earliest uses of this bivariate density was as a model for the joint distribution of the heights of fathers and sons. The density depends on ﬁve parameters: −∞ <μX < ∞−∞<μY < ∞ σX > 0 σY > 0 −1 <ρ<1 The contour lines of the density are the lines in the xyplane on which the joint density is constant. From the preceding equation, we see that f (x, y) is constant if (x − μX )2 σ 2 X + (y − μY )2 σ 2 Y − 2ρ(x − μX )(y − μY ) σX σY = constant The locus of such points is an ellipse centered at (μX ,μY ).Ifρ = 0, the axes of the ellipse are parallel to the x and y axes, and if ρ = 0, they are tilted. Figure 3.9 shows several bivariate normal densities, and Figure 3.10 shows the corresponding elliptical contours. 82 Chapter 3 Joint Distributions .3 .2 .1 0 .4 2 0 2 2 0 2 (c) .3 .2 .1 0 .4 2 0 2 2 0 2 (b) .3 .2 .1 0 .4 2 0 2 2 0 2 (d) .3 .2 .1 0 .4 2 0 2 2 0 2 (a) FIGURE 3.9 Bivariate normal densities with μX = μY = 0 and σX = σY = 1 and (a) ρ = 0, (b) ρ = .3, (c) ρ = .6, (d) ρ = .9. The marginal distributions of X and Y are N(μX ,σ2 X ) and N(μY ,σ2 Y ), respec- tively, as we will now demonstrate. The marginal density of X is fX (x) = ∞ −∞ fXY(x, y) dy Making the changes of variables u = (x − μX )/σX and v = (y − μY )/σY gives us fX (x) = 1 2πσX 1 − ρ2 ∞ −∞ exp − 1 2(1 − ρ2)(u2 + v2 − 2ρuv) dv 3.3 Continuous Random Variables 83 2 1 0 1 2012 (a) 2 2 1 0 1 2 10 12 (b) 2 2 1 0 1 2 10 12 (c) 2 2 1 0 1 2 10 12 (d) 2 1 FIGURE 3.10 The elliptical contours of the bivariate normal densities of Figure 3.9. To evaluate this integral, we use the technique of completing the square. Using the identity u2 + v2 − 2ρuv = (v − ρu)2 + u2(1 − ρ2) we have fX (x) = 1 2πσX 1 − ρ2 e−u2/2 ∞ −∞ exp − 1 2(1 − ρ2)(v − ρu)2 dv Finally, recognizing the integral as that of a normal density with mean ρu and variance (1 − ρ2), we obtain fX (x) = 1 σX √ 2π e−(1/2) (x−μX )2/σ 2 X which is a normal density, as was to be shown. Thus, for example, the marginal distributions of x and y in Figure 3.9 are all standard normal, even though the joint distributions of (a)–(d) are quite different from each other. ■ We saw in our discussion of copulas earlier in this section that marginal densities do not determine joint densities. For example, we can take both marginal densities to be normal with parameters μ = 0 and σ = 1 and use the Farlie-Morgenstern 84 Chapter 3 Joint Distributions 0 3 32 2 1 12 1 0 1 2 3 x y 0 0.05 0.1 0.15 0.2 f (x, y) FIGURE 3.11 A bivariate density that has normal marginals but is not bivariate normal. The contours of the density shown in the xy plane are not elliptical. copula with density c(u,v)= 2 − 2u − 2v + 4uv. Denoting the normal density and cumulative distribution functions by φ(x) and (x), the bivariate density is f (x, y) = (2 − 2(x) − 2(y) + 4(x)(y))φ(x)φ(y) This density and its contours are shown in Figure 3.11. Note that the contours are not elliptical. This bivariate density has normal marginals, but it is not a bivariate normal density. 3.4 Independent Random Variables DEFINITION Random variables X1, X2,...,Xn are said to be independent if their joint cdf factors into the product of their marginal cdf’s: F(x1, x2,...,xn) = FX1 (x1)FX2 (x2) ···FXn (xn) for all x1, x2,...,xn. ■ The deﬁnition holds for both continuous and discrete random variables. For discrete random variables, it is equivalent to state that their joint frequency function factors; for continuous random variables, it is equivalent to state that their joint density function factors. To see why this is true, consider the case of two jointly continuous random variables, X and Y. If they are independent, then F(x, y) = FX (x)FY (y) 3.4 Independent Random Variables 85 and taking the second mixed partial derivative makes it clear that the density function factors. On the other hand, if the density function factors, then the joint cdf can be expressed as a product: F(x, y) = x −∞ y −∞ fX (u) fY (v) dv du = x −∞ fX (u) du y −∞ fY (v) dv = FX (x)FY (y) It can be shown that the deﬁnition implies that if X and Y are independent, then P(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B) It can also be shown that if g and h are functions, then Z = g(X) and W = h(Y) are independent as well. A sketch of an argument goes like this (the details are beyond the level of this course): We wish to ﬁnd P(Z ≤ z, W ≤ w). Let A(z) be the set of x such that g(x) ≤ z, and let B(w) be the set of y such that h(y) ≤ w. Then P(Z ≤ z, W ≤ w) = P(X ∈ A(z), Y ∈ B(w)) = P(X ∈ A(z))P(Y ∈ B(w)) = P(Z ≤ z)P(W ≤ w) EXAMPLE A Suppose that the point (X, Y) is uniformly distributed on the square S ={(x, y) | −1/2 ≤ x ≤ 1/2, −1/2 ≤ y ≤ 1/2}: fXY(x, y) = 1 for (x, y) in S and 0 elsewhere. Make a sketch of this square. You can visualize that the marginal distributions of X and Y are uniform on [−1/2, 1/2]. For example, the marginal density at a point x, −1/2 ≤ x ≤ 1/2 is found by integrating (summing) the joint density over the vertical line that meets the horizontal axis at x. Thus, fX (x) = 1, −1/2 ≤ x ≤ 1/2 and fY (y) = 1, and − 1/2 ≤ y ≤ 1/2. The joint density is equal to the product of the marginal densities, so X and Y are independent. You should be able to see from our sketch that knowing the value of X gives no information about the possible values of Y. ■ EXAMPLE B Now consider rotating the square of the previous example by 90◦ to form a diamond. Sketch this diamond. From the sketch, you can see that the marginal density of X is nonnegative for −1/2 ≤ x ≤ 1/2 as before, but it is not uniform, and similarly for the marginal density of Y. Thus, for example, fX (.9)>0 and fY (.9)>0. But from the sketch you can also see that fXY(.9,.9) = 0. Thus, X and Y are not independent. Finally, the sketch shows you that knowing the value of X— for example, X = .9— constrains the possible values of Y. ■ 86 Chapter 3 Joint Distributions EXAMPLE C Farlie-Morgenstern Family From Example C in Section 3.3, we see that X and Y are independent only if α = 0, since only in this case does the joint cdf H factor into the product of the marginals F and G. ■ EXAMPLE D If X and Y follow a bivariate normal distribution (Example F from Section 3.3) and ρ = 0, their joint density factors into the product of two normal densities, and therefore X and Y are independent. ■ EXAMPLE E Suppose that a node in a communications network has the property that if two packets of information arrive within time τ of each other, they “collide” and then have to be retransmitted. If the times of arrival of the two packets are independent and uniform on [0, T ], what is the probability that they collide? The times of arrival of two packets, T1 and T2, are independent and uniform on [0, T ], so their joint density is the product of the marginals, or f (t1, t2) = 1 T 2 for t1 and t2 in the square with sides [0, T ]. Therefore, (T1, T2) is uniformly distributed over the square. The probability that the two packets collide is proportional to the area of the shaded strip in Figure 3.12. Each of the unshaded triangles of the ﬁgure has area (T −τ)2/2, and thus the area of the shaded area is T 2 −(T −τ)2. Integrating f (t1, t2) over this area gives the desired probability: 1 − (1 − τ/T )2. ■ t2 T T t1 FIGURE 3.12 The probability that the two packets collide is proportional to the area of the shaded region |t1 − t2| <τ 3.5 Conditional Distributions 87 3.5 Conditional Distributions 3.5.1 The Discrete Case If X and Y are jointly distributed discrete random variables, the conditional probability that X = xi given that Y = y j is, if pY (y j )>0, P(X = xi |Y = y j ) = P(X = xi , Y = y j ) P(Y = y j ) = pXY(xi , y j ) pY (y j ) This probability is deﬁned to be zero if pY (y j ) = 0. We will denote this conditional probability by pX|Y (x|y). Note that this function of x is a genuine frequency function since it is nonnegative and sums to 1 and that pY|X (y|x) = pY (y) if X and Y are independent. EXAMPLE A We return to the simple discrete distribution considered in Section 3.2, reproducing the table of values for convenience here: y x 0123 0 1 8 2 8 1 8 0 1 0 1 8 2 8 1 8 The conditional frequency function of X given Y = 1is pX|Y (0|1) = 2 8 3 8 = 2 3 pX|Y (1|1) = 1 8 3 8 = 1 3 ■ The deﬁnition of the conditional frequency function just given can be reexpressed as pXY(x, y) = pX|Y (x|y)pY (y) (the multiplication law of Chapter 1). This useful equation gives a relationship between the joint and conditional frequency functions. Summing both sides over all values of y, we have an extremely useful application of the law of total probability: pX (x) = y pX|Y (x|y)pY (y) 88 Chapter 3 Joint Distributions EXAMPLE B Suppose that a particle counter is imperfect and independently detects each incoming particle with probability p. If the distribution of the number of incoming particles in a unit of time is a Poisson distribution with parameter λ, what is the distribution of the number of counted particles? Let N denote the true number of particles and X the counted number. From the statement of the problem, the conditional distribution of X given N = n is binomial, with n trials and probability p of success. By the law of total probability, P(X = k) = ∞ n=0 P(N = n)P(X = k|N = n) = ∞ n=k λne−λ n! n k pk(1 − p)n−k = (λp)k k! e−λ ∞ n=k λn−k (1 − p)n−k (n − k)! = (λp)k k! e−λ ∞ j=0 λ j (1 − p) j j! = (λp)k k! e−λeλ(1−p) = (λp)k k! e−λp We see that the distribution of X is a Poisson distribution with parameter λp. This model arises in other applications as well. For example, N might denote the number of trafﬁc accidents in a given time period, with each accident being fatal or nonfatal; X would then be the number of fatal accidents. ■ 3.5.2 The Continuous Case In analogy with the deﬁnition in the preceding section, if X and Y are jointly contin- uous random variables, the conditional density of Y given X is deﬁned to be fY|X (y|x) = fXY(x, y) fX (x) if 0 < fX (x)<∞, and 0 otherwise. This deﬁnition is in accord with the result to which a differential argument would lead. We would deﬁne fY|X (y|x) dy as P(y ≤ Y ≤ y + dy|x ≤ X ≤ x + dx) and calculate P(y ≤ Y ≤ y + dy|x ≤ X ≤ x + dx) = fXY(x, y) dx dy fX (x) dx = fXY(x, y) fX (x) dy Note that the rightmost expression is interpreted as a function of y, x being ﬁxed. The numerator is the joint density fXY(x, y), viewed as a function of y for ﬁxed x: you can visualize it as the curve formed by slicing through the joint density function 3.5 Conditional Distributions 89 perpendicular to the x axis. The denominator normalizes that curve to have unit area. The joint density can be expressed in terms of the marginal and conditional densities as follows: fXY(x, y) = fY|X (y|x) fX (x) Integrating both sides over x allows the marginal density of Y to be expressed as fY (y) = ∞ −∞ fY|X (y|x) fX (x) dx which is the law of total probability for the continuous case. EXAMPLE A In Example D in Section 3.3, we saw that fXY(x, y) = λ2e−λy, 0 ≤ x ≤ y fX (x) = λe−λx , x ≥ 0 fY (y) = λ2 ye−λy, y ≥ 0 Let us ﬁnd the conditional densities. Before doing the formal calculations, it is in- formative to examine the joint density for x and y, respectively, held constant. If x is constant, the joint density decays exponentially in y for y ≥ x;ify is constant, the joint density is constant for 0 ≤ x ≤ y. (See Figure 3.7.) Now let us ﬁnd the conditional densities according to the preceding deﬁnition. First, fY|X (y|x) = λ2e−λy λe−λx = λe−λ(y−x), y ≥ x The conditional density of Y given X = x is exponential on the interval [x, ∞). Expressing the joint density as fXY(x, y) = fY|X (y|x) fX (x) we see that we could generate X and Y according to fXY in the following way: First, generate X as an exponential random variable ( fX ), and then generate Y as another exponential random variable ( fY|X ) on the interval [x, ∞). From this representation, we see that Y may be interpreted as the sum of two independent exponential random variables and that the distribution of this sum is gamma, a fact that we will derive later by a different method. Now, fX|Y (x|y) = λ2e−λy λ2 ye−λy = 1 y , 0 ≤ x ≤ y The conditional density of X given Y = y is uniform on the interval [0, y]. Finally, expressing the joint density as fXY(x, y) = fX|Y (x|y) fY (y) 90 Chapter 3 Joint Distributions we see that alternatively we could generate X and Y according to the density fXY by ﬁrst generating Y from a gamma density and then generating X uniformly on [0, y]. Another interpretation of this result is that, conditional on the sum of two independent exponential random variables, the ﬁrst is uniformly distributed. ■ EXAMPLE B Stereology In metallography and other applications of quantitative microscopy, aspects of a three- dimensional structure are deduced from studying two-dimensional cross sections. Concepts of probability and statistics play an important role (DeHoff and Rhines 1968). In particular, the following problem arises. Spherical particles are dispersed in a medium (grains in a metal, for example); the density function of the radii of the spheres can be denoted as fR(r). When the medium is sliced, two-dimensional, circular cross sections of the spheres are observed; let the density function of the radii of these circles be denoted by fX (x). How are these density functions related? x y H x r FIGURE 3.13 A plane slices a sphere of radius r at a distance H from its center, producing a circle of radius x. To derive the relationship, we assume that the cross-sectioning plane is chosen at random, ﬁx R = r, and ﬁnd the conditional density fX|R(x|r). As shown in Figure 3.13, let H denote the distance from the center of the sphere to the planar cross section. By our assumption, H is uniformly distributed on [0,r], and X = √ r 2 − H 2. We can thus ﬁnd the conditional distribution of X given R = r: FX|R(x|r) = P(X ≤ x) = P( r 2 − H 2 ≤ x) = P(H ≥ r 2 − x2) = 1 − √ r 2 − x2 r , 0 ≤ x ≤ r 3.5 Conditional Distributions 91 Differentiating, we ﬁnd fX|R(x|r) = x r √ r 2 − x2 , 0 ≤ x ≤ r The marginal density of X is, from the law of total probability, fX (x) = ∞ −∞ fX|R(x|r) fR(r) dr = ∞ x x r √ r 2 − x2 fR(r) dr [The limits of integration are x and ∞ since for r ≤ x, fX|R(x|r) = 0.] This equation is called Abel’s equation. In practice, the marginal density fX can be approximated by making measurements of the radii of cross-sectional circles. Then the problem becomes that of trying to solve for an approximation to fR, since it is the distribution of spherical radii that is of real interest. ■ EXAMPLE C Bivariate Normal Density The conditional density of Y given X is the ratio of the bivariate normal density to a univariate normal density. After some messy algebra, this ratio simpliﬁes to fY|X (y|x) = 1 σY 2π(1 − ρ2) exp ⎛ ⎜⎜⎜⎝−1 2 y − μY − ρ σY σX (x − μX ) 2 σ 2 Y (1 − ρ2) ⎞ ⎟⎟⎟⎠ This is a normal density with mean μY + ρ(x − μX )σY /σX and variance σ 2 Y (1 − ρ2). The conditional distribution of Y given X is a univariate normal distribution. In Example B in Section 2.2.3, the distribution of the velocity of a turbulent wind ﬂow was shown to be approximately normally distributed. Van Atta and Chen (1968) also measured the joint distribution of the velocity at a point at two different times, t and t + τ. Figure 3.14 shows the measured conditional density of the ve- locity, v2, at time t + τ, given various values of v1. There is a systematic departure from the normal distribution. Therefore, it appears that, even though the velocity is normally distributed, the joint distribution of v1 and v2 is not bivariate normal. This should not be totally unexpected, since the relation of v1 and v2 must con- form to equations of motion and continuity, which may not permit a joint normal distribution. ■ Example C illustrates that even when two random variables are marginally nor- mally distributed, they need not be jointly normally distributed. 92 Chapter 3 Joint Distributions .05 2 2 p ( 1, 2 ) 0 .10 .15 .20 .25 .01 3 2 0 2 1 0 0 .04 1 2 3 1 2.4611 2.514 1 1.9811 2.032 1 1.501 1.551 1 1.0181 1.069 1 0.055 .05 0 .10 FIGURE 3.14 The conditional densities of v2 given v1 for selected values of v1, where v1 and v2 are components of the velocity of a turbulent wind ﬂow at different times. The solid lines are the conditional densities according to a normal ﬁt, and the triangles and squares are empirical values determined from 409,600 observations. EXAMPLE D Rejection Method The rejection method is commonly used to generate random variables from a density function, especially when the inverse of the cdf cannot be found in closed form and therefore the inverse cdf method, Proposition D in Section 2.3, cannot be used. Suppose that f is a density function that is nonzero on an interval [a, b] and zero outside the interval (a and b may be inﬁnite). Let M(x) be a function such that M(x) ≥ f (x) on [a, b], and let m(x) = M(x) b a M(x) dx 3.5 Conditional Distributions 93 be a probability density function. As we will see, the idea is to choose M so that it is easy to generate random variables from m.If[a, b] is ﬁnite, m can be chosen to be the uniform distribution on [a, b]. The algorithm is as follows: Step 1: Generate T with the density m. Step 2: Generate U, uniform on [0, 1] and independent of T .IfM(T )×U ≤ f (T ), then let X = T (accept T ). Otherwise, go to Step 1 (reject T ). See Figure 3.15. From the ﬁgure, we can see that a geometrical interpretation of this algorithm is as follows: Throw a dart that lands uniformly in the rectangular region of the ﬁgure. If the dart lands below the curve f (x), record its x coordinate; otherwise, reject it. x a y bT M f accept reject FIGURE 3.15 Illustration of the rejection method. We must check that the density function of the random variable X thus obtained is in fact f : P(x ≤ X ≤ x + dx) = P(x ≤ T ≤ x + dx | accept) = P(x ≤ T ≤ x + dx and accept) P(accept) = P(accept|x ≤ T ≤ x + dx)P(x ≤ T ≤ x + dx) P(accept) First consider the numerator of this expression. We have P(accept|x ≤ T ≤ x + dx) = P(U ≤ f (x)/M(x)) = f (x) M(x) so that the numerator is m(x) dx f(x) M(x) = f (x) dx b a M(x) dx From the law of total probability, the denominator is P(accept) = P(U ≤ f (T )/M(T )) = b a f (t) M(t)m(t) dt = 1 b a M(t) dt where the last two steps follow from the deﬁnition of m and since f integrates to 1. Finally, we see that the numerator over the denominator is f (x) dx. ■ 94 Chapter 3 Joint Distributions In order for the rejection method to be computationally efﬁcient, the algorithm should lead to acceptance with high probability; otherwise, many rejection steps may have to be looped through for each acceptance. EXAMPLE E Bayesian Inference A freshly minted coin has a certain probability of coming up heads if it is spun on its edge, but that probability is not necessarily equal to 1 2 . Now suppose it is spun n times and comes up heads X times. What has been learned about the chance the coin comes up heads? We will go through a Bayesian treatment of this problem. Let denote the probability that the coin will come up heads. We represent our knowledge about before gathering any data by a probability density on [0, 1], called the prior density. If we are totally ignorant about , we might represent our state of knowledge by a uniform density on [0, 1]: f(θ) = 1, 0 ≤ θ ≤ 1. We will see how observing X changes our knowledge about , transforming the prior distribution into a “posterior” distribution. Given a value θ, X follows a binomial distribution with n trials and probability of success θ: fX|(x|θ) = n x θ x (1 − θ)n−x , x = 0, 1,...,n Now is continuous and X is discrete, and they have a joint probability distribution: f,X (θ, x) = fX|(x|θ)f(θ) = n x θ x (1 − θ)n−x , x = 0, 1,...,n, 0 ≤ θ ≤ 1 This is a density function in θ and a probability mass function in x, an object of a kind we have not seen before. We can calculate the marginal density X by integrating the joint over θ: fX (x) = 1 0 n x θ x (1 − θ)n−x dθ We can calculate this formidable looking integral by a trick. First write n x = n! x!(n − x)! = (n + 1) (x + 1)(n − x + 1) (If k is an integer, (k) = (k − 1)!; see Problem 49 in Chapter 2). Recall the beta density (Section 2.2.4) g(u) = (a + b) (a)(b)ua−1(1 − u)b−1, 0 ≤ u ≤ 1 The fact that this density integrates to 1 tells us that 1 0 ua−1(1 − u)b−1du = (a)(b) (a + b) 3.5 Conditional Distributions 95 Thus, identifying u with θ, a − 1 with x, and b − 1 with n − x, fX (x) = (n + 1) (x + 1)(n − x + 1) 1 0 θ x (1 − θ)n−x dθ = (n + 1) (x + 1)(n − x + 1) (x + 1)(n − x + 1) (n + 2) = 1 n + 1 , x = 0, 1,...,n Thus, if our prior on θ is uniform, each outcome of X is a priori equally likely. Our knowledge about having observed X = x is quantiﬁed in the conditional density of given X = x: f|X (θ|x) = f,X (θ, x) fX (x) = (n + 1) n x θ x (1 − θ)n−x = (n + 1) (n + 1) (x + 1)(n − x + 1)θ x (1 − θ)n−x = (n + 2) (x + 1)(n − x + 1)θ x (1 − θ)n−x The relationship x(x) = (x +1) has been used in the second step (see Problem 49, Chapter 2). Bear in mind that for each ﬁxed x, this is a function of θ—the posterior density of θ given x—which quantiﬁes our opinion about having observed x heads in n spins. The posterior density is a beta density with parameters a = x + 1, b = n − x + 1. A one-Euro coin has the number 1 on one face and a bird on the other face. I spun such a coin 20 times: the 1 came up 13 of the 20 times. Using the prior, ∼ U[0, 1], the posterior is beta with a = x + 1 = 14 and b = n − x + 1 = 8. Figure 3.16 shows this posterior, which represents my opinion if I was initially totally ignorant of θ and then observed thirteen 1s in 20 spins. From the ﬁgure, it is extremely unlikely that θ<0.25, for example. My probability, or belief, that θ is greater than 1 2 is the area under the density to the right of 1 2 , which can be calculated to be 0.91. I can be 91% certain that θ is greater than 1 2 . We need to distinguish between the steps of the preceding probability calcu- lations, which are are mathematically straightforward; and the interpretation of the results, which goes beyond the mathematics and requires a model that belief can be expressed in terms of probability and revised using the laws of probability. See Figure 3.16. ■ 96 Chapter 3 Joint Distributions 1 0.2 p 0 2 3 4 0.0 0.4 0.6 0.8 1.0 FIGURE 3.16 Beta density with parameters a = 14 and b = 8. 3.6 Functions of Jointly Distributed Random Variables The distribution of a function of a single random variable was developed in Section 2.3. In this section, that development is extended to several random variables, but ﬁrst some important special cases are considered. 3.6.1 Sums and Quotients Suppose that X and Y are discrete random variables taking values on the integers and having the joint frequency function p(x, y), and let Z = X + Y. To ﬁnd the frequency function of Z, we note that Z = z whenever X = x and Y = z − x, where x is an integer. The probability that Z = z is thus the sum over all x of these joint probabilities, or pZ (z) = ∞ x=−∞ p(x, z − x) If X and Y are independent so that p(x, y) = pX (x)pY (y), then pZ (z) = ∞ x=−∞ pX (x)pY (z − x) This sum is called the convolution of the sequences pX and pY . 3.6 Functions of Jointly Distributed Random Variables 97 z y Rz (0, z) (z, 0) x y z FIGURE 3.17 X + Y ≤ z whenever (X, Y) is in the shaded region Rz. The continuous case is very similar. Supposing that X and Y are continuous ran- dom variables, we ﬁrst ﬁnd the cdf of Z and then differentiate to ﬁnd the density. Since Z ≤ z whenever the point (X, Y) is in the shaded region Rz shown in Figure 3.17, we have FZ (z) = Rz f (x, y) dx dy = ∞ −∞ z−x −∞ f (x, y) dy dx In the inner integral, we make the change of variables y = v − x to obtain FZ (z) = ∞ −∞ z −∞ f (x,v− x) dv dx = z −∞ ∞ −∞ f (x,v− x) dx dv Differentiating, we have, if ∞ −∞ f (x, z − x) dx is continuous at z, fZ (z) = ∞ −∞ f (x, z − x) dx which is the obvious analogue of the result for the discrete case. If X and Y are independent, fZ (z) = ∞ −∞ fX (x) fY (z − x) dx This integral is called the convolution of the functions fX and fY . 98 Chapter 3 Joint Distributions EXAMPLE A Suppose that the lifetime of a component is exponentially distributed and that an identical and independent backup component is available. The system operates as long as one of the components is functional; therefore, the distribution of the life of the system is that of the sum of two independent exponential random variables. Let T1 and T2 be independent exponentials with parameter λ, and let S = T1 + T2. fS(s) = s 0 λe−λt λe−λ(s−t)dt It is important to note the limits of integration. Beyond these limits, one of the two component densities is zero. When dealing with densities that are nonzero only on some subset of the real line, we must always be careful. Continuing, we have fS(s) = λ2 s 0 e−λs dt = λ2se−λs This is a gamma distribution with parameters 2 and λ (compare with Example A in Section 3.5.2). ■ Let us next consider the quotient of two continuous random variables. The deriva- tion is very similar to that for the sum of such variables, given previously: We ﬁrst ﬁnd the cdf and then differentiate to ﬁnd the density. Suppose that X and Y are continuous with joint density function f and that Z = Y/X. Then FZ (z) = P(Z ≤ z) is the probability of the set of (x, y) such that y/x ≤ z.Ifx > 0, this is the set y ≤ xz;if x < 0, it is the set y ≥ xz. Thus, FZ (z) = 0 −∞ ∞ xz f (x, y) dy dx + ∞ 0 xz −∞ f (x, y) dy dx To remove the dependence of the inner integrals on x, we make the change of vari- ables y = xv in the inner integrals and obtain FZ (z) = 0 −∞ −∞ z xf(x, xv) dv dx + ∞ 0 z −∞ xf(x, xv) dv dx = 0 −∞ z −∞ (−x) f (x, xv) dv dx + ∞ 0 z −∞ xf(x, xv) dv dx = z −∞ ∞ −∞ |x| f (x, xv) dx dv Finally, differentiating (again under an assumption of continuity), we ﬁnd fZ (z) = ∞ −∞ |x| f (x, xz) dx In particular, if X and Y are independent, fZ (z) = ∞ −∞ |x| fX (x) fY (xz) dx 3.6 Functions of Jointly Distributed Random Variables 99 EXAMPLE B Suppose that X and Y are independent standard normal random variables and that Z = Y/X. We then have fZ (z) = ∞ −∞ |x| 2π e−x2/2e−x2z2/2 dx From the symmetry of the integrand about zero, fZ (z) = 1 π ∞ 0 xe−x2((z2+1)/2) dx To simplify this, we make the change of variables u = x2 to obtain fZ (z) = 1 2π ∞ 0 e−u((z2+1)/2) du Next, using the fact that ∞ 0 λ exp(−λx) dx = 1 with λ = (z2 + 1)/2, we get fZ (z) = 1 π(z2 + 1), −∞ < z < ∞ This density is called the Cauchy density. Like the standard normal density, the Cauchy density is symmetric about zero and bell-shaped, but the tails of the Cauchy tend to zero very slowly compared to the tails of the normal. This can be interpreted as being because of a substantial probability that X in the quotient Y/X is near zero. ■ Example B indicates one method of generating Cauchy random variables—we can generate independent standard normal random variables and form their quotient. The next section shows how to generate standard normals. 3.6.2 The General Case The following example illustrates the concepts that are important to the general case of functions of several random variables and is also interesting in its own right. EXAMPLE A Suppose that X and Y are independent standard normal random variables, which means that their joint distribution is the standard bivariate normal distribution, or fXY(x, y) = 1 2π e−(x2/2)−(y2/2) We change to polar coordinates and then reexpress the density in this new coordinate system (R ≥ 0, 0 ≤ ≤ 2π): R = X 2 + Y 2 100 Chapter 3 Joint Distributions = ⎧ ⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨ ⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩ tan−1 Y X , if X > 0 tan−1 Y X + π, if X < 0 π 2 sgn(Y), if X = 0, Y = 0 0, if X = 0, Y = 0 (The range of the inverse tangent function is taken to be − π 2 << π 2 .) The inverse transformation is X = R cos Y = R sin The joint density of R and is fR(r,θ)dr dθ = P(r ≤ R ≤ r + dr,θ ≤ ≤ θ + dθ) This probability is equal to the area of the shaded patch in Figure 3.18 times fXY[x(r,θ),y(r,θ)]. The area in question is clearly rdrdθ,so P(r ≤ R ≤ r + dr,θ ≤ ≤ θ + dθ) = fXY(r cos θ,r sin θ)rdrdθ and fR(r,θ)= rfXY(r cos θ,r sin θ) Thus, fR(r,θ)= r 2π e[−(r2 cos2 θ)/2−(r2 sin2 θ)/2] = 1 2π re−r2/2 From this, we see that the joint density factors implying that R and are independent random variables, that is uniform on [0, 2π], and that R has the density fR(r) = re−r2/2, r ≥ 0 which is called the Rayleigh density. d rr dr dr rd FIGURE 3.18 The area of the shaded patch is rdrdθ. 3.6 Functions of Jointly Distributed Random Variables 101 An interesting relationship can be found by changing variables again, letting T = R2. Using the standard techniques for ﬁnding the density of a function of a single random variable, we obtain fT (t) = 1 2e−t/2, t ≥ 0 This is an exponential distribution with parameter 1 2 . Because R and are indepen- dent, so are T and , and the joint density of the latter pair is fT (t,θ)= 1 2π 1 2 e−t/2 We have thus arrived at a characterization of the standard bivariate normal distribu- tion: is uniform on [0, 2π], and R2 is exponential with parameter 1 2 . (Also, from Example B in Section 3.6.1, tan follows a Cauchy distribution.) These relationships can be used to construct an algorithm for generating standard normal random variables, which is quite useful since , the cdf, and −1 cannot be expressed in closed form. First, generate U1 and U2, which are independent and uniform on [0, 1]. Then −2 log U1 is exponential with parameter 1 2 , and 2πU2 is uniform on [0, 2π]. It follows that X = −2 log U1 cos(2πU2) and Y = −2 log U1 sin(2πU2) are independent standard normal random variables. This method of generating nor- mally distributed random variables is sometimes called the polar method. ■ For the general case, suppose that X and Y are jointly distributed continuous random variables, that X and Y are mapped onto U and V by the transformation u = g1(x, y) v = g2(x, y) and that the transformation can be inverted to obtain x = h1(u,v) y = h2(u,v) Assume that g1 and g2 have continuous partial derivatives and that the Jacobian J(x, y) = det ⎡ ⎢⎢⎣ ∂g1 ∂x ∂g1 ∂y ∂g2 ∂x ∂g2 ∂y ⎤ ⎥⎥⎦ = ∂g1 ∂x ∂g2 ∂y − ∂g2 ∂x ∂g1 ∂y = 0 for all x and y. This leads directly to the following result. 102 Chapter 3 Joint Distributions PROPOSITION A Under the assumptions just stated, the joint density of U and V is fUV(u,v)= fXY(h1(u,v),h2(u,v))|J −1(h1(u,v),h2(u,v))| for (u,v) such that u = g1(x, y) and v = g2(x, y) for some (x, y) and 0 elsewhere. ■ We will not prove Proposition A here. It follows from the formula established in advanced calculus for a change of variables in multiple integrals. The essential elements of the proof follow the discussion in Example A. EXAMPLE B To illustrate the formalism, let us redo Example A. The roles of u and v are played by r and θ: r = x2 + y2 θ = tan−1 y x The inverse transformation is x = r cos θ y = r sin θ After some algebra, we obtain the partial derivatives: ∂r ∂x = x x2 + y2 ∂r ∂y = y x2 + y2 ∂θ ∂x = −y x2 + y2 ∂θ ∂y = x x2 + y2 The Jacobian is the determinant of the matrix of these expressions, or J(x, y) = 1 x2 + y2 = 1 r Proposition A therefore says that fR(r,θ)= rfXY(r cos θ,r sin θ) for r ≥ 0, 0 ≤ θ ≤ 2π, and 0 elsewhere, which is the same as the result we obtained by a direct argument in Example A. ■ Proposition A extends readily to transformations of more than two random vari- ables. If X1,...,Xn have the joint density function fX1···Xn and Yi = gi (X1,...,Xn), i = 1,...,n Xi = hi (Y1,...,Yn), i = 1,...,n 3.6 Functions of Jointly Distributed Random Variables 103 and if J(x1,...,xn) is the determinant of the matrix with the ij entry ∂gi /∂x j , then the joint density of Y1,...,Yn is fY1···Yn (y1,...,yn) = fX1···Xn (x1,...,xn)|J −1(x1,...,xn)| wherein each xi is expressed in terms of the y’s; xi = hi (y1,...,yn). EXAMPLE C Suppose that X1 and X2 are independent standard normal random variables and that Y1 = X1 Y2 = X1 + X2 We will show that the joint distribution of Y1 and Y2 is bivariate normal. The Jacobian of the transformation is simply J(x, y) = det 10 11 = 1 Since the inverse transformation is x1 = y1 and x2 = y2 − y1, from Proposition A the joint density of Y1 and Y2 is fY1Y2 (y1, y2) = 1 2π exp −1 2 y2 1 + (y2 − y1)2 = 1 2π exp −1 2 2y2 1 + y2 2 − 2y1 y2 This can be recognized to be a bivariate normal density, the parameters of which can be identiﬁed by comparing the constants in this expression with the general form of the bivariate normal (see Example F of Section 3.3). First, since the exponential contains only quadratic terms in y1 and y2,wehaveμY1 = μY2 = 0. (If μY1 were nonzero, for example, examination of the equation for the bivariate density in Example F of Section 3.3 shows that there would be a term y1μY1 .) Next, from the constant that occurs in front of the exponential, we have σY1 σY2 1 − ρ2 = 1 From the coefﬁcient of y1 we have σ 2 Y1 (1 − ρ2) = 1 2 Dividing the second relationship into the square of the ﬁrst gives σ 2 Y2 = 2. From the coefﬁcient of y2,wehave σ 2 Y2 (1 − ρ2) = 1 from which it follows that ρ2 = 1 2 . From the sign of the cross product, we see that ρ = 1/ √ 2. Finally, we have σ 2 Y1 = 1. We thus see that this linear transformation of two independent standard normal random variables follows a bivariate normal distribution. This is a special case of a more general result: A nonsingular linear transformation of two random variables whose joint distribution is bivariate normal yields two random variables 104 Chapter 3 Joint Distributions whose joint distribution is still bivariate normal, although with different parameters. (See Problem 58.) ■ 3.7 Extrema and Order Statistics This section is concerned with ordering a collection of independent continuous random variables. In particular, let us assume that X1, X2,...,Xn are independent random variables with the common cdf F and density f . Let U denote the maximum of the Xi and V the minimum. The cdfs of U and V , and therefore their densities, can be found by a simple trick. First, we note that U ≤ u if and only if Xi ≤ u for all i. Thus, FU (u) = P(U ≤ u) = P(X1 ≤ u)P(X2 ≤ u) ···P(Xn ≤ u) = [F(u)]n Differentiating, we ﬁnd the density, fU (u) = nf(u)[F(u)]n−1 Similarly, V ≥ v if and only if Xi ≥ v for all i. Thus, 1 − FV (v) = [1 − F(v)]n and FV (v) = 1 − [1 − F(v)]n The density function of V is therefore fV (v) = nf(v)[1 − F(v)]n−1 EXAMPLE A Suppose that n system components are connected in series, which means that the sys- tem fails if any one of them fails, and that the lifetimes of the components, T1,...,Tn, are independent random variables that are exponentially distributed with parameter λ: F(t) = 1−e−λt . The random variable that represents the length of time the system op- erates is V , which is the minimum of the Ti and by the preceding result has the density fV (v) = nλe−λv(e−λv)n−1 = nλe−nλv We see that V is exponentially distributed with parameter nλ. ■ EXAMPLE B Suppose that a system has components as described in Example A but connected in parallel, which means that the system fails only when they all fail. The system’s lifetime is thus the maximum of n exponential random variables and has the 3.7 Extrema and Order Statistics 105 density fU (u) = nλe−λu(1 − e−λu)n−1 By expanding the last term using the binomial theorem, we see that this density is a weighted sum of exponential terms rather than a simple exponential density. ■ We will now derive the preceding results once more, by the differential technique, and generalize them. To ﬁnd fU (u), we observe that u ≤ U ≤ u + du if one of the nXi falls in the interval (u, u + du) and the other (n − 1)Xi fall to the left of u. The probability of any particular such arrangement is [F(u)]n−1 f (u)du, and because there are n such arrangements, fU (u) = n[F(u)]n−1 f (u) Now we again assume that X1,...,Xn are independent continuous random vari- ables with density f (x). We sort the Xi and denote by X(1) < X(2) < ···< X(n) the order statistics. Note that X1 is not necessarily equal to X(1). (In fact, this equality holds with probability n−1.) Thus, X(n) is the maximum, and X(1) is the minimum. If n is odd, say, n = 2m + 1, then X(m+1) is called the median of the Xi . THEOREM A The density of X(k), the kth-order statistic, is fk(x) = n! (k − 1)!(n − k)! f (x)Fk−1(x)[1 − F(x)]n−k Proof We will use a differential argument to derive this result heuristically. (The alter- native approach of ﬁrst deriving the cdf and then differentiating is developed in Problem 66 at the end of this chapter.) The event x ≤ X(k) ≤ x + dx occurs if k − 1 observations are less than x, one observation is in the interval [x, x + dx], and n − k observations are greater than x + dx. The probability of any particular arrangement of this type is f (x)Fk−1(x)[1 − F(x)]n−kdx, and, by the multi- nomial theorem, there are n!/[(k − 1)!1!(n − k)!] such arrangements, which completes the argument. ■ EXAMPLE C For the case where the Xi are uniform on [0, 1], the density of the kth-order statistic reduces to n! (k − 1)!(n − k)!x k−1(1 − x)n−k, 0 ≤ x ≤ 1 This is the beta density. An interesting by-product of this result is that since the 106 Chapter 3 Joint Distributions density integrates to 1, 1 0 x k−1(1 − x)n−kdx = (k − 1)!(n − k)! n! ■ Joint distributions of order statistics can also be worked out. For example, to ﬁnd the joint density of the minimum and maximum, we note that x ≤ X(1) ≤ x + dx and y ≤ X(n) ≤ y + dy if one Xi falls in [x, x + dx], one falls in [y, y + dy], and n − 2 fall in [x, y]. There are n(n − 1) ways to choose the minimum and maximum, and thus V = X(1) and U = X(n) have the joint density f (u,v)= n(n − 1) f (v) f (u)[F(u) − F(v)]n−2, u ≥ v For example, for the uniform case, f (u,v)= n(n − 1)(u − v)n−2, 1 ≥ u ≥ v ≥ 0 The range of X(1),...,X(n) is R = X(n) − X(1). Using the same kind of analysis we used in Section 3.6.1 to derive the distribution of a sum, we ﬁnd fR(r) = ∞ −∞ f (v + r,v)dv EXAMPLE D Find the distribution of the range, U − V , for the uniform [0, 1] case. The integrand is f (v +r,v)= n(n −1)r n−2 for 0 ≤ v ≤ v +r ≤ 1 or, equivalently, 0 ≤ v ≤ 1−r. Thus, fR(r) = 1−r 0 n(n − 1)r n−2 dv = n(n − 1)r n−2(1 − r), 0 ≤ r ≤ 1 The corresponding cdf is FR(r) = nrn−1 − (n − 1)r n, 0 ≤ r ≤ 1 ■ EXAMPLE E Tolerance Interval If a large number of independent random variables having the common density func- tion f are observed, it seems intuitively likely that most of the probability mass of the density f (x) is contained in the interval (X(1), X(n)) and unlikely that a future observation will lie outside this interval. In fact, very precise statements can be made. For example, the amount of the probability mass in the interval is F(X(n))− F(X(1)), a random variable that we will denote by Q. From Proposition C of Section 2.3, the distribution of F(Xi ) is uniform; therefore, the distribution of Q is the distribution of U(n) − U(1), which is the range of n independent uniform random variables. Thus, P(Q >α), the probability that more than 100α% of the probability mass is contained in the range is from Example D, P(Q >α)= 1 − nαn−1 + (n − 1)αn 3.8 Problems 107 For example, if n = 100 and α = .95, this probability is .96. In words, this means that the probability is .96 that the range of 100 independent random variables covers 95% or more of the probability mass, or, with probability .96, 95% of all further observations from the same distribution will fall between the minimum and maximum. This statement does not depend on the actual form of the distribution. ■ 3.8 Problems 1. The joint frequency function of two discrete random variables, X and Y,isgiven in the following table: x y 1234 1 .10 .05 .02 .02 2 .05 .20 .05 .02 3 .02 .05 .20 .04 4 .02 .02 .04 .10 a. Find the marginal frequency functions of X and Y. b. Find the conditional frequency function of X given Y = 1 and of Y given X = 1. 2. An urn contains p black balls, q white balls, and r red balls; and n balls are chosen without replacement. a. Find the joint distribution of the numbers of black, white, and red balls in the sample. b. Find the joint distribution of the numbers of black and white balls in the sample. c. Find the marginal distribution of the number of white balls in the sample. 3. Three players play 10 independent rounds of a game, and each player has prob- ability 1 3 of winning each round. Find the joint distribution of the numbers of games won by each of the three players. 4. A sieve is made of a square mesh of wires. Each wire has diameter d, and the holes in the mesh are squares whose side length is w. A spherical particle of radius r is dropped on the mesh. What is the probability that it passes through? What is the probability that it fails to pass through if it is dropped n times? (Calculations such as these are relevant to the theory of sieving for analyzing the size distribution of particulate matter.) 5. (Buffon’s Needle Problem) A needle of length L is dropped randomly on a plane ruled with parallel lines that are a distance D apart, where D ≥ L. Show that the probability that the needle comes to rest crossing a line is 2L/(π D). Explain how this gives a mechanical means of estimating the value of π. 108 Chapter 3 Joint Distributions 6. A point is chosen randomly in the interior of an ellipse: x2 a2 + y2 b2 = 1 Find the marginal densities of the x and y coordinates of the point. 7. Find the joint and marginal densities corresponding to the cdf F(x, y) = (1−e−αx )(1−e−βy), x ≥ 0, y ≥ 0,α>0,β>0 8. Let X and Y have the joint density f (x, y) = 6 7 (x + y)2, 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 a. By integrating over the appropriate regions, ﬁnd (i) P(X > Y), (ii) P(X + Y ≤ 1), (iii) P(X ≤ 1 2 ). b. Find the marginal densities of X and Y. c. Find the two conditional densities. 9. Suppose that (X, Y) is uniformly distributed over the region deﬁned by 0 ≤ y ≤ 1 − x2 and −1 ≤ x ≤ 1. a. Find the marginal densities of X and Y. b. Find the two conditional densities. 10. A point is uniformly distributed in a unit sphere in three dimensions. a. Find the marginal densities of the x, y, and z coordinates. b. Find the joint density of the x and y coordinates. c. Find the density of the xy coordinates conditional on Z = 0. 11. Let U1, U2, and U3 be independent random variables uniform on [0, 1]. Find the probability that the roots of the quadratic U1x2 + U2x + U3 are real. 12. Let f (x, y) = c(x2 − y2)e−x , 0 ≤ x < ∞, −x ≤ y < x a. Find c. b. Find the marginal densities. c. Find the conditional densities. 13. A fair coin is thrown once; if it lands heads up, it is thrown a second time. Find the frequency function of the total number of heads. 14. Suppose that f (x, y) = xe−x(y+1), 0 ≤ x < ∞, 0 ≤ y < ∞ a. Find the marginal densities of X and Y. Are X and Y independent? b. Find the conditional densities of X and Y. 15. Suppose that X and Y have the joint density function f (x, y) = c 1 − x2 − y2, x2 + y2 ≤ 1 a. Find c. 3.8 Problems 109 b. Sketch the joint density. c. Find P(X 2 + Y 2) ≤ 1 2 . d. Find the marginal densities of X and Y. Are X and Y independent random variables? e. Find the conditional densities. 16. What is the probability density of the time between the arrival of the two packets of Example E in Section 3.4? 17. Let (X, Y) be a random point chosen uniformly on the region R ={(x, y) : |x|+|y|≤1}. a. Sketch R. b. Find the marginal densities of X and Y using your sketch. Be careful of the range of integration. c. Find the conditional density of Y given X. 18. Let X and Y have the joint density function f (x, y) = k(x − y), 0 ≤ y ≤ x ≤ 1 and 0 elsewhere. a. Sketch the region over which the density is positive and use it in determining limits of integration to answer the following questions. b. Find k. c. Find the marginal densities of X and Y. d. Find the conditional densities of Y given X and X given Y. 19. Suppose that two components have independent exponentially distributed life- times, T1 and T2, with parameters α and β, respectively. Find (a) P(T1 > T2) and (b) P(T1 > 2T2). 20. If X1 is uniform on [0, 1], and, conditional on X1, X2, is uniform on [0, X1], ﬁnd the joint and marginal distributions of X1 and X2. 21. An instrument is used to measure very small concentrations, X, of a certain chemical in soil samples. Suppose that the values of X in those soils in which the chemical is present is modeled as a random variable with density function f (x). The assay of a soil reports a concentration only if the chemical is ﬁrst determined to be present. At very low concentrations, however, the chemical may fail to be detected even if it is present. This phenomenon is modeled by assuming that if the concentration is x, the chemical is detected with probability R(x). Let Y denote the concentration of a chemical in a soil in which it has been determined to be present. Show that the density function of Y is g(y) = R(y) f (y) ∞ 0 R(x) f (x) dx 22. Consider a Poisson process on the real line, and denote by N(t1, t2) the number of events in the interval (t1, t2).Ift0 < t1 < t2, ﬁnd the conditional distribution of N(t0, t1) given that N(t0, t2) = n.(Hint: Use the fact that the numbers of events in disjoint subsets are independent.) 110 Chapter 3 Joint Distributions 23. Suppose that, conditional on N, X has a binomial distribution with N trials and probability p of success, and that N is a binomial random variable with m trials and probability r of success. Find the unconditional distribution of X. 24. Let P have a uniform distribution on [0, 1], and, conditional on P = p, let X have a Bernoulli distribution with parameter p. Find the conditional distribution of P given X. 25. Let X have the density function f , and let Y = X with probability 1 2 and Y =−X with probability 1 2 . Show that the density of Y is symmetric about zero—that is, fY (y) = fY (−y). 26. Spherical particles whose radii have the density function fR(r) are dropped on a mesh as in Problem 4. Find an expression for the density function of the particles that pass through. 27. Prove that X and Y are independent if and only if fX|Y (x|y) = fX (x) for all x and y. 28. Show that C(u,v)= uv is a copula. Why is it called “the independence copula”? 29. Use the Farlie-Morgenstern copula to construct a bivariate density whose marginal densities are exponential. Find an expression for the joint density. 30. For 0 ≤ α ≤ 1 and 0 ≤ β ≤ 1, show that C(u,v) = min(u1−αv,uv1−β) is a copula (the Marshall-Olkin copula). What is the joint density? 31. Suppose that (X, Y) is uniform on the disk of radius 1 as in Example E of Sec- tion 3.3. Without doing any calculations, argue that X and Y are not independent. 32. Continuing Example E of Section 3.5.2, suppose you had to guess a value of θ. One plausible guess would be the value of θ that maximizes the posterior density. Find that value. Does the result make intuitive sense? 33. Suppose that, as in Example E of Section 3.5.2, your prior opinion that the coin will land with heads up is represented by a uniform density on [0, 1]. You now spin the coin repeatedly and record the number of times, N, until a heads comes up. So if heads comes up on the ﬁrst spin, N = 1, etc. a. Find the posterior density of given N. b. Do this with a newly minted penny and graph the posterior density. 34. This problem continues Example E of Section 3.5.2. In that example, the prior opinion for the value of was represented by the uniform density. Suppose that the prior density had been a beta density with parameters a = b = 3, reﬂecting a stronger prior belief that the chance ofa1wasnear 1 2 . Graph this prior density. Following the reasoning of the example, ﬁnd the posterior density, plot it, and compare it to the posterior density shown in the example. 35. Find a newly minted penny. Place it on its edge and spin it 20 times. Following Example E of Section 3.5.2, calculate and graph the posterior distribution. Spin another 20 times, and calculate and graph the posterior based on all 40 spins. What happens as you increase the number of spins? 3.8 Problems 111 36. Let f (x) = (1 + αx)/2, for −1 ≤ x ≤ 1 and −1 ≤ α ≤ 1. a. Describe an algorithm to generate random variables from this density using the rejection method. b. Write a computer program to do so, and test it out. 37. Let f (x) = 6x2(1 − x)2, for −1 ≤ x ≤ 1. a. Describe an algorithm to generate random variables from this density using the rejection method. In what proportion of the trials will the acceptance step be taken? b. Write a computer program to do so, and test it out. 38. Show that the number of iterations necessary to generate a random variable using the rejection method is a geometric random variable, and evaluate the parameter of the geometric frequency function. Show that in order to keep the number of iterations small, M(x) should be chosen to be close to f (x). 39. Show that the following method of generating discrete random variables works (D. R. Fredkin). Suppose, for concreteness, that X takes on values 0, 1, 2, . . . with probabilities p0, p1, p2,.... Let U be a uniform random variable. If U < p0, return X = 0. If not, replace U by U − p0, and if the new U is less than p1, return X = 1. If not, decrement U by p1, compare U to p2, etc. 40. Suppose that X and Y are discrete random variables with a joint probability mass function pXY(x, y). Show that the following procedure generates a random variable X ∼ pX|Y (x|y). a. Generate X ∼ pX (x). b. Accept X with probability p(y|X). c. If X is accepted, terminate and return X. Otherwise go to Step a. Now suppose that X is uniformly distributed on the integers 1, 2,...,100 and that given X = x, Y is uniform on the integers 1, 2,...,x. You observe Y = 44. What does this tell you about X? Simulate the distribution of X,givenY = 44, 1000 times and make a histogram of the value obtained. How would you estimate E(X|Y = 44)? 41. How could you extend the procedure of the previous problem in the case that X and Y are continuous random variables? 42. a. Let T be an exponential random variable with parameter λ; let W be a random variable independent of T , which is ±1 with probability 1 2 each; and let X = WT. Show that the density of X is fX (x) = λ 2e−λ|x| which is called the double exponential density. b. Show that for some constant c, 1√ 2π e−x2/2 ≤ ce−|x| 112 Chapter 3 Joint Distributions Use this result and that of part (a) to show how to use the rejection method to generate random variables from a standard normal density. 43. Let U1 and U2 be independent and uniform on [0, 1]. Find and sketch the density function of S = U1 + U2. 44. Let N1 and N2 be independent random variables following Poisson distributions with parameters λ1 and λ2. Show that the distribution of N = N1 + N2 is Poisson with parameter λ1 + λ2. 45. For a Poisson distribution, suppose that events are independently labeled A and B with probabilities pA + pB = 1. If the parameter of the Poisson distribution is λ, show that the number of events labeled A follows a Poisson distribution with parameter pAλ. 46. Let X and Y be jointly continuous random variables. Find an expression for the density of Z = X − Y. 47. Let X and Y be independent standard normal random variables. Find the density of Z = X + Y, and show that Z is normally distributed as well. (Hint: Use the technique of completing the square to help in evaluating the integral.) 48. Let T1 and T2 be independent exponentials with parameters λ1 and λ2. Find the density function of T1 + T2. 49. Find the density function of X + Y, where X and Y have a joint density as given in Example D in Section 3.3. 50. Suppose that X and Y are independent discrete random variables and each as- sumes the values 0, 1, and 2 with probability 1 3 each. Find the frequency function of X + Y. 51. Let X and Y have the joint density function f (x, y), and let Z = XY. Show that the density function of Z is fZ (z) = ∞ −∞ f y, z y 1 |y| dy 52. Find the density of the quotient of two independent uniform random variables. 53. Consider forming a random rectangle in two ways. Let U1, U2, and U3 be inde- pendent random variables uniform on [0, 1]. One rectangle has sides U1 and U2, and the other is a square with sides U3. Find the probability that the area of the square is greater than the area of the other rectangle. 54. Let X, Y, and Z be independent N(0,σ2). Let , , and R be the corresponding random variables that are the spherical coordinates of (X, Y, Z): x = r sin φ cos θ y = r sin φ sin θ z = r cos φ 0 ≤ φ ≤ π, 0 ≤ θ ≤ 2π 3.8 Problems 113 Find the joint and marginal densities of , , and R.(Hint: dx dy dz = r 2 sin φ dr dθ dφ.) 55. A point is generated on a unit disk in the following way: The radius, R, is uniform on [0, 1], and the angle is uniform on [0, 2π] and is independent of R. a. Find the joint density of X = R cos and Y = R sin . b. Find the marginal densities of X and Y. c. Is the density uniform over the disk? If not, modify the method to produce a uniform density. 56. If X and Y are independent exponential random variables, ﬁnd the joint density of the polar coordinates R and of the point (X, Y). Are R and independent? 57. Suppose that Y1 and Y2 follow a bivariate normal distribution with parameters μY1 = μY2 = 0,σ2 Y1 = 1,σ2 Y2 = 2, and ρ = 1/ √ 2. Find a linear transformation x1 = a11 y1 + a12 y2, x2 = a21 y1 + a22 y2 such that x1 and x2 are independent standard normal random variables. (Hint: See Example C of Section 3.6.2.) 58. Show that if the joint distribution of X1 and X2 is bivariate normal, then the joint distribution of Y1 = a1 X1 + b1 and Y2 = a2 X2 + b2 is bivariate normal. 59. Let X1 and X2 be independent standard normal random variables. Show that the joint distribution of Y1 = a11 X1 + a12 X2 + b1 Y2 = a21 X1 + a22 X2 + b2 is bivariate normal. 60. Using the results of the previous problem, describe a method for generating pseu- dorandom variables that have a bivariate normal distribution from independent pseudorandom uniform variables. 61. Let X and Y be jointly continuous random variables. Find an expression for the joint density of U = a + bX and V = c + dY. 62. If X and Y are independent standard normal random variables, ﬁnd P(X 2 + Y 2 ≤ 1). 63. Let X and Y be jointly continuous random variables. a. Develop an expression for the joint density of X + Y and X − Y. b. Develop an expression for the joint density of XY and Y/X. c. Specialize the expressions from parts (a) and (b) to the case where X and Y are independent. 64. Find the joint density of X + Y and X/Y, where X and Y are independent exponential random variables with parameter λ. Show that X + Y and X/Y are independent. 65. Suppose that a system’s components are connected in series and have lifetimes that are independent exponential random variables with parameters λi . Show that the lifetime of the system is exponential with parameter λi . 114 Chapter 3 Joint Distributions 66. Each component of the following system (Figure 3.19) has an independent ex- ponentially distributed lifetime with parameter λ. Find the cdf and the density of the system’s lifetime. FIGURE 3.19 67. A card contains n chips and has an error-correcting mechanism such that the card still functions if a single chip fails but does not function if two or more chips fail. If each chip has a lifetime that is an independent exponential with parameter λ, ﬁnd the density function of the card’s lifetime. 68. Suppose that a queue has n servers and that the length of time to complete a job is an exponential random variable. If a job is at the top of the queue and will be handled by the next available server, what is the distribution of the waiting time until service? What is the distribution of the waiting time until service of the next job in the queue? 69. Find the density of the minimum of n independent Weibull random variables, each of which has the density f (t) = βα−βtβ−1e−(t/α)β , t ≥ 0 70. If ﬁve numbers are chosen at random in the interval [0, 1], what is the probability that they all lie in the middle half of the interval? 71. Let X1,...,Xn be independent random variables, each with the density func- tion f. Find an expression for the probability that the interval (−∞, X(n)] encompasses at least 100ν% of the probability mass of f. 72. Let X1, X2,...,Xn be independent continuous random variables each with cu- mulative distribution function F. Show that the joint cdf of X(1) and X(n) is F(x, y) = Fn(y) − [F(y) − F(x)]n, x ≤ y 73. If X1,...,Xn are independent random variables, each with the density function f , show that the joint density of X(1),...,X(n) is n! f (x1) f (x2) ··· f (xn), x1 < x2 < ···< xn 74. Let U1, U2, and U3 be independent uniform random variables. a. Find the joint density of U(1), U(2), and U(3). b. The locations of three gas stations are independently and randomly placed along a mile of highway. What is the probability that no two gas stations are less than 1 3 mile apart? 3.8 Problems 115 75. Use the differential method to ﬁnd the joint density of X(i) and X( j), where i < j. 76. Prove Theorem A of Section 3.7 by ﬁnding the cdf of X(k) and differentiating. (Hint: X(k) ≤ x if and only if k or more of the Xi are less than or equal to x. The number of Xi less than or equal to x is a binomial random variable.) 77. Find the density of U(k) −U(k−1) if the Ui , i = 1,...,n are independent uniform random variables. This is the density of the spacing between adjacent points chosen uniformly in the interval [0, 1]. 78. Show that 1 0 y 0 (y − x)n dx dy = 1 (n + 1)(n + 2) 79. If T1 and T2 are independent exponential random variables, ﬁnd the density function of R = T(2) − T(1). 80. Let U1,...,Un be independent uniform random variables, and let V be uniform and independent of the Ui . a. Find P(V ≤ U(n)). b. Find P(U(1) < V < U(n)). 81. Do both parts of Problem 80 again, assuming that the Ui and V have the density function f and the cdf F, with F−1 uniquely deﬁned. Hint: F(Ui ) has a uniform distribution. CHAPTER 4 Expected Values 4.1 The Expected Value of a Random Variable The concept of the expected value of a random variable parallels the notion of a weighted average. The possible values of the random variable are weighted by their probabilities, as speciﬁed in the following deﬁnition. DEFINITION If X is a discrete random variable with frequency function p(x), the expected value of X, denoted by E(X),is E(X) = i xi p(xi ) provided that i |xi |p(xi )<∞. If the sum diverges, the expectation is unde- ﬁned. ■ E(X) is also referred to as the mean of X and is often denoted by μ or μX . It might be helpful to think of the expected value of X as the center of mass of the frequency function. Imagine placing the masses p(xi ) at the points xi on a beam; the balance point of the beam is the expected value of X. EXAMPLE A Roulette A roulette wheel has the numbers 1 through 36, as well as 0 and 00. If you bet $1 that an odd number comes up, you win or lose $1 according to whether that event occurs. If X denotes your net gain, X = 1 with probability 18 38 and X =−1 with 116 4.1 The Expected Value of a Random Variable 117 probability 20 38 . The expected value of X is E(X) = 1 × 18 38 + (−1) × 20 38 =−1 19 Thus, your expected loss is about $.05. In Chapter 5, it will be shown that this coincides in the limit with the actual average loss per game if you play a long sequence of independent games. ■ EXAMPLE B Expectation of a Geometric Random Variable Suppose that items produced in a plant are independently defective with probability p. Items are inspected one by one until a defective item is found. On the average, how many items must be inspected? The number of items inspected, X, is a geometric random variable, with P(X = k) = q k−1 p, where q = 1 − p. Therefore, E(X) = ∞ k=1 kpq k−1 = p ∞ k=1 kq k−1 We use a trick to calculate the sum. Since kq k−1 = d dqq k, we interchange the oper- ations of summation and differentiation to obtain E(X) = p d dq ∞ k=1 q k = p d dq q 1 − q = p (1 − q)2 = 1 p It can be shown that the interchange of differentiation and summation is justiﬁed. Thus, for example, if 10% of the items are defective, an average of 10 items must be examined to ﬁnd one that is defective, as might have been guessed. ■ EXAMPLE C Poisson Distribution The expected value of a Poisson random variable is E(X) = ∞ k=0 kλk k! e−λ = λe−λ ∞ k=1 λk−1 (k − 1)! = λe−λ ∞ j=0 λ j j! Since ∞ j=0(λ j /j!) = eλ,wehaveE(X) = λ. The parameter λ of the Poisson distribution can thus be interpreted as the average count. ■ 118 Chapter 4 Expected Values EXAMPLE D St. Petersburg Paradox A gambler has the following strategy for playing a sequence of games: He starts off betting $1; if he loses, he doubles his bet; and he continues to double his bet until he ﬁnally wins. To analyze this scheme, suppose that the game is fair and that he wins or loses the amount he bets. At trial 0, he bets $1; if he loses, he bets $2 at trial 1; and if he has not won by the kth trial, he bets $2k. When he ﬁnally wins, he will be $1 ahead, which can be checked by going through the scheme for the ﬁrst few values of k. This seems like a foolproof way to win $1. What could be wrong with it? Let X denote the amount of money bet on the very last game (the game he wins). Because the probability that k losses are followed by one win is 2−(k+1), P(X = 2k) = 1 2k+1 and E(X) = ∞ n=0 nP(X = n) = ∞ k=0 2k 1 2k+1 =∞ Formally, E(X) is not deﬁned. Practically, the analysis shows a ﬂaw in this scheme, which is that it does not take into account the enormous amount of capital required. ■ The deﬁnition of expectation for a continuous random variable is a fairly obvious extension of the discrete case—summation is replaced by integration. DEFINITION If X is a continuous random variable with density f (x), then E(X) = ∞ −∞ xf(x) dx provided that |x| f (x)dx < ∞. If the integral diverges, the expectation is un- deﬁned. ■ Again E(X) can be regarded as the center of mass of the density. We next consider some examples. EXAMPLE E Gamma Density If X follows a gamma density with parameters α and λ, E(X) = ∞ 0 λα (α)xαe−λx dx 4.1 The Expected Value of a Random Variable 119 This integral is easy to evaluate once we realize that λα+1xαe−λx /(α+1) is a gamma density and therefore integrates to 1. We thus have ∞ 0 xαe−λx dx = (α + 1) λα+1 from which it follows that E(X) = λα (α) (α + 1) λα+1 Finally, using the relation (α + 1) = α(α),weﬁnd E(X) = α λ For the exponential density, α = 1, so E(X) = 1/λ. This may be contrasted to the median of the exponential density, which was found in Section 2.2.1 to be log 2/λ. The mean and the median can both be interpreted as “typical” values of X, but they measure different attributes of the probability distribution. ■ EXAMPLE F Normal Distribution From the deﬁnition of the expectation, we have E(X) = 1 σ √ 2π ∞ −∞ xe− 1 2 (x−μ)2 σ2 dx Making the change of variables z = x − μ changes this equation to E(X) = 1 σ √ 2π ∞ −∞ ze−z2/2σ 2 dz + μ σ √ 2π ∞ −∞ e−z2/2σ 2 dz The ﬁrst integral is 0 since the contributions from z < 0 cancel those from z > 0, and the second integral is μ because the normal density integrates to 1. Thus, E(X) = μ The parameter μ of the normal density is the expectation, or mean value. We could have made the derivation much shorter by claiming that it was “obvious” that since the center of symmetry of the density is μ, the expectation must be μ. ■ EXAMPLE G Cauchy Density Recall that the Cauchy density is f (x) = 1 π 1 1 + x2 , −∞ < x < ∞ The density is symmetric about zero, so it would seem that E(X) = 0. However, ∞ −∞ |x| 1 + x2 =∞ 120 Chapter 4 Expected Values Therefore, the expectation does not exist. The reason that it fails to exist is, that the density decreases so slowly that very large values of X can occur with substantial probability. ■ The expected value can be interpreted as a long-run average. In Chapter 5, it will be shown that if E(X) exists and if X1, X2, ...is a sequence of independent random variables with the same distribution as X, and if Sn = n i=1 Xi , then, as n →∞, Sn n → E(X) This statement will be made more precise in Chapter 5. For now, a simple empirical demonstration will be sufﬁcient. EXAMPLE H Using a pseudorandom number generator, a sequence X1, X2,... of independent standard normal random variables was generated, as well as a sequence Y1, Y2, ... of independent Cauchy random variables. Figure 4.1 shows the graphs of G(n) = 1 n n i=1 Xi and C(n) = 1 n n i=1 Yi n = 1, 2, ..., 500 Note how G(n) appears to be tending to a limit, whereas C(n) does not. ■ 2 2 1 0 1 3 4 0 (b) 100 200 300 400 500 C ( n ) 0 .4 .8 (a) G ( n ) n FIGURE 4.1 The average of n independent random variables as a function of n for (a) normal random variables and (b) Cauchy random variables. 4.1 The Expected Value of a Random Variable 121 We conclude this section with a simple result that is of great utility in probability theory. THEOREM A Markov’s Inequality If X is a random variable with P(X ≥ 0) = 1 and for which E(X) exists, then P(X ≥ t) ≤ E(X)/t. Proof We will prove this for the discrete case; the continuous case is entirely analogous. E(X) = x xp(x) = x kE(X)) ≤ k−1. This holds for any nonnegative random variable, regardless of its probability distribution. 4.1.1 Expectations of Functions of Random Variables We often need to ﬁnd E[g(X)], where X is a random variable and g is a ﬁxed function. For example, according to the kinetic theory of gases, the magnitude of the velocity of a gas molecule is random and its probability density is given by fX (x) = √ 2/π σ 3 x2e− 1 2 x2 σ2 (This is Maxwell’s distribution: the parameter σ depends on the temperature of the gas.) From this density, we can ﬁnd the average velocity, but suppose that we are interested in ﬁnding the average kinetic energy Y = 1 2 mX2, where m is the mass of the molecule. The straightforward way to do this would seem to be the following: Let Y = g(X); ﬁnd the density or frequency function of Y, say, fY ; and then compute E(Y) from the deﬁnition. It turns out, fortunately, that the process is much simpler than that. 122 Chapter 4 Expected Values THEOREM A Suppose that Y = g(X). a. If X is discrete with frequency function p(x), then E(Y) = x g(x)p(x) provided that |g(x)|p(x)<∞. b. If X is continuous with density function f (x), then E(Y) = ∞ −∞ g(x) f (x) dx provided that |g(x)| f (x) dx < ∞. Proof We will prove this result for the discrete case. The basic argument is the same for the continuous case, but making that proof rigorous requires some advanced theory of integration. By deﬁnition, E(Y) = i yi pY (yi ) Let Ai denote the set of x’s mapped to yi by g; that is, x ∈ Ai if g(x) = yi . Then pY (yi ) = x ∈ Ai p(x) and E(Y) = i yi x ∈ Ai p(x) = i x ∈ Ai yi p(x) = i x ∈ Ai g(x)p(x) = x g(x)p(x) This last step follows because the Ai are disjoint and every x belongs to some Ai . ■ It is worth pointing out explicitly that E[g(X)] = g[E(X)]; that is, the average value of the function is not equal to the function of the average value. Suppose, for example, that X takes on values 1 and 2, each with probability 1 2 ,soE(X) = 3 2 . Let Y = 1/X. Then E(Y) is clearly 1 × .5 + .5 × .5 = .75, but 1/E(X) = 2 3 . 4.1 The Expected Value of a Random Variable 123 EXAMPLE A Let us now apply Theorem A to ﬁnd the average kinetic energy of a gas molecule. E(Y) = ∞ 0 1 2mx2 fX (x) dx = m 2 √ 2/π σ 3 ∞ 0 x4e− 1 2 x2 σ2 dx To evaluate the integral, we make the change of variables u = x2/2σ 2 to reduce it to 2mσ 2 √π ∞ 0 u3/2e−u du = 2mσ 2 √π 5 2 Finally, using the facts (1 2 ) = √π and (α + 1) = α(α),wehave E(Y) = 3 2 mσ 2 ■ Now suppose that Y = g(X1,...,Xn), where Xi have a joint distribution, and that we want to ﬁnd E(Y). We do not have to ﬁnd the density or frequency function of Y, which again could be a formidable task. THEOREM B Suppose that X1,...,Xn are jointly distributed random variables and Y = g(X1,...,Xn). a. If the Xi are discrete with frequency function p(x1,...,xn), then E(Y) = x1,...,xn g(x1,...,xn)p(x1,...,xn) provided that x1,...,xn |g(x1,...,xn)|p(x1,...,xn)<∞. b. If the Xi are continuous with joint density function f (x1,...,xn), then E(Y) = ··· g(x1,...,xn) f (x1,...,xn)dx1 ···dxn provided that the integral with |g| in place of g converges. Proof The proof is similar to that of Theorem A. ■ 124 Chapter 4 Expected Values EXAMPLE B A stick of unit length is broken randomly in two places. What is the average length of the middle piece? We interpret this question to mean that the locations of the two break points are independent uniform random variables U1 and U2. Therefore, we need to compute E|U1 − U2|. Theorem B tells us that we do not need to ﬁnd the density function of |U1 − U2| and that we can just integrate |u1 − u2| against the joint density of U1 and U2, f (u1, u2) = 1, 0 ≤ u1 ≤ 1, 0 ≤ U2 ≤ 1. Thus, E|U1 − U2|= 1 0 1 0 |u1 − u2| du1 du2 = 1 0 u1 0 (u1 − u2) du2 du1 + 1 0 1 u1 (u2 − u1) du2 du1 With some care, we ﬁnd the expectation to be 1 3 . This is in accord with the intu- itive argument that the smaller of U1 and U2 should be 1 3 on the average and the larger should be 2 3 on the average, which means that the average difference should be 1 3 . ■ We note the following immediate consequence of Theorem B. COROLLARY A If X and Y are independent random variables and g and h are ﬁxed functions, then E[g(X)h(Y)] ={E[g(X)]}{E[h(Y)]}, provided that the expectations on the right-hand side exist. ■ In particular, if X and Y are independent, E(XY) = E(X)E(Y). The proof of this corollary is left to Problem 29 of the end-of-chapter problems. 4.1.2 Expectations of Linear Combinations of Random Variables One of the most useful properties of the expectation is that it is a linear operation. Suppose that you were told that the average temperature on July 1 in a certain location was 70◦F, and you were asked what the average temperature in degrees Celsius was. You can simply convert to degrees Celsius and obtain 5 9 × 70 − 17.7 = 21.2◦C. The notion of the average value of a random variable, which we have deﬁned as the expected value of a random variable, behaves in the same fashion. If Y = aX+b, then E(Y) = aE(X) + b. More generally, this property extends to linear combinations of random variables. 4.1 The Expected Value of a Random Variable 125 THEOREM A If X1, ..., Xn are jointly distributed random variables with expectations E(Xi ) and Y is a linear function of the Xi , Y = a + n i=1 bi Xi , then E(Y) = a + n i=1 bi E(Xi ) Proof We will prove this for the continuous case. The proof in the discrete case is parallel and is left to Problem 24 at the end of this chapter. For notational simplicity, we take n = 2. From Theorem B of Section 4.1.1, we have E(Y) = (a + b1x1 + b2x2) f (x1, x2) dx1 dx2 = a f (x1, x2) dx1 dx2 + b1 x1 f (x1, x2) dx1 dx2 + b2 x2 f (x1, x2) dx1 dx2 The ﬁrst double integral of the last expression is merely the integral of the bivariate density, which is equal to 1. The second double integral can be evaluated as follows: x1 f (x1, x2) dx1 dx2 = x1 f (x1, x2) dx2 dx1 = x1 fx1 (x1) dx1 = E(X1) A similar evaluation for the third double integral brings us to E(Y) = a + b1 E(X1) + b2 E(X2) This proves the theorem once we check that the expectation is well deﬁned, or that |a + b1x1 + b2x2| f (x1, x2) dx1 dx2 < ∞ This can be veriﬁed using the inequality |a + b1x1 + b2x2|≤|a|+|b1||x1|+|b2||x2| and the assumption that the E(Xi ) exist. ■ Theorem A is extremely useful. We will illustrate its utility with several examples. 126 Chapter 4 Expected Values EXAMPLE A Suppose that we wish to ﬁnd the expectation of a binomial random variable, Y. From the binomial frequency function, E(Y) = n k=0 n k kpk(1 − p)n−k It is not immediately obvious how to evaluate this sum. We can, however, represent Y as the sum of Bernoulli random variables, Xi , which equal 1 or 0 depending on whether there is success or failure on the ith trial, Y = n i=1 Xi Because E(Xi ) = 0 × (1 − p) + 1 × p = p, it follows immediately that E(Y) = np. An application of the binomial distribution and its expectation occurs in “shotgun sequencing” in genomics, a method of trying to ﬁgure out the sequence of letters that make up a long segment of DNA. It is technically too difﬁcult to sequence the entire segment at once if it is very long. The basic idea of shotgun sequencing is to chop the DNA randomly into many small fragments, sequence each fragment, and then somehow assemble the fragments into one long “contig.” The hope is that if there are many fragments, their overlaps can be used to assemble the contig. Suppose, then, that the length of the DNA sequence is G and that there are N fragments each of length L. G might be at least 100,000 and L about 500. Assume that the left end of each fragment is equally likely to be at positions 1, 2,...,G − L + 1. What is the probability that a particular location x ∈{L, L + 1,...,G} is covered by at least one fragment? How many fragments are expected to cover a particular location? (The positions {1, 2,...,L −1} are not included in this discussion because the boundary effect makes them a little different; for example, the only fragment that covers position 1 has its left end at position 1.) To answer these questions, ﬁrst consider a single fragment. The chance that it covers x equals the chance that its left end is in one of the L locations {x − L + 1, x − L,...,x}, and because the location of the left end is uniform, this probability is p = L G − L + 1 ≈ L G where the approximation holds because G L. Thus, the distribution of W, the number of fragments that cover a particular location, is binomial with parameters N and p. From the binomial probability formula, the chance of coverage is P(W > 0) = 1 − P(W = 0) = 1 − 1 − L G N Since N is large and p is small, the distribution of W is nearly Poisson with parameter λ = Np = NL/G. From the Poisson probability formula, P(W = 0) ≈ e−NL/G, so the probability that a particular location is covered is approximately 1 − e−NL/G. Observe that NLis the total length of all the fragments; the ratio NL/G is called the coverage. Calculations of this kind are thus useful in deciding how many fragments to use. If the coverage is 8, for example, the chance that a site is covered is .9997. 4.1 The Expected Value of a Random Variable 127 Overlap of fragments is important when trying to assemble them. Since W is a binomial random variable, the expected number of fragments that cover a given site is Np = NL/G, precisely the coverage. We can also now answer this closely related question: How many sites do we expect to be entirely missed? We will calculate this using indicator random variables: let Ix equal 1 if site x is missed and 0 elsewhere. Then E(Ix ) = 1 × P(Ix = 1) + 0 × P(Ix = 0) = e−NL/G. The number of sites that are not covered is V = G x=1 Ix and from the linearity of expectation E(V ) = G x=L E(Ix ) ≈ Ge−NL/G. The length of the human genome is approximately G = 3 × 109, so with eight times coverage, we would expect about a million sites to be missed. ■ EXAMPLE B Coupon Collection Suppose that you collect coupons, that there are n distinct types of coupons, and that on each trial you are equally likely to get a coupon of any of the types. How many trials would you expect to go through until you had a complete set of coupons? (This might be a model for collecting baseball cards or for certain grocery store promotions.) The solution of this problem is greatly simpliﬁed by representing the number of trials as a sum. Let X1 be the number of trials up to and including the trial on which the ﬁrst coupon is collected: X1 = 1. Let X2 be the number of trials from that point up to and including the trial on which the next coupon different from the ﬁrst is obtained; let X3 be the number of trials from that point up to and including the trial on which the third distinct coupon is collected; and so on, up to Xn. Then the total number of trials, X, is the sum of the Xi , i = 1, 2, ..., n. We now ﬁnd the distribution of Xr . At this point, r − 1ofn coupons have been collected, so on each trial the probability of success is (n − r + 1)/n. Therefore, Xr is a geometric random variable, with E(Xr ) = n/(n − r + 1). (See Example B of Section 4.1.) Thus, E(X) = n r=1 E(Xr ) = n n + n n − 1 + n n − 2 +···+ n 1 = n n r=1 1 r For example, if there are 10 types of coupons, the expected number of trials necessary to obtain at least one of each kind is 29.3. 128 Chapter 4 Expected Values Finally, we note the following famous approximation: n r=1 1 r = log n + γ + εn where log is the natural log or loge (unless otherwise speciﬁed, log means natural log throughout this text), γ is Euler’s constant, γ = .57...,and εn approaches zero as n approaches inﬁnity. Using this approximation for n = 10, we ﬁnd that the approximate expected number of trials is 28.8. Generally, we see that the expected number of trials grows at the rate n log n, or slightly faster than n. ■ EXAMPLE C Group Testing Suppose that a large number, n, of blood samples are to be screened for a relatively rare disease. If each sample is assayed individually, n tests will be required. On the other hand, if each sample is divided in half and one of the halves is put into a pool with all the other halves, the pooled lot can be tested. Then, provided that the test method is sensitive enough, if this test is negative, no further assays are necessary and only one test has to be performed. If the test on the pooled blood is positive, each reserved half-sample can be tested individually. In this case, a total of n + 1 tests will be required. It is therefore plausible, assuming that the disease is rare, that some savings can be achieved through this pooling procedure. To analyze this more quantitatively, let us ﬁrst generalize the scheme and suppose that the n samples are ﬁrst grouped into m groups of k samples each, or n = mk. Each group is then tested; if a group tests positively, each individual in the group is tested. If Xi is the number of tests run on the ith group, the total number of tests run is N = m i=1 Xi , and the expected total number of tests is E(N) = m i=1 E(Xi ) Let us ﬁnd E(Xi ). If the probability of a negative on any individual sample is p, then the Xi take on the value 1 with probability pk or the value k + 1 with probability 1 − pk. Thus, E(Xi ) = pk + (k + 1)(1 − pk) = k + 1 − kpk We now have E(N) = m(k + 1) − mkpk = n 1 + 1 k − pk Recalling that n tests are necessary with no pooling, we see that the factor (1+1/k − pk) is the average number of samples used in group testing as a proportion of n. Figure 4.2 shows this proportion as a function of k for p = .99. From the ﬁgure, we 4.1 The Expected Value of a Random Variable 129 0 .2 0 5 10 15 20 Proportion k .4 .6 .8 1.0 1.2 FIGURE 4.2 The proportion of n in the average number of samples tested using group testing as a function of k. see that for group testing with a group size of about 10, only 20% of the number of tests used with the straightforward method are needed on the average. ■ EXAMPLE D Counting Word Occurrences in DNA Sequences Here we consider another example from genomics, and one that again illustrates the power of using indicator random variables. In searching for patterns in DNA sequences, there might be reason to expect that a “word” such as TATA would occur more frequently than expected in a random sequence. Or suppose we want to identify regions of a DNA sequence in which the occurrence of the word is unusually large. To quantify these ideas, we need to specify the meaning of random. In this example, we will take it to mean that the sequence is randomly composed of the letters A,C,G, and T in the sense that the letters at sites are independent and, at every site, each letter has probability 1 4 . We also need to be careful to specify how we count. Consider the following sequence ACTATATAGATATA We will count overlaps, so in the preceding sequence, TATA occurs three times. Now suppose that the sequence is of length N and that the word is of length q. Let In be an indicator random variable taking on the value 1 if the word begins at position n and 0 otherwise: P(In = 1) = 1 4 q from the assumption of independence and E(In) = P(In = 1). Now the total number of times the word occurs is W = N−q+1 n=1 In 130 Chapter 4 Expected Values and E(W) = N−q+1 n=1 E(In) = (N − q + 1) 1 4 q Note that the In are not independent—for example, in the case of the word TATA, if I1 = 1, then I2 = 0. Thus W is not a binomial random variable. But despite the lack of independence, we can ﬁnd E(W) by expressing W as a linear combination of indicator variables. ■ EXAMPLE E Investment Portfolios An investor plans to apportion an amount of capital, C0, between two investments, placing a fraction π,0≤ π ≤ 1, in one investment and a fraction 1 − π in the other for a ﬁxed period of time. Denoting the returns (ﬁnal value divided by initial value) on the investments by R1 and R2, her capital at the end of the period will be C1 = πC0 R1 + (1 − π)C0 R2. Her return will then be R = C1 C0 = π R1 + (1 − π)R2 Suppose that the returns are unknown, as would be the case if they were stocks, for example, and that they are hence modeled as random variables, with expected values E(R1) and E(R2). Then her expected return is E(R) = π E(R1) + (1 − π)E(R2) How should she choose π? A simple solution would apparently be to choose π = 1 if E(R1)>E(R2) and π = 0 otherwise. But there is more to the story as we will see later. ■ 4.2 Variance and Standard Deviation The expected value of a random variable is its average value and can be viewed as an indication of the central value of the density or frequency function. The expected value is therefore sometimes referred to as a location parameter. The median of a distribution is also a location parameter, one that does not necessarily equal the mean. This section introduces another parameter, the standard deviation of a random variable, which is an indication of how dispersed the probability distribution is about its center, of how spread out on the average are the values of the random variable about its expectation. We ﬁrst deﬁne the variance of a random variable and then deﬁne the standard deviation in terms of the variance. 4.2 Variance and Standard Deviation 131 DEFINITION If X is a random variable with expected value E(X), the variance of X is Var(X) = E{[X − E(X)]2} provided that the expectation exists. The standard deviation of X is the square root of the variance. ■ If X is a discrete random variable with frequency function p(x) and expected value μ = E(X), then according to the deﬁnition and Theorem A of Section 4.1.1, Var(X) = i (xi − μ)2 p(xi ) whereas if X is a continuous random variable with density function f (x) and E(X) = μ Var(X) = ∞ −∞ (x − μ)2 f (x) dx The variance is often denoted by σ 2 and the standard deviation by σ. From the preceding deﬁnition, the variance of X is the average value of the squared deviation of X from its mean. If X has units of meters, for example, the vari- ance has units of meters squared, and the standard deviation has units of meters. Although we are often interested ultimately in the standard deviation rather than the variance, it is usually easier to ﬁnd the variance ﬁrst and then take the square root. The variance of a random variable changes in a simple way under linear trans- formations. THEOREM A If Var(X) exists and Y = a + bX, then Var(Y) = b2Var(X). Proof Since E(Y) = a + bE(X), E[(Y − E(Y))2] = E{[a + bX − a − bE(X)]2} = E{b2[X − E(X)]2} = b2 E{[X − E(X)]2} = b2Var(X) ■ This result seems reasonable once you realize that the addition of a constant does not affect the variance, since the variance is a measure of the spread around a center and the center has merely been shifted. 132 Chapter 4 Expected Values The standard deviation transforms in a natural way: σY =|b|σX . Thus, if the units of measurement are changed from meters to centimeters, for example, the standard deviation is simply multiplied by 100. EXAMPLE A Bernoulli Distribution If X has a Bernoulli distribution—that is, X takes on values 0 and 1 with probability 1 − p and p, respectively—then we have seen (Example A of Section 4.1.2) that E(X) = p. By the deﬁnition of variance, Var(X) = (0 − p)2 × (1 − p) + (1 − p)2 × p = p2 − p3 + p − 2p2 + p3 = p(1 − p) Note that the expression p(1 − p) is a quadratic with a maximum at p = 1 2 .Ifp is 0 or 1, the variance is 0, which makes sense since the probability distribution is concentrated at a single point and the random variable is not variable at all. The distribution is most dispersed when p = 1 2 . ■ EXAMPLE B Normal Distribution We have seen that E(X) = μ. Then Var(X) = E[(X − μ)2] = 1 σ √ 2π ∞ −∞ (x − μ)2e− 1 2 (x−μ)2 σ2 dx Making the change of variables z = (x − μ)/σ changes the right-hand side to σ 2 √ 2π ∞ −∞ z2e−z2/2 dz Finally, making the change of variables u = z2/2 reduces the integral to a gamma function, and we ﬁnd that Var(X) = σ 2. ■ The following theorem gives an alternative way of calculating the variance. THEOREM B The variance of X, if it exists, may also be calculated as follows: Var(X) = E(X 2) − [E(X)]2 4.2 Variance and Standard Deviation 133 Proof Denote E(X) by μ. Var(X) = E[(X − μ)2] = E(X 2 − 2μX + μ2) By the linearity of the expectation, this becomes Var(X) = E(X 2) − 2μE(X) + μ2 = E(X 2) − 2μ2 + μ2 = E(X 2) − μ2 as was to be shown. ■ According to Theorem B, the variance of X can be found in two steps: First ﬁnd E(X), and then ﬁnd E(X 2). EXAMPLE C Uniform Distribution Let us apply Theorem B to ﬁnd the variance of a random variable that is uniform on [0, 1]. We know that E(X) = 1 2 ; next we need to ﬁnd E(X 2): E(X 2) = 1 0 x2 dx = 1 3We thus have Var(X) = 1 3 − 1 2 2 = 1 12 ■ It was stated earlier that the variance or standard deviation of a random variable gives an indication as to how spread out its possible values are. A famous inequality, Chebyshev’s inequality, lends a quantitative aspect to this indication. THEOREM C Chebyshev’s Inequality Let X be a random variable with mean μ and variance σ 2. Then, for any t > 0, P(|X − μ| > t) ≤ σ 2 t2 Proof Let Y = (X−μ)2. Then E(Y) = σ 2, and the result follows by applying Markov’s inequality to Y. ■ 134 Chapter 4 Expected Values Theorem C says that if σ 2 is very small, there is a high probability that X will not deviate much from μ. For another interpretation, we can set t = kσ so that the inequality becomes P(|X − μ|≥kσ)≤ 1 k2 For example, the probability that X is more than 4σ away from μ is less than or equal to 1 16 . These results hold for any random variable with any distribution provided the variance exists. In particular cases, the bounds are often much narrower. For example, if X is normally distributed, we ﬁnd from tables of the normal distribution that P(|X −μ| > 2σ)= .05 (compared to 1 4 obtained from Chebyshev’s inequality). Chebyshev’s inequality has the following consequence. COROLLARY A If Var(X) = 0, then P(X = μ) = 1. Proof We will give a proof by contradiction. Suppose that P(X = μ) < 1. Then, for some ε>0, P(|X − μ|≥ε) > 0. However, by Chebyshev’s inequality, for any ε>0, P(|X − μ|≥ε) = 0 ■ EXAMPLE D Investment Portfolios We continue Example E in Section 4.1.2. Suppose that one of the two investments is risky and the other is risk free. The ﬁrst might be a stock and the other an insured savings account. The stock has a return R1, which is modeled as a random variable with expectation μ1 = 0.10 and standard deviation σ1 = 0.075. The standard deviation is a measure of risk—a large standard deviation means that the returns ﬂuctuate a lot so that the investor might be lucky and get a large return, but might also be unlucky and lose a lot. Suppose that the savings account has a certain return R2 = 0.03. The expected value of this return is μ2 = 0.03 and its standard deviation is 0—it is risk free. If the investor places a fraction π1 in the stock and a fraction π2 = 1 − π1 in the savings account, her return is R = π1 R1 + (1 − π1)R2 and her expected return is E(R) = π1μ1 + (1 − π1)μ2 Since μ1 >μ2, her expected return is maximized by π1 = 1, putting all her money in the stock. However, this point of view is too narrow; it does not take into account the risk of the stock. By Theorem A Var(R) = π2 1 σ 2 1 4.2 Variance and Standard Deviation 135 and the standard deviation of the return is σR = π1σ1. The larger π1, the larger the expected return, but also the larger the risk. In choosing π1, the investor has to balance the risk she is willing to take against the expected gain; the desired balance will be different for different investors. If she is risk averse, she will choose a small value of π1, being leery of volatile investments. By tracing out the expected return and the standard deviation as functions of π1, she can strike a balance with which she is comfortable. ■ 4.2.1 A Model for Measurement Error Values of physical constants are not precisely known but must be determined by experimental procedures. Such seemingly simple operations as weighting an object, determining a voltage, or measuring an interval of time are actually quite complicated when all the details and possible sources of error are taken into account. The National Institute of Standards and Technology (NIST) in the United States and similar agen- cies in other countries are charged with developing and maintaining measurement standards. Such agencies employ probabilists and statisticians as well as physical scientists in this endeavor. A distinction is usually made between random and systematic measurement errors. A sequence of repeated independent measurements made with no deliberate change in the apparatus or experimental procedure may not yield identical values, and the uncontrollable ﬂuctuations are often modeled as random. At the same time, there may be errors that have the same effect on every measurement; equipment may be slightly out of calibration, for example, or there may be errors associated with the theory underlying the method of measurement. If the true value of the quantity being measured is denoted by x0, the measurement, X, is modeled as X = x0 + β + ε where β is the constant, or systematic, error and ε is the random component of the error; ε is a random variable with E(ε) = 0 and Var(ε) = σ 2. We then have E(X) = x0 + β and Var(X) = σ 2 β is often called the bias of the measurement procedure. The two factors affecting the size of the error are the bias and the size of the variance, σ 2. A perfect measurement would have β = 0 and σ 2 = 0. EXAMPLE A Measurement of the Gravity Constant This and the next example are taken from an interesting and readable paper by Youden (1972), a statistician at NIST. Measurement of the acceleration due to gravity at Ottawa was done 32 times with each of two different methods (Preston-Thomas et al. 1960). The results are displayed as histograms in Figure 4.3. There is clearly some systematic difference between the two methods as well as some variation within each method. It 136 Chapter 4 Expected Values Mean 980.6139 cm/sec2 Standard deviation 0.9 mgal Maximum spread 4.1 mgal Dec 1959 32 Drops Rule No. 2 Aug 1958 32 Drops Rule No. 1 Mean 980.6124 cm/sec2 Standard deviation 0.6 mgal Maximum spread 2.9 mgal 980.610 980.611 980.612 980.613 980.614 980.615 980.616 cm/sec2 FIGURE 4.3 Histograms of two sets of measurements of the acceleration due to gravity. appears that the two biases are unequal. The results from Rule 1 are more scattered than those of Rule 2, and their standard deviation is larger. ■ An overall measure of the size of the measurement error that is often used is the mean squared error, which is deﬁned as MSE = E[(X − x0)2] The mean squared error, which is the expected squared deviation of X from x0, can be decomposed into contributions from the bias and the variance. THEOREM A MSE = β2 + σ 2. Proof From Theorem B of Section 4.2, E[(X − x0)2] = Var(X − x0) + [E(X − x0)]2 = Var(X) + β2 = σ 2 + β2 ■ 4.2 Variance and Standard Deviation 137 Measurements are often reported in the form 102±1.6, for example. Although it is not always clear what precisely is meant by such notation, 102 is the experimentally determined value and 1.6 is some measure of the error. It is often claimed or hoped that β is negligible relative to σ, and in that case 1.6 represents σ or some multiple of σ. In the graphical presentation of experimentally obtained data, error bars, usually of width σ or some multiple of σ, are placed around measured values. In some cases, efforts are made to bound the magnitude of β, and the bound is incorporated into the error bars in some fashion. EXAMPLE B Measurement of the Speed of Light Figure 4.4, taken from McNish (1962) and discussed by Youden (1972), shows 24 independent determinations of the speed of light, c, with error bars. The right col- umn of the ﬁgure contains codes for the experimental methods used to obtain the measurements; for example, G denotes a method called the geodimeter method. The range of values for c is about 3.5 km/sec, and many of the errors are less than .5 km/sec. Examination of the ﬁgure makes it clear that the error bars are too small and that the spread of values cannot be accounted for by different experimental techniques alone—the geodimeter method produced both the smallest and the next to largest value for c. Youden remarks, “Surely the evidence suggests that individual investigators are unable to set realistic limits of error to their reported values.” He goes on to suggest that the differences are largely a result of calibration errors for equipment. ■ 299790 91 92 93 km/sec 9594 96 Designation Method US CUTKOSKY-THOMAS GB US US GB AU AU AU ESSEN FROOME FROOME SW FROOME SW SW AU SW US AU US US US FLORMAN G Ru G Sh G G G G G CR MI MI G MI G G G G Sh G G Sh G RI FIGURE 4.4 A plot of 24 independent determinations of the speed of light with the reported error bars. The investigator or country is listed in the left column, and the experimental method is coded in the right column. 138 Chapter 4 Expected Values 4.3 Covariance and Correlation The variance of a random variable is a measure of its variability, and the covariance of two random variables is a measure of their joint variability, or their degree of asso- ciation. After deﬁning covariance, we will develop some of its properties and discuss a measure of association called correlation, which is deﬁned in terms of covariance. You may ﬁnd this material somewhat formal and abstract at ﬁrst, but as you use them, covariance, correlation, and their properties will begin to seem natural and familiar. DEFINITION If X and Y are jointly distributed random variables with expectations μX and μY , respectively, the covariance of X and Y is Cov(X, Y) = E[(X − μX )(Y − μY )] provided that the expectation exists. ■ The covariance is the average value of the product of the deviation of X from its mean and the deviation of Y from its mean. If the random variables are positively associated—that is, when X is larger than its mean, Y tends to be larger than its mean as well—the covariance will be positive. If the association is negative—that is, when X is larger than its mean, Y tends to be smaller than its mean—the covariance is negative. These statements will be expanded in the discussion of correlation. By expanding the product and using the linearity of the expectation, we obtain an alternative expression for the covariance, paralleling Theorem B of Section 4.2: Cov(X, Y) = E(XY − XμY − YμX + μX μY ) = E(XY) − E(X)μY − E(Y)μX + μX μY = E(XY) − E(X)E(Y) In particular, if X andY are independent, then E(XY) = E(X)E(Y)and Cov(X, Y) = 0 (but the converse is not true). See Problems 59 and 60 at the end of this chapter for examples. EXAMPLE A Let us return to the bivariate uniform distributions of Example C in Section 3.3. First, note that since the marginal distributions are uniform, E(X) = E(Y) = 1 2 . For the case α =−1, the joint density of X and Y is f (x, y) = (2x + 2y − 4xy),0≤ x ≤ 1, 0 ≤ y ≤ 1. E(XY) = xyf(x, y) dx dy = 1 0 1 0 xy(2x + 2y − 4xy) dx dy = 2 9 4.3 Covariance and Correlation 139 Thus, Cov(X, Y) = 2 9 − 1 2 1 2 =−1 36 The covariance is negative, indicating a negative relationship between X and Y.In fact, from Figure 3.5, we see that if X is less than its mean, 1 2 , then Y tends to be larger than its mean, and vice versa. A similar analysis shows that when α = 1, Cov(X, Y) = 1 36 . ■ We will now develop an expression for the covariance of linear combinations of random variables, proceeding in a number of small steps. First, since E(a + X) = a + E(X), Cov(a + X, Y) = E{[a + X − E(a + X)][Y − E(Y)]} = E{[X − E(X)][Y − E(Y)]} = Cov(X, Y) Next, since E(aX) = aE(X), Cov(aX, bY) = E{[aX − aE(X)][bY − bE(Y)]} = E{ab[X − E(X)][Y − E(Y)]} = abE{[X − E(X)][Y − E(Y)]} = abCov(X, Y) Next, we consider Cov(X, Y + Z): Cov(X, Y + Z) = E([X − E(X)]{[Y − E(Y)] + [Z − E(Z)]}) = E{[X − E(X)][Y − E(Y)] + [X − E(X)][Z − E(Z)]} = E{[X − E(X)][Y − E(Y)]} + E{[X − E(X)][Z − E(Z)]} = Cov(X, Y) + Cov(X, Z) We can now put these results together to ﬁnd Cov(aW + bX, cY + dZ): Cov(aW + bX, cY + dZ) = Cov(aW + bX, cY) + Cov(aW + bX, dZ) = Cov(aW, cY) + Cov(bX, cY) + Cov(aW, dZ) + Cov(bX, dZ) = acCov(W, Y) + bc Cov(X, Y) + ad Cov(W, Z) + bd Cov(X, Z) In general, the same kind of argument gives the following important bilinear property of covariance. 140 Chapter 4 Expected Values THEOREM A Suppose that U = a + n i=1 bi Xi and V = c + m j=1 d j Y j . Then Cov(U, V ) = n i=1 m j=1 bi d j Cov(Xi , Y j ) ■ This theorem has many applications. In particular, since Var(X) = Cov(X, X), Var(X + Y) = Cov(X + Y, X + Y) = Var(X) + Var(Y) + 2Cov(X, Y) More generally, we have the following result for the variance of a linear combination of random variables. COROLLARY A Var(a + n i=1 bi Xi ) = n i=1 n j=1 bi b j Cov(Xi , X j ). ■ If the Xi are independent, then Cov(Xi , X j ) = 0 for i = j, and we have another corollary. COROLLARY B Var(n i=1 Xi ) = n i=1 Var(Xi ),iftheXi are independent. ■ Corollary B is very useful. Note that E( Xi ) = E(Xi ) whether or not the Xi are independent, but it is generally not the case that Var( Xi ) = Var(Xi ). EXAMPLE B Finding the variance of a binomial random variable from the deﬁnition of variance and the frequency function of the binomial distribution is not easy (try it). But expressing a binomial random variable as a sum of independent Bernoulli random variables makes the computation of the variance trivial. Speciﬁcally, if Y is a binomial random variable, it can be expressed as Y = X1 + X2 +···+Xn, where the Xi are independent Bernoulli random variables with P(Xi = 1) = p. We saw earlier (Example A in Section 4.2) that Var(Xi ) = p(1 − p), from which it follows from Corollary B that Var(Y) = np(1 − p). ■ EXAMPLE C Random Walk A drunken walker starts out at a point x0 on the real line. He takes a step on length X1, which is a random variable with expected value μ and variance σ, and his position 4.3 Covariance and Correlation 141 at that time is S(1) = x0 + X1. He then takes another step of length X2, which is independent of X1 with the same mean and standard deviation. His position after n such steps is S(n) = x0 + n i=1 Xi . Then E(S(n)) = x0 + E n i=1 Xi = x0 + nμ Var(S(n)) = Var n i=1 Xi = nσ 2 He thus expects to be at the position x0 + nμ with an uncertainty as measured by the standard deviation of √ nσ. Note that if μ>0, for example, for large values of n he will be to the right of the point x0 with very high probability (using Chebyshev’s inequality). Random walks have found applications in many areas of science. Brownian mo- tion is a continuous time version of a random walk with the steps being normally distributed random variables. The name derives from observations of the biologist Robert Brown in 1827 of the apparently spontaneous motion of pollen grains sus- pended in water. This was later explained by Einstein to be due to the collisions of the grains with randomly moving water molecules. The theory of Brownian motion was developed by Louis Bachelier in 1900 in his PhD thesis “The theory of speculation,” which related random walks to the evolution of stock prices. If the value of a stock evolves through time as a random walk, its short-term behavior is unpredictable. The efﬁcient market hypothesis states that stock prices already reﬂect all known information so that the future price is random and unknowable. The solid line in Figure 4.5 shows the value of the S&P 500 during 900 50 Time Value 800 1000 1100 1200 1300 0 100 150 200 250 FIGURE 4.5 The solid line is the value of the S&P 500 during 2003. The dashed lines are simulations of random walks. 142 Chapter 4 Expected Values 2003. The average of the increments (steps) was 0.81 and the standard deviation was 9.82. The dashed lines are simulations of random walks with the same intial value and increments that were normally distributed random variables with μ = 0.81 and σ = 9.82. Notice the long stretches of upturns and downturns that occurred in the random walks as the markets reacted in ways that would have been explained ex post facto by analysts. See Malkiel (2004) for a popular exposition of the implications of random walk theory for stock market investors. ■ The correlation coefﬁcient is deﬁned in terms of the covariance. DEFINITION If X and Y are jointly distributed random variables and the variances and covari- ances of both X and Y exist and the variances are nonzero, then the correlation of X and Y, denoted by ρ,is ρ = Cov(X, Y)√ Var(X)Var(Y) ■ Note that because of the way the ratio is formed, the correlation is a dimension- less quantity (it has no units, such as inches, since the units in the numerator and denominator cancel). From the properties of the variance and covariance that we have established, it follows easily that if X and Y are both subjected to linear transforma- tions (such as changing their units from inches to meters), the correlation coefﬁcient does not change. Since it does not depend on the units of measurement, ρ is in many cases a more useful measure of association than is the covariance. EXAMPLE D Let us return to the bivariate uniform distribution of Example A. Because X and Y are marginally uniform, Var(X) = Var(Y) = 1 12 . In the one case (α =−1), we found Cov(X, Y) =−1 36 ,so ρ =−1 36 × 12 =−1 3 In the other case (α = 1), the covariance was 1 36 , so the correlation is 1 3 . ■ The following notation and relationship are often useful. The standard deviations of X and Y are denoted by σX and σY and their covariance by σXY. We thus have ρ = σXY σX σY and σXY = ρσX σY The following theorem states some further properties of ρ. 4.3 Covariance and Correlation 143 THEOREM B −1 ≤ ρ ≤ 1. Furthermore, ρ =±1 if and only if P(Y = a + bX) = 1 for some constants a and b. Proof Since the variance of a random variable is nonnegative, 0 ≤ Var X σX + Y σY = Var X σX + Var Y σY + 2Cov X σX , Y σY = Var(X) σ 2 X + Var(Y) σ 2 Y + 2Cov(X, Y) σX σY = 2(1 + ρ) From this, we see that ρ ≥−1. Similarly, 0 ≤ Var X σX − Y σY = 2(1 − ρ) implies that ρ ≤ 1. Suppose that ρ = 1. Then Var X σX − Y σY = 0 which by Corollary A of Section 4.2 implies that P X σX − Y σY = c = 1 for some constant, c. This is equivalent to P(Y = a + bX) = 1 for some a and b. A similar argument holds for ρ =−1. ■ EXAMPLE E Investment Portfolio We are now in a position to further develop the investment theory discussed in Sec- tion 4.1.2, Example E, and Section 4.2, Example D. Please review those examples be- fore continuing. We ﬁrst consider the simple example of two securities, assuming that they have the same expected returns μ1 = μ2 = μ and their returns are uncorrelated: σij = Cov(Ri , R j ) = 0. For a portfolio (π, 1 − π), the expected return is E(R(π)) = πμ+ (1 − π)μ = μ so that when considering expected return only, the choice of π makes no difference. However, taking risk into account, Var(R(π)) = π2σ 2 1 + (1 − π)2σ 2 2 . 144 Chapter 4 Expected Values Minimizing this with respect to π gives the optimal portfolio πopt = σ 2 2 σ 2 1 + σ 2 2 For example, if the investments are equally risky, σ1 = σ2 = σ, then π = 1/2, so the best strategy is to split her total investment equally between the two securities. If she does so, the variance of her return is, by Theorem A, Var R 1 2 = σ 2 2 whereas if she put all her money in one security, the variance of her return would be σ 2. The expected return is the same in both cases. This is a particularly simple example of the value of diversiﬁcation of investments. Suppose now that the two securities do not have the same expected returns, μ1 <μ2. Let the standard deviations of the returns be σ1 and σ2; usually less risky investments have lower expected returns, σ1 <σ2. Furthermore, the two returns may be correlated: Cov(R1, R2) = ρσ1σ2. Corresponding to the portfolio (π, 1 − π),we have expected return E(R(π)) = πμ1 + (1 − π)μ2 and the variance of the return is Var(R(π)) = π2σ 2 1 + 2π(1 − π)ρσ1σ2 + (1 − π)2σ 2 2 Comparing this to the result when the returns were independent, we see the risk is lower when the returns are independent than when they are positively correlated. It would thus be better to invest in two unrelated or weakly related market sectors than to make two investments in the same sector. In deciding the choice of the portfolio vector, the investor can study how the risk (the standard deviation of R(π)) changes as the expected return increases, and balance expected return versus risk. In actual investment decisions, many more than two possible investments are involved, but the basic idea remains the same. Suppose there are n possible invest- ments. Let the portfolio weights be denoted by the vector π = (π1,π2,...,πn). Let E(Ri ) = μi ,Cov(Ri , R j ) = σij (so, in particular, Var(Ri ) is denoted by σii), then E(R(π)) = πi μi and Var(R(π)) = n i=1 n j=1 πi πi σij. The investment decision, the choice of the portfolio vector π, is often couched as that of maximizing expected return subject to the risk being less than some value the individual investor is willing to tolerate. Some investors are more risk averse than others, so the portfolio vectors will differ from investor to investor. Equivalently, the decision may be phrased as that of ﬁnding the portfolio vector with the minimum risk subject to a desired return; there may well be many portfolio choices that give the 4.3 Covariance and Correlation 145 Standard deviation Monthly return (%) 0 1 3 4 5 5 1015200 2 Philipines Thailand Brazil Taiwan Argentina Turkey Greece S&P 500 Portugal Korea Mexico Indonesia Chile Malaysia INDEX FIGURE 4.6 The beneﬁt of diversiﬁcation. The monthly average return from January 1992 to June 1994 of 13 stock markets, plotted against their standard deviations. The performance of the Standard and Poor's 500 index of U.S. stocks is plotted for comparison. same expected return, and the wise investor would choose the one among them that had the lowest risk. As a general rule, risk is reduced by diversiﬁcation and can be decreased with only a small sacriﬁce of returns. Figure 4.6 from Bernstein (1996, p. 254) illustrates this point empirically. The point labeled “Index” shows the monthly average versus standard deviation for an investment that was equally weighted across all the markets. A reasonably high return with relatively little risk would thus have been obtained by spreading investments equally over the 13 stock markets. In fact, the risk is less than that of any of the individual markets. Note that these emerging markets were riskier than the U.S. market, but that they were more proﬁtable. ■ EXAMPLE F Bivariate Normal Distribution We will show that the covariance of X and Y when they follow a bivariate normal distribution is ρσX σY , which means that ρ is the correlation coefﬁcient. The covari- ance is Cov(X, Y) = ∞ −∞ ∞ −∞ (x − μX )(y − μY ) f (x, y) dx dy Making the changes of variables u = (x − μX )/σX and v = (y − μY )/σY changes the right-hand side to σX σY 2π 1 − ρ2 ∞ −∞ ∞ −∞ uv exp − 1 2(1 − ρ2)(u2 + v2 − 2ρuv) du dv 146 Chapter 4 Expected Values As in Example F in Section 3.3, we use the technique of completing the square to rewrite this expression as σX σY 2π 1 − ρ2 ∞ −∞ v exp(−v2/2) ∞ −∞ u exp − 1 2(1 − ρ2)(u − ρv)2 du dv The inner integral is the mean of an N[ρv,(1 − ρ2)] random variable, lacking only the normalizing constant [2π(1 − ρ2)]−1/2, and we thus have Cov(X, Y) = ρσX σY√ 2π ∞ −∞ v2e−v2/2dv = ρσX σY as was to be shown. ■ The correlation coefﬁcient ρ measures the strength of the linear relationship between X and Y (compare with Figure 3.9). Correlation also affects the appearance of a scatterplot, which is constructed by generating n independent pairs (Xi , Yi ), where i = 1, ..., n, and plotting the points. Figure 4.7 shows scatterplots of 100 pairs of pseudorandom bivariate normal random variables for various values of ρ. Note that the clouds of points are roughly elliptical in shape. 1 3 2 1 0 2 3 2 (a) 101234 x y 1 3 2 1 0 2 3 2 (b) 10 1 2 3 4 x x y 1 3 2 1 0 2 3 2 (c) 101234 x y 4 1 3 2 1 0 2 3 2 (d) 101234 y 4 FIGURE 4.7 Scatterplots of 100 independent pairs of bivariate normal random variables, (a) ρ = 0, (b) ρ = .3, (c) ρ = .6, (d) ρ = .9. 4.4 Conditional Expectation and Prediction 147 4.4 Conditional Expectation and Prediction 4.4.1 Deﬁnitions and Examples In Section 3.5, conditional frequency functions and density functions were deﬁned. We noted that these had the properties of ordinary frequency and density functions. In particular, associated with a conditional distribution is a conditional mean. Suppose that Y and X are discrete random variables and that the conditional frequency function of Y given x is pY|X (y|x). The conditional expectation of Y given X = x is E(Y|X = x) = y ypY|X (y|x) For the continuous case, we have E(Y|X = x) = yfY|X (y|x) dy More generally, the conditional expectation of a function h(Y) is E[h(Y)|X = x] = h(y) fY|X (y|x) dy in the continuous case. A similar equation holds in the discrete case. EXAMPLE A Consider a Poisson process on [0, 1] with mean λ, and let N be the number of points in [0, 1]. For p < 1, let X be the number of points in [0, p]. Find the conditional distribution and conditional mean of X given N = n. We ﬁrst ﬁnd the joint distribution: P(X = x, N = n), which is the probability of x events in [0, p] and n − x events in [p, 1]. From the assumption of a Poisson process, the counts in the two intervals are independent Poisson random variables with parameters pλ and (1 − p)λ,so pXN(x, n) = (pλ)x e−pλ x! [(1 − p)λ]n−x e−(1−p)λ (n − x)! The marginal distribution of N is Poisson, so the conditional frequency function of X is, after some algebra, pX|N (x|n) = pXN(x, n) pN (n) = n! x!(n − x)! px (1 − p)n−x This is the binomial distribution with parameters n and p. The conditional expectation is thus by Example A of Section 4.1.2, np. ■ 148 Chapter 4 Expected Values EXAMPLE B Bivariate Normal Distribution From Example C in Section 3.5.2, if Y and X follow a bivariate normal distribution, the conditional density of Y given X is fY|X (y|x) = 1 σY 2π(1 − ρ2) exp ⎛ ⎜⎜⎜⎝−1 2 y − μY − ρ σY σX (x − μX ) 2 σ 2 Y (1 − ρ2) ⎞ ⎟⎟⎟⎠ This is a normal density with mean μY + ρ(x − μX )σY /σX and variance σ 2 Y (1 − ρ2). The former is the conditional mean and the latter the conditional variance of Y given X = x. Note that the conditional mean is a linear function of X and that as |ρ| increases, the conditional variance decreases; both of these facts are suggested by the elliptical contours of the joint density. To see this more exactly, consider the case in which σX = σY = 1 and μX = μY = 0. The contours then are ellipses satisfying ρ2x2 − 2ρxy + y2 = constant The major and minor axes of such an ellipse are at 45◦ and 135◦. The conditional expectation of Y given X = x is the line y = ρx; note that this line does not lie along the major axis of the ellipse. Figure 4.8 shows such a bivariate normal distribution with ρ = 0.5. The curved lines of the bivariate density correspond to the conditional density of Y given various values of x, but they are not normalized to integrate to 1. The contours of the bivariate normal are the ellipses shown in the xy plane as dashed curves, with the major axis shown by the straight dashed line. The conditional expectation of Y given X = x is shown as a function of x by the solid line in the plane. Note that it is not the major axis of the ellipse. This phenomenon was noted by Sir Francis Galton (1822–1911) who studied the relationship of the heights of sons to that of their fathers. He observed that 0 1 2 3 4 4 3 3 1 12 2 0 1 2 3 4 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 y x f (x, y) FIGURE 4.8 Bivariate normal density with correlation, ρ = 0.5. The conditional expectation of Y given X = x is shown as the solid line in the xy plane. 4.4 Conditional Expectation and Prediction 149 sons of very tall fathers were shorter on average than their fathers and that sons of very short fathers were on average taller. The empirical relationship is shown in Figure 14.19. ■ Assuming that the conditional expectation of Y given X = x exists for every x in the range of X, it is a well-deﬁned function of X and hence is a random variable, which we write as E(Y|X). For instance, in Example A we found that E(X|N = n) = np; thus, E(X|N) = Np is a random variable that is a function of N. Provided that the appropriate sums or integrals converge, this random variable has an expectation and a variance. Its expectation is E[E(Y|X)]; for this expression, note that since E(Y|X) is a random variable that is a function of X, the outer expectation can be taken with respect to the distribution of X (Theorem A of Section 4.1.1). The following theorem says that the average (expected) value of Y can be found by ﬁrst conditioning on X, ﬁnding E(Y|X), and then averaging this quantity with respect to X. THEOREM A E(Y) = E[E(Y|X)]. Proof We will prove this for the discrete case. The continuous case is proved similarly. Using Theorem 4.1.1A we need to show that E(Y) = x E(Y|X = x)pX (x) where E(Y|X = x) = y ypY|X (y|x) Interchanging the order of summation gives us x E(Y|X = x)pX (x) = y y x pY|X (y|x)pX (x) (It can be shown that this interchange can be made.) From the law of total probability, we have pY (y) = x pY|X (y|x)pX (x) Therefore, y y x pY|X (y|x)pX (x) = y ypY (y) = E(Y) ■ Theorem A gives what might be called a law of total expectation: The ex- pectation of a random variable Y can be calculated by weighting the conditional expectations appropriately and summing or integrating. 150 Chapter 4 Expected Values EXAMPLE C Suppose that in a system, a component and a backup unit both have mean lifetimes equal to μ. If the component fails, the system automatically substitutes the backup unit, but there is probability p that something will go wrong and it will fail to do so. Let T be the total lifetime, and let X = 1 if the substitution of the backup takes place successfully, and X = 0 if it does not. Thus the total lifetime is the lifetime of the ﬁrst component if the backup fails and is the sum of the lifetimes of the original and the backup units if the backup is successfully made. Then E(T |X = 1) = 2μ E(T |X = 0) = μ Thus, E(T ) = E(T |X = 1)P(X = 1) + E(T |X = 0)P(X = 0) = μ(2 − p) ■ EXAMPLE D Random Sums This example introduces sums of the type T = N i=1 Xi where N is a random variable with a ﬁnite expectation and the Xi are random variables that are independent of N and have the common mean E(X). Such sums arise in a variety of applications. An insurance company might receive N claims in a given period of time, and the amounts of the individual claims might be modeled as random variables X1, X2,....The random variable N could denote the number of customers entering a store and Xi the expenditure of the ith customer, or N could denote the number of jobs in a single-server queue and Xi the service time for the ith job. For this last case, T is the time to serve all the jobs in the queue. According to Theorem A, E(T ) = E[E(T |N)] Since E(T |N = n) = nE(X), E(T |N) = NE(X) and thus E(T ) = E[NE(X)] = E(N)E(X) This agrees with the intuitive guess that the average time to complete N jobs, where N is random, is the average value of N times the average amount of time to complete a job. ■ We have seen that the expectation of the random variable E(Y|X) is E(Y).We now ﬁnd its variance. 4.4 Conditional Expectation and Prediction 151 THEOREM B Var(Y) = Var[E(Y|X)] + E[Var(Y|X)]. Proof We will explain what is meant by the notation in the course of the proof. First, Var(Y|X = x) = E(Y 2|X = x) − [E(Y)|X = x)]2 which is deﬁned for all values of x. Thus, just as we deﬁned E(Y|X) to be a random variable by letting X be random, we can deﬁne Var(Y|X) as a random variable. In particular, Var(Y|X) has the expectation E[Var(Y|X)]. Since Var(Y|X) = E(Y 2|X) − [E(Y|X)]2 E[Var(Y|X)] = E[E(Y 2|X)] − E{[E(Y|X)]2} Also, Var[E(Y|X)] = E{[E(Y|X)]2}−{E[E(Y|X)]}2 The ﬁnal piece that we need is Var(Y) = E(Y 2) − [E(Y)]2 = E[E(Y 2|X)] −{E[E(Y|X)]}2 by the law of total expectation. Now we can put all the pieces together: Var(Y) = E[E(Y 2|X)] −{E[E(Y|X)]}2 = E[E(Y 2|X)] − E{[E(Y|X)]2}+E{[E(Y|X)]2}−{E[E(Y|X)]}2 = E[Var(Y|X)] + Var[E(Y|X)] ■ EXAMPLE E Random Sums Let us continue Example D but with the additional assumptions that the Xi are inde- pendent random variables with the same mean, E(X), and the same variance, Var(X), and that Var(N)<∞. According to Theorem B, Var(T ) = E[Var(T |N)] + Var[E(T |N)] Because E(T |N) = NE(X), Var[E(T |N)] = [E(X)]2Var(N) Also, since Var(T |N = n) = Var(n i=1 Xi ) = n Var(X), Var(T |N) = N Var(X) and E[Var(T |N)] = E(N)Var(X) 152 Chapter 4 Expected Values We thus have Var(T ) = [E(X)]2Var(N) + E(N)Var(X) If N is ﬁxed, say, N = n, then Var(T ) = n Var(X). Thus, we see from the preceding equation that extra variability occurs in T because N is random. As a concrete example, suppose that the number of insurance claims in a certain time period has expected value equal to 900 and standard deviation equal to 30, as would be the case if the number were a Poisson random variable with expected value 900. Suppose that the average claim value is $1000 and the standard deviation is $500. Then the expected value of the total, T , of the claims is E(T ) = $900,000 and the variance of T is Var (T ) = 10002 × 900 + 900 × 5002 = 1.125 × 109 The standard deviation of T is the square root of the variance, $33,541. The insurance company could then plan on total claims of $900,000 plus or minus a few standard deviations (by Chebyshev’s inequality). Observe that if the total number of claims were not variable but were ﬁxed at N = 900, the variance of the total claims would be given by E(N)Var(X) in the preceding expression. The result would be a standard deviation equal to $15,000. The variability in the number of claims thus contributes substantially to the uncertainty in the total. ■ 4.4.2 Prediction This section treats the problem of predicting the value of one random variable from another. We might wish, for example, to measure the value of some physical quan- tity, such as pressure, using an instrument. The actual pressures to be measured are unknown and variable, so we might model them as values of a random variable, Y. Assume that measurements are to be taken by some instrument that produces a re- sponse, X, related to Y in some fashion but corrupted by random noise as well; X might represent current ﬂow, for example. Y and X have some joint distribution, and we wish to predict the actual pressure, Y, from the instrument response, X. As another example, in forestry, the volume of a tree is sometimes estimated from its diameter, which is easily measured. For a whole forest, it is reasonable to model diameter (X) and volume (Y) as random variables with some joint distribution, and then attempt to predict Y from X. Let us ﬁrst consider a relatively trivial situation: the problem of predicting Y by means of a constant value, c. If we wish to choose the “best” value of c, we need some measure of the effectiveness of a prediction. One that is amenable to mathematical analysis and that is widely used is the mean squared error: MSE = E[(Y − c)2] This is the average squared error of prediction, the averaging being done with respect to the distribution of Y. The problem then becomes ﬁnding the value of c that min- imizes the mean squared error. To solve this problem, we denote E(Y) by μ and 4.4 Conditional Expectation and Prediction 153 observe that (see Theorem A of Section 4.2.1) E[(Y − c)2] = Var(Y − c) + [E(Y − c)]2 = Var(Y) + (μ − c)2 The ﬁrst term of the last expression does not depend on c, and the second term is minimized for c = μ, which is the optimal choice of c. Now let us consider predicting Y by some function h(X) in order to minimize MSE = E{[Y − h(X)]2}. From Theorem A of Section 4.4.1, the right-hand side can be expressed as E{[Y − h(X)]2}=E(E{[Y − h(X)]2|X}) The outer expectation is with respect to X. For every x, the inner expectation is minimized by setting h(x) equal to the constant E(Y|X = x), from the result of the preceding paragraph. We thus have that the minimizing function h(X) is h(X) = E(Y|X) EXAMPLE A For the bivariate normal distribution, we found that E(Y|X) = μY + ρ σY σX (X − μX ) This linear function of X is thus the minimum mean squared error predictor of Y from X. ■ A practical limitation of the optimal prediction scheme is that its implementation depends on knowing the joint distribution of Y and X in order to ﬁnd E(Y|X), and often this information is not available, not even approximately. For this reason, we can try to attain the more modest goal of ﬁnding the optimal linear predictor of Y. (In Example A, it turned out that the best predictor was linear, but this is not generally the case.) That is, rather than ﬁnding the best function h among all functions, we try to ﬁnd the best function of the form h(x) = α + βx. This merely requires optimizing over the two parameters α and β.Now E[(Y − α − βX)2] = Var(Y − α − βX) + [E(Y − α − βX)]2 = Var(Y − βX) + [E(Y − α − βX)]2 The ﬁrst term of the last expression does not depend on α,soα can be chosen so as to minimize the second term. To do this, note that E(Y − α − βX) = μY − α − βμX and that the right-hand side is zero, and hence its square is minimized, if α = μY − βμX As for the ﬁrst term, Var(Y − βX) = σ 2 Y + β2σ 2 X − 2βσXY 154 Chapter 4 Expected Values where σXY = Cov(X, Y). This is a quadratic function of β, and the minimum is found by setting the derivative with respect to β equal to zero, which yields β = σXY σ 2 X = ρ σY σX ρ is the correlation coefﬁcient. Substituting in these values of α and β, we ﬁnd that the minimum mean squared error predictor, which we denote by ˆY,is ˆY = α + βX = μY + σXY σ 2 X (X − μX ) The mean squared prediction error is then Var(Y − βX) = σ 2 Y + σ 2 XY σ 4 X σ 2 X − 2 σXY σ 2 X σXY = σ 2 Y − σ 2 XY σ 2 X = σ 2 Y − ρ2σ 2 Y = σ 2 Y (1 − ρ2) Note that the optimal linear predictor depends on the joint distribution of X and Y only through their means, variances, and covariance. Thus, in practice, it is generally easier to construct the optimal linear predictor or an approximation to it than to construct the general optimal predictor E(Y|X). Second, note that the form of the optimal linear predictor is the same as that of E(Y|X) for the bivariate normal distribution. Third, note that the mean squared prediction error depends only on σY and ρ and that it is small if ρ is close to +1or−1. Here we see again, from a different point of view, that the correlation coefﬁcient is a measure of the strength of the linear relationship between X and Y. EXAMPLE B Suppose that two examinations are given in a course. As a probability model, we regard the scores of a student on the ﬁrst and second examinations as jointly distributed random variables X and Y. Suppose for simplicity that the exams are scaled to have the same means μ = μX = μY and standard deviations σ = σX = σY . Then, the correlation between X and Y is ρ = σXY/σ 2 and the best linear predictor is ˆY = μ + ρ(X − μ),so ˆY − μ = ρ(X − μ) Notice that by this equation we predict the student’s score on the second examination to differ from the overall mean μ by less than did the score on the ﬁrst examination. If the correlation ρ is positive, this is encouraging for a student who scores below the mean on the ﬁrst exam, since our best prediction is that his score on the next exam will be closer to the mean. On the other hand, it’s bad news for the student who scored above the mean on the ﬁrst exam, since our best prediction is that she will score closer to the mean on the next exam. This phenomenon is often referred to as regression to the mean. ■ 4.5 The Moment-Generating Function 155 4.5 The Moment-Generating Function This section develops and applies some of the properties of the moment-generating function. It turns out, despite its unlikely appearance, to be a very useful tool that can dramatically simplify certain calculations. The moment-generating function (mgf) of a random variable X is M(t) = E(etX) if the expectation is deﬁned. In the discrete case, M(t) = x etx p(x) and in the continuous case, M(t) = ∞ −∞ etx f (x) dx The expectation, and hence the moment-generating function, may or may not exist for any particular value of t. In the continuous case, the existence of the expectation depends on how rapidly the tails of the density decrease; for example, because the tails of the Cauchy density die down at the rate x−2, the expectation does not exist for any t and the moment-generating function is undeﬁned. The tails of the normal density die down at the rate e−x2 , so the integral converges for all t. PROPERTY A If the moment-generating function exists for t in an open interval containing zero, it uniquely determines the probability distribution. ■ We cannot prove this important property here—its proof depends on properties of the Laplace transform. Note that Property A says that if two random variables have the same mgf in an open interval containing zero, they have the same distribution. For some problems, we can ﬁnd the mgf and then deduce the unique probability distribution corresponding to it. The rth moment of a random variable is E(Xr ) if the expectation exists. We have already encountered the ﬁrst and second moments earlier in this chapter, that is, E(X) and E(X 2). Central moments rather than ordinary moments are often used: The rth central moment is E{[X − E(X)]r }. The variance is the second central moment and is a measure of dispersion about the mean. The third central moment, called the skewness, is used as a measure of the asymmetry of a density or a frequency function about its mean; if a density is symmetric about its mean, the skewness is zero (see Problem 78 at the end of this chapter). As its name implies, the moment-generating function has something to do with moments. To see this, consider the continuous case: M(t) = ∞ −∞ etx f (x) dx 156 Chapter 4 Expected Values The derivative of M(t) is M (t) = d dt ∞ −∞ etx f (x) dx It can be shown that differentiation and integration can be interchanged, so that M (t) = ∞ −∞ xetx f (x) dx and M (0) = ∞ −∞ xf(x) dx = E(X) Differentiating r times, we ﬁnd M(r)(0) = E(Xr ) It can further be argued that if the moment-generating function exists in an interval containing zero, then so do all the moments. We thus have the following property. PROPERTY B If the moment-generating function exists in an open interval containing zero, then M(r)(0) = E(Xr ). ■ To ﬁnd the moments of a random variable from the deﬁnition of expectation, we must sum a series or carry out an integration. The utility of Property B is that, if the mgf can be found, the process of integration or summation, which may be difﬁcult, can be replaced by the process of differentiation, which is mechanical. We now illustrate these concepts using some familiar distributions. EXAMPLE A Poisson Distribution By deﬁnition, M(t) = ∞ k=0 etk λk k! e−λ = ∞ k=0 (λet )k k! e−λ = e−λeλet = eλ(et −1) The sum converges for all t. Differentiating, we have M (t) = λet eλ(et −1) M (t) = λet eλ(et −1) + λ2e2t eλ(et −1) 4.5 The Moment-Generating Function 157 Evaluating these derivatives at t = 0, we ﬁnd E(X) = λ E(X 2) = λ2 + λ from which it follows that Var(X) = E(X 2) − [E(X)]2 = λ We have found that the mean and the variance of a Poisson distribution are equal. ■ EXAMPLE B Gamma Distribution The mgf of a gamma distribution is M(t) = ∞ 0 etx λα (α)xα−1e−λx dx = λα (α) ∞ 0 xα−1ex(t−λ) dx The latter integral converges for t <λand can be evaluated by relating it to the gamma density having parameters α and λ − t. We thus obtain M(t) = λα (α) (α) (λ − t)α = λ λ − t α Differentiating, we ﬁnd M (0) = E(X) = α λ M (0) = E(X 2) = α(α + 1) λ2 From these equations, we ﬁnd that Var(X) = E(X 2) − [E(X)]2 = α(α + 1) λ2 − α2 λ2 = α λ2 ■ EXAMPLE C Standard Normal Distribution For the standard normal distribution, we have M(t) = 1√ 2π ∞ −∞ etxe−x2/2 dx 158 Chapter 4 Expected Values The integral converges for all t and can be evaluated using the technique of completing the square. Since x2 2 − tx = 1 2 (x2 − 2tx + t2) − t2 2 = 1 2 (x − t)2 − t2 2 therefore, M(t) = et2/2 √ 2π ∞ −∞ e−(x−t)2/2 dx Making the change of variables u = x − t and using the fact that the standard normal density integrates to 1, we ﬁnd that M(t) = et2/2 From this result, we easily see that E(X) = 0 and Var(X) = 1. ■ Let us continue with the development of the properties of the moment-generating function. PROPERTY C If X has the mgf MX (t) and Y = a+bX, then Y has the mgf MY (t) = eat MX (bt). Proof MY (t) = E(etY) = E(eat+bt X ) = E(eatebt X ) = eat E(ebt X ) = eat MX (bt) ■ EXAMPLE D General Normal Distribution If Y follows a general normal distribution with parameters μ and σ, the distribution of Y is the same as that of μ + σ X, where X follows a standard normal distribution. Thus, from Example C and Property C, MY (t) = eμt MX (σt) = eμt eσ 2t2/2 ■ 4.5 The Moment-Generating Function 159 PROPERTY D If X and Y are independent random variables with mgf’s MX and MY and Z = X + Y, then MZ (t) = MX (t)MY (t) on the common interval where both mgf’s exist. Proof MZ (t) = E(etZ) = E(etX+tY) = E(etXetY) From the assumption of independence, MZ (t) = E(etX)E(etY) = MX (t)MY (t) ■ By induction, Property D can be extended to sums of several independent random variables. This is one of the most useful properties of the moment-generating function. The next three examples show how it can be used to easily derive results that would take a lot more work to achieve without recourse to the mgf. EXAMPLE E The sum of independent Poisson random variables is a Poisson random variable: If X is Poisson with parameter λ and Y is Poisson with parameter μ, then X + Y is Poisson with parameter λ + μ, since eλ(et −1)eμ(et −1) = e(λ+μ)(et −1) ■ EXAMPLE F If X follows a gamma distribution with parameters α1 and λ and Y follows a gamma distribution with parameters α2 and λ, the mgf of X + Y is λ λ − t α1 λ λ − t α2 = λ λ − t α1+ α2 where t <λ. The right-hand expression is the mgf of a gamma distribution with parametersλandα1+α2. It follows from this that the sum ofn independent exponential random variables with parameter λ follows a gamma distribution with parameters n andλ. Thus, the time betweenn consecutive events of a Poisson process in time follows a gamma distribution. Assuming that the service times in a queue are independent exponential random variables, the length of time to serve n customers follows a gamma distribution. ■ 160 Chapter 4 Expected Values EXAMPLE G If X ∼ N(μ, σ 2) and, independent of X, Y ∼ N(v, τ 2), then the mgf of X + Y is eμt et2σ 2/2evt et2τ 2/2 = e(μ+v)t et2(σ 2+τ 2)/2 which is the mgf of a normal distribution with mean μ + v and variance σ 2 + τ 2. The sum of independent normal random variables is thus normal. ■ The preceding three examples are atypical. In general, if two independent ran- dom variables follow some type of distribution, it is not necessarily true that their sum follows the same type of distribution. For example, the sum of two gamma ran- dom variables having different values for the parameter λ does not follow a gamma distribution, as can be easily seen from the mgf. We now apply moment-generating functions to random sums of the type intro- duced in Section 4.4.1. Suppose that S = N i=1 Xi where the Xi are independent and have the same mgf, MX , and where N has the mgf MN and is independent of the Xi . By conditioning, we have MS(t) = E[E(etS|N)] Given N = n, MS(t) = [MX (t)]n from Property D. We thus have MS(t) = E[MX (t)N ] = E(eN log MX (t)) = MN [log MX (t)] (We must carefully note the values of t for which this is deﬁned.) EXAMPLE H Compound Poisson Distribution This example presents a model that occurs for certain chain reactions, or “cascade” processes. When a single primary electron, having been accelerated in an electrical ﬁeld, hits a plate, several secondary electrons are produced. In a multistage multiplying tube, each of these secondary electrons hits another plate and thereby produces a number of tertiary electrons. The process can continue through several stages in this manner. Woodward (1948) considered models of this type in which the number of electrons produced by the impact of a single electron on the plate is random and, in particular, in which the number of secondary electrons follows a Poisson distribution. The number of electrons produced at the third stage is described by a random sum of the type just described, where N is the number of secondary electrons and Xi is the number of electrons produced by the ith secondary electron. Suppose that the Xi are independent Poisson random variables with parameter λ and that N is a Poisson random variable with parameter μ. According to the preceding result, the mgf of S, 4.6 Approximate Methods 161 the total number of particles, is MS(t) = exp[μ(eλ(et −1) − 1)] ■ Example H illustrates the utility of the mgf. It would have been more difﬁcult to ﬁnd the probability mass function of the number of particles at the third stage. By differentiating the mgf, we can ﬁnd the moments of the probability mass function (see Problem 98 at the end of this chapter). If X and Y have a joint distribution, their joint moment-generating function is deﬁned as MXY(s, t) = E(esX+tY) which is a function of two variables, s and t. If the joint mgf is deﬁned on an open set containing the origin, it uniquely determines the joint distribution. The mgf of the marginal distribution of X alone is MX (s) = MXY(s, 0) and similarly for Y. It can be shown that X and Y are independent if and only if their joint mgf factors into the product of the mgf’s of the marginal distributions. E(XY) and other higher-order joint moments can be obtained from the joint mgf by differentiation. Analogous properties hold for the joint mgf of several random variables. The major limitation of the mgf is that it may not exist. The characteristic function of a random variable X is deﬁned to be φ(t) = E(eitX) where i = √−1. In the continuous case, φ(t) = ∞ −∞ eitx f (x) dx This integral converges for all values of t, since |eitx|≤1. The characteristic func- tion is thus deﬁned for all distributions. Its properties are similar to those of the mgf: Moments can be obtained by differentiation, the characteristic function changes simply under linear transformations, and the characteristic function of a sum of inde- pendent random variables is the product of their characteristic functions. But using the characteristic function requires some familiarity with the techniques of complex variables. 4.6 Approximate Methods In many applications, only the ﬁrst two moments of a random variable, and not the entire probability distribution, are known, and even these may be known only approximately. We will see in Chapter 5 that repeated independent observations of a random variable allow reliable estimates to be made of its mean and variance. Suppose that we know the expectation and the variance of a random variable X but not the 162 Chapter 4 Expected Values entire distribution, and that we are interested in the mean and variance of Y = g(X) for some ﬁxed function g. For example, we might be able to measure X and determine its mean and variance, but really be interested in Y, which is related to X in a known way. We might want to know Var(Y), at least approximately, in order to assess the accuracy of the indirect measurement process. From the results given in this chapter, we cannot in general ﬁnd E(Y) = μY and Var(Y) = σ 2 Y from E(X) = μX and Var(X) = σ 2 X , unless the function g is linear. However, if g is nearly linear in a range in which X has high probability, it can be approximated by a linear function and approximate moments of Y can be found. In proceeding as just described, we follow a tack often taken in applied math- ematics: When confronted with a nonlinear problem that we cannot solve, we lin- earize. In probability and statistics, this method is called propagation of error, or the δ method. Linearization is carried out through a Taylor series expansion of g about μX . To the ﬁrst order, Y = g(X) ≈ g(μX ) + (X − μX )g (μX ) We have expressed Y as approximately equal to a linear function of X. Recalling that if U = a + bV, then E(U) = a + bE(V ) and Var(U) = b2Var(V ),weﬁnd μY ≈ g(μX ) σ 2 Y ≈ σ 2 X [g (μX )]2 We know that in general E(Y) = g(E(X)), as given by the approximation. In fact, we can carry out the Taylor series expansion to the second order to get an improved approximation of μY : Y = g(X) ≈ g(μX ) + (X − μX )g (μX ) + 1 2 (X − μX )2g (μX ) Taking the expectation of the right-hand side, we have, since E(X − μX ) = 0, E(Y) ≈ g(μX ) + 1 2 σ 2 X g (μX ) How good such approximations are depends on how nonlinear g is in a neighbor- hood of μX and on the size of σX . From Chebyshev’s inequality, we know that X is unlikely to be many standard deviations away from μX ;ifg can be reasonably well approximated in this range by a linear function, the approximations for the moments will be reasonable as well. EXAMPLE A The relation of voltage, current, and resistance is V = IR. Suppose that the voltage is held constant at a value V0 across a medium whose resistance ﬂuctuates randomly as a result, say, of random ﬂuctuations at the molecular level. The current therefore also varies randomly. Suppose that it can be determined experimentally to have mean μI = 0 and variance σ 2 I . We wish to ﬁnd the mean and variance of the resistance, R, and since we do not know the distribution of I, we must resort to an approximation. 4.6 Approximate Methods 163 We have R = g(I) = V0 I g (μI ) =−V0 μ2 I g (μI ) = 2V0 μ3 I Thus, μR ≈ V0 μI + V0 μ3 I σ 2 I σ 2 R ≈ V 2 0 μ4 I σ 2 I We see that the variability of R depends on both the mean level of I and the variance of I. This makes sense, since if I is quite small, small variations in I will result in large variations in R = V0/I, whereas if I is large, small variations will not affect R as much. The second-order correction factor for μR also depends on μI and is large if μI is small. In fact, when I is near zero, the function g(I) = V0/I is quite nonlinear, and the linearization is not a good approximation. ■ EXAMPLE B This example examines the accuracy of the approximations using a simple test case. We choose the function g(x) = √ x and consider two cases: X uniform on [0, 1], and X uniform on [1, 2]. The graph of g(x) in Figure 4.9 shows that g is more nearly linear in the latter case, so we would expect the approximations to work better there. Let Y = √ X; because X is uniform on [0, 1], E(Y) = 1 0 √ xdx= 2 3 0 .2 .5 1.0 1.5 2.0 g ( x ) x .4 .6 .8 1.0 1.2 1.4 1.6 0 FIGURE 4.9 The function g(x) = √ x is more nearly linear over the interval [1, 2] than over the interval [0, 1]. 164 Chapter 4 Expected Values and E(Y 2) = 1 0 xdx= 1 2 so Var(Y) = 1 2 − 2 3 2 = 1 18 and σY = .236. These results are exact. Using the approximation method, we ﬁrst calculate g (x) = 1 2 x−1/2 g (x) =−1 4 x−3/2 Since X is uniform on [0, 1], μX = 1 2 , and evaluating the derivatives at this value gives us g (μX ) = √ 2 2 g (μX ) =− √ 2 2 We know that Var(X) = 1 12 for a random variable uniform on [0, 1], so the approxi- mations are E(Y) ≈ 1 2 − 1 2 √ 2 12 × 2 = .678 Var(Y) ≈ 1 2 × 1 12 = .042 σY ≈ .204 The approximation to the mean is .678, and compared to the actual value of .667, it is off by about 1.6%. The approximation to the standard deviation is .204, and compared to the actual value of .236, it is off by 13%. Now let us consider the case in which X is uniform on [1, 2]. Proceeding as before, we ﬁnd that y = √ x has mean 1.219. The variance and standard deviation are .0142 and .119, respectively. To compare these to the approximations, we note that μX = 3 2 and Var(X) = 1 12 (the random variable uniform on [1, 2] can be obtained by adding the constant 1 to a random variable uniform on [0, 1]; compare with Theorem A in Section 4.2). We ﬁnd g (μX ) = .408 g (μX ) =−.136 so the approximations are E(Y) ≈ 3 2 − 1 2 .136 12 = 1.219 Var(Y) ≈ .4082 12 = .0138 σY ≈ .118 These values are much closer to the actual values than are the approximations for the ﬁrst case. ■ 4.6 Approximate Methods 165 Suppose that we have Z = g(X, Y), a function of two variables. We can again carry out Taylor series expansions to approximate the mean and variance of Z. To the ﬁrst order, letting μ denote the point (μX ,μY ), Z = g(X, Y) ≈ g(μ) + (X − μX )∂g(μ) ∂x + (Y − μY )∂g(μ) ∂y The notation ∂g(μ)/∂x means that the derivative is evaluated at the point μ. Here Z is expressed as approximately equal to a linear function of X and Y, and the mean and variance of this linear function are easily calculated to be E(Z) ≈ g(μ) and Var(Z) ≈ σ 2 X ∂g(μ) ∂x 2 + σ 2 Y ∂g(μ) ∂y 2 + 2σXY ∂g(μ) ∂x ∂g(μ) ∂y (For the latter calculation, see Corollary A in Section 4.3.) As is the case with a single variable, a second-order expansion can be used to obtain an improved estimate of E(Z): Z = g(X, Y) ≈ g(μ) + (X − μX )∂g(μ) ∂x + (Y − μY )∂g(μ) ∂y + 1 2 (X − μX )2 ∂2g(μ) ∂x2 + 1 2 (Y − μY )2 ∂2g(μ) ∂y2 + (X − μX )(Y − μY )∂2g(μ) ∂x∂y Taking expectations term by term on the right-hand side yields E(Z) ≈ g(μ) + 1 2 σ 2 X ∂2g(μ) ∂x2 + 1 2 σ 2 Y ∂2g(μ) ∂y2 + σXY ∂2g(μ) ∂x∂y The general case of a function of n variables can be worked out similarly; the basic concepts are illustrated by the two-variable case. EXAMPLE C Expectation and Variance of a Ratio Let us consider the case where Z = Y/X, which arises frequently in practice. For example, a chemist might measure the concentrations of two substances, both with some measurement error that is indicated by their standard deviations, and then report the relative concentrations in the form of a ratio. What is the approximate standard deviation of the ratio, Z? Using the method of propagation of error derived above, for g(x, y) = y/x,we have ∂g ∂x = −y x2 ∂g ∂y = 1 x ∂2g ∂x2 = 2y x3 ∂2g ∂y2 = 0 ∂2g ∂x∂y = −1 x2 166 Chapter 4 Expected Values Evaluating these derivatives at (μX ,μY ) and using the preceding result, we ﬁnd, if μX = 0, E(Z) ≈ μY μX + σ 2 X μY μ3 X − σXY μ2 X = μY μX + 1 μ2 X σ 2 X μY μX − ρσX σY From this equation, we see that the difference between E(Z) and μY /μX depends on several factors. If σX and σY are small—that is, if the two concentrations are measured quite accurately—the difference is small. If μX is small, the difference is relatively large. Finally, correlation between X and Y affects the difference. We now consider the variance. Again using the preceding result and evaluating the partial derivatives at (μX ,μY ),weﬁnd Var(Z) ≈ σ 2 X μ2 Y μ4 X + σ 2 Y μ2 X − 2σXY μY μ3 X = 1 μ2 X σ 2 X μ2 Y μ2 X + σ 2 Y − 2ρσX σY μY μX From this equation, we see that the ratio is quite variable when μX is small, paralleling the results in Example A, and that correlation between X and Y, if of the same sign as μY /μX , decreases Var(Z). ■ 4.7 Problems 1. Show that if a random variable is bounded—that is, |X| < M < ∞—then E(X) exists. 2. If X is a discrete uniform random variable—that is, P(X = k) = 1/n for k = 1,2,...,n—ﬁnd E(X) and Var(X). 3. Find E(X) and Var(X) for Problem 3 in Chapter 2. 4. Let X have the cdf F(x) = 1 − x−α, x ≥ 1. a. Find E(X) for those values of α for which E(X) exists. b. Find Var(X) for those values of α for which it exists. 5. Let X have the density f (x) = 1 + αx 2 , −1 ≤ x ≤ 1, −1 ≤ α ≤ 1 Find E(X) and Var(X). 4.7 Problems 167 6. Let X be a continuous random variable with probability density function f (x) = 2x,0≤ x ≤ 1. a. Find E(X). b. Let Y = X 2. Find the probability mass function of Y and use it to ﬁnd E(Y). c. Use Theorem A in Section 4.1.1 to ﬁnd E(X 2) and compare to your answer in part (b). d. Find Var(X) according to the deﬁnition of variance given in Section 4.2. Also ﬁnd Var(X) by using Theorem B of Section 4.2. 7. Let X be a discrete random variable that takes on values 0, 1, 2 with probabilities 1 2 , 3 8 , 1 8 , respectively. a. Find E(X). b. Let Y = X 2. Find the probability mass function of Y and use it to ﬁnd E(Y). c. Use Theorem A of Section 4.1.1 to ﬁnd E(X 2) and compare to your answer in part (b). d. Find Var(X) according to the deﬁnition of variance given in Section 4.2. Also ﬁnd Var(X) by using Theorem B in Section 4.2. 8. Show that if X is a discrete random variable, taking values on the positive integers, then E(X) = ∞ k=1 P(X ≥ k). Apply this result to ﬁnd the expected value of a geometric random variable. 9. This is a simpliﬁed inventory problem. Suppose that it costs c dollars to stock an item and that the item sells for s dollars. Suppose that the number of items that will be asked for by customers is a random variable with the frequency function p(k). Find a rule for the number of items that should be stocked in order to maximize the expected income. (Hint: Consider the difference of successive terms.) 10. A list of n items is arranged in random order; to ﬁnd a requested item, they are searched sequentially until the desired item is found. What is the expected number of items that must be searched through, assuming that each item is equally likely to be the one requested? (Questions of this nature arise in the design of computer algorithms.) 11. Referring to Problem 10, suppose that the items are not equally likely to be requested but have known probabilities p1, p2, ..., pn. Suggest an alternative searching procedure that will decrease the average number of items that must be searched through, and show that in fact it does so. 12. If X is a continuous random variable with a density that is symmetric about some point, ξ, show that E(X) = ξ, provided that E(X) exists. 13. If X is a nonnegative continuous random variable, show that E(X) = ∞ 0 [1 − F(x)] dx Apply this result to ﬁnd the mean of the exponential distribution. 168 Chapter 4 Expected Values 14. Let X be a continuous random variable with the density function f (x) = 2x, 0 ≤ x ≤ 1 a. Find E(X). b. Find E(X 2) and Var(X). 15. Suppose that two lotteries each have n possible numbers and the same payoff. In terms of expected gain, is it better to buy two tickets from one of the lotteries or one from each? 16. Suppose that E(X) = μ and Var(X) = σ 2. Let Z = (X − μ)/σ. Show that E(Z) = 0 and Var(Z) = 1. 17. Find (a) the expectation and (b) the variance of the kth-order statistic of a sample of n independent random variables uniform on [0, 1]. The density function is given in Example C in Section 3.7. 18. If U1, ..., Un are independent uniform random variables, ﬁnd E(U(n) −U(1)). 19. Find E(U(k) − U(k−1)), where the U(i) are as in Problem 18. 20. A stick of unit length is broken into two pieces. Find the expected ratio of the length of the longer piece to the length of the shorter piece. 21. A random square has a side length that is a uniform [0, 1] random variable. Find the expected area of the square. 22. A random rectangle has sides the lengths of which are independent uniform random variables. Find the expected area of the rectangle, and compare this result to that of Problem 21. 23. Repeat Problems 21 and 22 assuming that the distribution of the lengths is exponential. 24. Prove Theorem A of Section 4.1.2 for the discrete case. 25. If X1 and X2 are independent random variables following a gamma distribution with parameters α and λ, ﬁnd E(R2), where R2 = X 2 1 + X 2 2. 26. Referring to Example B in Section 4.1.2, what is the expected number of coupons needed to collect r different types, where r < n? 27. If n men throw their hats into a pile and each man takes a hat at random, what is the expected number of matches? (Hint: Express the number as a sum.) 28. Suppose that n enemy aircraft are shot at simultaneously by m gunners, that each gunner selects an aircraft to shoot at independently of the other gunners, and that each gunner hits the selected aircraft with probability p. Find the expected number of aircraft hit by the gunners. 29. Prove Corollary A of Section 4.1.1. 30. Find E[1/(X + 1)], where X is a Poisson random variable. 4.7 Problems 169 31. Let X be uniformly distributed on the interval [1, 2]. Find E(1/X).IsE(1/X) = 1/E(X)? 32. Let X have a gamma distribution with parameters α and λ. For those values of α and λ for which it is deﬁned, ﬁnd E(1/X). 33. Prove Chebyshev’s inequality in the discrete case. 34. Let X be uniform on [0, 1], and let Y = √ X. Find E(Y) by (a) ﬁnding the density of Y and then ﬁnding the expectation and (b) using Theorem A of Section 4.1.1. 35. Find the mean of a negative binomial random variable. (Hint: Express the random variable as a sum.) 36. Consider the following scheme for group testing. The original lot of samples is divided into two groups, and each of the subgroups is tested as a whole. If either subgroup tests positive, it is divided in two, and the procedure is repeated. If any of the groups thus obtained tests positive, test every member of that group. Find the expected number of tests performed, and compare it to the number performed with no grouping and with the scheme described in Example C in Section 4.1.2. 37. For what values of p is the group testing of Example C in Section 4.1.2 inferior to testing every individual? 38. This problem continues Example A of Section 4.1.2. a. What is the probability that a fragment is the leftmost member of a contig? b. What is the expected number of fragments that are leftmost members of contigs? c. What is the expected number of contigs? 39. Suppose that a segment of DNA of length 1,000,000 is to be shotgun sequenced with fragments of length 1000. a. How many fragment would be needed so that the chance of an individual site being covered is greater than 0.99? b. With this choice, how many sites would you expect to be missed? 40. A child types the letters Q, W, E, R, T, Y, randomly producing 1000 letters in all. What is the expected number of times that the sequence QQQQ appears, counting overlaps? 41. Continuing with the previous problem, how many times would we expect the word “TRY” to appear? Would we be surprised if it occurred 100 times? (Hint: Consider Markov’s inequality.) 42. Let X be an exponential random variable with standard deviation σ. Find P(|X − E(X)| > kσ)for k = 2, 3, 4, and compare the results to the bounds from Chebyshev’s inequality. 43. Show that Var(X − Y) = Var(X) + Var(Y) − 2Cov(X, Y). 170 Chapter 4 Expected Values 44. If X and Y are independent random variables with equal variances, ﬁnd Cov(X + Y, X − Y). 45. Find the covariance and the correlation of Ni and N j , where N1, N2, ..., Nr are multinomial random variables. (Hint: Express them as sums.) 46. If U = a + bX and V = c + dY, show that |ρUV|=|ρXY|. 47. If X and Y are independent random variables and Z = Y − X, ﬁnd expressions for the covariance and the correlation of X and Z in terms of the variances of X and Y. 48. Let U and V be independent random variables with means μ and variances σ 2. Let Z = αU + V √ 1 − α2. Find E(Z) and ρUZ. 49. Two independent measurements, X and Y, are taken of a quantity μ. E(X) = E(Y) = μ,butσX and σY are unequal. The two measurements are combined by means of a weighted average to give Z = αX + (1 − α)Y where α is a scalar and 0 ≤ α ≤ 1. a. Show that E(Z) = μ. b. Find α in terms of σX and σY to minimize Var(Z). c. Under what circumstances is it better to use the average (X + Y)/2 than either X or Y alone? 50. Suppose that Xi , where i = 1, ..., n, are independent random variables with E(Xi ) = μ and Var(Xi ) = σ 2. Let X = n−1 n i=1 Xi . Show that E(X) = μ and Var(X) = σ 2/n. 51. Continuing Example E in Section 4.3, suppose there are n securities, each with the same expected return, that all the returns have the same standard deviations, and that the returns are uncorrelated. What is the optimal portfolio vector? Plot the risk of the optimal portfolio versus n. How does this risk compare to that incurred by putting all your money in one security? 52. Consider two securities, the ﬁrst having μ1 = 1 and σ1 = 0.1, and the second having μ2 = 0.8 and σ2 = 0.12. Suppose that they are negatively correlated, with ρ =−0.8. a. If you could only invest in one security, which one would you choose, and why? b. Suppose you invest 50% of your money in each of the two. What is your expected return and what is your risk? c. If you invest 80% of your money in security 1 and 20% in security 2, what is your expected return and your risk? d. Denote the expected return and its standard deviation as functions of π by μ(π) and σ(π). The pair (μ(π), σ (π)) trace out a curve in the plane as π varies from 0 to 1. Plot this curve. e. Repeat b–d if the correlation is ρ = 0.1. 53. Show that Cov(X, Y) ≤ √ Var(X)Var(Y). 4.7 Problems 171 54. Let X, Y, and Z be uncorrelated random variables with variances σ 2 X , σ 2 Y , and σ 2 Z , respectively. Let U = Z + X V = Z + Y Find Cov(U, V ) and ρUV. 55. Let T = n k=1 kXk, where the Xk are independent random variables with means μ and variances σ 2. Find E(T ) and Var(T ). 56. Let S = n k=1 Xk, where the Xk are as in Problem 55. Find the covariance and the correlation of S and T . 57. If X and Y are independent random variables, ﬁnd Var(XY) in terms of the means and variances of X and Y. 58. A function is measured at two points with some error (for example, the position of an object is measured at two times). Let X1 = f (x) + ε1 X2 = f (x + h) + ε2 where ε1 and ε2 are independent random variables with mean zero and variance σ 2. Since the derivative of f is lim h→0 f (x + h) − f (x) h it is estimated by Z = X2 − X1 h a. Find E(Z) and Var(Z). What is the effect of choosing a value of h that is very small, as is suggested by the deﬁnition of the derivative? b. Find an approximation to the mean squared error of Z as an estimate of f (x) using a Taylor series expansion. Can you ﬁnd the value of h that minimizes the mean squared error? c. Suppose that f is measured at three points with some error. How could you construct an estimate of the second derivative of f , and what are the mean and the variance of your estimate? 59. Let (X, Y) be a random point uniformly distributed on a unit disk. Show that Cov(X, Y) = 0, but that X and Y are not independent. 60. Let Y have a density that is symmetric about zero, and let X = SY, where S is an independent random variable taking on the values +1 and −1 with probability 1 2 each. Show that Cov(X, Y) = 0, but that X and Y are not independent. 61. In Section 3.7, the joint density of the minimum and maximum of n independent uniform random variables was found. In the case n = 2, this amounts to X and Y, the minimum and maximum, respectively, of two independent random 172 Chapter 4 Expected Values variables uniform on [0, 1], having the joint density f (x, y) = 2, 0 ≤ x ≤ y a. Find the covariance and the correlation of X and Y. Does the sign of the correlation make sense intuitively? b. Find E(X|Y = y)and E(Y|X = x). Do these results make sense intuitively? c. Find the probability density functions of the random variables E(X|Y) and E(Y|X). d. What is the linear predictor of Y in terms of X (denoted by ˆY = a + bX) that has minimal mean squared error? What is the mean square prediction error? e. What is the predictor of Y in terms of X [ ˆY = h(X)] that has minimal mean squared error? What is the mean square prediction error? 62. Let X and Y have the joint distribution given in Problem 1 of Chapter 3. a. Find the covariance and correlation of X and Y. b. Find E(Y|X = x) for x = 1, 2, 3, 4. Find the probability mass function of the random variable E(Y|X). 63. Let X and Y have the joint distribution given in Problem 8 of Chapter 3. a. Find the covariance and correlation of X and Y. b. Find E(Y|X = x) for 0 ≤ x ≤ 1. 64. Let X and Y be jointly distributed random variables with correlation ρXY; deﬁne the standardized random variables ˜X and ˜Y as ˜X = (X − E(X))/√ Var(X) and ˜Y = (Y − E(Y))/√ Var(Y). Show that Cov( ˜X, ˜Y) = ρXY. 65. How has the assumption that N and the Xi are independent been used in Example D of Section 4.4.1? 66. A building contains two elevators, one fast and one slow. The average waiting time for the slow elevator is 3 min. and the average waiting time of the fast elevator is 1 min. If a passenger chooses the fast elevator with probability 2 3 and the slow elevator with probability 1 3 , what is the expected waiting time? (Use the law of total expectation, Theorem A of Section 4.4.1, deﬁning appropriate random variables X and Y.) 67. A random rectangle is formed in the following way: The base, X, is chosen to be a uniform [0, 1] random variable and after having generated the base, the height is chosen to be uniform on [0, X]. Use the law of total expectation, Theorem A of Section 4.4.1, to ﬁnd the expected circumference and area of the rectangle. 68. Show that E[Var(Y|X)] ≤ Var(Y). 69. Suppose that a bivariate normal distribution has μX = μY = 0 and σX = σY = 1. Sketch the contours of the density and the lines E(Y|X = x) and E(X|Y = y) for ρ = 0, .5, and .9. 4.7 Problems 173 70. If X and Y are independent, show that E(X|Y = y) = E(X). 71. Let X be a binomial random variable representing the number of successes in n independent Bernoulli trials. Let Y be the number of successes in the ﬁrst m trials, where m < n. Find the conditional frequency function of Y given X = x and the conditional mean. 72. An item is present in a list of n items with probability p; if it is present, its posi- tion in the list is uniformly distributed. A computer program searches through the list sequentially. Find the expected number of items searched through before the program terminates. 73. A fair coin is tossed n times, and the number of heads, N, is counted. The coin is then tossed N more times. Find the expected total number of heads generated by this process. 74. The number of offspring of an organism is a discrete random variable with mean μ and variance σ 2. Each of its offspring reproduces in the same man- ner. Find the expected number of offspring in the third generation and its variance. 75. Let T be an exponential random variable, and conditional on T , letU be uniform on [0, T ]. Find the unconditional mean and variance of U. 76. Let the point (X, Y) be uniformly distributed over the half disk x2 + y2 ≤ 1, where y ≥ 0. If you observe X, what is the best prediction for Y? If you observe Y, what is the best prediction for X? For both questions, “best” means having the minimum mean squared error. 77. Let X and Y have the joint density f (x, y) = e−y, 0 ≤ x ≤ y a. Find Cov(X, Y) and the correlation of X and Y. b. Find E(X|Y = y) and E(Y|X = x). c. Find the density functions of the random variables E(X|Y) and E(Y|X). 78. Show that if a density is symmetric about zero, its skewness is zero. 79. Let X be a discrete random variable that takes on values 0, 1, 2 with probabilities 1 2 , 3 8 , 1 8 , respectively. Find the moment-generating function of X, M(t), and verify that E(X) = M (0) and that E(X 2) = M (0). 80. Let X be a continuous random variable with density function f (x) = 2x, 0 ≤ x ≤ 1. Find the moment-generating function of X, M(t), and verify that E(X) = M (0) and that E(X 2) = M (0). 81. Find the moment-generating function of a Bernoulli random variable, and use it to ﬁnd the mean, variance, and third moment. 82. Use the result of Problem 81 to ﬁnd the mgf of a binomial random variable and its mean and variance. 174 Chapter 4 Expected Values 83. Show that if Xi follows a binomial distribution with ni trials and probability of success pi = p, where i = 1, ..., n and the Xi are independent, then n i=1 Xi follows a binomial distribution. 84. Referring to Problem 83, show that if the pi are unequal, the sum does not follow a binomial distribution. 85. Find the mgf of a geometric random variable, and use it to ﬁnd the mean and the variance. 86. Use the result of Problem 85 to ﬁnd the mgf of a negative binomial random variable and its mean and variance. 87. Under what conditions is the sum of independent negative binomial random variables also negative binomial? 88. Let X and Y be independent random variables, and let α and β be scalars. Find an expression for the mgf of Z = αX + βY in terms of the mgf’s of X and Y. 89. Let X1, X2, ..., Xn be independent normal random variables with means μi and variances σ 2 i . Show that Y = n i=1 αi Xi , where the αi are scalars, is normally distributed, and ﬁnd its mean and variance. (Hint: Use moment- generating functions.) 90. Assuming that X ∼ N(0,σ2), use the mgf to show that the odd moments are zero and the even moments are μ2n = (2n)!σ 2n 2n(n!) 91. Use the mgf to show that if X follows an exponential distribution, cX (c > 0) does also. 92. Suppose that is a random variable that follows a gamma distribution with pa- rameters λ and α, where α is an integer, and suppose that, conditional on , X follows a Poisson distribution with parameter . Find the uncondi- tional distribution of α + X.(Hint: Find the mgf by using iterated conditional expectations.) 93. Find the distribution of a geometric sum of exponential random variables by using moment-generating functions. 94. If X is a nonnegative integer-valued random variable, the probability- generating function of X is deﬁned to be G(s) = ∞ k=0 sk pk where pk = P(X = k). a. Show that pk = 1 k! dk dsk G(s) s=0 4.7 Problems 175 b. Show that dG ds s=1 = E(X) d2G ds2 s=1 = E[X(X − 1)] c. Express the probability-generating function in terms of moment-generating function. d. Find the probability-generating function of the Poisson distribution. 95. Show that if X and Y are independent, their joint moment-generating function factors. 96. Show how to ﬁnd E(XY) from the joint moment-generating function of X and Y. 97. Use moment-generating functions to show that if X and Y are independent, then Var(aX + bY) = a2Var(X) + b2Var(Y) 98. Find the mean and variance of the compound Poisson distribution (Example H in Section 4.5). 99. Find expressions for the approximate mean and variance of Y = g(X) for (a) g(x) = √ x, (b) g(x) = log x, and (c) g(x) = sin−1 x. 100. If X is uniform on [10, 20], ﬁnd the approximate and exact mean and variance of Y = 1/X, and compare them. 101. Find the approximate mean and variance of Y = √ X, where X is a random variable following a Poisson distribution. 102. Two sides, x0 and y0, of a right triangle are independently measured as X and Y, where E(X) = x0 and E(Y) = y0 and Var(X) = Var(Y) = σ 2. The angle between the two sides is then determined as = tan−1 Y X Find the approximate mean and variance of . 103. The volume of a bubble is estimated by measuring its diameter and using the relationship V = π 6 D3 Suppose that the true diameter is 2 mm and that the standard deviation of the measurement of the diameter is .01 mm. What is the approximate standard deviation of the estimated volume? 104. The position of an aircraft relative to an observer on the ground is estimated by measuring its distance r from the observer and the angle θ that the line of 176 Chapter 4 Expected Values sight from the observer to the aircraft makes with the horizontal. Suppose that the measurements, denoted by R and , are subject to random errors and are independent of each other. The altitude of the aircraft is then estimated to be Y = R sin . a. Find an approximate expression for the variance of Y. b. For given r, at what value of θ is the estimated altitude most variable? CHAPTER 5 Limit Theorems 5.1 Introduction This chapter is principally concerned with the limiting behavior of the sum of inde- pendent random variables as the number of summands becomes large. The results presented here are both intrinsically interesting and useful in statistics, since many commonly computed statistical quantities, such as averages, can be represented as sums. 5.2 The Law of Large Numbers It is commonly believed that if a fair coin is tossed many times and the proportion of heads is calculated, that proportion will be close to 1 2 . John Kerrich, a South African mathematician, tested this belief empirically while detained as a prisoner during World War II. He tossed a coin 10,000 times and observed 5067 heads. The law of large numbers is a mathematical formulation of this belief. The successive tosses of the coin are modeled as independent random trials. The random variable Xi takes on the value 0 or 1 according to whether the ith trial results in a tail or a head, and the proportion of heads in n trials is X n = 1 n n i=1 Xi The law of large numbers states that X n approaches 1 2 in a sense that is speciﬁed by the following theorem. 177 178 Chapter 5 Limit Theorems THEOREM A Law of Large Numbers Let X1, X2,...,Xi ... be a sequence of independent random variables with E(Xi ) = μ and Var(Xi ) = σ 2. Let X n = n−1 n i=1 Xi . Then, for any ε>0, P(|X n − μ| >ε)→ 0asn →∞ Proof We ﬁrst ﬁnd E(X n) and Var(X n): E(X n) = 1 n n i=1 E(Xi ) = μ Since the Xi are independent, Var(X n) = 1 n2 n i=1 Var(Xi ) = σ 2 n The desired result now follows immediately from Chebyshev’s inequality, which states that P(|X n − μ| >ε)≤ Var(X n) ε2 = σ 2 nε2 → 0, as n →∞ ■ In the case of a fair coin toss, the Xi are Bernoulli random variables with p = 1/2, E(Xi ) = 1/2 and Var(Xi ) = 1/4. If tossed 10,000 times Var(X 10,000) = 2.5 × 10−5 and the standard deviation of the average is the square root of the variance, 0.005. The proportion observed by Kerrich, 0.5067, is thus a little more than one standard deviation away from its expected value of 0.5, consistent with Chebyshev’s inequality. (Recall from Section 4.2 that Chebyshev’s inequality can be written in the form P(|X n − μ| > kσ)≤ 1/k2.) If a sequence of random variables, {Zn}, is such that P(|Zn −α| >ε)approaches zero as n approaches inﬁnity, for any ε>0 and where α is some scalar, then Zn is said to converge in probability to α. There is another mode of convergence, called strong convergence or almost sure convergence, which asserts more. Zn is said to converge almost surely to α if for every ε>0, |Zn − α| >εonly a ﬁnite number of times with probability 1; that is, beyond some point in the sequence, the difference is always less than ε, but where that point is random. The version of the law of large numbers stated and proved earlier asserts that X n converges to μ in probability. This version is usually called the weak law of large numbers. Under the same assumptions, a strong law of large numbers, which asserts that X n converges almost surely to μ, can also be proved, but we will not do so. We now consider some examples that illustrate the utility of the law of large numbers. 5.2 The Law of Large Numbers 179 EXAMPLE A Monte Carlo Integration Suppose that we wish to calculate I ( f ) = 1 0 f (x) dx where the integration cannot be done by elementary means or evaluated using ta- bles of integrals. The most common approach is to use a numerical method in which the integral is approximated by a sum; various schemes and computer packages ex- ist for doing this. Another method, called the Monte Carlo method, works in the following way. Generate independent uniform random variables on [0, 1]—that is, X1, X2,...,Xn—and compute ˆI( f ) = 1 n n i=1 f (Xi ) By the law of large numbers, this should be close to E[ f (X)], which is simply E[ f (X)] = 1 0 f (x) dx = I ( f ) This simple scheme can be easily modiﬁed in order to change the range of integration and in other ways. Compared to the standard numerical methods, it is not especially efﬁcient in one dimension, but becomes increasingly efﬁcient as the dimensionality of the integral grows. As a concrete example, let us consider the evaluation of I ( f ) = 1√ 2π 1 0 e−x2/2 dx The integral is that of the standard normal density, which cannot be evaluated in closed form. From the table of the normal distribution (Table 2 in Appendix B), an accurate numerical approximation is I ( f ) = .3413. If 1000 points, X1,...,X1000, uniformly distributed over the interval 0 ≤ x ≤ 1, are generated using a pseudorandom number generator, the integral is then approximated by ˆI( f ) = 1 1000 1√ 2π 1000 i=1 e−X2 i /2 which produced for one realization of the Xi the value .3417. ■ EXAMPLE B Repeated Measurements Suppose that repeated independent unbiased measurements, X1,...,Xn, of a quantity are made. If n is large, the law of large numbers says that X will be close to the true value, μ, of the quantity, but how close X is depends not only on n but on the variance of the measurement error, σ 2, as can be seen in the proof of Theorem A. 180 Chapter 5 Limit Theorems Fortunately, σ 2 can be estimated and therefore Var(X) = σ 2 n can be estimated from the data to assess the precision of X. First, note thatn−1 n i=1 X 2 i converges to E(X 2), from the law of large numbers. Second, it can be shown that if Zn converges to α in probability and g is a continuous function, then g(Zn) → g(α) which implies that X 2 → [E(X)]2 Finally, since n−1 n i=1 X 2 i converges to E(X 2) and X 2 converges to [E(X)]2, with a little additional argument it can be shown that 1 n n i=1 X 2 i − X 2 → E(X 2) − [E(X)]2 = Var(X) More generally, it follows from the law of large numbers that the sample moments, n−1 n i=1 Xr i , converge in probability to the moments of X, E(Xr ). ■ EXAMPLE C A muscle or nerve cell membrane contains a very large number of channels; when open, these channels allow ions to pass through. Individual channels seem to open and close randomly, and it is often assumed that in an equilibrium situation the channels open and close independently of each other and that only a very small fraction are open at any one time. Suppose then that the probability that a channel is open is p,avery small number, that there are m channels in all, and that the amount of current ﬂowing through an individual channel is c. The number of channels open at any particular time is N, a binomial random variable with m trials and probability p of success on each trial. The total amount of current is S = cN and can be measured. We then have E(S) = cE(N) = cmp Var(S) = c2mp(1 − p) and Var(S) E(S) = c(1 − p) ≈ c since p is small. Thus, through independent measurements, S1,...,Sn, we can esti- mate E(S) and Var(S) and therefore c, the amount of current ﬂowing through a single channel, without knowing how many channels there are. ■ 5.3 Convergence in Distribution and the Central Limit Theorem 181 5.3 Convergence in Distribution and the Central Limit Theorem In applications, we often want to ﬁnd P(a < X < b) when we do not know the cdf of X precisely; it is sometimes possible to do this by approximating FX . The approximation is often arrived at by some sort of limiting argument. The most famous limit theorem in probability theory is the central limit theorem, which is the main topic of this section. Before discussing the central limit theorem, we develop some introductory terminology, theory, and examples. DEFINITION Let X1, X2,...be a sequence of random variables with cumulative distribution functions F1, F2,...,and let X be a random variable with distribution function F. We say that Xn converges in distribution to X if limn→∞ Fn(x) = F(x) at every point at which F is continuous. ■ Moment-generating functions are often useful for establishing the convergence of distribution functions. We know from Property A of Section 4.5 that a distribu- tion function Fn is uniquely determined by its mgf, Mn. The following theorem, which we give without proof, states that this unique determination holds for limits as well. THEOREM A Continuity Theorem Let Fn be a sequence of cumulative distribution functions with the corresponding moment-generating function Mn. Let F be a cumulative distribution function with the moment-generating function M.IfMn(t) → M(t) for all t in an open interval containing zero, then Fn(x) → F(x) at all continuity points of F. ■ EXAMPLE A We will show that the Poisson distribution can be approximated by the normal distri- bution for large values of λ. This is suggested by examining Figure 2.6, which shows that as λ increases, the probability mass function of the Poisson distribution becomes more symmetric and bell-shaped. Let λ1,λ2,... be an increasing sequence with λn →∞, and let {Xn} be a sequence of Poisson random variables with the corresponding parameters. We know that E(Xn) = Var(Xn) = λn. If we wish to approximate the Poisson distribution function by a normal distribution function, the normal must have the same mean and 182 Chapter 5 Limit Theorems variance as the Poisson does. In addition, if we wish to prove a limiting result, we run into the difﬁculty that the mean and variance are tending to inﬁnity. This difﬁculty is dealt with by standardizing the random variables—that is, by letting Zn = Xn − E(Xn)√ Var(Xn) = Xn − λn√λn We then have E(Zn) = 0 and Var(Zn) = 1, and we will show that the mgf of Zn converges to the mgf of the standard normal distribution. The mgf of Xn is MXn (t) = eλn(et −1) By Property C of Section 4.5, the mgf of Zn is MZn (t) = e−t √λn MXn t√λn = e−t √λn eλn(et/ √ λn −1) It will be easier to work with the log of this expression. log MZn (t) =−t λn + λn(et/√λn − 1) Using the power series expansion ex = ∞ k=0 xk k! , we see that limn→∞ log MZn (t) = t2 2 or limn→∞ MZn (t) = et2/2 The last expression is the mgf of the standard normal distribution. We have shown that a standardized Poisson random variable converges in distri- bution to a standard normal variable as λ approaches inﬁnity. Practically, we wish to use this limiting result as a basis for an approximation for large but ﬁnite values of λn. How adequate the approximation is for λ = 100, say, is a matter for theoretical and/or empirical investigation. It turns out that the approximation is increasingly good for large values of λ and that λ does not have to be all that large. (See Problem 8 at the end of this chapter.) ■ 5.3 Convergence in Distribution and the Central Limit Theorem 183 The next example shows how the approximation of the Poisson distribution can be applied in a speciﬁc case. EXAMPLE B A certain type of particle is emitted at a rate of 900 per hour. What is the probability that more than 950 particles will be emitted in a given hour if the counts form a Poisson process? Let X be a Poisson random variable with mean 900. We ﬁnd P(X > 950) by standardizing: P(X > 950) = P X − 900√ 900 > 950 − 900√ 900 ≈ 1 − 5 3 = .04779 where is the standard normal cdf. For comparison, the exact probability is .04712. ■ We now turn to the central limit theorem, which is concerned with a limiting property of sums of random variables. If X1, X2,... is a sequence of independent random variables with mean μ and variance σ 2, and if Sn = n i=1 Xi we know from the law of large numbers that Sn/n converges to μ in probability. This followed from the fact that Var Sn n = 1 n2 Var(Sn) = σ 2 n → 0 The central limit theorem is concerned not with the fact that the ratio Sn/n converges to μ but with how it ﬂuctuates around μ. To analyze these ﬂuctuations, we stan- dardize: Zn = Sn − nμ σ√ n You should verify that Zn has mean 0 and variance 1. The central limit theorem states that the distribution of Zn converges to the standard normal distribution. 184 Chapter 5 Limit Theorems THEOREM B Central Limit Theorem Let X1, X2,...be a sequence of independent random variables having mean 0 and variance σ 2 and the common distribution function F and moment-generating function M deﬁned in a neighborhood of zero. Let Sn = n i=1 Xi Then limn→∞ P Sn σ√ n ≤ x = (x), −∞ < x < ∞ Proof Let Zn = Sn/(σ√ n). We will show that the mgf of Zn tends to the mgf of the standard normal distribution. Since Sn is a sum of independent random variables, MSn (t) = [M(t)]n and MZn (t) = M t σ√ n n M(s) has a Taylor series expansion about zero: M(s) = M(0) + sM (0) + 1 2 s2 M (0) + εs where εs/s2 → 0ass → 0. Since E(X) = 0, M (0) = 0, and M (0) = σ 2.As n →∞, t/(σ√ n) → 0, and M t σ√ n = 1 + 1 2 σ 2 t σ√ n 2 + εn where εn/(t2/(nσ 2)) → 0asn →∞. We thus have MZn (t) = 1 + t2 2n + εn n It can be shown that if an → a, then limn→∞ 1 + an n n = ea From this result, it follows that MZn (t) → et2/2 as n →∞ where exp(t2/2) is the mgf of the standard normal distribution, as was to be shown. ■ Theorem B is one of the simplest versions of the central limit theorem; there are many central limit theorems of various degrees of abstraction and generality. We have proved Theorem B under the assumption that the moment-generating functions exist, which is a rather strong assumption. By using characteristic functions instead, we 5.3 Convergence in Distribution and the Central Limit Theorem 185 could modify the proof so that it would only be necessary that ﬁrst and second mo- ments exist. Further generalizations weaken the assumption that the Xi have the same distribution and apply to linear combinations of independent random variables. Cen- tral limit theorems have also been proved that weaken the independence assumption and allow the Xi to be dependent but not “too” dependent. For practical purposes, especially for statistics, the limiting result in itself is not of primary interest. Statisticians are more interested in its use as an approximation with ﬁnite values of n. It is impossible to give a concise and deﬁnitive statement of how good the approximation is, but some general guidelines are available, and examining special cases can give insight. How fast the approximation becomes good depends on the distribution of the summands, the Xi . If the distribution is fairly symmetric and has tails that die off rapidly, the approximation becomes good for relatively small values of n. If the distribution is very skewed or if the tails die down very slowly, a larger value of n is needed for a good approximation. The following examples deal with two special cases. EXAMPLE C Because the uniform distribution on [0, 1] has mean 1 2 and variance 1 12 , the sum of 12 uniform random variables, minus 6, has mean 0 and variance 1. The distribution of this sum is quite close to normal; in fact, before better algorithms were developed, it was commonly used in computers for generating normal random variables from uniform ones. It is possible to compare the real and approximate distributions analyt- ically, but we will content ourselves with a simple demonstration. Figure 5.1 shows a histogram of 1000 such sums with a superimposed normal density function. The ﬁt is surprisingly good, especially considering that 12 is not usually regarded as a large value of n. ■ 0 50 Count 100 150 200 250 2024 Value 46 FIGURE 5.1 A histogram of 1000 values, each of which is the sum of 12 uniform [− 1 2 , 1 2 ] pseudorandom variables, with an approximating standard normal density. 186 Chapter 5 Limit Theorems EXAMPLE D The sum of n independent exponential random variables with parameter λ = 1 follows a gamma distribution with λ = 1 and α = n (Example F in Section 4.5). The exponential density is quite skewed; therefore, a good approximation of a standardized gamma by a standardized normal would not be expected for small n. Figure 5.2 shows the cdf’s of the standard normal and standardized gamma distributions for increasing values of n. Note how the approximation improves as n increases. ■ 0 .2 01 Cumulative probability x .4 .6 .8 1.0 3 2 1 2 3 FIGURE 5.2 The standard normal cdf (solid line) and the cdf s of standardized gamma distributions with α = 5 (long dashes), α = 10 (short dashes), and α = 30 (dots). Let us now consider some applications of the central limit theorem. EXAMPLE E Measurement Error Suppose that X1,...,Xn are repeated, independent measurements of a quantity, μ, and that E(Xi ) = μ and Var(Xi ) = σ 2. The average of the measurements, X,is used as an estimate of μ. The law of large numbers tells us that X converges to μ in probability, so we can hope that X is close to μ if n is large. Chebyshev’s inequality allows us to bound the probability of an error of a given size, but the central limit theorem gives a much sharper approximation to the actual error. Suppose that we wish to ﬁnd P(|X − μ| < c) for some constant c. To use the central limit theorem to approximate this probability, we ﬁrst standardize, using E(X) = μ and Var(X) = σ 2/n: P(|X − μ| < c) = P(−c < X − μ__