Download 36-217 Probability Theory and Random Processes

36-217 Probability Theory and Random Processes Lecture Notes Mattia Ciollaro June 24, 2015 NOTICE: text in red means that a change in the text has been made during or after class (possibly because of a typo). Lecture 1 Populations, Statistics and Random Processes Broadly speaking, in Statistics we try to infer, learn, or estimate some features or parameters of a ‘population’ using a observed data sampled from that population. What do we mean by ‘population’ ? We can think of a population as an entity which encompasses all possible data that can be observed in an experiment. This can be a tangible collection of objects or a more abstract entity. Examples of tangible populations are: • the collection of all pennies in the USA • all human beings born in the year 2000. Most often, we perform an experiment because we are interested in learning about a particular feature or parameter of a population. Examples of parameters for the above populations are respectively: • the mean weight of all the pennies in the USA • the proportion of people born in the year 2000 that are taller than 5.25 feet today • the ‘variability’ in the weight of all the pennies in the USA. 1 In order to learn such features we usually proceed as follows: 1. specify a statistical model for the unknown true population (e.g. specify a class of functions that approximate well the unknown true distribution of the weight of all the pennies in the USA) 2. collect data X1 , . . . , Xn , possibly by performing a particular experiment (e.g. sample n pennies and measure their weight) 3. compute a statistic, i.e. a function of the data (and of the data exclusively!), to estimate the parameter of interest (e.g. compute the average weight of the n sampled pennies) 4. further statistical analyses (see, for instance, 36-226!). Probability Theory provides the rigorous mathematical framework needed in the above process (and much more...). Probability Theory can be studied at very different levels, depending on one’s purpose and needs. At a purely mathematical level, Probability Theory is an application of Measure Theory to real-world problems. At a more introductory and conceptual level, Probability Theory is a set of mathematical tools that allows us to carry out and analyze the process described above in a scientific way (especially for statisticians, computer scientists, scientists and friends). As an example, some frequently used statistics or estimators are P • the sample mean X̄ = n1 ni=1 Xi 1 Pn 2 • the sample variance S 2 = n−1 i=1 (Xi − X̄) . In particular, they estimate the mean µ of the distribution of interest (which is a measure of central tendency) and its variance σ 2 (which is a measure of variability/dispersion). With respect to the examples above, we can use the sample mean to estimate the mean weight µ of all the pennies in the USA by collecting n pennies, measuring their weights X1 , . . . , Xn , and then computing X̄. We expect that, as n gets larger, X̄ ≈ µ. If we are also interested in estimating the variance √ of the weight of the pennies in the USA compute the sample variance σ 2 or its standard deviation σ = σ 2 , we might √ 2 2 S or the sample standard deviation S = S (note that, as opposed to the sample variance, the sample standard deviation has the same unit of measure of the data X1 , . . . , Xn ). Again, as n gets larger, we expect that S 2 ≈ σ 2 . We can also use the sample mean to estimate the proportion of people born in the year 2000 who are taller than 5.25 feet today: first we measure 2 the height of n persons born in the year 2000, we introduce an indicator variable ( 1 if Xi > 5.25 Zi = 1(5.25,∞) (Xi ) = (1) 0 if Xi ≤ 5.25 P and we finally compute Z̄ = n1 ni=1 Zi . Again, as an example, we might imagine that a given probability distribution, for instance the Normal distribution, might be good model to describe the weight of the pennies in the USA. The Normal distribution is completely characterized by two parameters: its mean, µ, and its variance σ2. How would you plan an experiment to estimate µ and σ 2 if you believe that the Normal distribution N (µ, σ 2 ) is a good model for the weight of all the pennies in the USA? We should always keep in mind that there is a difference between models and the truth. While the Normal distribution might provide a good model for the distribtion of the pennies in the USA, the true distribution of the weight of the pennies in the USA might not be Normal (and most likely it is not Normal, in fact). TRUTH → MODELS → STATISTICS Why the Normal distribution, for any µ and σ 2 , is most likely not the true distribution of the weight of the pennies in the USA? There are situations in which the quantity that we want to study evolves with time (or with space, or with respect to some other dimension). For example, suppose that you are interested in the number of people walking into a store on a given day. We can model that quantity as a random quantity Xt varying with respect to time t. Most often there exists some kind of ‘dependence’ between the number of people in the store at time t and the number of people in the store at time t0 (especially if |t−t0 | is small). There is a branch of Probability Theory that is devoted to study and model this type of problems. We usually refer to the collection {Xt : t ∈ T } as a ‘random process’ or as a ‘stochastic process’. The set T can represent a time interval, a spatial region, or some more elaborate domain. Another example of a random process is observing the location of robberies occurring in a given city. 3 Another typical example of a random process is the evolution of a stock price in Wall Street as a (random) function of time. We will devote part of this course to study (at an introductory level) some of the theory related to random processes. Sample Spaces and Set Theory What do we mean by probability? There are two major interpretations: • objective probability: the long-run frequency of the occurrence of a given event (e.g. seeing heads in a coin flip) • subjective probability: a subjective belief in the rate of occurrence of a given event. Methods based on objective probability are usually called frequentist or classical whereas those based on subjective probability are usually called Bayesian as they are associated to the Bayesian paradigm. In this class, we focus on frequentist methods (but we will discuss later in the course the basic idea at the basis of Bayesian Statistics). In Probability Theory, the notion of ‘event’ is usually described in terms of a set (of outcomes). Therefore, before beginning our discussion of Probability Theory, it is worthwile to review some basic facts about set theory. Throughout the course we will adopt the convention that Ω denotes the universal set (the superset of all sets) and ∅ denotes the empty set (i.e. a set that does not contain any element). Recall that a set A is a subset of another set B (or A is contained in B) if for any element x of A we have x ∈ A =⇒ x ∈ B. We denote the relation A is a subset of B by A ⊂ B (similarly, A ⊃ B means A is a superset of B or A contains B). Clearly, if A ⊂ B and B ⊂ A, then A = B. Let ∧ and ∨ stand for ‘and’ and ‘or’ respectively. Given two subsets A and B of Ω, recall the following basic set operations: • union of sets: A ∪ B = {x ∈ Ω : x ∈ A ∨ x ∈ B} • intersection of sets: A ∩ B = {x ∈ Ω : x ∈ A ∧ x ∈ B} • complement of a set: Ac = {x ∈ Ω : x ∈ / A} • set difference A \ B = {x ∈ Ω : x ∈ A ∧ x ∈ / B} • symmetric set difference A∆B = (A \ B) ∪ (B \ A) = (A ∪ B) \ (A ∩ B). 4 VENN’S DIAGRAMS Notice that, for any set A ⊂ Ω, A ∪ Ac = Ω and A ∩ Ac = ∅. Two sets A, B ⊂ Ω for which A ∩ B = ∅ are said to be disjoint or mutually exclusive. Also, recall the distributive property for unions and intersections. For A, B, C ⊂ Ω, we have • A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ B) • A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ B). Finally, we have the De Morgan’s Laws: • (A ∪ B)c = Ac ∩ B c • (A ∩ B)c = Ac ∪ B c . These can be extended to the union and the intersection of any n > 0 sets: • (∪ni=1 Ai )c = ∩ni=1 Aci • (∩ni=1 Ai )c = ∪ni=1 Aci . ILLUSTRATION WITH VENN’S DIAGRAMS We used before the word ‘experiment’ to describe the process of collecting data or observations. An experiment is, broadly speaking, any process by which an observation is made. This can be done actively, if you have control on the apparatus that collects the data, or passively, if you only get to see the data, but you have no control on the apparatus that collects them. An experiment generates observations or outcomes. Consider the following example: you toss a fair coin twice (your experiment). The possible outcomes (simple events) of the experiment are HH, HT, TH, TT (H: heads, T: tails). The collection of all possible outcomes of an experiment forms the ‘sample space’ of that experiment. In this case, the sample space is the set Ω = {HH, HT, T H, T T }. Any subset of the sample space of an experiment is called an event. For instance, the event ‘you observe at least one tails in your experiment’ is the set A = {HT, T H, T T } ⊂ Ω. 5 Homework 1 (due May 19th) 1. (25 pts) Consider the Normal distribution N (µ, σ 2 ) that we mentioned in class and a sample X1 , . . . , Xn of size n obtained from this distribution. Which of the following are statistics and why? (a) X̄ √ (b) n(X̄−µ) Qn S (c) i=1 Xi (d) max{X1 , . . . , Xn , σ 2 } (e) 1{X3 >0} (X3 ) Solution: (a), (c), and (e) are statistics since they are functions depending only on the data. (b) and (d) are not statistics since they also depend on the parameters µ and σ 2 . 2. (25 pts) Show that, for any A, B1 , . . . , Bn ⊂ Ω with ∪ni=1 Bi = Ω, we have A = (A ∩ B1 ) ∪ · · · ∪ (A ∩ Bn ). Solution: We have A = A ∩ Ω and Ω = ∪ni=1 Bn . Thus, A = A ∩ Ω = A ∩ (∪ni=1 Bi ) Using the distributive property of intersection, we get A = ∪ni=1 (A ∩ Bi ). 3. (25 pts) Consider the following experiments: • you have three keys and exactly one of them opens the door that you want to open; at each attempt to open the door, you pick one of the keys completely at random and try to open the door (at each attempt you pick one of the three keys at random, regardless of the fact that you may already have tried some in previous attempts). You count the number of attempts needed to open the door. What is the sample space for this experiment? • you have an urn with 1 red ball, 1 blue ball and 1 white ball. For two times, you draw a ball from the urn, look at its color, take a note of the color on a piece of paper, and put the ball back in the urn. What is the sample space for this experiment? Write out the set corresponding to the event ‘you observe at least 1 white ball’. 6 Solution: • The sample space is the set of positive integers Ω = {1, 2, 3, 4, 5, . . . }. • Let R: red, B: blue, and W: white. The sample space is the set of all possible pairs of observed colors Ω = {RR, RB, RW, BR, BB, BW, W R, W B, W W }. Let A be the event ‘you observe at least 1 white ball’. Then A = {RW, BW, W R, W B, W W }. 4. (25 pts) Think of an example of a random process and describe what the quantity of interest Xt is and what the variable t represents in your example (time, space, . . . ?) 5. (EXTRA CREDIT, 10 pts) Show that the expression of the sample variance n 1 X S2 = (Xi − X̄)2 (2) n−1 i=1 is equivalent to  n 1 X 2 1 S2 = Xi − n−1 n i=1 n X !2  Xi . (3) i=1 Solution: We have (n − 1)S 2 = = n X i=1 n X i=1 = n X i=1 n X (Xi − X̄)2 = Xi2 − 2X̄ i=1 n X (Xi2 − 2Xi X̄ + X̄ 2 ) Xi + nX̄ 2 = i=1 Xi2 − nX̄ 2 = n X Xi2 − 2nX̄ 2 + nX̄ 2 i=1 n X i=1 1 Xi2 − n 2 n n X !2 Xi i=1 Thus,  n 1 X 2 1 2 S = Xi − n−1 n i=1 7 n X i=1 !2  Xi . . Lecture 2 Recommended readings: WMS, sections 2.1 → 2.6 Unconditional Probability Consider again the example from the previous lecture. We toss a fair coin twice. The sample space for the experiment is Ω = {HH, HT, T H, T T }. If we were to repeat the experiment over and over again, what is the frequency of the occurrence of the sequence HT ? What about the other sequences? What about the frequency of {HT } ∪ {T T }? To make the intuition a little more formal, we can say that a probability measure on Ω is a set function P satisfying the following axioms: • for any A ⊂ Ω, P (A) ∈ [0, 1] • P (Ω) = 1 ∞ • P for any countable collection of disjoint events {Ai }∞ i=1 ⊂ Ω, P (∪i=1 Ai ) = ∞ i=1 P (Ai ). An interesting question to ask is what the probability of the union of two (possibly non-disjoint) events is. That is, given A, B ⊂ Ω, what is P (A∪B)? Here is the answer. Notice first that A = (A ∩ B) ∪ (A ∩ B c ) and the two sets are disjoint (why?). Using the third axiom of probability (the countable additivity axiom), we have P (A) = P (A ∩ B) + P (A ∩ B c ) P (B) = P (A ∩ B) + P (Ac ∩ B) (4) Now, A ∪ B = (A ∩ B c ) ∪ (Ac ∩ B) ∪ (A ∩ B) and again these sets are all disjoint. Putting all together we get P (A ∪ B) = P (A) − P (A ∩ B) + P (B) − P (A ∩ B) + P (A ∩ B) = = P (A) + P (B) − P (A ∩ B) (5) How does this formula read when A and B are disjoint? Why is P (A ∩ B) = 0 when A and B are disjoint? 8 This can be extended to more than 2 events. For instance, given A, B, C ⊂ Ω, we have P (A ∪ B ∪ C) = P (A) + P (B) + P (C) − P (A ∩ B) − P (A ∩ C) − P (B ∩ C) + P (A ∩ B ∩ C) (6) and, in general, for A1 , . . . , An ⊂ Ω we have the so called inclusion-exclusion formula (note the alternating signs for the summands): P (∪ni=1 Ai ) = n X i=1 P (Ai ) − n−1 + (−1) X 1≤i<j≤n P (Ai ∩ Aj ) + X 1≤i<j<k≤n P (Ai ∩ Aj ∩ Ak ) − . . . P (A1 ∩ · · · ∩ An ). (7) Notice that in general we have the following ‘union bound’ for any A1 , . . . , An ⊂ Ω: n X P (∪ni=1 Ai ) ≤ P (Ai ). (8) i=1 For a given experiment with m possible outcomes, how can we compute the probability of an event of interest? We can always do the following: 1. define the experiment and describe its simple events Ei , i ∈ {1, . . . , m} 2. define reasonable probabilities on the simple events, P (Ei ), i ∈ {1, . . . , m} 3. define the event of interest A in terms of the simple events Ei P 4. compute P (A) = i:Ei ∈A P (Ei ). Here is an example (and we will describe it in terms of the above scheme). There are 5 candidates for two identical job positions: 3 females and 2 males. What is the probability that a completely random selection process will appear discriminatory? (i.e. exactly 2 males or exactly 2 females are chosen for these job positions) We can approach this problem in the following way. 1. we introduce 5 labels for each of the 5 candidates: M1 , M2 , F1 , F2 , F3 . The sample space is then Ω ={M1 M2 , M1 F1 , M1 F2 , M1 F3 , M2 M1 , M2 F1 , M2 F2 , M2 F3 , F1 M1 , F1 M2 , F1 F2 , F1 F3 , F2 M1 , F2 M2 , F2 F1 , F2 F3 , F3 M1 , F3 M2 , F3 F1 , F3 F2 } 9 2. because the selection process is completely random, each of the simple events of the sample space is equally likely. Therefore the probability of any simple event is just 1/|Ω| 3. the event of interest is A = {M1 M2 , M2 M1 , F1 F2 , F1 F3 , F2 F1 , F2 F3 , F3 F1 , F3 F2 } 4. P (A) = P (M1 M2 ) + P (M2 M1 ) + · · · + P (F3 F2 ) = |A|/|Ω| = 8/20 = 2/5 = 0.4. Idea: if the simple events are equally likely to occur, then the probability of a composite event A is just P (A) = |A|/|Ω|. Questions to ask when you define the sample space and the probabilities on the simple events: • is the sampling done with or without replacement? • does the order of the labels matter? • can we efficiently compute the size of the sample space, |Ω|? Basic techniques from combinatorial analysis come handy for this type of questions when the simple events forming the sample space are equally likely to occur. Let’s take a closer look to some relevant cases. Sampling with replacement when order matters Suppose you have a group of m elements a1 , . . . , am and another group of n elements b1 , . . . bn . You can form mn pairs containing one element from each group. Of course you can easily extend this reasoning to more than just two groups. This is a useful fact when we are sampling with replacement and the order matters. Consider the following example. You toss a six-sided die twice. You have m = 6 simple outcomes on the first roll and n = 6 possible outcomes in the second roll. The size of the sample space consisting of all possible pairs of the type (i, j) for i, j ∈ {1, . . . , 6} is |Ω| = mn = 62 = 36. Is the experiment performed with replacement? Yes, if the first roll is a 2, nothing precludes the fact that the second roll is a 2 again. Does the order matter? Yes, the pair (2,5) corresponding to a 2 on the first roll and a 5 on the second roll is not equal to the pair (5,2) corresponding to a 5 on the first roll and a 2 on the second roll. 10 Sampling without replacement when order matters An ordered arrangement of r distinct elements is called a permutation. The number of ways of ordering n distinct elements taken r at a time is denoted Prn where Prn = n(n − 1)(n − 2) . . . (n − r + 1) = n! . (n − r)! (9) Consider the following example. There are 30 members in a student organization and they need to choose a president, a vice-president, and a treasurer. In how many ways can this be done? The size of the n distinct elements (persons) is n = 30, the number of persons to be appointed is r = 3. The sampling is done without replacement (a president is chosen out of 30 people, then among the 29 people left a vice-president is chosen, and finally among the 28 people left a treasurer is chosen). Does the order matter? Yes, the president is elected first, then the vice-president, and finally the treasurer (think about ‘ranking’ the 30 candidates). The number of ways in which the three positions can be assigned is therefore P330 = 30!/(30 − 3)! = 30 ∗ 29 ∗ 28 = 24360. Sampling without replacement when order does not matter The number of combinations of n elements taken r at a time corresponds to the number of subsets of size r that can be formed from the n objects. This is denoted Crn where n Pn n! n Cr = = r = . (10) r r! (n − r)!r! Here is an example. How many different subsets of three people can become officers of the organization, if chosen randomly? Order doesn’t matter here (we are not interested in the exact appointments for a given candidate). The answer is therefore C330 = 30!/(27! ∗ 3!) = 30 ∗ 29 ∗ 28/6 = 4060. The binomial coefficient nr can be extended to the multinomial case in a straightforward way. Suppose that we want to partition n distinct elements in k distinct groups of size n1 , . . . , nk in such a way that each of the n elements appears in exactly one of the groups. This can be done in n n! (11) = n1 . . . nk n1 !n2 ! . . . nk ! possible ways. 11 Exercises in class: 1. The final exam of 36-217 has 5 multiple choice questions. Questions 1 to 3 and question 5 each have 3 possible choices. Questions 4 has 3 possible choices. Suppose that you haven’t study at all for the exam and your answers to the final exam are given completely at random. What is the probability that you will get a full score at the exam? 2. You have a string of 10 letters. How many substrings of length 7 can you form based on the original string? 3. Google is looking to hire 3 software engineers and asked CMU for potential candidates. CMU advised that Google hires 3 students from the summer 36-217 class. Because the positions must be filled as quickly as possible, Google decided to skip the interview process and hire 3 of you completely at random. What is the probability that *you* will become a software engineer at Google? 12 Lecture 3 Recommended readings: WMS, sections 2.7 → 2.13 Conditional Probability The probability of an event A may vary as a function of the occurrence of other events. Then, it becomes interesting to compute conditional probabilities, e.g. the probability that the event A occurs given the knowledge that another event B occurred. As an example of unconditional probability, think about the event A =‘my final grade for the 36-217 summer class final exam is higher than 90/100’ (this event is not conditional on the occurrence of another event) and its probability P (A). As an example of conditional probability, consider now the event B = ‘my midterm grade for the 36-217 class was higher than 80/100’: how are P (A) and P (A given that B occurred) related? Intuitively, we might expect that P (A) < P (A given that B occurred)! (beware, however, that as we will see the definition of conditional probability is not intrinsically related to causality) The conditional probability of an event A given another event B is usually denoted P (A|B). INTUITIVE DESCRIPTION OF CONDITIONAL PROBABILITY VIA VENN’S DIAGRAMS By definition, the conditional probability of the event A given the event B is P (A ∩ B) P (A|B) = . (12) P (B) Observe that the quantity P (A|B) is well-defined only if P (B) > 0. Consider the following table: A Ac B 0.3 0.4 Bc 0.1 0.2 The unconditional probabilities of the events A and B are respectively P (A) = 0.3 + 0.1 = 0.4 and P (B) = 0.3 + 0.4 = 0.7. The conditional probability of the event A given the event B is P (A|B) = P (A ∩ B)/P (B) = 0.3/0.7 = 3/7 while the probability of the event B given the event A is 13 P (B|A) = P (B ∩ A)/P (A) = 0.3/0.4 = 3/4. The conditional probability of the event A given that B c occurs is P (A|B c ) = P (A ∩ B c )/P (B c ) = 0.1/0.3 = 1/3. Does the probability of A vary as a function of the occurrence of the event B? The conditional probability P (·|B) with respect to an event B with P (B) > 0 is a proper probability measure, therefore the three axioms of probability that we discussed in Lecture hold for P (·|B) as well. When it comes to standard computations, P (·|B) has the same properties as P (·): for instance, for disjoint events A1 , A2 ⊂ Ω we have P (A1 ∪ A2 |B) = P (A1 |B) + P (A2 |B). Exercise in class By using a certain prescription drug, there is a 10% chance of experiencing headaches, a 15% chance of experiencing nausea, and a 5% chance of experiencing both headaches and nausea. What is the probability that you will experience an headache given that you feel nauseaus after taking the prescription drug? Independence The event A ⊂ Ω is said to be independent of the event B ⊂ Ω if P (A ∩ B) = P (A)P (B). This means that the occurrence of the event B does not alter the chance that the event A happens. In fact, assuming that P (B) > 0, we easily see that this is equivalent to P (A|B) = P (A ∩ B)/P (B) = P (A)P (B)/P (B) = P (A). Furthermore, assuming that also P (A) > 0, we have that P (A|B) = P (A) is equivalent to P (B|A) = P (B) (indipendence is a symmetric relation!). Exercise in class Consider the following events and the corresponding table of probabilities’: A Ac B 0 0.4 Bc 0.2 0.4 Are the events A and B independent? Exercise in class Consider the following events and the corresponding table of probabilities’: 14 A Ac B 1/4 1/2 Bc 1/12 1/6 Are the events A and B independent? Exercise in class Let A, B ⊂ Ω and P (B) > 0. • What is P (A|Ω)? • What is P (A|A)? • Let B ⊂ A. What is P (A|B)? • Let A ∩ B = ∅. What is P (A|B)? Notice that from the definition of conditional probability, we have the following: P (A ∩ B) = P (B)P (A|B) = P (A)P (B|A). (13) This can be generalized to more than two events. For instance, for three events A, B, C ⊂ Ω, we have P (A ∩ B ∩ C) = P (A)P (B|A)P (C|A ∩ B) (14) and for n events A1 , . . . , An P (A1 ∩ · · · ∩ An ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) . . . P (An |A1 ∩ · · · ∩ An−1 ). (15) Law of Total Probability and Bayes’ Rule Assume that {Bi }∞ i=1 is a partition of Ω, i.e. for any i 6= j we have Bi ∩Bj = ∅ ∞ and ∪i=1 Bi = Ω. Assume also that P (Bi ) > 0 ∀i. Then, for any A ⊂ Ω, we have the so-called law of total probability P (A) = ∞ X P (A|Bi )P (Bi ). (16) i=1 n As you showed in Homework 1, A = ∪∞ i=1 (A ∩ Bi ) when ∪i=1 Bi = Ω. But ∞ this time the sets (A ∩ Bi ) are also disjoint since {Bi }i=1 is a partition of Ω. Furthermore, by equation (13) we have P (A ∩ Bi ) = P (A|Bi )P (Bi ). Thus, 15 P (A) = P∞ i=1 P (A ∩ Bi ) = P∞ i=1 P (A|Bi )P (Bi ). We can now use the law of total probability to derive the so-called Bayes’ rule. We have P (Bi |A) = P (Bi ∩ A) P (A|Bi )P (Bi ) = P∞ . P (A) i=1 P (A|Bi )P (Bi ) (17) For two events A, B this reduces to P (B|A) = P (A|B)P (B) P (A|B)P (B) = P (A) P (A|B)P (B) + P (A|B c )P (B c ) (18) Exercise in class You are diagnosed with a disease that has two types, A and B. In the population, the probability of having type A is 10% and the probability of having type B is 90%. You undergo a test that is 80% accurate, i.e. if you have type A disease, the test will diagnose type A with probability 80% and type B with probability 20% (and vice versa). The test indicates that you have type A. What is the probability that you really have the type A disease? Let A =‘you have type A’, B=‘you have type B’, TA =‘the test diagnoses type A’, and TB =‘the test diagnoses type B’. We know that P (A) = 0.1 and P (B) = 0.9. The test is 80% accurate, meaning that P (A|TA ) = P (B|TB ) = 0.8 P (B|TA ) = P (A|TB ) = 0.2 We want to compute P (A|TA ). We have P (A ∩ TA ) P (TA |A)P (A) = P (TA ) P (TA |A)P (A) + P (TA |B)P (B) 0.8 ∗ 0.1 8 4 = = = . 0.8 ∗ 0.1 + 0.2 ∗ 0.9 26 13 P (A|TA ) = 16 Homework 2 (due May 21st) 1. (12 pts) A secretary types 4 letters to 4 people and addresses the 4 envelopes. If she inserts the letters at random, what is the probability that exactly 2 letters will go in the correct envelopes? Solution: Call the letters A, B, C, D. What is the probability that she correctly address letters A and B? That is 2/4! (the ordering where A and B are matched to their envelopes out of the 4! possible orderings). Now, one has C24 possible choices of 2 letters out of 4 to be addressed correctly. Thus, the probability that exactly 2 letters will go in the correct envelopes is C24 ∗ 2/4! = 1/2. 2. (12 pts) A student prepares for an exam by studying a list of 10 problems. She can solve 6 of them. For the exam, the instructor chooses 5 problems among the 10 completely at random. What is the probability that the student can solve all of the problems on the day of the exam? Solution: How many combinations of test questions are possible (i.e. how big is the sample space)? There are C510 possible combinations of test questions. From the six questions that the student can solve, one can choose 5 of them in C56 ways. Thus the probability that the student can solve all of the problems on the day of the exam is C56 = C510 6 5 10 5 = 6! 5!(6−5)! 10! 5!(10−5)! = 1 . 42 3. (12 pts) A fair die is tossed six times, and the number on the uppermost face is recorded each time. What is the probability that the numbers recorded are 1, 2, 3, 4, 5, 6, in any order? Solution: The number of possible outcomes of the experiment is 66 (in the first roll there are 6 possible outcomes, in the second roll there are again 6 possible outcomes, and so on). The number of ways in which the numbers 1, 2, 3, 4, 5, 6 can appear in any order is 6!. Therefore, the probability that the numbers recorded after the experiment are 1, 2, 3, 4, 5, 6 in any order is 6!/66 = 20/64 . 17 4. (12 pts) A population of voters contains 40% Republicans and 60% Democrats. It is reported that 30% of the Republicans and 70% of the Democrats favor an election issue. A person chosen at random from this population is found to favor the issue in question. What is the conditional probability that this person is a Democrat? Solution: Let D and R denote the events ‘the person is a Democrat’ and ‘the person is a Republican’, respectively. Let F denote the event ‘the person favors the issue’. We have P (D) = 0.6, P (R) = 0.4, P (F |D) = 0.7, and P (F |R) = 0.3. Then, using Bayes’ Rule, P (F |D)P (D) P (F |D)P (D) = P (F ) P (F |D)P (D) + P (F |R)P (R) 0.7 ∗ 0.6 = = 7/9. 0.7 ∗ 0.6 + 0.3 ∗ 0.4 P (D|F ) = 5. (4 pts) Use the axioms of probability to show that for any A ⊂ S we have P (Ac ) = 1 − P (A). Solution: We have Ω = A∪Ac . Since A and Ac are disjoint, P (Ω) = P (A∪Ac ) = P (A) + P (Ac ). Thus, P (Ac ) = P (Ω) − P (A). Since P (Ω) = 1, we then have P (Ac ) = 1 − P (A). 6. (8 pts) Let B ⊂ A. Show that P (A) = P (B) + P (A ∩ B c ). Solution: Remember from Homework 1 that we have A = (A ∩ B) ∪ (A ∩ B c ). Furthermore, these two sets are disjoint: A∩B is a subset of B, A∩B c is a subset of B c , and obviously B and B c are disjoint. Therefore, P (A) = P (A ∩ B) + P (A ∩ B c ). Since B ⊂ A, we have A ∩ B = B and thus P (A) = P (B) + P (A ∩ B c ). 7. (10 pts) Show that for A, B ⊂ Ω, P (A∩B) ≥ max{0, P (A)+P (B)−1}. Solution: Because the probability of any event is a non-negative number, it is clear that P (A ∩ B) ≥ 0. Furthermore, recall that in general for two events A and B we have P (A ∪ B) = P (A) + P (B) − P (A ∩ B) from which it follows that P (A ∩ B) = P (A) + P (B) − P (A ∪ B) 18 Since P (A ∪ B) ≤ 1, we have P (A ∩ B) = P (A) + P (B) − P (A ∪ B) ≥ P (A) + P (B) − 1. Thus, since P (A ∩ B) ≥ 0 and P (A ∩ B) ≥ P (A) + P (B) − 1, we conclude that P (A ∩ B) ≥ max{0, P (A) + P (B) − 1}. 8. (10 pts) Let A, B ⊂ Ω be such that P (A) > 0 and P (B) > 0. Show that • if A and B are disjoint, then they are not independent • if A and B are independent, then they are not disjoint. Solution: • Suppose that A and B are independent. Then, P (A ∩ B) = P (A)P (B) > 0 (since P (A) > 0 and P (B) > 0 by assumption). However, A and B are disjoint by assumption, i.e. A ∩ B = ∅ which implies that P (A ∩ B) = 0, and we reach a contradiction. • Suppose that A and B are disjoint, i.e. A ∩ B = ∅. Then P (A ∩ B) = 0. However, A and B are independent by assumption, i.e. P (A∩B) = P (A)P (B) > 0 (recall that P (A) > 0 and P (B) > 0), and again we reach a contradiction. 9. (10 pts) Let A, B ⊂ Ω be such that P (A) > 0 and P (B) > 0. Prove or disprove the following claim: in general, it is true that P (A|B) = P (B|A). If you believe that the claim is true, clearly explain your reasoning. If you believe that the claim is false in general, can you find an additional condition under which the claim is true? Solution: The claim is false in general. A simple counter example: take A arbitrary with 0 < P (A) < 1 and B = Ω. We have P (A|B) = P (A|Ω) = P (A) while P (B|A) = P (Ω|A) = 1. Notice, however, that if P (A) = P (B), then P (A|B) = P (A ∩ B) P (A ∩ B) = = P (B|A). P (B) P (A) Therefore, under this additional condition, the claim is true. 10. (10 pts) Let A, B ⊂ Ω be independent events. Show that the following pairs of events are also independent: 19 • A, B c • Ac , B • Ac , B c . Solution: • We have A = (A ∩ B) ∪ (A ∩ B c ) and the two events are disjoint. Therefore, P (A ∩ B c ) = P (A) − P (A ∩ B) = P (A) − P (A)P (B) = P (A)[1 − P (B)] = P (A)P (B c ) which implies that A and B c are independent. • To show Ac and B are independent, just switch the roles of A and B in the argument above. • Once again we can write Ac = (Ac ∩ B c ) ∩ (Ac ∩ B) and the two events are disjoint. Hence, as in the first part, P (Ac ∩ B c ) = P (Ac ) − P (Ac ∩ B) = P (Ac ) − P (Ac )P (B) = P (Ac )[1 − P (B)] = P (Ac )P (B c ) implying that Ac and B c are also independent (notice that here we used the previous result on the independence of Ac and B). 20 Lecture 4 Recommended readings: WMS, sections 3.1, 3.2, 4.1 → 4.3 Random Variables and Probability Distributions Let’s start with an example. Suppose that you flip a fair coin three times. The sample space for this experiment is Ω = {HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T T }. Since the coin is fair, we have that P ({HHH}) = P ({HHT }) = · · · = P ({T T T }) = 1 1 = . |Ω| 8 Suppose that we are interested in a particular quantity associated to the experiment, say the number of tails that we observe. This is of course a random quantity. In particular, we can conveniently define a function X : Ω → R that counts the number of tails. Precisely, X(HHH) = 0 X(HHT ) = 1 X(HT H) = 1 X(HT T ) = 2 .. . X(T T T ) = 3. Since the outcome of the experiment is random, the value that the function X takes is also random. That’s why we call a function like X a random variable 1 . P induces through X a probability distribution on R. In particular, we can easily see that the probability distribution of the random variable X is P (X = 0) = P ({HHH}) = 1/8 P (X = 1) = P ({HHT, HT H, T HH}) = 3/8 P (X = 2) = P ({HT T, T HT, T T H}) = 3/8 P (X = 3) = P ({T T T }) = 1/8 P (X = any other number) = 0. 1 For historical reasons X is called a random ‘variable’, even though X is clearly a function mapping Ω into R. 21 You might be wondering why we care about random variables. After all, based on the example, it seems that we still have to clearly define and specify the sample space in order to determine P (X = x) for x ∈ R. Actually, it turns out when we model a random quantity of interest (such as the number of tails in the coin tossing example) most of the times one assumes (or knows) a distribution to use. As we shall see later, in the example above we would simply assume that X has a Binomial distribution without having to specify the sample space and specifying a probability on it! Assuming that X has a Binomial distribution (or any other distribution) is equivalent to making an assumption for the probabilities that we are interested in, i.e. P (X = x) for x ∈ R. Exercise in class Let X and Y be random variables with the following distributions:   0.4 if x = 0 P (X = x) = 0.6 if x = 1   0 if x ∈ / {0, 1} and   0.7 if y = −1 P (Y = y) = 0.3 if y = 1   0 if y ∈ / {0, 1} Suppose that for any x, y ∈ R the events {X = x} and {Y = y} are independent. What is the probability distribution of the random variable Z = X + Y ? We have   −1 if {X = 0} ∩ {Y = −1} occurs    0 if {X = 1} ∩ {Y = −1} occurs Z= 1 if {X = 0} ∩ {Y = 1} occurs    2 if {X = 1} ∩ {Y = 1} occurs. Thus,   0.4 ∗ 0.7 if z = −1      0.6 ∗ 0.7 if z = 0 P (Z = z) = 0.4 ∗ 0.3 if z = 1    0.6 ∗ 0.3 if z = 2     0 if z ∈ / {−1, 0, 1, 2} 22   0.28 if z = −1      0.42 if z = 0 = 0.12 if z = 1    0.18 if z = 2     0 if z ∈ / {−1, 0, 1, 2}. At this point it is worthwhile making an important distinction between two types of random variables, namely discrete random variables and continuous random variables. We say that a random variable is discrete if the set of values that it can take is at most countable. On the other hand, a random variable taking values in an uncountably infinite set is called continuous. Question Consider the following: • you draw a circle on a piece of paper an one diameter of the circle; at the center of the circle, you keep your pencil standing orthogonal to the plane where the circle lies. At some point you let go the pencil. X is the random variable corresponding to the angle that the pencil forms with the diameter of the circle that you drew after the pencil fell on the piece of paper. • you roll a die. Y is the random variable corresponding to the number of rolls needed until you observe tail for the first time. What are the possible values that X and Y can take? Are X and Y discrete or continuous random variables? Depending on whether a given random variable X is discrete or continuous, we use two special types of functions in order to describe its distribution. • if X is discrete, let’s define its support as the set supp(X) = {x ∈ R : P (X = x) > 0} (if X is discrete, supp(X) is either a finite or countable set). We can describe the probability distribution of X in terms of its probability mass function (p.m.f.), i.e. the function p(x) = P (X = x) (19) mapping R into [0, 1]. The function p satisfies the following properties: 1. p(x) ∈ [0, 1] ∀x ∈ R P 2. x∈supp(X) p(x) = 1. • if X is continuous, we can describe the probability distribution of X by means of the probability density function (p.d.f.) f : R → R+ . Define in this case supp(X) = {x ∈ R : f (x) > 0}. The function f satisfies the following properties: 23 1. f (x) ≥ 0 ∀x ∈ R R R 2. R f (x) dx = supp(X) f (x) dx = 1. We use f to compute the probability of events of the type {X ∈ (a, b]} for a < b. In particular, for a < b, we have Z b f (x) dx (20) P (X ∈ (a, b]) = a Notice that this implies that, for any x ∈ R, P (X = x) = 0 if X is a continuous random variable! Also, if X is a continous random variable, it is clear from above that P (X ∈ (a, b]) = P (X ∈ [a, b]) = P (X ∈ [a, b)) = P (X ∈ (a, b)). (21) In general, for any set A ⊂ R, we have P (X ∈ A) = Z f (x) dx. A Exercise in class A group of 4 components is known to contain 2 defectives. An inspector randomly tests the components one at a time until the two defectives are found. Let X denote the number of tests on which the second defective is found. What is the p.m.f. of X? What is the support of X? Graph the p.m.f. of X. Let D stand for defective and N stand for non-defective. The sample space for this experiment is Ω = {DD, N DD, DN D, DN N D, N DN D, N N DD}. The simple events in Ω are equally likely because the inspector samples the components completely at random. Thus, the probability of each simple event in Ω is just 1/|Ω| = 1/6. We have P (X = 2) = P ({DD}) = 1/6 P (X = 3) = P ({N DD, DN D}) = 2/6 = 1/3 P (X = 4) = P ({DN N D, N DN D, N N DD}) = 3/6 = 1/2 P (X = any other number) = 0. Thus, the p.m.f. of X is   1/6 if x = 2    1/3 if x = 3 p(x) =  1/2 if x = 4    0 if x ∈ / {2, 3, 4} 24 and supp(X) = {2, 3, 4}. GRAPH Exercise in class Consider the following p.d.f. for the random variable X: ( e−x if x ≥ 0 f (x) = e−x 1[0,∞) (x) = 0 if x < 0. What is the support of X? Compute P (2 < X < 3). Graph the p.d.f. of X. The support of X is clearly the set [0, ∞). We have Z 3 3 P (2 < X < 3) = P (X ∈ (2, 3)) = f (x) dx = −e−x 2 = e−2 − e−3 . 2 GRAPH Another way to describe the distribution of a random variable is by means of its cumulative distribution function (c.d.f). Again we will separate the discrete case and the continuous case. • If X is a discrete random variable, its c.d.f is defined as the function X F (x) = p(y). y≤x y∈supp(X) Notice that for a discrete random variable, F is not a continuous function. • If X is a continuous random variable, its c.d.f is defined as the function Z x F (x) = f (y) dy. −∞ Notice that for a continuous random variable, F is a continuous function. In both cases, the c.d.f. of X satisfies the following properties: 1. limx→−∞ F (x) = 0 2. limx→+∞ F (x) = 1 25 3. x ≤ y =⇒ F (x) ≤ F (y) 4. F is a right-continuous function, i.e. for any x ∈ R we have limy→x+ F (y) = F (x). Exercise in class Compute the c.d.f. of the random variable X in the two examples above and draw its graph. For the example in the discrete case, we have   0 if x < 2   1/6 if 2 ≤ x < 3 F (x) =  1/6 + 1/3 = 1/2 if 3 ≤ x < 4    1/2 + 1/2 = 1 if x ≥ 4. For the example in the continuous case, we have ( ( 0 if x < 0 0 if x < 0 R x −y F (x) = = x 0 + 0 e dy if x ≥ 0 −e−y |0 if x ≥ 0 ( 0 if x < 0 = 1 − e−x if x ≥ 0. GRAPHS What is the relationship between the c.d.f. of a random variable and its p.m.f./p.d.f.? • For a discrete random variable X, let x(i) denote the i-th largest element in supp(X). Then, ( F (x(i) ) − F (x(i−1) ) if i ≥ 2 p(x(i) ) = F (x(i) ) if i = 1 • For a continuous random variable X, d F (y) f (x) = dy y=x for any x at which F is differentiable. Furthermore, notice that for a continuous random variable X with c.d.f. F and for a < b one has P (a < X ≤ b) = P (a ≤ X ≤ b) = P (a ≤ X < b) = P (a < X < b) Z b Z b Z a = f (x) dx = f (x) dx − f (x) dx = F (b) − F (a). a −∞ −∞ 26 This is easy. However, for a discrete random variable one has to be careful with the bounds. Using the same notation as before, for i < j one has P (x(i) < X ≤ x(j) ) = F (x(j) ) − F (x(i) ) P (x(i) ≤ X ≤ x(j) ) = F (x(j) ) − F (x(i) ) + p(x(i) ) P (x(i) ≤ X < x(j) ) = F (x(j−1) ) − F (x(i) ) + p(x(i) ) P (x(i) < X < x(j) ) = F (x(j−1) ) − F (x(i) ). Suppose that the c.d.f. F of a continuous random variable X is strictly increasing. Then F is invertible, meaning that there exists a function F −1 : (0, 1) → R ∪ {−∞, +∞} such that for any x ∈ R we have F −1 (F (x)) = x. Then, for any α ∈ (0, 1) we can define the α-quantile of X as the number xα = F −1 (α) (22) with the property that P (X ≤ xα ) = α. This can be extended to c.d.f. that are not strictly increasing and to p.m.f.’s, but for this class we will not use that extension. Exercise in class Consider again a random variable X with p.d.f. ( e−x if x ≥ 0 f (x) = e−x 1[0,∞) (x) = 0 if x < 0. Compute the α-quantile of X. We know from above that ( 0 if x < 0 F (x) = 1 − e−x if x ≥ 0. For α ∈ (0, 1), set α = F (xα ) = 1 − e−xα . We then have that the α-quantile is xα = − log(1 − α). 27 Lecture 5 Recommended readings: WMS, sections 3.3, 4.3 Expectation and Variance The expectation (or expected value or mean) is an important operator associated to a probability distribution. Given a random variable X with p.m.f. p (if X is discrete) or p.d.f. f (if X is continuous), its expectation E(X) is defined as P • E(X) = x∈supp(X) xp(x), if X is discrete R R • E(X) = R xf (x) dx = x∈supp(X) xf (x) dx, if X is continuous. Roughly speaking, E(X) is the ‘center’ of the distribution of X. Exercise in class Consider the random variable X and its p.m.f.   0.2 if x = 0      0.3 if x = 1 p(x) = 0.1 if x = 2    0.4 if x = 3     0 if x ∈ / {0, 1, 2, 3}. What is E(X)? By definition we have X E(X) = xp(x) = 0∗0.2+1∗0.3+2∗0.1+3∗0.4 = 0.3+0.2+1.2 = 1.7. x∈supp(X) Exercise in class Consider the random variable X and its p.d.f f (x) = 3x2 1[0,1] (x). What is E(X)? 28 Again, by definition Z Z xf (x) dx = E(X) = R Z 1 = 0 xf (x) dx = supp(X) Z 1 x ∗ 3x2 dx = 3 x3 dx = 0 3 4 1 3 x 0= . 4 4 Consider a function g : R → R of the random variable X and the new random variable g(X). The expectation of g(X) is simply P • E(g(X)) = x∈supp(X) g(x)p(x), if X is discrete R R • E(g(X)) = R g(x)f (x) dx = x∈supp(X) g(x)f (x) dx, if X is continuous. Exercise in class Consider once again the two random variables above and the function g(x) = x2 . What is E(X 2 )? • In the discrete example, we have X E(X 2 ) = x2 p(x) = 02 ∗0.2+12 ∗0.3+22 ∗0.1+32 ∗0.4 = 0.3+0.4+3.6 = 4.3. x∈supp(X) • In the continous example, we have Z Z 2 2 E(X ) = x f (x) dx = R Z 1 = 0 x2 f (x) dx = supp(X) Z 1 x2 ∗ 3x2 dx = 3 x4 dx = 0 3 5 1 3 x 0= . 5 5 A technical note: for a random variable X, the expected value E(X) is a well-defined quantity whenever E(|X|) < ∞. The expected value operator E is a linear operator: given two random variables X, Y and scalars a, b ∈ R we have E(aX) = aE(x) E(X + Y ) = E(X) + E(Y ). (23) For any scalar a ∈ R, we have E(a) = a. To see this, consider the random variable Y with p.m.f. ( 1 if x = a p(y) = 1{a} (x) = 0 if x 6= a. 29 From the definition of E(Y ) it is clear that E(Y ) = a. Exercise in class Consider again the continuous random variable X of the previous example. What is E(X + X 2 )? We have E(X + X 2 ) = E(X) + E(X 2 ) = 3 4 + 3 5 = 27 20 . Beware that in general, for two random variables X, Y , it is not true that E(XY ) = E(X)E(Y ). However, we shall see later in the class that this is true when X and Y are independent. The letter µ is often used to denote the expected value E(X) of a random variable X, i.e. µ = E(X). Another very important operator associated to a probability distribution is the variance. The variance measures how ‘spread’ the distribution of a random variable is. The variance of a random variable X is defined as V (X) = E[(X − µ)2 ] = E(X 2 ) − µ2 . (24) Equivalently, one can write • if X is discrete: X V (X) = (x − µ)2 p(x) = x∈supp(X) • if X is continuous: Z V (X) = x∈supp(X) X x∈supp(X) x2 p(x) − µ2 x2 f (x) dx − µ2 . Exercise in class Consider the random variables of the previous examples. What is V (X)? • In the discrete example, V (X) = E(X 2 ) − [E(X)]2 = 4.3 − (1.7)2 = 4.3 − 2.89 = 1.41. • In the continuous example V (X) = E(X 2 )−[E(X)]2 = 3/5−(3/4)2 = 3/5 − 9/16 = 94/80 = 47/40. σ 2 is often used to denote the variance V (X) of a random variable X, i.e. σ 2 = V (X). The variance of a random variable X is finite as soon as E(X 2 ) < ∞. Here are two important properties of the variance operator. For any a ∈ R we have 30 • V (aX) = a2 V (X) • V (a + X) = V (X). Exercise in class Consider random variable of the previous examples. What p the continuous is V 40/47X ? We have p 40 40 47 40/47X = V (X) = = 1. 47 47 40 Beware that in general, for two random variables X, Y , it is not true that V (X + Y ) = V (X) + V (Y ). However, we shall see later in the course that this is true when X and Y are independent. V Markov’s Inequality and Tchebysheff ’s Inequality Sometimes we may want to compute or approximate the probability of certain events involving a random variable X even if we don’t know its distribution (but given that we know its expectation or its variance). Let a > 0. The following inequalities are useful in these cases: E(|X|) (Markov’s inequality) a E(X 2 ) P (|X| ≥ a) ≤ (Tchebysheff’s inequality) a2 P (|X| ≥ a) ≤ (25) Let σ 2 = V (X) as usual so that σ is the standard deviation of X. Note that we can conveniently take a = kσ and the second inequality then reads P (|X − µ| ≥ kσ) ≤ σ2 1 = 2 kσ 2 k (26) where µ = E(X). Thus, we can easily bound the probability that X deviates from its expectation by more than k times its standard deviation. 31 Homework 3 (due May 26th) 1. (8 pts) Consider the random variable X with p.m.f.   0.1 if x = 0 p(x) = 0.9 if x = 1   0 if x ∈ / {0, 1}. Draw the p.m.f. of X. Compute the c.d.f. of X and draw it. Solution: The c.d.f. of X is   0 if x < 0 F (x) = p(0) if 0 ≤ x < 1   p(0) + p(1) if x ≥ 1   0 if x < 0 = 0.1 if 0 ≤ x < 1   1 if x ≥ 1.   0 if x < 0 = 0.1 if 0 ≤ x < 1   0.1 + 0.9 if x ≥ 1. 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 p(x) F(x) 0.6 0.8 1.0 1) c.d.f. 1.0 1) p.m.f. -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 -1.0 x -0.5 0.0 0.5 1.0 1.5 x 2. (10 pts) Let X be a continuous random variable with p.d.f. f (x) = 3x2 1[0,1] (x). Compute the c.d.f. of X and draw it. What is P (X ∈ (0.2, 0.4))? What is P (X ∈ [−3, 0.2))? 32 2.0 Solution: The c.d.f. of X is   0R if x < 0 x x 2 F (x) = f (x) dx = 0 3y dy if 0 ≤ x < 1  −∞  1 if x ≥ 1   0 if x < 0 = x3 if 0 ≤ x < 1   1 if x ≥ 1. Z   0 if x < 0 x = y 3 0 if 0 ≤ x < 1   1 if x ≥ 1 0.0 0.2 0.4 F(x) 0.6 0.8 1.0 2) c.d.f. -2 -1 0 1 x We have P (X ∈ (0.2, 0.4)) = F (0.4) − F (0.2) = 0.43 − 0.23 = 0.056 and P (X ∈ [−3, 0.2)) = F (0.2) − F (−3) = 0.23 − 0 = 0.008. 3. (10 pts) Consider again the exercise of page 24 on the defective components. Compute the c.d.f. of X and use it (possibly together with the p.m.f. of X) to compute the following probabilities: • P (2 ≤ X ≤ 5) • P (X > 3) 33 2 • P (0 < X < 4). Solution: The c.d.f. of X is   0 if x < 2    p(2) if 2 ≤ x < 3 F (x) =  p(2) + p(3) if 3 ≤ x < 4    p(2) + p(3) + p(4) if x ≥ 4   0 if x < 2    1/6 if 2 ≤ x < 3 = 1/2 if 3 ≤ x < 4    1 if x ≥ 4.   0 if x < 2    1/6 if 2 ≤ x < 3 =  1/6 + 1/3 if 3 ≤ x < 4    1/6 + 1/3 + 1/2 if x ≥ 4 Thus, we have P (2 ≤ X ≤ 5) = F (5) − F (2) + p(2) = 1 − 1/6 + 1/6 = 1 P (X > 3) = 1 − P (X ≤ 3) = 1 − F (3) = 1 − 1/2 = 1/2 P (0 < X < 4) = F (3) − F (0) = 1/2 − 0 = 1/2. 4. (10 pts) Consider the following c.d.f. for the random variable X:   0 if x < 0   √    18 x if 0 ≤ x < 1  F (x) = P (X ≤ x) = − 38 + 12 x if 1 ≤ x < 32   5 3 3   − 2 + 4 x if 2 ≤ x < 2   1 if x ≥ 2. What is the p.d.f. of X? Solution: Notice that F is piecewise differentiable. Therefore, the p.d.f. of X is  0 if x ≤ 0     1√    16 x if 0 < x < 1 d f (x) = F (y) = 21 if 1 ≤ x < 32  dy  y=x 5 3    4 if 2 ≤ x < 2   0 if x ≥ 2. 34 5. (10 pts) Let α ∈ (0, 1). Derive the α-quantile associated to the following c.d.f. 1 F (x) = . 1 + e−x Solution: We need to solve F (xα ) = α with respect to xα . We have 1 =α 1 + e−xα 1 = 1 + e−xα α 1 − log − 1 = xα α 1−α α xα = − log = log . α 1−α 6. (10 pts) Consider the continuous random variable X with p.d.f. f and its tranformation 1A (X) where A ⊂ R. Show that E(1A (X)) = P (X ∈ A). Solution: Let g(x) = 1A (x). By definition, the expected value of the random variable g(X) is Z E(g(X)) = g(x)f (x) dx. R In this case, we have E(g(X)) = E(1A (X)) = Z 1A (x)f (x) dx = R Z A f (x) dx = P (X ∈ A). 7. (6 pts) Consider a random variable X and a scalar a ∈ R. Using the definition of variance, show that V (X + a) = V (X). Solution: Consider the random variable Y = X + a. We have that E(Y ) = E(X + a) = E(X) + a. Now, by definition, V (Y ) = E[(Y − E(Y ))2 ] = E[(X + a − (E(X) + a))2 ] = E[(X − E(X))2 ] = V (X). 35 8. (6 pts) Let X be a random variable with expectation µ. Show that E[(X − µ)2 ] = E(X 2 ) − µ2 . Solution: We have E[(X − µ)2 ] = E(X 2 + µ2 − 2µX) = E(X 2 ) + µ2 + E(−2µX) = = E(X 2 ) + µ2 − 2µE(X) = E(X 2 ) + µ2 − 2µ ∗ µ = E(X 2 ) + µ2 − 2µ2 = E(X 2 ) − µ2 . 9. (10 pts) Let X be a random variable with p.d.f. f (x) = 1[0,1] (x). • Draw the p.d.f. of X. • Compute the c.d.f. of X and draw it. • Compute E(3X). • Compute V (2X + 100). Solution: The c.d.f. of X is   0R if x < 0 x F (x) = f (y) dy = 0 dy if 0 ≤ x < 1  −∞  1 if if x ≥ 1 Z x   0 if x < 0 = x if 0 ≤ x < 1   1 if if x ≥ 1 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 f(x) F(x) 0.6 0.8 1.0 9) c.d.f. 1.0 9) p.d.f. -2 -1 0 1 2 -2 -1 x We have that E(X) = 0 1 x R R xf (x) dx = 36 R1 0 x ∗ 1 dx = 1 2 1 x2 0 = 12 . There- 2 fore, E(3X) = 3E(X) = 32 . Furthermore, Z 2 E(X ) = 1 Z 2 x f (x) dx = 0 R hence V (X) = E(X 2 ) − [E(X)]2 = 4 V (2X) = 4V (X) = 12 = 13 . 1 3 x2 ∗ 1 dx = − 14 = 1 12 . 1 3 1 1 x 0= 3 3 Thus, V (2X + 100) = 10. (10 pts) A secretary is given n > 0 passwords to access a computer file. In order to open the file, she starts entering the password in a random order and after each attempt she discards the password that she used if that password was not the correct one. Let X be the random variable corresponding to the number of password that the secretary needs to enter before she can open the file. • Argue that P (X = 1) = P (X = 2) = · · · = P (X = n) = 1/n. • Find the expectation and the variance of X. Potentially useful facts: Pn • x=1 x = n(n + 1)/2 Pn 2 • x=1 x = n(n + 1)(2n + 1)/6. Solution: Let Ri denote the event ‘the secretary chooses the right password on the i-th attempt’. We have P (X = 1) = P (R1 ) = 1/n since the secretary is picking one password at random from the n available passwords. The probability that she chooses the right password on the second attempt is P (X = 2) = P (R1c ∩ R2 ) = P (R1c )P (R2 |R1c ) = n−1 1 1 = , n n−1 n i.e. it equals the probability that she chooses the wrong password on the first attempt and then she chooses the right one on the second attempt. Similarly, P (X = 3) = P (R1c ∩ R2c ∩ R3c ) = P (R1c )P (R2c |R1c )P (R3 |R1c ∩ R2c ) = = 1 , n 37 n−1n−2 1 n n−1n−2 and then it is clear that P (X = i) = expectation of X is therefore E(X) = n X x=1 1 n for all i ∈ {1, . . . , n}. The n X 1 xP (X = x) = x n x=1 n 1X 1 n(n + 1) = x= n n 2 x=1 n+1 = . 2 To compute the variance of X we also need 2 E(X ) = n X x=1 n 1X 2 x x P (X = x) = n 2 x=1 1 n(n + 1)(2n + 1) (n + 1)(2n + 1) = = . n 6 6 Thus, (n + 1)(2n + 1) (n + 1)2 − 6 4 2n2 + n + 2n + 1 n2 + 2n + 1 = − 6 4 1 1 1 1 1 1 = − n2 + − n+ − 3 4 2 2 6 4 2 1 1 n −1 = n2 − = . 12 12 12 V (X) = E(X 2 ) − [E(X)]2 = 11. (10 pts) Consider the positive random variable X (i.e. P (X > 0) = 1) with expectation E(X) = 1 and variance V (X) = 0.002. Use Markov’s inequality and Tchebysheff’s inequality to provide two lower bounds for the probability of the event {X < 0.1}. Are both bounds useful in this case? Comment on the bounds. Solution: Markov’s inequality yields P (|X| ≥ 0.1) ≤ E|X| . 0.1 Hence, P (|X| < 0.1) > 1 − 38 E|X| . 0.1 Since X is positive, this is equivalent to E(X) = 1 − 10 = −9. 0.1 P (X < 0.1) > 1 − The bound is valid and correct! The probability of any event is a nonnegative number, so it is true that P (X < 0.1) > −9. However, this bound is uninformative. It gives us no additional information about the distribution of X. Tchebysheff’s inequality implies P (X < 0.1) > 1 − V (X) = 1 − 0.2 = 0.8. 0.12 This time, because the variance of X is relatively small, we obtain a non trivial lower bound for the probability of interest. 39 Lecture 6 Recommended readings: WMS, sections 3.4 → 3.8 Independent Random Variables Consider a collection of n random variables X1 , . . . , Xn and their joint probability distribution. Their joint probability distribution can be described in terms of the joint c.d.f. FX1 ,...,Xn (x1 , x2 , . . . , xn ) = P (X1 ≤ x1 ∩ X2 ≤ x2 ∩ · · · ∩ Xn ≤ xn ), (27) in terms of the joint p.m.f. (if the random variables are all discrete) pX1 ,...,Xn (x1 , x2 , . . . , xn ) = P (X1 = x1 ∩ X2 = x2 ∩ · · · ∩ Xn = xn ), (28) or in terms of the joint p.d.f. fX1 ,...,Xn (if the random variables are all continuous). The random variables X1 , . . . , Xn are said to be independent if either of the following holds: Q • FX1 ,...,Xn (x1 , x2 , . . . , xn ) = ni=1 FXi (xi ) Q • pX1 ,...,Xn (x1 , x2 , . . . , xn ) = ni=1 pXi (xi ) Q • fX1 ,...,Xn (x1 , x2 , . . . , xn ) = ni=1 fXi (xi ). Then, if we consider an arbitrary collection of events {X1 ∈ A1 }, {X2 ∈ A2 }, . . . , {Xn ∈ An }, we have that P (X1 ∈ A1 ∩ X2 ∈ A2 ∩ · · · ∩ Xn ∈ An ) = n Y i=1 P (Xi ∈ Ai ). If the random variables also share the same marginal distribution, i.e. we have • FXi = F ∀i ∈ {1, . . . , n} • pXi = p ∀i ∈ {1, . . . , n} (if the random variables are all discrete) • fXi = f ∀i ∈ {1, . . . , n} (if the random variables are all continuous) then the random variables X1 , . . . , Xn are said to be independent and identically distributed, usually shortened in i.i.d.. 40 Frequently Used Discrete Distributions The Binomial Distribution Consider again tossing a coin (not necessarily fair) n times in such a way that each coin flip is independent of the other coin flips. By this, we mean that if Hi denotes the event ‘observing heads on the i-th toss’, then P (Hi ∩ Hj ) = P (Hi )P (Hj ) for all i 6= j. Suppose that the probability of seeing heads on each flip is p ∈ [0, 1] (and let’s call the event ‘seeing heads’ a success). Introduce the random variables ( 1 if the i−th flip is a success Xi = 0 otherwise fo i ∈ {0, 1, 2, . . . , n}. The number of heads that we observe (or the number of successes in the experiment) is X= n X Xi . i=1 Under the conditions described above, the random variable X is distributed according to the Binomial distribution with parameters n ∈ Z+ (number of trials) and p ∈ [0, 1] (probability of success in each trial). We denote this by X ∼ Binomial(n, p). The p.m.f. of X is ( n x n−x if x ∈ {0, 1, 2, . . . , n} x p (1 − p) p(x) = (29) 0 if x ∈ / {0, 1, 2, . . . , n} Its expectation is E(X) = np and its variance is V (X) = np(1 − p). In particular, when n = 1, we usually say that X is distributed according to the Bernoulli distribution of parameter p, denoted X ∼ Bernoulli(p). In this case E(X) = p and V (X) = p(1 − p). Exercise in class: Let X denote the number of 6 observed after rolling 4 times a fair die. Then X ∼ Binomial(n, p) with n = 4 and p = 1/6. What is the probability that we observe at least 3 times the number 6 in the 4 rolls? What is the expected number of times that the number 6 is observed in 4 rolls? 41 We have P (X ≥ 3) = P (X = 3) + P (X = 4) = p(3) + p(4) 4 3 1 4−3 4 1 1 4−4 4 1 1− + 1− = 6 6 4 6 6 3 1 5 1 21 = 4 ∗ 3 + 4 = 4 ≈ 0.016 6 6 6 6 and E(X) = np = 4/6 = 2/3. The Geometric Distribution Consider now counting the number of coin flips needed before the first success is observed in the setting described above and let X denote the corresponding random variable. Then X has the Geometric distribution with parameter p ∈ [0, 1], denoted X ∼ Geometric(p). Its p.m.f. is given by ( (1 − p)x−1 p if x ∈ {1, 2, 3, . . . } p(x) = (30) 0 if x ∈ / {1, 2, 3, . . . } INTUITION The expected value of X is E(X) = 1/p, while its variance is V (X) = (1 − p)/p2 . The Geometric distribution is one of the distribution that have the socalled ‘memoryless property’. This means that if X ∼ Geometric(p), then P (X > x + y|X > x) = P (X > y) for any 0 < x ≤ y. To see this, let’s first compute the c.d.f. of X. We have ( ( 0 if x < 1 0 if x < 1 F (x) = P (X ≤ x) = = Pbxc Pbxc−1 y−1 p y=1 (1 − p) if x ≥ 1 p y=0 (1 − p)y if x ≥ 1 ( ( 0 if x < 1 0 if x < 1 = = 1−(1−p)bxc 1 − (1 − p)bxc if x ≥ 1. p 1−(1−p) if x ≥ 1 Let’s look for example to the case 1 < x < y. We have P (X > x + y|X > x) = P (X > x + y) 1 − F (x + y) (1 − p)bx+yc = = P (X > x) 1 − F (x) (1 − p)bxc = (1 − p)byc = P (X > y). 42 Exercise in class Consider again rolling a fair die. What is the probability that the first 6 is rolled on the 4-th roll? Let X denote the random variable counting the number of rolls needed to observe the first 6. We have X ∼ Geometric(p) with p = 1/6. Then, P (X = 4) = p(4) = p(1 − p)4−1 = 1 53 53 = ≈ 0.096. 6 63 64 The Negative Binomial Distribution Suppose that we are interested in counting the number of coin flips needed in order to observe the r-th success. If X denotes the corresponding random variable, then X has the Negative Binomial distribution with parameters r ∈ Z+ (number of successes) and p ∈ [0, 1] (probability of success on each trial). We use the notation X ∼ NBinomial(r, p). Its p.m.f. is then ( x−1 r x−r if x ∈ {r, r + 1, . . . } r−1 p (1 − p) p(x) = (31) 0 if x ∈ / {r, r + 1, . . . }. INTUITION The expected value of X is E(X) = r/p and its variance is V (X) = r(1 − p)/p2 . Exercise in class: Consider again rolling a fair die. What is the probability that the 3rd 6 is observed on the 5-th roll? We have X ∼ NBinomial(r, p) with r = 3 and p = 1/6 and 3 5−3 5−1 1 5 52 P (X = 5) = = 6 ∗ 5 ≈ 0.019. 3−1 6 6 6 Question: How do the Geometric distribution and the Negative Binomial ditribution relate? The Hypergeometric Distribution Suppose that you have a population of N elements that can be divided into 2 subpopulations of size r and N − r (say the population of ‘successes’ and ‘failures’, respectively). Imagine that you sample n ≤ N elements from 43 this population without replacement. What is the probability that your sample contains x successes? To answer this question we can introduce the random variable X with the Hypergeometric distribution with parameters r ∈ Z+ (number of successes in the population), N ∈ Z+ population size, and n ∈ Z+ (number of trials). We write X ∼ HGeometric(r, N, n). Its p.m.f. is  r N −r  (x)( n−x ) if x ∈ {0, 1, . . . , r} (Nn ) p(x) = (32)  0 if x ∈ / {0, 1, . . . , r}. The expectation and the variance of X are E(X) = nr/N and V (X) = nr N − r N − n . N N N −1 Exercise in class: There are 10 candidates for 4 software engineering positions at Google. 7 of them are female candidates and the remaining 3 are male candidates. If the selection process at Google is completely random, what is the probability that only 1 male candidate will be eventually hired? The number of male candidates hired by Google can be described by means of the random variable X ∼ Hypergeometric(r, N, n) with r = 3, N = 10, and n = 4. Then, 3 7 7! 3 4!3! 6∗5∗4∗3 1 7!6! 1 3 P (X = 1) = 10 = 10! = = = . 2 ∗ 10! 10 ∗ 9 ∗ 8 2 6!4! 4 The Poisson Distribution The Poisson distribution can be thought of as a limiting case of the Binomial distribution. Consider the following example: you own a store and you want to model the number of people who enter in your store on a given day. In any time interval of that day, the number of people walking in the store is certainly discrete, but in principle that number can be any non-negative integer. We could try and divide the day into n smaller subperiods in such a way that, as n → ∞, only one person can walk into the store in any given subperiod. If we let n → ∞, it is clear however that the probability p that a person will walk in the store in an infinitesimally small subperiod of time is such that p → 0. The Poisson distribution arises as a limiting case of the Binomial distribution when n → ∞, p → 0, and np → λ ∈ (0, ∞). 44 The p.m.f. of a random variable X that has the Poisson distribution with parameter λ > 0, denoted X ∼ Poisson(λ) is ( x e−λ λx! if x ∈ {0, 1, 2, 3, . . . } p(x) = (33) 0 if x ∈ / {0, 1, 2, 3, . . . } Both the expected value and the variance of X are equal to λ, E(X) = V (X) = λ. Exercise in class: At the CMU USPS office, the expected number of students waiting in line between 1PM and 2PM is 3. What is the probability that you will see more than 2 students already in line in front of you, if you go to the USPS office in that period of time? Let X ∼ Poisson(λ) with λ = 3 be the number of students in line when you enter into the store. We want to compute P (X > 2). We have P (X > 2) = 1 − P (X ≤ 2) = 1 − =1−e −3 2 X x=0 p(x) = 1 − e−3 2 X 3x x=0 9 17 1+3+ = 1 − e−3 ≈ 0.577. 2 2 45 x! = Lecture 7 Recommended readings: WMS, sections 4.4 → 4.8 Frequently Used Continuous Distributions In this section, we will describe some of the most commonly used continuous distribution. In particular, we will focus on three important families which are often used in the statistical modeling of physical phenomena: 1. the Normal family: this is a class of distributions that is commonly used to describe physical phenomena where the quantity of interest takes values (at least in principle) in the range (−∞, ∞) 2. the Gamma family: this is a class of distributions which are frequenlty used to describe physical phenomena where the quantity of interest takes non-negative values, i.e. it takes values in the interval [0, ∞) 3. the Beta family: this is a class of distributions that is commonly used to describe physical phenomena where the quantity of interest takes values in some interval [a, b] of the real line. We will also focus on some relevant subfamilies of the families mentioned above. The Uniform Distribution The Uniform distribution is the simplest among the continuous distributions. We say that a random variable X has the Uniform distribution with parameters a, b ∈ R, a < b, if its p.d.f. is ( 1 if x ∈ [a, b] 1 f (x) = 1[a,b] (x) = b−a (34) b−a 0 if x ∈ / [a, b]. In this case, we use the notation X ∼ Uniform(a, b). We have E(X) = and V (X) = (b−a)2 12 . GRAPH 46 a+b 2 The Normal Distribution The Normal distribution is of the most relevant distributions for applications in Statistics. We say that a random variable X has a Normal distribution with parameters µ ∈ R and σ 2 ∈ R+ if its p.d.f. is f (x) = √ 1 e− (x−µ)2 2σ 2 . 2πσ 2 In this case, we write X ∼ N (µ, σ). The parameters of X correspond to its expectation µ = E(X) and its variance σ 2 = V (X). The c.d.f. of X ∼ N (µ, σ 2 ) can be expressed in terms of the error function erf(·). However, for our purposes, we will not need to investigate this futher. GRAPH The Standardized Normal Distribution Given X ∼ N (µ, σ 2 ), we can always obtain a ‘standardized’ version Z of X such that Z still has a Normal distribution, E(Z) = 0, and V (Z) = 1 (i.e. Z ∼ N (0, 1)). This can be done by means of the transformation X −µ Z= . σ The random variable Z is said to be standardized. Of course, one can also standardize random variables that have other distributions (as long as they have finite variance), but unlike the Normal case the resulting standardized variable may not belong anymore to the same family to which the original random variable X belonged to. The Normal family is closed with respect to translation and scaling: if X ∼ N (µ, σ 2 ), then aX + b ∼ N (aµ + b, a2 σ 2 ) for a 6= 0 and b ∈ R. Furthermore, if X ∼ N (µ, σ 2 ) and Y ∼ N (ν, τ 2 ) and X and Y are independent, then X + Y ∼ N (µ + ν, σ 2 + τ 2 ). We finally mention that it is common notation to indicate the c.d.f. of Z ∼ N (0, 1) by Φ(·). Notice that Φ(−x) = 1 − Φ(x) for any x ∈ R. GRAPH The Gamma Distribution We say that the random variable X has a Gamma distribution with parameters α, β > 0 if its p.d.f. is 1 −x f (x) = α xα−1 e β 1[0,∞) (x). (35) β Γ(α) 47 GRAPH Notice that the p.d.f. of the Gamma distribution includes the Gamma function Z ∞ xα−1 e−x dx (36) Γ(α) = 0 for α > 0. Furthermore, for α ∈ Z+ , we have that Γ(α) = (α − 1)!. Make sure not to confuse the Gamma distribution, described by the p.d.f. of equation (35), and the Gamma function of equation (36)! The following recursive property of the Gamma function is often handy for computations that involve Gamma-distributed random variables: for any α > 0, we have Γ(α + 1) = αΓ(α). The expectation and the variance of X are respectively E(X) = αβ and V (X) = αβ 2 . The c.d.f. of a Gamma-distributed random variable can be expressed explicitly in terms of the incomplete Gamma function. Once again, for our purposes, we don’t need to investigate this further. We saw that the Normal family is closed with respect to translation and scaling. The Gamma family is closed with respect to positive scaling only. If X ∼ Gamma(α, β), then cX ∼ Gamma(α, cβ), provided that c > 0. Furthermore, if X1 ∼ Gamma(α1 , β), X2 ∼ Gamma(α P2 ,nβ), . . . , Xn ∼ Gamma(α n , β) are independent random variables, then i=1 Xi ∼ P Gamma ( ni=1 αi , β) The Exponential Distribution The Exponential distribution constitutes a subfamily of the Gamma distribution. In particular, X is said to have an Exponential distribution with parameter β > 0 if X ∼ Gamma(1, β). In that case, we write X ∼ Exponential(β). The p.d.f. of X is therefore f (x) = 1 − βx e 1[0,∞) (x). β Because the Exponenetial distribution is a subfamily of the Gamma distribution, we have E(X) = αβ = β and V (X) = αβ 2 = β 2 . Exercise in class Compute the c.d.f. of X ∼ Exponential(β). 48 We have, ( 0 if x < 0 F (x) = P (X ≤ x) = R x 1 − y β dy if x ≥ 0 0 βe ( 0 if x < 0 = −x 1 − e β if x ≥ 0.  0 if x < 0 y x = − e− β if x ≥ 0. 0 The Chi-Square Distribution The Chi-Square distribution is another relevent subfamily of the Gamma family of distributions. It frequently arises in statistical model-fitting procedures. We say that X has a Chi-Square distribution with ν > 0 ‘degrees of freedom’, if X ∼ Gamma(ν/2, 2). In this case, we write X ∼ χ2 (ν). We have E(X) = ν and V (X) = 2ν. The Beta distribution Suppose that you are interested in studying a quantity Y that can only take values in the interval [a, b] ⊂ R. We can easily transform Y in such a way that it can only take values in the standard unit interval [0, 1]: it is enough to consider the normalized version of Y X= Y −a . b−a Then, a flexible family of distributions that can be used to model Y is the Beta family. We say that a random variable X has a Beta distribution with parameters α, β > 0, denoted X ∼ Beta(α, β), if its p.d.f. is of the form f (x) = where Z B(α, β) = 0 xα−1 (1 − x)β−1 1[0,1] (x). B(α, β) 1 xα−1 (1 − x)β−1 dx = (37) Γ(α)Γ(β) . Γ(α + β) GRAPH The c.d.f. of X can be expressed explicitly in terms of the incomplete Beta function Z x B(x; α, β) = y α−1 (1 − y)β−1 dx, 0 49 but we don’t need to investigate this further for our purposes. α The expected value of X is E(X) = α+β and its variance is V (X) = αβ . (α+β)2 (α+β+1) A Note on the Normalizing Constant of a Probability Density Function Most frequently, a given p.d.f. f takes the form f (x) = cg(x) where c is a positive constant and g is a function. The part of f depending on x, i.e. the function g, is usually called the kernel of the p.d.f. f . Very often one can guess whether f belongs to a certain family by simply inspecting g. Then, if f is indeed a density, c > 0 is exactly the ‘right’ constant which makes f integrate to 1. Therefore, if for any reason c is unknown, but you can guess that X ∼ f belongs to a certain family of distributions for a particular value of its parameters, in order to figure out c one does not necessarily have to compute !−1 Z c= g(x) dx . supp(X) Let us illustrate this by means of two examples: • Let f (x) = ce− 2 1[0,∞) (x) and we want to figure out what c is. Here x g(x) = e− 2 1[0,∞) (x) is the kernel of an Exponential p.d.f. with parameter β = 2. We know therefore that c must be equal to c = 1/β. x • Let f (x) = cx4 (1 − x)5 1[0,1] (x) and again suppose that we want to figure out the value of c. Here g(x) = x4 (1 − x)5 1[0,1] (x) which is the kernel of a Beta p.d.f. with parameters α = 5 and β = 6. We know therefore that c must be the number c= Γ(α + β) Γ(5 + 6) 10! = = = 1260. Γ(α)Γ(β) Γ(5)Γ(6) 4!5! 50 Homework 4 (due May 28th) 1. (6 pts) An oil exploration firm is formed with enough capital to finance ten explorations. The probability of a particular exploration being successful is 0.1. Assume the explorations are independent. Find the mean and variance of the number of successful explorations. Solution: Let n ∈ Z+ and p ∈ [0, 1]. For any i ∈ {1, . . . , n}, let Xi be the Bernoulli(p) random variable describing the success or failure of the i-th exploration of the oil exploration firm: ( 1 if the i-th exploration is successful Xi = 0 otherwise. The number of successful explorations is Y = n X Xi . i=1 Since the explorations are assumed to be independent, it follows that Y ∼ Binomial(n, p) (recall that the sum of n independent Bernoulli(p) random variables is a random variable itself and its distribution is Binomial(n, p)). In this exercise n = 10 and p = 0.1 so we have E(Y ) = np = 10 ∗ 0.1 = 1 and V (Y ) = np(1 − p) = 1 ∗ 0.9 = 0.9. 2. (8 pts) Ten motors are packaged for sale in a certain warehouse. The motors sell for USD 100 each, but a double-your-money-back guarantee is in effect for any defectives the purchaser may receive (i.e. if the purchaser receives one defective motor, the seller has to pay USD 200 back to the purchaser). Find the expected net gain for the seller if the probability of any one motor being defective is 0.08 (assuming that the quality of any one motor is independent of that of the others). Solution: Let n be the number of motors that are packaged for sale in the warehouse and c be the price of each motor. For any i ∈ {1, . . . , n} let Di denote the Bernoulli(p) random variable describing whether the i-th motor is defective or not: ( 1 if the i-th motor is defective Di = 0 otherwise. 51 Everytime a motor is sold the seller gains c dollars, but if the motor turns out to be defective the seller has to pay 2c dollars back to the buyer because of the double-your-money-back guarantee. Assuming that the quality of each motor is independent of the quality of every other motor in stock, the net gain of the seller is the random variable G = cn − 2c n X Di i=1 Pn where i=1 Di ∼ Binomial(n, p) and the expected net gain is the expectation of G: ! ! n n X X E(G) = E cn − 2c Di = cn−2cE Di = cn−2cnp = cn(1−2p). i=1 i=1 Since c = USD 100, n = 10 and p = 0.08 we have that E(G) = USD 100 ∗ 10 ∗ (1 − 2 ∗ 0.08) = USD 1000 ∗ (0.84) = USD 840. 3. (6 pts) Consider again Exercise 10 of Homework 3. Suppose this time that the secretary does not discard a password once she tried it and discovered that the password is wrong. What is the distribution of the random variable X corresponding to the number of attempts needed until she enters the correct password? What are the mean and the variance of X? Solution: We can model the sequence of attempts using a sequence of random i.i.d. variables X1 , X2 , . . . ∼ Bernoulli(p) where p = 1/n. The number of attempts needed to enter the correct password for the first time then corresponds to a random variable Y ∼ Geometric(p). Its expectation is E(Y ) = 1/p = n and its variance is V (Y ) = (1 − p)/p2 = n(n − 1). 4. (8 pts) A geological study indicates that an exploratory oil well should strike oil with probability 0.2. • What is the probability that the first strike comes on the third well drilled? • What is the probability that the third strike comes on the seventh well drilled? • What assumptions did you make to obtain the answers to parts (a) and (b)? 52 • Find the mean and variance of the number of wells that must be drilled if the company wants to set up three producing wells. Solution: Let X be the random variable describing the number of the trial on which the r-th oil well is observed to strike oil. X is distributed according to NBinomial(r, p) and its probability mass function is ( x−1 r p (1 − p)x−r if x = r, r + 1, . . . r−1 P (X = x) = 0 otherwise. • The probability that the first strike comes on the third well drilled is 3−1 1 P (X = 3) = p (1 − p)3−1 = p(1 − p)2 1−1 which is equal to the probability of observing the first success on the third trial (notice that this is really the same as doing the computation with a Geometric(p) distribution since here r = 1). p = 0.2, therefore we have P (X = 3) = 0.2(0.8)2 = 0.128. • Here r = 3, so the probability that the third strike comes on the seventh well drilled is 7−1 3 6 3 P (X = 7) = p (1 − p)7−3 = p (1 − p)4 3−1 2 6 = 0.23 ∗ 0.84 ≈ 0.049. 2 • In order to apply the Negative Binomial model we need to make some assumptions. In particular, we have to assume that all the trials are identical and independent Bernoulli trials (i.e. every time a well is drilled the fact that it strikes oil or not does not depend on what happens at other wells and the probability that a well strikes oil is the same for all wells). • Here we want to compute E(Y ) and V (Y ) when r = 3, namely the expectation and the variance of the number of the trial on which the third well is observed to strike oil. We have r 3 E(Y ) = = = 15 p 0.2 r(1 − p) 3 ∗ 0.8 V (Y ) = = = 60. p2 0.22 53 5. (6 pts) Cards are dealt at random and without replacement from a standard 52 card deck. What is the probability that the second king is dealt on the fifth card? Solution: The probability of the event that the second king is dealt on the fifth card is equivalent to the probability of the intersection of the two events A=“exactly one king is dealt in the first four cards” and B = “exactly a king is dealt on the fifth card”. We have 4 48 P (A) = and 1 3 52 4 3 1 45 0 48 1 P (B|A) = so 4 1 48 3 52 4 P (A ∩ B) = P (A)P (B|A) = 3 1 45 0 48 1 = 4 ∗ 17296 3 ∗ 1 ≈ 0.016. 270725 48 We can also interpret the problem in the following way. Let K1 ∼ HGeometric(r, N, n) be the random variable describing the number of kings that are dealt in the first n cards from a deck of N cards containing exactly r kings and let K2 ∼ Hypergeometric(s, M, m) be a random variable describing the number of kings that are dealt in the first m cards from a deck of M cards containing exactly s kings. Assume moreover that K1 and K2 are independent. Then the probability that the second king is dealt on the fifth hand can be expressed as s M −s r N −r P (K1 = 1, K2 = 1) = P (K1 = 1)P (K2 = 1) = 1 n−1 N n 1 m−1 M m for a particular choice of the parameters. For this question, we know that N = 52, n = 4 and r = 4 and we know how the deck changes after exactly one king and three other cards are dealt: indeed, we know that M = N − n, m = 5 − n and s = r − 1. Moreover, the independence assumption is appropriate since the cards are always dealt randomly, no matter the composition of the deck. It’s easy to check that plugging this parameters values into the expression for P (K1 = 1, K2 = 1) yields 54 exactly N −r s M −s r n−1 1 m−1 1 = N M n m 3 45 4 48 1 3 1 480 ≈ 0.0.16. 52 4 1 r 1 P (K1 = 1, K2 = 1) = = N −r n−1 N n r−1 1 N −n−(r−1) 5−n−1 N −n 5−n 6. (8 pts) A parking lot has two entrances. Cars arrive at entrance I according to a Poisson distribution at an average of three per hour and at entrance II according to a Poisson distribution at an average of four per hour. What is the probability that a total of three cars will arrive at the parking lot in a given hour? Assume that the numbers of cars arriving at the two entrances are independent. Solution: Let C1 ∼ Poisson(λ) and C2 ∼ Poisson(µ) denote the number of cars arriving in an hour at entrance I and entrance II of the parking lot respectively. We know that λ = 3 and µ = 4 and we assume that C1 and C2 are independent. At a given hour C = C1 + C2 is the total number of cars arriving at the parking lot. We are interested in computing P (C = 3). The event C = 3 can be written as the following disjoint union of events {C = 3} = ({C1 = 0} ∩ {C2 = 3}) ∪ ({C1 = 1} ∩ {C2 = 2})∪ ∪ ({C1 = 2} ∩ {C2 = 1}) ∪ ({C1 = 3} ∩ {C2 = 0}) whose probability is (here we use the fact that the union is a disjoint union and we exploit the independence of C1 and C2 ) P (C = 3) = P (C1 = 0 ∩ C2 = 3) + P (C1 = 1 ∩ C2 = 2) + P (C1 = 2 ∩ C2 = 1)+ + P (C1 = 3 ∩ C2 = 0) = P (C1 = 0)P (C2 = 3) + P (C1 = 1)P (C2 = 2)+ + P (C1 = 2)P (C2 = 1) + P (C1 = 3)P (C2 = 0) = µ3 µ2 λ2 λ3 + e−λ λe−µ + e−λ e−µ µ + e−λ e−µ = 3! 2 2 3! 3 2 2 3 µ µ λ λ = e−(λ+µ) +λ +µ + ≈ · · · ≈ 0.052. 3! 2 2 3! = e−λ e−µ 55 7. (6 pts) Derive the c.d.f. of X ∼ Uniform(a, b). Solution: Recall that f (x) = We have 1 1 (x). b − a [a,b]   0R if x < a x 1 F (x) = P (X ≤ x) = a b−a if a ≤ x < b   1 if x ≥ b   0 if x < a = x−a b−a if a ≤ x < b   1 if x ≥ b. 8. (4 pts) Let X ∼ Uniform(0, 1). Is X ∼ Beta(α, β) for some α, β > 0? Solution: Recall the p.d.f. corresponding to the Beta distribution with parameters α, β > 0: f (x) = Γ(α + β) α−1 x (1 − x)β−1 1[0,1] (x). Γ(α)Γ(β) By comparing to the p.d.f. of X fX (x) = 1[0,1] (x), it is easily seen that X ∼ Uniform(α, β) ≡ Beta(1, 1). Notice that in this case Γ(α) = Γ(β) = Γ(1) = 0! = 1 and Γ(α + β) = Γ(2) = 1! = 1. 9. (8 pts) The number of defective circuit boards coming off a soldering machine follows a Poisson distribution. During a specific eight-hour day, one defective circuit board was found. Conditional on the event that a single defective circuit was produced, the time at which it was produced has a Uniform distribution. • Find the probability that the defective circuit was produced during the first hour of operation during that day. • Find the probability that it was produced during the last hour of operation during that day. • Given that no defective circuit boards were produced during the first four hours of operation, find the probability that the defective board was manufactured during the fifth hour. 56 Solution: Let T be the random variable denoting the time at which the defective circuit board was produced. T is uniformly distributed in the interval [0, 8] (hours), that is T ∼ Uniform(0, 8). Its probability density function is therefore ( 1 if t ∈ [0, 8] f (t) = 8 0 otherwise. and its cumulative distribution function is   0 if t < 0 F (t) = P (T ≤ t) = 8t if t ∈ [0, 8)   1 if t ≥ 8 . • The probability that the defective circuit board was produced during the first hour of operation during the day is P (T ∈ (0, 1]) = P (0 < T ≤ 1) = F (1) − F (0) = 1 1 −0= . 8 8 • The probability that the defective circuit board was produced during the last hour of operation during the day is P (T ∈ (7, 8]) = P (7 < T ≤ 8) = F (8) − F (7) = 1 − 7 1 = . 8 8 • Given that no defective circuits boards were produced in the first four hours (that is we know that the event T > 4 occurred), the probability that the defective board was manufactured during the fifth hour is P (T ∈ (5, 6] ∩ T > 4) P (T ∈ (5, 6]) P (T ∈ (5, 6] | T > 4) = = P (T > 4) P (T > 4) 6 5 1 1 − F (6) − F (5) 1 = = 8 8 = 8 4 = 81 = . 1 − P (T ≤ 4) 1 − F (4) 4 1− 8 2 10. (6 pts) Let Z ∼ N (0, 1). Show that, for any z > 0, P (|Z| > z) = 2(1 − Φ(z)). Solution: We have P (|Z| > z) = P (Z > z ∪ Z < −z) = P (Z > z) + P (Z < −z) = 1 − P (Z ≤ z) + P (Z < −z) = 1 − Φ(z) + Φ(−z) = 1 − Φ(z) + 1 − Φ(z) = 2(1 − Φ(z)). 57 11. (6 pts) Let Z ∼ N (0, 1). Express in terms of Φ, the c.d.f. of Z, the following probabilities • P (Z 2 < 1) • P (3Z/5 + 3 > 2). Compute the exact probability of the event Φ−1 (α/3) ≤ Z ≤ Φ−1 (4α/3), where Φ−1 (α) is the inverse c.d.f. computed at α (i.e. the α-quantile of Z). Solution: We have • P (Z 2 > 1) = P (|Z| > 1). Based on the result from the previous exercise, P (|Z| > 1) = 2(1 − Φ(1)). Thus, P (Z ≤ 1) = 1 − P (|Z| > 1) = 1 − 2(1 − Φ(1)) = 2Φ(1) − 1. • P (3Z/5 + 3 > 2) = P (3Z/5 > −1) = P (Z > −5/3) = 1 − Φ(−5/3) = Φ(5/3). Regarding the last question, we have P (Φ−1 (α/3) ≤ Z ≤ Φ−1 (4α/3)) = Φ(Φ−1 (4α/3)) − Φ(Φ−1 (α/3)) = 4α/3 − α/3 = α. 12. (8 pts) The SAT and ACT college entrance exams are taken by thousands of students each year. The mathematics portions of each of these exams produce scores that are approximately normally distributed. In recent years, SAT mathematics exam scores have averaged 480 with standard deviation 100. The average and standard deviation for ACT mathematics scores are 18 and 6, respectively. • An engineering school sets 550 as the minimum SAT math score for new students. What percentage of students will score below 550 in a typical year? Express your answer in terms of Φ. • What score should the engineering school set as a comparable standard on the ACT math test? You can leave your answer in terms of Φ−1 . Solution: Let S denote the random variable describing the SAT mathematics scores and let A denote the random variable describing the ACT 58 mathematics scores. Let also Z ∼ N (0, 1). S is distributed approximately according to N (µs , σs2 ) with µs = 480 and σs = 100 and A is distributed approximately according to N (µa , σa2 ) with µa = 18 and σa = 6. • Based on the approximations above the percentage of students in the engineering school with SAT math score below 550 is S − µs 550 − µs 550 − 480 P (S ≤ 550) = P ≤ ≈P Z≤ σs σs 100 = P (Z ≤ 0.7) = Φ(0.7) ≈ 0.758. • The minimum score that the engineering school should require from new students in the ACT math test in order for the two minimum scores to be two comparable standards is the number a that satisfies P (A ≤ a) ≈ P (S ≤ 550) ≈ Φ(0.7). We have to solve for a the following equation A − µa a − µa a − µa P (A ≤ a) = P ≤ ≈P Z≤ = Φ(0.7), σa σa σa namely we need to solve a − µa = 0.7 σa whose solution is a = 0.7σa + µa = 0.7 ∗ 6 + 18 = 22.2. 13. (6 pts) Suppose that a random variable X has a probability density function given by x f (x) = cx3 e− 2 1[0,∞) (x). • Find the value of c that makes f a valid p.d.f. (hint: if you think about this carefully, no calculations are needed!) • Does X have a Chi-Square distribution? If so, with how many degrees of freedom? • What are E(X) and V (X)? Solution: 59 • Given that f is a bona fide p.d.f., c has to be exactly the postive constant that makes the kernel of the p.d.f. x3 e− 2 1[0,∞) (x) x integrate to 1. We recognize that this kernel is the kernel of a Gamma p.d.f. with parameters α = 4 and β = 2. It follows that c must be the number c= 1 β α Γ(α) 1 = 24 Γ(4) = 1 1 1 = = . 16 ∗ 3! 16 ∗ 6 96 • Recall that Gamma(ν/2, 2) ≡ χ2 (ν). Therefore, in this case, ν = 8 and the above p.d.f. corresponds to a Chi-Square p.d.f. with ν = 8 degrees of freedom. • We have E(X) = ν = 8 and V (X) = 2ν = 16. 14. (8 pts) Consider the random variable X ∼ Exponential(β). Show that the distribution of X has the memoryless property. Solution: Let 0 ≤ s ≤ t. We have that − s+t β 1 − 1 − e P (X > s + t) 1 − F (s + t) P (X > s + t|X > s) = = = − s+t P (X > s) 1 − F (s) 1− 1−e β =e − βt = P (X > t). Therefore, X has the memoryless property. 15. (6 pts) An interesting property of the Exponential distribution is that its associated hazard function is constant on [0, ∞). The hazard function associated to a density f is defined as the function h(x) = f (x) . 1 − F (x) (38) One way to interpret the hazard function is in terms of the failure rate of a certain component. In this sense, h can be thought of as the ratio between the probability density of failure at time x, f (x), and the prorbability that the component is still working at the same time, which is P (X > x) = 1 − F (x). 60 Show that if X ∼ Exponential(β) is a random variable describing the time at which a particular component fails to work, then the failure rate of X is constant (i.e. h does not depend on x). Solution: For x > 0 we have −x 1 e β 1 f (x) β = . = h(x) = x −β 1 − F (x) β 1− 1−e 16. (EXTRA CREDIT, 10 pts) Consider the Binomial p.m.f. n x p(x) = p (1 − p)n−x 1{0,1,...,n} (x) x with parameters n and p. Show that lim p(x) = e−λ n→∞ λx 1 (x) x! {0,1,... } where the limit is taken as n → ∞, p → 0, and np → λ > 0. Solution: Take any x ∈ {0, 1, . . . , n}. We have np x np n−x n x n! 1− p(x) = p (1 − p)n−x = x!(n − x)! n n x n n(n − 1) . . . (n − x + 1) 1 np np −x x = (np) 1 − 1 − nx x! n n Now, n(n − 1) . . . (n − x + 1) → 1 as n → ∞ nx (np)x → λx as np → λ λ n λ n np n → 1− as np → λ and 1 − → e−λ as n → ∞ 1− n n n np −x λ −x λ −x 1− → 1− as np → λ and 1 − → 1 as n → ∞. n n n Thus, 1 λx ∗ λx ∗ e−λ ∗ 1 = e−λ x! x! which is the p.m.f. of a Poisson distribution with parameter λ. p(x) → 1 ∗ 61 17. (EXTRA CREDIT, 10 pts) Consider the Normal p.d.f. f (x) = √ 1 2πσ 2 e− (x−µ)2 2σ 2 with parameters µ ∈ R and σ 2 > 0. Show that µ − σ and µ + σ correspond to the inflection points of f . Solution: First, we need to compute the second derivative of f . We have f 0 (x) = − √ 2 x − µ − (x−µ) 2σ 2 e 2πσ 2 σ 2 1 and therefore 2 2 1 − (x−µ) (x − µ)2 − (x−µ) 2σ 2 2σ 2 f (x) = − √ − e e σ4 2πσ 2 σ 2 2 1 1 − (x−µ) (x − µ)2 2 = −√ e 2σ 1− σ2 2πσ 2 σ 2 00 1 Notice that −√ 2 1 − (x−µ) 2σ 2 e <0 2πσ 2 σ 2 1 for any x ∈ R, therefore the sign of f 00 depends on 1− (x − µ)2 σ 2 − (x − µ)2 = σ2 σ2 Clearly, σ 2 − (x − µ)2 ≥0 σ2 for |x − µ| ≤ σ ⇐⇒ µ − σ ≤ x ≤ µ + σ, σ 2 − (x − µ)2 ≤0 σ2 for |x − µ| ≥ σ ⇐⇒ x ≥ µ + σ ∨ x ≤ µ − σ, and σ 2 − (x − µ)2 =0 σ2 for x = µ ± σ. Therefore x = µ ± σ are the two inflection points of f . 62 18. (EXTRA CREDIT, 10 pts) Consider the random variable X ∼ Beta(α, β) with α, β ∈ Z+ . Show that, for n ∈ Z+ , E(X n+1 ) α+n = . n E(X ) α+β+n Use this result to show that E(X n ) = n−1 Y i=0 α+i . α+β+i Solution: For the first claim, fix an arbitrary n ∈ Z+ . Then Z Γ(α + β) 1 n α−1 n x x (1 − x)β−1 dx E(X ) = Γ(α)Γ(β) 0 Z Γ(α + β) 1 α+n−1 = x (1 − x)β−1 dx. Γ(α)Γ(β) 0 Now, xα+n−1 (1 − x)β−1 1[0,1] (x) is the kernel of a Beta(α + n, β) p.d.f., hence its integral on [0, 1] must be equal to Z 1 Γ(α + n)Γ(β) . xα+n−1 (1 − x)β−1 dx = Γ(α + β + n) 0 Therefore, Γ(α + β) Γ(α + n)Γ(β) Γ(α)Γ(β) Γ(α + β + n) Γ(α + β)Γ(α + n) = Γ(α)Γ(α + β + n) E(X n ) = Then, E(X n+1 ) Γ(α + β)Γ(α + n + 1) Γ(α)Γ(α + β + n) = E(X n ) Γ(α)Γ(α + β + n + 1) Γ(α + β)Γ(α + n) Γ(α + n + 1) Γ(α + β + n) = Γ(α + β + n + 1) Γ(α + n) (α + n)Γ(α + n) Γ(α + β + n) = (α + β + n)Γ(α + β + n) Γ(α + n) α+n = . α+β+n 63 For the second claim, it is enough to notice that E(X n ) = n−1 Y i=0 E(X i+1 ) . E(X i ) From the previous result, we have that E(X i+1 ) α+i = , i E(X ) α+β+i thus E(X n ) = n−1 Y i=0 n−1 Y α+i E(X i+1 ) = E(X i ) α+β+i i=0 64 Lecture 8 Recommended readings: WMS, sections 5.1 → 5.4 Multivariate Probability Distributions So far we have focused on univariate probability distributions, i.e. probability distributions for a single random variable. However, when we discussed independence of random variables in Lecture 6, we introduced the notion of joint c.d.f., joint p.m.f., and joint p.d.f. for a collection of n random variables X1 , . . . , Xn . In this lecture we will elaborate more on these objects. Let X1 , . . . , Xn be a collection of n random variables. • Regardless of whether they are discrete or continuous, we denote by FX1 ,...,Xn their joint c.d.f., i.e. the function FX1 ,...,Xn (x1 , . . . , xn ) = P (X1 ≤ x1 ∩ · · · ∩ Xn ≤ xn ). • If they are all discrete, we denote by pX1 ,...,Xn their joint p.m.f., i.e. the function pX1 ,...,Xn (x1 , . . . , xn ) = P (X1 = x1 ∩ · · · ∩ Xn = xn ). • If they are all continuous, we denote by fX1 ,...,Xn their joint p.d.f.. The above functions satisfy properties that are similar to those satisfied by their univariate counterparts. • The joint c.d.f. FX1 ,...,Xn satisfies: – FX1 ,...,Xn (x1 , . . . , xn ) ∈ [0, 1] for any x1 , . . . , xn ∈ R. – FX1 ,...,Xn is monotonically non-decreasing in each of its variables – FX1 ,...,Xn is right-continuous with respect to each of its variables – limxi →−∞ FX1 ,...,Xn (x1 , . . . , xi , . . . , xn ) = 0 for any i ∈ {1, . . . , n} – limx1 →+∞,...,xn →+∞ FX1 ,...,Xn (x1 , . . . , xn ) = 1 • The joint p.m.f. satisfies: – pX1 ,...,Xn (x1 , . . . , xn ) ∈ [0, 1] for any x1 , . . . , xn ∈ R. P P – x1 ∈supp(X1 ) · · · xn ∈supp(Xn ) pX1 ,...,Xn (x1 , . . . , xn ) = 1. • The joint p.d.f. satisfies: 65 – fX1 ,...,Xn (x1 , . . . , xn ) ≥ 0 for any x1 , . . . , xn ∈ R. R R – R R. . . R fX1 ,...,X R n (x1 , . . . , xn ) dx1 . . . dxn = supp(X1 ) . . . supp(Xn ) fX1 ,...,Xn (x1 , . . . , xn ) dxn . . . dx1 = 1. Furthermore we have X FX1 ,...,Xn (x1 , . . . , xn ) = y1 ≤x1 y1 ∈supp(X1 ) ··· X p(y1 , . . . , yn ) yn ≤xn yn ∈supp(Xn ) and Z FX1 ,...,Xn (x1 , . . . , xn ) = x1 Z xn ... −∞ −∞ fX1 ,...,Xn (y1 , . . . , yn ) dyn . . . dy1 Exercise in class: You are given the following bivariate p.m.f.:  1   8 if (x1 , x2 ) = (0, −1)    1   4 if (x1 , x2 ) = (0, 0)    1 if (x , x ) = (0, 1) 1 2 pX1 ,X2 (x1 , x2 ) = 81   4 if (x1 , x2 ) = (2, −1)    1   4 if (x1 , x2 ) = (2, 0)   0 otherwise. What is P (X1 ≤ 1 ∩ X2 ≤ 0) = FX1 ,X2 (1, 0)? We have FX1 ,X2 (1, 0) = pX1 ,X2 (0, −1) + pX1 ,X2 (0, 0) = 3 1 1 + = . 8 4 8 GRAPH Exercise in class: You are given the following bivariate p.d.f.: fX1 ,X2 (x1 , x2 ) = e−(x1 +x2 ) 1[0,∞)×[0,∞) (x1 , x2 ) • What is P (X1 ≤ 1 ∩ X2 > 5)? • What is P (X1 + X2 ≤ 3)? 66 We have Z 1Z ∞ Z 1Z ∞ e−(x1 +x2 ) dx1 dx2 fX1 ,X2 (x1 , x2 ) dx1 dx2 = 0 5 Z ∞ 1 ∞ −x1 e−x2 dx2 = − e−x1 0 − e−x2 5 e dx1 = 5 0 = 1 − e−1 e−5 . P (X1 ≤ 1 ∩ X2 > 5) = ∞ Z 1 5 and GRAPH P (X1 + X2 ≤ 3) = Z 3 Z 3−x1 e 0 Z Z e−x1 0 3 = e−x1 0 3 dx2 dx1 = e 0 3 = Z −(x1 +x2 ) 3−x −e−x2 0 1 dx1 = Z 0 3 0 −x1 Z 3−x1 e−x2 dx2 dx1 0 e−x1 1 − ex1 −3 dx1 3 − e−3 dx1 = − e−x1 0 − 3e−3 = 1 − 4e−3 . Marginal Distributions Given a collection of random variables X1 , . . . , Xn and their joint distribution, how can we derive the marginal distribution of only one of them, say Xi ? The idea is summing or integrating the joint distribution over all the variables except for Xi . GRAPH Thus, given pX1 ,...,Xn we have that pXi (xi ) = X y1 ∈supp(X1 ) ··· X X yi−1 ∈supp(Xi−1 ) yi+1 ∈supp(Xi+1 ) ··· X pX1 ,...,Xn (x1 , . . . , xi−1 , xi , xi+1 , . . . , xn ). yn ∈supp(Xn ) and given fX1 ,...,Xn we have Z Z Z Z ... . . . fX1 ,...,Xn (x1 , . . . , xi−1 , xi , xi+1 , . . . , xn ) dxn . . . dxi+1 dxi−1 . . . dx1 fXi (xi ) = R Z Z R R ZR Z = ... ... fX1 ,...,Xn (x1 , . . . , xi−1 , xi , xi+1 , . . . , xn ) supp(X1 ) supp(Xi−1 ) supp(Xi+1 ) dxn . . . dxi+1 dxi−1 . . . dx1 . 67 supp(Xn ) Exercise in class: Consider again the bivariate p.m.f.  1   8    1   4   1 pX1 ,X2 (x1 , x2 ) = 81   4    1   4   0 if (x1 , x2 ) = (0, −1) if (x1 , x2 ) = (0, 0) if (x1 , x2 ) = (0, 1) if (x1 , x2 ) = (2, −1) if (x1 , x2 ) = (2, 0) otherwise. Derive the marginal p.m.f. of X2 . Notice first that supp(X2 ) = {−1, 0, 1}. We have 3 1 1 + = 8 4 8 1 1 1 pX2 (0) = pX1 ,X2 (0, 0) + pX1 ,X2 (2, 0) = + = 4 4 2 1 pX2 (1) = pX1 ,X2 (0, 1) = . 8 pX2 (−1) = pX1 ,X2 (0, −1) + pX1 ,X2 (2, −1) = Thus,  3   8 if x2 = −1    1 if x = 0 2 pX2 (x2 ) = 21  if x 2 =1  8   0 if x ∈ 2 / {−1, 0, 1} Exercise in class: Consider again the bivariate p.d.f. fX1 ,X2 (x1 , x2 ) = e−(x1 +x2 ) 1[0,∞)×[0,∞) (x1 , x2 ). Derive the marginal p.d.f. of X1 . Notice first that supp(X1 ) = [0, ∞). For x1 ∈ [0, ∞) we have Z Z ∞ −x1 fX1 (x1 ) = fX1 ,X2 (x1 , x2 ) dx2 = e e−x2 dx2 = e−x1 . 0 R Thus, fX1 (x1 ) = e−x1 1[0,∞) (x1 ). 68 Conditional Distributions We will limit our discussion here at the bivariate case, but all that follows is easily extended to more than two random variables. Suppose that we are given a pair of random variables X1 , X2 and we want to compute probabilities of the type P (X1 ∈ A1 |X2 = x2 ) for a particular fixed value x2 of X2 . To do this, we need either the conditional p.m.f. (if X1 is discrete) or the conditional p.d.f. (if X1 is continuous) of X1 given X2 = x2 . By definition, we have pX1 ,X2 (x1 , x2 ) pX2 (x2 ) fX1 ,X2 (x1 , x2 ) fX1 |X2 =x2 (x1 ) = . fX2 (x2 ) pX1 |X2 =x2 (x1 ) = Notice the two following impotant facts: 1. pX1 |X2 =x2 and fX1 |X2 =x2 are not well-defined if x2 is such that pX2 (x2 ) = 0 and fX2 (x2 ) = 0 respectively (i.e. x2 ∈ / supp(X2 )) 2. given x2 ∈ supp(X2 ), pX1 |X2 =x2 and fX1 |X2 =x2 are null whenever pX1 ,X2 (x1 , x2 ) = 0 and fX1 ,X2 (x1 , x2 ) = 0 respectively. So whenever you are computing a conditional distribution, it is good practice to 1) determine the support of the conditioning variable X2 and clearly state that the conditional distribution that you are about to compute is only well-defined for x2 ∈ supp(X2 ) and 2) given x2 ∈ supp(X2 ), clarify for which values of x1 ∈ R the conditional distribution pX1 |X2 =x2 and fX1 |X2 =x2 is null. Exercise in class: Consider again the bivariate p.m.f.  1  8    1   4   1 pX1 ,X2 (x1 , x2 ) = 81   4   1   4   0 if (x1 , x2 ) = (0, −1) if (x1 , x2 ) = (0, 0) if (x1 , x2 ) = (0, 1) if (x1 , x2 ) = (2, −1) if (x1 , x2 ) = (2, 0) otherwise. 69 Derive the conditional distribution of X1 given X2 . First of all notice that the conditional p.m.f. of X1 given X2 = x2 is only well-defined for x2 ∈ supp(X2 ) = {−1, 0, 1}. Then we have p  X1 ,X2 (0,−1) 1/8   if x = 0 1   p (−1)  X2  3/8 if x1 = 0 pX1 |X2 =−1 (x1 ) = pX1 ,X2 (2,−1) if x1 = 2 = 1/4 3/8 if x1 = 2 pX2 (−1)      0 if x ∈ 1 / {0, 2} 0 if x1 ∈ / {0, 2}  1   3 if x1 = 0 = 23 if x1 = 2   0 if x1 ∈ / {0, 2} pX1 |X2 =0 (x1 ) = p X1 ,X2 (0,0)    pX2 (0) if x1 = 0 p (2,0) X1 ,X2 if x1 = 2 pX2 (0)    0 if x1 ∈ / {0, 2}  1   2 if x1 = 0 = 12 if x1 = 2   0 if x1 ∈ / {0, 2} ( pX pX1 |X2 =1 (x1 ) = (0,1) 1 ,X2 pX2 (1) if x1 = 0 0 if x1 6= 0 ( 1 if x1 = 0 = 0 if x1 = 6 0.  1/4    1/2 if x1 = 0 = 1/4 1/2 if x1 = 2   0 if x ∈ 1 / {0, 2} ( 1/8 = 1/8 if x1 = 0 0 if x1 6= 0 Exercise in class: Consider again the bivariate p.d.f. fX1 ,X2 (x1 , x2 ) = e−(x1 +x2 ) 1[0,∞)×[0,∞) (x1 , x2 ). Derive the conditional p.d.f. of X2 given X1 . First of all notice that the conditional p.m.f. of X2 given X1 = x1 is only well-defined for x1 ∈ supp(X1 ) = [0, ∞). For x1 ∈ [0, ∞), we have fX2 |X1 =x1 (x2 ) = = e−(x1 +x2 ) 1[0,∞)×[0,∞) (x1 , x2 ) fX1 ,X2 (x1 , x2 ) = fX1 (x1 ) e−x1 1[0,∞) (x1 ) e−(x1 +x2 ) 1[0,∞) (x2 ) = e−x2 1[0,∞) (x2 ). e−x1 70 Question: what family of distributions do X1 and X2 belong to? What are the parameter values? Question: can we say something about whether or not X1 and X2 are independent in this case? A Test for the Independence of Two Random Variables Suppose that you are given (X1 , X2 ) ∼ fX1 ,X2 . If 1. the support of fX1 ,X2 is a ‘rectangular’ region and 2. fX1 ,X2 (x1 , x2 ) = fX1 (x1 )fX2 (x2 ) for all x1 , x2 ∈ R then X1 and X2 are independent. The same is true for the case where (X1 , X2 ) is discrete, after obvious changes to 1). Notice that 1) can be conveniently captured by appropriately using indicator functions. Exercise in class: Consider the following bivariate p.d.f.: fX1 ,X2 (x1 , x2 ) = 6(1 − x2 )1{0≤x1 ≤x2 ≤1} (x1 , x2 ). Are X1 and X2 independent? A Note on Independence and Normalizing Constants Frequently, the p.d.f. (and the same holds true for a p.m.f.) of a pair of random variables takes the form fX1 ,X2 (x1 , x2 ) = cg(x1 )h(x2 ) (39) where c > 0 is a constant, g is a function depending only on x1 and h is a function depending only on x2 . If this is the case, the two random variables X1 and X2 are independent and g and h are the kernels of the p.d.f. of X1 and X2 respectively. Thus, by simply inspecting g and h we can guess to which family of distributions X1 and X2 belong to. We do not need to worry about the constant c, as we know that if fX1 ,X2 is a bona fide p.d.f., then c = c1 c2 is exactly equal to the product of the normalizing constants c1 and c2 that satisfy Z Z c1 g(x) dx = c2 h(x) dx = 1. (40) R R This easily generalizes to a collection of n > 2 random variables. 71 Lecture 9 Recommended readings: WMS, sections 5.5 → 5.8 Expected Value in the Context of Bivariate Distributions (part 1) In Lecture 5, we introduced the concept of expectation or expected value of a random variable. We will now introduce other operators that are based on the expected value operator and we will investigate how they act on bivariate distributions. While we focus on bivariate distributions, it is worthwile to keep in mind that all that we discuss in this Lecture can be extended to collections of random variables X1 , . . . , Xn with n > 2. For a function g :R2 → R (x1 , x2 ) 7→ g(x1 , x2 ) and a pair of random variables (X1 , X2 ) with joint p.m.f. pX1 ,X2 (if discrete) or joint p.d.f. fX1 ,X2 (if continuous), the expected value of g(X1 , X2 ) is defined as X X E(g(X1 , X2 )) = g(x1 , x2 )pX1 ,X2 (x1 , x2 ) (discrete case) x1 ∈supp(X1 ) x2 ∈supp(X2 ) Z Z g(x1 , x2 )fX1 ,X2 (x1 , x2 ) dx1 dx2 if X1 and X2 (continuous case). E(g(X1 , X2 )) = R R Expectation, Variance and Independence In Lecture 5, we pointed out that in general, given two random variables X1 and X2 , it is not true that E(X1 X2 ) = E(X1 )E(X2 ). We said, however, that it is true as soon as X1 and X2 are independent. Let’s see why. Without loss of generality, assume that X1 and X2 are continous (for the discrete case, just change integration into summation as usual). If they are independent, fX1 ,X2 = fX1 fX2 . Take g(x1 , x2 ) = x1 x2 . Then, we have Z Z E(X1 X2 ) = x1 x2 fX1 ,X2 (x1 , x2 ) dx1 dx2 R R Z Z Z Z = x1 x2 fX1 (x1 )fX2 (x2 ) dx1 dx2 = x1 fX1 (x1 ) dx1 x2 fX2 (x2 ) dx2 R R R = E(X1 )E(X2 ). 72 R The above argument is easily extended to show that for any two functions g1 , g2 : R → R, if X1 and X2 are independent, then E(g1 (X1 )g2 (X2 )) = E(g1 (X1 )g2 (X2 )). Exercise in class: Consider the following bivariate p.d.f.: 3 fX1 ,X2 (x1 , x2 ) = e−3x1 1[0,∞)×[0,2] (x1 , x2 ). 2 Are X1 and X2 independent? Why? What are their marginal distributions? Compute E(3XY + 3). Exercise in class: Consider the pair of random variables X1 and X2 with joint p.d.f. 1 fX1 ,X2 (x1 , x2 ) = x1 e−(x1 +x2 )/2 1[0,∞)×[0,∞) (x1 , x2 ) 8 and consider the function g(x, y) = y/x. What is E(g(X1 , X2 ))? First of all, notice that X1 and X2 are independent. Therefore, we know already that E(g(X1 , X2 )) = E(X2 /X1 ) = E(X2 )E(1/X1 ). Moreover, the kernel of the p.d.f. of X1 is x1 e−x1 /2 1[0,∞) (x1 ) while that of the p.d.f. of X2 is e−x2 /2 1[0,∞) (x2 ). Thus, X1 and X2 are distributed according to Gamma(2, 2) and Exponential(2) respectively. It follows that E(X2 ) = β = 2. We only need to compute E(1/X1 ). This is equal to Z ∞ Z ∞ x1 1 1 1 1 E = fX1 (x1 ) dx = x1 e− 2 dx X1 x1 x1 4 0 0 Z 1 ∞ − x1 1 1 = e 2 dx = 2 = . 4 0 4 2 It follows that E(X2 /X1 ) = E(X2 )E(1/X1 ) = 2(1/2) = 1. Back to Lecture 5 again. There we mentioned that for two random variables X and Y , V (X + Y ) 6= V (X) + V (Y ) usually, but that the result is 73 true if X and Y are independent. Now, we can easily see that if X and Y are independent and we take the functions g1 = g2 = g with g(X) = X − E(X), by our previous results we have V (X + Y ) = E[(X + Y − (E(X) + E(Y )))2 ] = E[((X − E(X)) + (Y − E(Y )))2 ] = E[(X − E(X))2 + (Y − E(Y ))2 + 2(X − E(X))(Y − E(Y ))] = E[(X − E(X))2 ] + E[(Y − E(Y ))2 ] + 2E(X − E(X))E(Y − E(Y )) = V (X) + V (Y ) + 2 ∗ 0 ∗ 0 = V (X) + V (Y ). (41) Notice that we exploited the independence of X and Y together with the results above to deal with the expectation of the cross-product 2(X−E(X))(Y − E(Y )) with g1 = g2 = g. Question: what about V (X − Y )? Covariance The covariance of a pair of random variables X and Y is a measure of their linear dependence. By definition, the covariance between X and Y is Cov(X, Y ) = E[(X − E(X))(Y − E(Y ))] = E(XY ) − E(X)E(Y ). (42) It is easy to check that the covariance operator satisfies, for a, b, c, d ∈ R Cov(a + bX, c + dY ) = bdCov(X, Y ). Also, notice that Cov(X, Y ) = Cov(Y, X) and Cov(X, X) = V (X). GRAPH Question: suppose that X and Y are independent. What is Cov(X, Y )? There exists a scaled version of the covariance which takes values in [−1, 1]. This is the correlation between X and Y . The correlation between X and Y is defined as Cov(X, Y ) Cor(X, Y ) = p . V (X)V (Y ) (43) It is easy to check that for a, b, c, d ∈ R with b, d 6= 0, we have Cor(a + bX, c+dY ) = Cor(X, Y ). So, unlike the covariance operator, the correlation operator is not affected by affine transformations of X and Y . 74 Beware that while we saw that X and Y are independent =⇒ Cov(X, Y ) = 0, the converse is not true (i.e. independence is stronger than uncorrelation). Consider the following example. Let X be a random variable with E(X) = E(X 3 ) = 0. Consider Y = X 2 . Clearly, X and Y are not independent (in fact, Y is a deterministic function of X!). However, Cov(X, Y ) = E(XY ) − E(X)E(Y ) = E(X ∗ X 2 ) − 0 = E(X 3 ) = 0. So X and Y are uncorrelated, because Cor(X, Y ) = 0, but they are not independent! Exercise in class: Look back at equation (41). Without assuming that X and Y are independent, but rather only making the weaker assumption that X and Y are uncorrelated (i.e. Cov(X, Y ) = 0), show that V (X + Y ) = V (X − Y ) = V (X) + V (Y ). From the discussion above it is clear that in general V (X + Y ) = V (X) + V (Y ) + 2Cov(X, Y ) V (X − Y ) = V (X) + V (Y ) − 2Cov(X, Y ). This generalizes as follows. Given a1 , . . . , an ∈ R and X1 , . . . , Xn , ! n n n X n X X X V ai Xi = a2i V (Xi ) + ai aj Cov(Xi , Xj ) i=1 i=1 = n X i=1 j=1 j6=i a2i V (Xi ) + i=1 X 2ai aj Cov(Xi , Xj ). 1≤i<j≤n If the random variables X1 , . . . , Xn are all pairwise uncorrelated, then obviously the second summand above is null. Exercise in class: You are given three random variables X, Y, Z with V (X) = 1, V (Y ) = 4, V (Z) = 3, Cov(X, Y ) = 0, Cov(X, Z) = 1, and Cov(Y, Z) = 1. Compute V (3X + Y − 2Z). We have V (3X + Y − 2Z) = V (3X) + V (Y ) + V (−2Z) + 2Cov(3X, Y ) + 2Cov(3X, −2Z) + 2Cov(Y, −2Z) = 9V (X) + V (Y ) + 4V (Z) + 6Cov(X, Y ) − 12Cov(X, Z) − 4Cov(Y, Z) = 9 + 4 + 12 − 12 − 4 = 9. 75 76 Homework 5 (due June 2nd) 1. (14 pts) Consider the following function: ( 6x21 x2 if 0 ≤ x1 ≤ x2 ∧ x1 + x2 ≤ 2 f (x1 , x2 ) = 0 otherwise. • Verify that f is a valid bivariate p.d.f. • Suppose (X1 , X2 ) ∼ f . Are X1 and X2 independent random variables? • Compute P (Y1 + Y2 < 1) • Compute the marginal p.d.f. of X1 . What family does this distribution belong to? What are the values of its parameters? • Compute the marginal p.d.f. of X2 • Compute the conditional p.d.f. of X2 given X1 • Compute P (X2 < 1.1|X1 = 0.6). Solution: • It is clear that fX1 ,X2 (x1 , x2 ) ≥ 0 for any x1 , x2 ∈ R. We just 77 need to check that fX1 ,X2 integrates to 1: Z 1 Z (2−x1 ) Z Z 6x21 x2 dx2 dx1 fX1 ,X2 (x1 , x2 ) dx1 dx2 = R 0 R Z 1 x21 =6 0 Z =3 0 (2−x1 ) Z Z x1 1 2−x x21 x22 x1 1 dx1 x2 dx2 dx1 = 3 x1 0 1 x21 [(2 2 x21 ] dx1 − x1 ) − Z = 12 0 1 x21 (1 − x1 ) dx1 2 24 Γ(3)Γ(2) = 12 = = 1. = 12 Γ(5) 4! 24 • X1 and X2 are not independent since the support of fX1 ,X2 (the union of the green region and the red region in the figure) is not a rectangular region. • The region corresponding to the event X1 + X2 < 1 is the red region in the figure. We have Z 1 Z 1−x1 2 P (X1 + X2 < 1) = 6x21 x2 dx2 dx1 0 x1 Z =6 1 2 x21 Z 0 Z =3 1−x1 Z x2 dx2 dx1 = 3 x1 1 2 x21 [(1 0 1 2 1−x x21 x22 x1 1 dx1 0 2 − x1 ) − x21 ] dx1 Z =3 1 2 0 x21 (1 − 2x1 ) dx1 1 3 1 = x31 02 − x41 02 2 3 1 1 = . = − 8 32 32 • Notice that supp(X1 ) = [0, 1]. For x1 ∈ [0, 1] we thus have Z 2−x1 2−x fX1 (x1 ) = 6x21 x2 dx2 = 3x21 x22 x1 1 = x1 3x21 [(2 − x1 )2 − x21 ] = 12x21 (1 − x1 ) which we recognize as the p.d.f. of a Beta(3, 2) distribution. • Notice that supp(X2 ) = [0, 2]. For 0 ≤ x2 < 1 we have Z x2 x fX2 (x2 ) = 6x21 x2 dx1 = 2x2 x31 0 2 0 = 2x42 . 78 For 1 ≤ x2 ≤ 2 we have Z fX2 (x2 ) = 2−x2 0 2−x 6x21 x2 dx1 = 2x2 x31 0 2 = 2x2 (2 − x2 )3 . • The conditional p.d.f. of X2 given X1 = x1 is only well-defined for x1 ∈ supp(X1 ) = [0, 1]. For x1 ∈ (0, 1), 6x21 x2 1[x1 ,2−x1 ] (x2 ) fX1 ,X2 (x1 , x2 ) = fX1 (x1 ) 12x21 (1 − x1 ) x2 = 1 (x2 ). 2(1 − x1 ) [x1 ,2−x1 ] fX2 |X1 =x1 (x2 ) = • We have Z 1.1 P (X2 < 1.1|X1 = 0.6) = −∞ Z 1.1 fX2 |X1 =0.6 (x2 ) dx2 x2 1[0.6,1.4] (x2 ) −∞ 2(1 − 0.6) 1 2 1.1 1.21 − 0.36 = x = 0.53125. = 1.6 2 0.6 1.6 = 2. (12 pts) Consider the following bivariate p.m.f.:   0.38 if (x1 , x2 ) = (0, 0)      0.17 if (x1 , x2 ) = (1, 0)    0.14 if (x , x ) = (0, 1) 1 2 pX1 ,X2 (x1 , x2 ) =  0.02 if (x 1 , x2 ) = (1, 1)      0.24 if (x1 , x2 ) = (0, 2)    0.05 if (x , x ) = (1, 2). 1 2 • Compute the marginal p.m.f. of X1 and the marginal p.m.f. of X2 • Compute pX2 |X1 =0 , the conditional p.m.f. of X2 given X1 = 0 • Compute P (X1 = 0|X2 = 2)? Solution: 79 • Notice first that supp(X1 ) = {0, 1}. Thus, we have   pX1 ,X2 (0, 0) + pX1 ,X2 (0, 1) + pX1 ,X2 (0, 2) if x1 = 0 pX1 (x1 ) = pX1 ,X2 (1, 0) + pX1 ,X2 (1, 1) + pX1 ,X2 (1, 2) if x1 = 1   0 if x1 ∈ / {0, 1}     0.38 + 0.14 + 0.24 if x = 0 1  0.76 if x1 = 0 = 0.17 + 0.02 + 0.05 if x1 = 1 = 0.24 if x1 = 1     0 if x1 ∈ / {0, 1} 0 if x1 ∈ / {0, 1}. Now notice that supp(X2 ) = {0, 1, 2}. We have • We have   pX1 ,X2 (0, 0) + pX1 ,X2 (1, 0) if x2 = 0    p X1 ,X2 (0, 1) + pX1 ,X2 (1, 1) if x2 = 1 pX2 (x2 ) =  pX1 ,X2 (0, 2) + pX1 ,X2 (1, 2) if x2 = 2    0 if x ∈ 2 / {0, 1, 2}     0.38 + 0.17 if x = 0   2  0.55 if x2 = 0   0.14 + 0.02 if x = 1 0.16 if x = 1 2 2 = =   0.29 if x2 = 2 0.24 + 0.05 if x2 = 2        0 if x ∈ 0 if x2 ∈ / {0, 1, 2}. 2 / {0, 1, 2}  0.38   0.76 if x2 = 0    0.14 if x2 = 1 pX1 ,X2 (0, x2 ) = 0.76 pX2 |X1 =0 (x2 ) = 0.24  0.76 if x2 = 2 pX1 (0)    0 if x ∈ 2 / {0, 1, 2}   0.5 if x2 = 0    0.184 if x = 1 2 ≈  0.316 if x 2 =2    0 if x ∈ 2 / {0, 1, 2} • We have pX1 ,X2 (0, 2) P (X1 = 0 ∩ X2 = 2) = P (X2 = 2) pX2 (2) 0.24 = ≈ 0.828. 0.29 P (X1 = 0|X2 = 2) = 80 3. (8 pts) Let X1 ∼ Uniform(0, 1) and, for 0 < x1 ≤ 1, fX2 |X1 =x1 (x2 ) = 1 1 (x2 ). x1 (0,x1 ] • What family does fX2 |X1 =x1 belong to? What are the values of its parameters? • Compute the joint p.d.f. of (X1 , X2 ) • Compute the marginal p.d.f. of X2 . Solution: • It is clear that fX2 |X1 =x1 for 0 < x1 ≤ 1 is a Uniform p.d.f. over the interval (0, x1 ]. • We have fX1 ,X2 (x1 , x2 ) = fX1 (x1 )fX2 |X1 =x1 (x2 ) = 1(0,1] (x1 ) = 1 1 (x2 ) x1 (0,x1 ] 1 1 (x1 , x2 ). x1 {0<x2 ≤x1 ≤1} • We have, for x2 ∈ (0, 1], Z 1 fX2 (x2 ) = x2 1 dx1 = log(x1 )|1x2 = − log(x2 ) x1 otherwise fX2 (x2 ) = 0 for x2 ∈ / (0, 1]. 4. (10 pts) Let X1 ∼ Binomial(n1 , p) and X2 ∼ Binomial(n2 , p). Assume that X1 and X2 are independent. Consider W = X1 + X2 . One can show that W ∼ Binomial(n1 + n2 , p). Use this result to show that pX1 |W =w is a Hypergeometric p.m.f. with parameters r = n1 , N = n1 + n2 , and n = w. Solution: 81 We have, for x1 ∈ {0, 1, . . . , w} P (X1 = x1 ∩ W = w) P (X1 = x1 ∩ X2 = w − x1 ) = P (W = w) P (W = w) P (X1 = x1 )P (X2 = w − x1 ) = P (W = w) w−x n2 n1 x1 n1 −x1 1 (1 − p)n2 −w+x1 w−x1 p x1 p (1 − p) = n1 +n2 w p (1 − p)n1 +n2 −w w n2 n1 pX1 |W =w (x1 ) = P (X1 = x1 |W = w) = = x1 w−x1 n1 +n2 w and clearly pX1 |W =w (x1 ) = 0 if x1 ∈ / {0, 1, . . . , w}. 5. (10 pts) Suppose that you have a coin such that, if tossed, head appears with probability p ∈ [0, 1] and tail occurs with probability 1−p. Person A tosses the coin. The number of tosses until the first head appears for person A is X1 . Then person B tosses the coin. The number of tosses until the first head appears for person B is X2 . Assume that X1 and X2 are independent. What is the probability of the event X1 = X2 ? Solution: Obviously, X ∼ Geometric(p) and Y ∼ Geometric(p). Because X and Y are independent, we have pX1 ,X2 (x1 , x2 ) = pX1 (x1 )pX2 (x2 ) = (1 − p)x1 −1 p1{1,2,... } (x1 )(1 − p)x2 −1 p1{1,2,... } (x2 ) = (1 − p)x1 +x2 −1 p2 1{1,2,... } (x1 )1{1,2,... } (x2 ). Then, X P (X1 = X2 ) = x1 =x2 =x x∈{1,2,... } = p2 X (1 − p)2x−2 = p2 p2 1 − (1 − pX1 ,X2 (x, x) x∈{1,2,... } x∈{1,2,... } = X pX1 ,X2 (x1 , x2 ) = p)2 = X [(1 − p)2 ]x−1 x∈{1,2,... } p . 2−p 6. (8 pts) Let (X1 , X2 ) ∼ fX1 ,X2 with fX1 ,X2 (x1 , x2 ) = 1 1 (x1 , x2 ). x1 {0<x2 ≤x1 ≤1} 82 Compute E(X1 − X2 ). Solution: We have E(X1 −X2 ) = E(X1 )−E(X2 ). First we compute the marginal p.d.f.’s. They are Z Z 1 fX1 ,X2 (x1 , x2 ) dx2 = fX1 (x1 ) = 1{0<x2 ≤x1 ≤1} (x1 , x2 ) dx2 R x1 R Z 1 = 1 (x1 ) 1(0,x1 ] (x2 ) dx2 x1 (0,1] ZRx1 1 = (x2 ) dx2 1 (x1 ) x1 (0,1] 0 = 1(0,1] (x1 ) and Z fX2 (x2 ) = Z fX1 ,X2 (x1 , x2 ) dx1 = ZR R 1 1 (x1 , x2 ) dx1 x1 {0<x2 ≤x1 ≤1} 1 1(0,1] (x2 )1[x2 ,1] (x2 ) dx1 = R x1 Z 1 1 = 1(0,1] (x2 ) dx1 = − log(x2 )1(0,1] (x2 ). x x2 1 Notice that fX1 is the p.d.f. of Uniform(0, 1), therefore E(X1 ) = 1/2. On the other hand 1 2 Z Z 1 x2 x2 log(x2 ) − dx2 E(X2 ) = − x2 log(x2 ) dx2 = − 2 2 0 0 1 2 1 1 = x = . 4 2 0 4 Therefore E(X1 − X2 ) = E(X1 ) − E(X2 ) = 1/2 − 1/4 = 1/4. 7. (10 pts) Consider fX1 ,X2 (x1 , x2 ) = 6(1 − x2 )1{0≤x1 ≤x2 ≤1} (x1 , x2 ). • Are X1 and X2 independent? • Are X1 and X2 correlated (positively or negatively) or uncorrelated? Solution: 83 • X1 and X2 are not independent. We can see this in two ways: 1) we draw the support of their joint p.d.f. (which is the set indicated by the indicator function; the red triangle in the figure) and we easily see that it does not correspond to a rectangular region; 2) we rewrite the indicator function as 1{0≤x1 ≤x2 ≤1} (x1 , x2 ) = 1[0,x2 ] (x1 )1[x1 ,1] (x2 ) and then we easily see that there is no way to factorize the joint p.d.f. into two functions g and h such that g only depends on x1 and h only depends on x2 . • To answer the second question, we need to compute Cov(X1 , X2 ). For that we need E(X1 ), E(X2 ), and E(X1 X2 ). We have Z Z E(X1 X2 ) = x1 x2 fX1 ,X2 (x1 , x2 ) dx1 dx2 R R Z 1 Z x2 = x1 x2 6(1 − x2 ) dx1 dx2 0 0 Z 1 Z x2 =6 x2 (1 − x2 ) x1 dx1 dx2 0 0 Z 1 x =3 x2 (1 − x2 ) x21 0 2 dx2 0 Z 1 =3 x32 (1 − x2 )dx2 0 Γ(4)Γ(2) 3! 18 3 =3 =3 = = . Γ(6) 5! 120 20 84 Then, Z 1 Z 1 6(1 − x2 ) dx2 dx1 1 # Z 1 " x22 1 x1 x2 |x1 − =6 dx1 2 x1 0 Z 1 x31 x1 2 dx1 − x1 + =6 2 3 0 1 1 1 1 =6 − + = 4 3 8 4 E(X1 ) = x1 x2 0 and 1 Z E(X2 ) = Z 6 0 0 1 6x2 (1 − x2 ) Z x2 dx1 dx2 0 x22 (1 − x2 ) dx2 = 6 2 12 1 Γ(3)Γ(2) =6 = = . Γ(5) 4! 24 2 3 Thus, Cov(X1 , X2 ) = E(X1 X2 ) − E(X1 )E(X2 ) = 20 − 14 12 = 3 1 1 20 − 8 = 40 . Since the covariance is positive, X1 and X2 are positively correlated and therefore not independent. 8. (10 pts) Let X1 and X2 be uncorrelated random variables. Consider Y1 = X1 + X2 and Y2 = X1 − X2 . • Express Cov(Y1 , Y2 ) in terms of V (X1 ) and V (X2 ) • Give an expression for Cor(Y1 , Y2 ) • Is it possible that Cor(Y1 , Y2 ) = 0? If so, when does this happen? Solution: • We have Cov(Y1 , Y2 ) = Cov(X1 + X2 , X1 − X2 ) = Cov(X1 , X1 ) − Cov(X1 , X2 ) + Cov(X2 , X1 ) − Cov(X2 , X2 ) = V (X1 ) − V (X2 ). • By definition Cov(Y1 , Y2 ) Cor(Y1 , Y2 ) = p . V (Y1 )V (Y2 ) 85 We have V (Y1 ) = V (X1 + X2 ) = V (X1 ) + V (X2 ) + 2Cov(X1 , X2 ) = V (X1 ) + V (X2 ) + 0 = V (X1 ) + V (X2 ) and V (Y2 ) = V (X1 − X2 ) = V (X1 ) + V (X2 ) − 2Cov(X1 , X2 ) = V (X1 ) + V (X2 ) + 0 = V (X1 ) + V (X2 ), thus V (X1 ) − V (X2 ) Cov(Y1 , Y2 ) . = Cor(Y1 , Y2 ) = p V (X1 ) + V (X2 ) V (Y1 )V (Y2 ) • If V (X1 ) > V (X2 ), then Y1 and Y2 are positively correlated. If V (X1 ) < V (X2 ), then Y1 and Y2 are negatively correlated. Finally, If V (X1 ) = V (X2 ), then Y1 and Y2 are uncorrelated. 9. (10 pts) Let 1 fX1 (x1 ) = x31 e−x1 1[0,∞) (x1 ) 6 and 1 x2 fX2 (x2 ) = e− 2 1[0,∞) (x2 ). 2 Let Y = X1 − X2 . • Compute E(Y ) • Let Cov(X1 , X2 ) = γ. Compute V (Y ) • Suppose that X1 and X2 are uncorrelated. What is V (Y ) in this case? Solution: • It is clear from the inspection of the two p.d.f.’s that X1 ∼ Gamma(4, 1) and X2 ∼ Exponential(2). Thus, E(Y ) = E(X1 − X2 ) = E(X1 ) − E(X2 ) = 4 − 2 = 2. • V (Y ) = V (X1 − X2 ) = V (X1 ) + V (X2 ) − 2Cov(X1 , X2 ) = 4 + 4 − 2γ = 2(4 − γ). • If X1 and X2 are uncorrelated, then γ = 0 and V (Y ) = 8. 86 10. (8 pts) Use the definition of variance to show that V (X ±Y ) = V (X)+ V (Y ) ± 2Cov(X, Y ). Solution: We have V (X ± Y ) = E[(X ± Y )2 ] − [E(X ± Y )]2 = E(X 2 + Y 2 ± 2XY ) − [E(X)]2 − [E(X)]2 ∓ 2E(X)E(Y ) = E(X 2 ) − [E(X)]2 + E(Y 2 ) − [E(Y )]2 ± 2E(XY ) ∓ 2E(X)E(Y ) = V (X) + V (Y ) ± 2[E(XY ) − E(X)E(Y )] = V (X) + V (Y ) ± 2Cov(X, Y ). 11. (EXTRA CREDIT, 12 pts) Let Z ∼ N (0, 1). Show that for any odd n ≥ 1, E(Z n ) = 0. Solution: Let f denote the p.d.f. associated to N (0, 1). Notice that f is an even function, i.e. f (z) = f (−z) for any z ∈ R. We have Z n E(Z ) = z n f (z) dz R Z 0 Z +∞ n = z f (z) dz + z n f (z) dz −∞ 0 0 Z = Z n +∞ z n f (z) dz z f (−z) dz + −∞ 0 Z = 0 −((−z)n )f (−z) dz + −∞ −∞ Z = (−z)n f (−z) dz + 0 Z Z +∞ z n f (z) dz 0 +∞ z n f (z) dz 0 Make the change of variable x = −z in the first integral. We get Z −∞ Z +∞ n (−z) f (−z) dz + z n f (z) dz 0 0 Z +∞ Z +∞ n = (x )f (x)(−1) dx + z n f (z) dz 0 0 Z +∞ Z +∞ =− (xn )f (x) dx + z n f (z) dz = 0. 0 0 12. (EXTRA CREDIT, 15 pts) Show that for two random variables X and Y , Cor(X, Y ) ∈ [−1, 1]. Hint: consider X 0 = X − E(X) and Y 0 = Y − E(Y ) and study the quadratic function h(t, X 0 , Y 0 ) = (tX 0 − Y 0 )2 as a function of t. Notice 87 that h(t, X 0 , Y 0 ) ≥ 0, therefore E(h(t, X 0 , Y 0 )) ≥ 0 which implies that the discriminant of h(t, X 0 , Y 0 ) as a quadratic function of t must be non-positive! Try to use this fact to prove the claim that Cor(X, Y ) ∈ [−1, 1]. Solution: We have E(h(t, X 0 , Y 0 )) ≥ 0. This means E[(tX 0 − Y 0 )2 ] = E(t2 X 02 + Y 02 − 2tX 0 Y 0 ) = t2 E(X 02 ) − 2tE(X 0 Y 0 ) + E(Y 02 ) ≥ 0 This can only happen if the discriminant is non-positive, i.e. if 4[E(X 0 Y 0 )]2 − 4E(X 02 )E(Y 02 ) ≤ 0, i.e. [E(X 0 Y 0 )]2 ≤ E(X 02 )E(Y 02 ) or, equivalently, [E(X 0 Y 0 )]2 ≤ 1. E(X 02 )E(Y 02 ) Notice that [E(X 0 Y 0 )]2 [E((X − E(X))(Y − E(Y )))]2 = E(X 02 )E(Y 02 ) E((X − E(X))2 )E((Y − E(Y ))2 ) [Cov(X, Y )]2 = = [Cor(X, Y )]2 . V (X)V (Y ) Thus, it follows that |Cor(X, Y )| ≤ 1 or, equivalently, −1 ≤ Cor(X, Y ) ≤ 1. 88 Lecture 10 Recommended readings: WMS, section 5.11 Expected Value in the Context of Bivariate Distributions (part 2) Using conditional distributions, we can define the notion of conditional expectation, conditional variance, and conditional covariance. Conditional Expectation Given a conditional p.m.f. pX1 |X2 =x2 or a conditional p.d.f. pX1 |X2 =x2 the conditional expectation of X1 given X2 = x2 is defined as X E(X1 |X2 = x2 ) = x1 pX1 |X2 =x2 (x1 ) (44) x1 ∈supp(X1 ) in the discrete case and as E(X1 |X2 = x2 ) = Z x1 ∈supp(X1 ) x1 fX1 |X2 =x2 (x1 ) dx1 (45) in the continuous case. It is clear from the definition that the conditional expectation is a function of the value of X2 . Since X2 is a random variable, this means that the conditional expectation (despite its name) is itself a random variable! This is a fundamental difference with respect to the standard concept of expectation for a random variable (for instance E(X1 )) which is just a constant depending on the distribution of X1 . In terms of notation, we often denote the random variable corresponding to the conditional expectation of X1 given X2 simply as E(X1 |X2 ). It is easy to see from the definition of conditional expectation that E(E(X1 |X2 )) = E(X1 ), where the outer expectation is taken with respect to the probability distribution of X2 . Note that if X1 and X2 are independent, then E(X1 |X2 ) = E(X1 ). Conditional Variance We can also define the conditional variance of X1 given X2 . We have V (X1 |X2 ) = E[(X1 − E(X1 |X2 ))2 |X2 ] = E(X12 |X2 ) − [E(X1 |X2 )]2 89 (46) Obtaining the unconditional variance from the conditional variance is a little ‘harder’ than obtaining the unconditional expectation from the conditional expectation: V (X1 ) = E[V (X1 |X2 )] + V [E(X1 |X2 )]. Note that if X1 and X2 are independent, then V (X1 |X2 ) = V (X1 ). Conditional Covariance It is worthwhile mentioning that we can also define the conditional covariance between X1 and X2 given a third random variable X3 . We hae Cov(X1 , X2 |X3 ) = E[(X1 − E(X1 |X3 ))(X2 − E(X2 |X3 ))|X3 ] = E(X1 X2 |X3 ) − E(X1 |X3 )E(X2 |X3 ). (47) To get the unconditional covariance we have the following formula: Cov(X1 , X2 ) = E[Cov(X1 , X2 |X3 )] + Cov[E(X1 |X3 ), E(X2 |X3 )]. (48) Note that if X1 and X3 are independent, and X2 and X3 are independent, then Cov(X1 , X2 |X3 ) = Cov(X1 , X2 ). In terms of computation, everything is essentially unchanged. The only difference is that we sum or integrate against the conditional p.m.f./conditional p.d.f. rather than the marginal p.m.f./marginal p.d.f.. Exercise in class: Consider again the p.d.f. of exercise 7 in Homework 5: fX1 ,X2 (x1 , x2 ) = 6(1 − x2 )1{0≤x1 ≤x2 ≤1} (x1 , x2 ). What is E(X1 |X2 )? Compute E(X1 ). We first need to compute fX1 |X2 =x2 for x2 ∈ supp(X2 ). For x2 ∈ [0, 1] we have Z x2 fX2 (x2 ) = 6(1 − x2 )1[0,1] (x2 ) dx2 = 6x2 (1 − x2 )1[0,1] (x2 ) 0 Notice that from the marginal p.d.f. of X2 we can see that X2 ∼ Beta(2, 2). For x2 ∈ (0, 1) we have 6(1 − x2 )1[0,x2 ] (x1 ) fX1 ,X2 (x1 , x2 ) = fX2 (x2 ) 6x2 (1 − x2 ) 1 = 1 (x1 ) x2 [0,x2 ] fX1 |X2 =x2 (x1 ) = 90 Notice that fX1 |X2 =x2 is the p.d.f. of a Uniform(0, x2 ) distribution. It follows that E(X1 |X2 = x2 ) = x2 /2. We can write this more concisely (and in a way that stresses more the fact that the conditional expectation is a random variable!) as E(X1 |X2 ) = X2 /2. We have E(X1 ) = E[E(X1 |X2 )] = E(X2 /2) = E(X2 )/2 = [(2/(2 + 2)]/2 = 1/4. Exercise in class: i.i.d. Let N ∼ Poisson(λ) and let Y1 , . . . , YN PN∼ Gamma(α, β) with N independent of each of the Yi ’s. Let T = i=1 Yi . Compute E(T |N ), E(T ), V (T |N ), and V (T ). We have that ! N n X X E(T |N = n) = E Yi |N = n = E(Yi |N = n) i=1 = n X i=1 E(Yi ) = nαβ. i=1 Thus, E(T |N ) = N αβ. Now, E(T ) = E[E(T |N )] = E(N αβ) = αβE(N ) = αβλ. The conditional variance is V (T |N = n) = V = n X i=1 N X i=1 ! Yi |N = n V (Yi |N = n) = =V n X i=1 n X ! Yi |N = n V (Yi ) = nαβ 2 . i=1 Thus, V (T |N ) = N αβ 2 . The unconditional variance of T is then V (T ) = V [E(T |N )] + E[V (T |N )] = V (N αβ) + E(N αβ 2 ) = α2 β 2 V (N ) + αβ 2 E(N ) = α2 β 2 λ + αβ 2 λ = αβ 2 λ(1 + α). Exercise in class: Let Q ∼ Uniform(0, 1) and Y |Q ∼ Binomial(n, Q). Compute E(Y ) and V (Y ). 91 We have E(Y ) = E[E(Y |Q)] = E(nQ) = nE(Q) = n/2 and V (Y ) = V [E(Y |Q)] + E[V (Y |Q)] = V (nQ) + E(nQ(1 − Q)) = n2 V (Q) + nE(Q − Q2 ) = n2 + nE(Q) − nE(Q2 ) 12 n2 n + − n(V (Q) + [E(Q)]2 ) 12 2 n2 n n n = + − − 12 2 12 4 n2 n n = + = (n/2 + 1). 12 6 6 = 92 Lecture 11 Recommended readings: WMS, sections 6.1 → 6.4 Functions of Random Variables and Their Distributions Suppose that you are given a random variable X ∼ FX and a function g : R → R. Consider the new random variable Y = g(X). How can we find the distribution of Y ? This is the focus of this lecture. There are two approaches that one can typically use to find the distribution of a function of a random variable Y = g(X): the first approach is based on the c.d.f., the other approach is based on the change of variable technique. The former is general and works for any function g, the latter requires some assumptions on g, but it is generally faster than the general approach based on the c.d.f.. The method of the cumulative distribution function The idea is pretty simple. You know that X ∼ FX and you want to find FY . The process is as follows: 1. FY (y) = P (Y ≤ y) by definition 2. P (Y ≤ y) = P (g(X) ≤ y), since Y = g(X) 3. now you want to express the event {g(X) ≤ y} in terms of X, since you know the distribution of X only 4. let A = {x ∈ R : g(x) ≤ y}; then {g(X) ≤ y} = {X ∈ A} 5. it follows that P (g(X) ≤ y) = P (X ∈ A) which can typically be expressed in terms of FX 6. (EXTRA) once you have FX , the p.m.f. or the p.d.f. of X can be easily derived from it. Exercise in class: Let Z ∼ N (0, 1) and g(x) = x2 . Consider Y = g(Z) = Z 2 . What is the probability distribution of Y ? 93 Following the steps above, we have ( 0 if y < 0 FY (y) = P (Y ≤ y) = P (Z ≤ y) = √ P (|Z| ≤ y) if y ≥ 0. ( 0 if y < 0 = √ √ P (Z ∈ [− y, y]) if y ≥ 0. 2 √ √ Notice that in this case A = [− y, y]. It follows that ( ( 0 if y < 0 0 if y < 0 = FY (y) = √ √ √ √ Φ( y) − Φ(− y) if y ≥ 0. P (Z ∈ [− y, y]) if y ≥ 0. ( 0 if y < 0 = √ 2Φ( y) − 1 if y ≥ 0. Let φ denote the standard normal p.d.f. 1 x2 φ(x) = √ e 2 . 2π The p.d.f. of Y = Z 2 is therefore ( 0 if y ≤ 0 fY (y) = √ √1 φ( y) if y > 0 y ( 0 if y ≤ 0 y = 1 √1 y − 2 e− 2 if y > 0 2π Notice that this is the p.d.f. of a Gamma( 21 , 2) ≡ χ2 (1) distribution. Exercise in class: the probability integral transform Consider a continuous random variable X ∼ FX where FX , the distribution of X is strictly increasing. Consider the random variable Y = FX (X). What is the probablity distribution of Y ? The transformation Y = FX (X) is usually called the probability integral transform. To see why, notice that Z X Y = FX (X) = fX (y) dy. −∞ We have   0 if y < 0 FY (y) = P (Y ≤ y) = P (FX (X) ≤ y) = P (X ≤ FX−1 (y)) if y ∈ [0, 1)   1 if y ≥ 1. 94 In this case, therefore, A = (−∞, FX−1 (y)]. Then,     0 if y < 0 0 if y < 0 −1 FY (y) = P (X ≤ FX (y)) if y ∈ [0, 1) = FX (FX−1 (y)) if y ∈ [0, 1)     1 if y ≥ 1 1 if y ≥ 1   0 if y < 0 = y if y ∈ [0, 1)   1 if y ≥ 1. The p.d.f. of Y is therefore fY (y) = 1[0,1] (y), i.e. Y ∼ Uniform(0, 1). The method of the change of variable We will not focus on the mathematical motivation behind this technique. If you are curious and want to learn more about it, I strongly recommend taking a look at this document. For our purposes, we will take this procedure as a black-box. In order to apply the method of the change of variable, the function g must admit an inverse g −1 (this is not required by the method based on the c.d.f.). A sufficient condition is that g is strictly monotone (increasing or decreasing). Suppose that you are given X ∼ FX and you want to compute the probability distribution of Y = g(X). First of all, determine the support of Y . Then, use this formula to obtain fY on the support of Y from fX and g: d fY (y) = fX (g −1 (y)) g −1 (z) . dz z=y Exercise in class: Consider X ∼ Beta(1, 2) and g(x) = 2x−1. What is the p.d.f. of Y = g(X)? First of all, notice that (with a little abuse of notation) supp(Y ) = g(supp(X)). Since supp(X) = [0, 1], it follows that supp(Y ) = [−1, 1]. We have fX (x) = 2(1 − x)1[0,1] (x), g −1 (x) = 95 x+1 , 2 and d −1 1 g (x) = . dx 2 Thus, for y ∈ [−1, 1], d fY (y) = fX (g −1 (y)) g −1 (x) dx x=y y+1 1 y+1 1−y =2 1− =1− = . 2 2 2 2 The complete description of the p.d.f. of Y is therefore fY (y) = 1−y 1[−1,1] (y). 2 96 Homework 6 (due June 4th) 1. (8 pts) Let X|Q = q ∼ Binomial(n, q) and Q ∼ Beta(α, β). Compute E(X). Solution: We have E(X) = E[E(X|Q)] = E(nQ) = nE(Q) = nα . α+β 2. (12 pts) Let X|Q = q ∼ Binomial(n, q) and Q ∼ Beta(3, 2). Find the marginal p.d.f. of X. (Notice that here we are mixing discrete and continuous random variables, but the usual definitions still apply!) Solution: The support of X is supp(X) = {0, 1, . . . , n}. For x ∈ supp(X) and q ∈ (0, 1) we have Z 1 Z 1 pX,Q (x, q) dq = pX|Q=q (x)pQ (q) dq 0 0 Z 1 n x Γ(3 + 2) 2 = q (1 − q)n−x q (1 − q) dq x Γ(3)Γ(2) 0 Z 1 n = 12 q 2+x (1 − q)n−x+1 dq x 0 n Γ(3 + x)Γ(n − x + 2) = 12 x Γ(3 + x + n − x + 2) n Γ(3 + x)Γ(n − x + 2) = 12 . x Γ(5 + n) pX (x) = For any x ∈ / supp(X), of course pX (x) = 0. 3. (10 pts) Let X ∼ Exponential(4). Consider Y = 3X + 1. Find the p.d.f. of Y and E(Y ). Solution: We have g(x) = 3x + 1 thus the support of Y is [1, ∞)), g −1 (x) = and d −1 1 g (x) = . dx 3 97 x−1 3 , It follows that, for y ∈ [1, ∞), d 1 y−1 1 fY (y) = fX (g −1 (y)) g −1 (x) = e− 12 dx 4 3 x=y = 1 − x−1 e 12 12 otherwise fY (y) = 0. Now, Z ∞ Z ∞ y−1 1 (12x + 1)e−x dx E(Y ) = y e− 12 dy = 12 1 Z ∞ Z ∞0 = 12 xe−x dx + e−x dx = 12 + 1 = 13. 0 0 4. (10 pts) We mentioned in class that if X ∼ Gamma(α, β) then, for c > 0, Y = cX ∼ Γ(α, cβ). Prove it. Solution: We have g(x) = cx, g −1 (x) = x/c, and d −1 1 g (x) = . dx c The image of the support of X through g is still [0, ∞). Then, for y ∈ [0, ∞), d fY (y) = fX (g −1 (y)) g −1 (x) dx x=y α−1 y 1 −y 1 e cβ = α β Γ(α) c c y 1 − = y α−1 e cβ , (cβ)α Γ(α) otherwise fY (y) = 0. This is clearly the p.d.f. of a Gamma(α, cβ) distribution. 5. (12 pts) Let F be a given continuous and strictly increasing c.d.f.. Let U ∼ Uniform(0, 1). Find a function g such that g(U ) ∼ F . You can assume that g is strictly increasing. 98 Solution: Let Y = g(U ). We have F (y) = P (Y ≤ y) = P (g(U ) ≤ y) = P (U ≤ g −1 (y))  −1  0 if g (y) < 0 = g −1 (y) if g −1 (y) ∈ [0, 1)   1, if g −1 (y) ≥ 1. Thus, it is enough to set g −1 = F , i.e. g = F −1 , and then Y = g(U ) ∼ F. 6. (12 pts) Let X ∼ fX with fX (x) = b 1 (x) x2 [b,∞) and let U ∼ Uniform(0, 1). Find a function g such that g(U ) has the same distribution of X. Solution: Using the result of exercise 5, we need to set g = FX−1 . We have ( ( 0 if x < b 0 if x < b = Rx b FX (x) = R x fX (y) dy if x ≥ b b y 2 dy if x ≥ b b ( 0 if x < b 0 if x < b x = = b − y if x ≥ b 1 − xb if x ≥ b. b It follows that g(x) = FY−1 (x) = b . 1−x 7. (10 pts) Let θ > 0 and X ∼ fX with fX (x) = 2x − x2 e θ 1[0,∞) (x). θ Consider Y = X 2 . What is fY , the p.d.f. of Y ? What are E(Y ) and V (Y ). Solution: Notice that the function g(x) = x2 is invertible over the support of 99 X, [0, ∞). Furthermore, g is such that supp(Y ) = supp(X). We have √ g −1 (x) = x and d −1 1 g (x) = √ . dx 2 x Therefore fY (y) = fX (g −1 d −1 (y)) g (x) dx x=y √ 2 y −y 1 √ e θ 1[0,∞) ( y) √ = θ 2 y 1 −x = e θ 1[0,∞) (y). θ It follows that Y ∼ Exponential(θ). Thus, E(Y ) = θ and V (Y ) = θ2 . 8. (10 pts) Let X ∼ Beta(α, β). Consider Y = 1 − X. What is the probability distribution of Y ? Now, let α = β = 1. Claim that X and Y have the same distribution. Are X and Y also independent? Solution: d −1 Here g(x) = 1 − x, so that g −1 (x) = g(x) and dx g (x) = −1. Notice that, for this g, supp(Y ) = supp(X) = [0, 1]. Then, for y ∈ [0, 1] we have d −1 −1 fY (y) = fX (g (y)) g (x) dx x=y Γ(α + β) (1 − y)α−1 y β−1 | − 1| Γ(α)Γ(β) Γ(α + β) = (1 − y)α−1 y β−1 , Γ(α)Γ(β) = otherwise fY (y) = 0. It is clear then that Y ∼ Beta(β, α). If α = β = 1, then X and Y share the exact same distribution, which is Uniform(0, 1). However, since Y is a deterministic function of X (in particular, Y = 1 − X), it is obvioulsy false that X and Y are independent random variables. 9. (6 pts) Let X ∼ FX . What is E(X|X)? Solution: For x ∈ supp(X), we have E(X|X = x) = E(x|X = x) = E(x) = x, i.e. E(X|X) = X. 100 10. (10 pts) Let X, Y, Z be three random variables such that E(X|Z) = 3Z, E(Y |Z) = 1, and E(XY |Z) = 4Z. Assume furthermore that E(Z) = 0. Are X and Y uncorrelated? Solution: We have Cov(X, Y ) = E[Cov(X, Y |Z)] + Cov[E(X|Z), E(Y |Z)] = E[E(XY |Z) − E(X|Z)E(Y |Z)] + Cov(3Z, 1) = E(4Z − 3Z ∗ 1) + 3Cov(Z, 1) = E(Z) + 3 ∗ 0 = 0 + 0 = 0. It follows that X and Y are uncorrelated. 101 Lecture 12 Recommended readings: WMS, sections 7.1 → 7.4 The Empirical Rule, Sampling Distributions, the Central Limit Theorem, and the Delta Method In this lecture we will focus on some basic statistical concepts at the basis of statistical inference. Before starting, let’s quickly discuss a useful rule of thumb associated to the Normal distribution which is often mentioned and used in practice. The Empirical Rule Based on the Normal Distribution Suppose that you collect data X1 , . . . , Xn ∼ f where f is an unknown probability density function which, however, you know being ‘bell-shaped’ (or you expect to be ‘bell-shaped’ given the information you have on the particular phenomenon you are observing, or you can show to ‘bell-shaped’ using, for example, an histogram). INTRODUCTION TO HISTOGRAM AND DENSITY ESTIMATION We know that we can estimate the mean µf and the standard deviation σf of f by means of the sample mean and the sample standard deviation X̄ and S, respectively. On the basis of these two statistics, you may be interested in approximately quantifying the probability content of intervals of the form [µf − kσf , µf + kσf ] where k is a positive integer. Because for X ∼ N (µ, σ 2 ) one has P (µ − σ ≤ X ≤ µ + σ) ≈ 68% P (µ − 2σ ≤ X ≤ µ + 2σ) ≈ 95% P (µ − 3σ ≤ X ≤ µ + 3σ) ≈ 99%, one would expect that, based on X̄ and S, P ([X̄ − S, X̄ + S]) ≈ P ([µf − σf , µf + σf ]) ≈ 68% P ([X̄ − 2S, X̄ + 2S]) ≈ P ([µf − 2σf , µf + 2σf ]) ≈ 95% P ([X̄ − 3S, X̄ + 3S]) ≈ P ([µf − 3σf , µf + 3σf ]) ≈ 99% 102 (49) if the probability density f is ‘bell-shaped’. The approximations of equation (49) are frequently referred to as the empirical rules based on the Normal distribution. Sampling Distributions To illustrate the concept of sampling distribution, we will consider the Nori.i.d. mal model, which assumes that the data X1 , . . . , Xn ∼ N (µ, σ 2 ) are i.i.d. with a Normal distribution with parameters µ and σ 2 that are usually assumed to be unknown. In order to estimate µ and σ 2 , we saw that we can use the two statistics X̄ and S 2 (the sample mean and the sample variance). These statistics, or estimators, are random variables too, since they depend on the data X1 , . . . , Xn . If they are random variables, then they must have their own probability distributions! The probability distribution of a statistic (or an appropriate stabilizing transformation of it) is called the sampling distribution of that statistic. We have the following results: √ 1. 2. 3. n(X̄−µ) σ ∼ N (0, 1) (n−1)S 2 ∼ χ2 (n − 1) σ2 √ n(X̄−µ) ∼ t(n − 1). S Some comments: 1. is easy to understand and prove (it’s just standardization of a Normal random variable!). 1. is useful when we are interested in making inferences about µ and we know σ 2 . 2. is a little harder to prove. We use this result when we are interested in making inferences about σ 2 and µ is unknown. 3. is a classical result. It is useful when we want to make inferences on µ and σ 2 is unknown. We won’t study in detail the t distribution. However, this distribution arises when we take the ratio of a standard Normal distribution and the square root of a χ2 distribution divided by its degrees of freedom (under the assumption that the Normal and the χ2 distributed random variables are also independent). The t distribution looks like a Normal distribution with ‘fatter’ tails. As the number of degrees of freedom of the t distribution goes to ∞, the t distribution ‘converges in distribution’ to a standard Normal distribution. For the purposes of this course, we say that a sequence of random variables X1 , X2 , . . . Xn , . . . converges in distribution to a certain distribution F if their c.d.f.’s F1 , F2 , . . . , Fn , . . . are such that lim Fn (x) = F (x) n→∞ 103 d for each x ∈ R at which F is continuous. We will use the notation → to denote convergence in distribution. The idea here is that the limiting c.d.f. F (often called the limiting distribution of the X’s) can be used to approximate probability statements about the random variables in the sequence. This idea will be key to the next result, the Central Limit Theorem. BASIC INFERENCE EXAMPLE AND p-VALUES (IMPORTANT: CHI-SQUARE EXAMPLE) The Central Limit Theorem The Central Limit Theorem is a remarkable result. Simply put, suppose that iid you have a sequence of random variables X1 , X2 , . . . , Xn , . . . ∼ F distributed according to some c.d.f. F , such that their expectation µ and their variance σ 2 exist and are finite. Then, the standardized version of their average converges in distribution to a standard Normal distribution. In other words, √ n(X̄n − µ) d →Z (50) σ as n → ∞, where Z ∼ N (0, 1). Another way to express the Central Limit Theorem is to say that, for any x ∈ R, √ n(X̄n − µ) ≤ x → Φ(x). (51) lim P n→∞ σ This result is extremely frequently used and invoked to approximate probability statements regarding the average of a collection of i.i.d. random variables when n is ‘large’. However, rarely the population variance σ 2 is known. The Central Limit Theorem can be extended to accomodate this case. We have √ n(X̄n − µ) d →Z (52) S where Z ∼ N (0, 1). Exercise in class: The number of errors per computer program has a Poisson distribution with mean λ = 5. We receive n = 125 programs written by n = 125 different programmers. Let X1 , . . . , Xn denote the number of errors in each of the iid n programs. Then X1 , . . . , Xn ∼ Poisson(λ). We want to approximate the probability that the average number of errors per program is not larger than 5.5. 104 p √ We have µ = E(X1 ) = λ and σ = V (X1 ) = λ. Furthermore, √ √ n(X̄n − µ) n(5.5 − µ) P (X̄n ≤ 5.5) = P ≤ σ σ ! √ 125(5.5 − 5) √ = P (Z ≤ 2.5) ≈P Z≤ 5 = Φ(2.5) where Z ∼ N (0, 1). The Delta Method If we have a sequence of random variables X1 , X2 , . . . , Xn , . . . which converges in distribution to a standard Normal and a differentiable function g : R → R, then the Delta Method allows us to find the limiting distribution for the sequence of random variables g(X1 ), g(X2 ), . . . , g(Xn ), . . . . Assume that √ n(X̄n − µ) d → N (0, 1) σ and that g : R → R is differentiable with g 0 (µ) 6= 0. Then, √ n(g(Xn ) − g(µ)) d → N (0, 1). |g 0 (µ)|σ Exercise in class: iid Let X1 , . . . , Xn ∼ F where F is some c.d.f. such that both the expectation µ and the variance σ 2 of the X’s exist and are finite. By the Central Limit Theorem, √ n(X̄n − µ) d → N (0, 1). σ Consider the function g(x) = ex . Find the limiting distribution of g(X̄n ). We have g 0 (x) = g(x) = ex > 0 for any x ∈ R; thus, g(µ) 6= 0 necessarily. By applying the Delta Method, we have that √ X̄n µ √ n e − e n[g(X̄n ) − g(µ)] d = → N (0, 1). 0 µ |g (µ)|σ e σ Thus, we can use the distribution 2µ 2 µ e σ N e , . n to approximate probability statements about g(X̄n ) = eX̄n when n is ‘large’. 105 Homework 7 (due June 9th) iid 1. (30 pts) Let X1 , . . . , X10 ∼ N (µ, σ 2 ). We don’t know the value of σ 2 , but we want to decide whether, based on the data X1 , . . . , X10 , σ 2 ≤ σ02 = 3 or not. We observe from the data that S 2 = s2 = 10. • Use the sampling distribution of S 2 to compute the probability of the event {S 2 ≥ s2 } under the assumption that the true unknown variance is σ 2 = σ02 (you will need to use a statistical software and use the c.d.f. of a χ2 with a certain number of degrees of freedom to compute this probability)2 . • Based on the probability that you computed, explain why the data X1 , . . . , X10 suggest that it is likely not true that σ 2 ≤σ02 . Solution: • We know that (n − 1)S 2 ∼ χ2 (n − 1). σ2 Here, σ 2 is unknown. However, let’s assume that σ 2 is at least as large as σ02 and let’s therefore plug σ 2 = σ02 in the above result about the sampling distribution of S 2 . It then follows that, (n − 1)S 2 (n − 1)s2 2 2 P (S ≥ s ) = P ≥ σ02 σ02 = P χ2 (9) ≥ 30 ≈ 0.0004 (with a little abuse of notation). • We thus discovered that, if the true unknown variance was not larger than σ02 = 3 as we conjectured, then the probability of observing a value of the sample variance s2 as extreme as the one we observed from the data (s2 = 10) is extremely small. Based on the probability that we computed, it looks therefore very unlikely that the true unknown variance σ 2 is not larger than σ02 = 3. 2 We can safely plug-in σ02 because sup 2 0<σ 2 ≤σ0 Pσ2 (S 2 > s2 ) = Pσ2 (S 2 > s2 ). 0 Here the notation Pσ2 (A) should be interpreted as the ‘probability of the event A when 0 we assume that the true unknown variance is σ02 ’. 106 2. (20 pts) The times a cashier spends processing individual customer’s orders are independent random variables with mean 2.5 minutes and standard deviation 2 minutes. What is the approximate probability that it will take more than 4 hours to process the orders of 100 people? You can leave your answer in terms of Φ. Solution: Let Z ∼ N (0, 1) and let X1 , . . . , Xn , . . . denote the times spent by the cahsier to process the 100 individual customer’s orders. Without any other assumption on the distribution of the X’s other than independence, finite mean and finite variance, we can only rely on the Central Limit Theorem. In this case, n = 100 is likely large enough to give a good approximation to the event of interest, which is n X Xi > 240 i=1 (notice that there are exactly 240 minutes in 4 hours). Thus, we have ! n X 240 P Xi > 240 = P X̄n > n i=1 ! √ 240 √ n n −µ n(X̄n − µ) > =P σ σ ! ! √ 240 √ 240 n n −µ n n −µ ≈P Z> =1−P Z ≤ σ σ ! √ 240 n n −µ =1−Φ σ ! 10 240 − 2.5 1 1 100 =1−Φ =Φ . =1−Φ − 2 2 2 3. (25 pts) Let X1 , . . . , Xn be iid random variables with c.d.f. F , finite expectation µ and finite variance σ 2 . Consider the random variable Y = aX̄n + b for a, b ∈ R and a > 0. Assume that n is ‘large’. Give an approximate expression for the α-quantile of Y . Your answer will be in terms of Φ−1 . Solution: Let Z ∼ N (0, 1). Without furhter information about F , we can only 107 use the Central Limit Theorem. By definition, the α-quantile of Y is the value q ∈ R such that P (Y ≤ q) = α. We can use the Central Limit Theorem to approximate this number when n is large. We have q−b α = P (Y ≤ q) = P (aX̄n + b ≤ q) = P X̄n ≤ a   q−b √ − µ √ a n(X̄n − µ)  =P ≤ n σ σ     q−b q−b − µ − µ √ √ a  = Φ n a . ≈ P Z ≤ n σ σ Thus, √ i.e. n q−b a −µ σ ≈ Φ−1 (α), σΦ−1 (α) √ q ≈b+a µ+ . n iid 4. (25 pts) Let X1 , . . . , Xn ∼ Bernoulli(p) be a sequence of binary random variables where ( 1, if the Pittsburgh Pirates won the i-th game against the Cincinnati Reds Xi = 0, if the Pittsburgh Pirates lost the i-th game against the Cincinnati Reds and p ∈ (0, 1). Suppose that we are interested in the odds that the Pittsburgh Pirates win a game agains the Cincinnati Reds. To do that, we need to consider the function g(p) = p 1−p representing the odds of the Pirates winning agains the Reds. It is natural to estimate p by means of X̄n . Provide a distribution that can be used to approximate probability statements about g(X̄n ). Solution: We have E(Xi ) = p ∈ (0, 1) and V (Xi ) = p(1 − p) ∈ (0, 1) for any i ∈ {1, . . . , n} so both the expectation and the variance of the X’s exist 108 and are finite. Consider the function g(x) = We have 1 g 0 (x) = . (1 − x)2 x 1−x defined for x ∈ (0, 1). Notice that g 0 (x) > 0 for any x in (0, 1). Thus, g 0 (E(Xi )) 6= 0. The Delta Method allows us to write √ X̄n p √ n − 1−p n[g(X̄n ) − g(E(Xi ))] 1−X̄n d p p = → N (0, 1). 1 0 |g (E(Xi ))| V (Xi ) p(1 − p) (1−p)2 Hence, we can use the distribution p p N , 1 − p n(1 − p)3 to approximate probability statements about g(X̄n ) = X̄n . 1−X̄n 5. (EXTRA CREDIT, 15 pts) Let U1 , U2 , . . . , Un , . . . be a sequence of d random variables such that Un → Uniform(0, 1). Consider the function g(x) = − log x and the new sequence of random variables X1 , X2 , . . . , Xn , . . . d with Xi = − log Ui . Prove that Xn → Exponential(1). Hint: use the method of the c.d.f. that we discussed in lecture 11. Solution: We first want to compute the c.d.f. of the n-th transformed variable Xn in the sequence. We have FXn (x) = P (Xn ≤ x) = P (− log Un ≤ x) = P Un ≥ e−x d Because Un → Uniform(0, 1), it follows that ( 0 if x < 0 −x P Un ≥ e → 1 − e−x if x ≥ 0 = F (x). We recognize F this as the c.d.f. of an Exponential(1) distribution. d Thus, we conclude that Xn → Exponential(1). 109 Lecture 13 Recommended readings: N/A The Law of Large Numbers, Convergence of Random Variables, Convergence Results The Law of Large Numbers In the previous set of lecture notes we studied the Central Limit Theorem and a particular mode of convergence, the convergence in distribution. Another important result on the convergence of the average of a sequence of random variables is the (weak) law of large numbers 3 . Simply put, the law of large numbers says that the sample mean of a sequence of iid random variables converges to the common expectation of the random variables in the sequence. More precisely, let X1 , . . . , Xn , . . . be a sequence of iid random variables with common mean µ. Then, for any > 0, lim P (|X̄n − µ| > ) = 0. n→∞ (53) p In this case, we use the notation X̄n → µ. Example in class: Suppose that you repeatedly flip a coin which has probability p ∈ (0, 1) of showing head. Introduce the random variables ( 1, if the i-th flip is head Xi = 0, if the i-th flip is tails. Consider the random variable X̄n describing the observed proportion of p heads. Then, as n → ∞, X̄n → µ. In English, this means that for any arbitrarily small > 0, the probability of the event ‘the proportion of heads differs from p by more than ’ converges to 0 as n → ∞. Convergence of Random Variables So far we introduced two notions of convergence for random variables: convergence in distribution (which we associated to the Central Limit Theorem) and convergence in probability (which we associated to the weak law of large numbers). There are at least two other important notions of convergence 3 We will discuss a strong version in the next section. 110 Figure 1: Relationships between different modes of convergence. that are frequently used and mentioned: convergence in quadratic mean (or L2 -convergence) and almost sure convergence. We say that the sequence of random variables X1 , X2 , . . . , Xn , . . . converges in quadratic mean to the random variable X if lim E[(Xn − X)2 ] = 0. n→∞ (54) L We use the notation Xn →2 X in this case. This type of convergence is thus related to the convergence of expectations. We say that the sequence of random variables X1 , X2 , . . . , Xn , . . . (defined over some sample space Ω) converges almost surely to the random variable X if P {ω ∈ Ω : lim Xn (ω) = X(ω)} = 1. n→∞ This is a very strong notion of convergence (essentially the strongest in probability theory, except for ‘sure’ convergence!). It states that the set of elements ω in the sample space Ω for which the sequence X1 (ω), X2 (ω), . . . , Xn (ω), . . . converges to X(ω) is a set that has probability 1. We denote almost sure a.s. convergence as Xn → X. Almost sure convergence is associated to the strong law of large numbers. Under the same assumptions of the weak law of large numbers, we have a.s. X̄n → µ. Convergence Results Figure 1 illustrates some relationships between the different modes of convergence that we discussed. Furthermore, even if it is not included in the Figure, almost sure convergence implies convergence in probability. Here is a list of interesting results. Let X1 , . . . , Xn , . . . and Y1 , . . . , Yn , . . . be two sequences of random variables, X and Y be two random variables, and let g : R → R be a continuous function. We have 111 a.s. a.s. a.s. p p p L L L • Xn → X and Yn → Y =⇒ Xn + Yn → X + Y • Xn → X and Yn → Y =⇒ Xn + Yn → X + Y • Xn →2 X and Yn →2 Y =⇒ Xn + Yn →2 X + Y d d d • Xn → c ∈ R and Yn → Y =⇒ Xn + Yn → c + Y a.s. a.s. a.s. p p p • Xn → X and Yn → Y =⇒ Xn Yn → XY • Xn → X and Yn → Y =⇒ Xn Yn → XY d d d • Xn → c ∈ R and Yn → Y =⇒ Xn Yn → cY a.s. a.s. p p d d • Xn → X =⇒ g(Xn ) → g(X) • Xn → X =⇒ g(Xn ) → g(X) • Xn → X =⇒ g(Xn ) → g(X). 112 Lecture 14 Recommended readings: N/A Probability Inequalities and Expectation Inequalities Probability Inequalities We already studied Markov’s inequality and Tchebysheff’s inequality in lecture 5. Here we discuss two other important inequalities that give exponential upper bounds on the probability content of the tails of a distribution. We will focus on Hoeffding’s inequality and McDiarmid’s inequality. Hoeffding’s Inequality Let X1 , . . . , Xn be independent random variables with E(Xi ) = 0 and P (ai ≤ Xi ≤ bi ) = 1 for some ai < bi and for all i ∈ {1, . . . , n}. Then for any > 0 we have ! n n X Y 2 2 P Xi > ≤ e−t et (bi −ai ) /8 i=1 i=1 for any t > 0. Exercise in class: Suppose that ai = a and bi = b for all i ∈ {1, . . . , n}. • How does Hoeffding’s inequality read in this case? • For any > 0, the bound is valid for any t > 0. Derive the tightest bound when ai = a and bi = b for all i ∈ {1, . . . , n}. It is easy to see that when ai = a and bi = b for all i ∈ {1, . . . , n} the bound reads ! n X n(b−a)2 t2 2 2 −t 8 . P Xi > ≤ e−t et n(b−a) /8 = e i=1 The tightest bound is the one obtained by minimizing with respect to t the quadratic exponent. The exponent is minimized for t = 4/n(b − a)2 . By plugging back this value of t we obtain ! n 2 X − 2 2 n(b−a) P Xi > ≤ e i=1 113 McDiarmid’s Inequality Let X1 , . . . , Xn be independent random variables. Suppose that the function g is such that sup g(x1 , . . . , xi−1 , xi , xi+1 , . . . , xn ) − g(x1 , . . . , xi−1 , x0i , xi+1 , . . . , xn ) ≤ ci x1 ,...,xn ,x0i for i ∈ {1, . . . , n}. Then P (g(X1 , . . . , Xn ) − E[g(X1 , . . . , Xn )] ≥ ) ≤ e 2 c2 i=1 i − Pn2 . Expectation Inequalities We will only discuss two basic expectation inequalities: Cauchy-Schwartz inequality and Jensen’s inequality. Although simple, they are useful in a variety of situations. Cauchy-Schwartz Inequality Let X and Y be two random variables such that E(X 2 ) and E(Y 2 ) exist and are finite. Then, p E(|XY |) ≤ E(X 2 )E(Y 2 ). Exercise in class: Use Cauchy-Schwartz inequality to show that for two random variables X and Y such that E(X 2 ) and E(Y 2 ) exist and are finite, |Cor(X, Y )| ≤ 1. Jensen’s Inequality Let g : R → R be a convex function and let X be a random variable. Then, provided E(X) exists, g[E(X)] ≤ E[g(X)]. Exercise in class: Use Jensen’s inequality to show that for any random variable X, V (X) ≥ 0. 114 Homework 8 (due June 11th) 1. (20 pts) Prove the weak law of large numbers under the additional assumption that the variance σ 2 of the iid sequence X1 , X2 , . . . , Xn , . . . exists and is finite. Hint: use Tchebysheff’s inequality. Solution: Take any > 0. By Tchebysheff’s inequality we have that P (|X̄n − µ| > ) ≤ V (X̄n ) σ2 = 2 n2 Clearly, σ2 = 0, n→∞ n2 0 ≤ lim P (|X̄n − µ| > ) ≤ lim n→∞ hence the result. 2. (20 pts) Consider the sequence of random variables X1 , X2 , . . . , Xn , . . . p with Xn ∼ N (µ, 3/(1 + log n)). Show that Xn → µ. Solution: Take any arbitrary > 0. Tchebysheff’s inequality implies P (|Xn − µ| > ) ≤ V (Xn ) 3 = . 2 (1 + log n)2 Clearly, 3 = 0, n→∞ (1 + log n)2 0 ≤ lim P (|Xn − µ| > ) ≤ lim n→∞ p and thus Xn → µ. 3. (20 pts) Prove that for a sequence of random variables X1 , X2 , . . . , Xn , . . . L2 p and a random variable X we have Xn → X =⇒ Xn → X. Hint: once again, use Tchebysheff’s inequality. Solution: Take any arbitrary > 0. Tchebysheff’s inequality implies P (|Xn − X| > ) = P (|Xn − X|2 > 2 ) ≤ E[(Xn − X)2 ] . 2 By assumption limn→∞ E[(Xn − X)2 ] = 0, therefore E[(Xn − X)2 ] = 0. n→∞ 2 0 ≤ lim P (|Xn − X| > ) ≤ lim n→∞ p and therefore Xn → X. 115 iid 4. (20 pts) Let X1 , . . . , Xn ∼ Bernoulli(p). Let > 0. Use Hoeffding’s inequality to derive a tight upper bound for the probability of the event {|X̄n − p| > }. Solution: ) ( n X {|Xn − p| > } = (Xi − p) > n i=1 ( n ) ( n ) X X = (Xi − p) > n ∪ (p − Xi ) > n . i=1 i=1 Notice that Xi − p ∈ [−1, 1] with probabilty 1, p − Xi ∈ [−1, 1] with probability 1, and E(Xi − p) = E(p − Xi ) = 0 for any i ∈ {1, . . . , n}. We can therefore apply Hoeffding’s inequality which yields ! ! n n X X P |X̄n − p| > = P (Xi − p) > n + P (p − Xi ) > n i=1 ≤e i=1 2(n)2 − n(1−(−1))2 2(n)2 − n(1−(−1))2 +e = 2e− n2 2 . 5. (20 pts) Consider McDiarmid’s inequality. Prove that McDiarmid’s inequality coincides withPHoeffding’s inequality in the particular case where g(x1 , . . . xn ) = n1 ni=1 xi , P (a ≤ Xi ≤ b)= 1, and E(Xi ) = 0 for i ∈ {1, . . . , n}. Solution: Notice that g(x1 , . . . , xi−1 , xi , xi+1 , . . . , xn ) − g(x1 , . . . , xi−1 , x0i , xi+1 , . . . , xn ) sup x1 ,...,x0i ,...,xn 1 0 = sup (x − x ) i i . x1 ,...,x0i ,...,xn n Because a ≤ Xi ≤ b with probability 1, it follows that g(X1 , . . . , Xi−1 , Xi , Xi+1 , . . . , Xn ) − g(X1 , . . . , Xi−1 , Xi0 , Xi+1 , . . . , Xn ) sup X1 ,...,Xi0 ,...,Xn 1 b−a 0 = sup (X − X ) i i ≤ n . X1 ,...,Xi0 ,...,Xn n 116 with probability 1. It is easily seen that E[g(X1 , . . . , Xn )] = 0, so McDiarmid’s inequality can be applied and yields P (g(X1 , . . . , Xn ) − E[g(X1 , . . . , Xn )] > ) = P X̄n > −P ≤e − =e 22 b−a 2 n n i=1 ( 22 (b−a)2 n ) − =e 2n2 (b−a)2 which is the same bound that we obtain by applying Hoeffding’s inequality. 6. (EXTRA CREDIT, 15 pts) Let Xn ∼ Poisson(λn ) with n ∈ {1, 2, . . . } p and λn = 1/n. Show that Yn = nXn → 0. Hint: remember from the d convergence results that we studied that if Yn → c for some c ∈ R, p d then the implication Yn → c =⇒ Yn → c holds true. Solution: Consider FYn , the c.d.f. of Yn . We have that FYn (t) = P (nXn ≤ t) = P (Xn ≤ t/n). Clearly, if t < 0 then FYn (t) = 0. Take now t ∈ [0, 1). We have FYn (t) = P (nXn ≤ t) = P (Xn ≤ t/n) ≤ P (Xn < 1/n) ≤ P (Xn < 1) 1 = P (Xn = 0) = pXn (0) = e−λn = e− n . 1 Now, since e− n → 1 as n → ∞, it is clear that the limit c.d.f. F is the function ( 0 if t < 0 F (t) = 1 if t ≥ 0 which corresponds to the c.d.f. of a random variable that assigns d probability mass 1 to the point t = 0. This shows that Yn → 0 which p implies that Yn → 0. 117 Lecture 15 Recommended readings: WMS, sections 3.9, 4.9, 6.5 Moment-generating Functions In this lecture, we introduce a particular function associated to a probability distribution F which uniquely characterizes F . This function is called the moment-generating function (henceforth shortened to m.g.f.) of F because, if it exists, it allows to easily compute any moment of F , i.e. any expectation of the type E(X k ) with k ∈ {0, 1, . . . } for X ∼ F . There are other functions that are similar to the m.g.f.s which we will not discuss in this course: these include the probability-generating function for discrete probability distributions (which is a compact power series representation of the p.m.f.), the characteristic function (which is the inverse Fourier transform of a p.m.f./p.d.f. and always exists) and the cumulant-generating function (which is the logarithm of the m.g.f.). The moments {E(X k )}∞ k=1 associated to a distribution X ∼ F completely characterize F (if they exist). They can all be encapsulated in the m.g.f. (R tx supp(X) e fX (x) dx if X is continuous mX (t) = E(etX ) = P tx supp(X) e pX (x) if X is discrete. If two random variables X and Y are such that mX = mY then their c.d.f.’s FX and FY are equal at almost all points (i.e. they can differ at at most countably many points). We say that the moment generating function mX of X ∼ F exists if there exists an open neighborhood around t = 0 in which mX (t) is finite. Note that it is always true that, for X ∼ F with an arbitrary distribution F , mX (0) = 1. The name of this function comes from the following feature: suppose that mX exists, then for any k ∈ {0, 1, . . . } dk = E(X k ). (55) mX (t) k dt t=0 This means that we can ‘generate’ the moments of X ∼ F from mX by differentiating mX and evaluating its derivatives at t = 0. 118 Exercise in class: Show that the m.g.f. of X ∼ Binomial(n, p) is mX (t) = [pet + 1 − p]n and use it to compute V (X). We have mX (t) = = n X tx e pX (x) = x=0 n X n X x=0 n x e p (1 − p)n−x x tx n (pet )x (1 − p)n−x = [pet + (1 − p)]n x x=0 where we used the binomial theorem n X n i n−i (a + b) = ab . i n i=0 Notice that the function above is well-defined and finite for any t ∈ R. Then, d E(X) = mX (t) = npet [pet + (1 − p)]n−1 t=0 = np dt t=0 and d2 E(X ) = 2 mX (t) = np2 e2t (n − 1)[pet + (1 − p)]n−2 + npet [pet + (1 − p)]n−1 t=0 dt t=0 2 = np2 (n − 1) + np. Thus, V (X) = E(X 2 )−[E(X)]2 = np2 (n−1)+np−n2 p2 = −np2 +np = np(1−p). Exercise in class: Show that the m.g.f. of X ∼ Uniform(0, 1) is 1 mX (t) = (et − 1). t We have Z mX (t) = 0 1 1 tx 1 1 t e dx = e = (e − 1) t t 0 tx 119 which is well-defined and finite for any t ∈ R (recall that mX (0) = 1 for any X). It is easy to verify that a m.g.f. mX satisfies maX+b (t) = ebt mX (at). (56) M.g.f’s also provide a tool which is often useful to identify the distribution of a linear combination of random variables. In the m.g.f. world sums becomes products! In particular, consider a collection of n independent ran. It is easy to check dom variables X1 , . . . , Xn with m.g.f’s mX1 , . . . , mXnP from the definition of m.g.f. that, if we consider Y = ni=1 (ai Xi + bi ), then Pn mY (t) = e i=1 bi t n Y mXi (ai t). (57) i=1 Exercise in class: The m.g.f of X ∼ Poisson(λ) is mX (t) = eλ(e t −1) for t ∈ R. Consider Y ∼ Poisson(µ) with X and Y independent. What is the distribution of X + Y ? Since X and Y are independent, we have t −1) mX+Y (t) = mX (t)mY (t) = eλ(e t −1) eµ(e = e(λ+µ)(e t −1) and we recognize this as the m.g.f. of a Poisson(λ + µ) distribution. 120 Homework 9 (due June 16th) 1. (30 pts) Show that if Xi ∼ Gamma(α . . . , n} and X1 , . . . , Xn i , β) for {1, P P are independent, then X = ni=1 Xi ∼ Gamma( ni=1 αi , β). Hint: the m.g.f. of Y ∼ Gamma(α, β) is mY (t) = (1 − βt)−α which is well-defined and finite for t < 1/β. Solution: For t < 1/β, we have mX (t) = n Y mXi (t) = i=1 n Pn Y (1 − βt)−αi = (1 − βt)− i=1 αi . i=1 We easilyPsee that the m.g.f. of X thus corresponds to the m.g.f. of a Gamma( ni=1 αi , β) distribution. 2. (20 pts) We showed in class that if X ∼ N (0, 1), then X 2 ∼ χ2 (1). We P iid mentioned that if X1 , . . . , Xn ∼ N (0, 1), then Y = ni=1 Xi2 ∼ χ2 (n). Use the result that you obtained from the previous exercise to prove that Y ∼ χ2 (n). Solution: Since Xi2 ∼ χ2 (1), we have 1 mXi (t) = (1 − 2t)− 2 for any i ∈ {1, . . . , n}. By applying the result from the previous exercise we obtain mY (t) = (1 − 2t)− Pn 1 i=1 2 n = (1 − 2t)− 2 for t < 1/2, which corresponds to the m.g.f. of a Gamma(n/2, 2) ≡ χ2 (n) distribution. 3. (25 pts) Let X ∼ Binomial(n1 , p) and Y ∼ Binomial(n2 , p) with X and Y independent random variables. What is the distribution of Z =X +Y? Solution: We have mZ (t) = mX (t)mY (t) = [pet +1−p]n1 [pet +1−p]n2 = [pet +1−p]n1 +n2 which clearly corresponds to the m.g.f. of a Binomial(n1 + n2 , p) distribution. 121 4. (25 pts) Derive the m.g.f. of X ∼ Uniform(a, b). Solution: We have mX (t) = E(e tX b Z b ebt − eat 1 1 tx e dx = e = . b−a (b − a)t (b − a)t a tx )= a which is well-defined and finite for any t ∈ R. 5. (EXTRA CREDIT, 20 pts) Consider the discrete random variable X ∼ pX with mX well-defined and finite in a open neighborhood of 0, say (−, ) for some > 0. Show that dk = E(X k ). mX (t) k dt t=0 for k ∈ {0, 1, . . . , }. Hint: use the Taylor series representation of the function t 7→ etx around t = 0. Solution: Consider the Taylor series representation of the function g(t) = etx around t = 0. We have g(t) = ∞ X g (i) (0) i=0 i! ti . Notice that for i ∈ {0, 1, . . . }, g (i) (t) = xi g(t) = xi etx thus g (i) (0) = xi g(0) = xi . It follows that mX (t) = E(e tX )= X tx e pX (x) = x∈supp(X) 122 X x∈supp(X) pX (x) ∞ X xi i=0 i! ti hence ∞ dk X xi i pX (x) k t i! dt i=0 x∈supp(X) t=0 ∞ k i X X d x pX (x) = ti dtk i! i=0 x∈supp(X) dk = mX (t) k dt t=0 X t=0 = X pX (x) ∞ X i=0 x∈supp(X) = X k i−k i(i − 1) . . . (i − k + 1)t i! xi t=0 k x pX (x) = E(X ). x∈supp(X) The last equality follows because • for 0 ≤ i < k, one of the factors in i(i − 1) . . . (i − k + 1) is 0 • for i > k, ti−k t=0 = 0 • for i = k xi xk i−k = i(i − 1) . . . (i − k + 1)t k! = xk . i! k! t=0 123 Lecture 16 Recommended readings: N/A Introduction to Random Processes Informally: a random process is a mathematical model of a probabilistic experiment that evolves in ‘time’ and generates a sequence of (random) values. A little more formally: consider a collection of random variables X = {Xt }t∈T defined on a sample space Ω endowed with a probability measure P , where the collection is indexed by a ‘time’ parameter t taking values in the set T . Such collection is commonly referred to as a random process or as a stochastic process. You might have noticed that once again, the word ‘time’ has quotation marks: this is because, while we often think about the parameter t as time, we can also think of spatial random processes, i.e. random processes that evolve with respect to space rather than time. In other more complicated cases, the index set T in which the parameter t takes its values can be an even more general domain that is not necessarily related to time or space. Recall that a random variable is a function mapping the sample space Ω into a subset S of the real numbers. In the jargon of random processes the set S is usually called the state space of the random process X. In fact, if at the time t ∈ T the random variable Xt takes the value Xt = xt ∈ S, we say that the state of the random process X at time t is xt . Based on the features of the sets T and the sets S we say that X is a • discrete-time random process, if T is a discrete set (i.e. T is finite or countably infinite) • continuous-time random process, if T is not a discrete set • discrete-state random process, if S is a discrete set • continuous-state random process, if S is a not a discrete set. It is clear from above that X = {Xt }t∈T is a function of both the elements of Ω and of the time parameter t. We can therefore think of X in at least two ways: • as a collection of random variables Xt taking values in S; i.e. for any fixed time t, the random process X corresponds to a random variable Xt valued in S 124 • as the collection of random functions of time (‘trajectories’); i.e. for any fixed ω ∈ Ω, we can view X as the sample path or ‘trajectory’ t 7→ Xt (ω) which is a (random) function of time. More generally, a random process is a mapping of the type X: Ω×T →S (ω, t) 7→ Xt (ω) GRAPHICAL INTERPRETATION In contrast to the study of a single random variable or to the study of a collection of independent random variables, the analysis of a random process puts much more focus on the dependence structure between the random variables in the process at different times and on the implications of that dependence structure. Because there are many real-world phenomena that can be interpreted and modeled as random processes, there is a wide variety of random processes. In this class, however, we will focus only on the following processes: • the Bernoulli process • the Poisson process • Brownian motion (the Wiener process) • discrete-time Markov chains. With respect to the processes listed above, the Bernoulli process and discretetime Markov chains are discrete-time and discrete-state processes, whereas the Poisson process is a continuous-time and discrete-state process. Brownian motion is instead a continuous-time and continuous-state random process. We can further characterize the Bernoulli process and the Poisson process as arrival-type processes: in this processes, we are interested in occurrences that have the character of an ‘arrival’, such as customers walking into a store, messages receptions at a receiver, incoming telephone calls, etc. In the Bernoulli process arrivals occur in discrete-time while in the Poisson process arrivals occur in continuous-time. In the following, we provide a very basic and intuitive interpretation of the aforementioned random processes. 125 The Bernoulli Process The Bernoulli process is one of the most elementary random processes. It can be visualized as an infinite sequence of independent coin tosses where the probability of heads in each toss is a fixed number p ∈ [0, 1]. Thus, a Bernoulli process is essentially a sequence of iid Bernoulli random variables. Question: what is the state space of a Bernoulli process? The Poisson Process The Poisson process is a ‘counting’ random process in continuous time. The basic idea is that we are interested in counting the number of events (occurring at a certain rate λ > 0) that occurred up to a certain time t. For example, imagine that you own a store and you are counting the number of customers that entered into your store up to a certain time t. A Poisson process is a probabilistic model that can be used to model this phenomenon. In a Poisson process, the number of events occurred up to a certain time t is a random variable that has a Poisson distribution with parameter λt. We will give a more precise definition of the Poisson process in the next lecture, and we will study some interesting properties of this process. Question: what is the state space of a Poisson process? Brownian Motion Brownian motion is one of the most important and well-studied continuous time stochastic processes. It has almost surely continuous sample paths 126 which are almost everywhere not differentiable. Brownian motion is part of a class of random processes called diffusion processes; these processes usually arise as the solution of a stochastic differential equation. Brownian motion is the key process for the definition of stochastic integrals, i.e. integrals of the type Z T Xt (ω) dBt (ω) (58) 0 where {Bt }Tt=0 is a Brownian motion and {Xt }Tt=0 is another random process. The branch of mathematics devoted to studying the properties of these integrals is called stochastic calculus. From a more practical point of view, the study of Brownian motion is historically associated to modeling the motion of a particle suspended in a fluid, where the motion results from the collision of the particle with the atoms and the molecules forming gas or the liquid in which the particle is immersed. Question: what is the state space of a Brownian motion? Discrete-Time Markov Chains Informally, a Markov chain is a random process with the property that, conditional on the present state of the process, the future states of the process are independent of the past states. Mathematically, this condition is described by the Markov property P (Xn = xn |Xn−1 = xn−1 , Xn−2 = xn−2 , . . . , X0 = x0 ) = P (Xn = xn |Xn−1 = xn−1 ) 127 (59) How would you compactly summarize the Markov chains in the pictures below? 128 Lecture 17 Recommended readings: Ross, sections 5.3.1 → 5.3.3 The Bernoulli Process and the Poisson Process The Bernoulli Process Let X = {Xi }∞ i=1 be a sequence of iid Bernoulli(p) random variables. The random process X is called a Bernoulli process. Question: describe the sample paths of a Bernoulli process. The Bernoulli process can be thought of as an arrival process: at each time i either there is an arrival (i.e. Xi = 1, or equivalently stated, we observe a success) or there is not an arrival (i.e. Xi = 0, or equivalently stated, we observe a failure). Many of the properties of a Bernoulli process are already known to us from previous discussions. Question: consider n distinct times; what is the probability distiribution of the number of arrivals in those n times? Question: what is the probability distiribution of the first arrival? Question: consider the time needed until the r-th arrival (r ∈ {1, 2, . . . }); what is the probability distribution associated to the time of the r-th arrival? The independence of the random variables in the Bernoulli process has an important implication about the process. Consider a Bernoulli process ∞ X = {Xi }∞ i=1 and the process X−n = {Xi }i=n+1 . Because the random variables in X−n are independent of the random variables in X \ X−n , it follows that X−n is itself a Bernoulli process (starting at time n) which does not depend on the initial n random variables of X. We say that the Bernoulli process thus satisfies the fresh-start property. Question: recall the memoryless property of the Geometric distribution. How do you intepret this property in light of the freshstart property of the Bernoulli process? 129 The Poisson Process A random process X = {Xt }t∈T is called a counting process if Xt represents the number of events that occur by time t. More properly, if X is a counting process, it must also satisfy the following intuitive properties: • for any t ∈ T , Xt ≥ 0 • for any t ∈ T , Xt ∈ Z • if s ≤ t, then Xs ≤ Xt • for s ≤ t, Xt −Xs is the number of events occurring in the time interval (s, t]. A counting process X is said to have independent increments if the number of events occurring in disjoint time intervals are independent. This means that the random variables Xt − Xs and Xv − Xu are independent whenever (s, t] ∩ (u, v] = ∅. Furthermore, a counting process X is said to have stationary increments if the distribution of the number of events that occur in a time interval only depends on the length of the time interval. This means that the random variables Xt+s − Xs have the same distribution for all s ∈ T . Example: ∞ ∞ Consider P a Bernoulli process X = {Xi }i=1 and the process Y = {Yi }i=1 with Yi = j≤i Xj . Then Y is a discrete-time counting process with independent and stationary increments. We are now ready to give rigoroulsy define what a Poisson process is. A counting process X = {Xt }t≥0 is said to be a Poisson process with rate λ > 0 if • X0 = 0 • X has independent increments • the number of events in any time interval of length t > 0 is Poisson distributed with expectation λt, i.e. for any s, t ≥ 0 we have X(t + s) − X(s) ∼ Poisson(λt). Question: does a Poisson process have stationary increments? There is another (equivalent) definition of Poisson process: a counting process X = {Xt }t≥0 is said to be a Poisson process with rate λ > 0 if 130 1. X0 = 0 2. X has stationary and independent increments 3. P (Xh = 1) = λh + o(h) 4. P (Xh ≥ 2) = o(h) Condition 3. says that the probability of observing an arrival in a small amount of time is roughly proportional to h. Condition 4. says that it is very unlikely to observe more than an arrival in a short amount of time. Try and verify that these conditions are consistent with a Poisson distribution! So, we know about the probability distribution of the count of arrivals up to any time in a Poisson process. But what about the probability distribution of the inter-arrival times {Ti }∞ i=1 ? Notice that the event {T1 > t} is equivalent to the event ‘there are no arrivals in the time interval [0, t], i.e. {Xt = 0}. Thus, P (T1 > t) = P (Xt = 0) = e−λt from which we see that T1 ∼ Exponential(1/λ). Now, what is the distribution of T2 ? We have P (T2 > t|T1 = s) = P (Xt+s − Xs = 0|T1 = s) = P (Xt+s − Xs = 0|Xs − X0 = 1) = P (Xt+s − Xs = 0) = e−λt which implies that T2 ∼ Exponential(λ) as well. By using the same argument, it is easy to see that all the inter-arrival times of a Poisson process are iid Exponentially distributed random variables with parameter λ. Question: what is the distribution of the waiting time until the n-th arrival? We can also describe a Poisson process in terms of inter-arrival times. Suppose we start with {Ti }∞ varii=1 , a sequence of iid Exponential random P ables with expectation λ > 0. Define the random variables Si = j≤i Ti for i ∈ {1, 2, . . . } and S0 = 0 corresponding to the waiting time until the i-th arrival. Finally, let Xt = max{i : Si ≤ t}. Then the process X = {Xt }t≥0 is a Poisson process. From the definition of Poisson process, it is then easy to see that the following hold true: • Xt ≥ n ⇐⇒ Sn ≤ t • Xt = n ⇐⇒ Sn ≤ t ∧ Sn+1 > t. 131 A similar argument can be made for the Bernoulli process using iid Geometric(p) inter-arrival times. The Poisson Process as a Continuous-time Version of the Bernoulli Process Fix an arrival rate λ > 0, some time t > 0, and divide the time interval (0, t] into n subintervals of length h = t/n. Consider now a Bernoulli process X = {Xi }ni=1 defined over these time subintervals, where each Xi is a Bernoulli random variable recording whether there was an arrival in the time interval ((i − 1)h, ih]. Impose now the conditions 3. and 4. of the definition of Poisson process that we described above. We have that the probability p of observing at least one arrival in any of these subintervals is t 1 p = λh + o(h) = λ + o . n n Thus, the number of subintervals in which we record an arrival has a Binomial(n, p) distribution with p.m.f. x n−x n t 1 t 1 p(x) = λ +o 1−λ +o 1{0,1,...,n} (x) x n n n n Now, let n → ∞ or equivalently h → 0 so that the partition of (0, t] becomes finer and finer, and we approach the continuous-time regime. Following the same limit calculations that we did in Homework 4, as n → ∞ we have p(x) → e−λt (λt)x 1 (x), x! {0,1,... } which is the p.m.f. of a Poisson(λt) distribution. Thus, the Poisson process can be thought of as a continuous-time version of the Bernoulli process. 132 Lecture 18 Recommended readings: Ross, sections 5.3.4, 5.3.5, 5.4.1 Splitting, Merging, Further Properties of the Poisson Process, the Nonhomogeneous Poisson Process, and the Spatial Poisson Process Splitting a Bernoulli Process ۧ֌ *႘ Consider a Bernoulli process of parameter p ∈ [0, 1]. Suppose that we keep any arrival with probability q ∈ [0, 1] or otherwise discard it with probability 1 − q. The#ºčč new O=Ǐ processIǏ thus obtained isèčč still a Bernoulli Ǐ M! ŗ|!Ǐ !Ǐ Ú++÷Ċč`č Pč Ǐ process with -. áč ùč Ҍߥof Ԧ ‫ ׹‬MƯ &Ǐ h&ū Æ Ǐ MǏ parameter pq.ø2 Čč This is an example thinned Bernoullič process. Similarly, č ¯ !Ǐ ‫ ׸‬I č |Ǐ Ǐ Ǐ ? Ǐ Hz/Ŭ Ŗ KǌǏ -@ mǏ č č the process by the discarded arrivals isâč aĎǏ thinned Bernoulli process ؎ -@obtained ÔǏ Ǐ ÝŒİ ÄǏ 7 ×=^Ǐ!KǏ Ƹ ĺǏ đćƀǏ č śǏ ǏƹǏ /'ۖ − q)). Ø\č Oč ӫ‫ۖ'؞‬ /'ۖ‫ۖ''ؿ‬ as well (with parameter p(1 Ǐ +Ǐ ř‫ ߥۯ‬č Ù ŭǏ ÕǏ Ǐ Ǐ œŰǏ ơg nǏ ĻǏ 9č č :+ G + Ɛ႘ ͭ෩႘ Qč Ǐ This can be easily generalized to the case where e! weǏ Ƣò split the original Ǐ ƍ ø AǏ process in more than two subprocesses. Ʒw AǏ #Uč č aUþč ȥ‫ط‬ gč `i %č ͳ - ® ÅǏ ǰ Չ ‫ ܨ ک‬႘ ࠲Ͱ̉ஆθ ༪ அ ഄધ႘ŋ੦႘̈Է ‫ނ‬ʋปഅŋ࿕̉౪இ႘ Ͱผŋफ़ʋΓΓ႘Ԋ႘ ¯/ժ١Չۖ TűǏ ඾႘ Merging a Bernoulli aԷ ĭMňňM‫كڇ‬ ͆ۖ YXč TǏ ǄƎƣǏ ŊԷ ӑΒԷ ҕ ε ԯ $ gǏ ņߥ g ǍƁǏ - m Ǐ Ƥƅě ƺ ǅĜǋ@Ǐ ѧٍӐߥ Process Ç࠱႘ Ǐ K Ǐ YÎXã¶ ¤č ëč \č ࢱ༛႘ Ǐ ú=č ̇ independent ; ˩ ල႘ B႘ ෨႘ ව෪႘ Ċ zǏBernoulli Ų Ŕ Ǐ ඼ : U Ҕ ୹ ༚ၖ႘ Ǔ à ۖ one ؊ȹ Ǉ Ҁ ۖ Considertwo processes, Ã äč Û ĕ Ä^ č .² ^č 2 č íîč ୺ ೼ ႘ řԧߥ ‫ܥ‬՝Ӱ‫ڹ‬ҵߥ - ¯ Ǐ Qč 6č TDhǏ ]č with parameter p and the other with parameterƥbǎǏ q. Consider the new process obtained by óƏǏ ( Ɔ:ƦƧǏ éč O nǏ cǏ ïą--MðñčÜbÙ GÏ č ඿႘ ô -Ǐ ࢲε႘ ĞǏ ƻ ŕ ŪǏ K Ǐ žũĝıďǏ 1 ifΜthere was an arrival at time i in either process ˘Γ˙ߥ Xi = č 0 otherwise. /‫ॲ ۖ֔؟׺‬ਓമςਔเ႘ Ș഍ ࣁҞϻ႘ òÿb§×Íč Pč Ņ ¾‫ط‬ Ȧ‫ط‬ ҖnnԷ ̉Է ƼǏVԷ The process thus obtained is a Bernoulli process with parameter p + q − pq. ҕӯߥ ħôĈĐǏ / ; ļǏ ::+hRǏ This is easily generalized to the case Ö ğųǏ where we merge more than two independent Bernoulli processes. ȧ‫ط‬ ÌÍÎÇoÈoËǏ« ©ªÊǏ ĨǏ ĉƨƩǏ +ƽ!Ǐ õƐö^cƑƪǏ РŋԷ Է Ѧٌӏߥ ඹ ӹ ႘ 133 ÇÑ႘ ࢮ႘ ͟͞͠‫ط‬ϋ ό Ϝ Ө Ռ h Ս Վ ‫פף‬±‫ط‬ ÖËM4č êč ೽ ͮԄ႘ 4č ņ ҔԷ ó"ô¨¡; Â4 zč æǆǏ ǭ႘ $ $ $ "$ # $ !!!$$ Splitting a Poisson Process Consider a Poisson process X with rate λ > 0. Suppose that there can be two different types of arrivals in the process, say type A and type B, and that each arrival is classified as an arrival of type A with probability p ∈ [0, 1] and as an arrival of type B with probability 1 − p and that the classification is fone independently from the rest of the process. Let XA and XB denote the counting processes associated to arrivals of type A and type B respectively. Then XA and XB are Poisson processes with parameters λp and λ(1 − p) respectively. Furthermore, XA and XB are independent processes. The two processes XA and XB are examples of thinned Poisson processes. This can be easily generalized to the case where we consider n ≥ 3 different types of arrivals. ԪԷ Merging two Poisson Processes Consider two Poisson processes X1 and X2 with rates λ > 0 and µ > 0 ƽ҆႘ ‫ߥמ‬ respectively. Consider the new Poisson process X = X1 + X2 . The merged Poisson process‫ا‬Ĭmٞߥ X is a PoissonѴ‫ۇ‬ӽߥ process with rate λ + µ. Furthermore, any arrival in the process X has probability λ/(λ + µ) of Ċ ӁӔۖ ω éʔߥ being originated from X1 and probability µ/(λ + µ) of being originated from X2 . This is easily generalized to the case where more than two Poisson processes are merged. Conditional Distribution of the Arrival Times Suppose that we know that the number of arrivals up to time t > 0 in a Poisson process with rate λ is n > 0. What is the conditional distribution of the n arrival times S1 , . . . , Sn given that there were exactly n arrivals? 134 We will not prove this result (although in homework 10 you will prove the simple case where n = 1), but it turns out that the arrival times are uniformly scattered in the time interval [0, t]. Because t can be arbitrarily large, this gives us a new perspective on the Poisson process: conditional on the number of arrivals, the Poisson process is the random process that gets closest to generating values that are uniformly scattered over an interval of arbitrarily large length (in the limit, [0, ∞)). The Nonhomogeneous Poisson Process So far we fixed the rate of a Poisson process to be a fixed scalar λ > 0. However, we can generalize the definition of Poisson process and allow the arrival rate to vary with as a function of time. We say that a random process X is a nonhomogeneous Poisson process with intensity function λ(t) for t ≥ 0 if • X is a counting process • X0 = 0 • X has independent increments • P (Xt+h − Xt = 1) = λ(t)h + o(h) • P (Xt+h − Xt ≥ 2) = o(h) In this case, for any 0 ≤ s < t, we have Z t Nt − Ns ∼ Poisson λ(u) du . s Question: is the nonhomogeneous Poisson process described above a random process with stationary increments? The Spatial Poisson Process We saw that the Poisson process is a probabilistic model for random scattering of values across the positive real numbers. What if we want to randomly scatter points over a spatial domain? We say that a counting process N = {N (A)}A⊂X on a set X is a spatial Poisson process with rate λ > 0 if • N (∅) = 0 135 • N has independent increments, i.e. if A1 , . . . , An are disjoint sets then N (A1 ), . . . , N (An ) are independent random variables • N (A) ∼ Poisson(λ vol(A)). The spatial Poisson process is a basic example of a point process. If we denote Y1 , Y2 , . . . the ‘points’ of the process, the spatial Poisson random process induces a random measure (the Poisson random measure) over the subsets of X by means of N (X ) N (A) = X 1A (Yi ). i=1 ILLUSTRATION: ASTEROIDS HITTING A PLANET 136 Homework 10 (due June 23rd) 1. (25 pts) Consider a random walk process on the integers X = {Xi }∞ i=0 with P (X0 = x0 ) = 1 for some x0 ∈ Z. The process has state space S = Z, the set of integer numbers. Furthermore, at any time i ∈ {0, 1, . . . } we have P (Xi+1 = xi + 1|Xi = xi ) = p P (Xi+1 = xi − 1|Xi = xi ) = 1 − p. • Does the random process X satisfy the Markov property? P • Consider the new process Y = {Yi }∞ i=0 with Yi = j≤i Xj . Does the random process Y satisfy the Markov property? Solution: • Yes, X satisfies the Markov property: for any i ∈ {0, 1, . . . }, the state of the process X at time i + 1 only depends on the state of the process at time i. P • Let Yi = j≤i Xi . Notice that Yi+1 = Yi + Xi+1 . We have P (Yi+1 = yi+1 |Yi = yi , Yi−1 = yi−1 , . . . , Y0 = y0 ) = P (Yi + Xi+1 = yi+1 |Yi = yi , Yi−1 = yi−1 , . . . , Y0 = y0 ) = P (yi + Xi+1 = yi+1 |Yi = yi , Yi−1 = yi−1 , . . . , Y0 = y0 ) = P (Xi+1 = yi+1 − yi |Yi = yi , Yi−1 = yi−1 ) = P (Xi+1 = yi+1 − yi |Xi = yi − yi−1 ) ( p if yi+1 − yi = yi − yi−1 + 1 = 1 − p if yi+1 − yi = yi − yi−1 − 1 ( p if yi+1 = 1 + 2yi − yi−1 = 1 − p if yi+1 = −1 + 2yi − yi−1 Thus, P (Yi+1 = yi+1 |Yi = yi , Yi−1 = yi−1 , . . . , Y0 = y0 ) = P (Yi+1 = yi+1 |Yi = yi , Yi−1 = yi−1 ) and Y does not satisfy the Markov property. 137 2. (15 pts) Consider a Bernoulli Pr process X and its inter-arrival times {Ti }∞ . Show that T = r i=1 i=1 Ti with r ∈ {1, 2, . . . } is a Negative Binomial random variable with parameters r and p, where p is the probability of success of each of the random variables {Xi }∞ i=1 . Hint: you may want to read again through lecture 15, where we illustrated that moment-generating functions are a useful tool to derive the distribution of a sum of independent random variables. Solution: The inter-arrival times of a Bernoulli process are iid Geometric(p) random variables. The m.g.f. of a Geometric distribution of parameter p is pet m(t) = 1 − (1 − p)et P for t < − log(1 − p). Consider the random variable Tr = ri=1 Ti . We have r pet r mTr (t) = m (t) = 1 − (1 − p)et for t < − log(1 − p), which is the m.g.f. of the Negative Binomial distribution with parameters r and p. Thus, Tr ∼ NBinomial(r, p). 3. (30 pts) Consider a Poisson process X = {Xt }t≥0 . Fix a time t0 ≥ 0. Show that Y satisfies the fresh-start property, i.e. show that the process Y = {Yt }t≥0 with Yt = Xt+t0 − Xt0 is also a Poisson process and it is independent from {Xt }0≤t<t0 . Hint: show that Y satisfies the properties that describe a Poisson process according to the first definition of Poisson process that we discussed in class. Solution: We need to check that the process Y satisfies the properties that define a Poisson process. We have • Y0 = Xt0 +0 − Xt0 = X0 = 0 • for any two time intervals (s, t] and (u, v] such that (s, t]∩(u, v] = ∅, we have Yt − Ys = Xt0 +t − Xt0 +s and Yv − Yu = Xt0 +v − Xt0 +u ; since X is a Poisson process and (t0 +s, t0 +t]∩(t0 +u, t0 +v] = ∅ if (s, t]∩(u, v] = ∅, it follows that Yt −Ys and Yv −Yu are independent • for any s, t ≥ 0 we have Yt+s −Ys = Xt+s+t0 −Xs+t0 ∼ Poisson(λ(t+ s + t0 − (s + t0 )) ≡ Poisson(λt). 138 Notice that, because X is a Poisson process, for any t ≥ 0, Yt = Xt+t0 − Xt0 is independent of any increment Xv − Xu with 0 ≤ u < v < t0 . Thus, Y is independent from {Xt }0≤t<t0 . We thus conclude that Poisson processes satisfy the fresh-start property. 4. (30 pts) Consider a Poisson process X = {Xt }t≥0 and the time of the first arrival T1 . Show that, conditional on the event {Xt = 1} (i.e. there was only one arrival until time t > 0), T1 is uniformly distributed on [0, t]. Use the definition of conditional probability and the express the event {T1 ≤ s ∩ Xt = 1} in terms of Poisson process count increments to show that the function s 7→ P (T1 ≤ s|Xt = 1) corresponds to a Uniform(0, t) c.d.f.. Solution: We want to compute P (T1 ≤ s|Xt = 1). We have P (T1 ≤ s|Xt = 1) = = P (T1 ≤ s ∩ Xt = 1) P (Xt = 1)   0 if s < 0 P (Xs −X0 =1∩Xt −Xs =0) P (Xt =1)   1 if s ≥ t if 0 ≤ s < t since, when 0 ≤ s < t, the intersection of the events ‘the first arrival was before time s’ and ‘there was an arrival before time t’ is ‘there was an arrival in the time interval [0, s] and there were no other arrivals in the time interval (s, t]’. For 0 ≤ s < t, the independent independence and the stationarity of the increments of a Poisson process imply that P (Xs − X0 = 1 ∩ Xt − Xs = 0) P (Xs − X0 = 1)P (Xt − Xs = 0) = P (Xt = 1) P (Xt = 1) = e−λs λs e−λ(t−s) s = . t e−λt λt It follows that   0 if s < 0 P (T1 ≤ s|Xt = 1) = st if 0 ≤ s < t   1 if s ≥ t which is the c.d.f. of a Uniform(0, t) distribution. 139 Lecture 19 Recommended readings: Ross, sections 10.1, 10.3, 10.5, 10.6 Brownian Motion Multivariate Normal Distribution Before introducing the Brownian motion, it is worthwile to recall that a n-dimensional random vector (X1 , . . . , Xn ) is distributed according to a (nondegenerate) multivariate Normal distribution with mean µ ∈ Rn and covariance matrix Σ ∈ Rn×n if its joint density function is 1 1 T −1 e− 2 (x−µ) Σ (x−µ) f (x1 , . . . , xn ) = f (x) = p n (2π) det(Σ) and the covariance matrix Σ is a positive definite matrix. If the covariance matrix is only positive semi-definite, the random vector (X1 , . . . , Xn ) is said to have a degenerate Normal distribution. Two key properties of a Normal random vector are 1. for i 6= j, Σi,j = Σj,i = 0 ⇐⇒ Xi and Xj are independent (uncorrelation is equivalent to independence for jointly Normal random variables) P 2. for any α = α1 , . . . , αn ∈ Rn , the linear combination ni=1 αi Xi has a Normal distribution with expectation αT µ and variance αT Σα. Brownian Motion A random process B = {Bt }t≥0 valued in R is called Brownian motion, if it satisfies: • B0 = 0 • B has independent increments • for any s, t ≥ 0, Bt − Bs ∼ N (0, |t − s|) The following properties follow from the definition of Brownian motion. 1. for any s, t ≥ 0, Cov(Bs , Bt ) = E(Bs Bt ) = min{s, t} = s ∧ t 140 2. for any 0 ≤ t1 < t2 < · · · < tn , the random vector (Bt1 , Bt2 , . . . , Btn ) has a multivariate Normal distribution with null expectation and covariance matrix   t1 . . . . . . . . . t1  ..   . t2 . . . . . . t2     ..  .. . .  . . .     .. .. . . tn−1 tn−1  t1 t2 tn−1 tn 3. Brownian motion is a Gaussian process, i.e. for any 0 ≤ t1 < t2 < · · · < tn and for any α1 , α2 , . . . , αn ∈ R, the random variable α1 Bt1 + · · · + αn Btn is a Normal random variable Conversely, if a random process B = {Bt }t≥0 satisfies properties 1., 2., and 3., and B0 = 0, then B is a Brownian motion. Note that for any t ≥ 0 we have E(Bt ) = 0 and V (Bt ) = V (Bt − B0 ) = t so the variance of the process increases linearly with time. Properties of the Brownian Motion Sample Paths Let us recall that a function f : R → R is said to be γ-Hölder continuous (for some γ > 0) on an interval [a, b] if |f (x) − f (y)| < ∞. |x − y|γ x,y∈[a,b] sup x6=y The sample paths t 7→ Bt (ω) of a Brownian motion satisfy the following properties: • they are almost surely continuous, i.e. P ({ω ∈ Ω : t 7→ Bt (ω) is a continuous function}) = 1 • they are almost surely γ-Hölder-continuous for γ ∈ (0, 1/2) in the interval [0, T ] for any T > 0, i.e. P ({ω ∈ Ω : t 7→ Bt (ω) is a γ − Hölder-continuous function on [0, T ]}) = 1 for any γ ∈ (0, 1/2) and for any T > 0; furthermore, with probability 1 they are not γ-Hölder continuous on any [a, b] ⊂ R+ if γ > 1/2 • they are almost surely nowhere differentiable. 141 Brownian Motion with Drift We say that X = {Xt }t≥0 is a Brownian motion with drift coefficient µ ∈ R and variance σ 2 > 0 if for any t ≥ 0 Xt = σBt + µt and B = {Bt }t≥0 is a Brownian motion. In this case we have, for any t ≥ 0, E(Xt ) = µt and V (Xt ) = tσ 2 , thus both the expected value and the variance increase linearly with time. Geometric Brownian Motion Consider the price of a stock X = {Xt }t≥0 evolving as a function of time. Often, it is reasonable to assume that the percentage changes in the price of the stock are independent and identically distributed random variables. In discrete time, this means that we assume that the random variables {Xn+1 /Xn }∞ n=0 are iid. Let Yn+1 = Xn+1 /Xn for n ∈ {0, 1, . . . } so that Xn+1 = Yn+1 Xn . By iteration, it follows that Xn+1 = Yn+1 Yn Yn−1 . . . Y1 X0 142 thus log(Xn+1 ) = n+1 X log(Yi ) + log(X0 ). i=1 After suitable normalization (and moving to the continuous-time regime) we can approximate the random variables {log(Xn )}∞ n=0 with a Brownian motion (possibly with a drift) — question for you: why does this make sense? In this case, the continuous-time random process X is called geometric Brownian motion. More compactly, let B be a Brownian motion with drift coefficient µ ∈ R and variance σ 2 > 0. The random process X = eB is a a geometric Brownian motion. In this case, for any t ≥ 0 we have E(Xt ) = e 2 µ+ σ2 t 2 2 µ+ σ2 t σ2 t V (Xt ) = e e −1 . Brownian Bridge Let B = {Bt }t∈[0,1] be a Brownian motion. Consider the conditional random process B̄ = B|{B1 = 0}. The random process B̄ is called a Brownian bridge; the name is suggested by the fact that the sample paths of B̄ are almost surely ‘suspended’ between B̄0 = 0 and B̄1 = 0, thus forming a bridge between the points (0, 0) and (1, 0). We have that, for any t ∈ [0, 1], E(B̄t ) = 0 and V (B̄t ) = t(1 − t) — question for you: at which time does the Brownian bridge have the highest variability? Furthermore, for any 0 < s < t < 1, we have that Cov(B̄s , B̄t ) = s(1 − t). A Brownian bridge can be also expressed as B̄t = Bt − tB1 where B = {Bt }t∈[0,1] is a Brownian motion. 143 144 Lecture 20 Recommended readings: Ross, sections 4.1 → 4.2 Markov Chains (part 1) The Markov Property Let X = {Xn }∞ n=1 be a discrete-time random process with discrete state space S. X is said to be a Markov chain if it satisfies the Markov property P (Xn+1 = xn+1 |Xn = xn , Xn−1 = xn−1 , . . . , X0 = x0 ) = P (Xn+1 = xn+1 |Xn = xn ), for any n ≥ 0. The Markov property can be stated in the following equivalent way. Let h be any bounded function mapping S ∞ → R. Then X satisfies the Markov property if E(h(Xn+1 , Xn+2 , . . . )|Xn , Xn−1 , . . . , X0 ) = E(h(Xn+1 , Xn+2 , . . . )|Xn ). How can we interpret the above condition in an algorithmic fashion? Think about h being any algorithm you may use on the future states of the process (i.e. h is any feature of the future states of the process that you may be interested to consider). Then your ‘best’ prediction of the algorithm’s output on the process’s future states given the entire history of the process up to the current state is a function only of the current state. Example in class: ∞ Let X = {Xn }∞ n=0 be a random process where {Xn }n=0 is a collection of independent random variables. Then, for all n ≥ 0, E(h(Xn+1 , Xn+2 , . . . )|Xn , Xn−1 , . . . , X0 ) = E(h(Xn+1 , Xn+2 , . . . )) = E(h(Xn+1 , Xn+2 , . . . )|Xn ). Thus X satisfies the Markov property. We are usually interested in two kinds of questions about Markov chains: • how will the system evolve in the short term? In this case, we use conditional probabilities to compute the likelihood of various events and evolutionary paths. 145 • how will the system evolve in the long term? In this case, we use conditional probabilities to identify the ‘limiting behavior’ over vast stretches of time, or the ‘equilibrium’ of the system. Transition Probabilities From now on, to simplify the notation, we will assume that the state space S is the set (or a subset) of the positive integers {0, 1, 2, . . . }. The evolution of a Markov chain with respect to time is described in terms of the transition probabilities P (Xn+1 = j|Xn = i) = Pn;i,j Notice that in general the transition probability from state i to state j can vary with time. It is very convenient, especially for computational reasons, to arrange the transition probabilities into a transition probability matrix (t.p.m.) P n . When the transition probabilities Pn;i,j do not change over time, we say that the Markov chain is time-homogeneous. In this case, we drop the subscript n and we simply write P (Xn+1 = j|Xn = i) = Pi,j . Similarly, the t.p.m. of a time-homogeneous Markov chain is denoted P . In this class we will mainly focus on time-homogeneous Markov chain. Exercise in class: Example 3: Three-state example time Write the t.p.m. of the time-homogeneous Markov from chain in last the figure below. 1 0.4 0.3 0.7 1.0 2 0 0.6 Exercise in class: Consider a simple random walk starting from 0. At each time n, the pro146 18 cess increments by 1 with probability p ∈ [0, 1] or decrements by 1 with probability 1 − p. Describe the t.p.m. of this process. Stochastic Matrices Consider a matrix A. The matrix A is said to be a stochastic matrix if it satisfies the two following properties: 1. all the entries of A are non-negative, i.e. Ai,j ≥ 0 for all i, j P 2. each row of A sums to 1: j Ai,j = 1 for all i. Notice that we call the matrix ‘stochastic’, but such matrix is not random! If the matrix A is such that 1.Pand 2. are satisfied and, furthermore, also each column of A sums to 1 ( i Ai,j = 1 for all j), then A is said to be a doubly stochastic matrix. Question: is a t.p.m. a stochastic matrix? Why? Question: is the t.p.m. of the 3-state Markov chain above a doubly stochastic matrix? Question: is the t.p.m. of the simple random walk above a doubly stochastic matrix? Using Transition Probabilities Let’s start with an example. Consider again the 3-state Markov chain with its t.p.m.   0.6 0.4 0 P = 0.7 0 0.3 0 1 0 Suppose that the initial distribution of the Markov chain at time 0 is pX0 = (P (X0 = 0), P (X0 = 1), P (X0 = 2)) = (0.2, 0, 0.8). What is the probability distribution of X1 ? What is the probability distribution of Xn , for any n ≥ 1? Let P (n) denote the n-step t.p.m.. Based on the example above, we see that P (n) = P P P . . . P , i.e. P (n) is the n-th power of P , P n , with respect to the usual matrix multiplication. 147 This is made precise in terms of the Chapman-Kolmogorov equations, which state that for all positive integers n, m we have P (n+m) = P (n) P (m) with P (0) = I. Therefore, we conclude that a discrete-state time-homogeneous Markov chain is completely described by the initial proabability distribution pX0 and the t.p.m. P . 148 Lecture 21 Recommended readings: Ross, section 4.3 Markov Chains (part 2) Probability of Sample Paths Consider a sample path (i0 , i1 , . . . , in ). Note that the Markov property makes it very easy to compute the probability of observing this sample path. In fact we have P (X0 = i0 ∩ X1 = i1 ∩ · · · ∩ Xn = in ) = P (X0 = i0 )P (X1 = i1 |X0 = i0 )P (X2 = i2 |X1 = i0 ∩ X2 = i1 ) × . . . × P (Xn = in |Xn−1 = in−1 ∩ · · · ∩ X1 = i1 ∩ X0 = i0 ) = P (X0 = i0 )P (X1 = i1 |X0 = i0 )P (X2 = i2 |X1 = i1 ) . . . P (Xn = in |Xn−1 = in−1 ). Communicating Classes We now turn to the questions of which states of the process can be reached from which. This type of analysis will help us understand both the short term and the long term behavior of a Markov chain. Let i, j ∈ S be two states (possibly with i = j). We say that j is (n) accessible from i if, for some n ≥ 0, P i,j > 0. This is often denoted i → j. Let again i, j ∈ S be two states (possibly with i = j). We say that i and j communicate if i → j and j → i. This is often denoted i ↔ j. The relation of communication satisfies the following properties for each i, j, k ∈ S: 1. i ↔ i 2. i ↔ j =⇒ j ↔ i 3. i ↔ j ∧ j ↔ k =⇒ i ↔ k. Furthermore, communication is an equivalence relation. This means that the state space S can be partitioned into disjoint subsets, where all the states that communicate with each other belong to the same subset. These subsets are called communicating classes. Each state of the state space S lies in exactly one communicating class. 149 Exercise in class: Consider the following t.p.m. of a Markov chain: 1 1  2 2 0 0  1 1 0 0 2 2  P = 1 1 1 1 4 4 4 4 0 0 0 1 What are the communicating classes of this Markov chain? Exercise in class: Consider the following t.p.m. of a Markov chain:   1 0 0 0 0 0 0 0.1 0 0.4 0 0.4 0 0.1     0 0.3 0 0.7 0 0 0    0 0 0.5 0 0.2  P =  0.3 0   0 0.3 0 0.3 0 0.4 0   0 0 0 0 0 0.25 0.75 0 0 0 0 0 0.6 0.4 What are the communicating classes of this Markov chain? Absorbing Classes A communicating class is called absorbing if the Markov chain never leaves that class once it enters. Question: what communicating classes are absorbing in the above examples? Irreducibility A Markov chain is said to be irreducible if it has only one communicating class (trivial partition of the state space). Exercise in class: Consider again the random walk on the penthagon. Is this Markov chain irreducible? If a Markov chain is not irreducible, then we can partition its state space S into disjoint sets S = D ∪ (∪i Ci ) (60) 150 where each Ci is an absorbing communicating class and D is a union of non-absorbing communicating classes. Note the following key implication: eventually, the Markov chain will leave D and enter exactly one of the Ci ’s. At that point the Markov chain will act like an irreducible Markov chain on the reduced state space Ci . Recurrent and Transient States Informally, we say that a state i ∈ S is recurrent if the probability that starting from state i the Markov chain will ever visit again state i is 1. On the other hand, if the probability that starting from state i the Markov chain will ever visit again state i is strictly less than 1, then we say that i is a transient state. Note that, based on the above definition, • if i ∈ S is recurrent then, if the Markov chain starts from state i, state i will be visited infinitely often • if i ∈ S is transient then, each time the Markov chain visits state i, there is a positive probability p ∈ (0, 1] that state i will never be visited again. Therefore, the probability that starting from state i the Markov chain will visit state i exactly n times is (1 − p)n−1 p; this means that the amount of time that the Markov chain spends in state i (starting from state i) is a Geometric random variable with parameter p. This implies that, if the Markov chain starts from state i, the expected amount of time spent in state i is 1/p. From the above, it follows that a state i ∈ S is recurrent if and only if, starting from state i, the expected number of times that the Markov chain visits state i is infinite. This allows us to characterize recurrent and transient states in terms of transition probabilities. Let In = 1{Xn =i} (Xn ) be the random variable P indicating whether the Markov chain is visiting state i at time n. Then ∞ n=0 In is the amount of time spent by the Markov chain in state i. We have ! ∞ ∞ X X E In |X0 = i = E(In |X0 = i) n=0 = ∞ X n=0 P (Xn = i|X0 = i) = n=0 ∞ X n=0 151 (n) P i,i and therefore ∞ X n=0 ∞ X n=0 (n) P i,i = ∞ ⇐⇒ i is recurrent (n) P i,i < ∞ ⇐⇒ i is transient. Notice also that if i ∈ S is recurrent then every state in the same communicating class is also recurrent. The property of being recurrent or transient is a class property. Example in class: Consider again the random walk on the integers. Since this Markov chain is irreducible (all the states communicates), either all the states are recurrent or they are all transient. It turns out that the random walk on the integers is recurrent if and only if the probability of a positive/negative increment is exactly 1/2. Otherwise, the random walk on the positive integers is transient (see Ross, pages 208-209 for a proof). Exercise in class: consider again the previous two Markov chains. Which classes are recurrent/transient? 152 Lecture 22 Recommended readings: Ross, sections 4.4 Markov Chains (part 3) Periodicity Let’s start with a simple example. Consider the t.p.m.   0 1 0 P = 0 0 1 . 1 0 0 We have that P (2)   0 0 1 = 1 0 0 . 0 1 0 P (3)   1 0 0 = 0 1 0 . 0 0 1 and Starting from any of the states 0, 1, 2, how often does the Markov chain return to the initial state? The above example suggests that this simple Markov chain has a built-in periodicity of order 3. How can we make this idea more formal and generalize it to more complicated Markov chains? For any state s ∈ S of a Markov chain, we define the period of s as n d(s) = gcd{n ≥ 1 : Ps,s > 0} (n) where gcd stand for ‘greatest common divisor’. This implies that P s,s = 0 unless n = md(s) for some m ∈ Z. An irreducible Markov chain is said to be aperiodic if d(s) = 1 for all states s ∈ S. It is called strongly aperiodic if Ps,s > 0 for some s ∈ S. Exercise in class: Consider  0 P =  14 1 8 1 2 0 7 8 153 1 2 3 . 4  0 Is this Markov chain periodic, aperiodic, or strongly aperiodic? Exercise in class: Consider 1 P = 4 1 4 1 8 1 2 0 7 8 1 4 3 . 4  0 Is this Markov chain periodic, aperiodic, or strongly aperiodic? Periodicity is another class property. This means that the period is constant over any fixed communicating class of a Markov chain. Cyclic Class Decomposition Let X be an irreducible Markov chain with period d. There exists a partition of S S = ∪dk=1 Uk such that Ps,Uk+1 = 1 for s ∈ Uk and k = 0, 1, . . . , d − 1. The sets U1 , . . . , Us are called cyclic classes of X because X cycles through them successively. It thus follows that the ‘accelerated’ Markov chain Xd = d {Xid }∞ i=1 has transition probability matrix P and each Uk is an absorbing, irreducible, aperiodic class. In light of this result, we can rewrite the state space decomposition of equation (60) as i S = D ∪ (∪i Ci ) = D ∪ ∪i ∪dj=1 Ui,j where each absorbing class Ci is an irreducible Markov chain with period di whose state space can be partitioned into the di cyclic classes Ui,1 , . . . , Ui,di . Positive and Null Recurrence We already defined the notion of recurrence in the previous lecture. Here, we introduce a somewhat technical (but important) point. Suppose that state s ∈ S is recurrent. Then, based on the previous lecture, this means that if the Markov chain starts from state s, the chain will visit state s 154 in the future infinitely often. We now make the following distinction for a recurrent state s (or, equivalently, for the communicating class of s, since recurrence is a class property): • s is positive recurrent if, starting from s, the expected amount of time until the Markov chain visits again state s is finite • s is null recurrent if, starting from s, the expected amount of time until the Markov chain visits again state s is infinite. While there is indeed a difference between positive and null recurrent states, it can be shown that in a Markov chain with finite state space all recurrent states are positive recurrent. Ergodicity, Equilibrium and Limiting Probabilities Consider a Markov chain with t.p.m. P . Suppose there exists a probability distribution π over the state space such that πP = π. This implies that π = πP (n) for all integers n ≥ 0. The probability distribution π is called the equilibrium distribution or the stationary distribution of the Markov chain. In equilibrium πj is the propotion of time spent by the Markov chain in state j. A communicating class is said to be ergodic if it is positive recurrent and aperiodic. We have the following important result: for an irreducible and ergodic Markov chain the limiting probabilities πj = lim Pijn n→∞ j∈S exist and the limit does not depend on i. Furthermore, the vector Pπ of the above limiting probabilities is the unique solution to π = P π and j πj = 1 (hence it corresponds to the stationary distribution of the Markov chain). The limiting probabilities can also be shown to correspond to the longterm proportion of time that the Markov chain spends in each state. More precisely, we have that vij (n) πj = lim n→∞ n where vij (n) denotes the expected number of visits to state j, starting from state i, within the first n steps. 155 Equilibrium When the State Space is Finite Let U be a |S| × |S| matrix with entries all equal to 1, let u be a |S|dimensional vector with entries all equal to 1 and let o be a |S|-dimensional vector with entries all equal to 0. Let X be an irreducible, aperiodic, and finite state (and thus ergodic) Markov chain. If we want to find the stationary distribution of X we must solve ( π(I − P ) = o πU = u. The solution to the above system is π = u(I − P + U )−1 . Exercise for home: Consider the t.p.m.  0.6 0.4 0 P = 0.7 0 0.3 . 0 1 0  Apply the above formula to verify that π ≈ (0.5738, 0.3279, 0.0984). It can be shown that if the matrix P is doubly stochastic, we have that πj = 1/|S| for all j ∈ S. Exercise in class: Consider again the random walk on the pentagon. What is the stationary distribution of this Markov chain? 156 Lecture 23 Recommended readings: Ross, section 4.6 Markov Chains (part 4) More on Equilibrium Distribution and Limiting Probabilities In the previous lecture, we studied that if an irreducible Markov chain is ergodic (i.e. positive recurrent and aperiodic), the limiting probabilities lim Pijn n→∞ j∈S (61) exist and these limits do not depend on i or on the initial distribution pX0 . Furthermore, we said that, whenever a stationary distribution π exists, the limiting probabilities above correspond exactly to the entries of π. Let us know clarify a slightly technical point. While it is true that in order for the Markov chain to admit the above limiting probabilities we need the Markov chain to be irreducible, positive recurrent and aperiodic (i.e. irreducible and ergodic), the existence of a stationary distribution does not require the aperiodicity assumption. Hitting Probabilities (part 1) Recall the decomposition of the state space of equation (60). Suppose that the Markov chain starts at a state s ∈ D which is not absorbing. At some point in time, the Markov chain will eventually enter one of the absorbing classes Ci of the decomposition (60) and will be ‘stuck’ in that class from that time on. What is the probability that, starting from s ∈ D, the chain will hit the absorbing class Ci ? Define hi (s) as the probability that, starting from a state s ∈ S, the chain will hit the absorbing class Ci . We have X X hi (s) = Pss0 + Psu hi (u) s0 ∈Ci u∈D and the vector hi is the smallest non-negative solution to the above equation. P Define P D to be the submatrix of P associated to D and bi (s) = s0 ∈Ci Pss0 for s ∈ D. When S is finite, we can rewrite the above equation in vector/matrix form as (I − P D )hi = bi =⇒ hi = (I − P D )−1 bi . 157 Example in class: Let X be a Markov chain with t.p.m.  1 0 1 0 2 1 P =  0 12 0 3 0 0 0 0 1 3 1 6 1 2 0 1 3 0 0 0  0 0  0  1 3 1 defined on S = {0, 1, 2, 3, 4}. We have that C1 = {0}, C2 = {4}, and D = {1, 2, 3}. Then,   0 13 61 P D =  21 0 12  , 1 1 3 3 0 1 2 b1 =  0  , 0   0 b2 =  0  , 1 3 and (I − P D )−1   30 14 12 1  24 34 21 . = 19 18 16 30 Thus,   15 1   12 h1 = 19 9 and   4 1   7 . h2 = 19 10 Block Decomposition of the T.P.M. From now on, we consider a finite state Markov chain with a set A of |A| = a absorbing states and a set T of |T | = t of transient states. We can rearrange the t.p.m. of the Markov chain as I 0 P = S T 158 where I is a a × a matrix, T is a t × t matrix, S is a t × a matrix, and 0 is a a × t matrix of zeroes. Notice that the values of S and T are not unique: they depend on how you order the states. Hitting Probabilities (part 2), Expected Hitting Times, Expected Number of Visits In terms of the block decoposition of the t.p.m., it can be shown that we can obtain the hitting probabilities matrix by simply computing (I − T )−1 S. On the other hand, the matrix (I − T )−1 gives us the expected number of visits to state s0 starting from state s, with s, s0 ∈ T . Finally, if we let u denote a vector of ones, it can be shown that m = (I − T )−1 u returns the vector of the expected time until, starting from a transient state s ∈ T , the Markov chain hits an absorbing state. Example in class (the rat maze): A rat is placed in the maze depicted below. If the rat is in the rooms 2, 3, 4, or 5, it chooses one of the doors at random with equal probability. The rat stays put once it reaches either the food or the shock. The t.p.m. for the chain is   1 0 0 0 0 0  1 0 0 1 0 0 2   12  0 0 31 13 0  3   P = 1 1 1 0 0 0  3 3 3 0 0 1 0 0 1  2 2 0 0 0 0 0 1 Using the formulae above, we have that the matrix of the expected number of visits is  26 2 5 2  21 4  21  10 21 2 21 7 10 7 4 7 5 7 7 4 7 10 7 2 7 while the matrix of hitting probabilities is 5 2 7 4  73  7 2 7 7 3 7 4 7 5 7 159 21 10  21  , 4  21 26 21 and the expected times until absorption are  49  A First Look at Long-Term Behavior 21 (∞) (n)  56  .  21  56  , if the limit exists. If this exists for all states i and Define Pij = limn→∞ Pij j, then put these into the matrix P (∞). 21 49 21 Exercise 4. [Exists?] Construct a transition matrix P such that P (∞) does not How do we interpret these numbers? exist. Exercise 5. [The Rat Maze] A rat is placed into the above maze. If the rat is in one of the other four rooms, it chooses one of the doors at random with equal probability. The rat stays put once it reaches either the food (F) or the shock (S). Food (F) 3 5 2 4 Shock (S) Write out the transition matrix P and the limiting transition matrix P(∞) for the rat maze example. If the rat starts in room “2,” what is the probability it reaches the food before the shock? 13 160

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 36-217 Probability Theory and Random Processes