Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Moments and Discrete Random Variables OPRE 7310 Lecture Notes by Metin Çakanyıldırım Compiled at 13:03 on Thursday 13th October, 2016 1 Random Variables In a probability space (Ω, F , P), the element ω ∈ Ω is not necessarily a numerical value. In the experiment of tossing a coin once, ω is either H or T. It is useful to convert outcomes of an experiment to numerical values. Then these numerical values can also be used in usual algebraic operations off additions, subtractions, etc. To convert each ω to a numerical value (a real number), we use a function X (ω ) : Ω → ℜ. X (ω ) is a function that maps the sample space to real numbers. Example: Let Ωn = { H, T } be the sample space of for the nth toss of a coin. We can define a random variable as the indicator function Xn (ωn ) = 1Iωn = H . Then we can count the number of heads in the N experiments as ∑nN=1 Xn (ωn ) or we can consider an average (1/N ) ∑nN=1 Xn (ωn ). We can consider the difference between the number of heads obtained in even numbered experiments and odd numbered experiments in the first 2N experiments: ∑nN=1 X2n (ω2n ) − ∑nN=1 X2n−1 (ω2n−1 ). ⋄ Example: A truck will arrive in the next 1 hour at a distribution center, the arrival time ω when measured in minutes belongs to Ω = [0, 60]. In this case, the outcomes are real numbers and we can use them as random variables: X (ω ) = ω. We can consider another truck to arrive in the same time window with arrival time Y ∈ [0, 60]. We can consider the earliness of the second truck with respect to the first: ( X − Y )+ = max{ X − Y, 0}. The earliness then becomes another random variable. ⋄ As illustrated above a function of random variable(s) is another random variable. Since random variables are real numbers, we can average them to obtain means and variances. Example: A truck arrives at a distribution center over 7 days and each day it arrives over an one-hour time window [0, 1]. Then Xn (ωn ) denotes the arrival time on the nth day for n = 1 . . . 7. The average arrival time is ∑7n=1 Xn /7. ⋄ A random variable can be conditioned on another random variable or on an event. When such a conditioning happens, the convention is to talk about the functions characterizing the conditioned random variable rather than writing X |Y or X | A for random variables X, Y and event A. One of the such functions is cumulative distribution function (cdf) FX for random variable X. The cdf maps the real interval (−∞, a] to the real interval [0, 1] and is defined through the probability measure: FX ( a) := P( X (ω ) ∈ (−∞, a]). Since (−∞, a] ⊆ (−∞, b] for a ≤ b, we have FX ( a) ≤ FX (b), so FX is a non-decreasing function. Recalling also P(Ω) = 1 and P(∅) = 0, we obtain three properties of the cdf: i) FX is non-decreasing. ii) lima→∞ FX ( a) = 1. ii) lima→−∞ FX ( a) = 0. Example: Consider the experiment of tossing a fair coin twice with the sample space Ω = { HT, TH, HH, TT }. 1 Let X (ω ) be the number of heads in two tosses, so that X (ω ) ∈ {0, 1, 2}. For the cdf FX we have 0/4 if a < 0 1/4 if 0 ≤ a < 1 FX ( a) = .⋄ 3/4 if 1 ≤ a < 2 4/4 if 2 ≤ a We can consider the right-hand side limits of FX ( a) and use the continuity of probability measure to obtain lim FX ( x ) = lim P( X (ω ) ∈ (−∞, x ]) = lim P((−∞, x ]) = P(∩ x→a+ (−∞, x ]) = P( X (ω ) ∈ (−∞, a]) = FX ( a). x → a+ x → a+ x → a+ Note that (−∞, x ] as x → a+ is a decreasing sequence of closed sets, their limit is intersection of these sets and intersection of infinitely many closed sets is closed. In particular, a ∈ ∩∞ n=1 (− ∞, a + 1/n ] = ∩ x → a+ (− ∞, x ]. Hence, cdfs are right continuous for all random variables. For the left-hand side limits, we obtain lim FX ( x ) = lim P( X (ω ) ∈ (−∞, x ]) = lim P((−∞, x ]) = P(∪ x→a− (−∞, x ]) = P( X (ω ) ∈ (−∞, a)) ≤ FX ( a). x → a− x → a− x → a− Note that (−∞, x ] as x → a− is an increasing sequence of closed sets, their limit is union of these sets and union of infinitely many closed sets is not necessarily closed. In particular, a ∈ / (−∞, a − 1/n] for any n and − so a ∈ / ∪∞ (− ∞, a − 1/n ] = ∩ (− ∞, x ] . So cdfs are not necessarily left-continuous, as illustrated by the x→a n =1 FX in the last example. Example: 4 people including you and a friend line up at random. Let X be the number of people between you and your friend and find FX . X = 0 means that you are sitting together. Putting two of you together as a pair, we have 2 people and a pair to order in 3! ways. By reordering you and your friend within the pair we obtain 2 × 3! ways. The total number of ways of ordering 4 people is 4!, so P( X = 0) = 2 × 3!/4! = 1/2. X = 1 means that there is exactly one person between you two. If either of you is at the beginning of the line, we can depict the permutation as ◦ △⋆, where ◦ and ⋆ are the other people. By exchanging the positions of with △ and ◦ with ⋆, we can create 4 permutations: ◦ △⋆, △ ◦ ⋆, ⋆ △◦, △ ⋆ ◦. Other 4 permutations can be created by placing either of you at the second place in the line: ⋆ ◦ △, ⋆△ ◦ , ◦ ⋆ △, ◦△ ⋆ . The total number of permutations with X = 1 is 8, so P( X = 1) = 8/4! = 1/3. X = 2 means exactly two people between you and all four such permutations are ◦ ⋆△, △ ◦ ⋆, ⋆ ◦△, △ ⋆ ◦. The total number of permutations with X = 2 is 4, so P( X = 2) = 4/4! = 1/6. 0 if a < 0 3/6 if 0 ≤ a < 1 .⋄ FX ( a) = 5/6 if 1 ≤ a < 2 6/6 if 2 ≤ a Example: Suppose a fair coin is tossed 4 times, we examine the head runs. A head run is consecutive appearance of heads. Let X be the length of the longest head run and find P( X = x ) for 0 ≤ x ≤ 4. The sample space is Ω = { HHHH, HHHT, HHTH, HTHH, THHH, HHTT, HTHT, THHT, HTTH, THTH, TTHH, HTTT, THTT, TTHT, TTTH, TTTT }. X ( HHHH ) = 4, X ( HHHT ) = 3, X ( THHH ) = 3, X ( HHTH ) = 2, X ( HTHH ) = 2, X ( HHTT ) = 2, X ( THHT ) = 2, X ( TTHH ) = 2, X ( TTTT ) = 0 and all the remaining 7 outcomes yield X = 1. So P( X = 0) = 1/16, P( X = 1) = 7/16, P( X = 2) = 5/16, P( X = 3) = 2/16, P( X = 4) = 1/16. ⋄ 2 2 Discrete Random Variables Most of the examples above had countable sample spaces. A discrete random variable has a countable sample space. As the sample space is countable, we can let it be Ω = { x1 , xs , x3 , . . . }. Probability measure is nonzero only at the elements of Ω and zero elsewhere. At the elements of Ω, we can define nonzero probability mass function (pmf) p X as p( a) = P( X = a). Then we have FX ( a) = ∑ xi ≤ a p X ( x i ). In real-life averages are used to summarize empirical observations: average fuel consumption of my car is 32 miles per gallon; average August temperature in Dallas is 88o Fahrenheit. On the other hand, expected values summarize theoretical probability distributions. The expectation of a discrete random variable X with pmf is p X is given by E( X ) = ∑ xpX (x). x:p X ( x )>0 Expectation is a weighted average where the weights are p( x ). Note that a function of a random variable is another random variable, so we can consider the expected value of this function. That is, let us define new random variable Y = g( X ) starting with random variable X. Then E (Y ) = ∑ g ( x ) p ( x ). x:p X ( x )>0 For a given random variable X, the function can in particular be g( x ) = ( x − E( X ))2 and can define a new random variable Y = g(Y ) = ( X − E( X ))2 . Expected value of this new random variable is the variance of X: V( X ) = ∑ ( x − E( X ))2 p( x ). x:p X ( x )>0 Variance of a random variable is the expected value of square of deviations from its mean. Another particular g function is g( x ) = x k for an integer k, it helps to define the kth moment of the random variable X E( X k ) = ∑ x k p ( x ). x:p X ( x )>0 Example: Consider the example of 4 people lining up and X – the number of people between you and your friend. The pmf of X is p(0) = 3/6, p(1) = 2/6 and p(2) = 1/6. Then the expected value of the number of people between you is E( X ) = 0(3/6) + 1(2/6) + 2(1/6) = 2/3. The variance of the number of people between you is V( X ) = (0 − 2/3)2 (3/6) + (1 − 2/3)2 (2/6) + (2 − 2/3)2 (1/6) = (4/9)(3/6) + (1/9)(2/6) + (16/9)(1/6) = 30/54 = 5/9. ⋄ Example: Consider the example of 4 people lining up and X – the number of people between you and your friend. Suppose you want to pass a glass of water to your friend. But 1/4 of water spills as glass moves from one person to another. What is the expected percentage of water remaining in the glass when your friend receives it? Let g( x ) be percentage of water left in the glass after x + 1 passes. Note that x people between you causes x + 1 passes. Then g(0) = 3/4, g(1) = (3/4)2 , g(2) = (3/4)3 . So E( g( X )) = (3/4)(3/6) + (3/4)2 (2/6) + (3/4)3 (1/6) = 81/128 = 0.6328. 3 63.28% of the water remains on average when you friend receives the glass. This example can be extended to model the amount of information loss over a telecommunication network as the information is sent from one node to another. ⋄ Expectation and variances have some properties that are straightforward to show: i) Linearity of expectation: For constant c, E(cg( X )) = cE( g( X )). For functions g1 , g2 , . . . , g N , E( g1 ( X ) + g2 ( X ) + · · · + g N ( X )) = E( g1 ( X )) + E( g2 ( X )) + · · · + E( g N ( X )). ii) For constant c, E(c) = c and V(c) = 0. iii) Nonlinearity of variance: For constant c, V(cg( X )) = c2 V( g( X )). iv) V( X ) = E( X 2 ) − (E( X ))2 . 3 Independent Discrete Random Variables Let X and Y be two random variables and consider the two events A := [ X ≤ a] and B := [Y ≤ b] for real numbers a, b and A ∈ Ω A , B ∈ Ω B . Suppose that these two events are independent so we have the middle equality below P( X ≤ a, Y ≤ b) = P( A ∩ B) = P( A)P( B) = P( X ≤ a)P(Y ≤ b). We extend this equality to all numbers a, b to define the independence of random varibbles. If we have P( X ≤ a, Y ≤ b) = P( X ≤ a)P(Y ≤ b) for all a, b ∈ ℜ, random variables X and Y are said to be independent. To indicate independence, we can write X ⊥ Y Starting from P( X ≤ a, Y ≤ b) = P( X ≤ a)P(Y ≤ b), we can obtain P( X = a, Y = b) for integer valued random varaibales X, Y in terms of their pmfs: P( X = a, Y = b) = P( X ≤ a, Y ≤ b) − P( X ≤ a, Y ≤ b − 1) − (P( X ≤ a − 1, Y ≤ b) − P( X ≤ a − 1, Y ≤ b − 1)) = FX ( a) FY (b) − FX ( a) FY (b − 1) − ( FX ( a − 1) FY (b) − FX ( a − 1) FY (b − 1)) = FX ( a) pY (b) − FX ( a − 1) pY (b) = p X ( a ) pY ( b ) . This can be extended to non-integer valued random variables but a new notation is neded to express the largest value of X smaller than a. If you adapt some notation for this purpose, above steps can be repeated to obtain P( X = a, Y = b) = p X ( a) pY (b) for two independent discrete random varaiables X and Y. Example: For two independent random varaiables, we check the equality E( X1 X2 ) = E( X1 )E( X2 ): ∑ E ( X1 X 2 ) = ( x 1 x 2 ) P ( X1 = x 1 , X 2 = x 2 ) = x1 ,x2 :P( X1 = x1 ,X2 = x2 )>0 = ∑ x 1 p X1 ( x 1 ) x1 :p X1 ( x1 )>0 ∑ x 1 x 2 p X1 ( x 1 ) p X2 ( x 2 ) x1 ,x2 :p X1 ( x1 ),p X2 ( x2 )>0 ∑ x 2 p X2 ( x 2 ) = E ( X 1 ) E ( X 2 ) . x2 :p X2 ( x2 )>0 The first equality above uses the fact that the probability of the event [ X1 = x1 , X2 = x2 ] is p X1 ( x1 ) p X2 ( x2 ). ⋄ Example: For two independent random varaiables, we check the equality V( X1 + X2 ) = V( X1 ) + V( X2 ): V( X1 + X2 ) = E( X12 + 2X1 X2 + X22 ) − E( X1 + X2 )E( X1 + X2 ) = E( X12 ) − E( X1 )2 + E( X22 ) − E( X2 )2 + 2E( X1 X2 ) − 2E( X1 )E( X2 ) = V ( X1 ) + V ( X 2 ) . ⋄ 4 The equalities established in the last two examples are summarized below: v) For two independent random varaibles X1 and X2 , E( X1 X2 ) = E( X1 )E( X2 ). vi) Additivity of variance for two independent random varaibles X1 and X2 , V( X1 + X2 ) = V( X1 ) + V( X2 ). 4 Common Discrete Random Variables Some commonly used discrete random variables are introduced below. 4.1 Bernoulli Random Variable An experiment is said to consist of Bernoulli trials if • outcome of each trial can take two values, which can be interpreted as say success and failure, • each trial is independent of other trials, • each trial is identical in terms of probability to other trials, so for all trials P(success)= p = 1 − P(failure). If an experiment has a single Bernoulli trial, the number of success is called a Bernoulli random variable. Example: Let us consider the experiment of tossing a coin once. Let “success” ≡ “head”, so P(Success)= P(Head)=p, where p is the probability that coin yields head. P(Failure)=P(Tail)=1 − p. Let X be the number of heads on a single toss. E( X ) = 1p + 0(1 − p) = p and V( X ) = E( X 2 ) − p2 = 12 p + 02 (1 − p) − p2 = p(1 − p). ⋄ Example: Let us consider the price of a stock, which can increase or decrease over a day. Let “success” ≡ “price increase”, so P(Success)= P(Price increase) and P(Failure)=P(Price decrease). Suppose that the price either increases by $1 or decreases by $1. Then the expected price change over a day is Today’s price + 1P(Price increase) − 1P(Price decrease) − Today’s price = P(Price increase) − P(Price decrease) = 2P(Price increase) − 1. Price increases on average if P(Price increase) > 1/2. This probability of price increase can be estimated statistically for stocks. ⋄ 4.2 Binomial Random Variable Number of successes in n independent Bernoulli trials, each with success probability p, is a Binomial random variable with parameters ( p, n). Example: Let us consider the experiment of tossing a coin many times. On each toss, P(Success)= P(Head)=p and P(Failure)=P(Tail)=1 − p. Let Xi be the number of heads on ith toss. Xi is a Bernoulli random variable. Consider the total number X of heads in n tosses. Then X is a Binomial random variable and n X= ∑ Xi . ⋄ i =1 To denote a Binomial random variable, we write X ∼ B(n, p). To obtain the probability mass function of X, consider the event [ X = m] of having m success among n events for m ≤ n. If the first m trials are success, we get a string of . . S S} |F F .{z . . F F} . |S S .{z m successes n−m failures 5 Probability of this string is pm (1 − p)n−m . If we take a permutation of this string, the probability remains n . Hence, pm (1 − p)n−m . The number of permutations that we can make up from m Ss and n − m Fs is Cm n m P( X = m) = Cm p (1 − p)n−m for m = 0, 1, 2, . . . , n. Example: For Xi Bernoulli random variable with P(Success)=p, we have E( Xi ) = p and V( Xi ) = p(1 − p). By using X = ∑in=1 Xi , linearity of expectation and additivity of variance for independent random variables, E( X ) = np and V( X ) = np(1 − p). ⋄ Example: An urn contains 6 white balls and 2 black balls. A ball is drawn and put back into the urn. The process is repeated 4 times. Let X be the number of black balls drawn out of 4 draws. Find P( X = 2, E( X ) and V( X ). Convince yourself that X ∼ B(4, 2/8), so P( X = 2) = C24 (2/8)2 (1 − 2/8)2 , E( X ) = 4(2/8) and V( X ) = 4(2/8)(6/8). ⋄ Suppose that B(n j , p) is the number of successes in n j trials. Then ∑ j B(n j , p) is the number of successes in ∑ j n j Bernoulli trials with success probability p. This in turn leads to ∑ j B(n j , p) ∼ B(∑ j n j , p), where ∼ indicates having the same distribution. For example, B(n1 , p) + B(n2 , p) is another Binomial random variable with n1 + n2 trials and p success probability, i.e., B(n1 , p) + B(n2 , p) ∼ B(n1 + n2 , p). 4.3 Geometric Random Variable Number of Bernoulli trials until the first success is a Geometric random variable. When counting the number of trials, we include the trial that resulted in the first success. For example, the first success in FFFFSFFSFS string is the fifth trial. Some books do not include the trial in which the success happened in counting the number of trials until the first success and say that it takes four trials to get a success in FFFFSFFSFS. This convention in the other books changes the pmf, cdf, mean and variance formulas so one has to be careful when reading other books. With our convention of counting events, a geometric random variable X takes values in {1, 2, 3, . . . }. For the Geometric random variable X to be equal to n, we need n − 1 failures first and a success afterwards: . . S F} . . . F F} S S |F F .{z | F .{z n−1 failures irrelevant Recalling that p is success probability, the pmf for the geometric random variable is P( X = n) = (1 − p)n−1 p. Unlike the binomial pmf, the geometric pmf does not have a combinatorial term because the sequence of failures and the success cannot be altered. Example: Consider a rice rolling experiment until 6 comes up. The experiment can potentially have infinite trials and so a countable but infinite sample space. Let “success”=“6 appears” and “failure”=“1,2,3,4, or 5 appears”. P(Success)=1/6 and P(Failure)=5/6. Then the probability of having “6” for the first time in the nth trial is P( X = n) = (5/6)n−1 (1/6). ⋄ The cumulative distribution function for the Geometric random variable has a simple form: FX (n) = P( X ≤ n) = n n −1 i =1 i =0 1 − (1 − p ) n ∑ ( 1 − p ) i −1 p = p ∑ ( 1 − p ) i = p 1 − ( 1 − p ) 6 = 1 − (1 − p ) n . Hence, F̄ := 1 − F, the complementary cdf, is also simple: F̄X (n) = 1 − FX (n) = P( X > n) = (1 − p)n . Example: A graduate student applies to jobs one by one. Within a few days after his job interview, he is told that he gets the position or not. Each experiment can be thought as a Bernoulli trial where success is obtaining a position. The chances of getting a position after an interview is estimated to be 0.2. What is the probability that the student cannot get a job after the first five interviews? We want P( X > 5) for a Geometric random variable X with success probability 0.2. So, P( X > 5) = (1 − 0.2)5 . ⋄ If the first m trials result in failures, what should the probability of n trials to result in failures as well for n ≥ m. In particular, we want to see if the probability that the remaining n − m trials turn out to be failures depend on the failures of the first m trials. In other words, do the outcome of n − m trials independent of the first m trials? This can be answered formally by examining the probability of the event [ X > n| X > m]. P( X > n | X > m ) = P( X > n, X > m) P( X > n ) (1 − p ) n = = = (1 − p ) n − m . P( X > m ) P( X > m ) (1 − p ) m On the other hand, (1 − p)n−m is the probability of failures in n − m events, so (1 − p)n−m = P( X > n − m). Hence, the outcome of n − m trials do not depend on the knowledge (memory) of the first m trials. We have the memoryless property for the Geometric random variable. P( X > n | X > m ) = P( X > n − m ). Example: What other discrete random variable ranging over {1, 2, . . . } has the memoryless property? The memoryless property can be written as P( X > n + m) = P( X > m)P( X > n) for nonnegative integers m, n. In terms of complementary cdf, we have F̄ (m + n) = F̄ (m) F̄ (n). Also F̄ (0) = P( X > 0) = 1. We can set q = F̄ (1) for 0 ≤ q ≤ 1. Then we recursively obtain F̄ (n) = qn . Interpreting q as the failure probability and defining p = 1 − q as the success probability, we obtain F (n) = 1 − (1 − p)n , which is the cdf of the Geometric random variable. As a result, the only discrete random variable ranging over {1, 2, . . . } and satisfying the memoryless property is the Geometric random variable. ⋄ 4.4 Hypergeometric Random Variable One way to obtain Binomial random variable is to consider an urn with s success balls and f failure balls, and we pick a ball from the urn record it and return it to the urn. Each ball can be either a success or failure with the success probability s/(s + f ). The probability of m successes after n repetitions of this experiment is given by the binomial probability n Cm (s/(s + f ))m ( f /(s + f ))n−m . In the urn experiment above, each ball is returned back to the urn. What happens if we do not return the balls back to the urn is that the success and failure probabilities are no longer constant. Suppose we again pick n balls without replacement and wonder about the probability of m balls being success (and equivalently n − m balls being failure). For this event, we need m ≤ s and n − m ≤ f (or n ≤ s + f ). The number of ways s , the number of ways of choosing n − m failure balls of choosing m success balls out of s success balls is Cm s+ f f out of f failure balls is Cn−m and the number of ways of choosing n balls out of s + f balls is Cn . Let X be the number of successes in this experiment. Then the probability of m success balls is f P( X = m ) = 7 s C Cm n−m s+ f Cn . This probability obtained under no ball replacement is different from the binomial probability obtained under replacement. So the random variable X deserves a new name - the Hypergeometric random variable. The difference between the Binomial and Hypergeometric random variables is caused by the “replacement” of the balls. Replacement changes the success probability significantly if n is close to s + f . Otherwise, success probabilities do not change much by taking a ball out of the urn. For example, with s = 300 and f = 700, the success probability for the first ball is 300/1000 for the second it is either 300/999 or 299/999, both of which are very close to 300/1000. This inspires us to check the Hypergeometric random variable as N := s + f → ∞. We also define p = s/N so s = N p and f = N (1 − p). With this reparametrization, we can show that the Hypergeometric pmf converges to the Binomial pmf as N increases: Np N (1− p ) Cm Cn−m CnN ( N p)!( N (1 − p))!n!( N − n)! ( N p − m)!m!( N (1 − p) − n + m)!(n − m)!N! n ( N p )( N p − 1) . . . ( N p − m + 1) N (1 − p )( N (1 − p ) − 1) . . . ( N (1 − p ) − n + m + 1) = Cm N ( N − 1) . . . ( N − n + 1) n ( p )( p − 1/N ) . . . ( p − ( m − 1) /N ) (1 − p )((1 − p ) − 1/N ) . . . ((1 − p ) − ( n − m − 1) /N ) = Cm 1(1 − 1/N ) . . . (1 − (n − 1)/N ) n m n−m → Cm p (1 − p) as N → ∞. = Example: A package of light bulbs has 6 bulbs. 4 of these are non-defective and 2 are defective. 3 bulbs are taken from the package for inspection. Let X be the number of defective bulbs observed during inspection. Find P( X = m) for m ∈ {0, 1, 2}. P( X = 0) = C12 C24 C22 C14 C02 C34 , P ( X = 1 ) = , P ( X = 1 ) = . C36 C36 C36 ⋄ Example: The sum of Hypergeometric pmf must yield 1. Let us check this sum. s ∑ s P( X = m ) = m =0 s+ f ∑ m =0 f s C Cm n−m s+ f Cn . f s+ f s C We claim that Cn = ∑sm=0 Cm the number of n−m . To convince yourself of this equality, consider Cn n-element subsets of a (s + f )-element set. An indirect way to obtain these subsets is to split (s + f )-element set into two sets with s and f elements. To obtain n elements subsets, we can pick m elements from the selement set and n − m elements from f -element set. The indirect way also gives us all the n-element subsets s+ f s Cf s Cf and in ∑sm=0 Cm = ∑sm=0 Cm n−m ways. So Cn n−m . ⋄ Example: Find the expected value of the Hypergeometric random variable X. f s E( X ) = ∑ m =0 m s C Cm n−m s+ f Cn ∑ m =1 s −1 Cm −1 Cn−m f s = s s+ f Cn s −1 sn Cm−1 Cn−m sn ∑ s + f s + f −1 = s + f Cn−1 m =1 f s = 8 s −1 ∑ m =0 s −1 C Cm n − m −1 f s + f −1 Cn−1 = sn . s+ f 4.5 Poisson Random Variable Starting with a Binomial random variable B(n, p), what happens to B(n, p) as n → ∞ and p → 0 while np remains constant. These limits can be considered by setting λ = np and then n → ∞ implies p = λ/n → 0. lim P( B(n, λ/n) = m) = n→∞ = n lim Cm (λ/n)m (1 − λ/n)n−m n→∞ λm n ( n − 1) . . . ( n − m + 1) lim (1 − λ/n)n (1 − λ/n)−m {z }| {z }| m! n→∞ | n{zm } →exp(−λ) = exp(−λ) λm m! →1 →1 for m = 0, 1, 2, . . . The limiting random variable has the pmf p(m) = exp(−λ)λm /m! and is called the Poisson distribution and is denoted as Po (λ). Example: Classical FM plays about 1 Led Zepplin song every 10 hours. Suppose that the number of LZ songs played every 10 hour has a Po (1) distribution. P(No LZ song played in the last 10 hours) = P( Po (1) = 0) = exp(−1)10 /0!=exp(−1) = 0.3679. P(1 LZ song played in the last 10 hours) = P( Po (1) = 1) = exp(−1)11 /1!=exp(−1) = 0.3679. P(2 or more LZ songs played in the last 10 hours) = P( Po (1) ≥ 2) = 1 − P( Po (1) ≤ 1) = 1 − exp(−1) − exp(−1) = 0.2642. ⋄ Expected value of Po (λ) is λ: E( Po (λ)) = ∞ ∞ ∞ i =0 i =1 j =0 ∑ i exp(−λ)λi /i! = λ ∑ exp(−λ)λi−1 /(i − 1)! = λ ∑ exp(−λ)λ j /j! = λ. Example: A customer enters a store in 1 second with probability p = 0.01; no customer enters the store in 1 second with probability 0.99. Suppose that we consider n = 100 seconds. What is the probability of having exactly 1 customer in 100 seconds? This probability is P( B(100, 0.01) = 1) = C1 001 (0.01)1 (0.99)99 = 0.9999 = 0.3697. We can also use B(100, 0.01) ≈ Po (1) and have P( Po (1) = 1) = exp(−1) = 0.3679. ⋄ Poisson random variable Po (λ) can be interpreted as the number of arrivals in a time interval. Suppose that Po (λ1 ) and Po (λ2 ) is the number of type 1 and type 2 customers arriving independent of each other in an interval. The total number of customers arriving in the interval is Po (λ1 ) + Po (λ2 ). This is the limiting distribution of B(n1 , p) + B(n2 , p) for λ1 = n1 p and λ2 = n2 p. Since B(n1 , p) + B(n2 , p) ∼ B(n1 + n2 , p) → Po (λ1 + λ2 ), we obtain Po (λ1 ) + Po (λ2 ) ∼ Po (λ1 + λ2 ). That is, the sum of Poisson random variables is another Poisson random variable. Formally, for an integer m P( Bin(n1 , p) + Bin(n2 , p) = m) = P( Bin(n1 + n2 , p) = m), P( Bin(n1 , p) + Bin(n2 , p) = m) → P( Po (λ1 ) + Po (λ2 ) = m), P( Bin(n1 + n2 , p) = m) → P( Po (λ1 + λ2 ) = m). If two sequences are equal and converge, they must converge to the same number: P( Po (λ1 ) + Po (λ2 ) = m) = P( Po (λ1 + λ2 ) = m). Note that the convergence used here is the convergence of sequences made up of real numbers. 5 Solved Examples 1. A broadcasting company runs 4 different commercials for a car company and 96 commercials for other companies. The total of 100 commercials are randomly shown on TV one by one. 9 a) What is the expected number of commercials to see until (including) the first commercial of the car company? ANSWER Probability of a car commercial is 4/100 so it takes 1/(100/4)=25 commercials until a car commercial is seen. b) What is the expected number of commercials to see until (including) all of commercials of the car company are seen? ANSWER Probability of a car commercial is 4/100 so it takes 1/(100/4)=25 commercials until a car commercial is seen. Then on, the probability of seeing a distinct car commercial is 3/100 so it takes 1/(100/3)=100/3 commercials until the second car commercial is seen. Then on, the probability of seeing a distinct car commercial is 2/100 so it takes 1/(100/2)=100/2 commercials until the third car commercial is seen. Then on, the probability of seeing a distinct car commercial is 1/100 so it takes 1/(100/1)=100 commercials until the fourth car commercial is seen. In total it takes an expected number of 100/4+100/3+100/2+100/1 commercials to see all car commercials. 10 6 Exercises 1. 5 people including you and a friend line up at random. Let X be the number of people between you and your friend and find FX . 2. N people including you and a friend line up at random. Let X be the number of people between you and your friend and find P( X = 0) and P( X = N ). 3. Suppose a fair coin is tossed 5 times. Let X be the length of the longest head run and find P( X = x ) for 0 ≤ x ≤ 5. 4. In the case of Geometric random variable, we have Bernoulli trials with success probability p and we are interested in the number of trial in which the first success happens. Let X (i, p) be the number of Bernoulli trial in which the ith success happens in a series of Bernoulli trials with success probability of p. Clearly, X (1, p) is the Geometric random variable. Find the pmf of X (i, p). 5. Let X (i, p) be the number of Bernoulli trial in which the ith success happens in a series of Bernoulli trials with success probability of p. When X (1, p) is the Geometric random variable, Y (1, p) := X (1, p) − 1 is the number of failures until the first success. Find the pmf of Y (1, p) and F̄Y (1,p) (n). Does Y (1, p) have the memoryless property. ANSWER P(Y (1, p) = m) = p(1 − p)m . F̄Y (1,p) (n) = P(Y (1, p) > n) = P( X (1, p) > n + 1) = (1 − p)n+1 . P(Y (1, p) > n)/P(Y (1, p) > m) = (1 − p)n−m ̸= (1 − p)n−m+1 = P(Y (1, p) > n − m) so Y (1, p) does not have the memoryless property. 6. Consider a series of Bernoulli trials with success probability of p and let X (i, p) be the number of Bernoulli trial in which the ith success happens. Recall that B(n, p) is the number of successes in n trials. For an integer a, we have P( X (i, p) > n) = P( B(n, p) < i + a). What should a be to make this equality valid? After finding this a, check that P( X (i, p) > n) = P( B(n, p) < i + a) holds for i > n. 7. Let X (i j , p) be the number of Bernoulli trial in which the i j th success happens in a series of Bernoulli trials with success probability of p. Suppose that X (i j , p)s are independent. Establish that ∑ X ( i j , p ) ∼ X ( ∑ i j , p ), j j where ∼ indicates having the same distribution. 8. For a random variable X parametrized by (r, p) we have the pmf P( X (r, p) = k) = Ckk+r−1 pk (1 − p)r for k = 0, 1, 2, . . . . Show that the mean of this distribution is rp/(1 − p). 9. For a random variable X parametrized by (r, p) we have the pmf P( X (r, p) = k) = Γ(k + r ) k p (1 − p )r . k! Γ(r ) What happens to this pmf as r → ∞ and p → 0 while the mean rp/(1 − p) remains constant? Identify the limiting pmf and determine which common random variable it corresponds to? 11 ANSWER Let λ := rp/(1 − p) so that p = λ/(r + λ). P( X (r, p) = k) = Γ(k + r ) k p (1 − p)r = [λk /k!][(1 + λ/r )−r ][Γ(k + r )/(Γ(r )(r + λ)k )]. k!Γ(r ) The first term is constant in r, the second term converges to e−λ as r → ∞ and the third term converges to 1 as r → ∞. Note that the third term can be written as Πik=1 (k/r + 1 − i/r )/(1 + λ/r ). Hence, P( X (r, p) = k) → [λk /k!][e−λ ]. This is Poisson distribution with parameter λ = rp/(1 − p). 10. Let X (i, p) be the number of Bernoulli trial in which the ith success happens in a series of Bernoulli trials with success probability of p. Let Y (i, p) be the number of Bernoulli trials resulting in failure until the ith success happens. What is the relation between X (i, p) and Y (i, p). Find the pmf of Y (i, p). 11. Show that the variance of the Hypergeometric random variable is (ns/(s + f ))(1 − s/(s + f ))((s + f − n)/(s + f − 1)). 12. Use the parametrization N = s + f , s = N p and f = N (1 − p) in the Hypergeometric variance (ns/(s + f ))(1 − s/(s + f ))((s + f − n)/(s + f − 1)) and simplify this variance as N → ∞. Relate this limiting variance to other variances that you may know. 13. For fixed parameters n ≥ 1 and 0 < q < 1, the pmf of a random variable X whose range is {n, n + 1, n + 2, . . . } satisfies p(n) = (1 − q)n and p(m) m−1 = q for m > n. p ( m − 1) m−n Find the expected value of X. You may use ∑i∞=0 Cin+i qi = (1 − q)−n−1 . m −1 ANSWER We can check that p(m) = Cnm−−11 (1 − q)n qm−n for m ≥ n. The mean is ∑∞ m=n mCn−1 (1 − m n m − n = n ∞ C n +i (1 − q ) n q i = n (1 − q ) n ∞ C n +i q i = n (1 − q ) n (1 − q )n qm−n = ∑∞ ∑ i =0 n ∑ i =0 n m=n nCn (1 − q ) q − n − 1 q) = n/(1 − q). You may realize that X (n, q) is the number of Bernoulli trial in which the nth success happens in a series of Bernoulli trials with success probability of 1 − q. 14. In a social experiment to measure the honesty of several cities, Reader’s Digest reporters 1 intentionally dropped 12 wallets one by one in a central area in each of the following cities and counted the number of wallets returned to the owners by the people who found the wallets. City Index # returned City Index # returned Lisbon 1 1 Ljubljana 9 6 Madrid 2 2 Berlin 10 6 Prague 3 3 Amsterdam 11 7 Zurich 4 4 Moscow 12 7 Rio 5 4 New York 13 8 Bucharest 6 4 Budapest 14 8 Warsaw 7 5 Mumbai 15 9 London 8 5 Helsinki 16 11 a) Let Xi,j = 1 if the jth dropped vallet is returned in city i; otherwise, it is zero. What are the range of values for i and j for the Reader’s Digest experiment? For distint integers of i, j, k from the appropriate ranges, consider the pair of random variables ( Xi,j , Xi,j+1 ) ( Xi,j , Xi,j+2 ), ( Xi,j , Xi+1,j ) and ( Xi,j , Xi+2,j ) Which of these pairs include random variables that are more likely to be independent within the pair? b) Consider the random variables Yj = ∑i Xi,j and Zi = ∑ j Xi,j , where sums are taken over all possible index values of i and j. Express these sums in English. 1 Most Honest Cities: The Readers Digest Lost Wallet Test at http://www.rd.com/culture/most-honest-cities-lost-wallet-test/ 12 c) If possible, list the assumptions needed to conclude that Y3 is a Binomial random variable and specify the parameters of the associated Binomial distribution. Otherwise, explain why this is not possible. d) If possible, list the assumptions needed to conclude that Z5 is a Binomial random variable and specify the parameters of the associated Binomial distribution. Otherwise, explain why this is not possible. e) If possible, list the assumptions needed to conclude that Y3 + Z5 is a Binomial random variable and specify the parameters of the associated Binomial distribution. Otherwise, explain why this is not possible. 15. 5 people take an elevator from the bottom floor of a building to the higher floors: first, second, third and fourth floor. Each person on the elevator is independently going to one of the floors with equal probability. Let i = 0 for the bottom floor and let Si be 1 if the elevator stops on floor i = 1, 2, 3, 4; otherwise it is 0. a) What is P(Elevator does not stop on floor i ) for i = 1, 2, 3, 4? What is the distribution of Si ? ANSWER The elevator does not stop on floor i, if all 5 people go to a different floor than i. This happens with probability P(Elevator does not stop on floor i ) = (3/4)5 . Si has binomial distribution with parameters 1 and 1 − (3/4)5 . b) When the elevator arrives at the first floor, two people get off at the first floor and 3 remain. Subject to this information, find the range of S2 + S3 + S4 and P(S2 + S3 + S4 = k) for appropriate values of k. Is the distribution of S2 + S3 + S4 in binomial or not? Do you expect it to be binomial, why? Hint: Part c) is shorter so you may solve that first. ANSWER S2 + S3 + S4 = k ∈ {1, 2, 3}. The number of ways of distributing 3 people to 3 floors is the number of non-negative solutions to x2 + x3 + x4 = 3. There are C33+3−1 = 10 solutions. Let Xi be the number pf people getting off on floor i. P(S2 + S3 + S4 = 1) = P( X2 = 3, X3 = 0, X4 = 0) + P( X2 = 0, X3 = 3, X4 = 0) + P( X2 = 0, X3 = 0, X4 = 3) = 3 ∗ (3!/(3!0!0!))(1/3)3 = 1/9. P(S2 + S3 + S4 = 2) = P( X2 = 2, X3 = 1, X4 = 0) + P( X2 = 1, X3 = 2, X4 = 0) +P( X2 = 2, X3 = 0, X4 = 1) + P( X2 = 1, X3 = 0, X4 = 2) +P( X2 = 0, X3 = 2, X4 = 1) + P( X2 = 0, X3 = 1, X4 = 2) = 6 ∗ (3!/(2!1!0!))(1/3)3 = 6/9. P(S2 + S3 + S4 = 3) = P( X2 = 1, X3 = 1, X4 = 1) = (3!/(1!1!0!))(1/3)3 = 2/9. The distribution is not binomial. We do not expect it to be binomial because S2 , S3 , S4 are not independent. For example S2 + S3 + S4 ≥ 1 breaks the independence. c) When the elevator arrives at the second floor, one person gets off at the second floor and 2 remain. Subject to this information, find the range of S3 + S4 and P(S3 + S4 = k) for appropriate values of k. ANSWER S3 + S4 = k ∈ {1, 2}. The number of ways of distributing 2 people to 2 floors is the number of non-negative solutions to x3 + x4 = 2. There are C22+2−1 = 3 solutions. P(S3 + S4 = 1) = P( X3 = 0, X4 = 2) + P( X3 = 2, X4 = 0) = 2 ∗ (2!/(2!0!))(1/2)2 = 1/2. P(S3 + S4 = 2) = P( X2 = 1, X3 = 1) = 1 ∗ (2!/(1!1!))(1/2)2 = 1/2. d) When the elevator arrives at the second floor, one person gets off at the second floor and 2 remain. But these two are talking about a joint project and are going to the same floor. Subject to this information, find the range of S3 + S4 and P(S3 + S4 = k) for appropriate values of k. ANSWER S3 + S4 = k ∈ {1}, so P(S3 + S4 = 1) = 1. 16. An urn contains 100 marbles, one of which is defective. a) A marble is picked from the urn. If it is the defective one, we stop. Otherwise, we return the picked marble to the urn and we pick another marble. We continue to pick marbles from the urn until the defective is found. Let M be the number of picks, find the range of M and P( M = k ) for appropriate values of k. 13 b) A marble is picked from the urn. If it is the defective one, we stop. Otherwise, we put the picked marble aside (outside the urn) and we pick another marble. We continue to pick marbles from the urn until the defective is found. Let N be the number of picks, find the range of N and P( N = k) for appropriate values of k. 17. An urn contains 100 marbles, two of which are defective. a) Two marbles are picked at once from the urn. If these are the defective ones, we stop. Otherwise, we return the picked marbles to the urn and we pick two other marbles. We continue to pick two marbles from the urn until the defective is found. Let M be the number of picks of two marbles, find the range of M and P( M = k) for appropriate values of k. b) A marble is picked from the urn. Then another marble is picked. If these are the defective ones, we stop by recording 2 picks. Otherwise, 1 or 2 of the picked marbles is/are non-defective. We put the non-defective marbles aside (outside the urn). If 1 marble put aside is non-defective, we pick 1 marble from the urn to have two marbles in hand. If 2 marbles put aside are non-defective, we pick 2 marbles at once from the urn to have two marbles in hand. If both marbles in hand are defective, we stop. Otherwise, we continue to pick marbles from the urn until the defectives are found. Let N be the number of picks, find the range of N and P( N = k ) for appropriate values of k. 14 7 Appendix: Negatives in Factorials and in Binomial Coefficients For nonnegative integer n, n! = n(n − 1)(n − 2) . . . 1 and 0! = 1. For positive non-integers as well as negative real numbers, this definition does not help. Instead, the factorial is defined through the Gamma Γ function: ∫ ∞ n! = Γ(n + 1) = 0 x n e− x dx. ∫∞ Note that 0! = Γ(1) = 0 e− x dx = 1. Also, integration by parts with u( x ) = x n (or du = nx n−1 ) and v( x ) = −e− x (or dv = e− x ) confirms Γ ( n + 1) = ∫ ∞ ∫ ∞ x n e− x dx = u(∞)v(∞) − u(0)v(0) − 0 nx n−1 (−e− x )dx = n 0 ∫ ∞ 0 x n−1 e− x dx = nΓ(n). Because of Γ(n + 1) = nΓ(n), knowing Γ over (0,1) is sufficient to find Γ for any positive real number. For example, Γ(7/2) = (√ 5/2)(3/2)(1/2)Γ(1/2). We can use u := x to obtain Γ(1/2) = ∫ ∞ 0 Then Γ(1/2) = 4 2 e ∫ ∞ 0 − x −1/2 x e − u2 du ∫ ∞ dx = ∫ ∞ 0 e 0 − v2 e −u2 2u u dv = 4 du = 2 ∫ ∞∫ ∞ 0 0 ∫ ∞ 0 e−(u e−u du. 2 2 + v2 ) dudv. Using polar coordinates, Γ(1/2) = 4 2 √ ∫ ∞ ∫ π/2 0 0 e −(r2 ) π rdrdθ = 4 2 ∫ ∞ 0 e −(r2 ) rdr = 2π ∫ ∞ 0 e−t dt = π. Hence, Γ(1/2) = π. We also obtain Γ (0) = ∫ ∞ 0 x −1 − x e dx ≥ lim ∫ 1 n →0 n x −1 − x e dx ≥ e −1 ∫ 1 lim n →0 n x −1 dx = e−1 lim (ln(1) − ln(n)) → ∞. n →0 Γ is undefined for 0. By Γ(n) = Γ(n + 1)/n, Γ is undefined for all negative integers as well. Γ takes ∞ or −∞ value depending on whether we approach 0 or a negative integer from left or right. But in general, we should remember |n!| = |Γ(n + 1)| = ∞ for negative integer n. Γ function is plotted in Figure 1, see that the function is diverging at nonpositive integers such as 0, -1, -2, -3, -4. n for nonnegative Having presented negative integer factorials, we consider the binomial coefficients Cm integers n, m but n < m. n ( n − 1) . . . ( m + 1) n Cm = = 0 for 0 ≤ n < m (n − m)! because |(n − m)!| → ∞ for n − m < 0. With this understanding, we can write the binomial formula as (1 + x ) n = n ∑ Cin xi = i =0 ∞ ∑ Cin xi . i =0 n: Would this binomial formula hold when n < 0? For integers n < 0 and m > 0, we need to define Cm n Cm := (n)(n − 1) . . . (n − m + 2)(n − m + 1) for n < 0 ≤ m, m! 15 40 20 0 −40 −20 y −4 −2 0 2 4 x Figure 1: Γ(n) for n ∈ {−4.975, −4.925, . . . , 4.975} produced by R commands x <– -5.025 + (1:200)/20; y <– gamma(x); plot(x, y). where the factorial notation is applied only onto the nonnegative m. This binomial coefficient can also be written as n Cm = (−1)m (−n + m − 1)(−n + m − 2) . . . (n − 1)(n) − n + m −1 = (−1)m Cm for n < 0 ≤ m. m! Then the negative binomial expansion for n < 0 and | x | < 1 (a technical condition for convergence) is (1 + x ) n = ∞ ∑ Cin xi = i =0 ∞ ∑ (−1)i Ci−n+i−1 xi . i =0 Interestingly, the formula (1 + x )n = ∑i Cin xi applies both for n ≥ 0 and n < 0 with the appropriate interpretation of the binomial coefficients for 0 ≤ n < m and n < 0 ≤ m. Example: Find (1 − x )−n−1 for n ≥ 0 and 0 < x < 1. ( 1 − x ) − n −1 = ∞ ∞ ∞ i =0 i =0 i =0 ∑ Ci−n−1 (−x)i = ∑ Cin+1+i−1 (−1)i (−1)i xi = ∑ Cin+i xi . We then obtain the following useful identity ∞ 1 ∑ Cin+i xi = (1 − x)n+1 . ⋄ i =0 8 Some Identities Involving Combinations n +1 n + Cn Cm m+1 = Cm+1 . When making up m + 1-element subsets from n + 1 elements, you can fix one element from n + 1 elements. If this fixed element is in your subset, you can choose the remaining m elements from n elements in n ways. Otherwise, you have m + 1 elements to choose from n elements in C n Cm m+1 ways. ∑ik=0 Cim Ckn−i = Ckm+n . 16 A combination of k objects from a group of m + n objects must have 0 ≤ i ≤ k objects from a group of m objects and remaining from a group of n objects. i = C n+1 for integers m ≤ n. ∑in=m Cm m +1 i = C m+1 + C m+1 + C m+2 + C m+3 + · · · + C n = C m+2 + C m+2 + C m+3 + · · · + C n where we use ∑in=m Cm m m m m m m m m +1 m +1 m +2 m +1 m+1 . Then C m+2 + C m+2 + C m+3 + · · · + C n = C m+3 + C m+3 + · · · + C n where we use Cm = C + C m m m m m m +1 m +1 m +1 m +1 n m +3 m +2 n +1 m +2 i n n Cm +1 = Cm+1 + Cm . Continuing this way we eventually obtain ∑i =m Cm = Cm+1 + Cm = Cm+1 . If you know of some other useful identities, you can ask me to list them here. 17