Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Section 7.1 Properties of Expectations: Introduction Recall that we have X E[X] = xp(x) x in the discrete case and E[X] = Z ∞ xf (x)dx −∞ in the continuous case. COROLLARY: If P {a ≤ X ≤ b} = 1, then a ≤ E[X] ≤ b Proof: E[X] = X xp(x) ≥ a x:p(x)>0 X p(x) = a x:p(x)>0 Likewise for upper bound and continuous case. Section 7.2 Expectation of Sums of Random Variables PROPOSITION: If X and Y have joint p.m.f. p(x, y), then E[g(X, Y )] = XX y g(x, y)p(x, y) x If X and Y have joint p.d.f. f (x, y), then E[g(X, Y )] = Z ∞ −∞ Z ∞ g(x, y)f (x, y)dxdy −∞ EXAMPLE: An accident occurs at a point X that is randomly (i.e. uniformly) selected on a road of length L. An ambulance is located at a point Y on the same road independent of X (also uniform). Find the expected distance between the accident and ambulance. 1 EXAMPLE: An accident occurs at a point X that is randomly (i.e. uniformly) selected on a road of length L. An ambulance is located at a point Y on the same road independent of X (also uniform). Find the expected distance between the accident and ambulance. Solution: We need to find E[|X − Y |]. We are given that both X and Y are uniform(0, L) and that they are independent. Therefore, 1 f (x, y) = 2 , 0 < x < L, 0 < y < L L and zero otherwise. Thus, Z LZ L 1 E[|X − Y |] = |x − y|dxdy 2 0 0 L First we do the inside integral Z L Z y Z L |x − y|dx = (y − x)dx + (x − y)dx 0 0 y y L 1 2 1 2 + x − yx = yx − x 2 2 x=0 x=y 1 1 1 = y 2 + L2 − Ly + y 2 2 2 2 1 = y 2 + L2 − Ly 2 Thus, L 1 2 2 y + L − Ly dy 2 0 1 1 3 1 3 1 3 = 2 L + L − L L 3 2 2 1 E[|X − Y |] = 2 L Z 1 = L 3 We now generalize what we did in chapter 4 by taking g(x, y) = x + y. We have Z ∞Z ∞ E[X + Y ] = (x + y)f (x, y)dxdy −∞ = Z ∞ −∞ −∞ Z ∞ xf (x, y)dxdy + −∞ Z ∞ −∞ Z ∞ yf (x, y)dxdy −∞ = E[X] + E[Y ] Also, suppose that X ≥ Y . Then, X − Y ≥ 0. We know that the expected value of a non-negative random variable is non-negative. Therefore, in this case, E[X − Y ] ≥ 0 and so E[X] ≥ E[Y ] Finally, it is trivial to generalize the above: That is, E[a1 X1 + . . . + an Xn ] = a1 E[X1 ] + . . . + an E[Xn ] where ai ∈ R. 2 EXAMPLE (The sample mean): Let Xi be independent, identically distributed random variables with expected value µ and distribution function F . We say such a sequence is a sample from F . The quantity X= n X 1 Xi , n i=1 is called the sample mean. We have # " n n X1 1X E[X] = E Xi = E[Xi ] = E[Xi ] n n i=1 i=1 When µ is unknown, we use the sample mean as an estimate. EXAMPLE (Boole’s inequality): Show that for any subsets A1 , A2 , . . . , An ⊂ S we have P n [ Ai i=1 ! ≤ n X P (Ai ) i=1 Solution: Define the indicator functions Xi by Xi = 1 if Ai occurs and zero otherwise. Let X= n X Xi i=1 and so X gives the number of the Ai that have occurred. Finally, let Y = 1 if X ≥ 1 and zero otherwise. It is clear that X ≥ Y . Thus, E[Y ] ≤ E[X]. First note that E[X] = n X E[Xi ] = i=1 n X P (Ai ) i=1 Also, E[Y ] = P {at least one of the Ai occurred} = P showing the result. n [ i=1 Ai ! EXAMPLE: Suppose that there are N different types of coupons, and each time one obtains a coupon, it is equally likely to be any one of the N types. Find the expected number of coupons one need amass before obtaining a complete set of at least one of each type. 3 EXAMPLE: Suppose that there are N different types of coupons, and each time one obtains a coupon, it is equally likely to be any one of the N types. Find the expected number of coupons one need amass before obtaining a complete set of at least one of each type. Solution: Let X be the number needed and let Xi , i = 0, . . . , N − 1, be the number of additional coupons that need be obtained after i distinct types have been collected in order to obtain another distinct type. Note that X = X0 + X1 + . . . + XN −1 When i distinct types have already been collected, there are (N − i) distinct types left. Therefore, the probability that any choice will yield new type is (N − i)/N. Hence, as Xi is a geometric random variable, for k ≥ 1 we have k−1 i N −i P {Xi = k} = N N Therefore, Xi is geometric with parameter (N − i)/N and so E[Xi ] = N N −i Thus, N N N 1 1 E[X] = 1 + + + ...+ = N 1 + + ...+ N −1 N −2 1 2 N EXAMPLE: The sequence of coin flips HHTHHHTTTHTHHHT contains two “runs” of heads of length three, one run of length two and one run of length one. If a coin is flipped 100 times, what is the expected number of runs of length 3? 4 EXAMPLE: The sequence of coin flips HHTHHHTTTHTHHHT contains two “runs” of heads of length three, one run of length two and one run of length one. If a coin is flipped 100 times, what is the expected number of runs of length 3? Solution: Let T be the number of runs of length 3. Let Ii be one if a run of length 3 begins at i ∈ {1, . . . , 98} 98 X and zero otherwise. Then, T = Ii and i=1 E[T ] = 98 X E[Ii ] = i=1 98 X i=1 P {Ii = 1} A run of length 3 begins at i only if there are heads on flips i, i + 1, i + 2 and tails on i + 3 (only holds if i 6= 98) and i − 1 (only holds if i 6= 1). Thus, E[I1 ] = P {H, H, H, T } = (1/2)4 E[I98 ] = P {T, H, H, H} = (1/2)4 E[Ii ] = P {T, H, H, H, T } = (1/2)5 , i 6∈ {1, 98} Therefore, E[T ] = 98 X i=1 E[Ii ] = 2 · 1 1 25 + 96 · 5 = = 3.125 4 2 2 8 EXAMPLE: Each night for 30 nights an ecologist sets a trap for a rabbit. The trap is very attractive so there is always a rabbit in the trap the next morning. (Only one because the trap isn’t large enough for two.) The rabbits that are trapped are marked and released. If there are 100 rabbits in the area and each night the 100 are equally likely to be trapped, what is the expected number of distinct rabbits that are trapped during the 30 night period? 5 EXAMPLE: Each night for 30 nights an ecologist sets a trap for a rabbit. The trap is very attractive so there is always a rabbit in the trap the next morning. (Only one because the trap isn’t large enough for two.) The rabbits that are trapped are marked and released. If there are 100 rabbits in the area and each night the 100 are equally likely to be trapped, what is the expected number of distinct rabbits that are trapped during the 30 night period? Solution: Let X be the number of distinct rabbits that are trapped during the 30 night period. For i = 1, . . . , 100, let Ai be one if rabbit i is never caught in the 30 nights and zero otherwise. Then 100 X X = 100 − Ai and i=1 E[X] = 100 − 100 X i=1 E[Ai ] = 100 − 100 X i=1 P {Ai = 1} Now, the probability that a given rabbit is never caught is (99/100)30. Therefore, 30 30 30 ! 100 X 99 99 99 E[X] = 100 − ≈ 26.0299 = 100 − 100 = 100 1 − 100 100 100 i=1 Section 7.3 Moments of Number of Events Consider a set of sets Ai . Let Xi be one if Ai occurs. Suppose that X = n X Xi is of interest to us. i=1 Now ask how many pairs of events took place? Noting that Xi Xj equals one if and only if both Xi and Xj do, it follows that the number of pairs is X Xi Xj i<j Also, because the number of events that do occur is X, the number of pairs of events that occur is exactly X 2 Combining these observations yields X X = Xi Xj 2 i<j n terms. Taking expected values gives us where the sum is over the 2 X X X = E[Xi Xj ] = P (Ai Aj ) E 2 i<j i<j or which yields X X(X − 1) E = P (AiAj ) 2 i<j E[X 2 ] − E[X] = 2 X i<j which gives us E[X 2 ] and then Var(X) = E[X 2 ] − (E[X])2 . 6 P (Ai Aj ) EXAMPLE (Moments of a binomial random variable): We have n independent trials each with probability of success p. Let Ai be event of ith success. When i 6= j, P (Ai Aj ) = p2 . Thus, X X n(n − 1) 2 X(X − 1) = P (AiAj ) = p2 = p E 2 2 i<j i<j and so E[X 2 ] − E[X] = n(n − 1)p2 Because E[X] = np, we have Var(X) = E[X 2 ] − E[X]2 = n(n − 1)p2 + np − (np)2 = np(1 − p) agreeing with the previous calculations. EXAMPLE: Suppose there are N distinct coupon types and that, independently of past types collected, N X each new one obtained is type j with probability pj , obviously with pj = 1. Find the expected value j=1 and variance of the number of different types of coupons that appear among the first n collected. 7 EXAMPLE: Suppose there are N distinct coupon types and that, independently of past types collected, N X each new one obtained is type j with probability pj , obviously with pj = 1. Find the expected value j=1 and variance of the number of different types of coupons that appear among the first n collected. Solution: Let Y denote the number of different types of collected coupons and let X = N − Y denote the number of uncollected types. Let Ai be the event that the ith coupon is not collected, and Xi = 1 if Ai occurs. Thus, N X X= Xi i=1 With probability (1 − pi ), the ith coupon is not collected during each round. Therefore, the probability the ith is not collected in the first n rounds is P (Ai ) = P {Xi = 1} = (1 − pi )n Therefore, E[X] = n X i=1 (1 − pi )n , and E[Y ] = N − E[X] = N − N X i=1 (1 − pi )n The probability that neither type i nor j is collected in a single round is (1 − pi − pj ). Therefore, P (Ai Aj ) = (1 − pi − pj )n , i 6= j Therefore, E[X(X − 1)] = 2 and X P (AiAj ) = 2 i<j E[X 2 ] = 2 i<j X i<j Hence, Var(Y ) = Var(X) = E[X 2 ] − (E[X])2 = 2 X (1 − pi − pj )n (1 − pi − pj )n + E[X] X i<j (1 − pi − pj )n + N X i=1 (1 − pi )n − N X i=1 Note that when pi = 1/N for all i, we get n n 1 1 =N 1− 1− E[Y ] = N − N 1 − N N and n n 2n 2 1 1 2 Var(Y ) = N(N − 1) 1 − +N 1− −N 1− N N N 8 (1 − pi )n !2 Section 7.4 Covariance, Variance of Sums, and Correlations THEOREM: If X and Y are independent, then for any functions h and g we have E[g(X)h(Y )] = E[g(X)]E[h(Y )] Proof: Let f (x, y) be the joint p.m.f. Then, Z ∞Z ∞ Z E[g(X)h(Y )] = g(x)h(y)f (x, y)dxdy = −∞ = Z ∞ h(y)fY (y)dy −∞ −∞ Z ∞ −∞ Z ∞ g(x)h(y)fX (x)fY (y)dxdy −∞ ∞ g(x)fX (x)dx = E[h(Y )]E[g(X)] −∞ Variance of linear combinations: Suppose that X = X1 + X2 + . . . + Xn We already know that E[X] = n X E[Xi ] i=1 and this doesn’t need independence. What about variance? Consider n = 2. We have Var(X1 + X2 ) = E[(X − E[X])2 ] = E[(X1 + X2 − µ1 − µ2 )2 ] = E[((X1 − µ1 ) + (X2 − µ2 ))2 ] = E[(X1 − µ1 )2 ] + E[(X2 − µ2 )2 ] + 2E[(X1 − µ1 )(X2 − µ2 )] = Var(X1 ) + Var(X2 ) + 2Cov(X1 , X2 ) where the last is a definition. So, Var(X1 + X2 ) = E[(X − E[X])2 ] = Var(X1 ) + Var(X2 ) + 2Cov(X1 , X2 ) It is easy to show (see book) that 1. Cov(X, Y ) = Cov(Y, X). 2. Cov(X, X) = Var(X). 3. Cov(aX, Y ) = aCov(X, Y ). ! n X m n m X X X 4. Cov Xi , Yj = Cov(Xi , Yj ). i=1 i=1 j=1 j=1 We also have Cov(X1 , X2 ) = E[(X1 − µ1 )(X2 − µ2 )] = E[X1 X2 ] − 2µ1 µ2 + µ1 µ2 = E[X1 X2 ] − µ1 µ2 Note that if X1 and X2 are independent, then Cov(X1 , X2 ) = E[X1 X2 ] − µ1 µ2 = E[X1 ]E[X2 ] − µ1 µ2 = 0 EXAMPLE: Two random variables may be dependent, but uncorrelated. Let R(X) = {−1, 0, 1} with P {X = i} = 1/3 for each i. Let Y = X 2 . Then, Cov(X, Y ) = E[X(Y − E[Y ])] = E[X(X 2 − E[X 2 ])] = E[X 3 ] − E[X]E[X 2 ] = 0 − 0 · E[X 2 ] = 0 but not independent, since P {X = 1, Y = 0} = 0 6= P {X = 1}P {Y = 0} = (1/3)(1/3) = 1/9 So these are just uncorrelated. 9 We have Var(X1 + X2 + . . . + Xn ) = Var(X) = E[(X − E[X])2 ] =E = n X n X i=1 Xi − Var(Xi ) + i=1 = n X !2 !2 n X E[Xi ] = E (Xi − µi ) n X i=1 i=1 n X n X Cov(Xi , Xj ) i=1 j=1 Var(Xi ) + 2 XX Cov(Xi , Xj ) i<j i=1 Now, if the Xi ’s are pairwise independent, then n X Var Xi i=1 ! = n X Var(Xi ) i=1 EXAMPLE: Let X be a binomial random variable with parameters n and p, that is, X is the number of successes in n independent trials. Therefore, X = X1 + . . . + Xn where Xi is 1 if ith trial was success and zero otherwise. Therefore, Xi ’s are independent Bernoulli random variables and E[Xi ] = P {Xi = 1} = p. Thus (not from independence), n X E[X] = E[Xi ] = i=1 n X p = np i=1 Variance of binomial: Each of the Xi ’s are independent. Therefore, Var(X) = n X Var(Xi ) = i=1 n X i=1 (p − p2 ) = np(1 − p) So much easier! In general, we have Var(aX + bY ) = a2 Var(X) + b2 Var(Y ) + 2abCov(X, Y ) Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ) Var(X − Y ) = Var(X) + Var(Y ) − 2Cov(X, Y ) Var n X i=1 ai Xi ! = n X a2i Var(Xi ) + 2 i=1 XX ai aj Cov(Xi , Xj ) i<j Independence implies Var n X i=1 ai Xi ! = 10 n X i=1 a2i Var(Xi ) EXAMPLE: Suppose there are 100 cards numbered 1 through 100. We draw a card at random. Let X be the number of digits (R(X) = {1, 2, 3}) and Y be the number of zeros (R(Y ) = {0, 1, 2}). The chart giving the probabilities is: y pXY (x, y) 0 1 2 1 9/100 0 0 2 81/100 9/100 0 x 3 0 0 1/100 pY (y) 90/100 9/100 1/100 E[X] = 1 · pX (x) 9/100 90/100 1/100 90 1 9 +2· +3· = 1.92 100 100 100 E[Y ] = 0 + 1 · 9 1 +2· = 0.11 100 100 E[XY ] = 2 · 1 · 9 1 +3·2· = 0.24 100 100 Cov(X, Y ) = E[XY ] − E[X]E[Y ] = 0.24 − 1.92 · 0.11 = 0.0288 (Var(X) = 0.0936, Var(Y ) = 0.1179, ρ = Cov(X, Y )/σX σY = 0.2742) LEMMA: Let X1 , . . . , Xn be a random sample of size n from a distribution F with mean µ and variance σ 2 (just n experiments). Let X = (X1 + . . . + Xn )/n be the mean of the random sample. Then E(X) = µ, Proof: We have E[X] = E and Var(X) = Var Var(X) = X1 + . . . + Xn n 1 (X1 + . . . + Xn ) n = = σ2 n 1 nµ = µ n 1 1 σ2 2 Var(X + . . . + X ) = nσ = 1 n n2 n2 n For real numbers a, b, c, d and random variables X, Y Cov(aX + b, cY + d) = E[(aX + b)(cY + d)] − E[aX + b]E[cY + d] = E[acXY + bcY + adX + bd] − [aE[X] + b][cE[Y ] + d] = ac(E[XY ] − E[X]E[Y ]) = acCov(X, Y ) So Cov(aX + b, cY + d) = acCov(X, Y ) 11 Correlation Suppose that X1 = 10X and Y1 = 5Y , then Cov(X1 , Y1 ) = E[(X1 − E(X1 ))(Y1 − E(Y1 ))] = E[(10X − E(10X))(5Y − E(5Y ))] = 50Cov(X, Y ) So knowing if the covariance between X and Y is positive or negative is fine, but doesn’t tell you “how correlated.” Solution: think in terms of unit-less parameters (standardized random variables). Put X∗ = then X − E[X] , σX Y∗ = Y − E[Y ] σY X − E[X] Y − E[Y ] Cov(X , Y ) = Cov , σX σY Thus, we define the correlation coefficient between X and Y as ∗ ∗ ρ = ρ(X, Y ) = = 1 1 Cov(X, Y ) σX σY Cov(X, Y ) σX σY So Cov(X, Y ) > 0 iff ρ > 0 and vice versa. Properties: 1. −1 ≤ ρ(X, Y ) ≤ 1. 2. ρ(X, Y ) = 1 if and only if Y = aX + B for some a > 0. 3. ρ(X, Y ) = −1 if and only if Y = aX + B for some a < 0. Will just prove item 1: 0 ≤ Var X Y + σX σY = Var(X) Var(Y ) Cov(X, Y ) = 2[1 + ρ(X, Y )] + +2 2 2 σX σY σx σY thus, ρ(X, Y ) ≥ −1. Similarly, X Y Var(X) Var(Y ) Cov(X, Y ) 0 ≤ Var − = = 2[1 − ρ(X, Y )] + −2 2 2 σX σY σX σY σx σY hence ρ(X, Y ) ≤ 1. EXAMPLE (Investment problem): Suppose someone invests in two financial assets with annual rates of return r1 and r2 . Let r be the annual rate of return for the overall investment. Let σ 2 = Var(r) and σ12 = Var(r1 ) and σ22 = Var(r2 ). Let w1 and w2 be fractions of investment in assets one and two, that is r = w1 r1 + w2 r2 . Then, if w1 , w2 > 0 and σ1 , σ2 6= 0 we have σ 2 < max(σ12 , σ22 ) Proof: Assume that σ1 ≤ σ2 (if not, go other way). Thus max is σ22 . Therefore σ 2 = Var(r) = Var(w1 r1 + w2 r2 ) = w12 Var(r1 ) + w22 Var(r2 ) + 2w1 w2 Cov(r1 , r2 ) = w12 σ12 + w22 σ22 + 2w1 w2 ρσ1 σ2 (now use that σ1 ≤ σ2 ) ≤ w12 σ22 + w22 σ22 + 2w1 w2 σ22 = (w1 + w2 )2 σ22 = σ22 = max(σ12 , σ22 ) 12 Section 7.5 Conditional Expectation Recall that if X and Y are jointly discrete, then the conditional probability mass function of X given Y = y is p(x, y) pX|Y (x|y) = P {X = x|Y = y} = pY (y) We therefore define the conditional expectation of X given Y = y for all y such that pY (y) > 0 to be X X E[X|Y = y] = xP {X = x|Y = y} = xpX|Y (x|y) x x Therefore, if X and Y are independent, then E[X|Y = y] = E[X]. EXAMPLE: If X and Y are both binomial(n, p) independent random variables (same n and p), calculate the conditional expectation of X given X + Y = m. Solution: We first need the conditional probability mass function of X, given X + Y = m. Note that the range k for X is ≤ min(n, m), since P {X = k|X + Y = m} = P {X = k, Y = m − k} P {X = k, X + Y = m} = P {X + Y = m} P {X + Y = m} We know that the sum of two binomials (with the same p) is binomial. Thus X + Y is binomial(2n, p). So, 2n m p (1 − p)2n−m P {X + Y = m} = m Therefore, P {X = k}P {Y = m − k} P {X = k, Y = m − k} = P {X + Y = m} P {X + Y = m} n n k n n m−k n−m+k n−k p (1 − p) p (1 − p) m−k k m−k k = = 2n m 2n p (1 − p)2n−m m m P {X = k|X + Y = m} = This is a hypergeometic random variable with parameters m, 2n, and n. Therefore, the mean is Similarly to the discrete case, the conditional density of X given Y = y is fX|Y (x|y) = f (x, y) fY (y) Thus, in this case we define E[X|Y = y] = Z ∞ xfX|Y (x|y)dx −∞ provided that fY (y) > 0. Therefore, if X and Y are independent, then E[X|Y = y] = E[X]. EXAMPLE: Suppose 1 f (x, y) = e−x/y e−y , y 0 < x < ∞, 0 < y < ∞ Compute E[X|Y = y]. 13 m·n m = . 2n 2 EXAMPLE: Suppose 1 f (x, y) = e−x/y e−y , y 0 < x < ∞, 0 < y < ∞ Compute E[X|Y = y]. Solution: We have 1 −x/y −y e e f (x, y) f (x, y) 1 y =Z ∞ fX|Y (x|y) = =Z ∞ = e−x/y 1 −x/y −y fY (y) y f (x, y)dx e e dx y −∞ 0 which is exponential with parameter 1/y. Therefore, the mean is y. That is, Z ∞ 1 E[X|Y = y] = x · e−x/y dx = y y 0 Important: Conditional expectations satisfy all the properties of ordinary expectations. This follows because they are just ordinary expectations, as conditional probabilities or densities are themselves probabilities and densities. Therefore, X g(x)pX|Y (x|y) in the discrete case Zx ∞ E[g(X)|Y = y] = g(x)fX|Y (x|y)dx in the continuous case −∞ and E " X i # Xi |Y = y = X i E[Xi |Y = y] Computing Expectations by Conditioning NOTATION: We will denote by E[X|Y ] the function of the random variable Y whose value at Y = y is E[X|Y = y]. Note that E[X|Y ] is itself a random variable. THEOREM: We have E[X] = E[E[X|Y ]] If Y is discrete, then one can deduce from the above Theorem that X E[X] = E[X|Y = y]P {Y = y} (1) (2) y and if Y is continuous with density fY (y), then Z ∞ E[X] = E[X|Y = y]fY (y)dy −∞ 14 (3) Proof (Discrete Case): We have X XX E[X|Y = y]P {Y = y} = xP {X = x|Y = y}P {Y = y} y y = = x X X P {X = x, Y = y} P {Y = y} x P {Y = y} y x XX xP {X = x, Y = y} XX xP {X = x, Y = y} y = x x = y X X x P {X = x, Y = y} x = y X x xP {X = x} = E[X] Proof (Continuous Case): We have Z ∞ Z E[X|Y = y]fY (y)dy = −∞ ∞ −∞ = Z ∞ Z ∞ Z ∞ Z ∞ Z ∞ −∞ = −∞ = −∞ = Z ∞ Z ∞ Z ∞ Z ∞ x −∞ f (x, y) fY (y)dxdy fY (y) xf (x, y)dxdy −∞ xf (x, y)dydx −∞ x −∞ = xfX|Y (x|y)fY (y)dxdy −∞ Z ∞ f (x, y)dydx −∞ xfX (x)dx −∞ = E[X] EXAMPLE: A miner is trapped in a mine containing 3 doors. The first door leads to a tunnel that will take him to safety after 3 hours of travel. The second door leads to a tunnel that will return him to the mine after 5 hours of travel. The third door leads to a tunnel that will return him to the mine after 7 hours. If we assume that the miner is at all times equally likely to choose any one of the doors, what is the expected length of time until he reaches safety? 15 EXAMPLE: A miner is trapped in a mine containing 3 doors. The first door leads to a tunnel that will take him to safety after 3 hours of travel. The second door leads to a tunnel that will return him to the mine after 5 hours of travel. The third door leads to a tunnel that will return him to the mine after 7 hours. If we assume that the miner is at all times equally likely to choose any one of the doors, what is the expected length of time until he reaches safety? Solution: Let X denote the amount of time (in hours) until the miner reaches safety, and let Y denote the door he initially chooses. Now, E[X] = E[X|Y = 1]P {Y = 1} + E[X|Y = 2]P {Y = 2} + E[X|Y = 3]P {Y = 3} 1 = (E[X|Y = 1] + E[X|Y = 2] + E[X|Y = 3]) 3 However, E[X|Y = 1] = 3 E[X|Y = 2] = 5 + E[X] E[X|Y = 3] = 7 + E[X] Therefore 1 E[X] = (3 + 5 + E[X] + 7 + E[X]) 3 hence E[X] = 15 EXAMPLE: Consider independent trials, each with a probability p of success. Let N be the time of the first success. Find Var(N). [Hint: Think about geometric distribution] 16 EXAMPLE: Consider independent trials, each with a probability p of success. Let N be the time of the first success. Find Var(N). [Hint: Think about geometric distribution] Solution: We will condition on the first trial being a success or failure. Let Y = 1 if the first step is a success Y = 0 if the first step is a failure We know that E[N] = 1/p and that Var(N) = E[N 2 ] − [E[N]]2 Thus, we need the second moment. We will calculate it by conditioning on Y : E[N 2 ] = E[E[N 2 |Y ]] Therefore, E[N 2 ] = E[N 2 |Y = 0]P {Y = 0} + E[N 2 |Y = 1]P {Y = 1} = (1 − p)E[N 2 |Y = 0] + pE[N 2 |Y = 1] Clearly, E[N 2 |Y = 1] = 1 We now show that E[N 2 |Y = 0] = E[(1 + N)2 ] In fact, we have 2 E[N |Y = 0] = ∞ X n=1 2 n P {N = n|Y = 0} = ∞ X n2 n=1 P {N = n, Y = 0} 1−p Note that P {N = n, Y = 0} = 0 if n = 1 and (1 − p)n−1 p otherwise. Therefore, E[N 2 |Y = 0] = ∞ X n=2 n2 (1 − p)n−2 p = ∞ X n=1 (n + 1)2 (1 − p)n−1 p = E[(N + 1)2 ] Going back to equation (4), we have E[N 2 ] = p + (1 − p)(E[N 2 + 2N + 1]) = 1 + 2 · Solving this equation for E[N 2 ], we get E[N 2 ] = Finally, Var(X) = 1−p + (1 − p)E[N 2 ] p 2−p p2 2−p 1 1−p − 2 = 2 p p p2 EXAMPLE: Let Ui be a sequence of uniform(0, 1) independent random variables and let ( ) n X N = min n : Ui > 1 i=1 Find E[N]. 17 (4) EXAMPLE: Let Ui be a sequence of uniform(0, 1) independent random variables and let ( ) n X N = min n : Ui > 1 i=1 Find E[N]. Solution: We will solve a more general problem. For x ∈ [0, 1], let ( ) n X N(x) = min n : Ui > x i=1 Set m(x) = E[N(x)] We derive an expression by conditioning on U1 . For any x we have Z ∞ Z m(x) = E[N(x)] = E[E[N(x)|U1 ]] = E[N(x)|U1 = y]fU1 (y)dy = −∞ 1 E[N(x)|U1 = y]dy 0 Clearly, if y > x, then E[N(x)|U1 = y] = 1. What about the other case? One can argue that it should be E[N(x)|U1 = y] = 1 + m(x − y) if y ≤ x In fact, we first note that E[N(x)|U1 = y] = ∞ X n=2 Now, assuming y ≤ x, we obtain nP {N(x) = n|U1 = y} P {N(x) = n|U1 = y} = P {U2 + . . . + Un > x − y, U2 + . . . + Un−1 < x − y} = P {N(x − y) = n − 1} Thus, E[N(x)|U1 = y] = ∞ X n=2 nP {N(x − y) = n − 1} = ∞ X n=1 (n + 1)P {N(x − y) = n} = m(x − y) + 1 therefore m(x) = Z 1 E[N(x)|U1 = y]dy = 0 Z x 0 which is =1+ (1 + m(x − y))dy + Z 0 Differentiating both sides, we get Z 1 x 1dy = 1 − x + x + x m(u)du (set u = x − y) m′ (x) = m(x), m(0) = 1 This is equivalent to m′ (x) =1 m(x) =⇒ d ln(m(x)) = 1 dx =⇒ ln(m(x)) = x + C or m(x) = eC ex We use the initial condition m(0) = 1 to see that m(x) = ex . Thus, m(1) = e. 18 Z 0 x m(x − y)dy Calculating Probabilities by Conditioning Let E be an event and set X = 1 if E occurs and 0 if E does not occur. Then E[X] = P (E), E[X|Y = y] = P (E|Y = y) for any Y . Using this in (2) and (3), we get X P (E|Y = y)P (Y = y) if Y is discrete y P (E) = Z ∞ P (E|Y = y)fY (y)dy if Y is continuous (5) −∞ EXAMPLE (The best-prize problem): Suppose that we are to be presented with n distinct prizes, in sequence. After being presented with a prize, we must immediately decide whether to accept it or to reject it and consider the next prize. The only information we are given when deciding whether to accept a prize is the relative rank of that prize compared to ones already seen. That is, for instance, when the fifth prize is presented, we learn how it compares with the four prizes we’ve already seen. Suppose that once a prize is rejected, it is lost, and that our objective is to maximize the probability of obtaining the best prize. Assuming that all n! orderings of the prizes are equally likely, how well can we do? 19 EXAMPLE (The best-prize problem): Suppose that we are to be presented with n distinct prizes, in sequence. After being presented with a prize, we must immediately decide whether to accept it or to reject it and consider the next prize. The only information we are given when deciding whether to accept a prize is the relative rank of that prize compared to ones already seen. That is, for instance, when the fifth prize is presented, we learn how it compares with the four prizes we’ve already seen. Suppose that once a prize is rejected, it is lost, and that our objective is to maximize the probability of obtaining the best prize. Assuming that all n! orderings of the prizes are equally likely, how well can we do? Solution: Fix k < n. Strategy: Let the first k prizes go by. Accept the first one if you see that it is larger than all the previous prizes. Let Zk be the event we choose the best with this strategy. We want to find P (Zk ) and then maximize it with respect to k. We will condition on the position, X, of the best prize. Thus, n n X 1X P (Zk ) = P (Zk |X = i)P (X = i) = P (Zk |X = i) n i=1 i=1 Note that P (Zk |X = i) = 0 if i ≤ k. Thus, n 1 X P (Zk |X = i) P (Zk ) = n i=k+1 Now suppose i > k. We show that P (Zk |X = i) = k i−1 if i > k k Indeed, if X = k + 1, then we definitely get the prize and we have P (Zk |X = k + 1) = = 1. Similarly, if k X = k + 2, we get 1 k P (Zk |X = k + 2) = 1 − = k+1 k+1 1 is representing the probability that the largest of the first k + 1 was in k + 1st slot. In general, where k+1 if X = i, then i−k−1 k P (Zk |X = i) = 1 − = i−1 i−1 i−k−1 is representing the probability that the largest of the first i−1 was in the slots k+1, . . . , i−1. where i−1 It follows that ! n−1 k−1 n n−1 1 X k k kX1 k X1 X1 k n P (Zk ) = ≈ (ln(n − 1) − ln(k − 1)) ≈ ln = = − n i−1 n i n i=1 i i n n k i=1 i=k+1 i=k x n . We want to maximize g with respect to x over the interval [0, n]. Differentiating Now let g(x) = ln n x both sides, we get 1 n 1 ′ g (x) = ln − n x n Therefore, n n g ′ (x) = 0 =⇒ ln = 1 =⇒ x = x e ′ ′ One can see that g (x) > 0 if 0 < x < n/e and g (x) < 0 if x > n/e. Therefore, the value found is a maximum. Plugging in n/e to g yields 1 1 g(n/e) = ln e = e e 20 EXAMPLE: Let U be uniform(0, 1). Suppose that the conditional distribution of X, given U = p, is binomial(n, p). Find the probability mass function of X, i.e. if we flip a coin n times with unknown and uniformly chosen p, what is P {X = i} for i ∈ {1, . . . , n}? Solution: We will condition on U and use the fact that Z 1 P {X = i} = P {X = i|U = p}fU (p)dp 0 Thus, P {X = i} = 1 Z 0 P {X = i|U = p}fU (p)dp = It can be shown that Z 0 Thus, 1 pi (1 − p)n−i dp = Z 1 0 Z 1 n pi (1 − p)n−i dp P {X = i|U = p}dp = i 0 i!(n − i)! 1 i!(n − i)! = · (n + 1)! n+1 n! 1 n+1 That is, we obtain the surprising result that if a coin whose probability of coming up heads is uniformly distributed over (0, 1) is flipped n times, then the number of heads occurring is equally likely to be any of the values 0, . . . , n. P {X = i} = Section 7.7 Moment Generating Functions DEFINITION: The moment generating function M(t) of the random variable X is defined for all real values of t by X etx p(x) if X is discrete with mass function p(x) tX x M(t) = E e = Z ∞ etx f (x)dx if X is continuous with density f (x) −∞ We call M(t) the moment generating function because all the moments of X can be obtained by successively differentiating M(t) and then evaluating the result at t = 0. For example, d tX d tX ′ M (t) = E e e = E XetX =E dt dt where we have assumed that the interchange of the differentiation and expectation operators is legitimate. Hence M ′ (0) = E[X] Similarly, d tX d d ′ tX = E X 2 etX Xe =E M (t) = M (t) = E Xe dt dt dt ′′ Thus, M ′′ (0) = E[X 2 ] 21 In general, the nth derivative of M(t) is given by M n (t) = E X n etX implying that M n (0) = E [X n ] n≥1 n≥1 EXAMPLE: If X is a binomial random variable with parameters n and p, then n tX X tk n pk (1 − p)n−k M(t) = E e = e k k=0 n X n (pet )k (1 − p)n−k = k k=0 = (pet + 1 − p)n Therefore, M ′ (t) = n(pet + 1 − p)n−1 pet Thus, E[X] = M ′ (0) = np Similarly, so M ′′ (t) = n(n − 1)(pet + 1 − p)n−2 (pet )2 + n(pet + 1 − p)n−1 pet The variance of X is given by E[X 2 ] = M ′′ (0) = n(n − 1)p2 + np Var(X) = E[X 2 ] − (E[X])2 = n(n − 1)p2 + np − n2 p2 = np(1 − p) verifying the result obtained previously. EXAMPLE: If X is a Poisson random variable with parameter λ, then ∞ tX X etn e−λ λn M(t) = E e = n! n=0 = e−λ ∞ X (λet )n n=0 = e−λ eλe n! t = exp{λ(et − 1)} Therefore, M ′ (t) = λet exp{λ(et − 1)} Thus, M ′′ (t) = (λet )2 exp{λ(et − 1)} + λet exp{λ(et − 1)} E[X] = M ′ (0) = λ E[X 2 ] = M ′′ (0) = λ2 + λ so Var(X) = E[X 2 ] − (E[X])2 = λ 22 EXAMPLE: If X is an exponential random variable with parameter λ, then Z ∞ Z ∞ tX λ tx −λx M(t) = E e = e λe dx = λ e−(λ−t)x dx = λ−t 0 0 for t < λ We note from this derivation that, for the exponential distribution, M(t) defined only for values of t less than λ. Differentiation of M(t) yields λ (λ − t)2 M ′ (t) = Hence, M ′′ (t) = 1 λ E[X] = M ′ (0) = so 2λ (λ − t)3 E[X 2 ] = M ′′ (0) = Var(X) = E[X 2 ] − (E[X])2 = 2 λ2 1 λ2 EXAMPLE: Let X be a normal random variable with parameters µ and σ 2 . We first compute the moment generating function of a standard normal random variable with parameters 0 and 1. Letting Z be such a random variable, we have Z ∞ tZ 1 2 MZ (t) = E e =√ etx e−x /2 dx 2π −∞ Z 1 =√ 2π Z t2 /2 =e ∞ (x2 − 2tx) dx exp − 2 −∞ 1 =√ 2π ∞ (x − t)2 t2 dx + exp − 2 2 −∞ 1 √ 2π Z ∞ 2 /2 e−(x−t) dx −∞ 2 /2 = et 2 Hence, the moment generating function of the standard normal random variable Z is MZ (t) = et /2 . We now compute the moment generating function of a normal random variable X = µ + σZ with parameters µ and σ 2 . We have 22 tX t(µ+σZ) tµ tσZ tµ tσZ tµ σ t tµ (tσ)2 /2 =E e E e =e E e = e MZ (tσ) = e e = exp MX (t) = E e = E e + µt 2 By differentiating, we obtain σ 2 t2 = (µ + tσ ) exp + µt 2 22 22 σ t σ t ′′ 2 2 2 MX (t) = (µ + tσ ) exp + µt + σ exp + µt 2 2 MX′ (t) Thus, 2 E[X] = M ′ (0) = µ E[X 2 ] = M ′′ (0) = µ2 + σ 2 so Var(X) = E[X 2 ] − (E[X])2 = σ 2 23 An important property of moment generating functions is that the moment generating function of the sum of independent random variables equals the product of the individual moment generating functions. To prove this, suppose that X and Y are independent and have moment generating functions MX (t) and MY (t), respectively. Then MX+Y (t), the moment generating function of X + Y, is given by MX+Y (t) = E et(X+Y ) = E etX etY = E etX E etY = MX (t)MY (t) Another important result is that the moment generating function uniquely determines the distribution. That is, if MX (t) exists and is finite in some region about t = 0, then the distribution of X is uniquely determined. EXAMPLE: Let X and Y be independent binomial random variables with parameters (n, p) and (m, p), respectively. What is the distribution of X + Y ? Solution: The moment generating function of X + Y is given by MX+Y (t) = MX (t)MY (t) = (pet + 1 − p)n (pet + 1 − p)m = (pet + 1 − p)m+n However, (pet + 1 − p)m+n is the moment generating function of a binomial random variable having parameters m + n and p. Thus, this must be the distribution of X + Y. EXAMPLE: Let X and Y be independent Poisson random variables with means respective λ1 and λ2 . What is the distribution of X + Y ? Solution: The moment generating function of X + Y is given by MX+Y (t) = MX (t)MY (t) = exp{λ1 (et − 1)}exp{λ2 (et − 1)} = exp{(λ1 + λ2 )(et − 1)} Hence, X + Y is Poisson distributed with mean λ1 + λ2 . EXAMPLE: Let X and Y be independent normal random variables with respective parameters (µ1 , σ12 ) and (µ2 , σ22 ). What is the distribution of X + Y ? Solution: The moment generating function of X + Y is given by 22 22 2 σ1 t σ2 t (σ1 + σ22 )t2 MX+Y (t) = MX (t)MY (t) = exp + µ1 t exp + µ2 t = exp + (µ1 + µ2 )t 2 2 2 which is the moment generating function of a normal random variable with mean µ1 + µ2 and variance σ12 + σ22 . 24 Discrete Probability Distribution Probability mass function, p(x) n x p (1 − p)n−x x x = 0, 1, . . . , n Binomial with parameters n, p; 0≤p≤1 λx x! x = 0, 1, 2, . . . e−λ Poisson with parameter λ > 0 Mean Variance (pet + 1 − p)n np np(1 − p) exp{λ(et − 1)} λ λ pet 1 − (1 − p)et 1 p 1−p p2 r r p r(1 − p) p2 Moment generating function, M(t) Mean Variance etb − eta t(b − a) a+b 2 (b − a)2 12 λ λ−t 1 λ 1 λ2 s λ s λ2 µ σ2 x−1 Geometric with parameter 0 ≤ p ≤ 1 p(1 − p) Negative binomial with parameters r, p; x = 1, 2, . . . n−1 r p (1 − p)n−r r−1 n = r, r + 1, . . . 0≤p≤1 Moment generating function, M(t) pet 1 − (1 − p)et Continuous Probability Distribution Probability mass function, p(x) Uniform over (a, b) Exponential with parameter λ > 0 Gamma with parameters (s, λ), λ > 0 Normal with parameters (µ, σ 2 ) f (x) = ( 1 a<x<b b−a 0 otherwise f (x) = λe−λx x ≥ 0 0 x<0 λe−λx (λx)s−1 x≥0 f (x) = Γ(s) 0 x<0 1 2 2 e−(x−µ) /2σ f (x) = √ 2πσ −∞ < x < ∞ 25 λ λ−t s σ 2 t2 exp µt + 2