Download Section 7.1 Properties of Expectations: Introduction Section 7.2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Section 7.1 Properties of Expectations: Introduction
Recall that we have
X
E[X] =
xp(x)
x
in the discrete case and
E[X] =
Z
∞
xf (x)dx
−∞
in the continuous case.
COROLLARY: If P {a ≤ X ≤ b} = 1, then
a ≤ E[X] ≤ b
Proof:
E[X] =
X
xp(x) ≥ a
x:p(x)>0
X
p(x) = a
x:p(x)>0
Likewise for upper bound and continuous case. Section 7.2 Expectation of Sums of Random Variables
PROPOSITION: If X and Y have joint p.m.f. p(x, y), then
E[g(X, Y )] =
XX
y
g(x, y)p(x, y)
x
If X and Y have joint p.d.f. f (x, y), then
E[g(X, Y )] =
Z
∞
−∞
Z
∞
g(x, y)f (x, y)dxdy
−∞
EXAMPLE: An accident occurs at a point X that is randomly (i.e. uniformly) selected on a road of length
L. An ambulance is located at a point Y on the same road independent of X (also uniform). Find the
expected distance between the accident and ambulance.
1
EXAMPLE: An accident occurs at a point X that is randomly (i.e. uniformly) selected on a road of length
L. An ambulance is located at a point Y on the same road independent of X (also uniform). Find the
expected distance between the accident and ambulance.
Solution: We need to find E[|X − Y |]. We are given that both X and Y are uniform(0, L) and that they
are independent. Therefore,
1
f (x, y) = 2 , 0 < x < L, 0 < y < L
L
and zero otherwise. Thus,
Z LZ L
1
E[|X − Y |] =
|x − y|dxdy
2
0
0 L
First we do the inside integral
Z L
Z y
Z L
|x − y|dx =
(y − x)dx +
(x − y)dx
0
0
y
y
L
1 2
1 2 +
x − yx = yx − x 2
2
x=0
x=y
1
1
1
= y 2 + L2 − Ly + y 2
2
2
2
1
= y 2 + L2 − Ly
2
Thus,
L
1 2
2
y + L − Ly dy
2
0
1 1 3 1 3 1 3
= 2
L + L − L
L
3
2
2
1
E[|X − Y |] = 2
L
Z
1
= L
3
We now generalize what we did in chapter 4 by taking g(x, y) = x + y. We have
Z ∞Z ∞
E[X + Y ] =
(x + y)f (x, y)dxdy
−∞
=
Z
∞
−∞
−∞
Z
∞
xf (x, y)dxdy +
−∞
Z
∞
−∞
Z
∞
yf (x, y)dxdy
−∞
= E[X] + E[Y ]
Also, suppose that X ≥ Y . Then, X − Y ≥ 0. We know that the expected value of a non-negative random
variable is non-negative. Therefore, in this case, E[X − Y ] ≥ 0 and so
E[X] ≥ E[Y ]
Finally, it is trivial to generalize the above: That is,
E[a1 X1 + . . . + an Xn ] = a1 E[X1 ] + . . . + an E[Xn ]
where ai ∈ R.
2
EXAMPLE (The sample mean): Let Xi be independent, identically distributed random variables with
expected value µ and distribution function F . We say such a sequence is a sample from F . The quantity
X=
n
X
1
Xi ,
n
i=1
is called the sample mean. We have
#
" n
n
X1
1X
E[X] = E
Xi =
E[Xi ] = E[Xi ]
n
n
i=1
i=1
When µ is unknown, we use the sample mean as an estimate.
EXAMPLE (Boole’s inequality): Show that for any subsets A1 , A2 , . . . , An ⊂ S we have
P
n
[
Ai
i=1
!
≤
n
X
P (Ai )
i=1
Solution: Define the indicator functions Xi by
Xi = 1 if Ai occurs
and zero otherwise. Let
X=
n
X
Xi
i=1
and so X gives the number of the Ai that have occurred. Finally, let
Y = 1 if X ≥ 1
and zero otherwise. It is clear that X ≥ Y . Thus, E[Y ] ≤ E[X]. First note that
E[X] =
n
X
E[Xi ] =
i=1
n
X
P (Ai )
i=1
Also,
E[Y ] = P {at least one of the Ai occurred} = P
showing the result.
n
[
i=1
Ai
!
EXAMPLE: Suppose that there are N different types of coupons, and each time one obtains a coupon,
it is equally likely to be any one of the N types. Find the expected number of coupons one need amass
before obtaining a complete set of at least one of each type.
3
EXAMPLE: Suppose that there are N different types of coupons, and each time one obtains a coupon,
it is equally likely to be any one of the N types. Find the expected number of coupons one need amass
before obtaining a complete set of at least one of each type.
Solution: Let X be the number needed and let Xi , i = 0, . . . , N − 1, be the number of additional coupons
that need be obtained after i distinct types have been collected in order to obtain another distinct type.
Note that
X = X0 + X1 + . . . + XN −1
When i distinct types have already been collected, there are (N − i) distinct types left. Therefore, the
probability that any choice will yield new type is (N − i)/N. Hence, as Xi is a geometric random variable,
for k ≥ 1 we have
k−1
i
N −i
P {Xi = k} =
N
N
Therefore, Xi is geometric with parameter (N − i)/N and so
E[Xi ] =
N
N −i
Thus,
N
N
N
1
1
E[X] = 1 +
+
+ ...+
= N 1 + + ...+
N −1 N −2
1
2
N
EXAMPLE: The sequence of coin flips HHTHHHTTTHTHHHT contains two “runs” of heads of length
three, one run of length two and one run of length one. If a coin is flipped 100 times, what is the expected
number of runs of length 3?
4
EXAMPLE: The sequence of coin flips HHTHHHTTTHTHHHT contains two “runs” of heads of length
three, one run of length two and one run of length one. If a coin is flipped 100 times, what is the expected
number of runs of length 3?
Solution: Let T be the number of runs of length 3. Let Ii be one if a run of length 3 begins at i ∈ {1, . . . , 98}
98
X
and zero otherwise. Then, T =
Ii and
i=1
E[T ] =
98
X
E[Ii ] =
i=1
98
X
i=1
P {Ii = 1}
A run of length 3 begins at i only if there are heads on flips i, i + 1, i + 2 and tails on i + 3 (only holds
if i 6= 98) and i − 1 (only holds if i 6= 1). Thus,
E[I1 ] = P {H, H, H, T } = (1/2)4
E[I98 ] = P {T, H, H, H} = (1/2)4
E[Ii ] = P {T, H, H, H, T } = (1/2)5 ,
i 6∈ {1, 98}
Therefore,
E[T ] =
98
X
i=1
E[Ii ] = 2 ·
1
1
25
+ 96 · 5 =
= 3.125
4
2
2
8
EXAMPLE: Each night for 30 nights an ecologist sets a trap for a rabbit. The trap is very attractive so
there is always a rabbit in the trap the next morning. (Only one because the trap isn’t large enough for
two.) The rabbits that are trapped are marked and released. If there are 100 rabbits in the area and each
night the 100 are equally likely to be trapped, what is the expected number of distinct rabbits that are
trapped during the 30 night period?
5
EXAMPLE: Each night for 30 nights an ecologist sets a trap for a rabbit. The trap is very attractive so
there is always a rabbit in the trap the next morning. (Only one because the trap isn’t large enough for
two.) The rabbits that are trapped are marked and released. If there are 100 rabbits in the area and each
night the 100 are equally likely to be trapped, what is the expected number of distinct rabbits that are
trapped during the 30 night period?
Solution: Let X be the number of distinct rabbits that are trapped during the 30 night period. For
i = 1, . . . , 100, let Ai be one if rabbit i is never caught in the 30 nights and zero otherwise. Then
100
X
X = 100 −
Ai and
i=1
E[X] = 100 −
100
X
i=1
E[Ai ] = 100 −
100
X
i=1
P {Ai = 1}
Now, the probability that a given rabbit is never caught is (99/100)30. Therefore,
30
30
30 !
100 X
99
99
99
E[X] = 100 −
≈ 26.0299
= 100 − 100
= 100 1 −
100
100
100
i=1
Section 7.3 Moments of Number of Events
Consider a set of sets Ai . Let Xi be one if Ai occurs. Suppose that X =
n
X
Xi is of interest to us.
i=1
Now ask how many pairs of events took place? Noting that Xi Xj equals one if and only if both Xi and
Xj do, it follows that the number of pairs is
X
Xi Xj
i<j
Also, because the number of events that do occur is X, the number of pairs of events that occur is exactly
X
2
Combining these observations yields
X
X
=
Xi Xj
2
i<j
n
terms. Taking expected values gives us
where the sum is over the
2
X
X
X
=
E[Xi Xj ] =
P (Ai Aj )
E
2
i<j
i<j
or
which yields
X
X(X − 1)
E
=
P (AiAj )
2
i<j
E[X 2 ] − E[X] = 2
X
i<j
which gives us E[X 2 ] and then Var(X) = E[X 2 ] − (E[X])2 .
6
P (Ai Aj )
EXAMPLE (Moments of a binomial random variable): We have n independent trials each with probability
of success p. Let Ai be event of ith success. When i 6= j, P (Ai Aj ) = p2 . Thus,
X
X
n(n − 1) 2
X(X − 1)
=
P (AiAj ) =
p2 =
p
E
2
2
i<j
i<j
and so
E[X 2 ] − E[X] = n(n − 1)p2
Because E[X] = np, we have
Var(X) = E[X 2 ] − E[X]2 = n(n − 1)p2 + np − (np)2 = np(1 − p)
agreeing with the previous calculations.
EXAMPLE: Suppose there are N distinct coupon types and that, independently of past types collected,
N
X
each new one obtained is type j with probability pj , obviously with
pj = 1. Find the expected value
j=1
and variance of the number of different types of coupons that appear among the first n collected.
7
EXAMPLE: Suppose there are N distinct coupon types and that, independently of past types collected,
N
X
each new one obtained is type j with probability pj , obviously with
pj = 1. Find the expected value
j=1
and variance of the number of different types of coupons that appear among the first n collected.
Solution: Let Y denote the number of different types of collected coupons and let X = N − Y denote the
number of uncollected types. Let Ai be the event that the ith coupon is not collected, and Xi = 1 if Ai
occurs. Thus,
N
X
X=
Xi
i=1
With probability (1 − pi ), the ith coupon is not collected during each round. Therefore, the probability
the ith is not collected in the first n rounds is
P (Ai ) = P {Xi = 1} = (1 − pi )n
Therefore, E[X] =
n
X
i=1
(1 − pi )n , and
E[Y ] = N − E[X] = N −
N
X
i=1
(1 − pi )n
The probability that neither type i nor j is collected in a single round is (1 − pi − pj ). Therefore,
P (Ai Aj ) = (1 − pi − pj )n , i 6= j
Therefore,
E[X(X − 1)] = 2
and
X
P (AiAj ) = 2
i<j
E[X 2 ] = 2
i<j
X
i<j
Hence,
Var(Y ) = Var(X) = E[X 2 ] − (E[X])2 = 2
X
(1 − pi − pj )n
(1 − pi − pj )n + E[X]
X
i<j
(1 − pi − pj )n +
N
X
i=1
(1 − pi )n −
N
X
i=1
Note that when pi = 1/N for all i, we get
n
n 1
1
=N 1− 1−
E[Y ] = N − N 1 −
N
N
and
n
n
2n
2
1
1
2
Var(Y ) = N(N − 1) 1 −
+N 1−
−N 1−
N
N
N
8
(1 − pi )n
!2
Section 7.4 Covariance, Variance of Sums, and Correlations
THEOREM: If X and Y are independent, then for any functions h and g we have
E[g(X)h(Y )] = E[g(X)]E[h(Y )]
Proof: Let f (x, y) be the joint p.m.f. Then,
Z ∞Z ∞
Z
E[g(X)h(Y )] =
g(x)h(y)f (x, y)dxdy =
−∞
=
Z
∞
h(y)fY (y)dy
−∞
−∞
Z
∞
−∞
Z
∞
g(x)h(y)fX (x)fY (y)dxdy
−∞
∞
g(x)fX (x)dx = E[h(Y )]E[g(X)]
−∞
Variance of linear combinations: Suppose that
X = X1 + X2 + . . . + Xn
We already know that
E[X] =
n
X
E[Xi ]
i=1
and this doesn’t need independence. What about variance? Consider n = 2. We have
Var(X1 + X2 ) = E[(X − E[X])2 ] = E[(X1 + X2 − µ1 − µ2 )2 ] = E[((X1 − µ1 ) + (X2 − µ2 ))2 ]
= E[(X1 − µ1 )2 ] + E[(X2 − µ2 )2 ] + 2E[(X1 − µ1 )(X2 − µ2 )] = Var(X1 ) + Var(X2 ) + 2Cov(X1 , X2 )
where the last is a definition. So,
Var(X1 + X2 ) = E[(X − E[X])2 ] = Var(X1 ) + Var(X2 ) + 2Cov(X1 , X2 )
It is easy to show (see book) that
1. Cov(X, Y ) = Cov(Y, X).
2. Cov(X, X) = Var(X).
3. Cov(aX, Y ) = aCov(X, Y ).
!
n X
m
n
m
X
X
X
4. Cov
Xi ,
Yj =
Cov(Xi , Yj ).
i=1
i=1 j=1
j=1
We also have
Cov(X1 , X2 ) = E[(X1 − µ1 )(X2 − µ2 )] = E[X1 X2 ] − 2µ1 µ2 + µ1 µ2 = E[X1 X2 ] − µ1 µ2
Note that if X1 and X2 are independent, then
Cov(X1 , X2 ) = E[X1 X2 ] − µ1 µ2 = E[X1 ]E[X2 ] − µ1 µ2 = 0
EXAMPLE: Two random variables may be dependent, but uncorrelated. Let R(X) = {−1, 0, 1} with
P {X = i} = 1/3 for each i. Let Y = X 2 . Then,
Cov(X, Y ) = E[X(Y − E[Y ])] = E[X(X 2 − E[X 2 ])] = E[X 3 ] − E[X]E[X 2 ] = 0 − 0 · E[X 2 ] = 0
but not independent, since
P {X = 1, Y = 0} = 0 6= P {X = 1}P {Y = 0} = (1/3)(1/3) = 1/9
So these are just uncorrelated.
9
We have
Var(X1 + X2 + . . . + Xn ) = Var(X) = E[(X − E[X])2 ]

=E
=
n
X
n
X
i=1
Xi −
Var(Xi ) +
i=1
=
n
X

!2 
!2 
n
X
E[Xi ]  = E 
(Xi − µi ) 
n
X
i=1
i=1
n X
n
X
Cov(Xi , Xj )
i=1 j=1
Var(Xi ) + 2
XX
Cov(Xi , Xj )
i<j
i=1
Now, if the Xi ’s are pairwise independent, then
n
X
Var
Xi
i=1
!
=
n
X
Var(Xi )
i=1
EXAMPLE: Let X be a binomial random variable with parameters n and p, that is, X is the number of
successes in n independent trials. Therefore,
X = X1 + . . . + Xn
where Xi is 1 if ith trial was success and zero otherwise. Therefore, Xi ’s are independent Bernoulli random
variables and E[Xi ] = P {Xi = 1} = p. Thus (not from independence),
n
X
E[X] =
E[Xi ] =
i=1
n
X
p = np
i=1
Variance of binomial: Each of the Xi ’s are independent. Therefore,
Var(X) =
n
X
Var(Xi ) =
i=1
n
X
i=1
(p − p2 ) = np(1 − p)
So much easier! In general, we have
Var(aX + bY ) = a2 Var(X) + b2 Var(Y ) + 2abCov(X, Y )
Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y )
Var(X − Y ) = Var(X) + Var(Y ) − 2Cov(X, Y )
Var
n
X
i=1
ai Xi
!
=
n
X
a2i Var(Xi ) + 2
i=1
XX
ai aj Cov(Xi , Xj )
i<j
Independence implies
Var
n
X
i=1
ai Xi
!
=
10
n
X
i=1
a2i Var(Xi )
EXAMPLE: Suppose there are 100 cards numbered 1 through 100. We draw a card at random. Let X
be the number of digits (R(X) = {1, 2, 3}) and Y be the number of zeros (R(Y ) = {0, 1, 2}). The chart
giving the probabilities is:
y
pXY (x, y) 0
1
2
1
9/100
0
0
2
81/100 9/100
0
x
3
0
0
1/100
pY (y)
90/100 9/100 1/100
E[X] = 1 ·
pX (x)
9/100
90/100
1/100
90
1
9
+2·
+3·
= 1.92
100
100
100
E[Y ] = 0 + 1 ·
9
1
+2·
= 0.11
100
100
E[XY ] = 2 · 1 ·
9
1
+3·2·
= 0.24
100
100
Cov(X, Y ) = E[XY ] − E[X]E[Y ] = 0.24 − 1.92 · 0.11 = 0.0288
(Var(X) = 0.0936, Var(Y ) = 0.1179, ρ = Cov(X, Y )/σX σY = 0.2742)
LEMMA: Let X1 , . . . , Xn be a random sample of size n from a distribution F with mean µ and variance
σ 2 (just n experiments). Let X = (X1 + . . . + Xn )/n be the mean of the random sample. Then
E(X) = µ,
Proof: We have
E[X] = E
and
Var(X) = Var
Var(X) =
X1 + . . . + Xn
n
1
(X1 + . . . + Xn )
n
=
=
σ2
n
1
nµ = µ
n
1
1
σ2
2
Var(X
+
.
.
.
+
X
)
=
nσ
=
1
n
n2
n2
n
For real numbers a, b, c, d and random variables X, Y
Cov(aX + b, cY + d) = E[(aX + b)(cY + d)] − E[aX + b]E[cY + d]
= E[acXY + bcY + adX + bd] − [aE[X] + b][cE[Y ] + d]
= ac(E[XY ] − E[X]E[Y ])
= acCov(X, Y )
So
Cov(aX + b, cY + d) = acCov(X, Y )
11
Correlation
Suppose that X1 = 10X and Y1 = 5Y , then
Cov(X1 , Y1 ) = E[(X1 − E(X1 ))(Y1 − E(Y1 ))] = E[(10X − E(10X))(5Y − E(5Y ))] = 50Cov(X, Y )
So knowing if the covariance between X and Y is positive or negative is fine, but doesn’t tell you “how
correlated.” Solution: think in terms of unit-less parameters (standardized random variables). Put
X∗ =
then
X − E[X]
,
σX
Y∗ =
Y − E[Y ]
σY
X − E[X] Y − E[Y ]
Cov(X , Y ) = Cov
,
σX
σY
Thus, we define the correlation coefficient between X and Y as
∗
∗
ρ = ρ(X, Y ) =
=
1 1
Cov(X, Y )
σX σY
Cov(X, Y )
σX σY
So Cov(X, Y ) > 0 iff ρ > 0 and vice versa.
Properties:
1. −1 ≤ ρ(X, Y ) ≤ 1.
2. ρ(X, Y ) = 1 if and only if Y = aX + B for some a > 0.
3. ρ(X, Y ) = −1 if and only if Y = aX + B for some a < 0.
Will just prove item 1:
0 ≤ Var
X
Y
+
σX
σY
=
Var(X) Var(Y )
Cov(X, Y )
= 2[1 + ρ(X, Y )]
+
+2
2
2
σX
σY
σx σY
thus, ρ(X, Y ) ≥ −1. Similarly,
X
Y
Var(X) Var(Y )
Cov(X, Y )
0 ≤ Var
−
=
= 2[1 − ρ(X, Y )]
+
−2
2
2
σX
σY
σX
σY
σx σY
hence ρ(X, Y ) ≤ 1.
EXAMPLE (Investment problem): Suppose someone invests in two financial assets with annual rates of
return r1 and r2 . Let r be the annual rate of return for the overall investment. Let σ 2 = Var(r) and
σ12 = Var(r1 ) and σ22 = Var(r2 ). Let w1 and w2 be fractions of investment in assets one and two, that is
r = w1 r1 + w2 r2 . Then, if w1 , w2 > 0 and σ1 , σ2 6= 0 we have
σ 2 < max(σ12 , σ22 )
Proof: Assume that σ1 ≤ σ2 (if not, go other way). Thus max is σ22 . Therefore
σ 2 = Var(r) = Var(w1 r1 + w2 r2 ) = w12 Var(r1 ) + w22 Var(r2 ) + 2w1 w2 Cov(r1 , r2 )
= w12 σ12 + w22 σ22 + 2w1 w2 ρσ1 σ2 (now use that σ1 ≤ σ2 )
≤ w12 σ22 + w22 σ22 + 2w1 w2 σ22 = (w1 + w2 )2 σ22 = σ22 = max(σ12 , σ22 ) 12
Section 7.5 Conditional Expectation
Recall that if X and Y are jointly discrete, then the conditional probability mass function of X given
Y = y is
p(x, y)
pX|Y (x|y) = P {X = x|Y = y} =
pY (y)
We therefore define the conditional expectation of X given Y = y for all y such that pY (y) > 0 to be
X
X
E[X|Y = y] =
xP {X = x|Y = y} =
xpX|Y (x|y)
x
x
Therefore, if X and Y are independent, then E[X|Y = y] = E[X].
EXAMPLE: If X and Y are both binomial(n, p) independent random variables (same n and p), calculate
the conditional expectation of X given X + Y = m.
Solution: We first need the conditional probability mass function of X, given X + Y = m. Note that the
range k for X is ≤ min(n, m), since
P {X = k|X + Y = m} =
P {X = k, Y = m − k}
P {X = k, X + Y = m}
=
P {X + Y = m}
P {X + Y = m}
We know that the sum of two binomials (with the same p) is binomial. Thus X + Y is binomial(2n, p).
So,
2n m
p (1 − p)2n−m
P {X + Y = m} =
m
Therefore,
P {X = k}P {Y = m − k}
P {X = k, Y = m − k}
=
P {X + Y = m}
P {X + Y = m}
n
n k
n
n
m−k
n−m+k
n−k
p
(1 − p)
p (1 − p)
m−k
k
m−k
k
=
=
2n m
2n
p (1 − p)2n−m
m
m
P {X = k|X + Y = m} =
This is a hypergeometic random variable with parameters m, 2n, and n. Therefore, the mean is
Similarly to the discrete case, the conditional density of X given Y = y is
fX|Y (x|y) =
f (x, y)
fY (y)
Thus, in this case we define
E[X|Y = y] =
Z
∞
xfX|Y (x|y)dx
−∞
provided that fY (y) > 0. Therefore, if X and Y are independent, then E[X|Y = y] = E[X].
EXAMPLE: Suppose
1
f (x, y) = e−x/y e−y ,
y
0 < x < ∞, 0 < y < ∞
Compute E[X|Y = y].
13
m·n
m
= .
2n
2
EXAMPLE: Suppose
1
f (x, y) = e−x/y e−y ,
y
0 < x < ∞, 0 < y < ∞
Compute E[X|Y = y].
Solution: We have
1 −x/y −y
e
e
f (x, y)
f (x, y)
1
y
=Z ∞
fX|Y (x|y) =
=Z ∞
= e−x/y
1 −x/y −y
fY (y)
y
f (x, y)dx
e
e dx
y
−∞
0
which is exponential with parameter 1/y. Therefore, the mean is y. That is,
Z ∞
1
E[X|Y = y] =
x · e−x/y dx = y
y
0
Important: Conditional expectations satisfy all the properties of ordinary expectations. This follows
because they are just ordinary expectations, as conditional probabilities or densities are themselves probabilities and densities. Therefore,
 X

g(x)pX|Y (x|y)
in the discrete case


Zx ∞
E[g(X)|Y = y] =



g(x)fX|Y (x|y)dx in the continuous case
−∞
and
E
"
X
i
#
Xi |Y = y =
X
i
E[Xi |Y = y]
Computing Expectations by Conditioning
NOTATION: We will denote by E[X|Y ] the function of the random variable Y whose value at Y = y is
E[X|Y = y]. Note that E[X|Y ] is itself a random variable.
THEOREM: We have
E[X] = E[E[X|Y ]]
If Y is discrete, then one can deduce from the above Theorem that
X
E[X] =
E[X|Y = y]P {Y = y}
(1)
(2)
y
and if Y is continuous with density fY (y), then
Z ∞
E[X] =
E[X|Y = y]fY (y)dy
−∞
14
(3)
Proof (Discrete Case): We have
X
XX
E[X|Y = y]P {Y = y} =
xP {X = x|Y = y}P {Y = y}
y
y
=
=
x
X X P {X = x, Y = y}
P {Y = y}
x
P {Y = y}
y
x
XX
xP {X = x, Y = y}
XX
xP {X = x, Y = y}
y
=
x
x
=
y
X X
x
P {X = x, Y = y}
x
=
y
X
x
xP {X = x}
= E[X]
Proof (Continuous Case): We have
Z ∞
Z
E[X|Y = y]fY (y)dy =
−∞
∞
−∞
=
Z
∞
Z
∞
Z
∞
Z
∞
Z
∞
−∞
=
−∞
=
−∞
=
Z
∞
Z
∞
Z
∞
Z
∞
x
−∞
f (x, y)
fY (y)dxdy
fY (y)
xf (x, y)dxdy
−∞
xf (x, y)dydx
−∞
x
−∞
=
xfX|Y (x|y)fY (y)dxdy
−∞
Z
∞
f (x, y)dydx
−∞
xfX (x)dx
−∞
= E[X]
EXAMPLE: A miner is trapped in a mine containing 3 doors. The first door leads to a tunnel that will
take him to safety after 3 hours of travel. The second door leads to a tunnel that will return him to the
mine after 5 hours of travel. The third door leads to a tunnel that will return him to the mine after 7
hours. If we assume that the miner is at all times equally likely to choose any one of the doors, what is
the expected length of time until he reaches safety?
15
EXAMPLE: A miner is trapped in a mine containing 3 doors. The first door leads to a tunnel that will
take him to safety after 3 hours of travel. The second door leads to a tunnel that will return him to the
mine after 5 hours of travel. The third door leads to a tunnel that will return him to the mine after 7
hours. If we assume that the miner is at all times equally likely to choose any one of the doors, what is
the expected length of time until he reaches safety?
Solution: Let X denote the amount of time (in hours) until the miner reaches safety, and let Y denote the
door he initially chooses. Now,
E[X] = E[X|Y = 1]P {Y = 1} + E[X|Y = 2]P {Y = 2} + E[X|Y = 3]P {Y = 3}
1
= (E[X|Y = 1] + E[X|Y = 2] + E[X|Y = 3])
3
However,
E[X|Y = 1] = 3
E[X|Y = 2] = 5 + E[X]
E[X|Y = 3] = 7 + E[X]
Therefore
1
E[X] = (3 + 5 + E[X] + 7 + E[X])
3
hence
E[X] = 15
EXAMPLE: Consider independent trials, each with a probability p of success. Let N be the time of the
first success. Find Var(N). [Hint: Think about geometric distribution]
16
EXAMPLE: Consider independent trials, each with a probability p of success. Let N be the time of the
first success. Find Var(N). [Hint: Think about geometric distribution]
Solution: We will condition on the first trial being a success or failure. Let
Y = 1 if the first step is a success
Y = 0 if the first step is a failure
We know that E[N] = 1/p and that
Var(N) = E[N 2 ] − [E[N]]2
Thus, we need the second moment. We will calculate it by conditioning on Y :
E[N 2 ] = E[E[N 2 |Y ]]
Therefore,
E[N 2 ] = E[N 2 |Y = 0]P {Y = 0} + E[N 2 |Y = 1]P {Y = 1} = (1 − p)E[N 2 |Y = 0] + pE[N 2 |Y = 1]
Clearly,
E[N 2 |Y = 1] = 1
We now show that
E[N 2 |Y = 0] = E[(1 + N)2 ]
In fact, we have
2
E[N |Y = 0] =
∞
X
n=1
2
n P {N = n|Y = 0} =
∞
X
n2
n=1
P {N = n, Y = 0}
1−p
Note that P {N = n, Y = 0} = 0 if n = 1 and (1 − p)n−1 p otherwise. Therefore,
E[N 2 |Y = 0] =
∞
X
n=2
n2 (1 − p)n−2 p =
∞
X
n=1
(n + 1)2 (1 − p)n−1 p = E[(N + 1)2 ]
Going back to equation (4), we have
E[N 2 ] = p + (1 − p)(E[N 2 + 2N + 1]) = 1 + 2 ·
Solving this equation for E[N 2 ], we get
E[N 2 ] =
Finally,
Var(X) =
1−p
+ (1 − p)E[N 2 ]
p
2−p
p2
2−p
1
1−p
− 2 =
2
p
p
p2
EXAMPLE: Let Ui be a sequence of uniform(0, 1) independent random variables and let
(
)
n
X
N = min n :
Ui > 1
i=1
Find E[N].
17
(4)
EXAMPLE: Let Ui be a sequence of uniform(0, 1) independent random variables and let
(
)
n
X
N = min n :
Ui > 1
i=1
Find E[N].
Solution: We will solve a more general problem. For x ∈ [0, 1], let
(
)
n
X
N(x) = min n :
Ui > x
i=1
Set
m(x) = E[N(x)]
We derive an expression by conditioning on U1 . For any x we have
Z ∞
Z
m(x) = E[N(x)] = E[E[N(x)|U1 ]] =
E[N(x)|U1 = y]fU1 (y)dy =
−∞
1
E[N(x)|U1 = y]dy
0
Clearly, if y > x, then E[N(x)|U1 = y] = 1. What about the other case? One can argue that it should be
E[N(x)|U1 = y] = 1 + m(x − y) if y ≤ x
In fact, we first note that
E[N(x)|U1 = y] =
∞
X
n=2
Now, assuming y ≤ x, we obtain
nP {N(x) = n|U1 = y}
P {N(x) = n|U1 = y} = P {U2 + . . . + Un > x − y, U2 + . . . + Un−1 < x − y} = P {N(x − y) = n − 1}
Thus,
E[N(x)|U1 = y] =
∞
X
n=2
nP {N(x − y) = n − 1} =
∞
X
n=1
(n + 1)P {N(x − y) = n} = m(x − y) + 1
therefore
m(x) =
Z
1
E[N(x)|U1 = y]dy =
0
Z
x
0
which is
=1+
(1 + m(x − y))dy +
Z
0
Differentiating both sides, we get
Z
1
x
1dy = 1 − x + x +
x
m(u)du (set u = x − y)
m′ (x) = m(x),
m(0) = 1
This is equivalent to
m′ (x)
=1
m(x)
=⇒
d
ln(m(x)) = 1
dx
=⇒
ln(m(x)) = x + C
or
m(x) = eC ex
We use the initial condition m(0) = 1 to see that m(x) = ex . Thus, m(1) = e.
18
Z
0
x
m(x − y)dy
Calculating Probabilities by Conditioning
Let E be an event and set X = 1 if E occurs and 0 if E does not occur. Then
E[X] = P (E),
E[X|Y = y] = P (E|Y = y)
for any Y . Using this in (2) and (3), we get
 X

P (E|Y = y)P (Y = y) if Y is discrete


 y
P (E) =
Z ∞



P (E|Y = y)fY (y)dy
if Y is continuous

(5)
−∞
EXAMPLE (The best-prize problem): Suppose that we are to be presented with n distinct prizes, in
sequence. After being presented with a prize, we must immediately decide whether to accept it or to reject
it and consider the next prize. The only information we are given when deciding whether to accept a prize
is the relative rank of that prize compared to ones already seen. That is, for instance, when the fifth prize
is presented, we learn how it compares with the four prizes we’ve already seen. Suppose that once a prize
is rejected, it is lost, and that our objective is to maximize the probability of obtaining the best prize.
Assuming that all n! orderings of the prizes are equally likely, how well can we do?
19
EXAMPLE (The best-prize problem): Suppose that we are to be presented with n distinct prizes, in
sequence. After being presented with a prize, we must immediately decide whether to accept it or to reject
it and consider the next prize. The only information we are given when deciding whether to accept a prize
is the relative rank of that prize compared to ones already seen. That is, for instance, when the fifth prize
is presented, we learn how it compares with the four prizes we’ve already seen. Suppose that once a prize
is rejected, it is lost, and that our objective is to maximize the probability of obtaining the best prize.
Assuming that all n! orderings of the prizes are equally likely, how well can we do?
Solution: Fix k < n. Strategy: Let the first k prizes go by. Accept the first one if you see that it is larger
than all the previous prizes. Let Zk be the event we choose the best with this strategy. We want to find
P (Zk ) and then maximize it with respect to k. We will condition on the position, X, of the best prize.
Thus,
n
n
X
1X
P (Zk ) =
P (Zk |X = i)P (X = i) =
P (Zk |X = i)
n
i=1
i=1
Note that P (Zk |X = i) = 0 if i ≤ k. Thus,
n
1 X
P (Zk |X = i)
P (Zk ) =
n
i=k+1
Now suppose i > k. We show that
P (Zk |X = i) =
k
i−1
if i > k
k
Indeed, if X = k + 1, then we definitely get the prize and we have P (Zk |X = k + 1) = = 1. Similarly, if
k
X = k + 2, we get
1
k
P (Zk |X = k + 2) = 1 −
=
k+1
k+1
1
is representing the probability that the largest of the first k + 1 was in k + 1st slot. In general,
where
k+1
if X = i, then
i−k−1
k
P (Zk |X = i) = 1 −
=
i−1
i−1
i−k−1
is representing the probability that the largest of the first i−1 was in the slots k+1, . . . , i−1.
where
i−1
It follows that
!
n−1
k−1
n
n−1
1 X k
k
kX1
k X1 X1
k n
P (Zk ) =
≈ (ln(n − 1) − ln(k − 1)) ≈ ln
=
=
−
n
i−1
n
i
n i=1 i
i
n
n
k
i=1
i=k+1
i=k
x n
. We want to maximize g with respect to x over the interval [0, n]. Differentiating
Now let g(x) = ln
n
x
both sides, we get
1 n 1
′
g (x) = ln
−
n
x
n
Therefore,
n
n
g ′ (x) = 0 =⇒ ln
= 1 =⇒ x =
x
e
′
′
One can see that g (x) > 0 if 0 < x < n/e and g (x) < 0 if x > n/e. Therefore, the value found is a
maximum. Plugging in n/e to g yields
1
1
g(n/e) = ln e =
e
e
20
EXAMPLE: Let U be uniform(0, 1). Suppose that the conditional distribution of X, given U = p, is
binomial(n, p). Find the probability mass function of X, i.e. if we flip a coin n times with unknown and
uniformly chosen p, what is P {X = i} for i ∈ {1, . . . , n}?
Solution: We will condition on U and use the fact that
Z 1
P {X = i} =
P {X = i|U = p}fU (p)dp
0
Thus,
P {X = i} =
1
Z
0
P {X = i|U = p}fU (p)dp =
It can be shown that
Z
0
Thus,
1
pi (1 − p)n−i dp =
Z
1
0
Z 1
n
pi (1 − p)n−i dp
P {X = i|U = p}dp =
i
0
i!(n − i)!
1
i!(n − i)!
=
·
(n + 1)!
n+1
n!
1
n+1
That is, we obtain the surprising result that if a coin whose probability of coming up heads is uniformly
distributed over (0, 1) is flipped n times, then the number of heads occurring is equally likely to be any of
the values 0, . . . , n.
P {X = i} =
Section 7.7 Moment Generating Functions
DEFINITION: The moment generating function M(t) of the random variable X is defined for all real
values of t by
 X

etx p(x)
if X is discrete with mass function p(x)



tX x
M(t) = E e
=
Z ∞



etx f (x)dx if X is continuous with density f (x)

−∞
We call M(t) the moment generating function because all the moments of X can be obtained by successively
differentiating M(t) and then evaluating the result at t = 0. For example,
d tX d tX ′
M (t) = E e
e
= E XetX
=E
dt
dt
where we have assumed that the interchange of the differentiation and expectation operators is legitimate.
Hence
M ′ (0) = E[X]
Similarly,
d tX d
d ′
tX
= E X 2 etX
Xe
=E
M (t) = M (t) = E Xe
dt
dt
dt
′′
Thus,
M ′′ (0) = E[X 2 ]
21
In general, the nth derivative of M(t) is given by
M n (t) = E X n etX
implying that
M n (0) = E [X n ]
n≥1
n≥1
EXAMPLE: If X is a binomial random variable with parameters n and p, then
n
tX X
tk n
pk (1 − p)n−k
M(t) = E e
=
e
k
k=0
n X
n
(pet )k (1 − p)n−k
=
k
k=0
= (pet + 1 − p)n
Therefore,
M ′ (t) = n(pet + 1 − p)n−1 pet
Thus,
E[X] = M ′ (0) = np
Similarly,
so
M ′′ (t) = n(n − 1)(pet + 1 − p)n−2 (pet )2 + n(pet + 1 − p)n−1 pet
The variance of X is given by
E[X 2 ] = M ′′ (0) = n(n − 1)p2 + np
Var(X) = E[X 2 ] − (E[X])2 = n(n − 1)p2 + np − n2 p2 = np(1 − p)
verifying the result obtained previously.
EXAMPLE: If X is a Poisson random variable with parameter λ, then
∞
tX X
etn e−λ λn
M(t) = E e
=
n!
n=0
= e−λ
∞
X
(λet )n
n=0
= e−λ eλe
n!
t
= exp{λ(et − 1)}
Therefore,
M ′ (t) = λet exp{λ(et − 1)}
Thus,
M ′′ (t) = (λet )2 exp{λ(et − 1)} + λet exp{λ(et − 1)}
E[X] = M ′ (0) = λ
E[X 2 ] = M ′′ (0) = λ2 + λ
so
Var(X) = E[X 2 ] − (E[X])2 = λ
22
EXAMPLE: If X is an exponential random variable with parameter λ, then
Z ∞
Z ∞
tX λ
tx
−λx
M(t) = E e
=
e λe dx = λ
e−(λ−t)x dx =
λ−t
0
0
for t < λ
We note from this derivation that, for the exponential distribution, M(t) defined only for values of t less
than λ. Differentiation of M(t) yields
λ
(λ − t)2
M ′ (t) =
Hence,
M ′′ (t) =
1
λ
E[X] = M ′ (0) =
so
2λ
(λ − t)3
E[X 2 ] = M ′′ (0) =
Var(X) = E[X 2 ] − (E[X])2 =
2
λ2
1
λ2
EXAMPLE: Let X be a normal random variable with parameters µ and σ 2 . We first compute the moment
generating function of a standard normal random variable with parameters 0 and 1. Letting Z be such a
random variable, we have
Z ∞
tZ 1
2
MZ (t) = E e
=√
etx e−x /2 dx
2π −∞
Z
1
=√
2π
Z
t2 /2
=e
∞
(x2 − 2tx)
dx
exp −
2
−∞
1
=√
2π
∞
(x − t)2 t2
dx
+
exp −
2
2
−∞
1
√
2π
Z
∞
2 /2
e−(x−t)
dx
−∞
2 /2
= et
2
Hence, the moment generating function of the standard normal random variable Z is MZ (t) = et /2 . We
now compute the moment generating function of a normal random variable X = µ + σZ with parameters
µ and σ 2 . We have
22
tX t(µ+σZ) tµ tσZ tµ tσZ tµ
σ t
tµ (tσ)2 /2
=E e E e
=e E e
= e MZ (tσ) = e e
= exp
MX (t) = E e = E e
+ µt
2
By differentiating, we obtain
σ 2 t2
= (µ + tσ ) exp
+ µt
2
22
22
σ t
σ t
′′
2 2
2
MX (t) = (µ + tσ ) exp
+ µt + σ exp
+ µt
2
2
MX′ (t)
Thus,
2
E[X] = M ′ (0) = µ
E[X 2 ] = M ′′ (0) = µ2 + σ 2
so
Var(X) = E[X 2 ] − (E[X])2 = σ 2
23
An important property of moment generating functions is that the moment generating function of the
sum of independent random variables equals the product of the individual moment generating functions.
To prove this, suppose that X and Y are independent and have moment generating functions MX (t) and
MY (t), respectively. Then MX+Y (t), the moment generating function of X + Y, is given by
MX+Y (t) = E et(X+Y ) = E etX etY = E etX E etY = MX (t)MY (t)
Another important result is that the moment generating function uniquely determines the distribution.
That is, if MX (t) exists and is finite in some region about t = 0, then the distribution of X is uniquely
determined.
EXAMPLE: Let X and Y be independent binomial random variables with parameters (n, p) and (m, p),
respectively. What is the distribution of X + Y ?
Solution: The moment generating function of X + Y is given by
MX+Y (t) = MX (t)MY (t) = (pet + 1 − p)n (pet + 1 − p)m = (pet + 1 − p)m+n
However, (pet + 1 − p)m+n is the moment generating function of a binomial random variable having
parameters m + n and p. Thus, this must be the distribution of X + Y.
EXAMPLE: Let X and Y be independent Poisson random variables with means respective λ1 and λ2 .
What is the distribution of X + Y ?
Solution: The moment generating function of X + Y is given by
MX+Y (t) = MX (t)MY (t) = exp{λ1 (et − 1)}exp{λ2 (et − 1)} = exp{(λ1 + λ2 )(et − 1)}
Hence, X + Y is Poisson distributed with mean λ1 + λ2 .
EXAMPLE: Let X and Y be independent normal random variables with respective parameters (µ1 , σ12 )
and (µ2 , σ22 ). What is the distribution of X + Y ?
Solution: The moment generating function of X + Y is given by
22
22
2
σ1 t
σ2 t
(σ1 + σ22 )t2
MX+Y (t) = MX (t)MY (t) = exp
+ µ1 t exp
+ µ2 t = exp
+ (µ1 + µ2 )t
2
2
2
which is the moment generating function of a normal random variable with mean µ1 + µ2 and variance
σ12 + σ22 .
24
Discrete Probability Distribution
Probability mass
function, p(x)
n x
p (1 − p)n−x
x
x = 0, 1, . . . , n
Binomial with parameters n, p;
0≤p≤1
λx
x!
x = 0, 1, 2, . . .
e−λ
Poisson with parameter λ > 0
Mean
Variance
(pet + 1 − p)n
np
np(1 − p)
exp{λ(et − 1)}
λ
λ
pet
1 − (1 − p)et
1
p
1−p
p2
r
r
p
r(1 − p)
p2
Moment generating
function, M(t)
Mean
Variance
etb − eta
t(b − a)
a+b
2
(b − a)2
12
λ
λ−t
1
λ
1
λ2
s
λ
s
λ2
µ
σ2
x−1
Geometric with parameter 0 ≤ p ≤ 1
p(1 − p)
Negative binomial with parameters r, p;
x = 1, 2, . . .
n−1 r
p (1 − p)n−r
r−1
n = r, r + 1, . . .
0≤p≤1
Moment generating
function, M(t)
pet
1 − (1 − p)et
Continuous Probability Distribution
Probability mass function, p(x)
Uniform over (a, b)
Exponential with
parameter λ > 0
Gamma with parameters
(s, λ), λ > 0
Normal with parameters
(µ, σ 2 )
f (x) =
(
1
a<x<b
b−a
0
otherwise
f (x) =
λe−λx x ≥ 0
0
x<0

 λe−λx (λx)s−1
x≥0
f (x) =
Γ(s)

0
x<0
1
2
2
e−(x−µ) /2σ
f (x) = √
2πσ
−∞ < x < ∞
25
λ
λ−t
s
σ 2 t2
exp µt +
2