Download 4.6 Conditional Expectation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
STAT 421 Lecture Notes
4.6
119
Conditional Expectation
Definition 4.7.1. Let X and Y be random variables with finite means µX and µY . The
conditional expectation of Y given X = x is denoted E(Y |X = x). In this context, we have
observed a realization x of the random variable X, and so for the purpose of computing the
conditional expectation, X is fixed at x and not random.
If the conditional distribution of Y given X = x is continuous, then
∫ ∞
E(Y |X = x) =
yg2 (y|x) dy.
−∞
In the case that the conditional distribution of Y given X = x is discrete, then
∑
E(Y |X = x) =
yg2 (y|x) dy.
y
The following table shows a data set from which a joint distribution can be formed. We’ll
assume that the 2000 individuals from which the data were collected constitute a population.
Table 1: Fingerprints of the right hand classified by the number of whorls and small loops.
From: Waite, H. 1915. Association of fingerprints. Biometrika, 10, 421-478.
Small loops (y)
Whorls (x)
0
1
2
3
4
5 Total Pr(X = x)
0
78 144 204 211 179 45 861
.430
1
106 153 126 80 32 0
497
.248
2
130 92 55 15
0
0
292
.146
3
125 38
7
0
0
0
170
.085
4
104 26
0
0
0
0
130
.065
5
50
0
0
0
0
0
50
.025
Total
593 453 392 306 211 45 2000
1
The conditional distributions of small loops (Y ) given the number of whorls (X) are shown
below along with the conditional expectations.
Pr(Y
Pr(Y
Pr(Y
Pr(Y
Pr(Y
Pr(Y
Table 2: Conditional distributions of Y and conditional means.
y
∑
0
1
2
3
4
5
E(Y |x) =
y Pr(y|x)
= y|X = 0) .091 .167 .237 .245 .208 .052
2.469
0
1.555
= y|X = 1) .213 .308 .254 .161 .064
= y|X = 2) .445 .315 .188 .051
0
0
.844
= y|X = 3) .735 .224 .041
0
0
0
.306
0
0
0
0
.200
= y|X = 4) .800 .200
= y|X = 5)
1
0
0
0
0
0
0
STAT 421 Lecture Notes
120
If we change our perspective and examine all of the conditional means E(Y |x), it’s apparent
that the conditional means can be viewed as functions of the random variable X. In this
case, the probability of occurrence of each possible value of E(Y |X) is inherited from the
distribution of X.
Conditional Means as Random Variables
Let h(x) = E(Y |x). Since X is a random variable, h(X) = E(Y |X) is also a random
variable, and the distribution of h(X) = E(Y |X) is shown below.
Table 3: Distribution of E(Y |X).
x
0
1
2
3
4
E(Y |X = x) 2.469 1.555 .844 .306 .200
Pr(X = x)
.430 .248 .146 .085 .065
5
0
.025
From Table 3, it’s possible to compute the expected value of E(Y |X), i.e.,
E[E(Y |X)] =
5
∑
E(Y |X = x) Pr(X = x)
x=0
= 2.469 × .430 + · · · + 0 × .025
= 1.609.
Theorem 4.7.1. Law of Total Probability for Expectations Let X and Y be random variables
such that E(Y ) < ∞. Then,
E(Y ) = E[E(Y |X)].
The proof can be understood easily for the case of continuous random variables:
∫ ∞
E[E(Y |X)] =
E(Y |x)f1 (x) dx
−∞
∫ ∞∫ ∞
=
yg2 (y|x)f1 (x) dydx
−∞
−∞
Note that f (x, y) = g2 (y|x)f1 (x) is the joint p.d.f. of (X, Y ); hence,
∫ ∞∫ ∞
yg2 (y|x)f1 (x) dydx
E[E(Y |X)] =
−∞ −∞
∫ ∞∫ ∞
=
yf (x, y) dydx
−∞
−∞
= E(Y ).
Returning to the fingerprint example and using the marginal values in Table 1, E(Y ) =
(0 × 593 + · · · + 5 × 45)/2000 = 1.612 whereas the previous calculation of E(Y ) yielded 1.609.
STAT 421 Lecture Notes
121
The difference is attributable to rounding error.
Example 4.7.6. Suppose that X ∼ Unif(0, 1). After observing a realization x, A second observation Y is obtained from the Unif(x, 1) distribution; hence g2 (y|x) = (1 − x)−1 I{r|x≤r≤1} (y).
The unconditional expectation of Y is determined as follows.
Note that g2 (y|x) is symmetric and hence E(Y |x) is the midpoint of the interval [x, 1];
, and as a random variable, E(Y |X) = 1+X
. Then,
thus, E(Y |x) = 1+x
2
2
E(Y ) = E[E(Y |X)]
∫ 1
1+x
=
f (x) dx
2
0
1
x x2 3
=
+ = .
2
4
4
0
Theorem 4.7.2. Suppose that X and Y are random variables and that Z = r(X, Y ) for some
function r. Then, the expected value of Z with respect to the conditional distribution of Y
given X = x is
E(Z|X = x) = E[r(x, Y )|X = x].
That is, since we are conditioning on the event {X = x}, we may replace the random variable
X in r(X, Y ) with the realized value x ∈ R in the calculation of E(Z|X = x). Consequently,
the expected value of Z is computed by integrating over the conditional distribution g2 (y, x):
∫ ∞
E(Z|X = x) =
r(x, y)g2 (y|x) dy.
−∞
Example 4.7.7. Suppose that E(Y |X) = aX + b. E(XY ) can be determined using
E(XY ) = Ex [Ey|x (XY |X)]
where the notation Ex is used to indicate that the expectation is with respect to the distribution of X and [Ey|x indicates that the expectation is with respect to the conditional
distribution of Y given X = x. Hence,
E(XY ) = Ex [Ey|x (XY |X)]
= Ex [XEy|x (Y |X)]
= Ex [X(aX + b)]
= aEx (X 2 ) + bEx (X).
It is more direct, however to compute E(XY ) = E[X(aX + b)] = aE(X 2 ) + bE(X).
Definition 4.7.3. The conditional variance of Y given X = x is defined to be
Var(Y |x) = Ey|x {[Y − E(Y |x)]2 }.
STAT 421 Lecture Notes
122
We sometimes view Var(Y |x) = v(x) as a random variable, i.e. v(X) = Var(Y |X). Assuming
X to have a continuous distribution implies that the unconditional variance of Y can be
computed from Var(Y |x) according to
∫ ∞
Var(Y ) =
v(x)f1 (x) dx.
−∞
The justification for this expression is based on Theorem 4.7.1., the law of total probability
for expectations.
The simple linear regression model specifies
E(Y |x) = ax + b.
When E(Y |x) is used to predict a value of y associated with x, the regression predictor
E(Y |X) = aX + b minimizes the unconditional mean square error E{[Y − d(X)]2 }. The
following theorem provides a proof.
Theorem 4.7.3. The prediction d(X) that minimizes E{[Y − d(X)]2 } is d(X) = Ey|x (Y |X).
For simplicity, assume that X has a continuous distribution. Let d(X) = Ey|x (Y |X) and
d∗ (X) denote some other predictor. The criterion to be minimized is expected squared error,
where the expectation is taken over the joint distribution of (X, Y ). We use
E{[Y − d(X)]2 } = Ex {Ey|x ([Y − d(X)]2 )}.
Theorem 4.5.2. proves that the mean of a distribution minimizes the mean squared error,
i.e. E{[Y − d]2 } is minimized by choosing d = E(Y ).
In this situation, d(x) = Ey|x (Y |x) is the mean of the conditional distribution of Y given
X = x, and so d(x) minimizes Ey|x {[Y − d(x)]2 }. This statement holds for every x in the
support of X, and thus
Ey|x {[Y − d(x)]2 } ≤ Ey|x {[Y − d∗ (x)]2 }∀x.
Thus,
(
(
)
)
Ex Ey|x {[Y − d(X)]2 } ≤ Ex Ey|x {[Y − d∗ (X)]2 } ,
and
E{[Y − d(X)]2 } ≤ E{[Y − d∗ (X)]2 }.
Theorem 4.7.4. Law of Total Probability for Variances If X and Y are random variables
with finite variances, then
Var(Y ) = Ex [Vary|x (Y |X)] + Varx (Ey|x [Y |X]).
The proof is based on expanding Var(Y ) = E[(Y −µY )2 ] = E{[Y −Ey|x (Y |X)+Ey|x (Y |X)−
µY ]2 }.