Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STAT 421 Lecture Notes 4.6 119 Conditional Expectation Definition 4.7.1. Let X and Y be random variables with finite means µX and µY . The conditional expectation of Y given X = x is denoted E(Y |X = x). In this context, we have observed a realization x of the random variable X, and so for the purpose of computing the conditional expectation, X is fixed at x and not random. If the conditional distribution of Y given X = x is continuous, then ∫ ∞ E(Y |X = x) = yg2 (y|x) dy. −∞ In the case that the conditional distribution of Y given X = x is discrete, then ∑ E(Y |X = x) = yg2 (y|x) dy. y The following table shows a data set from which a joint distribution can be formed. We’ll assume that the 2000 individuals from which the data were collected constitute a population. Table 1: Fingerprints of the right hand classified by the number of whorls and small loops. From: Waite, H. 1915. Association of fingerprints. Biometrika, 10, 421-478. Small loops (y) Whorls (x) 0 1 2 3 4 5 Total Pr(X = x) 0 78 144 204 211 179 45 861 .430 1 106 153 126 80 32 0 497 .248 2 130 92 55 15 0 0 292 .146 3 125 38 7 0 0 0 170 .085 4 104 26 0 0 0 0 130 .065 5 50 0 0 0 0 0 50 .025 Total 593 453 392 306 211 45 2000 1 The conditional distributions of small loops (Y ) given the number of whorls (X) are shown below along with the conditional expectations. Pr(Y Pr(Y Pr(Y Pr(Y Pr(Y Pr(Y Table 2: Conditional distributions of Y and conditional means. y ∑ 0 1 2 3 4 5 E(Y |x) = y Pr(y|x) = y|X = 0) .091 .167 .237 .245 .208 .052 2.469 0 1.555 = y|X = 1) .213 .308 .254 .161 .064 = y|X = 2) .445 .315 .188 .051 0 0 .844 = y|X = 3) .735 .224 .041 0 0 0 .306 0 0 0 0 .200 = y|X = 4) .800 .200 = y|X = 5) 1 0 0 0 0 0 0 STAT 421 Lecture Notes 120 If we change our perspective and examine all of the conditional means E(Y |x), it’s apparent that the conditional means can be viewed as functions of the random variable X. In this case, the probability of occurrence of each possible value of E(Y |X) is inherited from the distribution of X. Conditional Means as Random Variables Let h(x) = E(Y |x). Since X is a random variable, h(X) = E(Y |X) is also a random variable, and the distribution of h(X) = E(Y |X) is shown below. Table 3: Distribution of E(Y |X). x 0 1 2 3 4 E(Y |X = x) 2.469 1.555 .844 .306 .200 Pr(X = x) .430 .248 .146 .085 .065 5 0 .025 From Table 3, it’s possible to compute the expected value of E(Y |X), i.e., E[E(Y |X)] = 5 ∑ E(Y |X = x) Pr(X = x) x=0 = 2.469 × .430 + · · · + 0 × .025 = 1.609. Theorem 4.7.1. Law of Total Probability for Expectations Let X and Y be random variables such that E(Y ) < ∞. Then, E(Y ) = E[E(Y |X)]. The proof can be understood easily for the case of continuous random variables: ∫ ∞ E[E(Y |X)] = E(Y |x)f1 (x) dx −∞ ∫ ∞∫ ∞ = yg2 (y|x)f1 (x) dydx −∞ −∞ Note that f (x, y) = g2 (y|x)f1 (x) is the joint p.d.f. of (X, Y ); hence, ∫ ∞∫ ∞ yg2 (y|x)f1 (x) dydx E[E(Y |X)] = −∞ −∞ ∫ ∞∫ ∞ = yf (x, y) dydx −∞ −∞ = E(Y ). Returning to the fingerprint example and using the marginal values in Table 1, E(Y ) = (0 × 593 + · · · + 5 × 45)/2000 = 1.612 whereas the previous calculation of E(Y ) yielded 1.609. STAT 421 Lecture Notes 121 The difference is attributable to rounding error. Example 4.7.6. Suppose that X ∼ Unif(0, 1). After observing a realization x, A second observation Y is obtained from the Unif(x, 1) distribution; hence g2 (y|x) = (1 − x)−1 I{r|x≤r≤1} (y). The unconditional expectation of Y is determined as follows. Note that g2 (y|x) is symmetric and hence E(Y |x) is the midpoint of the interval [x, 1]; , and as a random variable, E(Y |X) = 1+X . Then, thus, E(Y |x) = 1+x 2 2 E(Y ) = E[E(Y |X)] ∫ 1 1+x = f (x) dx 2 0 1 x x2 3 = + = . 2 4 4 0 Theorem 4.7.2. Suppose that X and Y are random variables and that Z = r(X, Y ) for some function r. Then, the expected value of Z with respect to the conditional distribution of Y given X = x is E(Z|X = x) = E[r(x, Y )|X = x]. That is, since we are conditioning on the event {X = x}, we may replace the random variable X in r(X, Y ) with the realized value x ∈ R in the calculation of E(Z|X = x). Consequently, the expected value of Z is computed by integrating over the conditional distribution g2 (y, x): ∫ ∞ E(Z|X = x) = r(x, y)g2 (y|x) dy. −∞ Example 4.7.7. Suppose that E(Y |X) = aX + b. E(XY ) can be determined using E(XY ) = Ex [Ey|x (XY |X)] where the notation Ex is used to indicate that the expectation is with respect to the distribution of X and [Ey|x indicates that the expectation is with respect to the conditional distribution of Y given X = x. Hence, E(XY ) = Ex [Ey|x (XY |X)] = Ex [XEy|x (Y |X)] = Ex [X(aX + b)] = aEx (X 2 ) + bEx (X). It is more direct, however to compute E(XY ) = E[X(aX + b)] = aE(X 2 ) + bE(X). Definition 4.7.3. The conditional variance of Y given X = x is defined to be Var(Y |x) = Ey|x {[Y − E(Y |x)]2 }. STAT 421 Lecture Notes 122 We sometimes view Var(Y |x) = v(x) as a random variable, i.e. v(X) = Var(Y |X). Assuming X to have a continuous distribution implies that the unconditional variance of Y can be computed from Var(Y |x) according to ∫ ∞ Var(Y ) = v(x)f1 (x) dx. −∞ The justification for this expression is based on Theorem 4.7.1., the law of total probability for expectations. The simple linear regression model specifies E(Y |x) = ax + b. When E(Y |x) is used to predict a value of y associated with x, the regression predictor E(Y |X) = aX + b minimizes the unconditional mean square error E{[Y − d(X)]2 }. The following theorem provides a proof. Theorem 4.7.3. The prediction d(X) that minimizes E{[Y − d(X)]2 } is d(X) = Ey|x (Y |X). For simplicity, assume that X has a continuous distribution. Let d(X) = Ey|x (Y |X) and d∗ (X) denote some other predictor. The criterion to be minimized is expected squared error, where the expectation is taken over the joint distribution of (X, Y ). We use E{[Y − d(X)]2 } = Ex {Ey|x ([Y − d(X)]2 )}. Theorem 4.5.2. proves that the mean of a distribution minimizes the mean squared error, i.e. E{[Y − d]2 } is minimized by choosing d = E(Y ). In this situation, d(x) = Ey|x (Y |x) is the mean of the conditional distribution of Y given X = x, and so d(x) minimizes Ey|x {[Y − d(x)]2 }. This statement holds for every x in the support of X, and thus Ey|x {[Y − d(x)]2 } ≤ Ey|x {[Y − d∗ (x)]2 }∀x. Thus, ( ( ) ) Ex Ey|x {[Y − d(X)]2 } ≤ Ex Ey|x {[Y − d∗ (X)]2 } , and E{[Y − d(X)]2 } ≤ E{[Y − d∗ (X)]2 }. Theorem 4.7.4. Law of Total Probability for Variances If X and Y are random variables with finite variances, then Var(Y ) = Ex [Vary|x (Y |X)] + Varx (Ey|x [Y |X]). The proof is based on expanding Var(Y ) = E[(Y −µY )2 ] = E{[Y −Ey|x (Y |X)+Ey|x (Y |X)− µY ]2 }.