Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 7 Properties of Expectations and Central Limit Theorem 7.1 Introduction Contents --- In this chapter, additional properties of the expected values of random variables will be exploited. Also discussed are some limit theorems, especially the central limit theorem that is probably the most important and surprising result in probability. 7.2 Expectation of Sums of Random Variables Additional properties about jointly distributed random variables --- Proposition 7.1 (computing the mean of a function of two jointly distributed random variables) --If X and Y have a joint pmf p(x, y), then E[ g ( X , Y )] g ( x, y ) p ( x, y ) . y x If X and Y have a joint pdf f(x, y), then E[ g ( X , Y )] g ( x, y) f ( x, y)dxdy . Proof: similar to the proof for the case of a single random variable; left as an exercise. Example 7.1 --An accident occurs at a location X that is uniformly distributed on a road of length L; and at the time of the accident an ambulance is at a location Y that is also uniformly distributed on the same road. Assume that X and Y are independent. Find the expected distance between the ambulance and the location of the accident. Solution: The pdf of X is fX(x) = 1/L 0 < x < L; 0, otherwise. Similarly, the pdf of Y is fY(y) = 1/L 0 < y < L; 0, otherwise. By Fact 6.9 of the last chapter, since X and Y are independent, we get the joint pdf f of X and Y to be 0 < x <L, 0 < y < L; otherwise. f(x, y) = fX(x)fY(y) = (1/L)(1/L) = 1/L2 =0 The distance between the ambulance and the location of the accident is just a function of X and Y: g(X, Y) = |X Y|. And the expected distance, by Proposition 7.1, is E[|X Y|] = E[g(X, Y)] = g ( x, y) f ( x, y)dxdy = L L 1 0 0 | x y | L2 dydx . L Now, we want to compute the value of 0 | x y | dy first, in which the range of integration for y is [0, L] and x is a fixed value. If y[0, x), then x > y, and so |x y| = x y. On the other hand, if y[x, L], then x y, and so |x y| = (x y) = y x. Consequently, L 0 | x y | dy = x 0 ( x y)dy + = (xy L y2 x y2 ) |0 + ( xy) | x 2 2 = (x2 x2 L2 x2 ) + [( xL) ( x2)] 2 2 2 = L2 + x2 xL. 2 Now, we can compute E[|X Y|] as: E[|X Y|] = L x ( y x)dy L L 1 0 0 | x y | L2 dydx L = (1/L2) 0 ( L2 x 2 xL)dx 2 = (1/L2)( L L2 x3 x2 x+ L) |0 2 3 2 = (1/L2)( L3 L3 L3 + ) 2 3 2 = L . 3 Expectations of sums of random variables --- Fact 7.1 --Given two random variables X and Y, the following equality is true: E[X +Y] = E[X] + E[Y]. Proof: Regarding X + Y as a function of two random variables g(X, Y), we can apply Proposition 7.1 to get E[X + Y] = g ( x, y) f ( x, y)dxdy = ( x y) f ( x, y)dxdy = xf ( x, y)dydx = x( f ( x, y)dy)dx = xf X ( x)dx + + yf ( x, y)dxdy + y( f ( x, y)dx)dy yfY ( y)dy = E[X] + E[Y]. Fact 7.2 --Given n random variables X1, X2, …, Xn, the following equality is true: E[X1 + X2 + … + Xn] = E[X1] + E[X2] + … + E[Xn]. Proof: easy to derive using Fact 7.1 and the principle of induction. Example 7.2 (the expected number of matches) --A group of N people throw their hats into the center of a room. The hats are mixed up, and each person randomly selects one. Find the expected number of people that select their own hats. Solution: Let the number of persons that select their own hats be denoted as X. Obviously, X may be computed as X = X1 + X2 + … + XN where the random variable Xi 1 i N is defined as Xi = 1 if the ith person selects his own hat; = 0 otherwise. Now, the probability for a person to select his/her own hat among the N ones is just 1/N, i.e., P{person i selects his/her own hat} = P{Xi = 1} = 1/N. Therefore, by definition we get the expected value for Xi as E[Xi] = 1P{Xi = 1} + 0P{Xi = 0} = 1(1/N) + 0(1 1/N) = 1/N where i = 1, 2, …, N. And so, by Fact 7.2 we get E[X] = E[X1 + X2 + … + Xn] = E[X1] + E[X2] + … +E[Xn] = (1/N) N = 1. That is, exactly one person selects his/her own hat on the average. 7.3 Random Samples and Their Properties Concept --- Every observed data item of the distribution of a certain random variable is itself random in nature, and so may be regarded as a random variable, too. Therefore, we have the following definitions. Random samples --- Definition 7.1 (random sample) --A random sample of size n arising from a certain random variable X is a collection of n independent random variables X1, X2, …, Xn such that i =1, 2, …, n, Xi is identically distributed with X, meaning that every Xi has the same pmf or pdf as that of X. Each Xi is called a sample variable; X is called the population random variable; and the mean and variance of X are called the population mean and variance, respectively. Definition 7.2 (random sampling and sample value) --The process of obtaining a set of observed data of a random sample is called random sampling; the set of the observed data is also called a random sample; and each observed data item is called a sample value. A note --- the term random sample so has two meanings: a set of random variables all independently and identically distributed with a population random variable X; or a set of observed data of such random variables. An abbreviation --- we use iid to mean independently and identically distributed. Example 7.3 --In a resistor manufacturing company, according to the past experience it is known that the resistance of a resistor produced by the company is a normal random variable X~N(, 2) where and 2 are unknown. A random sampling process is conducted to take a set of sample values of resistances from every 5 resistors produced. One of such sample-value sets is (8.04, 8.02, 8.07, 7.99, 8.03), another is (8.01, 7.89, 8.10, 8.02, 8.00), and a third is (7.98, 8.01, 8.05, 7.90, 8. 05), and so on. These sets are random samples. The first sample values in the three sets, namely, 8.04, 8.01, 7,89, …, are all random in nature, and so may be described by a random variable X1, as mentioned previously. Similarly, the second sample values in the sets, namely, 8.02, 7.89, 8.01, …, may be described by a second random variable X2, and so on, yielding additionally three random variables X3, X4, and X5. The set of these five random variables X1, X2, …, X5 consists of a random sample arising from X. Properties of random samples and sample means --- Fact 7.3 (the joint pdf of the random variables of a random sample) --The sample variables Xi’s in a random sample X1, X2, …, Xn of size n arising from a population random variable X with pmf or pdf f(x) has a joint pmf or pdf of the following form: f(x1, x2, …, xn) = f(x1)f(x2)…f(xn). Proof: easy to derive by applying Facts 6.8 or 6.9 according to the property of independence and the principle of induction. Definition 7.3 (sample mean) --Given a random sample X1, X2, …, Xn arising from a population random variable X, the following two functions of the sample variables X = 1 (X1 + X2 + … + Xn) = n To = (X1 + X2 + … + Xn) = n Xi / n ; (7.1) i 1 n Xi (7.2) i 1 are called the sample mean and the sample total of the random sample, respectively. A note: the sample mean is itself a random variable, as mentioned previously, and so is the sample total. Fact 7.4 (the mean of the sample mean) --The mean X = E[ X ] of the sample mean X of a random sample X1, X2, ..., Xn of size n arising from a population random variable X with mean is X = . Proof: By Definition 7.1 and Fact 7.2, we have X = E[ X ] = E[(X1 + X2 + … + Xn)/n] = (E[X1] + E[X2] + … + E[Xn])/n = (n)/n = . A note: the sample mean is useful as an estimator of the mean of the population random variable (to be discussed in the next chapter). Example 7.4 (continued from Example 7.3) --Suppose that we want to estimate the mean of the normal distribution of the resistance values of the resistor produced by the factory mentioned in Example 7.3. As mentioned above, the sample mean may be used for this purpose. The sample mean value x1 computed in terms of the first set of sample values is x1 = (8.04 + 8.02 + 8.07 + 7.99 + 8.03)/5 = 8.03; that computed in terms of the second set is x2 = (8.01 + 7.89 + 8.10 + 8.02 + 8.00)/5 8.00; and that computed in terms of the third set is x3 (7.98 + 8.01 + 8.05 + 7.90 + 8. 15) 8.02, and so on. These estimates, 8.03, 8.00, 8.02, etc., are random in nature, and are values of the sample mean X = (X1 + X2 + … + X5)/5. Fact 7.4 says that the mean of these values, X = E[ X ], is just the real mean of the population random variable X. Note that there are ways other than the sample mean for estimating (to be discussed in the next chapter). 7.4 Covariance, Variance of Sums, and Correlations Expectation of a product of independent random variables --- Proposition 7.2 --If X and Y are two independent random variables, then for any functions h and g, the following equality holds: E[g(X)h(Y)] = E[g(X)]E[h(Y)]. Proof: Suppose that X and Y are jointly continuous with joint pdf f(x, y). By the independence of X and Y and Fact 6.9, we have f(x, y) = fX(x)fY(y). Then E[g(X)h(Y)] = g ( x)h( y) f ( x, y)dxdy = g ( x)h( y) f X ( x) fY ( y)dxdy = g ( x) f X ( x)dx h( y) fY ( y)dy = E[g(X)]E[h(Y)]. The proof for the discrete case is similar. Covariance --- Definition 7.4 (the covariance of two random variables) --The covariance between two random variables X and Y, denoted by Cov(X, Y), is defined by Cov(X, Y) = E[(X – X)(Y – Y)] where X and Y are the means of X and Y, respectively. Notes: It is easy to see from the above definition that Cov(X, Y) = Cov(Y, X), i.e., the operator Cov is commutative. Comparing the above definition with that of Var(X) which is Var (X) = E[(X )2] where is the mean of X, we see that the former becomes the latter if Y is taken to be X, as described by the following fact. Fact 7.5 --Var(X) = Cov(X, X). Proof: easy and left as an exercise. Fact 7.6 --Cov(X, Y) = E[XY] – [X]E[Y]. Proof: easy and left as an exercise. A note: compare the above formula with that for computing Var(X) from Proposition 5.2, which is Var(X) = E[X2] (E[X])2; the latter is just a special case of the former. Fact 7.7 --If two random variables X and Y are independent, then Cov(X, Y) = 0, but the reverse is not always true. Proof: X and Y are independent, so by Proposition 7.2, Cov(X, Y) = E[(X – X)(Y – Y)] = E[(X – X)]E[(Y – Y)] = (E[X] X)(E[Y] Y) = 0. To prove the second statement, let X and Y be defined respectively as: P{X = 0} = P{X = 1} = P{X = 1} = 1/3; and Y = 0 if X 0; = 1, else. From the definition of random variable Y above, we get to know the values of the product random variable XY is always zero, and so we have E[XY] = 0. Also, it is easy to compute E[X] = (1/3)(0 + 1 1) = 0. Therefore, by Fact 7.4, we get Cov(X, Y) = E[XY] – [X]E[Y] = 0. However, clearly, X and Y are not independent as can be seen easily from the definition of Y. Done. Fact 7.8 --Cov(aX, Y) = Cov(X, aY) = aCov(X, Y); Cov(aX, bY) = abCov(X, Y). Proof: easy from the definition of the covariance and left as exercises. Proposition 7.3 --n m i 1 j 1 Cov( X i , Y j ) = n m Cov( X i , Y j ) . i 1 j 1 (Note: here all Xi need not be iid; neither need all Yi.) Proof: Let i = E[Xi], j = E[Yj]. n n m m i 1 i 1 j 1 j 1 Then, E[ X i ] i and E[ Y j ] j . n So, Cov( X i , i 1 m n n m m i 1 i 1 j 1 j 1 E[( X i E[ X i ])( Y j E[ Y j ])] Yj ) = j 1 n n m m i 1 i 1 j 1 j 1 = E[( X i i )( Y j j )] n m i 1 j 1 = E[ ( X i i ) (Y j j )] m n = E[ ( X i i )(Y j j )] j 1 i 1 m = n E[( X i i )(Y j j )] (by Fact 7.2) j 1 i 1 n = m Cov( X i , Y j ) . i 1 j 1 (by definition of Cov) Fact 7.9 --n Var( X i ) = i 1 n Var( X i ) i 1 n n + 2 Cov( X i , X j) . i j Proof: n n i 1 i 1 Var( X i ) = Cov( X i , (by Fact 7.5) Cov( X i , X j ) (by Proposition 7.3) n = n Xi ) i 1 n i 1 j 1 n n = Cov( X i , X i ) i j i 1 n n = Var( X i ) + n Var( X i ) i 1 n Cov( X i , X j) (by Fact 7.5) i j i 1 = n Cov( X i , X j) + n n + 2 Cov( X i , X j) . i j Fact 7.10 --If X1, X2, …, Xn are pairwise independent, in that Xi and Xj are independent for all i j, then n Var( X i ) = i 1 n Var( X i ) . i 1 Proof: If Xi and Xj are independent for all i j, then by Fact 7.7, we have Cov( X i , X j ) = 0. And so by Fact 7.9, the result is derived. Fact 7.11 (the variance of the sample mean)--If X1, X2, …, Xn constitutes a random sample arising from a population random variable X with variance 2, then the variance of their sample mean X is Var( X ) = 2/n. Proof: By the definition of the sample mean: X = (X1 + X2 + … +Xn)/n where Xi are all iid, we get Var( X ) = Var((X1 + X2 + … +Xn)/n) n = Var((1/n)( X i )) i 1 n = (1/n)2Var( X i ) (by Proposition 5.3) i 1 n = (1/n2) Var( X i ) (by the iid property and Fact 7.10) i 1 = (1/n2)(n2) = 2/n. ( Var(Xi) = i = 1, 2, …, n) Notes: By Facts 7.4 and 7.11 above, we have the result: E[ X ] = ; Var( X ) = 2/n, respectively, if the population random variable X has mean and variance 2. It is also easy to see the following fact about the sample total To. Fact 7.12 (the mean and variance of the sample total) --E[To] = n; Var(To) = n2. Proof: easy by Facts 7.4 and 7.11; left as an exercise. Example 7.5 --Prove Fact 4.7 which says that the mean and variance of a binomial random variable with parameters n and p are E[X] = np and Var(X) = np(1 p), respectively. Proof: By definition, a binomial random variable with parameters n and p may be considered as the sample total To of a random sample of size n, X1, X2, …, Xn, arising from a Bernoulli random variable X defined as: X = 1 with probability p and X = 0 with probability 1 p. The mean of the Bernoulli random variable X by definition is E[X] = 1p + 0(1 p) = p. Also, the second moment of X is E[X2] = 12p + 02(1 p) = p. And the variance of it is thus Var(X) = E[X2] (E[X])2 = p p2 = p(1 p). Now, by Fact 7.12, the mean and variance of the sample total To may be computed to be E[To] = nE[X] = np; Var[To] = nVar(X) = np(1 p). Sample variance --- Definition 7.5 (sample variance) --Given a random sample X1, X2, …, Xn arising from a population random variable X, the function of the sample variables S2 = [(X1 X )2 + (X2 X )2 + … + (Xn X )2]/(n 1) n = [ ( X i X )2 ]/(n 1) (7.3) i 1 is called the sample variance, where X is the sample mean of the random sample. A note: the value n 1 instead of n is used as the divisor in the above definition. The reason will become clear in the next fact. Fact 7.13 (the mean of the sample variance) --If X1, X2, …, X2 constitutes a random sample arising from a population random variable X with mean and variance 2, then the mean E[S2] of the sample variance S2 is just 2. Proof: First, by definition we have (n 1)S2 = n ( X i X )2 i 1 n = [( X i ) ( X )]2 i 1 = n n n i 1 i 1 i 1 ( X i )2 2( X ) ( X i ) ( X )2 n = ( X i )2 2( X )n( X ) n( X )2 i 1 n = ( X i )2 n( X )2 . i 1 Taking expectations of the two sides of the above equality and by the linearity of the expectation operator E[](Fact 7.2), we get n (n 1)E[S2] = E[ ( X i )2 n( X )2 ] i 1 n = E[( X i )2 ] nE[( X )2 ] . (A) i 1 But E[( X )2] = E[( X [ X ])2] = Var( X ). Therefore, (A) above becomes (n 1)E[S2] = n E[( X i )2 ] (by Fact 7.4: E[ X ] = ) (by definition) nVar( X ) i 1 = n2 n(2/n) = (n 1)2. (by definition of 2 and Fact 7.11) Therefore, E[S2] = 2. Done. A summary -- Now according to Facts 7.4 and 7.13, we have the means of the sample mean X and the sample variance S2, respectively, as E[ X ] = ; E[S2] = 2, where and 2 are the mean and the variance of the population random variable X, i.e., Also, by Fact 7.11 we have Var( X ) = 2/n. How about the value Var(S2)? This value is difficult to derive except for some special cases, like the population random variable being normal, as described by the following fact. Fact 7.14 (the variance of the sample variance of a random sample arising from a normal population distribution) --If X1, X2, …, X2 constitutes a random sample arising from a normal population random variable X with mean and variance 2, then the variance Var(S2) of the sample variance S2 of the random sample is 24/(n 1). Proof: see the reference book. A note: additional facts about the sample mean and the sample variance are described in the following fact. Fact 7.15 (relations between the sample mean and the sample variance) --If X1, X2, …, X2 constitutes a random sample arising from a normal population random variable X with mean and variance 2, then the sample mean X and the sample variance S2 of the random sample have the following properties: (a) X and S2 are independent; (b) X is a normal random variable with mean and variance 2/n; and (c) (n 1)S2/2 is a 2 random variable with n 1 degrees of freedom. Proof: see the reference book. Example 7.6 --Use the result of (c) in Fact 7.15 to prove Fact 7.14. Solution: First, by Fact 6.13, the mean and variance of a gamma distribution with parameters (t, ) are t/ and t/2, respectively. From (c) of Fact 7.15, (n 1)S2/2 is a 2 random variable with n 1 degrees of freedom; and according to Definition 6.11, this random variable is just a gamma random variable with parameters (t, ) where t = (n 1)/2 and = 1/2. Therefore, the variance of (n 1)S2/2 may be computed as Var((n 1)S2/2) = t/2 = [(n 1)/2]/(1/2)2 = 2(n 1). (C) On the other hand, by Proposition 5.3 which says Var(aX + b) = a2Var(X) for any random variable, we get Var((n 1)S2/2) = [(n 1)/2]2Var(S2). (D) Equating the above two results (C) and (D), we get [(n 1)/2]2Var(S2) = 2(n 1) or equivalently, Var(S2) = 24/(n 1) which is just the result of Fact 7.14. Done. Sample covariance --- Definition 7.6 (sample covariance) --Given two random samples X1, X2, …, Xn and Y1, Y2, …, Yn arising from two random variables X and Y, respectively, the sample covariance of X and Y is defined as SXY = [(X1 X )(Y1 Y ) + (X2 X )(Y2 Y ) + … + (Xn X )(Yn Y )]/(n 1) n = [ ( X i X )(Yi Y ) ]/(n 1) (7.4) i 1 where X and Y are the sample means of the two random samples arising from X and Y, respectively. Correlation --- Definition 7.7 (correlation of two random variables) --The correlation of two random variables X and Y, denoted by (X, Y), is defined, as long as the product Var(X)Var(Y) is positive, by Cov( X , Y ) . Var(X )Var(Y ) ( X ,Y ) (7.5) Fact 7.16 (limits of correlation values) --The correlation of two random variables X and Y are limited within the interval of [1, 1], i.e., 1 (X, Y) 1. Proof: Suppose random variables X and Y have variances given by X2 and Y2, respectively. Then, X Y 0 Var( + ) (by definition of variance) X Y = Var( X ) + Var( Y ) + 2 Cov( X , Y X Y X Y Cov( X , Y ) Var( X ) Var(Y ) = + +2 2 2 XY X Y ) (by Fact 7.9) (by Proposition 5.3 and Fact 7.8) = 2[1 + (X, Y)]. ( Var(X) = X2, Var(Y) = Y2) That is, 0 2[1 + (X, Y)], which implies –1 (X, Y). (E) On the other hand, X Y 0 Var( ) X Y = Var( X (by definition of variance) ) + Var( Y ) 2 Cov( X , Y X Y X Y Cov( X , Y ) Var( X ) Var(Y ) = + 2 2 2 XY X Y ) (by Facts 7.8 and 7.9) (by Proposition 5.3 and Fact 7.8) = 2[1 – (X, Y)]. ( Var(X) = X2, Var(Y) = Y2) That is, 0 2[1 (X, Y)], which implies (X, Y) 1. (F) By (E) and (F), we get –1 (X, Y) 1. Done. Fact 7.17 (the extreme linearity cases of correlation values) --(X, Y) = 1 implies that Y = a + bX where b = Y > 0; and X (X, Y) = 1 implies that Y = a + bX where b = Y < 0; X Proof: For (X, Y) = 1 to be true, we see from the last proof that 0 = Var( Y Y X X ) must be true. This, by the definition of variance, implies that constant (say c), i.e., X X Y Y X X Y Y must be a = c. So, Y = a + bX where a = cY, b = Y/X > 0. The second case may be proved similarly with a = cY, b = Y/X < 0. Done. Fact 7.18 (the inverse cases of linearity of Fact 7.17) --Y = a + bX implies (X, Y) = 1 or –1, depending on the sign of b. Proof: easy and left as an exercise. Comments and illustrations about the property of correlation -- The correlation (X, Y) is a measure of the linearity between the two random variables X and Y. A value of (X, Y) close to +1 or –1 indicates a high degree of linearity between X and Y, whereas a value close to 0 indicates a lack of such linearity (see Fig. 7.1). y y (X, Y) = 1 (X, Y) = 1 x x (a) (b) Fig. 7.1 Illustration of linearity of extreme correlation values. A positive value of (X, Y) indicates that Y tends to increase when X does the same, whereas a negative value indicates that Y tends to decrease when X increases (see Fig. 7.2). y y (X, Y) > 0 (X, Y) < 0 x (a) x (b) Fig. 7.2 Illustration of mutual tendency of X and Y for positive and negative correlation values. When (X, Y) = 0, it means that there is no relation between the tendency of the values of X and Y (see Fig. 7.3). y (X, Y) = 0 x Fig. 7.3 Illustration of mutual tendency of X and Y for correlation value of zero. Definition 7.8 --If (X, Y) = 0, then the random variables X and Y are said to be uncorrelated. Example 7.7 --A group of 10 ladies spend money to buy their cosmetics and clothes, and the sample values in a recent year are shown in Table 7.1. Consider the expenses for the cosmetics and clothes as two random variables X and Y. Find the correlation (X, Y) using the data of Table 7.1 to see whether or not spending more money for cosmetics implies spending more for clothes in the lady group. Considering the data in the table as sample values of random samples of X and Y, compute the sample mean value and the sample variance value for use as the mean and the variance values of X and Y in computing the value of . Table 7.1 Sample values of yearly cosmetics and clothes fees of a lady group. Lady No. 1 2 3 4 5 6 7 8 9 10 Cosmetics Fee 3000 5000 12000 2000 7000 15000 5000 6000 8000 10000 Clothes Fee 7000 8000 25000 5000 12000 30000 10000 15000 20000 18000 Solution: The sample mean values X and Y may be computed to be x = 10 (i=1 Xi)/10 = 73000/10 = 7300 and y = (10 Y )/10 = 150000/10 = i=1 i 15000, respectively, for use as estimates of the means E[X] and E[Y] of X and Y, respectively (details omitted). The sample variance values sx2 and sy2 for use as estimates of the variances Var(X) and Var(Y) of X and Y may be computed to be sx2 = 148100000/(10 1) and sy2 = 606000000/(10 1) , respectively (details omitted). The sample covariance value for use as an estimate of the covariance Cov(X, Y) of X and Y may be computed to be sxy = 290000000/(10 1) (details omitted). Therefore, the correlation (X, Y) may be computed, using the estimates, to be Cov( X , Y ) = sxy/(sx2sy2)1/2 ( X ,Y ) Var(X )Var(Y ) = [290000000/(10 1)] {[148100000/(10 1)][606000000/(10 1)]}1/2 = 290000000/(148100000606000000)1/2 0.9680 which is quite close to 1, meaning that X and Y are very positively correlated (X increases as Y increases, and vice versa), as can be seen from Fig. 7.4 where the data distribution condition is quite close to that of Fig. 7.1(a) as indicated by the red line. 35000 30000 25000 20000 15000 10000 5000 0 0 5000 10000 15000 20000 Fig. 7.4 Data distribution of Example 7.7. Comments -- The sample mean, variance, and covariance values computed in the last example are respectively estimates of the means, variances, and covariance of the population random variables. More about parameter estimation will be discussed in the next chapter. Bivariate normal distribution --- Notes -- The bivariate normal distribution is important in many applications; it is also a starting point to study multivariate normal distributions. This joint normal distribution has many interesting properties which are useful in various applications. Definition 7.9 (bivariate normal distribution) --Two jointly distributed random variables X and Y are said to have a bivariate normal distribution if their joint pdf is given by f(x, y) = 1 2 x y 1 2 x 2 y 1 y x exp 2 2(1 ) x y 2 ( x x )( y y ) 2 . x y (7.6) Illustration of a bivariate normal distribution --A diagram of the shape of the pdf of a bivariate normal distribution is shown in Fig. 7.5. f(x, y) y x Fig. 7.5 Shape of the pdf of a bivariate normal distribution. Fact 7.19 --Given the bivariate normal distribution of two jointly distributed random variables X and Y as described in Definition 7.9, the following facts are true: (a) both of the marginal random variables X and Y are normal with parameters (x, x2) and (y, y2), respectively; (b) the in the pdf is the correlation of X and Y; (c) X and Y are independent when = 0. Proof: see the reference book and the following example. Example 7.8 --Prove (c) of Fact 7.19 above. Proof: When = 0 in (7.6) of Definition 7.9, f(x, y) there becomes 2 1 x x y y f(x, y) = exp 2 x y 2 x y 1 which can be factored into 2 f(x, y) = 1 x 2 1 y 1 1 x x y exp exp 2 2 2 y x y 2 2 = fX(x)fY(y) with fX(x) and fY(y) being the pdf’s of the two normal random variables X and Y, respectively. Therefore, by Proposition 6.1 of the last chapter, we conclude that X and Y are independent. 7.5 Conditional Expectation Definitions --- Recall -- (Definition 6.12) The conditional pmf of X, given that Y = y, is defined for all y such that P{Y = y} > 0, by pX|Y(x|y) = P{X = x | Y = y} = p(x, y)/pY(y). (Definition 6.14) The conditional pdf of X, given that Y = y, is defined for all y such that fY(y) > 0, by fX|Y(x|y) = f(x, y)/fY(y). Definition 7.10 (for the discrete case) --The conditional expectation of X, given that Y = y, is defined for all y such that pY(y) > 0, by E[X|Y = y] = xP{ X x | Y y} x = xpX |Y ( x | y) . x Definition 7.11 (for the continuous case) --The conditional expectation of X, given that Y = y, is defined for all y such that fY(y) > 0, by E[X|Y = y] = xf X |Y ( x | y)dx . An interpretation of the conditional expectation --The conditional expectation given Y = y can be thought of as being an ordinary expectation on a reduced sample space consisting only of outcomes for which Y = y. Example 7.9 --Suppose that the joint pdf of random variables X and Y is given by for 0 < x < , 0 < y < . f(x, y) = ex/yey/y Compute E[X|Y = y]. Solution: First compute fX|Y(x|y) = f(x, y)/fY(y) = f(x, y)/ f ( x, y )dx = [(1/y)ex/yey]/[ 0 (1/ y)e x / y e y dx ] = [(1/y)ex/y]/[ 0 e x / y d ( x / y ) ] = (1/y)ex/y[ex/y |0 ] = (1/y)ex/y(0 1) = (1/y)ex/y. So, E[X|Y = y] = xf X |Y ( x | y)dx = 0 ( x / y)e x/ y dx = 0 x[(1/ y)e x / y dx] = 0 xd (e x / y ) = xex/y |x 0 + 0 e x / y dx (using integration by part udv = uv vdu with v = ex/y and u = x) = (0 0) + (y)ex/y |x 0 = 0 (y)e0/y = y. Comments -- The above example, with a result of y, shows that E[X|Y = y] may be thought as a function g(y) of y, i.e., E[X|Y = y] = EX[X|Y = y] = g(y). Therefore, E[X|Y] may thought as a function of random variable Y, g(Y), whose value at Y = y is just E[X|Y = y]. And E[E[X|Y]] in more detail is just EY[EX[X|Y]]. Just as conditional probabilities satisfy all of the properties of ordinary probabilities, so do conditional expectations satisfy all of the properties of ordinary expectations. Some facts resulting from this viewpoint are as follows. Fact 7.20 --E[g(X)|Y = y] = g ( x) p X |Y ( x | y) in the discrete cases; x = g ( x) f X |Y ( x | y)dx n E[ X i |Y = y] = i 1 in the continuous cases. n E[ X i | Y y] . i 1 Proof: left as exercises. Proposition 7.4 (computing expectation by conditioning) --E[X] = E[E[X|Y]] where for the discrete case of Y, E[E[X|Y]] = EY[EX[X|Y]] = y EX[X|Y = y]P{Y = y}; and for the continuous case of Y, E[E[X|Y]] = EY[EX[X|Y]] = EX[X|Y = y]fY(y)dy. Proof: (for the discrete case only; the continuous case is left as an exercise) y E[X|Y = y]P{Y = y} = y ( x xP{X = x|Y = y})P{Y = y} = y x x[P{X = x, Y = y}/P{Y = y}]P{Y = y} = y x xP{X = x, Y = y} = x x( y P{X = x, Y = y}) = x xP{X = x} = E[X]. Example 7.10 --A miner is trapped in a mine containing 3 doors. The 1st door leads to a tunnel that will take him to safety after 3 hours of travel. The 2nd door leads to a tunnel that will return him to the mine after 5 hours of travel. The 3rd door leads to a tunnel that will return him to the mine after 7 hours. If we assume that the miner is at all time equally likely to choose any of the doors, what is the expected length of time until he reaches safety? Solution: Let X be the amount of time (in hours) until the miner reaches safety, and let Y denote the door he initially chooses. Now, according to Proposition 7.4, we have E[X] = E[X|Y = 1]P{Y = 1} + E[X|Y = 2]P{Y = 2} + E[X|Y = 3]P{Y = 3} = (1/3)(E[X|Y = 1] + E[X|Y = 2] + E[X|Y = 3]). However, E[X|Y = 1] = 3; E[X|Y = 2] = 5 + E[X]; (why? Because of return to original status.) E[X|Y = 3] = 7 + E[X]. Therefore, E[X] = (1/3)(3 + 5 + E[X] + 7 + E[X]) which can be solved to get E[X] = 15. 7.6 Central Limit Theorem and Related Laws Introduction --- The central limit theorem provides insight into why many random variables have probability distributions that are approximately normal. For example, the measurement error in a scientific experiment, often normally distributed, may be thought of as a sum of a number of underlying perturbations and errors of small magnitude, which are not necessarily normal. Laws related to the central limit theorem will also be discussed subsequently. The weak law of large numbers --- Theorem 7.1 (the weak law of large numbers) --If X is the sample mean of a random sample X1. X2. …, Xn of size n arising from a population random variable X with mean and variance 2, then for any > 0, the following equality is true: lim P{| X | < } = 1. n (7.7) That is, as n , the probability that X and are arbitrarily close approaches 1. Proof: we prove this theorem by three stages. Stage 1 --- proof of Markov’s inequality: P{Y a} E[Y]/a a > 0, where Y is a nonnegative random variable. For a > 0, define a new random variable I = 1 if Y a; = 0, otherwise. Then, by definition E[I] = 1P{Y a} + 0P{Y < a} = P{Y a}. (G) Also, since Y 0 ( Y is a nonnegative random variable), for all values of Y it is always true that Y/a I, or equivalently, I Y/a because Y a, Y/a 1 = I; and Y < a, Y/a > 0 = I. Taking the expectations of the two sides of the inequality I Y/a leads to E[I] E[Y/a], or equivalently, P{Y a} E[Y]/a according to (G). Stage 2 --- proof of Chebyshev’s inequality: P{|W | k} 2/k2 k > 0, where W is a random variable with mean and variance 2. Define the random variable Y mentioned in the proof of Stage 1 above as Y = (W )2. Then Y is nonnegative. Also let the value a there as a = k2. Then Markov’s inequality proven above can be applied here to get P{Y a} = P{(W )2 k2} E[Y]/a = E[(W )2]/k2 = 2/k2 (H) where the equality E[(W )2] = 2 has been used. Since the inequality (W )2 k2 means |W | k, we get from (H) the desired result P{|W | k} 2/k2. Stage 3 --- proof of the theorem. Assume that the variance 2 of the population random variable X is finite. The proof for the infinite case is omitted. The sample mean X has mean and variance 2/n according to Facts 7.4 and 7.11. Take W and k mentioned in Stage 2 to be X and , respectively. Then, the Chebyshev’s inequality proven there says that P{|W | } = P{| X | } (2/n)/2 = 2/n2 or equivalently, that P{| X | < } = 1 P{| X | } 1 2/n2. As n , 2/n2 0 since 2 is finite. Therefore, P{| X | < } 1 as n . That is, the desired result lim P{| X | < } = 1 is obtained. Done. n Comments -- With lim P{| X | < } = 1, we say that X converges to in n probability. The weak law of large numbers will be used to define a good property, “consistency,” of estimators of parameters of various probability distributions, as will be discussed in the next chapter. Central limit theorem --- Theorem 7.2 (the central limit theorem) --If X is the sample mean of a random sample X1. X2. …, Xn of size n arising from a population random variable X with mean and variance 2, then for any > 0, the following equality is true: lim P{ n X < } = () / n (7.8) where () is the cdf of the standard normal random variable (called the X error function hereafter). That is, as n , the random variable Y = / n is approximately a unit normal random variable. Proof: see the reference book. Notes -- The above theorem is valid both for continuous random variables and discrete ones. Also, the population random variable need not be normal. Another form of the theorem may be easily derived (left as an exercise) for the sample total To (instead of for the sample mean X ): lim P{ n To n < } = () n (7.9) which means again that as n , the random variable Z = To n is n approximately a unit normal random variable Significance of the central limit theorem -- Providing a simple method for computing approximate probabilities for the sample total To or sample mean X of a random sample X1, X2, …, Xn (i.e., for the sum of iid random variables or their mean); Helping explain the remarkable fact that the empirical frequencies of so many natural populations exhibit bell-shaped (i. e., normal) distributions (see the illustration discussed next). An Illustration of the effect of the central limit theorem as n -- See Fig. 7.6 in which each diagram illustrates the histogram of the sample total of a random sample of size n arising from a discrete population random variable X for specifying the outcomes of 10000 tossings of n dice (X = 1, 2, …, 6 with probability 1/6 for each value), where n is taken to be n = 1, 2, 3, 5, 10, 50 for the six figures, respectively (i.e., sample size = n, and # random sample = 10000). As can be seen, as n becomes larger and larger, the histogram becomes closer and closer to be of a bell shape --- the shape of a normal distribution. A rule of thumb about the magnitude of n for using the central limit theorem --How large should n be for the approximation using the central limit theorem to be good enough? A thumb of rule is n > 30. Example 7.11 (use of the central limit theorem) --When a batch of a certain chemical product is prepared, the amount of a particular impurity in the batch is a random variable with mean 4.0g and variance 2.25g2. If 50 batches are independently prepared, what is the approximate probability that the sample mean amount of impurity, X , is larger than 3.8g? Fig. 7.6 Illustration of the central limit theorem showing the histogram of the sample total becoming bell-shaped as n (downloaded 05/04/2010 from http://gaussianwaves.blogspot.com/2010/01/central-limit-theorem.html). Solution: Here the size of the random sample is 50 > 30, so the central limit theorem can be applied to the sample mean X here. The population random variable X here has mean = 4.0g and = (2.25)1/2 = 1.5g. Since by (7.8) we have P{ X < } () as n , we can / n compute the desired approximate probability as P{ X > 3.8} = P{ X 4.0 3.8 4.0 > } 1.5 / 50 1.5 / 50 P{Z > 0.94} = 1 P{Z 0.94} = 1 (0.94) = 1 (1 (0.94)) = (0.94) 0.8264. (by (7.8)) Example 7.12 (use of the central limit theorem) --If 10 fair dice are rolled, find the approximate probability that the sum obtained is between 30 and 40 using the central limit theorem. Solution: Let random variable X denote the outcome of a die. The sum of the 10 outcomes of the 10 fair dice may be regarded as a 10 sample total To = Xi of a random sample X1, X2, …, X10 of size 10 i 1 arising from the population random variable X. The mean of X is E[X] = (1 + 2 + … + 6)/6 = 7/2, and the variance of X is Var(X) = E[X2] + (E[X])2 = (12 + 22 + … + 62)/6 + (7/2)2 = 35/12. And so the sample total To has mean n = 107/2 = 35 and standard 35 350 10 = . 12 12 By the central limit theorem described by (7.9), we get deviation n = P{30 To 40} P{ 30 35 P{ 350 12 6 = ( 6 = ( 6 7 To 35 350 12 Z 6 7 7 ) ( 7 ) (1 ( 2( 6 7 ) 1 2 (0.93) 1 = 20.8238 1 = 0.6472. 6 7 40 35 } 350 12 } ) 6 7 )) ( 6 7 0.9258) Note that the rule of thumb for n is not satisfied here, so the approximation of the probability is not very accurate. Example 7.13 (the DeMoivre-Laplace Limit Theorem as a special case of the central limit theorem) --Show that the DeMoivre-Laplace Limit Theorem mentioned in Chapter 5 and repeated below is a special case of the central limit theorem: If Sn denotes the number of successes that occur when n independent trials, each with a success probability p, are performed, then for any a < b, it is true that P{a Sn np b} (b) (a) np(1 p) as n (note: Sn is a random variable here). Proof: Sn obviously is the sample total To of a random sample of size n arising from a Bernoulli random variable X with success probability p, which has mean p and variance p(1 p) as seen from the result of Example 7.5. Applying (7.9) of the central limit theorem above with n = np and n = p(1 p) n = np(1 p) , we get the desired result. A generalized central limit theorem --- Theorem 7.2 (the central limit theorem for independent random variables) --Let X1, X2, … be a sequence of independent random variables with respective means and variances i = E[Xi] i2 = Var(Xi). If (a) the Xi are uniformly bounded, i.e., for some , P{|Xi| < M} =1 i; and (b) i=1i2 = , then the following equality is true: n ( X i i ) lim P{ n i 1 n i 1 < } = (). (7.10) 2 i Proof: omitted. A comment --- the above generalized central limit theorem is more useful for estimating the probability distribution of the sum of an unlimited number of independently, but not necessarily identically, distributed random variables. The strong law of large numbers --- Theorem 7.3 (the strong law of large numbers) --If X is the sample mean of a random sample X1. X2. …, Xn of size n arising from a population random variable X with a finite mean , then the following equality is true: P{ lim X = } = 1. n (7.11) Proof: omitted. A comment --- with P{ lim X = } = 1, we say that X converges to with n probability 1. Comparison about the strong and weak laws of large numbers -- By notations, the two laws are --Weak: lim P{| X | < } = 1. n Strong: P{ lim X = } = 1. n (7.7) (7.11) In words, the two laws may be described as --Weak: the sample mean X converges in probability towards the population mean (as n approaches ). Strong: the sample mean X converges with probability 1 (or almost surely) to the population mean (as n approaches ). Mathematically, the difference between the two laws are --Weak: for a specified large n*, the sample mean X is likely to be near μ. Thus, it leaves open the possibility that | X | > happens an infinite number of times for n > n*, although at infrequent intervals. Strong: the above case of the open possibility almost surely will not occur. In particular, it implies that with probability 1, we have that for any ε > 0 the inequality | X | < holds for all large enough n. Example 7.14 (an application of the strong law of large numbers) --The strong law of large numbers may be use to estimate the probability p of success of a Bernoulli random variable X (which is a binomial random variable with a single trial) or other similar cases. Recall the definition of a Bernoulli random variable X: X = 1 if a success with probability p occurs; and = 0, otherwise. The mean of X is = E[X] = 1P{success} + 0P{failure} = 1p + 0(1 p) = p. Let X1, X2, …, Xn be a random sample arising from X with X as the sample mean (note: the sample total To = (X1 + X2 + … + Xn) here is just the binomial random variable consisting of n trials all identical with the Bernoulli distribution). Then the law above says that as n , with probability 1 the sample mean X = p. That is, the mean of the n sample values approaches almost surely the probability p as n approaches . The validity of this way of estimating p was mentioned in Chapter 2 (at the end of Section 2.2). Historical notes -- The central limit theorem was first proved by the French mathematician Pierre-Simon, marquis de Laplace who observed that errors of measurements, which can usually be regarded as being the sum of a large number of tiny forces, tend to be normally distributed. The central limit theorem was regarded as an important contribution to science. The strong law of large numbers is probably the best-known result in probability theory.