Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Random variables: variance March 15, 2009 1 Definition We defined expectation, a plug-in for average in the last handout. The expectation of X is analogous to the center of mass—in mechanics, we model objects to be point masses located at its center of mass. In probability, we hope that the expectation somehow captures the behavior of the random variable—though as of now, we haven’t been able to say exactly how. But the center of mass does not capture all behavior of the object—it cannot help you figure out how the object will rotate. To model rotation, there is a different concept—the moment of inertia. Similarly in probability, expectation cannot tell you how the variable is spread around the expectation. For example, a Binomial(1000,.1), Geometric(.01) and a constant distribution that assigns probability 1 to 100 (0 else) all have expectation 100—but the constant distribution gives non-zero probability to only 1 value, the binomial random variable can be one of 1001 values, and the geometric random variable can take on infinite values. We introduce the notion of variance to specify how the expectation captures the behavior of the random variable, and to describe the spread of the random variable. Specifically, we refer to (??). Recall that for notational convenience, we denote the set of all values a random variable X can take by X(Ω), and if we consider a real valued function g of X, X Eg(X) = g(x)P (X = x). x∈X(Ω) The variance of X is the expectation of g(x) = (x − EX)2 . Therefore, X var(X) = Eg(X) = E(X − EX)2 = (x − EX)2 P (X = x). x∈X(Ω) HW 1 Show that for all real values of c, var(X + c) = var(X). Hint: first show that E(X + c) = EX + c. HW 2 Show that var(X) = EX 2 − (EX)2 . X will always be written as (EX)2 . 2 Notation: EX 2 is expectation of X 2 . Square of expectation of Bernoulli and Binomial random variables If X is a Bernoulli p random variable, then EX 2 = p, therefore var(X) = EX 2 − (EX)2 = p − p2 = p(1 − p). (1) Suppose we have n independent Bernoulli p trials, the i0 th trial being Xi . As we saw before, the random variable Y = X1 + X2 + . . . + Xn 1 is a Binomial random variable with parameters n and p, and EY = np. To compute the variance of Y , we use the identity n X X (X1 + X2 + . . . + Xn )2 = Xi2 + 2 Xi Xj , i<j i=1 where X Xi Xj = i<j n n X X Xi Xj . i=1 j=i+1 Note that EXi2 = 12 · p = p while EXi Xj = P (Xi = 1, Xj = 1) = p2 . Therefore, EY 2 = E(X1 + X2 + . . . + Xn )2 = E n X Xi2 + E 2 i=1 X Xi Xj = i<j n X EXi2 + 2 i=1 n 2 EXi Xj . = np + 2 p . 2 i<j X It follows that var(Y ) = EY 2 − (EY )2 = np − np2 = np(1 − p). (2) Observe from (2) and (1) that the variance of Y is the sum of variances of Xi in this particular case. HW 3 Since (X1 + X2 + . . . + Xn )2 = X1 (X1 + X2 + . . . + Xn ) + X2 (X1 + X2 + . . . + Xn ) + . . . + Xn (X1 + X2 + . . . + Xn ), show that (X1 + X2 + . . . + Xn )2 = n X n X Xi Xj . i=1 j=1 Therefore, that (X1 + X2 + . . . + Xn )2 = n X Xi2 + i=1 n X X Xi Xj . i=1 j6=i Why is n X X Xi Xj = 2 i=1 j6=i X Xi Xj ? i<j HW 4 You are now ready for a real-life scenario. If there is only one thing you remember from your 4 years of engineering undergrad, I think you should remember this principle: with most real problems, there is lot of value in modeling the situation appropriately, even if the model is not completely accurate. We will encounter such a situation here. Suppose you want to estimate the proportion of defective chips produced at a particular chip fabrication unit. You know N is the total number of chips produced at this fabrication unit. You do not have the resources to test all N chips. So you pick n N chips without replacement. Each choice of chip is random among what chips are remaining at that point and is independent of the test results on previous chips. You test these n chips. From these test results you have to figure out what proportion of the N chips are defective. Let m be the (unknown) number of defective chips—so you want to estimate m/N . Suppose Xi denotes if your i0 th chip is defective—if the i0 th chip is defective, Xi = 1, else Xi = 0. Since the first chip is picked at random among the N chips, the P (X1 = 1) = m/N . The second chip is now chosen randomly among the remaining N − 1 chips. 2 1. If X1 = 1, what is the probability X2 = 1? In other words, what is P (X2 = 1|X1 = 1)? 2. If X1 = 0, what is the probability X2 = 1? Namely, what is is P (X2 = 1|X1 = 0)? 3. What is P (X2 = 1)? 4. Are X1 and X2 independent? If N is very large, note that P (X2 = 1|X1 = 0) ≈ P (X2 = 1|X1 = 1) ≈ P (X2 = 1). Furthermore, convince yourself that if n N , then it is a good approximation to consider X1 , . . . ,Xn to be independent Bernoulli m/N variables. Optional: how can you prove this is a good approximation? The starting point is to prove that if n is sufficiently smaller than m, with high confidence, even if you chose n chips with replacement, you would not see repeats (as if you sampled without replacement)—how would you continue? See also the optional question at the end of Subproblem 10. Your estimate p̂ of m/N is Pn Xi p̂ = i=1 , n namely, you estimate what proportion of the n chips you sampled are defective. You need to estimate m/N to an accuracy of ±.03 and you want to be 90% confident of your answer. The next 5 parts help you do this: 5. What is E p̂? 6. Show that the variance of your estimate p̂ is m(N −m) . nN 2 7. For all 0 ≤ x ≤ 1, 0 ≤ x(1 − x) ≤ 41 . Therefore show that var(p̂) ≤ 1 . 4n 8. Therefore, the standard deviation of your estimate is at most 2√1 n . Recall that you need to estimate m/N to an accuracy of ±.03 and you want to be 90% confident of your answer. What should be n, the number of chips that you must test? Hint: Use (??) to first find k from the confidence probability .9. Then use the accuracy value .03 and k to find n. 9. You will need to verify at this point that m n. Since you know N and you have the estimate p̂, what is your estimate of m? You are definitely in trouble if the estimate of m so obtained is not much greater than n. But suppose the estimate of m is sufficiently n. 10. Are you done? Can anything else go wrong? (Optional) The confidence interval obtained doesn’t take into account we approximated independence. How would you take that into account? Suppose we choose the chips with replacement and let Xi = 1 denote the event that the i0 th choice is defective. Are all Xi independent now? If yes, how would you proceed? If not, what (precise) conditions on N, n, m make them approximately independent, and how would you derive confidence intervals? The two optional parts can bump you up by up to a grade, and they are not as hard as the optional problem in Handout 7. 3 Correlation When computing the variance of a Binomial variable, Y = X1 + . . . + X + n, we observed that the variance of Y was the sum of variances of Xi since Xi were independent. Here we will examine a general necessary condition for the variances add up. 3 The correlation between two random variables X1 and X2 is defined as corr(X1 , X2 ) = EX1 X2 − (EX1 )(EX2 ). (3) We are using a definition here that is different from the usual definition in statistics and probability. The quantity in (3) is called covariance in these fields. They define a quantity ρ between X1 and X2 as ρ(X1 , X1 ) = EX1 X2 − (EX1 )(EX2 ) p . var(X1 )var(X2 ) (4) We will call ρ the correlation coefficient—note that the definition above only differs from (3) by a normalization constant. Because of the normalization, it can be shown that −1 ≤ ρ ≤ 1, while correlation as we defined it can take any real value. The reason we use the definition (3) is to coincide with the notion of autocorrelation (defined as the correlation between samples of a signal at two different instants) used in signal processing and communication. Autocorrelation determines how much power a signal has in any given frequency—so an estimate of autocorrelation is a very important design parameter for a communication engineer. In this course, we will always use (3) to mean correlation, and ρ to mean correlation coefficient (instead of covariance for (3) and correlation for (4) as is common in the statistics and math literature). But always be clear always what is meant (with or without the normalization). If the correlation between X1 and X2 is 0, then the variables are said to be uncorrelated. When we have multiple random variables, X1 , . . . ,Xn , if we say the variables are uncorrelated, we mean that for all i 6= j, Xi and Xj are uncorrelated. In this section X1 , . . . ,Xn are not necessarily independent, and Y = X1 + X2 + . . . + Xn . Observe that EY 2 = E(X1 + X2 + . . . + Xn )2 = E n X Xi2 + E X i=1 Xi Xj = n X EXi2 + i=1 i6=j X EXi Xj i6=j If for all i and j, EXi Xj = EXi EXj , then EY 2 = n X EXi2 + i=1 while (EY )2 = (E n X X EXi Xj = n X EXi2 + i=1 i6=j X EXi EXj , (5) i6=j n n X X X Xi )2 = ( EXi )2 = (EXi )2 + EXi EXj . i=1 i=1 i=1 i6=j n X X Therefore from (5) and (6), var(Y ) = EY 2 − (EY )2 = n X i=1 = n X EXi2 + X EXi EXj − (EXi )2 + i=1 i6=j EXi EXj i6=j (EXi2 − (EXi )2 ) i=1 = n X var(Xi ). i=1 Therefore, if Xi and Xj are uncorrelated for all pairs i and j, where i 6= j, then var(Y ) = n X i=1 4 var(Xi ). (6) 4 Correlation and Independence Independence, or even pairwise independence between variables implies that the variables are uncorrelated. The reverse is however, not true—uncorrelated variables need not be independent or even pairwise independent. To see that independent or pairwise independent variables X1 , . . . ,Xn are also uncorrelated, we simplify EXi Xj (i 6= j) as follows X X EXi Xj = xi xj P (Xi = xi , Xj = xj ) xi ∈Xi (Ω) xj ∈Xj (Ω) X = X xi xj P (Xi = xi )P (Xj = xj |Xi = xi ) xi ∈Xi (Ω) xj ∈Xj (Ω) (a) = X X xi xj P (Xi = xi )P (Xj = xj ) xi ∈Xi (Ω) xj ∈Xj (Ω) = X xi P (Xi = xi ) xi ∈Xi (Ω) X xj P (Xj = xj ) xj ∈Xj (Ω) = EXi EXj , where the equality (a) holds when Xi and Xj are independent. Therefore, if for all i and j (i 6= j), Xi and Xj are independent, namely X1 , . . . ,Xn are pairwise independent then the variables are also uncorrelated. If the stronger condition of independence holds, then the variables are also pairwise independent and therefore also uncorrelated. The following probability distribution illustrates a case where the variables X1 and X2 are uncorrelated, but they are not independent. X1 -1 -1 -1 0 0 0 1 1 1 X2 -1 0 1 -1 0 1 -1 0 1 P (X1 , X2 ) 1/12 1/6 1/12 1/9 1/9 1/9 1/15 1/5 1/15 Furthermore since there are only two variables, they are not pairwise independent either. Now P (X1 = −1) = P (X1 = 0) = P (X1 = 1) = 1/3. Observe that X1 and X2 are not independent since 1/15 1 P (X1 = 1, X2 = 1) = = P (X1 = 1) 1/3 5 P (X1 = 0, X2 = 1) 1/9 1 = = . P (X2 = 1|X1 = 0) = P (X1 = 0) 1/3 3 P (X2 = 1|X1 = 1) = On the other hand, EX1 X2 = 0 and EX1 = 0, which implies that corr(X1 , X2 ) = EX1 X2 − EX1 EX2 = 0, namely X1 and X2 are uncorrelated. HW 5 This is a computer assignment. If you are unfamiliar with MATLAB, please come and see me. 1. How would you generate X1 , a Bernoulli variable with probability p = 1/4? 5 2. Generate 100 independent Bernoulli 1/4 variables in MATLAB, X1 , . . . ,X100 . That is, each variable Xi must be 1 with probability 1/4 no matter what the values of the other random variables are. Suppose you have 100 Bernoulli 1/4 trials. Define Yk to be the sum of the first k trials and Yn−k+1 to be the sum of the last k trials. What is the expectation of Yk and of Yn−k+1 ? The variance of Yk and of Yn−k+1 ? Now generate 50 binary 100-length sequences X(i) = X1 , . . . ,X100 as follows. Each sequence X(i) is the 100-bit sequence you get as a result of 100 independent Bernoulli 1/4 trials. So you do a total of 50×100 independent Bernoulli 1/4 trials in all. (i) (i) For the i0 th trial, let Yk be the sum of the first k trials, while Yn−k+1 be the sum of the last k variables. Fix k = 75. (i) 1. Plot the values of Yk (i) against Yn−k+1 . (i) 2. Use MATLAB to find the best (minimum least square error) linear fit between Yk is the slope of this line? (i) 3. Find the sample expectation of Yk (i) and Yn−k+1 , namely P50 Yk = (i) the sample variance of Yk (i) and Yn−k+1 ? What (i) Yk 50 i=1 (i) and Yn−k+1 , (i) i=1 (Yk P50 50 − Y K )2 , as well as the sample correlation coefficient (define it yourself). Note that the sample mean, the sample variance and sample correlation coefficient are not necessarily the true values of mean, variance and correlation coefficients—rather our approximations the observations. 4. Why do we compute sample expectation like above? Note that the random variables do not have uniform distributions—does it matter? 5. How do you think the slope of √ the linear fit above, the correlation coefficient and the variances are var(Yk ) related? Try ρ(Yk , Yn−k+1 ) × √ if Yk is on the y−axis. var(Yn−k+1 ) Instead of 50 sequences X(i) , generate 100 sequences. Then 1000 sequences. What happens as you increase the number of sequences? 6