Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture Notes: Variance, Law of Large Numbers, Central Limit Theorem CS244-Randomness and Computation March 24, 2015 1 Variance–Definition, Basic Examples The variance of a random variable is a measure of how much the value of the random variable differs from its expected value. Let X be a random variable, and let µ = E(X) be its expected value. Then the variance is defined by V ar(X) = E((X − µ)2 ). A related quantity, which you can think of as the average deviation of X from its mean, is the standard deviation of X, denoted σX , defined by σX = p V ar(X). You might wonder why we don’t use the more obvious E(|X − µ|) as a measure of the average deviation from the mean. The answer, in part, is given in the last section of these notes on the Central Limit Theorem. Example. Bernoulli Random Variable Let X be a Bernoulli random variable that has the value 1 with probability p and 0 with probability q = 1 − p. As we’ve seen, E(X) = p. So (X − µ)2 has the value (1 − p)2 = q 2 with probability p, and the value (0 − p)2 = p2 with probability q. Thus V ar(X) = pq 2 + qp2 = pq(p + q) = pq, and σX = √ 1 pq. Figure 1: PMFs for a fair die and two differently loaded dice For instance if p = 12 then V ar(X) = 14 and σX = 12 . This represents a kind of extreme case, at least for Bernoulli random variables, of deviation from the mean. At the other extreme, if p = 1 then X never varies (it always has the value 1), and V ar(X) = σX = 0. Example. Dice, loaded and unloaded. Figure 1 shows the PMFs for three different distributions of the outcome of a single die roll. The diagram at left shows the standard uniform random variable, where each of the six outcomes has probability 1/6. The diagram in the center shows a loaded die that always results in 1 or 6, each with probability 1/2, and the diagram at right is the case where the die always results in 3 or 4, again each with probability 1/2. In all three instances, the expected value of the random variable is 3.5. For the fair die, the variance is 6 X 1 j=1 6 (j − 3.5)2 = 2.9167, so the standard deviation is the square root of this, about 1.71. For the die loaded to come out 1 or 6, the variance is just 1 ((1 − 3.5)2 + (6 − 3.5)2 ) = 6.25, 2 so the standard deviation is 2.5. Similarly for the die loaded to come out 3 or 4, the variance is 0.25 and the standard deviation 0.5. In other words, the center diagram 2 is the most ‘spread out’, because its value is always quite far from the mean, and the right diagram the least spread out. The uniform distribution has values both far from the mean and close to the mean, giving a variance those of the two loaded dice. A (usually) easier way to compute the variance. We can write (X − µ)2 = X 2 − 2µX + µ2 , so by linearity of expectation, and the fact that µ is constant, V ar(X) = = = = E(X 2 − 2µX + µ2 ) E(X 2 ) − E(2µX) + µ2 E(X 2 ) − 2µE(X) + µ2 E(X 2 ) − µ2 . Let’s repeat the computation of the variance of the Bernoulli variable, using this simpler formula. Since X has values 0 and 1, X 2 = X, so E(X 2 ) = E(X) = p. Thus V ar(X) = E(X 2 ) − µ2 = p − p2 = p(1 − p) = pq, just as we found above. For the fair die, E(X 2 ) = 61 · (12 + 22 + 32 + 42 + 52 + 62 ) = 15.1667, so the variance is 15.1667 − 3.52 = 2.9167, which agrees with the previous example. As another example, if we want to find E(X 2 ) for a continuous random variable X, we compute Z ∞ x2 p(x)dx −∞ where p(x) is the probability density function. If we take X to be the outcome of a spinner with values between 0 and 1, then p is the uniform density that is 1 between 0 and 1 and 0 elsewhere. Thus Z 1 2 E(X ) = x2 dx = 1/3. 0 Since E(X) = 1/2, we have V ar(X) = E(X 2 ) − E(X)2 = 1/3 − (1/2)2 = 1/12. As was the case with the expected value, the variance of an infinite discrete or continuous random variable might not be defined, because the underlying infinite series or improper interval does not converge. 3 2 Additivity for Independent Random Variables; Binomial Distribution For a constant c and random variable X we have, by linearity of expectations V ar(cX) = = = = = E((cX)2 ) − E(cX)2 E(c2 X 2 ) − (cE(X))2 c2 E(X 2 ) − c2 E(X)2 c2 (E(X 2 ) − E(X)2 ) c2 V ar(X). What is the variance of the sum of two random variables? Again, by repeatedly applying the linearity of expectations, V ar(X + Y ) = = = = = E((X + Y )2 ) − E(X + Y )2 E(X 2 + 2XY + Y 2 ) − (E(X) + E(Y ))2 E(X 2 ) + 2E(XY ) + E(Y 2 ) − (E(X)2 + 2E(X)E(Y ) + E(Y )2 ) (E(X 2 ) − E(X)2 ) + (E(Y 2 ) − E(Y )2 ) + 2(E(XY ) − E(X)E(Y )) V ar(X) + V ar(Y ) + 2(E(XY ) − E(X)E(Y )). The expression E(XY ) − E(X)E(Y ) in the right-hand summand is called the covariance of X and Y, and we will see it again later. If X and Y are independent then this expression is 0, so for independent random variables, V ar(X + Y ) = V ar(X) + V ar(Y ). This does not hold in general if the random variables are not independent. In a homework problem you were asked to show that if X and Y represent, respectively, the smaller and larger values of the dice in a roll of two dice, then E(X)E(Y ) 6= E(XY ), so in this case the sum of the variances is not equal to the variance of the sum. Example. Binomial Random Variable The binomial random variable Sn gives the probability of k heads on n coin tosses; as we’ve seen before, n k P [Xn = k] = p (1 − p)n−k k 4 if p is the probability of heads. Xn is itself the sum of n pairwise independent copies of a Bernoulli random variable X each with probability p. As a result, using the summation result above, V ar(Sn ) = n · V ar(X) = np(1 − p), and σSn = p np(1 − p). Further, let Y = Sn /n be the average number of heads on n tosses of a coin. Then V ar(Sn ) = V ar(X)/n2 = p(1 − p)/n. 3 Chebyshev Inequality The notion of variance allows us to obtain a rough estimate of the probability that a random value differs by a certain amount from its mean. Before we state and prove this, we’ll establish a simpler property called the Markov inequality: If X is a positive-valued random variable with expected value µ, and c > 0, then E(X) . c Let’s see why this is: We’ll assume that X is a discrete random variable with finitely many outcomes 0 < c1 < c2 < · · · cn . P [X ≥ c] ≤ with probabilities p1 , p2 , . . . , pn , respectively. Let ci be the smallest of these values that is greater than or equal to c. Then P [X > c] = pi + pi+1 + . . . + pn 1 · (cpi + cpi+1 + · · · cpn ) = c 1 ≤ · (ci pi + ci+1 pi+1 + · · · + cn pn ) c 1 ≤ · (c1 p1 + c2 p2 + · · · + cn pn ) c E(X) = . c Essentially the same argument works for infinite discrete and continuous random variables. The Markov inequality by itself does not provide a great deal of information, but we will use it show something important. Let X be any random 5 variable with expected value µ, and let be a positive number. (Think of as small.) When we apply the Markov inequality to the random variable (X − µ)2 then we get P [(X − µ)2 ≥ 2 ] ≤ E((X − µ)2 ) V ar(X) = . 2 2 Since the left-hand side is the same thing as P [|X − µ| ≥ ], we can write this as P [|X − µ| ≥ ] ≤ V ar(X) . 2 This is called Chebyshev’s inequality. It provides a probability bound on how likely it is for a random variable to deviate a certain amount from its expected value. Example. Let Sn be the number of heads on n tosses of a fair coin. Let’s estimate the probability that the number of heads for n = 100 is between 40 and 60. Since E(S100 ) = 50, we are asking for the complement of the probability that the number of heads is at least 61 or at most 39; in other words, that it differs by at least 11 from it’s mean. Since the variance of S100 is 100/4 = 25, Chebyshev’s inequality gives P [|S100 − 50| ≥ 11] < 25/112 ≈ 0.207, so the probability of the number of head being between 40 and 50 is at least 0.793. As we shall see later, the probability is actually much closer to 1 than this. Chebyshev’s inequality only gives a rough upper bound, not a close approximation. The advantage is that it applies to absolutely any random variable. If we take = tσX , then we can write the inequality as P [|X − µ| ≥ tσX ] ≤ 1 . t2 For instance, the probability of being at least two standard deviations from the mean is at most 4.1 4 Law of Large Numbers The numerical outcome of an experiment is a random variable X. Let us suppose X has expected value µ and variance σ 2 . If we perform n independent trials of 6 the experiment, then the random variable An = (X1 + . . . + Xn )/n, giving the average of the outcomes of these trials, still has expected value µ, but now the variance is σ 2 /n. Thus by Chebyshev’s inequality, for any > 0 P r[|An − µ| > ] < σ2 . n2 What does this mean? Imagine that is rather small, say 0.01. Then the right-hand side is 104 σ 2 /n. If we choose n to be large enough, then we can make this righthand side as small as we like, which means that we can guarantee with probability as large as we like, that the average An is within 0.01 of its expected value. Put more simply, we can get as close to the mean as we like (with probability as high as we like) by repeating the experiment often enough. (Of course, we cannot make the probability exactly 1 no matter how many times we repeat the experiment, nor can we guarantee that the average will be exactly equal to the mean.) This is the precise statement of what is often described colloquially as ‘the law of averages’. In probability theory, it is called the Law of Large Numbers. 5 Normal Approximation to Binomial Distribution Figure 2 shows the PMFs of the random variables Sn for n = 20, 60, 100, where Sn denotes the number of heads on n tosses of a coin with heads probability p = 0.4. These PMFs are given by the binomial probability distribution n k P [Sn = k] = p (1 − p)n−k . k The three random variables of course all have different expected values (10, 30 and 50, respectively) so the PMFs are nonzero on different parts of the numer line. By our previous calculations, the standard deviation of Sn grows proportionally to the square root of n. As a result, the PMFs get more spread out as n increases. In Figure 3, Sn is replaced by Sn − np, so that all three random variables have expected value 0. In Figure 4, we further change this to the random variables S − np p n , np(1 − p) so that all three have variance 1. The apparent result is that all three graphs seem to have the same basic shape, but just differ in the vertical scale. In Figure 5, the vertical scale is adjusted so 7 that all three have maximum value 1. All the points lie on the same smooth curve. What is this shape? The smooth curve was drawn by plotting the graph of y = e−x 2 /2 , and the crucial result illustrated by these pictures is that this shape closely approximates the binomial distribution. In other words, this famous ‘bell curve’ represents a continuous probability density that is a kind of limiting case of the binomial distributions as n grows large. 2 Let’s be a little more precise about this: The function e−x /2 is not itself a probability density function, because the area under the curve is not 1, but it becomes √ a density function when we divide by the area under the curve. (This area is 2π, a fact that is far from obvious.) is called the standard normal density. ‘Standard’ here means that it has mean 0 and standard deviation 1. The corresponding cumulative distribution function is Z x 1 2 e−t /2 dt. Φ(x) = √ 2π −∞ Since we cannot evaluate Φ(x) analytically, it has to be approximated numerically. You can compute Φ(x) in Python to high accuracy using a built-in related function erf, as 0.5+0.5*math.erf(x/math.sqrt(2)) Our observations above illustrate an important fact: the binomial distribution, adjusted to have mean 0 and standard deviation 1, is closely approximated by the normal distribution, especially as n gets larger. Here are a few examples. Example. Let us redo the problem of estimating the probability that on one hundred tosses of a fair coin, the number of heads is between 40 and 60. Let X be the random variable representing the number of heads, so we are asking for P [45 ≤ X ≤ 60]. We make the same modification asq above, subtracting the expected value 50 and dividing by the standard deviation 100 · P [−1 ≤ 1 4 = 5. Thus we are looking for X − 50 ≤ 1]. 5 8 Figure 2: PMFs of binomial distribution with n = 20, 60, 100 and p = 0.4 9 Figure 3: The same distributions shifted to all have mean 0... 10 Figure 4: ...and scaled horizontally to have standard deviation 1 11 Figure 5: The previous figure stretched vertically so that all three PMFs appear 2 with the same height, superimposed on the graph of e−x /2 12 Figure 6: The standard normal density φ(x): the shaded area is Φ(1), the probability that the standard normal random variable has value less than 1. 13 Figure 7: The cumulative normal density Φ(x). 14 Approximating this by the standard normal distribution suggests that this probability is about Φ(1) − Φ(−1) = 0.6827. The exact value, of course, is 55 X 100 −100 2 = 0.72875, j j=45 so our approximation is not very impressive, only accurate to one decimal digit of precision. Part of the reason can be seen in the fact that the probability we are looking for is also equal to P [44 < X < 56]. One of the pitfalls of approximating a discrete distribution with a continuous one is that we don’t necessarily know exactly where we should draw the lines between values of the random variable. It turns out that works well in this situation is to use values for the continuous distribution that are halfway between the relevant values for the discrete distribution: In this case, that means we should view the problem as one of calculating P [44.5 < X < 55.5]. This gives an estimate of Φ(1.1) − Φ(−1.1) = 0.72867, accurate to four decimal digits of precision. 6 Central Limit Theorem The last section illustrated the fact that the sum of independent identically distributed Bernoulli random variables is approximately normally distributed. This is an instance of a much more general phenomenon—every random variable has this property! To be more precise, let X be a random variable with µ = E(X) and σ 2 = V ar(X) defined. Let X1 , . . . , Xn be pairwise independent random variables, each with the same distribution as X. Think of this as making n independent repetitions of an experiment whose outcome is modeled by the random variable X. Our claim 15 is that the sum of the Xi is approximately normally distributed. Again we adjust the mean and standard deviation to be 0 and 1; then the precise statement is lim P [a < n→∞ X1 + · · · Xn − nµ √ < b] = Φ(b) − Φ(a). σ n This is called the Central Limit Theorem. Before we saw, with the Law of Large Numbers, that the deviation of the average of n independent identical random variables from its mean approaches 0 as n grows larger. The Central Limit Theorem says more: it tells us how that deviation is distributed. Example. Let’s look at an experiment that was the subject of a question on the midterm: Spin two spinners, each giving a value uniformly distributed between 0 and 1, and let X be the larger of the two values. We saw that the cdf of X was given by y = x2 between 0 and 1, and thus the pdf of X is y = 2x between 0 and 1, and 0 elsewhere. (See the posted exam solutions for the details.) We can then compute Z 1 2 x · 2xdx = , µ = E(X) = 3 0 Z 1 1 x2 · 2xdx = , E(X 2 ) = 2 0 so 2 1 1 σ 2 = V ar(X) = − ( )2 = , 2 3 18 1 and σ = √18 . Suppose we perform this experiment 100 times. How likely is it that the sum is greater than 65? The expected value of the sum is 66 23 . Since the distribution of the sum is approximately normal, and thus symmetric about the mean, we should expect a probability greater than one half. How likely is it that the sum is greater than 70? Here we should expect a probability less than one-half. We apply the Central Limit Theorem to obtain an estimate. We first try to compute P r[0 ≤ X1 + · · · + X100 < 65], so making our usual transformation with µ and σ, this is √ √ X1 + · · · + X100 − 66 32 2 2 √ P r[−66 /(10/ 18) < < −1 /(10/ 18)]. 3 3 10/ 18 16 The Central Limit Theorem says that this is approximately √ √ 2 2 Φ(−1 /(10/ 18)) − Φ(−66 /(10/ 18)). 3 3 The right-hand expression is a very tiny number which we can treat as 0. (Alternatively, we could just as well have use −∞ as 0 in our computation.) So this gives the approximation √ 2 Φ(−1 /(10/ 18)) = 0.23975, 3 and thus the probability that the sum is greater than 65 is 1 − 0.23975 = 0.76025, which is greater than one-half, as we expected. If we replace 65 by 70, then an identical calculation gives √ 1 1 − Φ(−3 /(10/ 18)) = 1 − 0.92135 = 0.07865. 3 I simulated the experiment of 100 spinners spun twice and summed the maxima of the results. In 10,000 repetitions, I found that the number of times the sum was greater than 65 to be 7594 (as compared to the predicted 7602), and the number of times the sum was greater than 70 to be 780 (as compared to the predicted 787). 17