Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Section 10.1 Sample Mean: Expected Value and Variance Definition 10.1 Sample Mean For iid random variables X1, . . . , Xn with PDF fX(x), the sample mean of X is the random variable X + · · · + Xn . Mn(X) = 1 n Sample Mean 6= Expected Value • The first thing to notice is that Mn (X) is a function of the random variables X1 , . . . , Xn and is therefore a random variable itself. • It is important to distinguish the sample mean, Mn (X), from E[X], which we sometimes refer to as the mean value of random variable X. • While Mn (X) is a random variable, E[X] is a number. • To avoid confusion when studying the sample mean, it is advisable to refer to E[X] as the expected value of X, rather than the mean of X. • The sample mean of X and the expected value of X are closely related. • A major purpose of this chapter is to explore the fact that as n increases without bound, Mn (X) predictably approaches E[X]. • In everyday conversation, this phenomenon is often called the law of averages. Theorem 10.1 The sample mean Mn(X) has expected value and variance E [Mn(X)] = E [X ] , Var[X] Var[Mn(X)] = . n Proof: Theorem 10.1 From Definition 10.1, Theorem 9.1, and the fact that E[Xi ] = E[X] for all i, 1 1 (10.1) (E [X1 ] + · · · + E [Xn ]) = (E [X] + · · · + E [X]) = E [X] . n n Because Var[aY ] = a2 Var[Y ] for any random variable Y (Theorem 3.15), Var[Mn (X)] = Var[X1 + · · · + Xn ]/n2 . Since the Xi are iid, we can use Theorem 9.3 to show E [Mn (X)] = Var[X1 + · · · + Xn ] = Var[X1 ] + · · · + Var[Xn ] = n Var[X]. Thus Var[Mn (X)] = n Var[X]/n2 = Var[X]/n. (10.2) Comment: Theorem 10.1 • Theorem 10.1 demonstrates that as n increases without bound, the variance of Mn (X) goes to zero. • When we first met the variance, and its square root the standard deviation, we said that they indicate how far a random variable is likely to be from its expected value. • Theorem 10.1 suggests that as n approaches infinity, it becomes highly likely that Mn (X) is arbitrarily close to its expected value, E[X]. • In other words, the sample mean Mn (X) converges to the expected value E[X] as the number of samples n goes to infinity. Quiz 10.1 X is the exponential (1) random variable; Mn(X) is the sample mean of n independent samples of X. How many samples n are needed to guarantee that the variance of the sample mean Mn(X) is no more than 0.01? Quiz 10.1 Solution An exponential random variable with expected value 1 also has variance 1. By Theorem 10.1, Mn(X) has variance Var[Mn(X)] = 1/n. Hence, we need n = 100 samples. Section 10.2 Deviation of a Random Variable from the Expected Value Theorem 10.2 Markov Inequality For a random variable X, such that P[X < 0] = 0, and a constant c, i E [X ] 2 P X≥c ≤ . c2 h Proof: Theorem 10.2 Since X is nonnegative, fX(x) = 0 for x < 0 and E [X ] = Z c2 0 xfX (x) dx + Z ∞ c2 xfX (x) dx ≥ Z ∞ c2 xfX (x) dx. (10.3) Since x ≥ c2 in the remaining integral, E [ X ] ≥ c2 Z ∞ c2 h i 2 2 fX (x) dx = c P X ≥ c . (10.4) Example 10.1 Let X represent the height (in feet) of a storm surge following a hurricane. If the expected height is E[X] = 5.5, then the Markov inequality states that an upper bound on the probability of a storm surge at least 11 feet high is P [X ≥ 11] ≤ 5.5/11 = 1/2. (10.5) Example 10.2 Suppose random variable Y takes on the value c2 with probability p and the value 0 otherwise. In this case, E[Y ] = pc2, and the Markov inequality states h i 2 P Y ≥ c ≤ E [Y ] /c2 = p. (10.6) Since P[Y ≥ c2] = p, we observe that the Markov inequality is in fact an equality in this instance. Theorem 10.3 Chebyshev Inequality For an arbitrary random variable Y and constant c > 0, Var[Y ] P [|Y − µY | ≥ c] ≤ . c2 Proof: Theorem 10.3 In the Markov inequality, Theorem 10.2, let X = (Y −µY )2. The inequality states h i 2 h i h i E (Y − µY ) Var[Y ] 2 2 2 P X ≥ c = P (Y − µY ) ≥ c ≤ = . 2 2 c c (10.7) The theorem follows from the fact that {(Y −µY )2 ≥ c2} = {|Y − µY | ≥ c}. Example 10.3 Problem If the height X of a storm surge following a hurricane has expected value E[X] = 5.5 feet and standard deviation σX = 1 foot, use the Chebyshev inequality to to find an upper bound on P[X ≥ 11]. Example 10.3 Solution Since a height X is nonnegative, the probability that X ≥ 11 can be written as P [X ≥ 11] = P [X − µX ≥ 11 − µX ] = P [|X − µX | ≥ 5.5] . (10.8) Now we use the Chebyshev inequality to obtain P [X ≥ 11] = P [|X − µX | ≥ 5.5] ≤ Var[X]/(5.5)2 = 0.033 ≈ 1/30. (10.9) Although this bound is better than the Markov bound, it is also loose. P[X ≥ 11] is seven orders of magnitude lower than 1/30. Theorem 10.4 Chernoff Bound For an arbitrary random variable X and a constant c, P [X ≥ c] ≤ min e−scφX (s). s≥0 Proof: Theorem 10.4 In terms of the unit step function, u(x), we observe that Z ∞ Z ∞ P [X ≥ c] = fX (x) dx = u(x − c)fX (x) dx. −∞ c For all s ≥ 0, u(x − c) ≤ es(x−c) . This implies Z ∞ Z P [X ≥ c] ≤ es(x−c) fX (x) dx = e−sc −∞ (10.10) ∞ esx fX (x) dx = e−sc φX (s). (10.11) −∞ This inequality is true for any s ≥ 0. Hence the upper bound must hold when we choose s to minimize e−sc φX (s). Example 10.4 Problem If the probability model of the height X, measured in feet, of a storm surge following a hurricane at a certain location is the Gaussian (5.5, 1) random variable, use the Chernoff bound to find an upper bound on P[X ≥ 11]. Example 10.4 Solution In Table 9.1 the MGF of X is 2 φX (s) = e(11s+s )/2. (10.12) Thus the Chernoff bound is 2 2 P [X ≥ 11] ≤ min e−11se(11s+s )/2 = min e(s −11s)/2. s≥0 s≥0 (10.13) To find the minimizing s, it is sufficient to choose s to minimize h(s) = s2 − 11s. Setting the derivative dh(s)/ds = 2s − 11 = 0 yields s = 5.5. Applying s = 5.5 to the bound yields (s2 −11s)/2 P [X ≥ 11] ≤ e 2 /2 −(5.5) =e = 2.7 × 10−7. s=5.5 (10.14) Quiz 10.2 In a subway station, there are exactly enough customers on the platform to fill three trains. The arrival time of the nth train is X1 + · · · + Xn where X1, X2, . . . are iid exponential random variables with E[Xi] = 2 minutes. Let W equal the time required to serve the waiting customers. For P[W > 20], the probability that W is over twenty minutes, (a) Use the central limit theorem to find an estimate. (b) Use the Markov inequality to find an upper bound. (c) Use the Chebyshev inequality to find an upper bound. (d) Use the Chernoff bound to find an upper bound. (e) Use Theorem 4.11 for an exact calculation. Quiz 10.2 Solution The train interarrival times X1 , X2 , X3 are iid exponential (λ) random variables. The arrival time of the third train is W = X1 + X2 + X3 . (1) In Theorem 9.9, we found that the sum of three iid exponential (λ) random variables is an Erlang (n = 3, λ) random variable. From Appendix A, we find that W has expected value and variance E [W ] = 3/λ = 6, (2) Var[W ] = 3/λ2 = 12. (3) (a) By the Central Limit Theorem, W −6 20 − 6 P [W > 20] = P √ > √ 12 12 7 ≈Q √ = 2.66 × 10−5 . 3 (b) From the Markov inequality, we know that P [W > 20] ≤ E [W ] 6 = = 0.3. 20 20 (4) [Continued] Quiz 10.2 Solution (Continued 2) (c) To use the Chebyshev inequality, we observe that E[W ] = 6 and W nonnegative imply P [|W − E [W ]| ≥ 14] = P [W − 6 ≥ 14] + P [W − 6 ≤ −14] | {z } =0 = P [W ≥ 20] . (5) Thus P [W ≥ 20] = P [|W − E [W ]| ≥ 14] 3 Var[W ] = = 0.061. ≤ 142 49 (d) For the Chernoff bound, we note that the MGF of W is 3 λ 1 φW (s) = = . λ−s (1 − 2s)3 (6) (7) (8) The Chernoff bound states that P [W > 20] ≤ min e s≥0 −20s e−20s φX (s) = min . s≥0 (1 − 2s)3 (9) [Continued] Quiz 10.2 Solution (Continued 3) To minimize h(s) = e−20s /(1 − 2s)3 , we set the derivative of h(s) to zero: dh(s) e−20s (40s − 14) = 0. = 4 ds (1 − 2s) (10) This implies s = 7/20. Applying s = 7/20 into the Chernoff bound yields e−20s P [W > 20] ≤ = 0.0338. (1 − 2s)3 s= 7 20 (c) Theorem 4.11 says that for any w > 0, the CDF of the Erlang (3, λ) random variable W satisfies FW (w) = 1 − 2 X (λw)k e−λw k=0 k! (11) Equivalently, for λ = 1/2 and w = 20, P [W > 20] = 1 − FW (20) 2 10 10 = e−10 1 + + 1! 2! = 61e−10 = 0.0028. (12) Although the Chernoff bound is weak in that it overestimates the probability by a factor of 12, it is a valid bound. By contrast, the Central Limit Theorem approximation grossly underestimates the true probability. Section 10.3 Laws of Large Numbers Weak Law of Large Theorem 10.5 Numbers (Finite Samples) For any constant c > 0, (a) P[|Mn(X) − µX | ≥ c] ≤ Var[X] , 2 nc Var[X] (b) P[|Mn(X) − µX | < c] ≥ 1 − . nc2 Proof: Theorem 10.5 Let Y = Mn(X). Theorem 10.1 states that E [Y ] = E [Mn(X)] = µX Var[Y ] = Var[Mn(X)] = Var[X]/n. (10.15) Theorem 10.5(a) follows by applying the Chebyshev inequality (Theorem 10.3) to Y = Mn(X). Theorem 10.5(b) is just a restatement of Theorem 10.5(a), since P [|Mn(X) − µX | ≥ c] = 1 − P [|Mn(X) − µX | < c] . (10.16) Weak Law of Large Theorem 10.6 Numbers (Infinite Samples) If X has finite variance, then for any constant c > 0, (a) lim P[|Mn(X) − µX | ≥ c] = 0, n→∞ (b) lim P[|Mn(X) − µX | < c] = 1. n→∞ The Weak Law • In words, Theorem 10.6(b) says that the probability that the sample mean is within ±c units of E[X] goes to one as the number of samples approaches infinity. • Since c can be arbitrarily small (e.g., 10−2000 ), both Theorem 10.5(a) and Theorem 10.6(b) can be interpreted as saying that the sample mean converges to E[X] as the number of samples increases without bound. • The weak law of large numbers is a very general result because it holds for all random variables X with finite variance. • Moreover, we do not need to know any of the parameters, such as the expected value or variance, of random variable X. Definition 10.2 Convergence in Probability The random sequence Yn converges in probability to a constant y if for any > 0, lim P [|Yn − y| ≥ ] = 0. n→∞ Theorem 10.7 As n → ∞, the relative frequency P̂n(A) converges to P[A]; for any constant c > 0, h i lim P P̂n(A) − P [A] ≥ c = 0. n→∞ Proof: Theorem 10.7 The proof follows from Theorem 10.6 since P̂n(A) = Mn(XA) is the sample mean of the indicator XA, which has expected value E[XA] = P[A] and variance Var[XA] = P[A](1 − P[A]). Quiz 10.3 X1, . . . , Xn are n iid samples of the Bernoulli (p = 0.8) random variable X. (a) Find E[X] and Var[X]. (b) What is Var[M100(X)]? (c) Use Theorem 10.5 to find α such that P [|M100(X) − p| ≥ 0.05] ≤ α. (d) How many samples n are needed to guarantee P [|Mn(X) − p| ≥ 0.1] ≤ 0.05. Quiz 10.3 Solution (a) Since X is a Bernoulli random variable with parameter p = 0.8, we can look up in Appendix A to find that E[X] = p = 0.8 and variance Var[X] = p(1 − p) = (0.8)(0.2) = 0.16. (1) (b) By Theorem 10.1, Var[X] = 0.0016. (2) 100 (c) Theorem 10.5 uses the Chebyshev inequality to show that the sample mean satisfies Var[M100 (X)] = Var[X] . (3) nc2 Note that E[X] = PX(1) = p. To meet the specified requirement, we choose c = 0.05 and n = 100. Since Var[X] = 0.16, we must have P [|Mn (X) − E [X]| ≥ c] ≤ 0.16 =α 100(0.05)2 (4) This reduces to α = 16/25 = 0.64. (d) Again we use Equation (3). To meet the specified requirement, we choose c = 0.1. Since Var[X] = 0.16, we must have 0.16 ≤ 0.05 n(0.1)2 The smallest value that meets the requirement is n = 320. (5) Section 10.4 Point Estimates of Model Parameters Model Parameters • The general problem is estimation of a parameter of a probability model. • A parameter is any number that can be calculated from the probability model. • For example, for an arbitrary event A, P[A] is a model parameter. Estimates of Model Parameters • Consider an experiment that produces observations of sample values of the random variable X. • The observations are sample values of the random variables X1 , X2 , . . ., all with the same probability model as X. • Assume that r is a parameter of the probability model. • We use the observations X1 , X2 , . . . to produce a sequence of estimates of r. • The estimates R̂1 , R̂2 , . . . are all random variables. • R̂1 is a function of X1 . • R̂2 is a function of X1 and X2 , and in general R̂n is a function of X1 , X2 , . . . , Xn . Definition 10.3 Consistent Estimator The sequence of estimates R̂1, R̂2, . . . of parameter r is consistent if for any > 0, h i lim P R̂n − r ≥ = 0. n→∞ Definition 10.4 Unbiased Estimator An estimate, R̂, of parameter r is unbiased if E[R̂] = r; otherwise, R̂ is biased. Asymptotically Unbiased Definition 10.5 Estimator The sequence of estimators R̂n of parameter r is asymptotically unbiased if lim E[R̂n] = r. n→∞ Definition 10.6 Mean Square Error The mean square error of estimator R̂ of parameter r is i 2 e = E (R̂ − r) . h Theorem 10.8 If a sequence of unbiased estimates R̂1, R̂2, . . . of parameter r has mean square error en = Var[R̂n] satisfying limn→∞ en = 0, then the sequence R̂n is consistent. Proof: Theorem 10.8 Since E[R̂n] = r, we apply the Chebyshev inequality to R̂n. constant > 0, h i Var[R̂n] P R̂n − r ≥ ≤ . 2 For any (10.20) In the limit of large n, we have h i Var[R̂n] lim P R̂n − r ≥ ≤ lim = 0. n→∞ n→∞ 2 (10.21) Example 10.5 Problem In any interval of k seconds, the number Nk of packets passing through an Internet router is a Poisson random variable with expected value E[Nk ] = kr packets. Let R̂k = Nk /k denote an estimate of the parameter r packets/second. Is each estimate R̂k an unbiased estimate of r? What is the mean square error ek of the estimate R̂k ? Is the sequence of estimates R̂1, R̂2, . . . consistent? Example 10.5 Solution First, we observe that R̂k is an unbiased estimator since E[R̂k ] = E [Nk /k] = E [Nk ] /k = r. (10.22) Next, we recall that since Nk is Poisson, Var[Nk ] = kr. This implies Nk Var [Nk ] r Var[R̂k ] = Var = . = 2 k k k (10.23) Because R̂k is unbiased, the mean square error of the estimate is the same as its variance: ek = r/k. In addition, since limk→∞ Var[R̂k ] = 0, the sequence of estimators R̂k is consistent by Theorem 10.8. Theorem 10.9 The sample mean Mn(X) is an unbiased estimate of E[X]. Theorem 10.10 The sample mean estimator Mn(X) has mean square error i Var[X] 2 en = E (Mn(X) − E [X ]) = Var[Mn(X)] = . h n Standard Error √ • In the terminology of statistical inference, en , the standard deviation of the sample mean, is referred to as the standard error of the estimate. • The standard error gives an indication of how far we should expect the sample mean to deviate from the expected value. • In particular, when X is a Gaussian random variable (and Mn (X) is also Gaussian), Problem 10.4.1 asks you to show that √ √ P E [X] − en ≤ Mn (X) ≤ E [X] + en = 2Φ(1) − 1 ≈ 0.68. (10.24) In words, Equation (10.24) says there is roughly a two-thirds probability that the sample mean is within one standard error of the expected value. • This same conclusion is approximately true when n is large and the central limit theorem says that Mn (X) is approximately Gaussian. Example 10.6 Problem How many independent trials n are needed to guarantee that P̂n(A), the relative frequency estimate of P[A], has standard error ≤ 0.1? Example 10.6 Solution Since the indicator XA has variance Var[XA] = P[A](1 − P[A]), Theorem 10.10 implies that the mean square error of Mn(XA) is Var [X ] P [A] (1 − P [A]) = . (10.25) en = n n √ We need to choose n large enough to guarantee en ≤ 0.1 (en ≤= 0.01) even though we don’t know P[A]. We use the fact that p(1 − p) ≤ 0.25 for all 0 ≤ p ≤ 1. Thus, en ≤ 0.25/n. To guarantee en ≤ 0.01, we choose n = 0.25/0.01 = 25 trials. Theorem 10.11 If X has finite variance, then the sample mean Mn(X) is a sequence of consistent estimates of E[X]. Proof: Theorem 10.11 By Theorem 10.10, the mean square error of Mn(X) satisfies Var[X] lim Var[Mn(X)] = lim = 0. n→∞ n→∞ n By Theorem 10.8, the sequence Mn(X) is consistent. (10.26) Estimating the Variance • When E[X] is a known quantity µX , we know Var[X] = E[(X − µX )2 ]. • In this case, we can use the sample mean of W = (X − µX )2 to estimate Var[X]., n 1X Mn (W ) = (Xi − µX )2 . n i=1 (10.28) If Var[W ] exists, Mn (W ) is a consistent, unbiased estimate of Var[X]. • When the expected value µX is unknown, the situation is more complicated because the variance of X depends on µX . • We cannot use Equation (10.28) if µX is unknown. • In this case, we replace the expected value µX by the sample mean Mn (X). Definition 10.7 Sample Variance The sample variance of n independent observations of random variable X is n 1 X Vn(X) = (Xi − Mn(X))2 . n i=1 Theorem 10.12 E [Vn(X)] = n−1 Var[X]. n Proof: Theorem 10.12 Substituting Definition 10.1 of the sample mean Mn (X) into Definition 10.7 of sample variance and expanding the sums, we derive n n n 1X 2 1 XX Vn = Xi Xj . X − 2 n i=1 i n i=1 j=1 Because the Xi are iid, E[Xi2 ] = E[X 2 ] for all i, and E[Xi ] E[Xj ] = µ2X . rem 5.16(a), E [Xi Xj ] = Cov [Xi , Xj ] + E [Xi ] E [Xj ] = Cov [Xi , Xj ] + µ2X . (10.29) By Theo(10.30) Combining these facts, the expected value of Vn in Equation (10.29) is n n 2 1 XX E [Vn ] = E X − 2 Cov [Xi , Xj ] + µ2X n i=1 j=1 n n 1 XX Cov [Xi , Xj ] . (10.31) = Var[X] − 2 n i=1 j=1 P P Since the double sum has n2 terms, ni=1 nj=1 µ2X = n2 µ2X . Of the n2 covariance terms, there are n terms of the form Cov[Xi , Xi ] = Var[X], while the remaining covariance terms are all 0 because Xi and Xj are independent for i 6= j. This implies E [Vn ] = Var[X] − 1 n−1 Var[X]) = Var[X]. (n n2 n (10.32) Theorem 10.13 The estimate Vn0 (X) = n 1 X (Xi − Mn(X))2 n − 1 i=1 is an unbiased estimate of Var[X]. Proof: Theorem 10.13 Using Definition 10.7, we have Vn0 (X) = n Vn(X), n−1 (10.34) and i 0 E Vn(X) = h n E [Vn(X)] = Var[X]. n−1 (10.35) Quiz 10.4 X is the continuous uniform (−1, 1) random variable. Find the mean square error, E[(Var[X] − V100(X))2], of the sample variance estimate of Var[X], based on 100 independent observations of X. Quiz 10.4 Solution Define the random variable W = (X − µX )2. Observe that V100(X) = M100(W ). By Theorem 10.10, the mean square error is i Var[W ] 2 E (M100(W ) − µW ) = . h 100 (1) Observe that µX = 0 so that W = X 2. Thus, h i µW = E X 2 = h i h i 2 4 E W =E X = Z 1 −1 Z 1 −1 x2fX (x) dx = 1/3, (2) x4fX (x) dx = 1/5. (3) 2 Therefore Var[W ] = E[W 2] − µ2 W = 1/5 − (1/3) = 4/45 and the mean square error is 4/4500 = 0.0009. Section 10.5 Matlab Sample Mean Traces • In particular, for a random variable X, 0 we can view a set of iid samples X1, . . . , Xn as a random vector X = X1 · · · Xn . • This vector of iid samples yields a vector of sample mean values 0 M(X) = M1(X) M2(X) · · · Mn(X) where Mk (X) = X1 + · · · + Xk k (10.36) • We call a graph of the sequence Mk (X) versus k a sample mean trace. • By graphing the sample mean trace as a function of n we can observe the convergence of the point estimate Mk (X) to E[X]. Example 10.7 Problem Write a Matlab function bernoullitraces(n,m,p) to generate m sample mean traces, each of length n, for the sample mean of a Bernoulli (p) random variable. Example 10.7 Solution function MN=bernoullitraces(n,m,p); x=reshape(bernoullirv(p,m*n),n,m); nn=(1:n)’*ones(1,m); MN=cumsum(x)./nn; stderr=sqrt(p*(1-p))./sqrt((1:n)’); plot(1:n,0.5+stderr,... 1:n,0.5-stderr,1:n,MN); In bernoullitraces, each column of x is an instance of a random vector X with iid Bernoulli (p) components. Similarly, each column of MN is an instance of the vector M(X). The output graphs each column of MN as a function of the number of trials n. In √ √ √ addition, we calculate the standard error ek and overlay graphs of p − ek and p + ek . Equation (10.24) says that at each step k, we should expect to see roughly two-thirds of the sample mean traces in the range √ √ p − ek ≤ Mk (X) ≤ p + ek . (10.37) A sample graph of bernoullitraces(50,40,0.5) is shown in Figure 10.1. The figure shows how at any given step, approximately two thirds of the sample mean traces are within one standard error of the expected value. Figure 10.1 1 Mn(X) 0.5 0 0 5 10 15 20 25 30 35 40 45 50 n Sample output of bernoullitraces.m, including the deterministic standard error graphs. The graph shows how at any given step, about two thirds of the sample means are within one standard error of the true mean. Quiz 10.5 Generate m = 1000 traces (each of length n = 100) of the sample mean of a Bernoulli (p) random variable. At each step k, calculate Mk and the number of traces, such that Mk is within one standard error of the expected value p. Graph Tk = Mk /m as a function of k. Explain your results. Quiz 10.5 Solution Following the bernoullitraces.m approach, we generate m = 1000 sample paths, each sample path having n = 100 Bernoulli traces. at time k, OK(k) counts the fraction of sample paths that have sample mean within one standard error of p. The program bernoullisample.m generates graphs the number of traces within one standard error as a function of the time, i.e. the number of trials in each trace. function OK=bernoullisample(n,m,p); x=reshape(bernoullirv(p,m*n),n,m); nn=(1:n)’*ones(1,m); MN=cumsum(x)./nn; stderr=sqrt(p*(1-p))./sqrt((1:n)’); stderrmat=stderr*ones(1,m); OK=sum(abs(MN-p)<stderrmat,2)/m; plot(1:n,OK); [Continued] Quiz 10.5 Solution (Continued 2) The following graph was generated by bernoullisample(50,5000,0.5): 1 0.5 0 0 10 20 30 40 50 As we would expect, as m gets large, the fraction of traces within one standard error approaches 2Φ(1) − 1 ≈ 0.68. The unusual sawtooth pattern, though perhaps unexpected, is examined in Problem 10.5.1.