Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STATISTICAL LABORATORY, April 27th, 2011 CENTRAL LIMIT THEOREM ILLUSTRATION Mario Romanazzi 1 BINOMIAL DISTRIBUTION The binomial distribution Bi(n, p), being the sum of n independent Bernoulli distributions, falls within the framework of the CLT. When the parameter n (sample size) is large, it can be approximated by a normal distribution with parameters µ = np and standard deviation σ = (np(1 − p))1/2 . If p = 1/2, the binomial distribution is symmetric and the rate of convergence is high: the approximation is already good for n equal to 30-40. The result is illustrated below with n = 50 and p = 1/2. n <- 50 p <- 1/2 dmax <- max(dbinom(0:n, n, p)) plot(0:n, dbinom(0:n, n, p), type = "h", lwd = 5, ylim = c(0, dmax), xlab = "x", ylab = "PDF", main = "Binomial Distribution Bi(n=50, p=1/2)") plot(function(x) dnorm(x, mean = n * p, sd = sqrt(n * p * (1 p))), 0, n, lwd = 2, col = "red", add = TRUE) plot(0:n, pbinom(0:n, n, p), type = "s", lwd = 2, xlab = "x", ylab = "CDF", main = "Binomial Distribution Bi(n =50, p=1/2)") plot(function(x) pnorm(x, mean = n * p, sd = sqrt(n * p * (1 p))), 0, n, lwd = 2, col = "red", add = TRUE) Binomial Distribution Bi(n =50, p=1/2) 0.0 0.00 0.02 0.2 0.04 0.4 CDF 0.06 0.6 0.08 0.8 0.10 1.0 Binomial Distribution Bi(n=50, p=1/2) PDF > > > > + > + > + > + 0 10 20 30 40 50 0 x 10 20 30 x 1 40 50 2 POISSON DISTRIBUTION 2 2 POISSON DISTRIBUTION The Poisson distribution, when the parameter λ is large √ can be approximated by a normal distribution with parameters µ = λ and standard deviation σ = λ. The result is not surprising if we recall the relationship of the Poisson distribution with the binomial. The illustration below assumes λ = 20. > > > > > + > + > + > + lambda <- 20 a1 <- round(lambda - 3.5 * sqrt(lambda)) a2 <- round(lambda + 3.5 * sqrt(lambda)) dmax <- max(dpois(a1:a2, lambda)) plot(a1:a2, dpois(a1:a2, lambda), type = "h", lwd = 5, ylim = c(0, dmax), xlab = "x", ylab = "PDF", main = "Poisson Distribution (lambda = 20)") plot(function(x) dnorm(x, mean = lambda, sd = sqrt(lambda)), a1, a2, lwd = 2, col = "red", add = TRUE) plot(a1:a2, ppois(a1:a2, lambda), type = "s", lwd = 2, xlab = "x", ylab = "CDF", main = "Poisson Distribution (lambda = 20)") plot(function(x) pnorm(x, mean = lambda, sd = sqrt(lambda)), a1, a2, lwd = 2, col = "red", add = TRUE) Poisson Distribution (lambda = 20) CDF 0.4 0.04 0.0 0.00 0.2 0.02 PDF 0.6 0.06 0.8 0.08 1.0 Poisson Distribution (lambda = 20) 5 10 15 20 25 30 35 5 x 3 10 15 20 25 30 35 x EXPONENTIAL DISTRIBUTION The sum of n independent exponential distributions with parameter λ, when n is large, can √ be approximated by a normal distribution with parameters µ = n/λ and standard deviation σ = n/λ. This is easily checked because the sum of n independent exponential distributions is a gamma random variable. The illustration below assumes λ = 1 and n = 40. > > > > > > lambda <- 1 n <- 40 mn <- n/lambda sn <- sqrt(n)/lambda a1 <- mn - 4.5 * sn a2 <- mn + 4.5 * sn 4 > + > + > + > + UNIFORM DISTRIBUTION 3 plot(function(x) dgamma(x, shape = n, rate = lambda), a1, a2, lwd = 2, xlab = "x", ylab = "PDF", main = "Sum of 40 Exponential Distributions (lambda = 1)") plot(function(x) dnorm(x, mean = mn, sd = sn), a1, a2, lty = "dashed", lwd = 2, col = "red", add = TRUE) plot(function(x) pgamma(x, shape = n, rate = lambda), a1, a2, lwd = 2, xlab = "x", ylab = "CDF", main = "Sum of 40 Exponential Distributions (lambda = 1)") plot(function(x) pnorm(x, mean = mn, sd = sn), a1, a2, lty = "dashed", lwd = 2, col = "red", add = TRUE) Sum of 40 Exponential Distributions (lambda = 1) 0.6 CDF 0.0 0.00 0.01 0.2 0.02 0.4 0.03 PDF 0.04 0.05 0.8 0.06 1.0 Sum of 40 Exponential Distributions (lambda = 1) 10 20 30 40 50 60 70 x 4 10 20 30 40 50 60 70 x UNIFORM DISTRIBUTION The sum of n independent uniform distributions R(a, b), when n is large, canp be approximated by a normal distribution with parameters µ = n(a + b)/2 and standard deviation σ = n/12(b − a). A fairly good approximation is already apparent with n = 12. The illustration below deals with the R(0, 1) distribution and n = 12. Since neither the pdf nor the cdf of the sum of (independent) uniform distributions admit explicit representations, here we simulate a very large sample from X1 + X2 + ... + X12 , where the Xi are independent R(0, 1) random variables. > > > > + > + > + > + set.seed(1805014) matr <- matrix(runif(24000), 2000, 12) vett <- rowSums(matr) hist(vett, freq = FALSE, breaks = 23, xlab = "x", ylab = "Simulated PDF", lwd = 2, main = "Sum of 12 R(0, 1) Distributions") plot(function(x) dnorm(x, mean = 6, sd = 1), 0, 10, lty = "dashed", lwd = 2, col = "red", add = TRUE) plot.ecdf(vett, pch = 20, xlab = "x", ylab = "Simulated CDF", lwd = 2, main = "Sum of 12 R(0, 1) Distributions") plot(function(x) pnorm(x, mean = 6, sd = 1), 0, 10, lty = "dashed", lwd = 2, col = "red", add = TRUE) 5 OTHER APPLICATIONS 4 Sum of 12 R(0, 1) Distributions 0.6 0.4 Simulated CDF 0.2 0.0 0.0 0.2 0.1 Simulated PDF 0.3 0.8 1.0 Sum of 12 R(0, 1) Distributions 2 4 6 8 ● 2 ●●● ● ● ● ● ● ●● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●●●●● ● ●●● ●● 4 x 5 6 8 10 x OTHER APPLICATIONS Ex1 A fair coin is tossed 120 times. What is the probability that the % of heads is greater than 55%? Solution. Let Xi ∼ Bi(1, 1/2) be the Bernoulli distribution corresponding to the i-th trial, i = P120 1, ..., 120. Then the total number of heads is S120 = 1 Xi ∼ Bi(120, 1/2) and the relative frequency of heads is X 120 = S120 /120. Therefore X 120 > 0.55 ⇔ S120 > 120 · 0.55 = 66 and the required probability is P (X 120 > 0.55) = P (S120 120 X 120 1 120 > 66) = ( ) . x 2 x=67 The previous probability can be obtained through the R function pbinom. > 1 - pbinom(66, 120, 1/2) [1] 0.1176017 An approximate solution is given by the CLT. The first step is to derive expectation and standard deviation of S120 . E(S120 ) = np = 120 · p √ 1 = 60; SD(S120 ) = np(1 − p) = 30 ' 5.477. 2 The second step is to derive the standard value corresponding to 66. √ (66 − E(S120 ))/SD(S120 ) = (66 − 60)/ 30 ' 1.187. The final step is to derive the normal approximation of the binomial CDF. P (S120 > 66) = 1 − P (S120 ≤ 66) = 1 − P (S120,ST ≤ 1.187) ' 1 − FN (0,1) (1.187). The result can be computed through the R function pnorm. 5 OTHER APPLICATIONS 5 > 1 - pnorm(1.187) [1] 0.1176138 The accuracy of the approximation is good. Ex2 The time (minutes) spent by a teacher to check a written examination is a random whose distribution is exponential with rate parameter λ = 1/5 minutes. 1. What is the that more than 10 minutes are needed to check an examination? 2. Suppose that examinations have to be checked. What is the expected total time to perform the job? the probability that the total time is is between 6 and 7 hours? variable X probability 80 written 3) What is Solution. 1. P (X > 10) = 1 − FX (10) = exp(−2) ' 0.135. 2. The total time S80 is the sum of the times needed to check each Writing Xi for Pexamination. 80 the time required by the i-th examination, i = 1, ..., 80, S80 = i=1 Xi . The standard model is assumed, that is, the random variables Xi are independent and have the same exponential distribution Exp(λ = 1/5). It follows √ √ E(S80 ) = nE(X) = 80 · 5 = 400; SD(S80 ) = nSD(X) = 20 5 ' 44.72. 3. We use the normal approximation allowed by the CLT, even though we can not expect a very good accuracy here (the exponential distribution is far from the normal shape). Six hours correspond to 360 minutes and seven hours correspond to 420 minutes. The standardized values are √ √ (360 − 400)/20 5 ' −0.894, (420 − 400)/20 5 ' 0.447. Finally, P (360 < S80 < 420) = FS80 (420) − FS80 (360) = FS80,ST (0.447) − FS80,ST (−0.894) ' FN (0,1) (0.447) − FN (0,1) (−0.894) We use R to derive the approximate probability and its true value (from a gamma distribution with shape parameter n equal to 80 and rate λ = 1/5. > pnorm(420, mean = 400, sd = 20 * sqrt(5)) - pnorm(360, mean = 400, + sd = 20 * sqrt(5)) [1] 0.4870929 > pgamma(420, shape = 80, rate = 1/5) - pgamma(360, shape = 80, + rate = 1/5) [1] 0.4962206 The CLT approximation shows an error of about 1%. Ex3 A drunkard executes a random walk (see Rice, p. 140, Ex. C, for details) in the following way. Each minute he takes a step north or south, with probability 1/2 each, and his successive directions are independent. His step length is 50 cm. 1) Use the CLT to approximate the probability distribution of his location after 1 hour. Where is he most likely to be? 2) What is the probability that after 1 hour the distance from the starting point is greater than 10 metres? (Rice, 5.13) Solution. Suppose, for the sake of simplicity, that the initial location is the origin Pn of the real line. The drunkard’s location after n minutes can be represented as the sum Sn = i=1 Xi , where Xi is the random variable describing the i-th movement. For i = 1, ..., n, Xi can assume the values ±0.5 metres with equal probabilities and they are independent. It is easily checked that E(Xi ) = 0 and SD(Xi ) = 1/2. 5 OTHER APPLICATIONS 6 1. The previous discussion implies E(Sn ) = nE(Xi ) = 0; SD(Sn ) = √ nSD(X) = 1√ n. 2 P60 By the CLT, the location after 1 hour, S60 = i=1 Xi , is approximately a normal random √ variable, centered at the origin with standard deviation SD(S60 ) = 60/2 ' 3.873. Hence, the most probable location is (a neighbourhood of) the origin! 2. The required probability is P (S60 < −10) + P (S60 > 10) = 2FS60 (−10) = 2FS60 ,ST (−2.582) ' 2FN (0,1) (−2.582). Observe that, if Zi ∼ Bi(1, 1/2), then Xi = Zi − 1/2 and therefore Sn ∼ Bi(n, 1/2) − n/2. The approximate and exact probabilities are computed below. > 2 * pnorm(-10, mean = 0, sd = sqrt(60)/2) [1] 0.009823275 > 2 * pbinom(19, 60, 1/2) [1] 0.006217603 Ex4 Suppose that a measurement has mean µ and standard deviation σ = 5. Let X n be the average of n such independent measurements. How large should n be so that P (|X n − µ| < 1) = 0.95? (Rice, 5.17) Solution. Observe that, according to the WLLN, the event A√n = |X n − µ| < 1 is almost sure when n → ∞, because X n has mean µ and standard deviation 5/ n. Hence the problem has a solution n∗ sufficiently high to allow application of the CLT. In other words, we can assume the existence of a given (high) n∗ so as X n∗ is approximately normally distributed. Therefore, we have to solve a (ST ) quantile problem about a normal distribution. Let x0.975 ' 1.96 denote the quantile of the N (0, 1) distribution of the order 0.975. The following equation must hold 1√ ∗ (ST ) n = x0.975 ' 1.96 5 and the solution is (about) 96.