Download STATISTICAL LABORATORY, April 27th, 2011 CENTRAL LIMIT

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
STATISTICAL LABORATORY, April 27th, 2011
CENTRAL LIMIT THEOREM ILLUSTRATION
Mario Romanazzi
1
BINOMIAL DISTRIBUTION
The binomial distribution Bi(n, p), being the sum of n independent Bernoulli distributions, falls within
the framework of the CLT. When the parameter n (sample size) is large, it can be approximated by a
normal distribution with parameters µ = np and standard deviation σ = (np(1 − p))1/2 . If p = 1/2,
the binomial distribution is symmetric and the rate of convergence is high: the approximation is already
good for n equal to 30-40. The result is illustrated below with n = 50 and p = 1/2.
n <- 50
p <- 1/2
dmax <- max(dbinom(0:n, n, p))
plot(0:n, dbinom(0:n, n, p), type = "h", lwd = 5, ylim = c(0,
dmax), xlab = "x", ylab = "PDF", main = "Binomial Distribution Bi(n=50, p=1/2)")
plot(function(x) dnorm(x, mean = n * p, sd = sqrt(n * p * (1 p))), 0, n, lwd = 2, col = "red", add = TRUE)
plot(0:n, pbinom(0:n, n, p), type = "s", lwd = 2, xlab = "x",
ylab = "CDF", main = "Binomial Distribution Bi(n =50, p=1/2)")
plot(function(x) pnorm(x, mean = n * p, sd = sqrt(n * p * (1 p))), 0, n, lwd = 2, col = "red", add = TRUE)
Binomial Distribution Bi(n =50, p=1/2)
0.0
0.00
0.02
0.2
0.04
0.4
CDF
0.06
0.6
0.08
0.8
0.10
1.0
Binomial Distribution Bi(n=50, p=1/2)
PDF
>
>
>
>
+
>
+
>
+
>
+
0
10
20
30
40
50
0
x
10
20
30
x
1
40
50
2
POISSON DISTRIBUTION
2
2
POISSON DISTRIBUTION
The Poisson distribution, when the parameter λ is large
√ can be approximated by a normal distribution
with parameters µ = λ and standard deviation σ = λ. The result is not surprising if we recall the
relationship of the Poisson distribution with the binomial. The illustration below assumes λ = 20.
>
>
>
>
>
+
>
+
>
+
>
+
lambda <- 20
a1 <- round(lambda - 3.5 * sqrt(lambda))
a2 <- round(lambda + 3.5 * sqrt(lambda))
dmax <- max(dpois(a1:a2, lambda))
plot(a1:a2, dpois(a1:a2, lambda), type = "h", lwd = 5, ylim = c(0,
dmax), xlab = "x", ylab = "PDF", main = "Poisson Distribution (lambda = 20)")
plot(function(x) dnorm(x, mean = lambda, sd = sqrt(lambda)),
a1, a2, lwd = 2, col = "red", add = TRUE)
plot(a1:a2, ppois(a1:a2, lambda), type = "s", lwd = 2, xlab = "x",
ylab = "CDF", main = "Poisson Distribution (lambda = 20)")
plot(function(x) pnorm(x, mean = lambda, sd = sqrt(lambda)),
a1, a2, lwd = 2, col = "red", add = TRUE)
Poisson Distribution (lambda = 20)
CDF
0.4
0.04
0.0
0.00
0.2
0.02
PDF
0.6
0.06
0.8
0.08
1.0
Poisson Distribution (lambda = 20)
5
10
15
20
25
30
35
5
x
3
10
15
20
25
30
35
x
EXPONENTIAL DISTRIBUTION
The sum of n independent exponential distributions with parameter λ, when n is large, can
√ be approximated by a normal distribution with parameters µ = n/λ and standard deviation σ = n/λ. This is
easily checked because the sum of n independent exponential distributions is a gamma random variable.
The illustration below assumes λ = 1 and n = 40.
>
>
>
>
>
>
lambda <- 1
n <- 40
mn <- n/lambda
sn <- sqrt(n)/lambda
a1 <- mn - 4.5 * sn
a2 <- mn + 4.5 * sn
4
>
+
>
+
>
+
>
+
UNIFORM DISTRIBUTION
3
plot(function(x) dgamma(x, shape = n, rate = lambda), a1, a2,
lwd = 2, xlab = "x", ylab = "PDF", main = "Sum of 40 Exponential Distributions (lambda = 1)")
plot(function(x) dnorm(x, mean = mn, sd = sn), a1, a2, lty = "dashed",
lwd = 2, col = "red", add = TRUE)
plot(function(x) pgamma(x, shape = n, rate = lambda), a1, a2,
lwd = 2, xlab = "x", ylab = "CDF", main = "Sum of 40 Exponential Distributions (lambda = 1)")
plot(function(x) pnorm(x, mean = mn, sd = sn), a1, a2, lty = "dashed",
lwd = 2, col = "red", add = TRUE)
Sum of 40 Exponential Distributions (lambda = 1)
0.6
CDF
0.0
0.00
0.01
0.2
0.02
0.4
0.03
PDF
0.04
0.05
0.8
0.06
1.0
Sum of 40 Exponential Distributions (lambda = 1)
10
20
30
40
50
60
70
x
4
10
20
30
40
50
60
70
x
UNIFORM DISTRIBUTION
The sum of n independent uniform distributions R(a, b), when n is large, canp
be approximated by a normal
distribution with parameters µ = n(a + b)/2 and standard deviation σ = n/12(b − a). A fairly good
approximation is already apparent with n = 12. The illustration below deals with the R(0, 1) distribution
and n = 12. Since neither the pdf nor the cdf of the sum of (independent) uniform distributions admit
explicit representations, here we simulate a very large sample from X1 + X2 + ... + X12 , where the Xi are
independent R(0, 1) random variables.
>
>
>
>
+
>
+
>
+
>
+
set.seed(1805014)
matr <- matrix(runif(24000), 2000, 12)
vett <- rowSums(matr)
hist(vett, freq = FALSE, breaks = 23, xlab = "x", ylab = "Simulated PDF",
lwd = 2, main = "Sum of 12 R(0, 1) Distributions")
plot(function(x) dnorm(x, mean = 6, sd = 1), 0, 10, lty = "dashed",
lwd = 2, col = "red", add = TRUE)
plot.ecdf(vett, pch = 20, xlab = "x", ylab = "Simulated CDF",
lwd = 2, main = "Sum of 12 R(0, 1) Distributions")
plot(function(x) pnorm(x, mean = 6, sd = 1), 0, 10, lty = "dashed",
lwd = 2, col = "red", add = TRUE)
5
OTHER APPLICATIONS
4
Sum of 12 R(0, 1) Distributions
0.6
0.4
Simulated CDF
0.2
0.0
0.0
0.2
0.1
Simulated PDF
0.3
0.8
1.0
Sum of 12 R(0, 1) Distributions
2
4
6
8
●
2
●●● ● ● ●
●
● ●●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●●●●
●
●●● ●●
4
x
5
6
8
10
x
OTHER APPLICATIONS
Ex1 A fair coin is tossed 120 times. What is the probability that the % of heads is greater than 55%?
Solution. Let Xi ∼ Bi(1, 1/2) be the Bernoulli distribution corresponding to the i-th trial, i =
P120
1, ..., 120. Then the total number of heads is S120 =
1 Xi ∼ Bi(120, 1/2) and the relative
frequency of heads is X 120 = S120 /120. Therefore
X 120 > 0.55 ⇔ S120 > 120 · 0.55 = 66
and the required probability is
P (X 120 > 0.55) = P (S120
120 X
120 1 120
> 66) =
( ) .
x
2
x=67
The previous probability can be obtained through the R function pbinom.
> 1 - pbinom(66, 120, 1/2)
[1] 0.1176017
An approximate solution is given by the CLT. The first step is to derive expectation and standard
deviation of S120 .
E(S120 ) = np = 120 ·
p
√
1
= 60; SD(S120 ) = np(1 − p) = 30 ' 5.477.
2
The second step is to derive the standard value corresponding to 66.
√
(66 − E(S120 ))/SD(S120 ) = (66 − 60)/ 30 ' 1.187.
The final step is to derive the normal approximation of the binomial CDF.
P (S120 > 66) = 1 − P (S120 ≤ 66) = 1 − P (S120,ST ≤ 1.187) ' 1 − FN (0,1) (1.187).
The result can be computed through the R function pnorm.
5
OTHER APPLICATIONS
5
> 1 - pnorm(1.187)
[1] 0.1176138
The accuracy of the approximation is good.
Ex2 The time (minutes) spent by a teacher to check a written examination is a random
whose distribution is exponential with rate parameter λ = 1/5 minutes. 1. What is the
that more than 10 minutes are needed to check an examination? 2. Suppose that
examinations have to be checked. What is the expected total time to perform the job?
the probability that the total time is is between 6 and 7 hours?
variable X
probability
80 written
3) What is
Solution.
1.
P (X > 10) = 1 − FX (10) = exp(−2) ' 0.135.
2. The total time S80 is the sum of the times needed to check each
Writing Xi for
Pexamination.
80
the time required by the i-th examination, i = 1, ..., 80, S80 = i=1 Xi . The standard model
is assumed, that is, the random variables Xi are independent and have the same exponential
distribution Exp(λ = 1/5). It follows
√
√
E(S80 ) = nE(X) = 80 · 5 = 400; SD(S80 ) = nSD(X) = 20 5 ' 44.72.
3. We use the normal approximation allowed by the CLT, even though we can not expect a very
good accuracy here (the exponential distribution is far from the normal shape). Six hours
correspond to 360 minutes and seven hours correspond to 420 minutes. The standardized
values are
√
√
(360 − 400)/20 5 ' −0.894, (420 − 400)/20 5 ' 0.447.
Finally,
P (360 < S80 < 420) = FS80 (420) − FS80 (360) = FS80,ST (0.447) − FS80,ST (−0.894)
' FN (0,1) (0.447) − FN (0,1) (−0.894)
We use R to derive the approximate probability and its true value (from a gamma distribution
with shape parameter n equal to 80 and rate λ = 1/5.
> pnorm(420, mean = 400, sd = 20 * sqrt(5)) - pnorm(360, mean = 400,
+
sd = 20 * sqrt(5))
[1] 0.4870929
> pgamma(420, shape = 80, rate = 1/5) - pgamma(360, shape = 80,
+
rate = 1/5)
[1] 0.4962206
The CLT approximation shows an error of about 1%.
Ex3 A drunkard executes a random walk (see Rice, p. 140, Ex. C, for details) in the following way. Each
minute he takes a step north or south, with probability 1/2 each, and his successive directions are
independent. His step length is 50 cm. 1) Use the CLT to approximate the probability distribution
of his location after 1 hour. Where is he most likely to be? 2) What is the probability that after 1
hour the distance from the starting point is greater than 10 metres? (Rice, 5.13)
Solution. Suppose, for the sake of simplicity, that the initial location is the origin
Pn of the real line.
The drunkard’s location after n minutes can be represented as the sum Sn = i=1 Xi , where Xi is
the random variable describing the i-th movement. For i = 1, ..., n, Xi can assume the values ±0.5
metres with equal probabilities and they are independent. It is easily checked that E(Xi ) = 0 and
SD(Xi ) = 1/2.
5
OTHER APPLICATIONS
6
1. The previous discussion implies
E(Sn ) = nE(Xi ) = 0; SD(Sn ) =
√
nSD(X) =
1√
n.
2
P60
By the CLT, the location after 1 hour, S60 = i=1 Xi , is approximately
a normal random
√
variable, centered at the origin with standard deviation SD(S60 ) = 60/2 ' 3.873. Hence,
the most probable location is (a neighbourhood of) the origin!
2. The required probability is
P (S60 < −10) + P (S60 > 10) = 2FS60 (−10) = 2FS60 ,ST (−2.582) ' 2FN (0,1) (−2.582).
Observe that, if Zi ∼ Bi(1, 1/2), then Xi = Zi − 1/2 and therefore
Sn ∼ Bi(n, 1/2) − n/2.
The approximate and exact probabilities are computed below.
> 2 * pnorm(-10, mean = 0, sd = sqrt(60)/2)
[1] 0.009823275
> 2 * pbinom(19, 60, 1/2)
[1] 0.006217603
Ex4 Suppose that a measurement has mean µ and standard deviation σ = 5. Let X n be the average of
n such independent measurements. How large should n be so that P (|X n − µ| < 1) = 0.95? (Rice,
5.17)
Solution. Observe that, according to the WLLN, the event A√n = |X n − µ| < 1 is almost sure when
n → ∞, because X n has mean µ and standard deviation 5/ n. Hence the problem has a solution
n∗ sufficiently high to allow application of the CLT. In other words, we can assume the existence of
a given (high) n∗ so as X n∗ is approximately normally distributed. Therefore, we have to solve a
(ST )
quantile problem about a normal distribution. Let x0.975 ' 1.96 denote the quantile of the N (0, 1)
distribution of the order 0.975. The following equation must hold
1√ ∗
(ST )
n = x0.975 ' 1.96
5
and the solution is (about) 96.
Related documents