Download The Central Limit Theorem - wiki

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
THE CENTRAL LIMIT THEOREM
DANIEL RÜDT
UNIVERSITY OF
TORONTO
MARCH, 2010
Contents
1 Introduction
1
2 Mathematical Background
3
3 The Central Limit Theorem
4
4 Examples
4.1 Roulette . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Cauchy Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
4
5
5 Historical Background
6
6 Proof
6.1 Outline of the Proof . . .
6.2 Lévy Continuity Theorem
6.3 Lemmas . . . . . . . . . .
6.4 Proof of the Central Limit
7 References
. . . . . .
. . . . . .
. . . . . .
Theorem
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
8
8
11
12
13
1
Introduction
The Central Limit Theorem (CLT) is one of the most remarkable results in probability
theory because it’s not only very easy to phrase, but also has very useful applications.
The CLT can tell us about the distribution of large sums of random variables even if the
distribution of the random variables is almost unknown. With this result we are able to
approximate how likely it is that the arithmetic mean deviates from its expected value. I
give such an example in chapter 4.
The CLT provides answers to many statistical problems. Using the CLT we can verify
hypotheses by making statistical decisions, because we are able to determine the asymptotic
distribution of certain test statistics.
As a warm up, we attempt to understand what happens to the distribution of random
variables if we sum them.
Suppose X and Y are continuous and independent random variables with densities fX and
fY . If we define 11A (x) to be the indicator function of a set A, then we want to recall that
for independent random variables
Z Z
P (X ∈ A, Y ∈ B) =
11A (x)11B (y)fX (x)fY (y) dx dy .
R R
Now we can see that the density of X + Y is given by the convolution of fX and fY .
Z Z
Z Z z−y
P (X + Y ≤ z) =
11[x+y≤z](x, y)fX (x)fY (y) dx dy =
fX (x) dx fY (y) dx dy
R R
R
−∞
Z Z z
Z z Z
F ubini
=
fX (x − y) dx fY (y) dx dy =
fX (x − y)fY (y) dy dx
R
−∞
−∞ R
Z z
=
fX ∗ fY (x) dx .
−∞
In order to visualize this result I did some calculations. I determined the density of the
sum of independent,Puniformly distributed random variables. The following pictures will
show the density of ni=1 Xi for Xi ∼ U [−0.5, 0.5].
1
n=1
n=2
n=3
n = 10
If we compare these graphs to the density of a standard normally distributed random
variable, we can see remarkable similarities even for small n.
Density of a standard normally distributed random variable
This result leads us to suspect that sums of random variables somehow behave normally.
The CLT frames this fact.
2
2
Mathematical Background
In this section I just want to recall some of the basic definitions and theorems used in this
paper. Elementary definitions of probability theory are assumed to be well known. To
keep things simple, we just consider the sample space Ω = R.
Definition. A random variable X is called normally distributed with parameters µ and σ
(X ∼ N (µ, σ)) if the density of the random variable is given by
φ(x) = √
1
e
2πσ 2
−(x−µ)2
2σ 2
.
Definition. If a random variable X with probability measure PX has density f , then we
define the distribution function FX as follows
Z
x
FX (x) = P (X ≤ x) = PX ((−∞, x]) =
Z
x
dPX (x) =
−∞
f (x) dx
−∞
Definition. A sequence of random variables X1 , X2 ... is converging in distribution to a
random variable X if
lim FXn (x) = FX (x)
n→∞
d
for every x ∈ R at which FX is continuous. We can write this as Xn ⇒ X.
The following theorem is also used to define convergence in distribution and will become
handy in proving the Central Limit Theorem:
d
Theorem 2.1. Suppose X1 , X2 ... is a sequence of random variables, then Xn ⇒ X if and
only if
lim E[f (Xn )] = E[f (X)]
n→∞
for every bounded and continuous function f.
Definition. The characteristic function of a random variable X is defined by
ϕ(t) = E eitX =
Z
R
3
eitx dPX (x) .
Proposition. Every characteristic function ϕ is continuous, ϕ(0) = 1 and |ϕ(t)| ≤ 1.
Theorem 2.2. If X,Y are random variables and ϕX (t) = ϕY (t) for all t ∈ R, then
d
X=Y
i.e. X and Y have the same distribution.
Theorem 2.3. Suppose X and Y are independent random variables, then
ϕX+Y (t) = ϕX (t) · ϕY (t)
3
∀t ∈ R .
The Central Limit Theorem
The Central Limit Theorem. If {Xn } is a sequence of identically and independent
distributed random variables each having finite expectation µ and finite positive variance
σ 2 , then:
X1 + X2 + ... + Xn − n · µ d
√
⇒ N (0, 1)
σ· n
i.e. a centered and normalized sum of independent and identically distributed (i.i.d.) random variables is distributed standard normally as n goes to infinity.
4
4.1
Examples
Roulette
It’s nothing new that on average you should lose when playing roulette. Despite this it’s
still interesting to examine the chances of winning. The CLT gives an answer to this
question:
4
A roulette wheel has 37 numbers in total. 18 are black, 18 are red and 1 is green. Players
are allowed to bet on black or red. Assume a player is always betting $1 on black. We
define Xi to be the winnings of the ith spin. X1 , X2 , ... are clearly independent and
P (Xi = 1) =
E(Xi ) = −
18
19
, P (Xi = −1) =
37
37
1
, V ar(Xi ) = E(Xi 2 ) − [E(Xi )]2 ≈ 1 .
37
We want to approximate the probability that Sn = X1 + ... + Xn is bigger than 0
Sn − nµ
−nµ
√
P (Sn > 0) = P
.
>√
nσ
nσ
Let’s say we want to play n = 100 times, then
−nµ
10
100 · (1/37)
√
√
=
.
=
37
nσ
100
Now the CLT states that
P (Sn > 0) ≈ P (X >
10
)
37
10
for a standard normally distributed random variable X. Since P (X > 37
) ≈ 0.39 we can
assume that the chance to win money by playing roulette 100 times is about 39%.
4.2
Cauchy Distribution
The Cauchy distribution shows that the conditions of finite variance and finite expectation
can not be dropped.
Definition. A random variable X is called Cauchy distributed if the density of X is given
by
f (x) =
1
.
π(1 + x2 )
Proposition. If X is a Cauchy distributed random variable, then
E[X] = ∞ and V ar[X] = ∞ .
5
Lemma. If X is Cauchy distributed, then
ϕX (t) = e−|t| .
Proposition. If {Xn } is a sequence of independent Cauchy distributed random variables,
then
P
Yn = n1 ni=1 Xi has Cauchy distribution.
Proof. To prove this statement we want to compute the characteristic function of Yn and
compare it to the characteristic function of a Cauchy distributed random variable. If they
are the same, then the claim follows by Theorem 2.2.
ϕYn (t) =
n
Y
ϕ Xi (t) =
i=1
−|t|
n
n
Y
i=1
n |t| n
t
t
ϕXi
= ϕX1
= en
n
n
=e
The first step is true because of the Theorem about sums of independent random variables
and its characteristic functions (Theorem 2.3). The third step follows from Theorem 2.2
since all random variables are identically distributed.
So as we can see the arithmetic mean of Cauchy distributed random variables is always
Cauchy distributed and therefore the CLT doesn’t hold.
5
Historical Background
The CLT has a long and vivid history. It developed over time and there are many different
versions and proofs of the CLT.
1st Chapter (1776 - 1829)
In 1776 Laplace published a paper about the inclining angles of meteors. In this paper he
tried to calculate the probability that the actual data collected differed from the theoretical
mean he calculated. This was the first attempt to study summed random variables. From
this it is clear that the CLT was motivated by statistics. His work was continued by Poisson,
6
who published two papers in 1824 and 1829. In these papers he tried to generalize the work
of Laplace, and also make it more rigorous. At this time probability theory still was not
considered a branch of real mathematics. For most mathematicians it was sufficient that
the CLT worked in practice, so they didn’t make a lot of effort to give real proofs.
2nd Chapter (1870 - 1913)
This mindset changed during the 19th century. Bessel, Dirichlet and especially Cauchy
turned probability theory into a respected branch of pure mathematics. They succeeded
in giving rigorous proofs, but there were still some issues. They had problems in dealing
with distributions with infinite support and quantifying the rate of convergence. Moreover
the conditions for the CLT were not satisfying.
Between 1870 and 1913 the famous Russian mathematicians Markov, Chebyshev and Lyapunov did a lot of research on the CLT and are considered to be the most important contributers to the CLT. To prove the CLT they worked in two different directions: Markov
and Chebyshev attempted to prove the CLT using the ”Method of Moments”, whereas
Lyapunov was using characteristic functions.
3rd Chapter (1920 - 1937)
During that period Lindeberg, Feller and Lévy studied the CLT. Lindeberg was able to
apply the CLT to random vectors and he quantified the rate of convergence. His proof was
a big step since he was able to give all sufficient conditions of the CLT. Later, Feller and
Lévy also succeeded in giving also all necessary conditions, which could be proven by the
work of Cramer. The CLT, as it is known today, was born.
The CLT today
People have continued to improve the CLT. There has been various research on theorems
related to dependent random variables, but the basic principles of Lindeberg, Feller and
Lévy are still up to date.
7
6
6.1
Proof
Outline of the Proof
The idea of the proof is to use nice properties of characteristic functions. The Lévy continuity theorem states that the distribution of the limit of random variables is uniquely
defined by the limit of the corresponding characteristic functions. So all we have to do is
understand the limit of the characteristic functions of our summed random variables. We
will see that the characteristic function of summed i.i.d. random variables behave very
well.
The first step is to understand the Lévy continuity theorem. The second step will deal
with the evaluation of the characteristic function of summed i.i.d. random variables. The
final step will put all things together in a very nice and smooth proof.
6.2
Lévy Continuity Theorem
The actual proof of the CLT is straight forward. The difficulty is to understand all the contributing theorems and lemmas. Since the most important theorem is the Lévy continuity
theorem, I want to have a close look at this result.
Lévy Continuity Theorem. Suppose X1 , X2 , ...Xn , X are random variables and
ϕ(t)1 , ..., ϕ(t)n , ϕX (t) the corresponding sequence of characteristic functions, then
Xn ⇒ X ⇐⇒ lim ϕn (t) = ϕX (t)
n→∞
∀t ∈ R .
To understand how the proof works we need some more tools:
d
Bounded Convergence Theorem. If X, X1 , X2 , ... are random variables, Xn ⇒ X,
C ∈ R and |Xn | ≤ C for all n ∈ N, then
lim E[Xn ] = E[X] .
n→∞
Definition. A sequence of random variables Xn is defined as tight if ∀ > 0 there exists a
M ∈ R s.t. P (|Xn | > M ) ≤ for all n ∈ N.
Lemma 6.2.1. If Xn is tight, then there exists a random variable X s.t.
d
Xn ⇒ X .
8
Lemma 6.2.2. If Xn is tight and if each subsequence of Xn that converges at all converges
to a random variable Y , then also
d
Xn ⇒ Y
Proof of the Lévy Continuity Theorem
“⇒“:
Since cos(tXn ) and sin(tXn ) are continuous and bounded functions we can see that by
Theorem 2.1
n→∞
ϕn (t) = E[eitX ] = E[cos(tXn )]+iE[sin(tXn )] −→ E[cos(tX)]+iE[sin(tX)] = ϕn (t)
“⇐“:
This part of the proof proceeds in two steps. First we want to show that pointwise convergence of characteristic functions implies tightness. After this, we are able to use nice
properties of tight random variables to prove the claim.
We will show tightness by estimating the following term which will turn out to be a nice
upper bound for the probability that |Xn | is big. For arbitrary δ > 0
δ
−1
Z
δ
(1 − ϕn (t)) dt = δ
−1
Z
itXn
1 − E[e
−δ
= δ −1
δ
−δ
δ Z
Z
−δ
F ubini
=
−1
δ
=δ
=δ
δ
E[1 − eitXn ] dt
−δ
1 − eitx dPn (x) dt
R
Z Z
δ
itx
1−e
Z
Z
δ
2δ −
−δ
δ
Z
Z
dt dPn (x)
−δ
R
−1
Z
R
−1
] dt = δ
−1
2δ −
cos(tx) + i sin(tx) dt dPn (x)
cos(tx) dt dPn (x)
−δ
R
Z
2 sin(δx)
= δ −1 2δ −
dPn (x)
x
Z R
sin(δx)
=2 1−
dPn (x).
δx
R
Now that this term has a nice shape we want to find
R x a lower bound for it. We know that
1 − sin(ux)
≥
0.
This
is
true
because:
|
sin
x|
=
|
ux
0 cos(y)dy| ≤ |x|. So the integral gets
smaller if we discard an interval
9
Z
Z
sin(δx)
2 1−
dPn (x) ≥ 2
ux
R
sin(δx)
1−
dPn (x) ≥ 2
δx
|x|≥δ/2
Z
1
1−
dPn (x)
|δx|
|x|≥δ/2
| {z }
≥1/2
Z
≥
dPn (x) = Pn ({x : |x| ≥ δ/2})
|x|≥δ/2
= P (Xn ≥ |δ/2|).
Pick > 0. Because ϕ is continuous in 0 and ϕ(0) = 1 we can find a δ > 0 s.t.
|1 − ϕ(t)| ≤
4
∀|t| ≤ δ .
We can use this to estimate the following term
Z
−1
δ
δ
1 − ϕ(t) dt ≤ δ −1 2δ = .
4
2
−δ
Since |ϕn (t)| ≤ 1 the bounded convergence theorem implies
Z
δ
n→∞
(1 − ϕn (t)) dt −→
−δ
Z
δ
(1 − ϕ(t)) dt .
−δ
Because of that there exists a N ∈ R s.t. for all n > N
Z
Z δ
−1 δ
−1
δ
(1 − ϕn (t)) dt − δ
(1 − ϕ(t)) dt ≤ .
2
−δ
−δ
If we put the three bounds together, we get for all n > N
Z δ
Z δ
−1
−1
P (Xn ≥ |δ/2|) ≤ δ
1 − ϕn (t) dt = δ
(1 − ϕn (t)) + (1 − ϕ(t)) − (1 − ϕ(t)) dt
−δ
−δ
Z
Z δ
Z
−1 δ
−1 δ
−1
≤ δ
1 − ϕn (t) dt − δ
(1 − ϕ(t)) dt + δ
(1 − ϕn (t)) dt ≤ .
−δ
−δ
−δ
We just proved that the point δ/2 has the property that the probability of |Xn | exceeding
this value is small for infinite many n. To show tightness we just need to find a bound for
m→∞
all the finite cases. Because Pn ([−m, m]) −→ 1 we know that for all n ∈ {1, ..., N } there
exists a mn ∈ R such that
10
P (|Xn | ≥ mn ) ≤ .
Now define M = max{m1 , ..., mN , δ/2}. Because of the monotonicity of distribution functions we just proved that P (|Xn | > M ) ≤ for all n ∈ N ⇒ Xn is tight.
Lemma 6.2.1 tells us that Xn has a limit. Because of Lemma 6.2.2 we just need to show that
every converging subsequence converges to X. So Suppose Xnk converges to some random
variable Y in distribution. Then we know because of “⇐“ that Y has the characteristic
d
function ϕ(t). Now Y = X because of Theorem 2.2. Since this holds for any converging
subsequence we just have shown
d
Xn ⇒ X .
6.3
Lemmas
To apply the Lévy Continuity Theorem to the characteristic function of summed i.i.d.
random variables we need two more lemmas.
Lemma 6.3.1. Suppose X is a random variable and E |X|2 < ∞, then ϕX (t) can be
written as the following Taylor expansion
ϕX (t) = 1 + itE[X] +
Recall o(t2 ) means that as t → 0
o(t2 )
t2
t2 2 E X + o t2
2
→ 0.
The Lemma can be proven by using the following estimate for the error term o(t2 ) <
3 2|X|2
min( |X|
3! , 2 ).
n→∞
Lemma 6.3.2. Suppose cn is a sequence of complex numbers and cn −→ c, then
lim
n→∞
1+
cn n
= ec
n
11
6.4
Proof of the Central Limit Theorem
Without loss of generality, we can assume that µ = 0 and σ 2 = 1 because E
i
h
and V ar Xnσ−µ = 1.
h
Xn −µ
σ
i
=0
The Lévy Continuity Theorem tells us that it is sufficient to show that the characteristic function of our sum converges to the characteristic function of a standard normally
distributed random variable, i.e.
2
√t
ϕ√
→ e−t /2
as n → ∞
Sn (t) = ϕSn
n
n
and Sn = X1 + X2 + ... + Xn .
Now we want to use independence by applying Theorem 2.3 to the sum. Fix t ∈ R
ϕSn
t
√
n
=
n
Y
ϕXk
k=1
t
√
n
n
t
= ϕX1 √
.
n
The second equality is true because all random variables are identically distributed and
therefore have the same
function (Theorem 2.2). Lemma 6.3.1 and the basic
characteristic
2
2
fact that V ar[Y ] = E Y − E[Y ] for any random variable Y yields
n

ϕX1
t
√
n
n
t2
t

= 1 + i √ E[X1 ] + E X12 +o
n | {z } n | {z }
=0
t2
n


=1
n
t2
1
= 1+
+o
2n
n
!n
t2 /2 + n · o n1
.
= 1+
n
By using Lemma 6.3.2 and because o n−1 /n−1 → 0 as n → ∞ we just identified the
limit
lim ϕSn
n→∞
t
√
n
2 /2
= e−t
.
12
7
References
P. Billingsley (1986) Probability and Measure. John Wiley and Sons, New York
R. Durrett (2010) Probability: Theory and Examples
M. Mether (2003) The History Of The Central Limit Theorem
13