Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CENTRAL LIMIT THEOREM FREDERICK VU Abstract. This expository paper provides a short introduction to probability theory before proving a central theorem in probability theory, the central limit theorem. The theorem concerns the eventual convergence to a normal distribution of an average of a sampling of independently distributed random variables with identical variance and mean. The paper shall use Lévy’s continuity theorem to go about proving the central limit theorem. Contents 1. Introduction 2. Convergence 3. Variance Matrices 4. Multivariate Normal Distribution 5. Characteristic Functions and the Lévy Continuity Theorem Acknowledgments References 1 3 7 8 8 13 13 1. Introduction Before we state the central limit theorem, we must first define several terms. An understanding of the terms relies on basic functional analysis fitted with new probability terminology. Definition 1.1. A probability space is a triple (Ω, F, P ) where Ω is a non-empy set, F is a σ-algebra (collection of subsets closed under countable unions/intersections of countable many subsets) of measurable subsets of Ω, and P is a finite measure on the measurable space (Ω, F) with P (Ω) = 1. P is referred to as a probability. Definition 1.2. A random variable X is a measurable function from a probability space (Ω, F, P ) to a measurable space (S, S) where S is a σ-algebra of measurable subsets of S. Normally (S, S) is the real numbers with the Borel σalgebra. We will maintain this notation, but conform to the norm throughout the paper. A random vector is a column vector whose components are real-valued random variables defined on the same probability space. In many places in this paper, a statement concerning random variables will presume the existence of some general probability space. Definition 1.3. The expected value of a real-valued random variable X is defined as the Lebesgue integral of X with respect to the measure P Z E(X) ≡ X dP. Ω 1 2 FREDERICK VU For a random vector X, the expected value E(X) is the vector whose components are E(Xi ) Definition 1.4. Because independence is such a central notion in probability it is best to define it early. First, define the distribution of a random variable as Q ≡ P ◦ X −1 defined on (S, S) by Q(B) := P (X −1 (B)) ≡ P (X ∈ B) ≡ P (ω ∈ Ω : X(ω) ∈ B), B ∈ S. This possibly confusing notation can be understood as the pushforward measure of P to (S, S). Definition 1.5. A set of random variables X1 , ..., Xn with Xi a map from (Ω, F, P ) to (Si , Si ) is called independent if the distribution Q of X := (X1 , ..., Xn ) on the product space (S = S1 × · · · × Sn , S = S1 × · · · × Sn ) is the product measure Q = Q1 × · · · × Qn where Qi is the distribution of Xi , or more compactly: Q(B1 × · · · × Bn ) = n Y Qi (Bi ) i=1 Two random vectors are said to be independent if their components are pairwise independent as above. Since the (multivariate) central limit theorem won’t be stated until much further along due to the required definitions of normal distributions and many lemmas along the way, we pause here to give an informal statement of the central theorem before continuing on with a few basic lemmas from probability theory. The central limit theorem basically says that given a fixed distribution, if one were to repeatedly, but independently, sample from such distribution, the average value will roughly approach the expected value of the corresponding random variable, even giving a bell-shaped curve if one were to plot a histogram. The following are simple inequalities used often in the paper. Lemma 1.6. (Markov’s Inequality) If X is a nonnegative randomvariable and a > 0, then E(X) . P (X ≥ a) ≤ a Proof. Denote for U ⊆ Ω, the indicator function of U , IU . Then by linearity of the integral and the definition of the probability distribution E(X) ≥ E(aIX≥a ) = aE(IX≥1 ) = aP (X ≥ a) Corollary 1.7. (Chebyshev’s Inequality) For any random variable X and a > 0 P (|X − E(X)|) ≥ a) ≤ E((X − E(X))2 ) . a2 Proof. Consider the random variable (X − E(X))2 and apply Markov’s inequality. There are many ways to understand probability measures, and it is from these different points of view and their interrelations that one can derive the multitude of theorems following. CENTRAL LIMIT THEOREM 3 Definitions 1.8. The cumulative distribution function (cdf) of a random vector X = (Xi , ..., Xn ) is the function FX : Rn → R FX (x) = P (X1 ≤ x1 , ..., Xn ≤ xn ) For a continuous random vector X, define the probability density function as ∂n fX (x) = FX (x1 , ..., xn ) ∂x1 · · · ∂xn This provides us with another way to write the distribution of a random vector X. For A ⊆ Rn , Z P (X ∈ A) = fX (x) dx. A Remark 1.9. For a continuous random variable X, there is also another way to express the expected value of powers of X. Z (1.10) E(X n ) = xn fX (x) dx R This is just a specific case of Z (1.11) g(x)fX (x) dx E(g(X)) = R where g is a measurable function. 2. Convergence Definition 2.1. A sequence of cumulative distribution functions {Fn } is said to converge in distribution, or converge weakly, to the cumulative distribution function F , denoted Fn ⇒ F , if (2.2) lim Fn (x) = F (x) n for every continuity point x of F . If Qn and Q are their respective distribution functions, then we may equivalently define Qn ⇒ Q if for every A = (−∞, x) for which Q(x) = 0, lim Qn (A) = Q(A). n Similarly, if Xn and X are the respective random variables corresponding to Fn and F , we write Xn ⇒ X defined equivalently. Since distributions are just measures on some measurable (S, S), which again is generally the reals, we have a similar understanding of convergence of measures rather than just distributions. The following theorem allows the representation of weakly convergent measures as the distribution of random variables defined on a common probability space. Theorem 2.3. Suppose that µn and µ are probability measures on (R, R) and µn ⇒ µ. Then there exist random variables Xn and X on some (Ω, F, P ) such that Xn , X have respective distributions µn , µ, and Xn (ω) → X(ω) for each ω ∈ Ω. Proof. Take (Ω, F, P ) to be the set (0, 1) with Borel subsets of (0, 1) and the Lebesgue measure. Denote the cumulative distributions associated with µn , µ by Fn , F , and put Xn (ω) = inf[x : ω ≤ Fn (x)] and X(ω) = inf[x : ω ≤ F (x)]. 4 FREDERICK VU The set [x : ω ≤ F (x)] is closed on the left since F (x) is right-continuous as are all cumulative distributions, and therefore it is the set [X(ω), ∞). Hence, ω ≤ F (x) if and only if X(ω) ≤ x, and P [ω : X(ω) ≤ x] = P [ω : ω ≤ F (x)] = F (x). Thus, X has cumulative distribution F ; similarly, Xn has distribution F . To prove pointwise convergence, suppose for a given > 0, we choose x so that X(ω) − < x < X(ω) and µ(x) = 0. Then F (x) < ω, and Fn (x) → F (x) implies that for large enough n, Fn (x) < ω, and therefore X(ω) − < x < Xn (ω). Thus lim inf Xn (ω) ≤ X(ω). n 0 Now for ω > ω, we may similarly choose a y such that ω ≤ Fn (y) and hence Xn (ω) ≤ y < X(ω 0 ) + . Thus lim sup Xn (ω) ≤ X(ω 0 ). n Therefore, if X is continuous at ω, Xn (ω) → X(ω). Since X is increasing on (0, 1), it has at most countably many discontinuities. For any point of discontinuity ω, define Xn (ω) = X(ω) = 0. Since the set of discontinuities has Lebesgue measure 0, the distributions remain unchanged. At the heart of many theorems in probability is the properties of convergence of distribution functions. We now come to many fundamental convergence theorems in probability, though in essence they are rehashings of conventional proofs from functional analysis. The first theorem essentially says that measurable maps preserve limits. Theorem 2.4. Let h : R → R be measurable and let the set Dh of its continuities be measurable. If µn ⇒ µ as before and µ(Dh ) = 0, then µn ◦ h−1 ⇒ µ ◦ h−1 . Proof. Using the random variables Xn , X defined in the previous proof, we see h(Xn (ω)) → h(X(ω)) almost everywhere. Therefore h(Xn ) ⇒ h(X), where such notation means composition h ◦ X. For A ⊆ R, since P [h(X) ∈ A] = P [X ∈ h−1 (A)] = µ(h−1 (A)), h ◦ X has distribution µh−1 ; similarly, h ◦ Xn has distribution µn h−1 , again abusing notation of composition. Thus h(Xn ) ⇒ h(X) is equivalent to µn h−1 ⇒ µh−1 . Corollary 2.5. If Xn ⇒ X and P [X ∈ Dh ] = 0, then h(Xn ) ⇒ h(X). RLemma 2.6. R µn ⇒ µ if and only if for every bounded, continuous function f , f dµn → f dµ. Proof. For the forward proof, in the same process as seen in the proof of theorem 2.3, we have f (Xn ) → f (X) almost everywhere. By change of variables and the dominated convergence theorem, Z Z f dµn = E(f (Xn )) → E(f (X)) = f dµ. Conversely, consider the cumulative distribution functions Fn , F associated with µn , µ and suppose x < y. Define the function f by f (t) = 1 for t ≤ x, f (t) = 0 CENTRAL LIMIT THEOREM 5 for R t ≥ y, and f (t) = (y − t)/(y − x) for x ≤ t ≤ y. Since Fn (x) ≤ f dµ ≤ F (y), if we let y & x it follows from assumption that R f dµn and lim sup Fn (x) ≤ F (x). n If we consider u < x and define a function g similar to f (1 up to u, 0 after x, and linear inbetween), we have F (x−) ≤ lim inf Fn (x) n which implies convergence at continuity points. Theorem 2.7. (Helly’s Selection Theorem) For every sequence of cdf ’s Fn , there exists a subsequence Fnk and a nondecreasing, right-continuous fuction F such that limk Fnk (x) = F (x) at continuity points x of F . Proof. Enumerate the rationals by t1 , t2 , . . . . Since these are cumulative distribution functions, the sequence Fn (t1 ) contains a convergent subsequence; denote one by Fn(1) . Similarly, we may find a subsequence of this subsequence, denoted Fn(2) , k k (2) (1) (k) such that n1 > n1 and Fn(2) converges. Let nk = n1 , the first element of the k k-th sub-subsequence, so that Fnk is convergent at every rational. Denote G(tm ) as the limit of the function at the rational tm , then define F (x) = inf{G(tm ) : x < tm }, which is clearly nondecreasing. For any given x and > 0, there exists an r > x so that G(r) < F (x) + . If x < y < r, then F (y) ≤ G(r) < F (x) + . Hence F is right continuous. If F is continuous at x, choose y < x so that F (x) − < F (y), and choose rational s, r so that y < r < x < s and G(s) < F (x) + . From F (x) − < G(r) ≤ G(s) < F (x) + and monotonicity of Fn , it follows that as k → ∞, Fnk has lim sup and lim inf within of F (x). Note that the function F described above does not have to be a cdf; consider Fn to be the unit jump at n, then F ≡ 0. To make the theorem useful, we need a condition that ensures F is a cdf. Definition 2.8. A sequence of distributions µn on (R, R) is tight if for each positive these exists a finite interval (a, b] such that µn ((a, b]) > 1 − for all n. Theorem 2.9. A sequence of distributions µn is tight if and only if for every subsequence µnk there is a further subsequence µnkj and a probability measure µ such that µnkj ⇒ µ. Proof. For the forward direction, use Helly’s theorem to the subsequence Fnk of corresponding cdf’s such that limj Fnkj (x) = F (x) at continuity points of F . From the random variables constructed in the proof of theorem 2.3, a measure µ on (R, R) may be defined so that µ(a, b] = F (a) − F (b). As a consequence of tightness, given > 0, we may choose a, b so that µn (a, b] > 1 − for all n. We may also decrease a and increase b so that they are points of continuity for F . Then µ(a, b] ≥ 1 − so that µ is a probability measure and µnkj ⇒ µ. Conversely, assume µn is not tight; there exist > 0 such that for any finite interval (a.b], µn (a, b] ≤ 1 − for some n. Choose the subsequence nk so that µnk (−k, k) ≤ 1 − . Now suppose there is a subsequence µnkj of µnk that converges 6 FREDERICK VU weakly to some probability measure µ. Choose (a, b] so that µ(a) = µ(b) = 0 and µ(a, b] > 1 − . Then for large enough j, (a, b] ⊂ (−kj , kj ], and so 1 − ≥ µnkj (−kj , kj ] ≥ µnkj (a, b] → µ(a, b]. Thus, µ(a, b] ≤ 1 − , a contradiction. Corollary 2.10. If µn is a tight sequence of probability measures, and if each convergent subsequence converges weakly to the probability measure µ, then µn ⇒ µ. Proof. By the theorem, every subsequence µnk contains a subsequence µnkj that converges weakly to µ. Suppose that µn ⇒ µ is false. Then there exists x such that µ(x) = 0 but µn (−∞, x] → µ(−∞, x] is false. Then there is an > 0 such that |µnk (−∞, x] − µ(−∞, x]| ≥ for a sequence nk , for which no subsequence may converge weakly to µ, a contradiction. Like many fundamental theorems in analysis, the following section concerns interactions between limits and integration. This is very important when we are dealing with sequences of random variables and their expected values. Definition 2.11. The random variables Xn are uniformly integrable if Z lim sup |Xn | dP = 0 a→∞ n |Xn |≥a which implies that sup E(|Xn |) < ∞/. n Theorem 2.12. If Yn ⇒ X and Yn are uniformly integrable, then Y is integrable and E(Yn ) → E(Y ) Proof. The integrability of Y follows from Fatou’s lemma. From the distributions associated with Yn and Y , construct as in the proof of theorem 2.3 the random variables Xn , X. Since they have the same distribution and Xn → X almost everywhere, E(Xn ) = E(Yn ) → E(Y ) = E(X) by Vitali’s convergence theorem. Corollary 2.13. For a positive integer r, if Xn ⇒ X and supn E(|Xn |r+ ) < ∞, for some > 0, then E(|X|r ) < ∞ and E(Xnr ) → E(X r ) Proof. The Xn are uniformly integrable because Z 1 |Xn | dP ≤ E(|Xn |1+ ). a |Xn |≥a Then by corollary 2.5, Xn ⇒ X implies Xnr ⇒ X r . CENTRAL LIMIT THEOREM 7 3. Variance Matrices Definitions 3.1. The covariance of two random variables X, Y is Cov(X, Y ) ≡ E[(X − µX )(Y − µY )] where µX = E(X). The covariance matrix of two random vectors X = (X1 , ..., Xn ), Y = (Y1 , ..., Ym ) is the n × m matrix defined by [Cov(X, Y)]ij = Cov(Xi , Yj ) The variance matrix of a random vector X is the square matrix MX defined by [MX ]ij = [var(X)]ij = Cov(Xi , Xj ) Let’s examine some properties of the expected value (mean) and variance matrix of random vectors a little more closely. Theorem 3.2. Let Y = a + BX where a is any fixed vector, B is any fixed matrix, and X is a random vector, then (3.3) E(Y) = a + BE(X) (3.4) var(Y) = Bvar(X)B0 Proof. To prove (3.3), it is enough to note the linearity of the expectation operator. To prove (3.4), we note that the variance matrix may be written var(Y) = E[(Y − µY )(Y − µY )0 ] Thus, evaluating the variance of Y, we get var(a + BX) = E[(a + BX − µY )(a + BX − µY )0 ] = E[(BX − BµX )(BX − BµX )0 ] = E[B(X − µX )(X − µX )0 B0 ] = BE[(X − µX )(X − µX )0 ]B0 = Bvar(X)B0 . Definition 3.5. An n × n matrix A is positive semi-definite if for every vector c ∈ Rn c0 Ac ≥ 0 and positive definite if for every vector c ∈ Rn c0 Ac > 0. Theorem 3.6. The variance matrix of a random vector X is symmetric and positive semi-definite. Proof. Define a scalar random variable by Y = a+c0 X, where a is a constant scalar and c is a constant vector, then by theorem 3.2, var(Y ) = cMX c0 but since the variance of of a random variable is non-negative by definition, we see MX is positive semi-definite. 8 FREDERICK VU 4. Multivariate Normal Distribution The standard multivariate normal distribution is the distribution of a random vector Z = (Z1 , . . . , Zn ) whose components are independent and have identical distribution, and for notation, we write Z ∼ Nn (0, I). The distribution is defined by its probability density function fZ (x) = n Y 0 2 1 1 √ e−xi /2 = e−x x/2 n/2 (2π) 2π n=1 Remark 4.1. The value of a Gaussian integral is r Z 2 π I(a) = e−ax dx = . a R √ Z 2 1 π dI(a) = −x2 e−ax dx = − 3/2 da 2a R Thus by definitions 3.1, the variance matrix of Z is the n dimensional identity matrix, and the mean of Z is 0. Corollary 4.2. For a random vector X = a + BZ, E(X) = a + BE(Z) = a var(X) = Bvar(Z)B0 = BB0 . We say that X has multivariate normal distribution with mean a and variance BB0 = M. For notation, we write X ∼ Nn (a, M), dropping the n if the dimension is implied from context. It turns out that a symmetric, positive semi-definite matrix is equivalent to the variance matrix of a normal random vector. Some more properties about matrices will have to be introduced before this can be proven. Definition 4.3. For a given symmetric, positive semi-definite matrix A, we know by the spectral theorem that it may be written as A = ODO0 , where O is orthogonal and D is the diagonal matrix of eigenvalues of A. Define the symmetric square root of A by A1/2 = OD1/2 O0 Theorem 4.4. For a given symmetric, positive semi-definite matrix M and vector µ, there is a normal random vector X such that X ∼ N (µ, M). Proof. Define X = µ + M1/2 Z, where Z is multivariate standard normal. Using corollary 4.2 and the properties of the symmetric square root of M, this finishes the proof. 5. Characteristic Functions and the Lévy Continuity Theorem The power of the following function stems from its relation with the probability density function of a random variable, i.e. they are essentially the Fourier transform of one another. It turns out that this is just the tool needed to go about proving stronger convergence theorems about distributions. CENTRAL LIMIT THEOREM 9 Definition 5.1. The characteristic function of a random vector X is the function φ : Rn → C φX (t ) = E(eit 0 X ) where the expectation is taken with respect to the distribution of X. The characteristic function is sometimes written φ(t ) without the index X. Some basic properties of φ(t ) follow from the absolute continuity of the integral and the isomorphic properties of the exponential operator from the additive group to the multiplicative group of the real numbers: 1.φ(0) = 1 3.φa+bX = eit a φX (at ) 2.φ(t ) ≤ 1 4.φPN (t ) = n=1 Xn N Y φXn (t ) n=1 Characteristic functions provide us with a new way of determining convergence of distributions. Theorem 5.2. Distribution functions have unique characteristic functions. Proof. We shall prove this by giving an inversion formula: Z M −ita e − e−itb 1 φ(t) dt. (5.3) µ(a, b] = lim M →∞ 2π −M it We expand the characteristic function and apply Fubini’s theorem to get # Z "Z M it(x−a) e − eit(x−b) 1 dt dµ IM = 2π R −M it Since sin x is odd and cos x is even, this simplifies to # Z "Z M Z M sin t(x − a) sin t(x − b) 1 − dt dµ. IM = π R 0 t t 0 lim IM = M →∞ 0 1 2 1 for x < a or x > b for x = a, b for a < x < b Since µ(a) = µ(b) = 0, equation 5.3 holds. Theorem 5.4. The characteristic function of a random vector X ∼ N (µ, M) is φX (t ) = eit 0 µ−t 0 Mt/2 Proof. First, we shall prove the theorem for the univariate case then use property 4 above to generalize. φ(t) = E(eitx ) = E(cos(tx) + isin(tx)) = E(cos(tx)) + 0 Z 2 1 √ cos(tx)e−x dx. = 2π 10 FREDERICK VU The third equality comes from sin(tx) being an odd function. Now we differentiate with respect to t to get Z 2 1 xsin(tx)e−x dx φ0 (t) = − √ 2π Z 2 1 =√ sin(tx) d(e−x ) 2π Z 2 1 −x2 sin(tx)e−x |∞ − tcos(tx)e dx =√ −∞ 2π = −tφ(t) With the initial condition φ(0) = 1, we have a differential equation with a unique 2 solution φ(t) = e−x . Now if X ∼ N (0, I), we have φX (t ) = e−t’t /2 For an arbitrary X ∼ N (µ, M), by theorem 3.3, we may write X = µ + M1/2 Z, where Z is multivariate standard normal. Now the characteristic function of X is E(eit 0 X ) = E(eit 0 (M1/2 Z+µ) = eit 0 = eit 0 µ i(M = eit 0 µ−t 0 Mt /2 µ E(ei(M 1/2 1/2 0 e ) 0 t) Z t ) (M ) 1/2 t )/2 The following lemma provides bounds for the error in the Taylor approximation of the exponential function. While this may seem quite a bit away from the goal of the paper, it is one of many lemmas needed to prove Levy’s theorem, from while the central limit theorem follows more easily. Lemma 5.5. Suppose X is a random variable such that E(X m ) < ∞. Then m k X (it) (tX)m+1 (tX)m itX k E(X ) ≤ min ,2 E(e ) − k! (m + 1)! m! k=0 Rx k Pm m+1 iy Proof. Let fm (x) = eix − k=0 (ix) d dy. k! and note that fm (x) = i 0 (x − y) Iterating this reduction m − 1 times yields iterated integrals with integrand of m modulus |eiym−1 − 1| ≤ 2. Thus, the value of fm is at most 2 |x| m! . For the other bound, consider the following identity using integration by parts Z x Z x i xm+1 m iy + (x − y)m+1 eiy dy (x − y) e dy = (m + 1)! m + 1 0 0 This defines a recursive function that leads by induction to the formula Z m X im+1 x (ix)k eix = + (x − y)m eiy dy k! m! 0 k=0 m m Since |x − y| < |y| for both nonnegative and negative values of x, we obtain |x|m+! . Replace x with |tX| and take a bound on the modulus of the integral of (m+1)! expected values. CENTRAL LIMIT THEOREM 11 Before we prove the Lévy continuity theorem, we need one more algebraic lemma concerning Gaussian integrals. Lemma 5.6. For a > 0, Z e−ax 2 Proof. Rewrite −ax2 + bx = −a(x − Z e −ax2 +bx dx = e +bx r dx = b 2 2a ) b2 4a + Z e b2 4a . π b2 e 4a a Then we have b 2 −a(x− 2a ) dx = e b2 4a r π a Theorem 5.7. (L evy’s Continuity Theorem) Xn ⇒ X φXn (t ) → φX (t ), ∀t ∈ Rm if and only if 0 Proof. For the forward direction, since eit X is bounded and continuous in X, the result follows from lemma 2.6. Conversely, assume φXn (t ) → φX (t ), ∀t ∈ Rm . We first show E(g(Xn )) → E(g(X)) for continuous g with compact support, and we shall later show that this implies convergence for bounded, continuous functions on all of Rm . Now since g is uniformly continuous, for any > 0, we can find a δ > 0 such that kx − yk < δ implies kg(x) − g(y)k < . Let Z be a N (0, σ 2 I) random vector that is independent to X and {Xn }, then |E(g(Xn )) − E(g(X))| = |E(g(Xn )) − E(g(X)) + E(g(Xn + Z)) − E(g(Xn + Z)) + E(g(X + Z)) − E(g(X + Z))| ≤ |E(g(Xn )) − E(g(Xn + Z))| + |E(g(Xn + Z)) − |E(g(X + Z))| + |E(g(X + Z)) − E(g(X))|. The first term above is bounded by 2 because for sufficiently small σ |E(g(Xn )) − E(g(Xn + Z))| ≤|E(g(Xn ) − E(g(Xn + IkZk≤δ Z))| + |E(g(Xn ) − E(g(Zn + IkZk>δ Z))| ≤E() + 2 sup |g(w )| P (kZk > δ) w ≤2 where the last line follows from Chebyshev’s inequality. Similarly, the third term in the above expression is bounded by 2. We wish to show that the second term goes to 0, i.e. E(g(Xn + Z)) → Eg(X + Z). We have 12 FREDERICK VU ZZ 0 2 1 E(g(Xn + Z)) = √ g(x+z )e−z z /(2σ ) dz dFXn n ( 2πσ) ZZ 0 2 1 = √ g(u )e−(u-x) (u-x) /(2σ ) du dFXn n ( 2πσ) ZZ n Y 2 2 1 = √ g(u ) e−(uj −xj ) /(2σ ) du dFXn n ( 2πσ) j=1 ZZ Z n Y 2 2 1 σ √ = √ g(u ) eitj (uj −xj )−σ tj /2 dtj du dFXn ( 2πσ)n 2π j=1 ZZZ 0 2 1 g(u )eit (u-x )−σ t’t /2 dt du dFXn = n (2π) ZZ 0 2 1 = g(u )eit (u )−σ t’t /2 φXn (−t ) dt du n (2π) The first equality comes from a multivariate form of equation (1.11), the fourth equality comes from lemma 4.4, and the last inequality comes from the definition of the characteristic function. Since g is continuous with compact support, we can add a constant and scale g so that it is a distribution. Furthermore, we may consider the above expression as an expectation with respect to two random vectors, one having normal density and the other s(g(u ) + r) for some constants which would make the expression a permissable density. The expectation then is of the argument 0 eit u φXn (−t ) which is bounded by 1, and thus by the dominated convergence theorem (lemma 2.6), ZZ 0 2 1 g(u )eit (u-x )−σ t’t /2 φXn (−t ) dt du n (2π) ZZ 0 2 1 g(u )eit (u-x )−σ t’t /2 φX (−t ) dt du . → n (2π) Repeating the above derivation with X in place of Xn , we have E(g(Xn + Z)) → E(g(X + Z)). Now it only remains to extend this to bounded, continuous functions defined on all of Rm . Take g : Rm → R such that |g(x )| ≤ A for some A ∈ R. For any > 0, we shall show that |E(g(Xn )) − E(g(X))| ≤ . We may find c ∈ R such that P (kXk ≥ c) < 2A , and a continuous function 0 0 0 ≤ g (x ) ≤ 1 such that g (x ) = 0 if kx k ≥ c + 1 and g 0 (x ) = 1 if kx k ≤ c. It follows that E(g 0 (X)) ≥ 1 − 2A and |E(g(Xn )) − E(h(X))| =|E(g(Xn )) − E(g(X)) + E(g(Xn )g 0 (Xn )) − E(g(Xn )g 0 (Xn )) + E(g 0 (X)g 0 (X)) − E(g 0 (X)g 0 (X))| ≤|E(g(Xn )) − E(g(Xn )g 0 (Xn ))| + |E(g(Xn )g 0 (Xn )) − E(g 0 (X)g 0 (X))| + E(g 0 (X)g 0 (X)) − E(g(X))| →|E(g(Xn )) − E(g(Xn )g 0 (Xn ))| + 0 + |E(g 0 (X)g 0 (X)) − E(g(X))| → + = . 2 2 CENTRAL LIMIT THEOREM 13 The first convergence follows from the first half of the proof and the fact that g · g 0 is continuous with compact support. The second convergence follows from |E(g(Xn )) − E(g(Xn )g 0 (Xn ))| ≤ E(|g(Xn )| · |1 − g 0 (Xn )|) ≤ AE(|g(Xn )| · |1 − g 0 (Xn )|) = AE(g(Xn )| · |1 − g 0 (Xn )) = A(1 − E(g 0 (Xn ))) → A(1 − E(g 0 (X))) ≤A = 2A 2 and a bound for |E(g(X)) − E(g 0 (X)g 0 (X))| is found in the same fashion. The following theorem along with the law of large numbers is the basis for much of the beauty (subjectively) in statistics. Theorem 5.8. (The Classical Central Limit Thorem) Let {Xn } be a sequence of independent and identically distributed m-dimensional random vectors with mean µ and finite covariance matrix M. Then denoting ΦM as a Nm (0, M) distribution, X1 + · · · + Xn − nµ √ ⇒ ΦM n Proof. We shall prove the theorem for µ = 0 and M = I since the general result follows by a linear transformation. Consider first the case m = 1, and let the √ random variable Yn = (X1 + · · · + Xn )/ n. By Taylor’s theorem, we have n √ 1 t2 n φYn (t) = φX (t/ n) = 1 − + o( ) 2 n t2 where limn→∞ n · o( n1 ) = 0. This converges to e− 2 , and lemma 5.5 proves the theorem for m = 1. For m > 1, Xn ∼ N (0, I), define for a fixed t ∈ Rm the random variable sequence Yn = t · Xn . Then {Yn } has mean 0 and √ variance t ·t . From the preceding, the random variable Zn := (Y1 + · · · + Yn )/ n converges in distribution to the normal distribution with mean 0 and variance t ·t . By Lévy’s, 2 φZn (ξ) → e−t·t ξ , ξ∈R Evaluating the expression at ξ = 1 and applying Lévy’s continuity theorem once again, the proof is complete. Acknowledgments. It is a pleasure to thank my mentor, Mohammad Rezaei, for leading me to this time consuming, though beautiful topic. References [1] Rabi Bhattacharya, Edward Waymire. A Basic Course in Probability Theory. Springer. 2007. Print. [2] Guy Lebanon. The Analysis of Data http://theanalysisofdata.com/