Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Notes on Asymptotic Theory: Convergence in Probability and Distribution Introduction to Econometric Theory Econ. 770 Jonathan B. Hill Dept. of Economics University of North Carolina - Chapel Hill November 19, 2011 1 Introduction Let (¨ F ) be a probability space. Throughout is a parameter of interest like the mean, variance, correlation, or distribution parameters like Poisson , Binomial , or exponential . Throughout f^ g¸1 is a sequence of estimators of based on a sample of data f g=1 with sample size ¸ 1. Assume ^ is F-measurable for any . Unless otherwise noted, assume the 0 have the same mean and variance: » ( 2 ). If appropriate, we may have a bivariate sample f g=1 where » ( 2 ) and » ( 2 ). Examples include the sample mean, variance, or correlation: X ¹ := 1 Sample Mean : =1 Sample Variance #1 : 2 1 X := ( ¡ )2 ¡1 =1 Sample Variance #2 : ^ 2 := 1 X =1 ¡ ¢ ¹ 2 ¡ ¢¡ ¢ P ¡ ¹ ¡ ¹ 1 =1 ¡ Sample Correlation : ^ := ^ ^ Similarly, we may estimate a probability by using a sample relative frequency: 1X ^ () = ( · ) the sample percentage of · =1 1 Notice ^ () estimates ( · ). We will look at estimator properties: what ^ is on average for any sample size; and what ^ becomes as the sample size grows. PIn every case¢ ¡above the ¢ es ¡ ¹ ¹ timator is a variant¡ of a straight average (e.g. 1 =1 ¡ ¡ is a ¢¡ ¢ ¹ ¡ ¹ ), or a function of a straight average (e.g. straight average of ¡ ^ ¢2 12 ¡ ¢2 P ¡ ¹ ¹ := (1 =1 ¡ ) , the square root of the average ¡ ). We therefore pay particular attention to the sample mean. 2 Unbiasedness Defn. We say ^ is an unbiased estimator of if [^ ] = . De…ne bias as ³ ´ B ^ := [^ ] ¡ An unbiased estimator has zero bias: B(^ ) = 0. If we had an in…nite number of samples of size , then the average estimate ^ across all samples would be . An asymptotically unbiased estimator satis…es B(^ ) ! 0 as ! 1. Claim (Weighted Average): P Let have a common mean := [ ]. Then the ^ := =1 is an unbiased estimator of := [] if P weighted average = 1. =1 Proof: " X =1 # = X [ ] = =1 X =1 = QED. ¹ := 1 P is a Corollary (Straight Average): The sample mean =1 P weighted average with ‡at or uniform weights = 1 hence trivially =1 = 1 hence ¹ = [] P The problem then arises as to which weighted average =1 may be preferred in practice since any with unit summed weights is unbiased. We will discuss below the concept of e¢ciency below, but the minimum mean-squared-error unbiased estimator has uniform weights if » ( 2 ). That is: ¹ is the best linear Claim (Sample Mean is Best): Let » ( 2 ). Then unbiased estimator of (i.e. it is BLUE). Proof: We want to solve à !2 X X min subject to = 1 =1 =1 The Lagrange is L ( ) := à X =1 2 !2 à + 1¡ X =1 ! P P where by independence ( =1 )2 = 2 =1 2 , hence à ! X X L ( ) := 2 2 + 1 ¡ =1 =1 The …rst order conditions are X L ( ) = 2 2 ¡ = 0 and L ( ) = 1 ¡ = 0 =1 P 2 2 ThereforeP = (2 P) is a constant that sums to =1 = 1. Write = (2 ) =: . Since =1 = =1 = = 1 it follows = = 1. QED. Remark: As in many cases here and below, independence can be substituted for uncorrelatedness since the same proof applies: [ ] = [ ][ ] for all 6= . We can also substitute uncorrelatedness with a condition that restricts the total correlation across all and for 6= , but such generality is typically only exploited in time series settings (where is at a di¤erent time period). Claim (Sample Variance): Let » ( 2 ). The estimator 2 is unbiased 2 and ^ is negatively biased but asymptotically unbiased. Proof: Notice =1 =1 X¡ ¢ ¢ ¡1 2 1 X¡ ¹ 2= 1 ¹ 2 = ^ 2 = ¡ ¡ + ¡ = ¢2 ¡ ¢ 1X 1 X¡ 1X 2 ¹ ¹ ( ¡ ) + ¡ +2 ( ¡ ) ¡ =1 = =1 =1 ¡ ¢2 ¡ ¢1X 1X 2 ¹ ¹ ( ¡ ) + ¡ ¡ 2 ¡ ( ¡ ) =1 = =1 ¡ ¢ ¡ ¢¡ ¢ 1X ¹ ¡ 2¡2 ¹ ¡ ¹ ¡ ( ¡ )2 + =1 = ¡ ¢ 1X ¹ ¡ 2 ( ¡ )2 ¡ =1 ¹ is unbiased By the iid assumption and the fact that à ! X ¡ ¢2 1 1 X 1 ¹ ¡ = = 2 ( ) = 2 2 = 2 =1 =1 Further, by de…nition 2 := [( ¡ )2 ] hence " # i 1X 1X 1X h 2 2 ( ¡ ) = ( ¡ ) = 2 = 2 =1 =1 =1 3 Therefore £ 2¤ ¡ 1 £ 2¤ ¡1 = ^ = 2 ¡ 2 = 2 £ ¤ £ 2¤ This implies each claim: 2 =£ 2¤ (2 is unbiased), ^ = 2 ( ¡ 1) 2 2 2 2 (^ is negatively biased), and ^ = ( ¡ 1) ! 2 (^ 2 is asymptotically unbiased). QED. Example: We simulate 100 samples of » (75 4) with sample size = 20. ¹ for each sample. The simulation average of all ¹ is 74.983941 In Figure 1 we plot ¹ and the simulation variance of Pall is 21615195. P In Figure 2 we plot ^ = =1 for each sample with weights = =1 . The simulation average of all ^ is 74.982795 and the simulation variance of all ^ ¹ is .30940776. Thus, both display the same property of unbiasedness, but exhibits less dispersion across samples ¹ Figure 1 : 3 Figure 2 : ^ Convergence in Mean-Square or L -Convergence Defn. We also write We say ^ 2 R converges to in mean-square if ³ ´2 MSE(^ ) := ^ ¡ ! 0 ^ ! and ^ ! in mean-square. If ^ is unbiased for then ³ h i´2 h i MSE(^ ) = ^ ¡ ^ = ^ Convergence in mean-square certainly does not require unbiasedness. In the, MSE is ³ ´2 ³ h i h i ´2 MSE(^ ) = ^ ¡ = ^ ¡ ^ + ^ ¡ ³ h i´2 ³ h i ´2 ³ h i´ ³ h i ´ = ^ ¡ ^ + ^ ¡ + 2 ^ ¡ ^ ^ ¡ ³ h i´2 ³ h i ´2 = ^ ¡ ^ + ^ ¡ 4 h i ³ h i´ h i since ^ ¡ is just a constant and ^ ¡ ^ = [^ ] ¡ ^ = 0. Hence MSE is the variance plus bias squared: ³ h i´2 ³ h i ´2 h i ³ ³ ´´2 MSE(^ ) = ^ ¡ ^ + ^ ¡ = ^ + B ^ If ^ 2 R then we write ³ ´³ ´0 MSE(^ ) := ^ ¡ ^ ¡ ! 0 hence component wise convergence. We may similarly write convergence in 2 -norm 0 112 ° ³ X ´³ ´0 ° X ° ° ° ^ ¡ ^ ¡ ° ! 0 where kk := @ 2 A 2 ° ° 2 =1 =1 or convergence in matrix (spectral) norm: ° ³ ´³ ´0 ° ° ° ° ^ ¡ ^ ¡ ° ! 0 where kk is the largest eigenvalue of . ° ° ³ ´2 Both imply convergence with respect to each element ^ ¡ ! 0. Defn. We say ^ 2 R has the property of -convergence, or convergence in -norm, to if for 0 ¯ ¯ ¯ ¯ ¯^ ¡ ¯ ! 0 Clearly 2 -convergence and mean-square convergence are equivalent. Claim (Sample Mean): Proof: ¹ ! in mean square. Let » ( 2 ). Then ¹ ¡ )2 = [] ¹ = 2 ! 0 QED. ( ¹ = 2 still holds. We only require uncorrelatedness since [] Claim (Sample Mean): mean square. Proof: ¹ ! in Let » ( 2 ) be uncorrelated. Then ¹ ¡ )2 = [] ¹ = 2 ! 0 QED. ( In fact, we only need all cross covariances to not be too large as the sample size grows. P Claim (Sample Mean): Let » ( 2 ) satisfy 12 ( ) ! 0. ¹ ! in mean square. Then ¹ ¡ )2 = [] ¹ = 2 + 2¡2 P ( ) ! 0 QED. Proof: ( Remark: In micro-economic contexts involving cross-sectional data this type of correlatedness is evidently rarely or never entertained. Typically we assume the 0 are uncorrelated. It is, however, profoundly popular in macroeconomic and 5 …nance contexts where data are time series. A very large P class of time series random 2 variables satis…es both ( ) 6= 0 8 6= and 1 ( ) ! 0, and ¹ therefore exhibits ! in mean square. ¹ ! in -norm for any 2 (1 2] but proving the result If » ( 2 ) then for non-integer 2 (1 2) is quite a bit more di¢cult. There are many types of "maximal inequalities", however, that can be used to prove ¯ ¯ ¯X ¯ ¯ ¯ ¯ ¯ · for 2 (1 2) where 0 is a …nite constant. ¯ ¯ =1 ¹ ! in -norm for Let » ( 2 ) be iid. Then Claim (Sample Mean): any 2 (1 2). Proof: ¯ ¯ ¯ ¯ ¯1 X ¯ ¯ 1 ¯¯X 1 1 ¯ ¯ ¯ ¯ ¡ ¯ = ¯ f ¡ g¯ · = ¡1 ! 0 ¯ ¯ ¯ ¯ =1 =1 since 1 QED. Example: We simulate » (7 400) with sample sizes = 5 15 25 1000. ¹ and [] ¹ = 400 over sample size . Notice the high volatility In Figure 3 we plot for small . ¹ and [] ¹ Figure 3: 4 Convergence in Probability : WLLN Defn. We say ^ converges in probability to if ¯ ³¯ ´ ¯ ¯ lim ¯^ ¡ ¯ = 0 8 0 !1 We variously write ^ ! and ^ ! 6 (1) and we say ^ is a consistent estimator of . Since probability convergence is convergence in the sequence f (j^ ¡ j by the de…nition of a limit it follows for every 0 there exists 0 such that ¯ ³¯ ´ ¯^ ¯ ¯ ¡ ¯ · 1 ¡ 8 ¸ )g1 =1 , That is, for a large enough sample size ^ is guaranteed to be as close to as we choose (i.e. the ) with as a great a probability as we choose (i.e. 1 ¡ ). Claim (Law of Large Numbers = LLN): Proof: ¹ ! If » ( 2 ) then . By Chebyshev’s inequality and independence, for any 0 2 ¯ ¡¯ ¢ ¡ ¢ ¹ ¡ ¯ · ¡2 ¹ ¡ 2 = ¡2 ! 0 QED ¯ Remark 1: We call this a Weak Law of Large Numbers [WLLN] since convergence is in probability. A Strong LLN based on a stronger form of convergence is given below. ¡ ¢ ¹ ¡ 2 = 2 ! 0. Remark 2: We only need uncorrelatedness to get The WLLN, however extends to many forms of dependent random variables. Remark 3: In the iid case we only need j j 1, although the proof is substantially more complicated. Even for non-iid data we typically only need j j1+ 1 for in…nitessimal 0 (pay close attention to scholarly articles you read, and to your own assumptions: usually far stronger assumptions are imposed than are actually required). P The weighted average =1 is also consistent as long as the weights decay with the sample size. Thus we write the weight as . P P P Claim: If » ( 2 ) then =1 ! if =1 = 1 and =1 2 ! 0. P Proof: By Chebyshev’s inequality, independence and =1 = 1, for any 0 ¯ ï ! à !2 à !2 ¯X ¯ X X ¯ ¯ ¯ ¡ ¯ · ¡2 ¡ = ¡2 f ¡ g ¯ ¯ =1 =1 = ¡2 X =1 =1 h i X 2 ( ¡ )2 = ¡2 2 2 ! 0 =1 which proves the claim. QED. ¹ with = 1, but also the weights = P used in An example is =1 Figure 2. Example: We simulate » (75 20) sample sizes = 5 15 25 10000. Pwith ¹ In Figures 4 and 5 we plot and ^ = =1 over sample size . Notice the high volatility for small . 7 ¹ Figure 4 : Figure 5 : ^ 79 79 78 78 77 77 76 76 75 75 74 74 73 73 72 72 71 71 70 70 5 1005 2005 3005 4005 5005 6005 7005 8005 9005 5 1005 2005 3005 Sam ple Size n 4005 5005 6005 7005 8005 9005 Sam ple Size n Claim (Slutsky Theorem): Let ^ 2 R . If ^ ! and : R ! R is continuous (except possibly with countably many discontinuity points) then (^ ) ! (). Corollary: Let ^ ! , = 1 2. Then ^ 1 §^ 2 ! 1 § 2 , ^1 £^2 ! 1 £2 , and if 2 6= 0 and lim inf !1 j^2 j 0 then ^1 ^2 ! 1 2 . Claim: Proof: If » ( 2 ) and [4 ] 1 then 2 ! 2 . Note ¢2 ¡ ¢ ¡1 2 1 X¡ 1X ¹ ¹ ¡ 2 = ¡ = ( ¡ )2 ¡ =1 =1 ¡ ¢ ¹ ! ¹ ¡ 2 ! By LLN , therefore by the Slutsky Theorem 0. By [4 ] 1 it follows ( ¡ )2 is iid with a …nite variance, hence it satis…es the LLN: P 1 =1 ( ¡ )2 ! [( ¡ )2 ] = 2 . QED. Claim: If » ( 2 2 ) and [2 2 j 1 then the sample correla- tion ^ ! the population correlation. Example: We simulate » (7 400) and » (0 900) and construct = ¡43 + 2 + . The true correlation is £ ¤ ¡43 [ ] + 2 2 ¡ 7 £ (¡43 + 2 £ 7) [ ] ¡ [ ] [ ] p = = 20 £ 4 £ 400 + 900 ¡ ¢ ¡43 £ 7 + 2 400 + 72 ¡ 7 (¡43 + 2 £ 7) p = = 8 20 £ 4 £ 400 + 900 We estimate correlation for samples with size = 5 15 25 10000. Figure 6 demonstrates consistency and therefore the Slutsky Theorem. 8 Figure 6: Correlation 1.00 0.90 0.80 0.70 0.60 0.50 5 5 1005 2005 3005 4005 5005 6005 7005 8005 9005 Sample Size n Almost Sure Convergence : SLLN Defn. We say ^ converges almost surely to if ³ ´ lim ^ = = 1 !1 This is identical to µ ¶ ¯ ¯ ¯^ ¯ lim max ¯ ¡ ¯ = 0 8 0 !1 We variously write ¸ ^ ! and ^ ! and we say ^ is strongly consistent for . We have the following relationships. Claim: Proof: ^ ! implies ^ ! ; . ^ ! implies ^ ! . (j^ ¡ j ) · ¡2 (^ ¡ )2 by Chebyshev’s inequality. If (^ ¡ )2 ! 0 (i.e. ^ ! ) then (j^ ¡ j ) ! 0 where 0 is arbitrary. Therefore ^ ! . (j^ ¡ j ) · (sup¸ j^ ¡ j ) since sup¸ j^ ¡ j ¸ j^ ¡ j. Therefore if (sup¸ j^ ¡ j ) ! 0 8 0 (i.e. ^ ! ) then (j^ ¡ j ) ! 0 8 0 (i.e. ^ ! ). QED. If ^ is bounded wp1 then ^ ! if and only if [^ ] ! which is asymptotic un biasedness (see Bierens). By the Slutsky Theorem ^ ! implies (^ ¡)2 ! 0 hence [(^ ¡ )2 ] ! 0: convergence in probability implies convergence in mean-square. This proves the following (and gives almost sure convergence as the "strongest" form: the one that implies all the rest). Claim (a.s. =) i.p. =) m.s.): Let ^ be bounded wp1: (j^ j · ) = 1 9 for …nite 0. Then ^ ! implies ^ ! implies asymptotic unbiasedness and ^ ! . Claim (Strong Law of Large Numbers = SLLN): ! . Remark: Example: ¹ If » ( 2 ) then The Slutsky Theorem carries over to strong convergence. Let » ( 2 ) and de…ne ^ := 1 1 + ¹ ¹ Then (j^ j · ) = 1. Moreover, under the iid assumption ! by the SLLN, hence by the Slutsky Theorem 1 ^ ! 1 + Therefore 1 ^ ! 1 + and [^ ] ! = 1(1 + ) and ·³ ´2 ¸ ^ ¡ ! 0 6 Convergence in Distribution : CLT Defn. We say ^ converges in distribution to a distribution , or to a random variable with distribution , if ³ ´ lim ^ · = () for every on the support . !1 Thus, while ^ may itself not be distributed , asymptotically it is. We write ^ ! or ^ ! where » . The notation ^ ! is a bit awkward, because characterizes in…nitely many random variables. We are therefore saying there is some random draw from that ^ is becoming. Which random draw is not speci…ed. 6.1 Central Limit Theorem ¹ Convergence of some By far the most famous result concerns the sample mean . ^ estimator in a monumentally large number of cases reduces to convergence of a sample mean of something, call it . This carries over to the sample correlation, regression model estimation methods like Ordinary Least Squares, GMM, and Maximum Likelihood, as well as non-parametric estimation, and on and on. 10 As usual, we limit ourselves to the iid case. The following substantially carries over to non-iid data, and based on a rarely cited obscure fact does not even require a …nite variance (I challenge you to …nd a proof of this, or to ever discover any econometrics textbook that accurately states this). Claim (Central Limit Theorem = CLT): := If » ( 2 ) then ¢ p ¡ ¹ ¡ ! (0 1) Remark 1: This is famously cites as the Lindeberg–Lévy CLT. Historically, however, the proof arose in di¤erent camps sometime between 1910-1930 (covering Lindeberg, Lévy, Chebyshev, Markov and Lyapunov). ¡ ¢ p ¹ Remark 2: Notice by construction := ¡ is a standardized ¹ = by identical distributedness and [] ¹ = 2 by sample mean because [] independence and identical distributedness. Thus ¡ ¢ ¡ ¢ ¹ ¡ ¹ ¡ ¹ ¡ [] ¹ p p := = = ¹ [] Therefore ¢ p ¡¹ ¡ has mean 0 and variance 1: " ¡ ¢# p ¹ ¡ ¢ p ¡ £ ¹¤ = ¡ =0 " # ¡ ¢ ¹ ¡ p £ ¹¤ 2 = 2 = 2 = 1 Thus, even as ! 1 the random variable » (0 1). Although this is a long way from proving has a de…nable distribution, even in the limit, it does help to point p ¹ for otherwise we simply have out that the term ! 1 is necessary to stabilize , ¡ ¢ ¹ ¡ ! 0. ¢ p ¡¹ Remark 3: Asymptotically := ¡ has a standard normal density (2)¡1 expf¡ 2 2g. Proof: De…ne := ( ¡ ), hence ¢ p ¡ 1 X ¹ := ¡ = p =1 2 We will show the characteristic function [ ] ! ¡ 2 . The latter is the characteristic function of a standard normal, while characteristic functions and distributions have a unique correspondence: only standard normals have a characteristic function 2 like ¡ 2 . 11 ¡12 By independence and identical distributednessNow expand 0 by a second order Taylor expansion: " # h i h i Y ¡12 ¡12 =1 = = around = (2) =1 = Y =1 ¡12 h i ³ h i´ ¡12 ¡12 = = 1 + ¡12 1 2 + 2 2 + 1! 2! = 1 + ¡12 1 2 ¡ 2 + 1! 2! where is a remainder term that is a function of 12 . Now take the expectations as in (2), and note [ ] = [( ¡)] = 0 and [2 ] = [( ¡)2 ]2 = 2 2 = 1: h i ¡12 1 £ ¤ 2 = 1 + ¡12 [ ] ¡ 2 + [ ] 1! 2! 1 2 = 1¡ + 2 where := [ ] ¡12 ¡12 is a bounded random variable, in particular j j It is easy to prove 1 wp1 (see Bierens) so even if does not have higher moments we know j j 1. ¡12 ] ! 1. Further ! 0 because [ Now take the -power in (2): by the Binomial expansion µ ¶ X ¶¡ µ ¶µ ³ h i´ 1 2 1 2 ¡12 = 1¡ + = 1¡ 2 2 =0 µ µ ¶¡ 2 ¶ X ¶µ 1 1 2 = 1¡ + 1¡ 2 2 =1 The …rst term satis…es µ ¶ 2 1 2 1¡ ! ¡ 2 2 because the sequence f(1 + ) g¸1 converges: (1 ¡ ) ! (simply put = ¡2 2). For the second term notice for large enough we have j1 ¡ ¡1 2 2j¡ · 1 hence ¯ µ ¶µ ¶¡ ¯¯ X µ ¶ µ ¶ ¯X 1 2 X ¯ ¯ 1¡ ¯ · · = (1 + ) ¯ ¯ ¯ 2 =1 =1 12 =0 See Bierens for details that verify (1 + ) ! 0. QED. Example (Bernoulli): The most striking way to demonstrate the CLT is to begin with the least normal of data, a Bernoulli random variable which is discrete ¢ p ¡¹ and takes only two …nite values, and show ¡ ! (0 1), a continuous random variable with in…nite support. We simulate » (2) for = 5, 50 500, 10000 and compute ¡ ¢ ¡ ¢ ¡ ¢ ¹ ¡ ¹ ¡ 2 ¹ ¡ 2 p p p := = p = 4 2 £ 8 In order to show the small sample distribution of we need a sample of 0 , so we repeat the simulation 1000 times. We plot the relative frequencies of the sample of 0 for each . Let f g1000 the simulated sample of 0 . The =1 beP relative frequencies are the percentage 11000 1000 =1 ( · +1 ) for interval endpoints = [¡5 ¡49 ¡48 49 50]. See Figure 7. For the sake of comparison in Figure 8 we plot the relative frequencies for one sample of 1000 iid standard normal random variables » (0 1). Another way to see how becomes a standard normal random variable is to compute the quantile such that ( · ) = 975. A standard normal satis…es ( · 196) = 975. We call an empirical quantile since it is based on a simulated set of samples. We simulate 10,000 samples for each size = 5 105 205 ..., 5005 and compute . See Figure 9. As increases ! 196. Figure 7 Standardized Means for Bernoulli 1000 0 , = 5 1000 0 , = 50 1000 0 , = 500 1000 0 , = 5000 13 Figure 8 Standard Normal Standard Normal Figure 9 - Empirical Quantiles q 2.3 2.2 2.1 2.0 1.9 1.8 1.7 5 505 1005 1505 2005 2505 3005 3505 4005 4505 5005 Sample Size n 14