Download Notes on Asymptotic Theory: Convergence in - UNC

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Resampling (statistics) wikipedia , lookup

Transcript
Notes on Asymptotic Theory: Convergence in
Probability and Distribution
Introduction to Econometric Theory
Econ. 770
Jonathan B. Hill
Dept. of Economics
University of North Carolina - Chapel Hill
November 19, 2011
1
Introduction
Let (¨  F  ) be a probability space. Throughout  is a parameter of interest like
the mean, variance, correlation, or distribution parameters like Poisson , Binomial
, or exponential . Throughout f^ g¸1 is a sequence of estimators of  based on
a sample of data f g=1 with sample size  ¸ 1. Assume ^  is F-measurable for
any . Unless otherwise noted, assume the 0  have the same mean and variance:
 » ( 2 ). If appropriate, we may have a bivariate sample f   g=1 where 
» (   2 ) and  » (  2 ).
Examples include the sample mean, variance, or correlation:

X
¹ := 1
Sample Mean : 


=1
Sample Variance #1 :
2

1 X
:=
( ¡ )2
¡1
=1
Sample Variance #2 : 
^ 2 :=
1


X
=1
¡
¢
¹ 2
 ¡ 
¢¡
¢
P ¡
¹  ¡ ¹
1 =1  ¡ 
Sample Correlation : 
^ :=


^
^
Similarly, we may estimate a probability by using a sample relative frequency:

1X
^ () =
 ( · ) the sample percentage of  · 

=1
1
Notice ^ () estimates  ( · ).
We will look at estimator properties: what ^ is on average for any sample
size; and what ^ becomes as the sample size grows. PIn every
case¢ ¡above the
¢ es ¡
¹
¹
timator is a variant¡ of a straight
average
(e.g. 1 =1  ¡   ¡  is a
¢¡
¢
¹  ¡ ¹ ), or a function of a straight average (e.g. 
straight average of  ¡ 
^
¢2 12
¡
¢2
P ¡
¹
¹
:= (1 =1  ¡  ) , the square root of the average  ¡  ). We therefore
pay particular attention to the sample mean.
2
Unbiasedness
Defn.
We say ^ is an unbiased estimator of  if [^ ] = . De…ne bias as
³ ´
B ^ := [^ ] ¡ 
An unbiased estimator has zero bias: B(^ ) = 0. If we had an in…nite number of
samples of size , then the average estimate ^ across all samples would be . An
asymptotically unbiased estimator satis…es B(^  ) ! 0 as  ! 1.
Claim (Weighted Average):
P Let  have a common mean  := [ ]. Then
the
^  :=
=1   is an unbiased estimator of  := [] if
P weighted average 

=
1.
=1 
Proof:

" 
X
=1
#
  =

X
  [ ] = 
=1

X
=1
 =  QED.
¹ := 1 P  is a
Corollary (Straight Average):
The sample mean 
=1
P
weighted average with ‡at or uniform weights  = 1 hence trivially =1  = 1
hence
¹ = 
[]
P
The problem then arises as to which weighted average =1   may be preferred
in practice since any with unit summed weights is unbiased. We will discuss below the
concept of e¢ciency below, but the minimum mean-squared-error unbiased estimator

has uniform weights if  » ( 2 ). That is:

¹ is the best linear
Claim (Sample Mean is Best):
Let  » ( 2 ). Then 
unbiased estimator of  (i.e. it is BLUE).
Proof:
We want to solve
à 
!2

X
X
min 
 
subject to
 = 1

=1
=1
The Lagrange is
L ( ) := 
à 
X
 
=1
2
!2
Ã
+ 1¡

X
=1

!
P
P
where by independence  ( =1   )2 =  2 =1 2 , hence
Ã
!


X
X
L ( ) := 2
2 +  1 ¡

=1
=1
The …rst order conditions are

X


L ( ) = 2 2 ¡  = 0 and
L ( ) = 1 ¡
 = 0


=1
P
2
2
ThereforeP = (2
P) is a constant that sums to =1  = 1. Write  = (2 ) =:

. Since =1  = =1  =  = 1 it follows  =  = 1. QED.
Remark:
As in many cases here and below, independence can be substituted
for uncorrelatedness since the same proof applies: [  ] = [ ][ ] for all
 6= . We can also substitute uncorrelatedness with a condition that restricts the
total correlation across all  and  for  6= , but such generality is typically only
exploited in time series settings (where  is  at a di¤erent time period).

Claim (Sample Variance):
Let  » ( 2 ). The estimator 2 is unbiased
2
and 
^  is negatively biased but asymptotically unbiased.
Proof:
Notice


=1
=1
X¡
¢
¢
¡1 2
1 X¡
¹ 2= 1
¹ 2
 = 
^ 2 =
 ¡ 
 ¡  +  ¡ 



=



¢2
¡
¢
1X
1 X¡
1X
2
¹
¹
( ¡ ) +
¡ +2
( ¡ )  ¡ 



=1
=
=1
=1


¡
¢2
¡
¢1X
1X
2
¹
¹
( ¡ ) +  ¡  ¡ 2  ¡ 
( ¡ )


=1
=
=1

¡
¢
¡
¢¡
¢
1X
¹ ¡ 2¡2 
¹ ¡ 
¹ ¡
( ¡ )2 + 

=1
=

¡
¢
1X
¹ ¡ 2
( ¡ )2 ¡ 

=1
¹ is unbiased
By the iid assumption and the fact that 
à 
!

X
¡
¢2
1
1 X
1
¹
  ¡ =
 = 2
 ( ) = 2  2 = 2 



=1
=1
Further, by de…nition  2 := [( ¡ )2 ] hence
" 
#


i 1X
1X
1X h
2
2

( ¡ ) =
 ( ¡ ) =
 2 = 2 



=1
=1
=1
3
Therefore
£ 2¤
 ¡ 1 £ 2¤
¡1
  =  
^  =  2 ¡ 2  = 2



£ ¤
£ 2¤
This implies each claim:  2 =£  2¤ (2 is unbiased),  
^  = 2 ( ¡ 1)  2
2
2
2
(^
  is negatively biased), and  
^  =  ( ¡ 1) ! 2 (^
2 is asymptotically
unbiased). QED.

Example:
We simulate 100 samples of  » (75 4) with sample size  = 20.
¹ for each sample. The simulation average of all 
¹ is 74.983941
In Figure 1 we plot 
¹
and the simulation variance of
Pall  is 21615195.
P
In Figure 2 we plot 
^  = =1   for each sample with weights  =  =1 .
The simulation average of all 
^  is 74.982795 and the simulation variance of all 
^
¹
is .30940776. Thus, both display the same property of unbiasedness, but  exhibits
less dispersion across samples
¹
Figure 1 : 
3
Figure 2 : 
^
Convergence in Mean-Square or L -Convergence
Defn.
We also write
We say ^ 2 R converges to  in mean-square if
³
´2
MSE(^  ) :=  ^ ¡  ! 0
^ 
!  and ^ !  in mean-square.
If ^ is unbiased for  then
³
h i´2
h i
MSE(^ ) =  ^  ¡  ^
=  ^  
Convergence in mean-square certainly does not require unbiasedness. In the, MSE is
³
´2
³
h i
h i
´2
MSE(^ ) =  ^ ¡  =  ^ ¡  ^ +  ^ ¡ 
³
h i´2
³ h i
´2
³
h i´ ³ h i
´
=  ^ ¡  ^
+   ^ ¡  + 2 ^  ¡  ^
 ^  ¡ 
³
h i´2
³ h i
´2
=  ^ ¡  ^
+   ^ ¡ 
4
h i
³
h i´
h i
since  ^ ¡  is just a constant and  ^ ¡  ^ = [^ ] ¡  ^ = 0. Hence
MSE is the variance plus bias squared:
³
h i´2
³ h i
´2
h i ³ ³ ´´2
MSE(^ ) =  ^  ¡  ^
+   ^ ¡  =  ^ + B ^ 

If ^ 2 R then we write
³
´³
´0
MSE(^ ) :=  ^ ¡  ^ ¡  ! 0
hence component wise convergence. We may similarly write convergence in 2 -norm
0
112
° ³
 X

´³
´0 °
X
°
°
° ^  ¡  ^ ¡  ° ! 0 where kk := @
2 A
2
°
°
2
=1 =1
or convergence in matrix (spectral) norm:
° ³
´³
´0 °
°
°
° ^ ¡  ^  ¡  ° ! 0 where kk is the largest eigenvalue of .
°
°
³
´2
Both imply convergence with respect to each element  ^ ¡  ! 0.
Defn.
We say ^ 2 R has the property of  -convergence, or convergence in
 -norm, to  if for   0
¯
¯
¯
¯
 ¯^  ¡ ¯ ! 0
Clearly 2 -convergence and mean-square convergence are equivalent.
Claim (Sample Mean):
Proof:

¹ !  in mean square.
Let  » ( 2 ). Then 
¹ ¡ )2 =  []
¹ = 2  ! 0 QED.
(
¹ =  2  still holds.
We only require uncorrelatedness since  []
Claim (Sample Mean):
mean square.
Proof:
¹ !  in
Let  » ( 2 ) be uncorrelated. Then 
¹ ¡ )2 =  []
¹ = 2  ! 0 QED.
(
In fact, we only need all cross covariances to not be too large as the sample size
grows.
P
Claim (Sample Mean):
Let  » ( 2 ) satisfy 12  (   ) ! 0.
¹ !  in mean square.
Then 
¹ ¡ )2 =  []
¹ = 2  + 2¡2 P (   ) ! 0 QED.
Proof:
(

Remark:
In micro-economic contexts involving cross-sectional data this type
of correlatedness is evidently rarely or never entertained. Typically we assume the
0  are uncorrelated. It is, however, profoundly popular in macroeconomic and
5
…nance contexts where data are time series. A very large P
class of time series random
2
variables satis…es both (   ) 6= 0 8 6=  and 1
 (   ) ! 0, and
¹
therefore exhibits  !  in mean square.

¹ !  in  -norm for any  2 (1 2] but proving the result
If  » ( 2 ) then 
for non-integer  2 (1 2) is quite a bit more di¢cult. There are many types of
"maximal inequalities", however, that can be used to prove
¯
¯

¯X
¯
¯
¯
¯
 ¯ ·  for  2 (1 2) where   0 is a …nite constant.
¯
¯
=1
¹ !  in  -norm for
Let  » (  2 ) be iid. Then 
Claim (Sample Mean):
any  2 (1 2).
Proof:
¯ 
¯
¯ 
¯
¯1 X
¯
¯
1 ¯¯X
1
1
¯
¯
¯
¯
 ¡ ¯ =   ¯
f ¡ g¯ ·   =  ¡1 ! 0
¯
¯
¯
¯



=1
=1
since   1 QED.

Example:
We simulate  » (7 400) with sample sizes  = 5 15 25  1000.
¹ and  []
¹ = 400 over sample size . Notice the high volatility
In Figure 3 we plot 
for small .
¹ and  []
¹
Figure 3: 
4
Convergence in Probability : WLLN
Defn.
We say ^ converges in probability to  if
¯
³¯
´
¯
¯
lim  ¯^ ¡ ¯   = 0 8  0
!1
We variously write

^ !
 and ^ !  
6
(1)
and we say ^ is a consistent estimator of .
Since probability convergence is convergence in the sequence f (j^ ¡ j 
by the de…nition of a limit it follows for every    0 there exists   0
such that
¯
³¯
´
¯^
¯
 ¯  ¡ ¯ ·   1 ¡  8 ¸  
)g1
=1 ,
That is, for a large enough sample size ^ is guaranteed to be as close to  as we
choose (i.e. the ) with as a great a probability as we choose (i.e. 1 ¡ ).
Claim (Law of Large Numbers = LLN):
Proof:


¹ !
If  » ( 2 ) then 
.
By Chebyshev’s inequality and independence, for any   0
2
¯
¡¯
¢
¡
¢
¹ ¡ ¯   · ¡2  
¹ ¡  2 = ¡2  ! 0 QED
 ¯

Remark 1:
We call this a Weak Law of Large Numbers [WLLN] since convergence is in probability. A Strong LLN based on a stronger form of convergence is
given below.
¡
¢
¹ ¡  2 = 2  ! 0.
Remark 2:
We only need uncorrelatedness to get  
The WLLN, however extends to many forms of dependent random variables.
Remark 3:
In the iid case we only need j j  1, although the proof is
substantially more complicated. Even for non-iid data we typically only need j j1+
 1 for in…nitessimal   0 (pay close attention to scholarly articles you read, and
to your own assumptions: usually far stronger assumptions are imposed than are
actually required).
P
The weighted average =1   is also consistent as long as the weights decay
with the sample size. Thus we write the weight as  .
P
P
P


Claim:
If  » (  2 ) then =1   !  if =1  = 1 and =1 2 ! 0.
P
Proof:
By Chebyshev’s inequality, independence and =1  = 1, for any   0
¯
ï 
!
à 
!2
à 
!2
¯X
¯
X
X
¯
¯
 ¯
  ¡ ¯  
· ¡2 
  ¡  = ¡2 
 f ¡ g
¯
¯
=1
=1
= ¡2

X
=1
=1

h
i
X
2  ( ¡ )2 = ¡2 2
2 ! 0
=1
which proves the claim. QED.
¹ with  = 1, but also the weights  =  P  used in
An example is 
=1
Figure 2.

Example:
We simulate  » (75 20)
sample sizes  = 5 15 25  10000.
Pwith

¹
In Figures 4 and 5 we plot  and 
^  = =1   over sample size . Notice the
high volatility for small .
7
¹
Figure 4 : 
Figure 5 : 
^
79
79
78
78
77
77
76
76
75
75
74
74
73
73
72
72
71
71
70
70
5
1005
2005
3005
4005
5005
6005
7005
8005
9005
5
1005
2005
3005
Sam ple Size n
4005
5005
6005
7005
8005
9005
Sam ple Size n

Claim (Slutsky Theorem):
Let ^ 2 R . If ^ !  and  : R ! R is

continuous (except possibly with countably many discontinuity points) then (^ ) !
().



Corollary:
Let ^  !  ,  = 1 2. Then ^ 1 §^ 2 ! 1 § 2 , ^1 £^2 !  1 £2 ,

and if 2 6= 0 and lim inf !1 j^2 j  0 then ^1 ^2 ! 1  2 .
Claim:
Proof:


If  » ( 2 ) and [4 ]  1 then 2 ! 2 .
Note


¢2
¡
¢
¡1 2
1 X¡
1X
¹
¹ ¡ 2
 =
 ¡  =
( ¡ )2 ¡ 



=1
=1
¡
¢ 

¹ !
¹ ¡ 2 !
By LLN 
, therefore by the Slutsky Theorem 
0. By [4 ] 
1 it follows ( ¡ )2 is iid with a …nite variance, hence it satis…es the LLN:
P

1 =1 ( ¡ )2 ! [( ¡ )2 ] = 2 . QED.
Claim:


If    » (     2  2 ) and [2 2 j  1 then the sample correla-
tion 
^ !  the population correlation.


Example:
We simulate  » (7 400) and  » (0 900) and construct  =
¡43 + 2 +  . The true correlation is
£ ¤
¡43 [ ] + 2 2 ¡ 7 £ (¡43 + 2 £ 7)
 [  ] ¡  [ ]  [ ]
p
 =
=

20 £ 4 £ 400 + 900
¡
¢
¡43 £ 7 + 2 400 + 72 ¡ 7 (¡43 + 2 £ 7)
p
=
= 8
20 £ 4 £ 400 + 900
We estimate correlation for samples with size  = 5 15 25  10000. Figure 6 demonstrates consistency and therefore the Slutsky Theorem.
8
Figure 6: Correlation
1.00
0.90
0.80
0.70
0.60
0.50
5
5
1005 2005 3005 4005 5005 6005 7005 8005 9005
Sample Size n
Almost Sure Convergence : SLLN
Defn.
We say ^ converges almost surely to  if
³
´
 lim ^ =  = 1
!1
This is identical to
µ
¶
¯
¯
¯^
¯
lim  max ¯ ¡ ¯   = 0 8  0
!1
We variously write
¸

^ !
 and ^ !  
and we say ^ is strongly consistent for .
We have the following relationships.
Claim:
Proof:




 ^ !  implies ^ ! ; . ^ !  implies ^ ! .
  (j^ ¡ j  ) · ¡2 (^  ¡ )2 by Chebyshev’s inequality. If (^ ¡ )2 ! 0 (i.e.


^
 ! ) then  (j^  ¡ j  ) ! 0 where   0 is arbitrary. Therefore ^ ! .
  (j^ ¡ j  ) ·  (sup¸ j^ ¡ j  ) since sup¸ j^ ¡ j ¸ j^  ¡ j.

Therefore if  (sup¸ j^ ¡ j  ) ! 0 8  0 (i.e. ^  ! ) then  (j^  ¡ j 

) ! 0 8  0 (i.e. ^ ! ). QED.

If ^ is bounded wp1 then ^ !  if and only if [^ ] !  which is asymptotic un

biasedness (see Bierens). By the Slutsky Theorem ^ !  implies (^  ¡)2 ! 0 hence
[(^
 ¡ )2 ] ! 0: convergence in probability implies convergence in mean-square.
This proves the following (and gives almost sure convergence as the "strongest" form:
the one that implies all the rest).
Claim (a.s. =) i.p. =) m.s.):
Let ^ be bounded wp1:  (j^ j · ) = 1
9


for …nite   0. Then ^  !  implies ^ !  implies asymptotic unbiasedness and

^
 ! .
Claim (Strong Law of Large Numbers = SLLN):

! .
Remark:
Example:

¹
If  » (  2 ) then 
The Slutsky Theorem carries over to strong convergence.

Let  » (  2 ) and de…ne
^ :=
1

1 + ¹
¹ 
Then  (j^ j · ) = 1. Moreover, under the iid assumption 
!  by the SLLN,
hence by the Slutsky Theorem
1
^ 
!

1 + 
Therefore
1

^  !
1 + 
and [^ ] !  = 1(1 +  ) and
·³
´2 ¸
^
  ¡ 
! 0
6
Convergence in Distribution : CLT
Defn.
We say ^ converges in distribution to a distribution  , or to a random
variable  with distribution  , if
³
´
lim  ^ ·  =  () for every  on the support  .
!1
Thus, while ^ may itself not be distributed  , asymptotically it is. We write


^ !
 or ^  !  where  »  .

The notation ^ !  is a bit awkward, because  characterizes in…nitely many
random variables. We are therefore saying there is some random draw  from 
that ^ is becoming. Which random draw is not speci…ed.
6.1
Central Limit Theorem
¹ Convergence of some
By far the most famous result concerns the sample mean .
^
estimator  in a monumentally large number of cases reduces to convergence of a
sample mean of something, call it  . This carries over to the sample correlation,
regression model estimation methods like Ordinary Least Squares, GMM, and Maximum Likelihood, as well as non-parametric estimation, and on and on.
10
As usual, we limit ourselves to the iid case. The following substantially carries
over to non-iid data, and based on a rarely cited obscure fact does not even require
a …nite variance (I challenge you to …nd a proof of this, or to ever discover any
econometrics textbook that accurately states this).
Claim (Central Limit Theorem = CLT):
 :=

If  » ( 2 ) then
¢
p ¡

¹ ¡   !
 
 (0 1) 
Remark 1:
This is famously cites as the Lindeberg–Lévy CLT. Historically,
however, the proof arose in di¤erent camps sometime between 1910-1930 (covering
Lindeberg, Lévy, Chebyshev, Markov and Lyapunov). ¡
¢
p ¹
Remark 2:
Notice by construction  :=  
¡   is a standardized
¹ =  by identical distributedness and  []
¹ = 2  by
sample mean because []
independence and identical distributedness. Thus
¡
¢
¡
¢
¹ ¡
¹ ¡
¹ ¡ []
¹
p 


p
 := 
=
=

¹

 
 []
Therefore
¢
p ¡¹
  ¡   has mean 0 and variance 1:
"
¡
¢# p
¹ ¡
¢
p 
 ¡ £ ¹¤


=
  ¡ =0


"
#
¡
¢
¹ ¡
p 
 £ ¹¤
 2


= 2 
= 2
= 1


 
Thus, even as  ! 1 the random variable  » (0 1). Although this is a long way
from proving  has a de…nable distribution, even in the limit, it does help to point
p
¹ for otherwise we simply have
out that the term  ! 1 is necessary to stabilize ,
¡
¢

¹ ¡   ! 0.

¢
p ¡¹
Remark 3:
Asymptotically  :=  
¡   has a standard normal density (2)¡1 expf¡ 2 2g.
Proof:
De…ne  := ( ¡ ), hence

¢
p ¡
1 X
¹
 :=   ¡   = p
 

=1
2
We will show the characteristic function [ ] ! ¡ 2 . The latter is the characteristic function of a standard normal, while characteristic functions and distributions
have a unique correspondence: only standard normals have a characteristic function
2
like ¡ 2 .
11
¡12 

By independence and identical distributednessNow expand 
0 by a second order Taylor expansion:
"
#
h
i
h
i

Y
¡12 

¡12 




=1
 
=  
=

around  =
(2)
=1
=

Y
=1
¡12 


h
i ³ h
i´
¡12 
¡12 


 
=  

= 1 + ¡12 

1 2
+ 2 2 + 
1!
 2!
= 1 + ¡12 

1 2
¡ 2 +  
1! 
2!
where  is a remainder term that is a function of  12 . Now take the expectations
as in (2), and note [ ] = [( ¡)] = 0 and [2 ] = [( ¡)2 ]2 = 2 2 =
1:
h
i
¡12 

1 £ ¤ 2

 
= 1 + ¡12  [ ] ¡  2
+  [ ]
1! 
2!
1 2
= 1¡
+  
 2
where  := [ ]
¡12 
¡12 
 is a bounded random variable, in particular j
j
It is easy to prove 
 1 wp1 (see Bierens) so even if  does not have higher moments we know j j  1.
¡12 
 ] ! 1.
Further  ! 0 because [
Now take the -power in (2): by the Binomial expansion
µ
¶ X
¶¡
 µ ¶µ
³ h
i´
1 2

1 2
¡12 
 
=
1¡
+ 
=
1¡

 2

 2
=0
µ
µ
¶¡

2 ¶
X ¶µ
1
1 2
=
1¡
+
1¡
 
 2

 2
=1
The …rst term satis…es
µ
¶
2
1 2
1¡
! ¡ 2
 2
because the sequence f(1 + ) g¸1 converges: (1 ¡ ) !  (simply put
 = ¡2 2). For the second term notice for large enough  we have j1 ¡ ¡1 2 2j¡
· 1 hence
¯  µ ¶µ
¶¡ ¯¯ X
 µ ¶
 µ ¶
¯X 
1 2
  X  
¯
¯
1¡
 ¯ ·
 ·
 = (1 +  ) 
¯
¯
¯

 2


=1
=1
12
=0
See Bierens for details that verify (1 +  ) ! 0. QED.
Example (Bernoulli):
The most striking way to demonstrate the CLT is to
begin with the least normal of data, a Bernoulli random variable which is discrete
¢
p ¡¹

and takes only two …nite values, and show  
¡   !  (0 1), a continuous
random variable with in…nite support.

We simulate  » (2) for  = 5, 50 500, 10000 and compute
¡
¢
¡
¢
¡
¢
¹ ¡
¹ ¡ 2
¹ ¡ 2
p 
p 
p 
 := 
= p
= 


4
2 £ 8
In order to show the small sample distribution of  we need a sample of 0 ,
so we repeat the simulation 1000 times. We plot the relative frequencies of the
sample of 0  for each . Let f g1000
the simulated sample of 0 . The
=1 beP
relative frequencies are the percentage 11000 1000
=1 ( ·   +1 ) for interval
endpoints  = [¡5 ¡49 ¡48  49 50]. See Figure 7. For the sake of comparison
in Figure 8 we plot the relative frequencies for one sample of 1000 iid standard normal

random variables  » (0 1).
Another way to see how  becomes a standard normal random variable is to
compute the quantile  such that  ( ·  ) = 975. A standard normal  satis…es
 ( · 196) = 975. We call  an empirical quantile since it is based on a simulated
set of samples. We simulate 10,000 samples for each size  = 5 105 205 ..., 5005
and compute  . See Figure 9. As  increases  ! 196.
Figure 7
Standardized Means for Bernoulli
1000 0 ,  = 5
1000 0 ,  = 50
1000 0 ,  = 500
1000 0 ,  = 5000
13
Figure 8
Standard Normal
Standard Normal
Figure 9 - Empirical Quantiles q
2.3
2.2
2.1
2.0
1.9
1.8
1.7
5
505 1005 1505 2005 2505 3005 3505 4005 4505 5005
Sample Size n
14