Download Theoretical Statistics. Lecture 1.

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Theoretical Statistics. Lecture 1.
Peter Bartlett
1. Organizational issues.
2. Overview.
3. Stochastic convergence.
1
Organizational Issues
• Lectures: Tue/Thu 11am–12:30pm, 332 Evans.
• Peter Bartlett. bartlett@stat. Office hours: Tue 1-2pm, Wed
1:30-2:30pm (Evans 399).
• GSI: Siqi Wu. siqi@stat. Office hours: Mon 3:30-4:30pm, Tue
3:30-4:30pm (Evans 307).
• http://www.stat.berkeley.edu/∼bartlett/courses/210b-spring2013/
Check it for announcements, homework assignments, ...
• Texts:
Asymptotic Statistics, Aad van der Vaart. Cambridge. 1998.
Convergence of Stochastic Processes, David Pollard. Springer. 1984.
Available on-line at
http://www.stat.yale.edu/∼pollard/1984book/
2
Organizational Issues
• Assessment:
Homework Assignments (60%): posted on the website.
Final Exam (40%): scheduled for Thursday, 5/16/13, 8-11am.
• Required background:
Stat 210A, and either Stat 205A or Stat 204.
3
Asymptotics: Why?
Example: We have a sample of size n from a
density pθ . Some estimator gives θ̂n .
• Consistent? i.e., θ̂n → θ? Stochastic convergence.
• Rate? Is it optimal? Often no finite sample optimality results.
Asymptotically optimal?
• Variance of estimate? Optimal? Asymptotically?
• Distribution of estimate? Confidence region. Asymptotically?
4
Asymptotics: Approximate confidence regions
Example: We have a sample of size n from
a density pθ . Maximum likelihood estimator
gives θ̂n .
√ −1
Under mild conditions, n θ̂n − θ is asymptotically N 0, Iθ . Thus
√ 1/2
nIθ (θ̂n − θ) ∼ N (0, I), and n(θ̂n − θ)T Iθ (θ̂n − θ) ∼ χ2 (k).
So we have an approximate 1 − α confidence region for θ:
(
)
2
χk,α
T
θ : (θ − θ̂n ) Iθ̂n (θ − θ̂n ) ≤
.
n
5
Overview of the Course
1. Tools for consistency, rates, asymptotic distributions:
• Stochastic convergence.
• Concentration inequalities.
• Projections.
• U-statistics.
• Delta method.
2. Tools for richer settings (eg: function space vs Rk )
• Uniform laws of large numbers.
• Empirical process theory.
• Metric entropy.
• Functional delta method.
6
3. Tools for asymptotics of likelihood ratios:
• Contiguity.
• Local asymptotic normality.
4. Asymptotic optimality:
• Efficiency of estimators.
• Efficiency of tests.
5. Applications:
• Nonparametric regression.
• Nonparametric density estimation.
• M-estimators.
• Bootstrap estimators.
7
Convergence in Distribution
X1 , X2 , . . . , X are random vectors,
Definition: Xn converges in distribution (or weakly converges) to X
(written Xn
X) means that their distribution functions satisfy Fn (x) →
F (x) at all continuity points of F .
8
Review: Other Types of Convergence
d is a distance on Rk (for which the Borel σ-algebra is the usual one).
as
Definition: Xn converges almost surely to X (written Xn → X) means
that d(Xn , X) → 0 a.s.
P
Definition: Xn converges in probability to X (written Xn → X) means
that, for all ǫ > 0,
P (d(Xn , X) > ǫ) → 0.
9
Review: Other Types of Convergence
Theorem:
as
P
Xn → X =⇒ Xn → X =⇒ Xn
P
Xn → c ⇐⇒ Xn
as
X,
c.
P
NB: For Xn → X and Xn → X, Xn and X must be functions on the
sample space of the same probability space. But not convergence in
distribution.
10
Convergence in Distribution: Equivalent Definitions
Theorem: [Portmanteau] The following are equivalent:
1. P (Xn ≤ x) → P (X ≤ x) for all continuity points x of P (X ≤ ·).
2. Ef (Xn ) → Ef (X) for all bounded, continuous f .
3. Ef (Xn ) → Ef (X) for all bounded, Lipschitz f .
T
T
4. Eeit Xn → Eeit X for all t ∈ Rk . (Lévy’s Continuity Theorem)
5. for all t ∈ Rk , tT Xn
tT X. (Cramér-Wold Device)
6. lim inf Ef (Xn ) ≥ Ef (X) for all nonnegative, continuous f .
7. lim inf P (Xn ∈ U ) ≥ P (X ∈ U ) for all open U .
8. lim sup P (Xn ∈ F ) ≤ P (X ∈ F ) for all closed F .
9. P (Xn ∈ B) → P (X ∈ B) for all continuity sets B
(i.e., P (X ∈ ∂B) = 0).
11
Convergence in Distribution: Equivalent Definitions
Example: [Why do we need continuity?]
Consider f (x) = 1[x > 0], Xn = 1/n. Then Xn → 0, f (x) → 1, but
f (0) = 0.
[Why do we need boundedness?]
Consider f (x) = x,

n w.p. 1/n,
Xn =
0 w.p. 1 − 1/n.
Then Xn
0, Ef (Xn ) → 1, but f (0) = 0.
12
Relating Convergence Properties
Theorem:
Xn
P
X and d(Xn , Yn ) → 0 =⇒ Yn
Xn
X and Yn
P
X,
c =⇒ (Xn , Yn )
P
(X, c),
P
Xn → X and Yn → Y =⇒ (Xn , Yn ) → (X, Y ).
13
Relating Convergence Properties
Example: NB: NOT Xn
X and Yn
Y =⇒ (Xn , Yn )
(X, Y ).
(joint convergence versus marginal convergence in distribution)
Consider X, Y independent N (0, 1), Xn ∼ N (0, 1), Yn = −Xn . Then
Xn
X, Yn
Y , but (Xn , Yn )
(X, −X), which has a very different
distribution from that of (X, Y ).
14
Relating Convergence Properties: Continuous Mapping
Suppose f : Rk → Rm is “almost surely continuous”
(i.e., for some S with P (X ∈ S)=1, f is continuous on S).
Theorem: [Continuous mapping]
Xn
X =⇒ f (Xn )
f (X).
P
P
as
as
Xn → X =⇒ f (Xn ) → f (X).
Xn → X =⇒ f (Xn ) → f (X).
15
Relating Convergence Properties: Continuous Mapping
Example: For X1 , . . . , Xn i.i.d. mean µ, variance σ 2 , we have
√
n
(X̄n − µ)
N (0, 1).
σ
So
n
2
(
X̄
−
µ)
n
σ2
(N (0, 1))2 = χ21 .
Example: We also have X̄n − µ
0 hence (X̄n − µ)2
0. Consider
f (x) = 1[x > 0]. Then f ((X̄n − µ)2 )
1 6= f (0).
(The problem is that f is not continuous at 0, and PX (0) > 0, for X
satisfying (X̄n − µ)2
X.)
16
Relating Convergence Properties: Slutsky’s Lemma
Theorem: Xn
X and Yn
c imply
X n + Yn
Yn X n
Yn−1 Xn
(Why does Xn
X and Yn
X + c,
cX,
c−1 X.
Y not imply Xn + Yn
17
X + Y ?)
Relating Convergence Properties: Examples
Theorem: For i.i.d. Yt with EY1 = µ, EY12 = σ 2 < ∞,
√ Ȳn − µ
n
Sn
N (0, 1),
where
Ȳn = n−1
n
X
Yi ,
i=1
Sn2 = (n − 1)−1
18
n
X
i=1
(Yi − Ȳn )2 .
Proof:



2


 X

n




n 1
2
2


Yi −  Ȳn  
Sn =


|{z}
n
−
1
n


| {z }  i=1
P

| {z }
→EY1
P
→1
P
→EY12
(weak law of large numbers)
P
→ EY12 − (EY1 )2
(continuous mapping theorem, Slutsky’s Lemma)
= σ2 .
19
Also
√
1
n Ȳn − µ
Sn
|
{z
} |{z}
N (0,σ 2 )
P
→1/σ
(central limit theorem)
N (0, 1)
(continuous mapping theorem, Slutsky’s Lemma)
20
Showing Convergence in Distribution
Recall that the characteristic function demonstrates weak convergence:
itT X
itT Xn
→ Ee
for all t ∈ Rk .
Xn
X ⇐⇒ Ee
Theorem: [Lévy’s Continuity Theorem]
itT Xn
→ φ(t) for all t in Rk , and φ : Rk → C is continuous at 0,
If Ee
itT X
then Xn
X, where Ee
= φ(t).
Special case: Xn = Y . So X, Y have same distribution iff φX = φY .
21
Showing Convergence in Distribution
Theorem: [Weak law of large numbers]
P
Suppose X1 , . . . , Xn are i.i.d. Then X̄n → µ iff φ′X1 (0) = iµ.
Proof:
P
We’ll show that φ′X1 (0) = iµ implies X̄n → µ. Indeed,
EeitX̄n = φn (t/n)
= (1 + tiµ/n + o(1/n))n
→ |{z}
eitµ .
=φµ (t)
Lévy’s Theorem implies X̄n
P
µ, hence X̄n → µ.
22
Showing Convergence in Distribution
e.g., X ∼ N (µ, Σ) has characteristic function
itT X
φX (t) = Ee
itT µ−tT Σt/2
=e
.
Theorem: [Central limit theorem]
√
2
Suppose X1 , . . . , Xn are i.i.d., EX1 = 0, EX1 = 1. Then nX̄n
N (0, 1).
23
Proof: φX1 (0) = 1, φ′X1 (0) = iEX1 = 0, φ′′X1 (0) = i2 EX12 = −1.
√
it nX̄n
Ee
√
= φ (t/ n)
n
2
2
= 1 + 0 − t EY /(2n) + o(1/n)
→ e−t
2
/2
= φN (0,1) (t).
24
n
Uniformly tight
Definition:
X is tight means that for all ǫ > 0 there is an M for which
P (kXk > M ) < ǫ.
{Xn } is uniformly tight (or bounded in probability) means that for all
ǫ > 0 there is an M for which
sup P (kXn k > M ) < ǫ.
n
(so there is a compact set that contains each Xn with high probability.)
25
Notation: Uniformly tight
Theorem: [Prohorov’s Theorem]
1. Xn
X implies {Xn } is uniformly tight.
2. {Xn } uniformly tight implies that for some X and some subsequence,
X.
Xnj
26
Notation for rates: oP , OP
Definition:
P
Xn = oP (1) ⇐⇒Xn → 0,
Xn = oP (Rn ) ⇐⇒Xn = Yn Rn and Yn = oP (1).
Xn = OP (1) ⇐⇒Xn uniformly tight
Xn = OP (Rn ) ⇐⇒Xn = Yn Rn and Yn = OP (1).
(i.e., oP , OP specify rates of growth of a sequence. oP means strictly
slower (sequence Yn converges in probability to zero). OP means within
some constant (sequence Yn lies in a ball).
27
Relations between rates
oP (1) + oP (1) = oP (1).
oP (1) + OP (1) = OP (1).
oP (1)OP (1) = oP (1).
(1 + oP (1))−1 = OP (1).
oP (OP (1)) = oP (1).
P
Xn → 0, R(h) = o(khkp ) =⇒ R(Xn ) = oP (kXn kp ).
P
Xn → 0, R(h) = O(khkp ) =⇒ R(Xn ) = OP (kXn kp ).
28
Related documents