Download 3. Limit Results for Inference

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
3. Limit Results for Inference
3.1. Convergence Concepts
(1) Probability space (S, ö, P), X: S 6 ú and Xn: S 6
ú are rv’s for n = 1, 2,... Usual convergence: Xn 6 X,
if Xn(T) 6 X(T) for every T 0 S, as n 6 4.
(2) Almost sure convergence: Xn 6 X a.s.,
if P({T 0 S* Xn(T) 6 X(T)}) = 1,
or equivalently: for all , > 0, P(*Xn - X*> , i.o.) = 0,
(“i.o.” = infinitely often), or equivalently
P(sup n$N*Xn - X*> ,) 6 0, as N 6 4.
(3) Convergence in probability: Xn 6 X in probability,
if for all , > 0, P(*Xn - X*> ,) 6 0.
Proof. P(*Xn - X*> ,) # P(sup n$N*Xn - X*> ,) 6 0.
(4) Thm. If Xn 6 X a.s., then Xn 6 X in probability.
(5) Remark: if Xn 6 X in probability, then it is not
necessarily true that Xn 6 X a.s.
(6) Distribution of X is F, and distribution of Xn is Fn.
Convergence in distribution: Xn 6 X in distribution, if
Fn(x) 6 F(x) at all continuity points of F.
(7) Thm. If Xn 6 X in probability, then Xn 6 X in
distribution.
(8) Remark: if Xn 6 X in distribution, then it is not
necessarily true that Xn 6 X in probability. However, if
X = c, a constant, then Xn 6 c in probability.
Proof. P(*Xn - c*# ,) = Fn(c + ,) - Fn(c - ,) 6 F(c + ,) F(c - ,) = 1.
1
3.2. Characteristic Functions
(1) For z 0 ú, eiz = cos(z) + i sin(z) (Euler’s formula).
(2) X is a rv with distribution F. Then, the
characteristic function (cf) of X is
N(t) = E[eitX] =
(3) N(0) = 1, *N(t)* # 1.
(4) NaX+b(t) = eitb NX(at).
(5) If E[*Xk*] < +4 for some k = 1, 2,..., then
N(p)(t) = E[(iX)p eitX] for p = 1,..., k.
(6) If X and Y are independent, then
NX+Y(t) = NX(t)NY(t).
(7) If X ~ N(0,1), then N(t) = exp(-t2/2).
(8) If P(X = c) = 1, then N(t) = exp(itc).
(9) E[*X*] < +4, then N(t) = 1 + itE[X] + o(*t*) in
some neighborhood of the origin. (o(h)/h 6 0, as h 60.)
(10) If E[*Xk*] < +4, then N(t) = 1 + itE[X] + ... +
(it)kE[*Xk*]/k! + o(*t*k).
(11) Thm. (Chung 1974, Thm. 6.3.2., p. 161)
X1, X2,... rv’s with distributions F1, F2,... and cf’s N1,
N2,... If Nn(t) 6 N(t) for all t, and N(t) is continuous at t
= 0, then Fn 6 F, a distribution function and N is a cf.
2
3.3. Laws of Large Numbers
(1) Lemma. If cn 6 c (complex) as n 6 4, then
(1 + cn/n)n 6 ec.
(2) WLLN. X1, X2,... are i.i.d. and
If E[Xn] = : is finite, then
in probability.
Proof. Cf of Xi’s is N(t). Then, cf of
has N(t/n)n 6
eit: by (1), and (9) of 3.2. The result follows from (11)
and (8) of 3.2., and (8) of 3.1.
(3) SLLN. (Chung 1974, Thm. 5.4.2, p. 126) Under the
same assumptions,
a.s.
(4) Ex. “Universal learning” (!?) X1, X2,..., Xn are i.i.d.
with distribution F. The empirical distribution of the
first n: define Yi = 1{Xi # x}, so
Fn(x) = (# of Xi’s # x)/n =
Since E[Yi] = F(x), SLLN implies that Fn(x) 6 F(x) a.s.
for all x.
(5) (Glivenko-Cantelli; Chung 1974, Thm. 5.5.1, p.
133) Converge in (4) is uniform a.s.
(6) Frequency definition of probabilities: Xi = 1, if
“heads”, Xi = 0 if “tails”, E[Xi] = p. Sometimes the
probability p is defined as a limiting frequency p = lim
based on the SLLN, but this already assumes the
existence of P! The interpretation is correct, however.
3
3.4. Central Limit Theorems
(1) (Lindeberg and Lévy CLT) X1, X2,... are i.i.d. with
E[Xn] = : and Var(Xn) = F2 > 0. Define Zn = (Xn - :)/F,
so E[Zn] = 0 and Var(Zn) = 1, with cf N(t). Let Z = (Z1 +
... + Zn)/n1/2 with NZ(t) = N(t/n1/2)n 6 exp(-t2/2).
Therefore, Z 6 N(0,1) in distribution.
(2) (Lindeberg 1922, and Feller 1935); Chung 1974,
Chapter 7.2) Central Limit Theorem (CLT) X1, X2,... are
independent with E[Xn] = 0, Var(Xn) = Fn2 and
distributions Fn. Write Sn2 = F12 + ... + Fn2, and Zn = (X1
+ ... + Xn)/Sn. Lindeberg condition: for all t > 0,
(*)
is necessary and sufficient for Zn 6 N(0,1) in
distribution.
(3) In (1), (*) is satisfied with Sn2 = nF2.
(4) If Fn2 = cn-1 for some 0 < c < 1, then (*) does not
hold when t > 0 is small enough.
(5) Multivariate CLT. Xn = (X1n,..., Xmn)T, independent
random vectors with second moments. If for all a =
(a1,..., am)T 0 úm, a … 0,
then Xn has an asymptotic multivariate normal
distribution.
4
3.5. *-method
(1) *-method (Bishop, Fienberg, Holland, 1975, Thm.
14.6-1, p. 487). Suppose, n1/2(Xn - :)/F 6 N(0,1) in
distribution, and g is continuously differentiable. Then,
n1/2(g(Xn) - g(:))/F 6 N(0,gN(:)2).
(2) Zn ~ Bin(n, p), Xn = Zn/n, so that E[Xn] = : = p and
Var(Xn) = F2/n = p(1 - p)/n, and the CLT applies.
(3) If g(x) = a + bx, then gN(x) = b.
(4) *-method in practice: Suppose X ~ F(:,F2), then,
approximately, Var(g(X)) . gN(:)2Var(X).
(5) Ex. X ~ N(0,F2), g(x) = exp(x), so gN(0) = 1 and
Var(g(X)) . F2. However, true value is Var(g(X)) =
exp(2F2) - exp(F2) = F2 + 3F4/2 + 7F6/3! + ...
For small F, works OK, but always an underestimate.
(6) Stabilizing variance for Zn ~ Bin(n, p), Xn = Zn/n, so
: = p, but F2 = p(1 - p) depends on p. Find a
transformation that removes the dependency: gN(p)2F2 =
1. Solve to get g(p) = 2 arcsin(p1/2). Hence,
asymptotically,
arcsin(Xn1/2) ~ N(arcsin(p1/2), 1/4n).
(7) Use: get approximate confidence intervals for p.
(Principal value arcsin(p1/2) is monotone increasing.)
(8) Another aproximate method. If Zn ~ Bin(n, p), then
X = (Zn - np)/(np(1 - p))½ ~ N(0,1) asymptotically, so
X2 ~ P12 asymptotically. Then, a 95% confidence
interval consists of those values of p for which X2 #
1.962 = 3.8416.
5
3.6. Convergence of Markov Chains
(1) Consider an irreducible aperiodic Markov chain
with a countable state space. All states are positive
recurrent, if an only if <TP = <T has a solution with 1 =
<T1. If such a solution exists, it is strictly positive and
unique. (Proof in Stochastic Processes.)
(2) Interpretation: < is the invariant distribution of the
chain: in the long run, the effect of the initial state of
the chain wears out, and it is in state i with probability
<(i). This is also the proportion of time the chain spends
in state i.
(3) Heuristic extension. Consider a random vector
(X,Y) with density f(x,y), marginals f1(x) and f2(y), and
conditional densities f1(x*y) = f(x,y)/f2(y) and f2(y*x) =
f(x,y)/f1(x). Define a “transition probability” from (x,y)
to (u,v) as f1(u*y)f2(v*u)dxdy. Then,
so that f(x,y) is the invariant distribution of the chain.
(4) Intuition: run the chain to get samples from f(x,y).
(5) Extension to n dimensions. X = (X1,..., Xn) has
marginals fi(xi*x(-i)) = f(x1,..., xn)/f(-i)(x(-i)), where x(-i) =
(x1,..., xi-1,xi+1,..., xn), and transition probabilities from
(x1,..., xn) to (u1,..., un) is given by
This is called Gibbs sampling.
6
Related documents