Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
3. Limit Results for Inference 3.1. Convergence Concepts (1) Probability space (S, ö, P), X: S 6 ú and Xn: S 6 ú are rv’s for n = 1, 2,... Usual convergence: Xn 6 X, if Xn(T) 6 X(T) for every T 0 S, as n 6 4. (2) Almost sure convergence: Xn 6 X a.s., if P({T 0 S* Xn(T) 6 X(T)}) = 1, or equivalently: for all , > 0, P(*Xn - X*> , i.o.) = 0, (“i.o.” = infinitely often), or equivalently P(sup n$N*Xn - X*> ,) 6 0, as N 6 4. (3) Convergence in probability: Xn 6 X in probability, if for all , > 0, P(*Xn - X*> ,) 6 0. Proof. P(*Xn - X*> ,) # P(sup n$N*Xn - X*> ,) 6 0. (4) Thm. If Xn 6 X a.s., then Xn 6 X in probability. (5) Remark: if Xn 6 X in probability, then it is not necessarily true that Xn 6 X a.s. (6) Distribution of X is F, and distribution of Xn is Fn. Convergence in distribution: Xn 6 X in distribution, if Fn(x) 6 F(x) at all continuity points of F. (7) Thm. If Xn 6 X in probability, then Xn 6 X in distribution. (8) Remark: if Xn 6 X in distribution, then it is not necessarily true that Xn 6 X in probability. However, if X = c, a constant, then Xn 6 c in probability. Proof. P(*Xn - c*# ,) = Fn(c + ,) - Fn(c - ,) 6 F(c + ,) F(c - ,) = 1. 1 3.2. Characteristic Functions (1) For z 0 ú, eiz = cos(z) + i sin(z) (Euler’s formula). (2) X is a rv with distribution F. Then, the characteristic function (cf) of X is N(t) = E[eitX] = (3) N(0) = 1, *N(t)* # 1. (4) NaX+b(t) = eitb NX(at). (5) If E[*Xk*] < +4 for some k = 1, 2,..., then N(p)(t) = E[(iX)p eitX] for p = 1,..., k. (6) If X and Y are independent, then NX+Y(t) = NX(t)NY(t). (7) If X ~ N(0,1), then N(t) = exp(-t2/2). (8) If P(X = c) = 1, then N(t) = exp(itc). (9) E[*X*] < +4, then N(t) = 1 + itE[X] + o(*t*) in some neighborhood of the origin. (o(h)/h 6 0, as h 60.) (10) If E[*Xk*] < +4, then N(t) = 1 + itE[X] + ... + (it)kE[*Xk*]/k! + o(*t*k). (11) Thm. (Chung 1974, Thm. 6.3.2., p. 161) X1, X2,... rv’s with distributions F1, F2,... and cf’s N1, N2,... If Nn(t) 6 N(t) for all t, and N(t) is continuous at t = 0, then Fn 6 F, a distribution function and N is a cf. 2 3.3. Laws of Large Numbers (1) Lemma. If cn 6 c (complex) as n 6 4, then (1 + cn/n)n 6 ec. (2) WLLN. X1, X2,... are i.i.d. and If E[Xn] = : is finite, then in probability. Proof. Cf of Xi’s is N(t). Then, cf of has N(t/n)n 6 eit: by (1), and (9) of 3.2. The result follows from (11) and (8) of 3.2., and (8) of 3.1. (3) SLLN. (Chung 1974, Thm. 5.4.2, p. 126) Under the same assumptions, a.s. (4) Ex. “Universal learning” (!?) X1, X2,..., Xn are i.i.d. with distribution F. The empirical distribution of the first n: define Yi = 1{Xi # x}, so Fn(x) = (# of Xi’s # x)/n = Since E[Yi] = F(x), SLLN implies that Fn(x) 6 F(x) a.s. for all x. (5) (Glivenko-Cantelli; Chung 1974, Thm. 5.5.1, p. 133) Converge in (4) is uniform a.s. (6) Frequency definition of probabilities: Xi = 1, if “heads”, Xi = 0 if “tails”, E[Xi] = p. Sometimes the probability p is defined as a limiting frequency p = lim based on the SLLN, but this already assumes the existence of P! The interpretation is correct, however. 3 3.4. Central Limit Theorems (1) (Lindeberg and Lévy CLT) X1, X2,... are i.i.d. with E[Xn] = : and Var(Xn) = F2 > 0. Define Zn = (Xn - :)/F, so E[Zn] = 0 and Var(Zn) = 1, with cf N(t). Let Z = (Z1 + ... + Zn)/n1/2 with NZ(t) = N(t/n1/2)n 6 exp(-t2/2). Therefore, Z 6 N(0,1) in distribution. (2) (Lindeberg 1922, and Feller 1935); Chung 1974, Chapter 7.2) Central Limit Theorem (CLT) X1, X2,... are independent with E[Xn] = 0, Var(Xn) = Fn2 and distributions Fn. Write Sn2 = F12 + ... + Fn2, and Zn = (X1 + ... + Xn)/Sn. Lindeberg condition: for all t > 0, (*) is necessary and sufficient for Zn 6 N(0,1) in distribution. (3) In (1), (*) is satisfied with Sn2 = nF2. (4) If Fn2 = cn-1 for some 0 < c < 1, then (*) does not hold when t > 0 is small enough. (5) Multivariate CLT. Xn = (X1n,..., Xmn)T, independent random vectors with second moments. If for all a = (a1,..., am)T 0 úm, a 0, then Xn has an asymptotic multivariate normal distribution. 4 3.5. *-method (1) *-method (Bishop, Fienberg, Holland, 1975, Thm. 14.6-1, p. 487). Suppose, n1/2(Xn - :)/F 6 N(0,1) in distribution, and g is continuously differentiable. Then, n1/2(g(Xn) - g(:))/F 6 N(0,gN(:)2). (2) Zn ~ Bin(n, p), Xn = Zn/n, so that E[Xn] = : = p and Var(Xn) = F2/n = p(1 - p)/n, and the CLT applies. (3) If g(x) = a + bx, then gN(x) = b. (4) *-method in practice: Suppose X ~ F(:,F2), then, approximately, Var(g(X)) . gN(:)2Var(X). (5) Ex. X ~ N(0,F2), g(x) = exp(x), so gN(0) = 1 and Var(g(X)) . F2. However, true value is Var(g(X)) = exp(2F2) - exp(F2) = F2 + 3F4/2 + 7F6/3! + ... For small F, works OK, but always an underestimate. (6) Stabilizing variance for Zn ~ Bin(n, p), Xn = Zn/n, so : = p, but F2 = p(1 - p) depends on p. Find a transformation that removes the dependency: gN(p)2F2 = 1. Solve to get g(p) = 2 arcsin(p1/2). Hence, asymptotically, arcsin(Xn1/2) ~ N(arcsin(p1/2), 1/4n). (7) Use: get approximate confidence intervals for p. (Principal value arcsin(p1/2) is monotone increasing.) (8) Another aproximate method. If Zn ~ Bin(n, p), then X = (Zn - np)/(np(1 - p))½ ~ N(0,1) asymptotically, so X2 ~ P12 asymptotically. Then, a 95% confidence interval consists of those values of p for which X2 # 1.962 = 3.8416. 5 3.6. Convergence of Markov Chains (1) Consider an irreducible aperiodic Markov chain with a countable state space. All states are positive recurrent, if an only if <TP = <T has a solution with 1 = <T1. If such a solution exists, it is strictly positive and unique. (Proof in Stochastic Processes.) (2) Interpretation: < is the invariant distribution of the chain: in the long run, the effect of the initial state of the chain wears out, and it is in state i with probability <(i). This is also the proportion of time the chain spends in state i. (3) Heuristic extension. Consider a random vector (X,Y) with density f(x,y), marginals f1(x) and f2(y), and conditional densities f1(x*y) = f(x,y)/f2(y) and f2(y*x) = f(x,y)/f1(x). Define a “transition probability” from (x,y) to (u,v) as f1(u*y)f2(v*u)dxdy. Then, so that f(x,y) is the invariant distribution of the chain. (4) Intuition: run the chain to get samples from f(x,y). (5) Extension to n dimensions. X = (X1,..., Xn) has marginals fi(xi*x(-i)) = f(x1,..., xn)/f(-i)(x(-i)), where x(-i) = (x1,..., xi-1,xi+1,..., xn), and transition probabilities from (x1,..., xn) to (u1,..., un) is given by This is called Gibbs sampling. 6