Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Matrix calculus wikipedia, lookup

Singular-value decomposition wikipedia, lookup

Gaussian elimination wikipedia, lookup

Matrix multiplication wikipedia, lookup

Cayley–Hamilton theorem wikipedia, lookup

Linear least squares (mathematics) wikipedia, lookup

Ordinary least squares wikipedia, lookup

Non-negative matrix factorization wikipedia, lookup

Perron–Frobenius theorem wikipedia, lookup

Orthogonal matrix wikipedia, lookup

Transcript

Parameter estimation in multivariate models Let X1 , . . . , Xn be i.i.d. sample from the Pθ distribution, where θ ∈ Θ and Θ ⊂ Rk is the parameter space. The unknown parameter θ is estimated by means of a T = T(X1 , . . . , Xn ) = T(X) ∈ Rk statistic, which depends on the sample condensed into the p × n matrix X column-wise. Here XT is the data matrix. For example, if our sample is from the Np (m, C) distribution, and p is given, then θ = (m, C) and k = p + p(p + 1)/2. It is estimated with the T = (X̄, Ĉ) or (X̄, Ĉ∗ ) statistics. Definition 1 The statistic T is an unbiased estimator of θ if Eθ (T) = θ, ∀θ ∈ Θ. Definition 2 The sequence of statistics Tn = T(X1 , . . . , Xn ) is asymptotically unbiased estimator of θ if limn→∞ Eθ (Tn ) = θ, ∀θ ∈ Θ. For example, if our sample is from the Np (m, C) distribution, and p is given, then X̄ is an unbiased estimator of m, whereas Ĉ is asymptotically unbiased, and Ĉ∗ is unbiased estimator of C. Definition 3 Let T1 and T2 be two unbiased estimators of the parameter θ, based on the same sample. We say that T1 is at least as efficient as T2 if for their covariance matrices Varθ (T1 ) ≤ Varθ (T2 ) ∀θ ∈ Θ holds, which means that Varθ (T2 ) − Varθ (T1 ) ≥ 0 (positive semidefinite). An unbiased estimator is efficient if it is at least as efficient as any other unbiased estimator. • Efficient estimator does not always exist, but if yes, then it is unique with probability 1. • If the covariance matrix of an unbiased estimator attains the Cramér–Rao information (matrix) limit (see the forthcoming definition), then it is the efficient estimator. • Even if the information limit cannot be attained with any unbiased estimator, there may exist an efficient estimator. As a consequence of the Rao–Blackwell–Kolmogorov theorem, an unbiased estimator is efficient if it is also a sufficient and complete statistic. Definition 4 The statistic T is sufficient for θ if the distribution of the sample, conditioned on the given value of T, dos not depend on θ any more. In fact, a sufficient statistic contains all the information that can be retrieved from the sample for the parameter. We usually find sufficient statistics with the help of the following theorem. Theorem 1 (Neyman–Fisher factorization) The statistic T is sufficient for θ if and only if the likelihood function (joint p.m.f. or p.d.f. of the sample) can be factorized as Lθ (X) = gθ (T(X)) · h(X), 1 ∀θ ∈ Θ. Definition 5 The statistic T is complete if Eθ [g(T)] = 0 (∀θ ∈ Θ) implies that g = 0 with probability 1. (Here g : Rk → Rk is a measurable function.) A complete and sufficient statistic is also minimal sufficient (it is a function of any other sufficient statistic). Definition 6 The sequence of statistics Tn = T(X1 , . . . , Xn ) is a strongly consistent estimator of θ if Tn → θ as n → ∞ almost surely (with probability 1), ∀θ ∈ Θ. For example, if our sample is from the Np (m, C) distribution, and p is given, then in view of the Strong Law of Large Numbers, X̄ is a strongly consistent estimator of m, whereas Ĉ and Ĉ∗ are both strongly consistent estimators of C. Now, the multivariate counterpart of the Cramér–Rao inequality will be formulated. First we generalize the notion of the Fisher-information. Definition 7 The Fisher-information matrix of the n-element p-dimensional sample X1 , . . . , Xn , taken from the Pθ distribution (θ ∈ Θ ⊂ Rk ) is Inf n (θ) = Varθ ∇θ ln Lθ (X1 , . . . , Xn ) where ∇θ denotes the k-dimensional gradient vector, and Inf n (θ) is a k × k positive semidefinite matrix. Proposition 1 Under some regularity conditions, e.g., the support of the p.m.f. or p.d.f. does not depend on the parameter, which always holds in exponential families (and the multivariate normal distribution belongs here), Eθ ∇θ ln fθ (X1 ) = 0 and so, Inf n (θ) = Eθ h T i ∇θ ln Lθ (X1 , . . . , Xn ) ∇θ ln Lθ (X1 , . . . , Xn ) . Further, Inf n (θ) = nInf 1 (θ), where Inf 1 (θ) = Varθ ∇θ ln fθ (X1 ) h T i = Eθ ∇θ ln fθ (X1 ) ∇θ ln fθ (X1 ) with f denoting the p.d.f. of X1 (or of any sample entry) if it comes from an absolutely continuous distribution (e.g., from a p-variate normal distribution), otherwise the p.m.f. of a discrete distribution is to be used. Theorem 2 (Cramér–Rao inequality) Let T be an unbiased estimator of θ based on the i.i.d. sample X1 , . . . , Xn , the covariance matrix of which exists. Then under some regularity conditions (imposed on the distribution itself ), Varθ (T) ≥ Inf −1 n (θ) = 1 Inf −1 1 (θ), n ∀θ ∈ Θ where the inequality again means that the difference of the left- and right-hand side matrices is positive semidefinite. 2 Note that equality holds (the information limit is attained) only if this difference is the zero matrix. Applying the theorem for a multivariate normal sample with T = (X̄, Ĉ∗ ), the equality cannot be attained (even in the p = 1, k = 2 case). However, T is an efficient estimator, as a consequence of the Rao–Blackwell– Kolmogorov theorem. In fact, its covariance matrix asymptotically attains the information bound, akin to the covariance matrix of the ML-estimator (X̄, Ĉ) to be introduced in the next lesson. Finally, let us find a sufficient statistic based on an i.i.d. X1 , . . . , Xn ∼ Np (m, C) sample. We will apply the Neyman–Fisher factorization theorem for the likelihood function (joint density of the sample): Lm,C (X1 , . . . , Xn ) = = 1 1 (2π)np/2 |C|n/2 e− 2 T −1 (Xk −m) k=1 (Xk −m) C Pn −1 T −1 1 1 e− 2 [trC S+n(X̄−m) C (X̄−m)] . (2π)np/2 |C|n/2 Here, in the exponent we used the multivariate Steiner equality and the fact that the tr operator is cyclically commutative: n n X X tr[(Xi − m)T C−1 (Xi − m)] = (Xi − m)T C−1 (Xi − m) = i=1 i=1 = n X tr[C−1 (Xi − m)(Xi − m)T ] = tr[C−1 i=1 i=1 " = trC−1 n X (Xi − m)(Xi − m)T ] = n X (Xi − X̄)(Xi − X̄)T + n(X̄ − m)(X̄ − m)T # = i=1 = trC−1 S + ntr[(X̄ − m)T C−1 (X̄ − m)] = trC−1 S + n(X̄ − m)T C−1 (X̄ − m). As the likelihood function depends on the sample only through the statistics X̄ and S, they will provide the sufficient statistics (the other factor is 1). Equivalently, the pair (X̄, Ĉ) or (X̄, Ĉ∗ ) is also sufficient. 3 Only for the Multivariate Statistics course: c.d.f. of the multivariate normal distribution Numerical approximations for Z xp Z x1 f (t1 , . . . , tp ) dt1 , . . . dtp . ... F (x1 , . . . , xp ) = −∞ −∞ 1. Monte Carlo method. Approximate the probability F (x1 , . . . , xp ) = P(X1 < x1 , . . . , Xp < xp ) with the corresponding relative frequency based on an X1 , . . . , Xn ∼ Np (m, C) i.i.d. sample. How to do it? 2. Expansion by means of Hermite polynomials Definition 8 The correlation matrix R corresponding to the covariance matrix C is R = D−1/2 CD−1/2 corr where the diagonal matrix D contains the positive diagonal entries of the covariance matrix C in its main diagonal. Definition 9 The lth orthogonal Hermite polynomial: 2 Hl (x) = (−1)l ex /2 dl −x2 /2 e , dxl (l = 0, 1, 2, . . . ) Proposition 2 Let X = (X1 , . . . , Xp ) ∼ Np (0, R), and suppose that the eigenvalues of the matrix R − I are less then 1 in absolute value (or equivalently, the spectrum of the correlation matrix R is in the (0,2) interval). Then p Y P(X1 ≥ x1 , . . . , Xp ≥ xp ) = (1 − Φ(xm ))+ m=1 + p Y m=1 φ(xm ) · ∞ X X k=1 kij k p p Y rijij Y Hl −1 (xq ), k ! q=1 q i=1 j=i+1 ij p−1 Y where the summation if for kij ’s such that kij ≥ 0 integer, p−1 X p X kij = k, i=1 j=i+1 Pq−1 Pp rij ’s are entries of R, further lq = i=1 kiq + j=q+1 kqj , Hl is the lth Hermite polynomial, and H−1 (x) := 1 − Φ(x). This series is absolutely and uniformly convergent over Rp . 4