Download Parameter estimation in multivariate models Let X1,..., Xn be i.i.d.

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Determinant wikipedia , lookup

Linear least squares (mathematics) wikipedia , lookup

Matrix (mathematics) wikipedia , lookup

Singular-value decomposition wikipedia , lookup

Jordan normal form wikipedia , lookup

Non-negative matrix factorization wikipedia , lookup

Orthogonal matrix wikipedia , lookup

Gaussian elimination wikipedia , lookup

Matrix calculus wikipedia , lookup

Perron–Frobenius theorem wikipedia , lookup

Matrix multiplication wikipedia , lookup

Cayley–Hamilton theorem wikipedia , lookup

Ordinary least squares wikipedia , lookup

Transcript
Parameter estimation in multivariate models
Let X1 , . . . , Xn be i.i.d. sample from the Pθ distribution, where θ ∈ Θ and
Θ ⊂ Rk is the parameter space. The unknown parameter θ is estimated by
means of a T = T(X1 , . . . , Xn ) = T(X) ∈ Rk statistic, which depends on the
sample condensed into the p × n matrix X column-wise. Here XT is the data
matrix.
For example, if our sample is from the Np (m, C) distribution, and p is given,
then θ = (m, C) and k = p + p(p + 1)/2. It is estimated with the T = (X̄, Ĉ)
or (X̄, Ĉ∗ ) statistics.
Definition 1 The statistic T is an unbiased estimator of θ if Eθ (T) = θ, ∀θ ∈
Θ.
Definition 2 The sequence of statistics Tn = T(X1 , . . . , Xn ) is asymptotically
unbiased estimator of θ if limn→∞ Eθ (Tn ) = θ, ∀θ ∈ Θ.
For example, if our sample is from the Np (m, C) distribution, and p is given,
then X̄ is an unbiased estimator of m, whereas Ĉ is asymptotically unbiased,
and Ĉ∗ is unbiased estimator of C.
Definition 3 Let T1 and T2 be two unbiased estimators of the parameter θ,
based on the same sample. We say that T1 is at least as efficient as T2 if for
their covariance matrices
Varθ (T1 ) ≤ Varθ (T2 )
∀θ ∈ Θ
holds, which means that Varθ (T2 ) − Varθ (T1 ) ≥ 0 (positive semidefinite).
An unbiased estimator is efficient if it is at least as efficient as any other
unbiased estimator.
• Efficient estimator does not always exist, but if yes, then it is unique with
probability 1.
• If the covariance matrix of an unbiased estimator attains the Cramér–Rao
information (matrix) limit (see the forthcoming definition), then it is the
efficient estimator.
• Even if the information limit cannot be attained with any unbiased estimator, there may exist an efficient estimator. As a consequence of the
Rao–Blackwell–Kolmogorov theorem, an unbiased estimator is efficient if
it is also a sufficient and complete statistic.
Definition 4 The statistic T is sufficient for θ if the distribution of the sample,
conditioned on the given value of T, dos not depend on θ any more.
In fact, a sufficient statistic contains all the information that can be retrieved
from the sample for the parameter. We usually find sufficient statistics with the
help of the following theorem.
Theorem 1 (Neyman–Fisher factorization) The statistic T is sufficient
for θ if and only if the likelihood function (joint p.m.f. or p.d.f. of the sample)
can be factorized as
Lθ (X) = gθ (T(X)) · h(X),
1
∀θ ∈ Θ.
Definition 5 The statistic T is complete if Eθ [g(T)] = 0 (∀θ ∈ Θ) implies that
g = 0 with probability 1. (Here g : Rk → Rk is a measurable function.)
A complete and sufficient statistic is also minimal sufficient (it is a function
of any other sufficient statistic).
Definition 6 The sequence of statistics Tn = T(X1 , . . . , Xn ) is a strongly consistent estimator of θ if Tn → θ as n → ∞ almost surely (with probability 1),
∀θ ∈ Θ.
For example, if our sample is from the Np (m, C) distribution, and p is given,
then in view of the Strong Law of Large Numbers, X̄ is a strongly consistent
estimator of m, whereas Ĉ and Ĉ∗ are both strongly consistent estimators of
C.
Now, the multivariate counterpart of the Cramér–Rao inequality will be
formulated. First we generalize the notion of the Fisher-information.
Definition 7 The Fisher-information matrix of the n-element p-dimensional
sample X1 , . . . , Xn , taken from the Pθ distribution (θ ∈ Θ ⊂ Rk ) is
Inf n (θ) = Varθ ∇θ ln Lθ (X1 , . . . , Xn )
where ∇θ denotes the k-dimensional gradient vector, and Inf n (θ) is a k × k
positive semidefinite matrix.
Proposition 1 Under some regularity conditions, e.g., the support of the p.m.f.
or p.d.f. does not depend on the parameter, which always holds in exponential
families (and the multivariate normal distribution belongs here),
Eθ ∇θ ln fθ (X1 ) = 0
and so,
Inf n (θ) = Eθ
h
T i
∇θ ln Lθ (X1 , . . . , Xn ) ∇θ ln Lθ (X1 , . . . , Xn )
.
Further,
Inf n (θ) = nInf 1 (θ),
where
Inf 1 (θ) = Varθ ∇θ ln fθ (X1 )
h
T i
= Eθ ∇θ ln fθ (X1 ) ∇θ ln fθ (X1 )
with f denoting the p.d.f. of X1 (or of any sample entry) if it comes from an
absolutely continuous distribution (e.g., from a p-variate normal distribution),
otherwise the p.m.f. of a discrete distribution is to be used.
Theorem 2 (Cramér–Rao inequality) Let T be an unbiased estimator of θ
based on the i.i.d. sample X1 , . . . , Xn , the covariance matrix of which exists.
Then under some regularity conditions (imposed on the distribution itself ),
Varθ (T) ≥ Inf −1
n (θ) =
1
Inf −1
1 (θ),
n
∀θ ∈ Θ
where the inequality again means that the difference of the left- and right-hand
side matrices is positive semidefinite.
2
Note that equality holds (the information limit is attained) only if this difference
is the zero matrix. Applying the theorem for a multivariate normal sample with
T = (X̄, Ĉ∗ ), the equality cannot be attained (even in the p = 1, k = 2 case).
However, T is an efficient estimator, as a consequence of the Rao–Blackwell–
Kolmogorov theorem. In fact, its covariance matrix asymptotically attains the
information bound, akin to the covariance matrix of the ML-estimator (X̄, Ĉ)
to be introduced in the next lesson.
Finally, let us find a sufficient statistic based on an i.i.d. X1 , . . . , Xn ∼
Np (m, C) sample. We will apply the Neyman–Fisher factorization theorem for
the likelihood function (joint density of the sample):
Lm,C (X1 , . . . , Xn ) =
=
1
1
(2π)np/2 |C|n/2
e− 2
T
−1
(Xk −m)
k=1 (Xk −m) C
Pn
−1
T
−1
1
1
e− 2 [trC S+n(X̄−m) C (X̄−m)] .
(2π)np/2 |C|n/2
Here, in the exponent we used the multivariate Steiner equality and the fact
that the tr operator is cyclically commutative:
n
n
X
X
tr[(Xi − m)T C−1 (Xi − m)] =
(Xi − m)T C−1 (Xi − m) =
i=1
i=1
=
n
X
tr[C−1 (Xi − m)(Xi − m)T ] = tr[C−1
i=1
i=1
"
= trC−1
n
X
(Xi − m)(Xi − m)T ] =
n
X
(Xi − X̄)(Xi − X̄)T + n(X̄ − m)(X̄ − m)T
#
=
i=1
= trC−1 S + ntr[(X̄ − m)T C−1 (X̄ − m)] = trC−1 S + n(X̄ − m)T C−1 (X̄ − m).
As the likelihood function depends on the sample only through the statistics X̄
and S, they will provide the sufficient statistics (the other factor is 1). Equivalently, the pair (X̄, Ĉ) or (X̄, Ĉ∗ ) is also sufficient.
3
Only for the Multivariate Statistics course: c.d.f. of the multivariate normal distribution
Numerical approximations for
Z xp
Z x1
f (t1 , . . . , tp ) dt1 , . . . dtp .
...
F (x1 , . . . , xp ) =
−∞
−∞
1. Monte Carlo method. Approximate the probability
F (x1 , . . . , xp ) = P(X1 < x1 , . . . , Xp < xp )
with the corresponding relative frequency based on an X1 , . . . , Xn ∼
Np (m, C) i.i.d. sample. How to do it?
2. Expansion by means of Hermite polynomials
Definition 8 The correlation matrix R corresponding to the covariance
matrix C is R = D−1/2 CD−1/2 corr where the diagonal matrix D contains the positive diagonal entries of the covariance matrix C in its main
diagonal.
Definition 9 The lth orthogonal Hermite polynomial:
2
Hl (x) = (−1)l ex
/2
dl −x2 /2
e
,
dxl
(l = 0, 1, 2, . . . )
Proposition 2 Let X = (X1 , . . . , Xp ) ∼ Np (0, R), and suppose that the
eigenvalues of the matrix R − I are less then 1 in absolute value (or equivalently, the spectrum of the correlation matrix R is in the (0,2) interval).
Then
p
Y
P(X1 ≥ x1 , . . . , Xp ≥ xp ) =
(1 − Φ(xm ))+
m=1
+
p
Y
m=1
φ(xm ) ·
∞ X
X
k=1 kij


k
p
p
Y
rijij Y


Hl −1 (xq ),
k ! q=1 q
i=1 j=i+1 ij
p−1
Y
where the summation if for kij ’s such that
kij ≥ 0 integer,
p−1 X
p
X
kij = k,
i=1 j=i+1
Pq−1
Pp
rij ’s are entries of R, further lq = i=1 kiq + j=q+1 kqj , Hl is the lth
Hermite polynomial, and H−1 (x) := 1 − Φ(x). This series is absolutely
and uniformly convergent over Rp .
4