Download 2. Multivariate Distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
2. Multivariate Distributions
I Random vectors: mean, covariance matrix, linear transformations,
dependence measures
(a short introduction on the probability tools for multivariate
statistics).
I Multidimensional normal distribution, mixture models
(some well-known examples of multivariate probability
distributions).
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
1
2.1. Random vectors
Multivariate data are the result of observing a random vector, a
vector X = (X1 , . . . , Xp )0 whose components Xj , j = 1, . . . , p, are
random variables (r.v.) on the same probability space (Ω, A, P).
Similarly, a random matrix is a matrix whose elements are r.v.
The probability distribution of a random vector or matrix is
characterized by the joint distribution of its components. In
particular, the distribution function of a random vector X is
F (x1 , . . . , xp ) = P{X1 ≤ x1 , . . . , Xp ≤ xp },
for (x1 , . . . , xp ) ∈ Rp .
In general, we will only work with continuous random vectors,
whose probability distribution is characterized by the density
function f = f (x1 , . . . , xn ), satisfying
1 f (x1 , . . . , xp ) ≥ 0 for all (x1 , . . . , xp ) ∈ Rp ;
Z
f (x1 , . . . , xp )dx1 . . . dxp = 1;
2
Rp
3
f (x1 , . . . , xp ) =
∂ p F (x1 , . . . , xp )
.
∂x1 . . . ∂xp
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
2
The marginal distribution of each component Xj , j = 1, . . . , p, is
its probability distribution as an individual random variable. Its
density function is:
Z
fj (xj ) =
f (x1 , . . . , xp )dx1 . . . dxj−1 dxj+1 . . . dxp , for xj ∈ R.
Rp−1
More generally, given the partition


X(1)
X =  − − − ,
X(2)
with X(1) = (X1 , . . . , Xr )0 and X(2) = (Xr +1 , . . . , Xp )0 , the
marginal density of X(1) is
Z
fX(1) (x1 , . . . , xr ) =
f (x1 , . . . , xp )dxr +1 . . . dxp .
Rp−r
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
3
Two random matrices X1 and X2 are independent if the elements
of X1 (as a collection of r.v.) are independent of the elements of
X2 . (The elements within X1 or X2 need not be independent.)
0
0
In particular, given the partition X = [X(1) , X(2) ]0 , the vectors
X(1) and X(2) are independent if
F (x1 , . . . , xp ) = FX(1) (x1 , . . . , xr ) FX(2) (xr +1 , . . . , xp ), for all x1 , . . . , xp ,
or, equivalently, if
f (x1 , . . . , xp ) = fX(1) (x1 , . . . , xr ) fX(2) (xr +1 , . . . , xp ), for all x1 , . . . , xp .
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
4
2.1.1 Expectation
The expected value of a random vector (resp. matrix) is the vector
(resp. matrix) of expected values of each of its components (the
marginal expectations). For the random vector X = (X1 , . . . , Xp )0 ,
µ := E(X) = (E(X1 ), . . . , E(Xp ))0 = (µ1 , . . . , µp )0 ,
R
where µj := E(Xj ) = R x fj (x) dx.
The expectation is a linear function:
1 If A is a q × p constant matrix, X is a p-dimensional random
vector and b is a q-dimensional constant vector, then
E(AX + b) = AE(X) + b.
2 If X and Y are random matrices of the same dimension, then
E(X + Y) = E(X) + E(Y).
3 If X is a q × p random matrix and A, B are constant matrices
of adequate dimensions, then E(AXB) = AE(X)B.
If X1 and X2 are conformable independent matrices, then
E(X1 X2 ) = E(X1 )E(X2 ).
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
5
2.1.2 Covariance matrix
The variance-covariance matrix (or simply covariance matrix) of a
random vector X = (X1 . . . , Xp )0 with expectation µ is
Σ = V(X) := E((X − µ)(X − µ)0 ) = E(XX0 ) − µµ0


σ11 σ12 . . . σ1p
 σ21 σ22 . . . σ2p 


= .
..
..  ,
.
 .
.
. 
σp1 σp2 . . . σpp
where σjj = V(Xj ) is the variance of the r.v. Xj and
σjk = Cov(Xj , Xk ) is the covariance of Xj and Xk , j, k = 1, . . . , p.
Then Σ is a symmetric matrix.
Some properties of the covariance matrix:
1 If A is a q × p constant matrix, X is a p-dimensional random
vector and b is a q-dimensional constant vector, then
V(AX + b) = AV(X)A0 .
2
Σ = V(X) is always nonnegative definite.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
6
2.1.3 Correlation matrix
Let X = (X1 , . . . , Xp )0 be a random vector with covariance matrix
Σ and with 0 < σjj = V(Xi ) < ∞, i = 1 . . . , p. Define
D := diag(σ11 , . . . , σpp ).
Then the correlation matrix of X is

1 ρ12 . . . ρ1p
 ρ21 1 . . . ρ2p

ρ= .
..
..
 ..
.
.
ρp1 ρp2 . . . 1



 = D−1/2 ΣD−1/2 ,

where ρjk is the correlation of Xj and Xk , j, k = 1, . . . , p, and
−1/2
D−1/2 := diag(σ11
−1/2
, . . . , σpp
).
Observe that, if Z := D−1/2 (X − µ), where µ = E(X), then
V(Z) = ρ.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
7
2.1.4 Dependence measures
More generally, the (cross-)covariance between the p-dimensional
random vector X1 and the q-dimensional random vector X2 , with
means µ1 and µ2 respectively, is the p × q matrix given by
Cov(X1 , X2 ) = E((X1 − µ1 )(X2 − µ2 )0 )
Some properties of the cross-covariance:
1
If A and B are constant matrices and c and d are constant
vectors, then
Cov(AX1 + c, BX2 + d) = ACov(X1 , X2 )B0 .
2
If X1 , X2 and X3 are random vectors, then
Cov(X1 + X2 , X3 ) = Cov(X1 , X3 ) + Cov(X2 , X3 ).
3
If X1 and X2 are independent, then Cov(X1 , X2 ) = 0p×q .
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
8
Pearson’s product-moment covariance measures linear dependence
and, for the multivariate normal distribution, diagonal covariance
matrix implies independence of the random vector components. In
general, however, Pearson’s correlation matrix does not
characterize independence.
Székely et al. (2007) introduced two dependence coefficients,
distance covariance and distance correlation, that measure all types
of dependence between random vectors X and Y of arbitrary (and
possibly different) dimensions.
Suppose that X in Rp and Y in Rq are random vectors. The
characteristic function of X is
Z
iht,Xi
ˆ
fX (t) := E e
=
e iht,xi dFX (x).
Rp
Let fˆY be the characteristic function of Y, and denote the joint
characteristic function of (X0 , Y0 )0 by fˆX,Y . Then X and Y are
independent if and only if fˆX,Y = fˆX fˆY .
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
9
Distance covariance is defined as a measure of the discrepancy
between fˆX,Y and fˆX and fˆY :
kfˆX,Y (t, s) − fˆX (t)fˆY (s)k2w =
Z
|fˆX,Y (t, s) − fˆX (t)fˆY (s)|2 w (t, s) dt ds.
Rp+q
The only integrable weight function w that makes this definition
scale and rotation invariant is proportional to the reciprocal of
1+q
p
|t|1+p
p |s|q , where | |p here denotes the Euclidean distance in R .
The distance covariance between random vectors X and Y with
E|X|p < ∞ and E|Y|q < ∞ is the square root of
1
V (X, Y) =
cp cq
2
|fˆX,Y (t, s) − fˆX (t)fˆY (s)|2
Z
Rp+q
with
cp :=
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
1+q
|t|1+p
p |s|q
dt ds,
(1)
π (p+1)/2
.
Γ p+1
2
2. Multivariate Distributions
10
Similarly, distance variance is defined as the square root of
V 2 (X) = V 2 (X, X).
The distance correlation between random vectors X and Y with
E|X|p < ∞ and E|Y|q < ∞ is the square root of

2
 p V (X, Y) , V 2 (X)V 2 (Y) > 0,
2
R (X, Y) :=
V 2 (X)V 2 (Y)

0,
V 2 (X)V 2 (Y) = 0.
(2)
Theorem 3 in Székely et al. (2007): If E|X|p < ∞ and
E|Y|q < ∞, then 0 ≤ R ≤ 1, and R(X, Y) = 0 if and only if X
and Y are independent.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
11
For an observed random sample {(Xi , Yi ), i = 1, . . . , n} of (X, Y)
natural estimators of the unknown characteristic functions are
Z
n
n
1 X iht,Xi i
1 X ihs,Yi i
fˆXn (t) :=
e iht,xi dFXn (x) =
e
, fˆYn (s) :=
e
n
n
Rp
i=1
and
i=1
n
1 X iht,Xi i+ihs,Yi i
n
e
,
fˆX,Y
(t, s) :=
n
i=1
FXn
where
denotes the empirical distribution function of
X1 , . . . , Xn .
The empirical distance covariance is defined as the square root of
n
Vn2 (X, Y) := kfˆX,Y
(t, s) − fˆXn (t)fˆYn (s)k2w .
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
12
Székely et al. (2007) used the asymptotic properties of the
empirical distance covariance to test the independence of X and Y:
H0 : X and Y are independent
H1 : X and Y are dependent
Corollary 2 of Székely et al. (2007): If E|X|p < ∞ and
E|Y|q < ∞ and X and Y are independent, then
n
Vn2 (X, Y)
d
−−−−→ Q,
n→∞
S2
(3)
where Q is a certain, known quadratic form of centered Gaussian
random variables with E(Q) = 1 and
n
n
1 X
1 X
S2 := 2
|Xi − Xk |p 2
|Yi − Yk |q .
n
n
i,k=1
i,k=1
The test statistic (3) is a particular case of the so-called energy
statistics, functions of distances between statistical observations
(see Székely and Rizzo 2013).
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
13
2.2. Examples of multidimensional distributions
2.2.1 Multidimensional normal distribution
The random vector X = (X1 , . . . , Xp )0 follows a p-dimensional
normal distribution with mean µ and covariance matrix Σ, and we
denote it by X ∼ Np (µ, Σ), if its density function is
f (x; µ, Σ) =
1
0
(2π)p/2 |Σ|1/2
e −(x−µ) Σ
−1
(x−µ)/2
,
(4)
where x = (x1 , . . . , xp )0 and −∞ < xi < ∞, i = 1, . . . , p.
Example (Bivariate normal density):
We evaluate the bivariate (p = 2) normal density in terms of the
individual parameters µ1 = E(X1 ), µ2 = E(X2 ), σ11 = V(X1 ),
√
σ22 = V(X2 ) and ρ12 = Cor(X1 , X2 ) = σ12 / σ11 σ22 .
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
14
The determinant and inverse of the matrix
√
σ11 σ12
σ11
σ11 σ22 ρ12
Σ=
= √
σ12 σ22
σ11 σ22 ρ12
σ22
are respectively |Σ| = σ11 σ22 (1 − ρ212 ) and
√
1
σ22
− σ11 σ22 ρ12
√
Σ−1 =
σ11
σ11 σ22 (1 − ρ212 ) − σ11 σ22 ρ12
Thus
(x − µ)0 Σ−1 (x − µ) =
= (x1 − µ1 , x2 − µ2 )
1
σ11 σ22 (1 − ρ212 )
√
x1 − µ1
σ
− σ11 σ22 ρ12
√ 22
− σ11 σ22 ρ12
σ11
x2 − µ2
1
√
σ22 (x1 − µ1 )2 + σ11 (x2 − µ2 )2 − 2ρ12 σ11 σ22 (x1 − µ1 )(x2 − µ2 )
σ11 σ22 (1 − ρ212 )
1
(x1 − µ1 )2
(x2 − µ2 )2
x1 − µ1 x2 − µ2
=
+
− 2ρ12 √
√
2
1 − ρ12
σ11
σ22
σ11
σ22
=
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
15
Consequently, the bivariate normal density is
f (x1 , x2 )
=
1
q
·
2π σ11 σ22 (1 − ρ212 )
1
(x1 − µ1 )2
(x2 − µ2 )2
x1 − µ1 x2 − µ2
· exp −
+
−
2ρ
√
√
12
2(1 − ρ212 )
σ11
σ22
σ11
σ22
Observe that, if ρ12 = 0 (X1 and X2 are uncorrelated), then
f (x1 , x2 ) =
=
1
√ √
2π σ11 σ22
= √
1 (x1 − µ1 )2 (x2 − µ2 )2
exp −
+
=
2
σ11
σ22
1 (x1 − µ1 )2
1
1 (x2 − µ2 )2
1
exp −
·√ √
exp −
√
2
σ11
2
σ22
2π σ11
2π σ22
= f1 (x1 ) · f2 (x2 ).
Since the joint density f (x1 , x2 ) can be expressed as the product of
the marginal densities, we conclude that X1 and X2 are actually
independent r.v.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
16
split.screen(c(2,3))
screen(1)
## bivariate normal pdf
library(mvtnorm)
x = y = seq(-5, 5, length = 50)
f = function(x,y) { dmvnorm(cbind(x,y)) }
z = outer(x, y, f)
par(mai=c(0.1,0.1,0.1,0.1))
persp(x, y, z, theta=5, phi=50, expand=0.5,
col="lightblue")
screen(2)
## contours of the bivariate normal pdf
x = y = seq(-5, 5, length = 150)
z = outer(x, y, f)
par(mai=c(0.5,0.5,0.5,0.5))
contour(x, y, z, nlevels=20, col=rainbow(20))
screen(3)
## normal data
X = rmvnorm(n=100,sigma=matrix(c(1,0,0,1),
ncol=2))
par(mai=c(0.5,0.5,0.5,0.5))
plot(X[,1],X[,2], pch=19,xlab=expression(x
[1]),ylab=expression(x[2]))
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
screen(4)
x = y = seq(-5, 5, length = 50)
Sigma = matrix(c(1,0.7,0.7,1), ncol=2)
f = function(x,y) { dmvnorm(cbind(x,y),sigma=
Sigma) }
z = outer(x, y, f)
par(mai=c(0.1,0.1,0.1,0.1))
persp(x, y, z, theta=5, phi=50, expand=0.5,
col="lightblue")
screen(5)
## contours of the bivariate normal pdf
x = y = seq(-5, 5, length = 150)
z = outer(x, y, f)
par(mai=c(0.5,0.5,0.5,0.5))
contour(x, y, z, nlevels=20, col=rainbow(20))
screen(6)
## normal data
X = rmvnorm(n=100,sigma=Sigma)
par(mai=c(0.5,0.5,0.5,0.5))
plot(X[,1],X[,2], pch=19,xlab=expression(x
[1]),ylab=expression(x[2]))
2. Multivariate Distributions
17
x
x
x
x
−4
−4
z
−4
−4
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
0 2 4
0
2
4
−3
−2
−3
−4
−2
−3
−2
−2 −2
−1
−1
−1
00000011
2
11122224
−2
−2
x1
00 11 22 3
x1
x2
x
−3
−1 2 1 2
−3
−2
−2
−4
−2
−2
−2 −1
−1
0000 1222222
z
y
−4
y
0 2 4
0 4
−4
0 2 4
−4 −2 0
2
4
−4 −2 0
2
4
x
x
0 2 4
x2
x
−2−2
0 0 21 1 22
−2
−2
−2
−2
−2 00000 222222
−4 −2
0
2
4
−4 −2 0 2 4
−4
0 2 4
−4 0 4
zz
z z
y yyy
y
z
x
−3
−2
−2
−2
−2
−3
−2 −1
−1
0000011111122222223
−2
2. Multivariate Distributions
0
x1
1
2
18
Properties of the multivariate normal distribution
Let X ∼ Np (µ, Σ).
1
The normal density has a global maximum at µ and is
symmetric with respect to µ in the sense that
f (µ + a) = f (µ − a) for all a ∈ Rd .
2
Linear combinations of a multivariate normal are also normally
distributed: if A is a (q × p) constant matrix and d is a
(q × 1) constant vector, then AX + d ∼ Nq (Aµ + d, AΣA0 ).
Consequently, all subsets of the components of X are normally
distributed.
3
Zero correlation between
vectors is equivalent to
normal
X1
independence: if X =
, then X1 and X2 are
X2
independent if and only if Cov(X1 , X2 ) = 0.
4
If |Σ| > 0, there exists a linear transformation of X with mean
0 and covariance matrix equal to the identity.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
19
5
Contours of constant density for the multivariate distribution
are ellipsoids centered at the population mean. If |Σ| > 0,
then
(a) the level sets of the probability density f are the ellipsoids
given by
{x ∈ Rd : (x − µ)0 Σ−1 (x − µ) = c 2 }.
√
These ellipsoids are centered at µ and have axes ±c λi ei ,
where (λi , ei ), i = 1, . . . , p, are the eigenvalue-eigenvector
pairs of Σ.
(b) (X − µ)0 Σ−1 (X − µ) follows a χ2p distribution. Thus,
P{(X−µ)0 Σ−1 (X−µ) ≤ χ2p;α } = 1−α,
for any 0 < α < 1.
The Mahalanobis distance dM of a point x ∈ Rp to the mean µ of
a p-dimensional distribution with covariance matrix Σ is defined by
2
dM
(x) = (x − µ)0 Σ−1 (x − µ).
It is a statistical distance in the sense that it takes into account
the variability of the distribution (unlike the Euclidean distance).
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
20
6
7
If X ∼ Np (µ, Σ), then any linear combination of variables
a0 X = a1 X1 + a2 X2 + ap Xp is distributed as N(a0 µ, a0 Σa).
Also, if a0 X is distributed as N(a0 µ, a0 Σa) for every a ∈ Rp ,
then X must follow a Np (µ, Σ).
Let X1 , . . . , Xn be mutually independent Np (µj , Σ) random
vectors. Let c1 . . . , cn be real constants. Then
V = c 1 X1 + . . . + c n Xn

 
n
n
X
X
follows a Np 
cj µj , 
cj2  Σ distribution.

j=1
8
j=1
Given X1 , . . . , Xn a random sample from X ∼ Np (µ, Σ), the
maximum likelihood estimators (m.l.e.) of µ and Σ are
respectively
n
µ̂ = X̄ :
1X
Xi
n
n
and Σ̂ = Sn =
i=1
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
1X
(Xi − X̄)(Xi − X̄)0 .
n
i=1
2. Multivariate Distributions
21
9
The Central Limit Theorem: Let X1 , . . . , Xn be independent
observations from a population with mean µ and nonsingular
covariance matrix Σ. Then
√
d
n(X̄ − µ) −−−−→ Np (0, Σ)
n→∞
and
d
n(X̄ − µ)0 S−1 (X̄ − µ) −−−−→ χ2p .
n→∞
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
22
The normality assumption on a sample from X can be assessed by
• examining the univariate marginal distributions of the components
of X, which should be Gaussian;
• examining the bivariate scatterplots of the pairs of components of
X, which should have an elliptical appearance;
• checking if the Mahalanobis distances di2 = (xi − x̄)0 S−1
n (xi − x̄)
2
follow a χp distribution.
If the data are clearly non-normal, we can consider the possibility
of taking nonlinear transformations of the variables.
There are multiple proposals in the literature to test the
multivariate normality assumption (see Székely and Rizzo 2005;
McAssey 2013 and references therein).
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
23
Example: Mass, snout-vent length and hind limb span of 25 lizards
60
70
80
16
50
2 4 6 8
12
var 1
160
50
60
70
80
var 2
100
120
140
var 3
2
4
6
8
10
12
14
16
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
100
120
2. Multivariate Distributions
140
160
24
Example: Concentration of Selenium in the teeth and liver of 20
whales (Delphinapterus leucas) at Mackenzie Delta, Northwest
Territories, in 1996.
120
140
160
180
200
240
240
10
20
30
40
var 1
220
120
160
200
var 2
10
20
30
40
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
25
2.2.2 Distributions associated to the multivariate normal
Correspondences between the univariate and the multivariate
situations:
Univariate case
N(µ, σ)
Multivariate case
Np (µ, Σ)
χ2n
Wp (Σ, n)
F (m, n)
Λ(p, a, b)
t
T2
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
26
Wishart distribution
Given a random sample of independent random vectors X1 , . . . , Xn
from a Np (0, Σ) distribution, the Wishart distribution Wp (Σ, n) is
that of the random p × p matrix
Q=
n
X
Xi X0i .
i=1
Properties:
1
2
If Q1 ∼ Wp (Σ, n1 ) and Q2 ∼ Wp (Σ, n2 ) are independent,
then Q1 + Q2 ∼ Wp (Σ, n1 + n2 ).
Fisher’s Theorem: If X1 , . . . , Xn are independent Np (µ, Σ)
random vectors, then
i) the sample mean vector X̄ and the sample covariance
matrix Sn are independent;
ii)
X̄ ∼ Np (µ, n1 Σ);
iii)
nSn ∼ Wp (Σ, n − 1).
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
27
Wilks’ Lambda
This is the distribution of the following determinants ratio
Λ=
|A|
1
=
∼ Λ(p, a, b),
|A + B|
|I + A−1 B|
where A ∼ Wp (Σ, a) and B ∼ Wp (Σ, b) are independent, Σ is non
singular and a ≥ p.
Properties:
1
Bartlett’s approximation: For large a,
p+b+1
− a+b−
log Λ(p, a, b) ' χ2pb .
2
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
28
Hotelling’s T 2
It is the distribution of the r.v.
T 2 = nZ0 Q−1 Z ∼ T 2 (p, n),
where Z ∼ Np (0, I) and Q ∼ Wp (I, n) are independent.
Properties:
1
2
3
If p = 1, then T 2 (1, n) is the square of a Student t
distribution with n degrees of freedom.
n−p+1 2
T (p, n) = F (p, n − p + 1)
np
Hotelling’s distribution is invariant under affine
transformations, that is, if X ∼ Np (µ, Σ) and R ∼ Wp (Σ, n)
are independent, then n(X − µ)0 R−1 (X − µ) ∼ T 2 (p, n).
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
29
4
Given a random sample of independent random vectors
X1 , . . . , Xn from a Np (0, Σ) distribution, then
2
n(X̄ − µ)0 S−1
n (X̄ − µ) ∼ T (p, n − 1).
5
Let X1 , . . . , Xn1 and Y1 , . . . , Yn2 be two random samples of
independent random vectors from a Np (µ1 , Σ) and a
Np (µ2 , Σ) respectively. If µ1 = µ2 , then
n1 n2
2
(X̄ − Ȳ)0 S−1
p (X̄ − Ȳ) ∼ T (p, n1 + n2 − 2),
n1 + n2
where
Sp = (n1 Sx,n1 + n2 Sy ,n2 )/(n1 + n2 )
(5)
is the pooled covariance matrix.
These two properties will be used in hypothesis tests about mean
vectors of Gaussian distributions.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
30
Inferences about the mean
Case 1. Let X1 , . . . , Xn be a sample of independent random
vectors from a Np (µ, Σ). Fix µ0 ∈ Rp and consider the test
H0 : µ = µ0 .
(6)
Under H0 the test statistic
2
n(X̄ − µ0 )0 S−1
n (X̄ − µ0 ) ∼ T (p, n − 1),
or, equivalently,
n−p
(X̄ − µ0 )0 S−1
n (X̄ − µ0 ) ∼ F (p, n − p).
p
This provides us with a rejection region for the test (6).
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
31
Case 2. Let X1 , . . . , Xn1 and Y1 , . . . , Yn2 be two independent
samples of independent random vectors from a Np (µ1 , Σ) and a
Np (µ2 , Σ) respectively. Consider the test
H0 : µ1 = µ2 .
(7)
Under H0 , the test statistic
2
n(X̄ − Ȳ)0 S−1
p (X̄ − Ȳ) ∼ T (p, n1 + n2 − 2),
where Sp is given in (5). This is equivalent to
n1 + n2 − p − 1 n1 n2
(X̄− Ȳ)0 S−1
p (X̄− Ȳ) ∼ F (p, n1 +n2 −p−1).
p(n1 + n2 − 2) n1 + n2
This provides us with a rejection region for the test (7).
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
32
Case 3. Assume we have g data matrices from g independent
multivariate normal populations
Sample
X1
X2
..
.
Size
n1 × p
n2 × p
..
.
Mean
x̄1
x̄2
..
.
Covariance
S1
S2
..
.
Distribution
Np (µ1 , Σ)
Np (µ2 , Σ)
..
.
Xg
ng × p
x̄g
Sg
Np (µg , Σ)
The global sample mean vector and sample covariance matrix are
g
1X
x̄ =
ni x̄i ,
n
i=1
g
1 X
S=
ni S i ,
n−g
with n =
i=1
g
X
ni .
i=1
Consider the test
H0 : µ1 = µ2 = . . . = µg .
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
(8)
33
Let us introduce the following matrices
B=
g
X
ni (x̄i − x̄)(x̄i − x̄)0
(between-groups dispersion)
i=1
W=
g X
n
X
0
(xik −x̄i )(xik −x̄i ) =
i=1 k=1
g
X
ni Si
(intra-groups dispersion).
i=1
Under H0 , B ∼ Wp (Σ, g − 1) and W ∼ Wp (Σ, n − g ) are
independent. The test statistic
Λ=
|W|
∼ Λ(p, n − g , g − 1),
|W + B|
can be approximated by the F distribution via Rao’s asymptotic
approximation:
1 − Λ1/β αβ − 2γ
∼ F (pb, αβ − 2γ),
pb
Λ1/β
p2 b2 − 4
pb − 2
p+b+1 2
where α = a + b −
,β = 2
,γ=
.
2
p + b2 − 5
4
If Λ ∼ Λ(p, a, b), then
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
34
2.2.3 Mixture models
Let k > 0 be an integer. A p-dimensional random vector X has a
k-component finite mixture distribution if its probability density (or
mass) function is given by
f (x) =
k
X
πj fj (x),
(9)
j=1
where fj , j = 1, . . . , k, are probability densities (or mass functions)
and 0 ≤ πj ≤ 1, j = 1, . . . , k, are constants such that
Pk
j=1 πj = 1.
The fj are the component densities of the mixture and the πj are
the mixing proportions or weights.
In the definition of a mixture model, the number k of components
is considered fixed, but in many applications the value of k is
unknown and has to be inferred from the data.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
35
The key to generate random vectors with density (9) is as follows.
We define the discrete r.v. Z taking values 1, 2, . . . , k, with
probabilities π1 , π2 , . . . , πk respectively. We suppose that the
conditional density of X given Z = j is given by fj . Then the
unconditional density of X is (9).
Equivalently, we can define the discrete random vector
P
Z = (Z1 , . . . , Zk )0 , with the Zj ’s taking value 0 or 1, kj=1 Zj = 1
and πj equal to the probability that component Zj in Z is 1. Then
Z follows a multinomial distribution with parameters (π1 , . . . , πk )
and we suppose that fj is the conditional density of X given that
the j-th component of Z is 1.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
36
In many applications the component densities fj are specified to
belong to some parametric family. The resulting model is called a
parametric mixture. In particular, frequently the component
densities are assumed to belong to the same parametric family,
such as the mixtures of Gaussian densities.
Parametric mixture models can be viewed as a semiparametric
compromise between a single parametric family (case k = 1) and a
nonparametric model such as kernel density estimation (case
k = n).
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
37
Example: Mixture of three bivariate normal distributions
n = 200 # Sample size
Size = 1 # Number of non-zero values of the multinomial components
Prob = c(0.5,0.3,0.2) # Mixing weights
NumComp = length(Prob) # Number of components in mixture
C = rmultinom(n, Size, Prob) # Sample from the multinomial
SizeComp = apply(C,1,sum) # Sample size for each component
X = matrix(rep(0,n*2),nrow = n)
library(mvtnorm)
X[C[1,]==1,] = rmvnorm(n=SizeComp[1],sigma=matrix(c(1,0,0,1),ncol=2))
X[C[2,]==1,] = rmvnorm(n=SizeComp[2],mean=c(3,5),sigma=matrix(c(3,1,1,1)
,ncol=2))
X[C[3,]==1,] = rmvnorm(n=SizeComp[3],mean=c(4,-3),sigma=matrix(c
(0.5,0.1,0.1,2),ncol=2))
panel.hist = function(x, ...) {
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
}
pairs(X, cex = 1.5, pch = 20,
bg = "light blue", diag.panel = panel.hist,
cex.labels = 2, font.labels = 2)
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
38
−4
−2
0
2
4
6
6
−6
6
−2
0
2
4
var 1
−6
−2 0
2
4
var 2
−2
0
2
4
6
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
39
Thus, a mixture is a candidate distribution to model a population
with several subpopulations.
Example: Times between Old Faithful eruptions (Y , var 2) and
duration of eruptions (X , var 1).
50
60
70
80
90
5.0
40
90
2.0
3.0
4.0
var 1
40
50
60
70
80
var 2
2.0
2.5
3.0
3.5
4.0
4.5
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
5.0
2. Multivariate Distributions
40
2.3. Maximum likelihood estimation
Let x1 , . . . , xn denote a sample of a multivariate parametric model
with density (or mass) function f (x; ψ), where ψ = (ψ1 , . . . , ψk )0
denotes the vector of unknown parameters. The maximum
likelihood estimator (m.l.e.) of ψ is ψ̂, the maximizer of the
likelihood function
L(ψ; x1 , . . . , xn ) =
n
Y
f (xi ; ψ).
i=1
MLE for the Gaussian distribution
Let X1 , . . . , Xn be a random sample from a normal population with
mean µ and covariance Σ. Then the m.l.e. of µ and Σ are
respectively
n
µ̂ = X̄
and
Σ̂ = Sn =
1X
(Xi − X̄)(Xi − X̄)0
n
i=1
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
41
Proof (Johnson and Wichern 2007): The likelihood is
L(µ, Σ) =
n
Y
i=1
=
1
0
(2π)p/2 |Σ|1/2
1
(2π)np/2 |Σ|
e −(xi −µ) Σ
e−
n/2
−1
(xi −µ)/2
Pn
0 −1 (x −µ)/2
i
i=1 (xi −µ) Σ
.
The mle of µ is the minimizer of
n
X
(xi − µ)0 Σ−1 (xi − µ)
i=1
=
=
=
n
X
i=1
n
X
j=1
n
X
tr Σ−1 (xi − µ)(xi − µ)0
tr Σ−1 (xi − x̄)(xi − x̄)0 + n(x̄ − µ)(x̄ − µ)0
tr Σ−1 (xi − x̄)(xi − x̄)0 + n(x̄ − µ)Σ−1 (x̄ − µ)0
j=1
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
42
Since Σ−1 is positive definite, the distance
(x̄ − µ)Σ−1 (x̄ − µ)0 > 0 unless µ = x̄. Thus the mle of µ is
µ̂ = x̄. It remains to maximize (over Σ)
L(µ̂, Σ) =
1
(2π)np/2 |Σ|n/2
e −tr[Σ
−1
Pn
j=1 (xi −x̄)(xi −x̄)
0
]/2 .
Auxiliary result: Given a p × p symmetric positive definite matrix B
and a scalar b > 0, it holds that, for all positive definite Σ(p×p) ,
1 −tr(Σ−1 B)/2
1
e
≤
(2b)pb e −bp .
b
|Σ|
|B|b
Equality holds if Σ = (1/2b)B.
We apply
with b = n/2 and
Pn this auxiliary result
0
B = j=1 (xi − x̄)(xi − x̄) and conclude that the maximum occurs
at Σ̂ = Sn .
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
43
MLE for a parametric mixture model
Consider a parametric mixture model
f (x; ψ) =
k
X
πj fj (x; θ j ),
(10)
j=1
where ψ = (π1 , . . . , πk−1 , ξ 0 )0 and ξ is the vector containing all the
parameters in θ 1 , . . . , θ k known a priori to be different.
We want to obtain the m.l.e. of the parameters in model (10)
based on a sample x1 , . . . , xn from f . The log-likelihood for ψ is
log L(ψ; x1 , . . . , xn ) =
n
X
i=1
k
X
log(
πj fj (xi ; θ j )).
j=1
Computing the m.l.e. would require solving the likelihood equation
∂
log L(ψ) = 0,
∂ψ
not an easy task (see Section 2.8 in McLachlan and Peel 2000).
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
44
The Expectation-Maximization (EM) algorithm of Dempster et al.
(1977) provides an iterative scheme to be followed for computing
the m.l.e. of the parameters ψ in a parametric mixture.
The EM algorithm is designed for “incomplete data”, so the key is
to consider the mixture data x1 , . . . , xn as incomplete, since the
associated component label vectors, z1 , . . . , zn , are not available.
Here zi = (zi1 , . . . , zik )0 is a k-dimensional vector with zij = 1 or 0
according to whether xi did or did not arise from the j-th
component of the mixture, i = 1, . . . , n, j = 1, . . . , k.
The complete data sample is therefore declared to be xc1 , . . . xcn ,
where xci = (xi , zi ). Then the complete-data log-likelihood for ψ is
given by
log Lc (ψ) =
n X
k
X
zij (log πj + log fj (xi ; θ j )).
i=1 j=1
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
45
E-Step:
The algorithm starts with an initial guess ψ (0) for ψ. In general,
denote by ψ (g ) the approximated value of ψ after the g -th
iteration of the algorithm. The E-step requires computing the
conditional expectation of log Lc (ψ) given the sample x1 , . . . , xn
and under the current approximation for ψ:
Q(ψ; ψ (g ) ) = Eψ(g ) (log Lc (ψ)|x1 , . . . , xn )
=
n X
k
X
Eψ(g ) (Zij |x1 , . . . , xn )(log πj + log fj (xi ; θ j )).
i=1 j=1
It can be proved that, for i = 1, . . . , n, j = 1, . . . , k,
Eψ(g ) (Zij |x1 , . . . , xn ) = Pψ(g ) {Zij = 1|x1 , . . . , xn )
(g )
=
(g )
πj fj (xi ; θ j )
(g )
(g )
j=1 πj fj (xi ; θ j )
Pk
(g )
:= τij .
This is the posterior probability that the i-th member of the
sample, Xi , belongs to the j-th component of the mixture.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
46
Then
Q(ψ; ψ
(g )
)=
n X
k
X
(g )
τij (log πj + log fj (xi ; θ j )).
i=1 j=1
M-Step:
The updated estimate ψ (g +1) is obtained as the global maximizer
of Q(ψ; ψ (g ) ) with respect to ψ. Specifically,
n
(g +1)
πj
=
1 X (g )
τij
n
i=1
and ξ (g +1) is obtained as an appropriate root of
n X
k
X
i=1 j=1
(g )
τij
∂ log fj (xi ; θ j )
= 0.
∂ξ
The E- and M-steps are repeatedly alternated until the difference
L(ψ (g +1) ) − L(ψ (g ) )(≥ 0) is small enough.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
47
In the case of a normal mixture with heteroscedastic components
f (x, ψ) =
k
X
πj f (x; µj , Σj ),
j=1
the M-step update ξ (g +1) has a closed form:
Pn
(g )
i=1 τij xi
(g +1)
µj
= P
(g )
n
i=1 τij
and
(g +1)
Σj
(g )
i=1 τij (xi
Pn
=
(g +1)
− µj
Pn
(g +1) 0
)
)(xi − µj
(g )
.
i=1 τij
Remark: We have assumed that the number k of components in
the mixture fitted to the sample is known or fixed in advance.
There are techniques for choosing the “optimal” number of
components (see, e.g., Chapter 6 in McLachlan and Peel 2000;
Claeskens and Hjort 2008).
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
48
We can use the R package mclust for normal mixture fitting to a
data set.
Example: Times between Old Faithful eruptions (Y ) and duration
of eruptions (X ).
Datos = read.table(’Datos-geyser.txt’,header=TRUE)
XY = cbind(Datos$X,Datos$Y)
# Normal mixture fitting with 2 components
faithfulDens = densityMclust(XY,G=2,modelNames="VVV")
summary(faithfulDens, parameters = TRUE)
------------------------------------------------------Density estimation via Gaussian finite mixture modeling
------------------------------------------------------Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model
with 2 components:
log.likelihood n df
BIC
ICL
-455.468 107 11 -962.3372 -962.4458
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
49
Clustering table:
1 2
80 27
Mixing probabilities:
1
2
0.7479006 0.2520994
Means:
[,1]
[,2]
[1,] 3.994438 1.877456
[2,] 77.314712 52.266216
Variances:
[,,1]
[,1]
[,2]
[1,] 0.284642 1.927692
[2,] 1.927692 57.832533
[,,2]
[,1]
[,2]
[1,] 0.02383614 -0.05018129
[2,] -0.05018129 19.87038892
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
50
70
40
50
60
Y
80
90
plot(faithfulDens,XY,xlab="X",ylab="Y")
2.0
2.5
3.0
3.5
4.0
4.5
5.0
X
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
51
plot(faithfulDens, type = "persp", col = grey(0.8))
ty
Densi
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
52
References
Claeskens, G. and Hjort, N.L. (2008). Model Selection and Model Averaging.
Cambridge University Press.
Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977). Maximum likelihood from
incomplete data via the EM algorithm. Journal of the Royal Statistical Society.
Series B, 39, 1–38.
Johnson, R.A. and Wichern, D.W. (2007). Applied Multivariate Statistical Analysis.
Prentice Hall.
McAssey, M.P. (2013). An empirical goodness-of-fit test for multivariate distributions.
Journal of Applied Statistics, 40,1120–1131.
McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley.
Peña, D. (2002). Análisis de datos multivariantes. McGraw-Hill.
Székely, G.J. and Rizzo, M.L. (2005). A new test for multivariate normality. Journal of
Multivariate Analysis, 93, 58–80.
Székely, G.J. and Rizzo, M.L. (2013). Energy statistics: a class of statistics based on
distances. Journal of Statistical Planning and Inference, 143, 1249–1272.
Székely, G.J., Rizzo, M.L. and Bakirov, N.K. (2007). Measuring and testing
independence by correlation of distances. Annals of Statistics, 35, 2769-2794.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
2. Multivariate Distributions
53
Related documents