Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
2. Multivariate Distributions I Random vectors: mean, covariance matrix, linear transformations, dependence measures (a short introduction on the probability tools for multivariate statistics). I Multidimensional normal distribution, mixture models (some well-known examples of multivariate probability distributions). Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 1 2.1. Random vectors Multivariate data are the result of observing a random vector, a vector X = (X1 , . . . , Xp )0 whose components Xj , j = 1, . . . , p, are random variables (r.v.) on the same probability space (Ω, A, P). Similarly, a random matrix is a matrix whose elements are r.v. The probability distribution of a random vector or matrix is characterized by the joint distribution of its components. In particular, the distribution function of a random vector X is F (x1 , . . . , xp ) = P{X1 ≤ x1 , . . . , Xp ≤ xp }, for (x1 , . . . , xp ) ∈ Rp . In general, we will only work with continuous random vectors, whose probability distribution is characterized by the density function f = f (x1 , . . . , xn ), satisfying 1 f (x1 , . . . , xp ) ≥ 0 for all (x1 , . . . , xp ) ∈ Rp ; Z f (x1 , . . . , xp )dx1 . . . dxp = 1; 2 Rp 3 f (x1 , . . . , xp ) = ∂ p F (x1 , . . . , xp ) . ∂x1 . . . ∂xp Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 2 The marginal distribution of each component Xj , j = 1, . . . , p, is its probability distribution as an individual random variable. Its density function is: Z fj (xj ) = f (x1 , . . . , xp )dx1 . . . dxj−1 dxj+1 . . . dxp , for xj ∈ R. Rp−1 More generally, given the partition X(1) X = − − − , X(2) with X(1) = (X1 , . . . , Xr )0 and X(2) = (Xr +1 , . . . , Xp )0 , the marginal density of X(1) is Z fX(1) (x1 , . . . , xr ) = f (x1 , . . . , xp )dxr +1 . . . dxp . Rp−r Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 3 Two random matrices X1 and X2 are independent if the elements of X1 (as a collection of r.v.) are independent of the elements of X2 . (The elements within X1 or X2 need not be independent.) 0 0 In particular, given the partition X = [X(1) , X(2) ]0 , the vectors X(1) and X(2) are independent if F (x1 , . . . , xp ) = FX(1) (x1 , . . . , xr ) FX(2) (xr +1 , . . . , xp ), for all x1 , . . . , xp , or, equivalently, if f (x1 , . . . , xp ) = fX(1) (x1 , . . . , xr ) fX(2) (xr +1 , . . . , xp ), for all x1 , . . . , xp . Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 4 2.1.1 Expectation The expected value of a random vector (resp. matrix) is the vector (resp. matrix) of expected values of each of its components (the marginal expectations). For the random vector X = (X1 , . . . , Xp )0 , µ := E(X) = (E(X1 ), . . . , E(Xp ))0 = (µ1 , . . . , µp )0 , R where µj := E(Xj ) = R x fj (x) dx. The expectation is a linear function: 1 If A is a q × p constant matrix, X is a p-dimensional random vector and b is a q-dimensional constant vector, then E(AX + b) = AE(X) + b. 2 If X and Y are random matrices of the same dimension, then E(X + Y) = E(X) + E(Y). 3 If X is a q × p random matrix and A, B are constant matrices of adequate dimensions, then E(AXB) = AE(X)B. If X1 and X2 are conformable independent matrices, then E(X1 X2 ) = E(X1 )E(X2 ). Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 5 2.1.2 Covariance matrix The variance-covariance matrix (or simply covariance matrix) of a random vector X = (X1 . . . , Xp )0 with expectation µ is Σ = V(X) := E((X − µ)(X − µ)0 ) = E(XX0 ) − µµ0 σ11 σ12 . . . σ1p σ21 σ22 . . . σ2p = . .. .. , . . . . σp1 σp2 . . . σpp where σjj = V(Xj ) is the variance of the r.v. Xj and σjk = Cov(Xj , Xk ) is the covariance of Xj and Xk , j, k = 1, . . . , p. Then Σ is a symmetric matrix. Some properties of the covariance matrix: 1 If A is a q × p constant matrix, X is a p-dimensional random vector and b is a q-dimensional constant vector, then V(AX + b) = AV(X)A0 . 2 Σ = V(X) is always nonnegative definite. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 6 2.1.3 Correlation matrix Let X = (X1 , . . . , Xp )0 be a random vector with covariance matrix Σ and with 0 < σjj = V(Xi ) < ∞, i = 1 . . . , p. Define D := diag(σ11 , . . . , σpp ). Then the correlation matrix of X is 1 ρ12 . . . ρ1p ρ21 1 . . . ρ2p ρ= . .. .. .. . . ρp1 ρp2 . . . 1 = D−1/2 ΣD−1/2 , where ρjk is the correlation of Xj and Xk , j, k = 1, . . . , p, and −1/2 D−1/2 := diag(σ11 −1/2 , . . . , σpp ). Observe that, if Z := D−1/2 (X − µ), where µ = E(X), then V(Z) = ρ. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 7 2.1.4 Dependence measures More generally, the (cross-)covariance between the p-dimensional random vector X1 and the q-dimensional random vector X2 , with means µ1 and µ2 respectively, is the p × q matrix given by Cov(X1 , X2 ) = E((X1 − µ1 )(X2 − µ2 )0 ) Some properties of the cross-covariance: 1 If A and B are constant matrices and c and d are constant vectors, then Cov(AX1 + c, BX2 + d) = ACov(X1 , X2 )B0 . 2 If X1 , X2 and X3 are random vectors, then Cov(X1 + X2 , X3 ) = Cov(X1 , X3 ) + Cov(X2 , X3 ). 3 If X1 and X2 are independent, then Cov(X1 , X2 ) = 0p×q . Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 8 Pearson’s product-moment covariance measures linear dependence and, for the multivariate normal distribution, diagonal covariance matrix implies independence of the random vector components. In general, however, Pearson’s correlation matrix does not characterize independence. Székely et al. (2007) introduced two dependence coefficients, distance covariance and distance correlation, that measure all types of dependence between random vectors X and Y of arbitrary (and possibly different) dimensions. Suppose that X in Rp and Y in Rq are random vectors. The characteristic function of X is Z iht,Xi ˆ fX (t) := E e = e iht,xi dFX (x). Rp Let fˆY be the characteristic function of Y, and denote the joint characteristic function of (X0 , Y0 )0 by fˆX,Y . Then X and Y are independent if and only if fˆX,Y = fˆX fˆY . Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 9 Distance covariance is defined as a measure of the discrepancy between fˆX,Y and fˆX and fˆY : kfˆX,Y (t, s) − fˆX (t)fˆY (s)k2w = Z |fˆX,Y (t, s) − fˆX (t)fˆY (s)|2 w (t, s) dt ds. Rp+q The only integrable weight function w that makes this definition scale and rotation invariant is proportional to the reciprocal of 1+q p |t|1+p p |s|q , where | |p here denotes the Euclidean distance in R . The distance covariance between random vectors X and Y with E|X|p < ∞ and E|Y|q < ∞ is the square root of 1 V (X, Y) = cp cq 2 |fˆX,Y (t, s) − fˆX (t)fˆY (s)|2 Z Rp+q with cp := Advanced Course in Statistics. Lecturer: Amparo Baı́llo 1+q |t|1+p p |s|q dt ds, (1) π (p+1)/2 . Γ p+1 2 2. Multivariate Distributions 10 Similarly, distance variance is defined as the square root of V 2 (X) = V 2 (X, X). The distance correlation between random vectors X and Y with E|X|p < ∞ and E|Y|q < ∞ is the square root of 2 p V (X, Y) , V 2 (X)V 2 (Y) > 0, 2 R (X, Y) := V 2 (X)V 2 (Y) 0, V 2 (X)V 2 (Y) = 0. (2) Theorem 3 in Székely et al. (2007): If E|X|p < ∞ and E|Y|q < ∞, then 0 ≤ R ≤ 1, and R(X, Y) = 0 if and only if X and Y are independent. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 11 For an observed random sample {(Xi , Yi ), i = 1, . . . , n} of (X, Y) natural estimators of the unknown characteristic functions are Z n n 1 X iht,Xi i 1 X ihs,Yi i fˆXn (t) := e iht,xi dFXn (x) = e , fˆYn (s) := e n n Rp i=1 and i=1 n 1 X iht,Xi i+ihs,Yi i n e , fˆX,Y (t, s) := n i=1 FXn where denotes the empirical distribution function of X1 , . . . , Xn . The empirical distance covariance is defined as the square root of n Vn2 (X, Y) := kfˆX,Y (t, s) − fˆXn (t)fˆYn (s)k2w . Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 12 Székely et al. (2007) used the asymptotic properties of the empirical distance covariance to test the independence of X and Y: H0 : X and Y are independent H1 : X and Y are dependent Corollary 2 of Székely et al. (2007): If E|X|p < ∞ and E|Y|q < ∞ and X and Y are independent, then n Vn2 (X, Y) d −−−−→ Q, n→∞ S2 (3) where Q is a certain, known quadratic form of centered Gaussian random variables with E(Q) = 1 and n n 1 X 1 X S2 := 2 |Xi − Xk |p 2 |Yi − Yk |q . n n i,k=1 i,k=1 The test statistic (3) is a particular case of the so-called energy statistics, functions of distances between statistical observations (see Székely and Rizzo 2013). Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 13 2.2. Examples of multidimensional distributions 2.2.1 Multidimensional normal distribution The random vector X = (X1 , . . . , Xp )0 follows a p-dimensional normal distribution with mean µ and covariance matrix Σ, and we denote it by X ∼ Np (µ, Σ), if its density function is f (x; µ, Σ) = 1 0 (2π)p/2 |Σ|1/2 e −(x−µ) Σ −1 (x−µ)/2 , (4) where x = (x1 , . . . , xp )0 and −∞ < xi < ∞, i = 1, . . . , p. Example (Bivariate normal density): We evaluate the bivariate (p = 2) normal density in terms of the individual parameters µ1 = E(X1 ), µ2 = E(X2 ), σ11 = V(X1 ), √ σ22 = V(X2 ) and ρ12 = Cor(X1 , X2 ) = σ12 / σ11 σ22 . Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 14 The determinant and inverse of the matrix √ σ11 σ12 σ11 σ11 σ22 ρ12 Σ= = √ σ12 σ22 σ11 σ22 ρ12 σ22 are respectively |Σ| = σ11 σ22 (1 − ρ212 ) and √ 1 σ22 − σ11 σ22 ρ12 √ Σ−1 = σ11 σ11 σ22 (1 − ρ212 ) − σ11 σ22 ρ12 Thus (x − µ)0 Σ−1 (x − µ) = = (x1 − µ1 , x2 − µ2 ) 1 σ11 σ22 (1 − ρ212 ) √ x1 − µ1 σ − σ11 σ22 ρ12 √ 22 − σ11 σ22 ρ12 σ11 x2 − µ2 1 √ σ22 (x1 − µ1 )2 + σ11 (x2 − µ2 )2 − 2ρ12 σ11 σ22 (x1 − µ1 )(x2 − µ2 ) σ11 σ22 (1 − ρ212 ) 1 (x1 − µ1 )2 (x2 − µ2 )2 x1 − µ1 x2 − µ2 = + − 2ρ12 √ √ 2 1 − ρ12 σ11 σ22 σ11 σ22 = Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 15 Consequently, the bivariate normal density is f (x1 , x2 ) = 1 q · 2π σ11 σ22 (1 − ρ212 ) 1 (x1 − µ1 )2 (x2 − µ2 )2 x1 − µ1 x2 − µ2 · exp − + − 2ρ √ √ 12 2(1 − ρ212 ) σ11 σ22 σ11 σ22 Observe that, if ρ12 = 0 (X1 and X2 are uncorrelated), then f (x1 , x2 ) = = 1 √ √ 2π σ11 σ22 = √ 1 (x1 − µ1 )2 (x2 − µ2 )2 exp − + = 2 σ11 σ22 1 (x1 − µ1 )2 1 1 (x2 − µ2 )2 1 exp − ·√ √ exp − √ 2 σ11 2 σ22 2π σ11 2π σ22 = f1 (x1 ) · f2 (x2 ). Since the joint density f (x1 , x2 ) can be expressed as the product of the marginal densities, we conclude that X1 and X2 are actually independent r.v. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 16 split.screen(c(2,3)) screen(1) ## bivariate normal pdf library(mvtnorm) x = y = seq(-5, 5, length = 50) f = function(x,y) { dmvnorm(cbind(x,y)) } z = outer(x, y, f) par(mai=c(0.1,0.1,0.1,0.1)) persp(x, y, z, theta=5, phi=50, expand=0.5, col="lightblue") screen(2) ## contours of the bivariate normal pdf x = y = seq(-5, 5, length = 150) z = outer(x, y, f) par(mai=c(0.5,0.5,0.5,0.5)) contour(x, y, z, nlevels=20, col=rainbow(20)) screen(3) ## normal data X = rmvnorm(n=100,sigma=matrix(c(1,0,0,1), ncol=2)) par(mai=c(0.5,0.5,0.5,0.5)) plot(X[,1],X[,2], pch=19,xlab=expression(x [1]),ylab=expression(x[2])) Advanced Course in Statistics. Lecturer: Amparo Baı́llo screen(4) x = y = seq(-5, 5, length = 50) Sigma = matrix(c(1,0.7,0.7,1), ncol=2) f = function(x,y) { dmvnorm(cbind(x,y),sigma= Sigma) } z = outer(x, y, f) par(mai=c(0.1,0.1,0.1,0.1)) persp(x, y, z, theta=5, phi=50, expand=0.5, col="lightblue") screen(5) ## contours of the bivariate normal pdf x = y = seq(-5, 5, length = 150) z = outer(x, y, f) par(mai=c(0.5,0.5,0.5,0.5)) contour(x, y, z, nlevels=20, col=rainbow(20)) screen(6) ## normal data X = rmvnorm(n=100,sigma=Sigma) par(mai=c(0.5,0.5,0.5,0.5)) plot(X[,1],X[,2], pch=19,xlab=expression(x [1]),ylab=expression(x[2])) 2. Multivariate Distributions 17 x x x x −4 −4 z −4 −4 Advanced Course in Statistics. Lecturer: Amparo Baı́llo 0 2 4 0 2 4 −3 −2 −3 −4 −2 −3 −2 −2 −2 −1 −1 −1 00000011 2 11122224 −2 −2 x1 00 11 22 3 x1 x2 x −3 −1 2 1 2 −3 −2 −2 −4 −2 −2 −2 −1 −1 0000 1222222 z y −4 y 0 2 4 0 4 −4 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 x x 0 2 4 x2 x −2−2 0 0 21 1 22 −2 −2 −2 −2 −2 00000 222222 −4 −2 0 2 4 −4 −2 0 2 4 −4 0 2 4 −4 0 4 zz z z y yyy y z x −3 −2 −2 −2 −2 −3 −2 −1 −1 0000011111122222223 −2 2. Multivariate Distributions 0 x1 1 2 18 Properties of the multivariate normal distribution Let X ∼ Np (µ, Σ). 1 The normal density has a global maximum at µ and is symmetric with respect to µ in the sense that f (µ + a) = f (µ − a) for all a ∈ Rd . 2 Linear combinations of a multivariate normal are also normally distributed: if A is a (q × p) constant matrix and d is a (q × 1) constant vector, then AX + d ∼ Nq (Aµ + d, AΣA0 ). Consequently, all subsets of the components of X are normally distributed. 3 Zero correlation between vectors is equivalent to normal X1 independence: if X = , then X1 and X2 are X2 independent if and only if Cov(X1 , X2 ) = 0. 4 If |Σ| > 0, there exists a linear transformation of X with mean 0 and covariance matrix equal to the identity. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 19 5 Contours of constant density for the multivariate distribution are ellipsoids centered at the population mean. If |Σ| > 0, then (a) the level sets of the probability density f are the ellipsoids given by {x ∈ Rd : (x − µ)0 Σ−1 (x − µ) = c 2 }. √ These ellipsoids are centered at µ and have axes ±c λi ei , where (λi , ei ), i = 1, . . . , p, are the eigenvalue-eigenvector pairs of Σ. (b) (X − µ)0 Σ−1 (X − µ) follows a χ2p distribution. Thus, P{(X−µ)0 Σ−1 (X−µ) ≤ χ2p;α } = 1−α, for any 0 < α < 1. The Mahalanobis distance dM of a point x ∈ Rp to the mean µ of a p-dimensional distribution with covariance matrix Σ is defined by 2 dM (x) = (x − µ)0 Σ−1 (x − µ). It is a statistical distance in the sense that it takes into account the variability of the distribution (unlike the Euclidean distance). Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 20 6 7 If X ∼ Np (µ, Σ), then any linear combination of variables a0 X = a1 X1 + a2 X2 + ap Xp is distributed as N(a0 µ, a0 Σa). Also, if a0 X is distributed as N(a0 µ, a0 Σa) for every a ∈ Rp , then X must follow a Np (µ, Σ). Let X1 , . . . , Xn be mutually independent Np (µj , Σ) random vectors. Let c1 . . . , cn be real constants. Then V = c 1 X1 + . . . + c n Xn n n X X follows a Np cj µj , cj2 Σ distribution. j=1 8 j=1 Given X1 , . . . , Xn a random sample from X ∼ Np (µ, Σ), the maximum likelihood estimators (m.l.e.) of µ and Σ are respectively n µ̂ = X̄ : 1X Xi n n and Σ̂ = Sn = i=1 Advanced Course in Statistics. Lecturer: Amparo Baı́llo 1X (Xi − X̄)(Xi − X̄)0 . n i=1 2. Multivariate Distributions 21 9 The Central Limit Theorem: Let X1 , . . . , Xn be independent observations from a population with mean µ and nonsingular covariance matrix Σ. Then √ d n(X̄ − µ) −−−−→ Np (0, Σ) n→∞ and d n(X̄ − µ)0 S−1 (X̄ − µ) −−−−→ χ2p . n→∞ Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 22 The normality assumption on a sample from X can be assessed by • examining the univariate marginal distributions of the components of X, which should be Gaussian; • examining the bivariate scatterplots of the pairs of components of X, which should have an elliptical appearance; • checking if the Mahalanobis distances di2 = (xi − x̄)0 S−1 n (xi − x̄) 2 follow a χp distribution. If the data are clearly non-normal, we can consider the possibility of taking nonlinear transformations of the variables. There are multiple proposals in the literature to test the multivariate normality assumption (see Székely and Rizzo 2005; McAssey 2013 and references therein). Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 23 Example: Mass, snout-vent length and hind limb span of 25 lizards 60 70 80 16 50 2 4 6 8 12 var 1 160 50 60 70 80 var 2 100 120 140 var 3 2 4 6 8 10 12 14 16 Advanced Course in Statistics. Lecturer: Amparo Baı́llo 100 120 2. Multivariate Distributions 140 160 24 Example: Concentration of Selenium in the teeth and liver of 20 whales (Delphinapterus leucas) at Mackenzie Delta, Northwest Territories, in 1996. 120 140 160 180 200 240 240 10 20 30 40 var 1 220 120 160 200 var 2 10 20 30 40 Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 25 2.2.2 Distributions associated to the multivariate normal Correspondences between the univariate and the multivariate situations: Univariate case N(µ, σ) Multivariate case Np (µ, Σ) χ2n Wp (Σ, n) F (m, n) Λ(p, a, b) t T2 Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 26 Wishart distribution Given a random sample of independent random vectors X1 , . . . , Xn from a Np (0, Σ) distribution, the Wishart distribution Wp (Σ, n) is that of the random p × p matrix Q= n X Xi X0i . i=1 Properties: 1 2 If Q1 ∼ Wp (Σ, n1 ) and Q2 ∼ Wp (Σ, n2 ) are independent, then Q1 + Q2 ∼ Wp (Σ, n1 + n2 ). Fisher’s Theorem: If X1 , . . . , Xn are independent Np (µ, Σ) random vectors, then i) the sample mean vector X̄ and the sample covariance matrix Sn are independent; ii) X̄ ∼ Np (µ, n1 Σ); iii) nSn ∼ Wp (Σ, n − 1). Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 27 Wilks’ Lambda This is the distribution of the following determinants ratio Λ= |A| 1 = ∼ Λ(p, a, b), |A + B| |I + A−1 B| where A ∼ Wp (Σ, a) and B ∼ Wp (Σ, b) are independent, Σ is non singular and a ≥ p. Properties: 1 Bartlett’s approximation: For large a, p+b+1 − a+b− log Λ(p, a, b) ' χ2pb . 2 Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 28 Hotelling’s T 2 It is the distribution of the r.v. T 2 = nZ0 Q−1 Z ∼ T 2 (p, n), where Z ∼ Np (0, I) and Q ∼ Wp (I, n) are independent. Properties: 1 2 3 If p = 1, then T 2 (1, n) is the square of a Student t distribution with n degrees of freedom. n−p+1 2 T (p, n) = F (p, n − p + 1) np Hotelling’s distribution is invariant under affine transformations, that is, if X ∼ Np (µ, Σ) and R ∼ Wp (Σ, n) are independent, then n(X − µ)0 R−1 (X − µ) ∼ T 2 (p, n). Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 29 4 Given a random sample of independent random vectors X1 , . . . , Xn from a Np (0, Σ) distribution, then 2 n(X̄ − µ)0 S−1 n (X̄ − µ) ∼ T (p, n − 1). 5 Let X1 , . . . , Xn1 and Y1 , . . . , Yn2 be two random samples of independent random vectors from a Np (µ1 , Σ) and a Np (µ2 , Σ) respectively. If µ1 = µ2 , then n1 n2 2 (X̄ − Ȳ)0 S−1 p (X̄ − Ȳ) ∼ T (p, n1 + n2 − 2), n1 + n2 where Sp = (n1 Sx,n1 + n2 Sy ,n2 )/(n1 + n2 ) (5) is the pooled covariance matrix. These two properties will be used in hypothesis tests about mean vectors of Gaussian distributions. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 30 Inferences about the mean Case 1. Let X1 , . . . , Xn be a sample of independent random vectors from a Np (µ, Σ). Fix µ0 ∈ Rp and consider the test H0 : µ = µ0 . (6) Under H0 the test statistic 2 n(X̄ − µ0 )0 S−1 n (X̄ − µ0 ) ∼ T (p, n − 1), or, equivalently, n−p (X̄ − µ0 )0 S−1 n (X̄ − µ0 ) ∼ F (p, n − p). p This provides us with a rejection region for the test (6). Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 31 Case 2. Let X1 , . . . , Xn1 and Y1 , . . . , Yn2 be two independent samples of independent random vectors from a Np (µ1 , Σ) and a Np (µ2 , Σ) respectively. Consider the test H0 : µ1 = µ2 . (7) Under H0 , the test statistic 2 n(X̄ − Ȳ)0 S−1 p (X̄ − Ȳ) ∼ T (p, n1 + n2 − 2), where Sp is given in (5). This is equivalent to n1 + n2 − p − 1 n1 n2 (X̄− Ȳ)0 S−1 p (X̄− Ȳ) ∼ F (p, n1 +n2 −p−1). p(n1 + n2 − 2) n1 + n2 This provides us with a rejection region for the test (7). Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 32 Case 3. Assume we have g data matrices from g independent multivariate normal populations Sample X1 X2 .. . Size n1 × p n2 × p .. . Mean x̄1 x̄2 .. . Covariance S1 S2 .. . Distribution Np (µ1 , Σ) Np (µ2 , Σ) .. . Xg ng × p x̄g Sg Np (µg , Σ) The global sample mean vector and sample covariance matrix are g 1X x̄ = ni x̄i , n i=1 g 1 X S= ni S i , n−g with n = i=1 g X ni . i=1 Consider the test H0 : µ1 = µ2 = . . . = µg . Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions (8) 33 Let us introduce the following matrices B= g X ni (x̄i − x̄)(x̄i − x̄)0 (between-groups dispersion) i=1 W= g X n X 0 (xik −x̄i )(xik −x̄i ) = i=1 k=1 g X ni Si (intra-groups dispersion). i=1 Under H0 , B ∼ Wp (Σ, g − 1) and W ∼ Wp (Σ, n − g ) are independent. The test statistic Λ= |W| ∼ Λ(p, n − g , g − 1), |W + B| can be approximated by the F distribution via Rao’s asymptotic approximation: 1 − Λ1/β αβ − 2γ ∼ F (pb, αβ − 2γ), pb Λ1/β p2 b2 − 4 pb − 2 p+b+1 2 where α = a + b − ,β = 2 ,γ= . 2 p + b2 − 5 4 If Λ ∼ Λ(p, a, b), then Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 34 2.2.3 Mixture models Let k > 0 be an integer. A p-dimensional random vector X has a k-component finite mixture distribution if its probability density (or mass) function is given by f (x) = k X πj fj (x), (9) j=1 where fj , j = 1, . . . , k, are probability densities (or mass functions) and 0 ≤ πj ≤ 1, j = 1, . . . , k, are constants such that Pk j=1 πj = 1. The fj are the component densities of the mixture and the πj are the mixing proportions or weights. In the definition of a mixture model, the number k of components is considered fixed, but in many applications the value of k is unknown and has to be inferred from the data. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 35 The key to generate random vectors with density (9) is as follows. We define the discrete r.v. Z taking values 1, 2, . . . , k, with probabilities π1 , π2 , . . . , πk respectively. We suppose that the conditional density of X given Z = j is given by fj . Then the unconditional density of X is (9). Equivalently, we can define the discrete random vector P Z = (Z1 , . . . , Zk )0 , with the Zj ’s taking value 0 or 1, kj=1 Zj = 1 and πj equal to the probability that component Zj in Z is 1. Then Z follows a multinomial distribution with parameters (π1 , . . . , πk ) and we suppose that fj is the conditional density of X given that the j-th component of Z is 1. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 36 In many applications the component densities fj are specified to belong to some parametric family. The resulting model is called a parametric mixture. In particular, frequently the component densities are assumed to belong to the same parametric family, such as the mixtures of Gaussian densities. Parametric mixture models can be viewed as a semiparametric compromise between a single parametric family (case k = 1) and a nonparametric model such as kernel density estimation (case k = n). Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 37 Example: Mixture of three bivariate normal distributions n = 200 # Sample size Size = 1 # Number of non-zero values of the multinomial components Prob = c(0.5,0.3,0.2) # Mixing weights NumComp = length(Prob) # Number of components in mixture C = rmultinom(n, Size, Prob) # Sample from the multinomial SizeComp = apply(C,1,sum) # Sample size for each component X = matrix(rep(0,n*2),nrow = n) library(mvtnorm) X[C[1,]==1,] = rmvnorm(n=SizeComp[1],sigma=matrix(c(1,0,0,1),ncol=2)) X[C[2,]==1,] = rmvnorm(n=SizeComp[2],mean=c(3,5),sigma=matrix(c(3,1,1,1) ,ncol=2)) X[C[3,]==1,] = rmvnorm(n=SizeComp[3],mean=c(4,-3),sigma=matrix(c (0.5,0.1,0.1,2),ncol=2)) panel.hist = function(x, ...) { usr <- par("usr"); on.exit(par(usr)) par(usr = c(usr[1:2], 0, 1.5) ) h <- hist(x, plot = FALSE) breaks <- h$breaks; nB <- length(breaks) y <- h$counts; y <- y/max(y) rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...) } pairs(X, cex = 1.5, pch = 20, bg = "light blue", diag.panel = panel.hist, cex.labels = 2, font.labels = 2) Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 38 −4 −2 0 2 4 6 6 −6 6 −2 0 2 4 var 1 −6 −2 0 2 4 var 2 −2 0 2 4 6 Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 39 Thus, a mixture is a candidate distribution to model a population with several subpopulations. Example: Times between Old Faithful eruptions (Y , var 2) and duration of eruptions (X , var 1). 50 60 70 80 90 5.0 40 90 2.0 3.0 4.0 var 1 40 50 60 70 80 var 2 2.0 2.5 3.0 3.5 4.0 4.5 Advanced Course in Statistics. Lecturer: Amparo Baı́llo 5.0 2. Multivariate Distributions 40 2.3. Maximum likelihood estimation Let x1 , . . . , xn denote a sample of a multivariate parametric model with density (or mass) function f (x; ψ), where ψ = (ψ1 , . . . , ψk )0 denotes the vector of unknown parameters. The maximum likelihood estimator (m.l.e.) of ψ is ψ̂, the maximizer of the likelihood function L(ψ; x1 , . . . , xn ) = n Y f (xi ; ψ). i=1 MLE for the Gaussian distribution Let X1 , . . . , Xn be a random sample from a normal population with mean µ and covariance Σ. Then the m.l.e. of µ and Σ are respectively n µ̂ = X̄ and Σ̂ = Sn = 1X (Xi − X̄)(Xi − X̄)0 n i=1 Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 41 Proof (Johnson and Wichern 2007): The likelihood is L(µ, Σ) = n Y i=1 = 1 0 (2π)p/2 |Σ|1/2 1 (2π)np/2 |Σ| e −(xi −µ) Σ e− n/2 −1 (xi −µ)/2 Pn 0 −1 (x −µ)/2 i i=1 (xi −µ) Σ . The mle of µ is the minimizer of n X (xi − µ)0 Σ−1 (xi − µ) i=1 = = = n X i=1 n X j=1 n X tr Σ−1 (xi − µ)(xi − µ)0 tr Σ−1 (xi − x̄)(xi − x̄)0 + n(x̄ − µ)(x̄ − µ)0 tr Σ−1 (xi − x̄)(xi − x̄)0 + n(x̄ − µ)Σ−1 (x̄ − µ)0 j=1 Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 42 Since Σ−1 is positive definite, the distance (x̄ − µ)Σ−1 (x̄ − µ)0 > 0 unless µ = x̄. Thus the mle of µ is µ̂ = x̄. It remains to maximize (over Σ) L(µ̂, Σ) = 1 (2π)np/2 |Σ|n/2 e −tr[Σ −1 Pn j=1 (xi −x̄)(xi −x̄) 0 ]/2 . Auxiliary result: Given a p × p symmetric positive definite matrix B and a scalar b > 0, it holds that, for all positive definite Σ(p×p) , 1 −tr(Σ−1 B)/2 1 e ≤ (2b)pb e −bp . b |Σ| |B|b Equality holds if Σ = (1/2b)B. We apply with b = n/2 and Pn this auxiliary result 0 B = j=1 (xi − x̄)(xi − x̄) and conclude that the maximum occurs at Σ̂ = Sn . Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 43 MLE for a parametric mixture model Consider a parametric mixture model f (x; ψ) = k X πj fj (x; θ j ), (10) j=1 where ψ = (π1 , . . . , πk−1 , ξ 0 )0 and ξ is the vector containing all the parameters in θ 1 , . . . , θ k known a priori to be different. We want to obtain the m.l.e. of the parameters in model (10) based on a sample x1 , . . . , xn from f . The log-likelihood for ψ is log L(ψ; x1 , . . . , xn ) = n X i=1 k X log( πj fj (xi ; θ j )). j=1 Computing the m.l.e. would require solving the likelihood equation ∂ log L(ψ) = 0, ∂ψ not an easy task (see Section 2.8 in McLachlan and Peel 2000). Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 44 The Expectation-Maximization (EM) algorithm of Dempster et al. (1977) provides an iterative scheme to be followed for computing the m.l.e. of the parameters ψ in a parametric mixture. The EM algorithm is designed for “incomplete data”, so the key is to consider the mixture data x1 , . . . , xn as incomplete, since the associated component label vectors, z1 , . . . , zn , are not available. Here zi = (zi1 , . . . , zik )0 is a k-dimensional vector with zij = 1 or 0 according to whether xi did or did not arise from the j-th component of the mixture, i = 1, . . . , n, j = 1, . . . , k. The complete data sample is therefore declared to be xc1 , . . . xcn , where xci = (xi , zi ). Then the complete-data log-likelihood for ψ is given by log Lc (ψ) = n X k X zij (log πj + log fj (xi ; θ j )). i=1 j=1 Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 45 E-Step: The algorithm starts with an initial guess ψ (0) for ψ. In general, denote by ψ (g ) the approximated value of ψ after the g -th iteration of the algorithm. The E-step requires computing the conditional expectation of log Lc (ψ) given the sample x1 , . . . , xn and under the current approximation for ψ: Q(ψ; ψ (g ) ) = Eψ(g ) (log Lc (ψ)|x1 , . . . , xn ) = n X k X Eψ(g ) (Zij |x1 , . . . , xn )(log πj + log fj (xi ; θ j )). i=1 j=1 It can be proved that, for i = 1, . . . , n, j = 1, . . . , k, Eψ(g ) (Zij |x1 , . . . , xn ) = Pψ(g ) {Zij = 1|x1 , . . . , xn ) (g ) = (g ) πj fj (xi ; θ j ) (g ) (g ) j=1 πj fj (xi ; θ j ) Pk (g ) := τij . This is the posterior probability that the i-th member of the sample, Xi , belongs to the j-th component of the mixture. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 46 Then Q(ψ; ψ (g ) )= n X k X (g ) τij (log πj + log fj (xi ; θ j )). i=1 j=1 M-Step: The updated estimate ψ (g +1) is obtained as the global maximizer of Q(ψ; ψ (g ) ) with respect to ψ. Specifically, n (g +1) πj = 1 X (g ) τij n i=1 and ξ (g +1) is obtained as an appropriate root of n X k X i=1 j=1 (g ) τij ∂ log fj (xi ; θ j ) = 0. ∂ξ The E- and M-steps are repeatedly alternated until the difference L(ψ (g +1) ) − L(ψ (g ) )(≥ 0) is small enough. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 47 In the case of a normal mixture with heteroscedastic components f (x, ψ) = k X πj f (x; µj , Σj ), j=1 the M-step update ξ (g +1) has a closed form: Pn (g ) i=1 τij xi (g +1) µj = P (g ) n i=1 τij and (g +1) Σj (g ) i=1 τij (xi Pn = (g +1) − µj Pn (g +1) 0 ) )(xi − µj (g ) . i=1 τij Remark: We have assumed that the number k of components in the mixture fitted to the sample is known or fixed in advance. There are techniques for choosing the “optimal” number of components (see, e.g., Chapter 6 in McLachlan and Peel 2000; Claeskens and Hjort 2008). Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 48 We can use the R package mclust for normal mixture fitting to a data set. Example: Times between Old Faithful eruptions (Y ) and duration of eruptions (X ). Datos = read.table(’Datos-geyser.txt’,header=TRUE) XY = cbind(Datos$X,Datos$Y) # Normal mixture fitting with 2 components faithfulDens = densityMclust(XY,G=2,modelNames="VVV") summary(faithfulDens, parameters = TRUE) ------------------------------------------------------Density estimation via Gaussian finite mixture modeling ------------------------------------------------------Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 2 components: log.likelihood n df BIC ICL -455.468 107 11 -962.3372 -962.4458 Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 49 Clustering table: 1 2 80 27 Mixing probabilities: 1 2 0.7479006 0.2520994 Means: [,1] [,2] [1,] 3.994438 1.877456 [2,] 77.314712 52.266216 Variances: [,,1] [,1] [,2] [1,] 0.284642 1.927692 [2,] 1.927692 57.832533 [,,2] [,1] [,2] [1,] 0.02383614 -0.05018129 [2,] -0.05018129 19.87038892 Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 50 70 40 50 60 Y 80 90 plot(faithfulDens,XY,xlab="X",ylab="Y") 2.0 2.5 3.0 3.5 4.0 4.5 5.0 X Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 51 plot(faithfulDens, type = "persp", col = grey(0.8)) ty Densi Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 52 References Claeskens, G. and Hjort, N.L. (2008). Model Selection and Model Averaging. Cambridge University Press. Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B, 39, 1–38. Johnson, R.A. and Wichern, D.W. (2007). Applied Multivariate Statistical Analysis. Prentice Hall. McAssey, M.P. (2013). An empirical goodness-of-fit test for multivariate distributions. Journal of Applied Statistics, 40,1120–1131. McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley. Peña, D. (2002). Análisis de datos multivariantes. McGraw-Hill. Székely, G.J. and Rizzo, M.L. (2005). A new test for multivariate normality. Journal of Multivariate Analysis, 93, 58–80. Székely, G.J. and Rizzo, M.L. (2013). Energy statistics: a class of statistics based on distances. Journal of Statistical Planning and Inference, 143, 1249–1272. Székely, G.J., Rizzo, M.L. and Bakirov, N.K. (2007). Measuring and testing independence by correlation of distances. Annals of Statistics, 35, 2769-2794. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 2. Multivariate Distributions 53