Download Stat 511 Outline Spring 2004 Steve Vardeman Iowa State University

Document related concepts

Regression analysis wikipedia , lookup

German tank problem wikipedia , lookup

Data assimilation wikipedia , lookup

Choice modelling wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Maximum likelihood estimation wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Stat 511 Outline
Spring 2004
Steve Vardeman
Iowa State University
May 5, 2004
Abstract
This outline summarizes the main points of lectures based on Ken
Koehler’s class notes and Rencher’s book.
1
Linear Models
The basic linear model structure is
Y = Xβ + ²
(1)
for Y an n × 1 vector of observables, X an n × k matrix of known constants, β
a k × 1 vector of (unknown) constants (parameters), and ² an n × 1 vector of
unobservable random errors. Almost always one assumes that E² = 0. Often
one also assumes that for an unknown constant (a parameter) σ 2 > 0, Var² =σ 2 I
(these are the Gauss-Markov model assumptions) or somewhat more generally
assumes that Var² = σ 2 V (these are the Aitken model assumptions). These
assumptions can be phrased as “the mean vector EY is in the column space of
the matrix X (EY ∈ C (X)) and the variance-covariance matrix VarY is known
up to a multiplicative constant.”
1.1
Ordinary Least Squares
The ordinary least squares estimate for EY = Xβ is made by minimizing
´
³
´0 ³
b
b
Y−Y
Y−Y
b ∈ C (X). Y
b is then “the (perpendicular) projection of Y
over choices of Y
b
onto C (X).” This is minimization of the squared distance between Y and Y
belonging to C (X). Computation of this projection can be accomplished using
a (unique) “projection matrix” PX as
b = PX Y
Y
1
There are various ways of constructing PX . One is as
−
PX = X (X0 X) X0
for (X0 X)− any generalized inverse of X0 X. As it turns out, PX is both
symmetric and idempotent. It is sometimes called the “hat matrix” and written
as H rather than PX . (It is used to compute the “y hats.”)
The vector
b = (I − PX ) Y
e=Y−Y
is the vector of residuals. As it turns out, the matrix I − PX is also a perpendicular projection matrix. It projects onto the subspace of Rn consisting of all
vectors perpendicular to the elements of C (X). That is, I − PX projects onto
C (X)⊥ ≡ {u ∈ Rn | u0 v = 0 ∀v ∈ C (X)}
It is the case that C (X)⊥ = C (I − PX ) and
rank (X) = rank (X0 X) = rank (PX ) = dimension of C (X) = trace (PX )
and
⊥
rank (I − PX ) = dimension of C (X) = trace (I − PX )
and
n = rank(I) = rank (PX ) + rank (I − PX )
Further, there is the Pythagorean Theorem/ANOVA identity
b 0Y
b + e0 e
Y0 Y = (PX Y)0 (PX Y) + ((I − PX ) Y)0 ((I − PX ) Y) = Y
When rank(X) = k (one has a “full rank” X) every w ∈ C (X) has a unique
representation as a linear combination of the columns of X. In this case there
is a unique b that solves
d
b = Xβ
Xb = PX Y = Y
(2)
We can call this solution of equation (2) the ordinary least squares estimate of
β. Notice that we then have
−
XbOLS = PX Y = X (X0 X) X0 Y
so that
−
X0 XbOLS = X0 X (X0 X) X0 Y
These are the so called “normal equations.” X0 X is k × k with the same rank
−
−1
as X (namely k) and is thus non-singular. So (X0 X) = (X0 X) and the
normal equations can be solved to give
−1
bOLS = (X0 X)
X0 Y
When X is not of full rank, there are multiple b’s that will solve equation
(2) and multiple β’s that could be used to represent EY ∈ C (X). There is
thus no sensible “least squares estimate of β.”
2
1.2
Estimability and Testability
In full rank cases, for any c ∈ Rk it makes sense to define the least squares
estimate of the linear combination of parameters c0 β as
−1
0
0
0
0β
cd
X0 Y
OLS = c bOLS = c (X X)
(3)
When X is not of full rank, the above expression doesn’t make sense. But even
in such cases, for some c ∈ Rk the linear combination of parameters c0 β can
be unambiguous in the sense that every β that produces a given EY = Xβ
produces the same value of c0 β. As it turns out, the c’s that have this property
can be characterized in several ways.
Theorem 1 The following 3 conditions on c ∈ Rk are equivalent:
a) ∃ a ∈ Rn such that a0 Xβ = c0 β ∀ β ∈ Rk
b) c ∈ C (X0 )
c) Xβ1 = Xβ 2 implies that c0 β1 = c0 β2
Condition c) says that c0 β is unambiguous in the sense mentioned before the
statement of the result, condition a) says that there is a linear combination of
the entries of Y that is unbiased for c0 β, and condition b) says that c0 is a linear
combinations of the rows of X. When c ∈ Rk satisfies the characterizations
of Theorem 1 it makes sense to try and estimate c0 β. Thus, one is led to the
following definition.
Definition 2 If c ∈ Rk satisfies the characterizations of Theorem 1 the parametric function c0 β is said to be estimable.
If c0 β is estimable, ∃ a ∈ Rn such that
c0 β = a0 Xβ = a0 EY ∀β ∈ Rk
and so it makes sense to invent an ordinary least squares estimate of c0 β as
−
−
0b
0
0
0
0
0
0
0
0β
cd
OLS = a Y = a PX Y = a X (X X) X Y = c (X X) X Y
which is a generalization of the full rank formula (3).
It is often of interest to simultaneously estimate several parametric functions
c01 β, c02 β, . . . , c0l β
If each c0i β is estimable, it makes sense to assemble the matrix
 0 
c1
 c02 


C= . 
 .. 
c0l
3
and talk about estimating the vector



Cβ = 

c01 β
c02 β
..
.
c0l β





An ordinary least squares estimator of Cβ is then
−
0
0
d
Cβ
OLS = C (X X) X Y
Related to the notion of the estimability of Cβ is the concept of “testability”
of hypotheses. Roughly speaking, several hypotheses like H0 :c0i β = # are
(simultaneously) testable if each c0i β can be estimated and the hypotheses are
not internally inconsistent. To be more precise, suppose that (as above) C is
an l × k matrix of constants.
Definition 3 The hypothesis H0 :Cβ = d is testable provided each c0i β is estimable and rank(C) =l.
Cases of testing H0 :Cβ = 0 are of particular interest. If such a hypothesis
is testable, ∃ ai ∈ Rn such that c0i = a0i X for each i, and thus with
 0 
a1
 a02 


A= . 
 .. 
a0l
one can write
C = AX
Then the basic linear model says that EY ∈ C (X) while the hypothesis says
that EY ∈ C (A0 )⊥ . C (X) ∩ C (A0 )⊥ is a subspace of C (X) of dimension
rank (X) − rank (A0 ) = rank (X) − l
and the hypothesis can thus be thought of in terms of specifying that the mean
vector is in a subspace of C (X).
1.3
Means and Variances for Ordinary Least Squares (Under the Gauss-Markov Assumptions)
Elementary rules about how means and variances of linear combinations of random variables are computed can be applied to find means and variances for OLS
estimators. Under the Gauss-Markov Model some of these are
b = Xβ and VarY
b = σ 2 PX
EY
Ee = 0 and Vare = σ 2 (I − PX )
4
Further, for l estimable functions c01 β, c02 β, . . . , c0l β and
 0 
c1
 c02 


C= . 
 .. 
c0l
−
0
0
d
the estimator Cβ
OLS = C (X X) X Y has mean and covariance matrix
−
2
0
0
d
d
ECβ
OLS = Cβ and VarCβ OLS = σ C (X X) C
Notice that in the case that X is full rank and β = Iβ is estimable, the above
says that
−1
EbOLS = β and VarbOLS = σ 2 (X0 X)
It is possible to use Theorem 5.2A of Rencher about the mean of a quadratic
form to also argue that
³
´0 ³
´
b
b = σ 2 (n − rank (X))
Ee0 e =E Y − Y
Y−Y
This fact suggests the ratio
M SE ≡
e0 e
n − rank (X)
as an obvious estimate of σ 2 .
In the Gauss-Markov model, ordinary least squares estimation has some
optimality properties. Foremost there is the guarantee provided by the GaussMarkov Theorem. This says that under the linear model assumptions with
0β
Var² =σ 2 I, for estimable c0 β the ordinary least squares estimator cd
OLS is the
Best (in the sense of minimizing variance) Linear (in the entries of Y) Unbiased
(having mean c0 β for all β) Estimator of c0 β.
1.4
Generalized Least Squares
For V positive definite, suppose that Var² = σ 2 V. There exists a symmetric
1
positive definite square root matrix for V−1 , call it V− 2 . Then
1
U = V− 2 Y
satisfies the Gauss-Markov model assumptions with model matrix
1
W = V− 2 X
It then makes sense to do ordinary least squares estimation of EU ∈ C (W)
d
b = Wβ).
Note that for c ∈ C (W0 ) the BLUE of the
(with PW U = U
0
parametric function c β is
−
−
1
c0 (W0 W) W0 U = c0 (W0 W) W0 V− 2 Y
5
(any linear function of the elements of U is a linear function of elements of Y
and vice versa). So, for example, in full rank cases, the best linear unbiased
estimator of the β vector is
−1
bOLS(U) = (W0 W)
W0 U = (W0 W)
−1
1
W0 V− 2 Y
(4)
Notice that ordinary least squares estimation of EU is minimization of
³
´0 ³
´ ³
´0
´
³
1
1
b
b = Y − V 12 U
b V−1 Y − V 12 U
b
V− 2 Y − U
V− 2 Y − U
´
³
b ∈ C (W) = C V− 12 X . This is minimization of
over choices of U
³
´0
³
´
b ∗ V−1 Y − Y
b∗
Y−Y
³
´
b ∗ ∈ C V 12 W = C (X). This is so-called “generalized least
over choices of Y
squares” estimation in the original variables Y.
1.5
Reparameterizations and Restrictions
When two superficially different linear models
Y = Xβ + ² and Y = Wγ + ²
(with the same assumptions on ²) have C (X) = C (W), they are really fundab and Y − Y
b are the same in the two models. Further,
mentally the same. Y
c0 β is estimable ⇐⇒ ∃a ∈ Rn with a0 Xβ = c0 β ∀β
⇐⇒ ∃a ∈ Rn with a0 EY = c0 β ∀β
That is, c0 β is estimable exactly when it is a linear combination of the entries of
EY. Though this is expressed differently in the two models, it must be the case
that the two equivalent models produce the same set of estimable functions, i.e.
that
{c0 β|c0 β is estimable} = {d0 γ|d0 γ is estimable}
The correspondence between estimable functions in the two model formulations is as follows. Since every column of W is a linear combination of the
columns of X, there must be a matrix F such that
W = XF
Then for an estimable c0 β, ∃ a ∈ Rn such that c0 = a0 X. But then
c0 β = a0 Xβ = a0 Wγ = a0 XFγ = (c0 F) γ
So, for example, in the Gauss-Markov model, if c ∈ C (X0 ), the BLUE of
0β
c0 β = (c0 F) γ is cd
OLS .
6
What then is there to choose between two fundamentally equivalent linear
models? There are two issues. Computational/formula simplicity pushes one
in the direction of using full rank versions. Sometimes, scientific interpretability
of parameters pushes one in the opposite direction. It must be understood that
the set of inferences one can make can ONLY depend on the column space of
the model matrix, NOT on how that column space is represented.
1.6
Normal Distribution Theory and Inference
If one adds to the basic Gauss-Markov (or Aitken) linear model assumptions an
assumption that ² (and therefore Y) is multivariate normal, inference formulas
(for making confidence intervals, tests and predictions) follow. These are based
primarily on two basic results.
Theorem 4 (Koehler 4.7, panel 309. See also Rencher Theorem 5.5.A.) Suppose that A is n × n and symmetric with rank(A) = k, Y ∼MVNn (µ, Σ) for
Σ positive definite. If AΣ is idempotent, then
Y0 AY ∼ χ2k (µ0 Aµ)
(So, if in addition Aµ = 0, then Y0 AY ∼ χ2k .)
¡
¢
Theorem 5 (Theorem 1.3.7 of Christensen) Suppose that Y ∼MVN µ, σ 2 I
and BA = 0.
a) If A is symmetric, Y0 AY and BY are independent, and
b) if both A and B are symmetric, then Y0 AY and Y0 BY are independent.
(Part b) is Koehler’s 4.8. Part a) is a weaker form of Corollary 1 to Rencher’s
Theorem 5.6.A and part b) is a weaker form of Corollary 1 to Rencher’s Theorem
5.6.B.) Here are some implications of these theorems.
Example 6 In the normal Gauss-Markov model
´0 ³
´
1 ³
b
b = SSE ∼ χ2
Y
−
Y
Y
−
Y
n−rank(X)
σ2
σ2
This leads, for example, to 1 − α level confidence limits for σ2
Ã
!
SSE
SSE
,
upper α2 point of χ2n−rank(X) lower α2 point of χ2n−rank(X)
Example 7 (Estimation and testing for an estimable function) In the normal
Gauss-Markov model, if c0 β is estimable,
√
0
0β
cd
OLS − c β
q
∼ tn−rank(X)
−
M SE c0 (X0 X) c
7
This implies that H0 :c0 β = # can be tested using the statistic
0β
cd
OLS − #
q
T =√
−
M SE c0 (X0 X) c
and a tn−rank(X) reference distribution. Further, if t is the upper α2 point of
the tn−rank(X) distribution, 1 − α level two-sided confidence limits for c0 β are
q
√
−
0
d
c β O L S ± t M SE c0 (X0 X) c
Example 8 (Prediction)
the normal
Gauss-Markov model, suppose that c0 β
¡ In
¢
∗
0
2
is estimable and y ∼N c β, γσ independent of Y is to be observed. (We
assume that γ is known.) Then
∗
0β
cd
OLS − y
q
∼ tn−rank(X)
√
M SE γ + c0 (X0 X)− c
This means that if t is the upper α2 point of the tn−rank(X) distribution, 1 − α
level two-sided prediction limits for y ∗ are
q
√
−
0β
±
t
M
SE
γ + c0 (X0 X) c
cd
O LS
Example 9 (Testing) In the normal Gauss-Markov model, suppose that the
hypothesis H0 :Cβ = d is testable. Then with
³
´0 ³
´
´−1 ³
−
d
d
SSH 0 = Cβ
C (X0 X) C0
Cβ
OLS − d
OLS − d
it’s easy to see that
¡ ¢
SSH 0
∼ χ2l δ 2
2
σ
for
³
´
1
− 0 −1
0
0
(Cβ
−
d)
X)
C
(Cβ − d)
C
(X
σ2
This in turn implies that
δ2 =
F =
¡ ¢
SSH 0 /l
∼ Fl,n−rank(X) δ 2
M SE
So with f the upper α point of the Fl,n−rank(X) distribution, an α level test of
SS
/l
0
H0 :Cβ = d can be made by rejecting if MHSE
> f . The power of this test is
¡ ¢
£
¡ ¢
¤
power δ 2 = P an Fl,n−rank(X) δ 2 random variable > f
Or taking a significance testing point of view, a p-value for testing this hypothesis
is
·
¸
SSH 0 /l
P an Fl,n−rank(X) random variable exceeds the observed value of
M SE
8
1.7
Normal Theory “Maximum Likelihood” and Least Squares
One way to justify/motivate the use of least squares in the linear model is
through appeal to the statistical principle of “maximum likelihood.” That is,
in the normal Gauss-Markov model, the joint pdf for the n observations is
µ
¶
¯
¡
¡ 2 ¢¯− 12
¢
1
−n
0 ¡ 2 ¢−1
2
2 ¯
¯
f Y|Xβ, σ
det σ I
exp − (Y − Xβ) σ I
(Y − Xβ)
= (2π)
2
µ
¶
1
−n 1
0
= (2π) 2 n exp − 2 (Y − Xβ) (Y − Xβ)
σ
2σ
Clearly, for fixed σ this is maximized as a function of Xβ ∈ C (X) by minimizing
0
(Y − Xβ) (Y − Xβ)
d
b = Xβ
i.e. with Y
OLS . Then consider
´
³
b σ 2 = − n ln2π − n lnσ 2 − 1 SSE
lnf Y|Y,
2
2
2σ 2
Setting the derivative with respect to σ 2 equal to 0 and solving shows this
function of σ2 to be maximized when σ 2 = SSE
. That is,
the (joint) maximum
¡n
¢ ³ SSE ´
2
b
likelihood estimator of the parameter vector Xβ, σ is Y, n . It is worth
noting that
µ
¶
n − rank (X)
SSE
=
M SE
n
n
so that
SSE
E
=
n
µ
n − rank (X)
n
¶
σ2 < σ2
and the MLE of σ 2 is biased low.
1.8
Linear Models and Regression
A most important application of the general framework of the linear model is
to (multiple linear) regression analysis, usually represented in the form
yi = β 0 + β 1 x1i + β 2 x2i + · · · + β r xri +
i
for “predictor variables” x1 , x2 , . . . , xr . This can, of course, be written in the
usual linear model format (1) for k = r + 1 and




1 x11 x21 · · · xr1
β0
 1 x12 x22 · · · xr2 
 β1 




β =  .  and X =  . .
 = (1|x1 |x2 | · · · |xr )
..
.
..
 .. ..

 .. 
. ..
.
βr
1 x1n
x2n
9
···
xrn
Unless n is small or one is very unlucky, in regression contexts X is of full rank
(i.e. is of rank r + 1). A few specifics of what has gone before that are of
particular interest in the regression context are as follows.
b = PX Y and in the regression context it is especially common
As always Y
to call PX = H the hat matrix. It is n × n and its diagonal entries hii are
sometimes used as indices of “influence” or “leverage” of a particular case on
the regression fit. It is the case that each hii ≥ 0 and
X
hii = trace (H) = rank (H) = r + 1
so that the “hats” hii average to r+1
n . In light of this, a case with hii >
is sometimes flagged as an “influential” case. Further
2(r+1)
n
b = σ 2 PX = σ 2 H
VarY
√ √
so that Varb
yi = hii σ2 and an estimated standard deviation of ybi is hii M SE.
(This is useful, for example, in making confidence intervals for Eyi .)
Also as always, e = (I − PX ) Y and Vare = σ 2 (I − PX ) = σ 2 (I − H). So
Varei = (1 − hii ) σ 2 and it is typical to compute and plot standardized versions
of the residuals
ei
e∗i = √
√
M SE 1 − hii
The general testing of hypothesis framework discussed in Section 1.6 has a
particular important specialization in regression contexts. That is, it is common
in regression contexts (for p < r) to test
H0 : β p+1 = β p+2 = · · · = β r = 0
(5)
and in first methods courses this is done using the “full model/reduced model”
paradigm. With
Xi = (1|x1 |x2 | · · · |xi )
this is the hypothesis
H0 : EY ∈ C (Xp )
It is also possible to write this hypothesis in the standard form H0 :Cβ = 0
using the matrix
µ
¶
=
0
|
I
C
(r−p)×(r+1)
(r−p)×(p+1) (r−p)×(r−p)
So from Section 1.6 the hypothesis can be tested using an F test with numerator
sum of squares
³
´−1
−1
0
SSH 0 = (CbOLS ) C (X0 X) C0
(CbOLS )
What is interesting and perhaps not initially obvious is that
¡
¢
SSH 0 = Y0 PX − PXp Y
10
(6)
and that this kind of sum of squares is the elementary SSRfull − SSRreduced .
(A proof of the equivalence (6) is on a handout posted on the course web page.)
Further, the sum of squares in display (6) can be made part of any number
of interesting partitions of the (uncorrected) overall sum of squares Y0 Y. For
example, it is clear that
¡
¢
¡
¢ ¡
¢
Y0 Y = Y 0 P1 + PXp − P1 + PX − PXp + (I − PX ) Y
(7)
so that
¡
¢
¡
¢
Y0 Y − Y 0 P1 Y = Y0 PXp − P1 Y + Y0 PX − PXp Y + Y0 (I − PX ) Y
In elementary regression analysis notation
0
Y0 Y
¡ − Y P1 Y¢ = SST ot (corrected)
0
Y ¡PXp − P1 ¢Y = SSRreduced
Y0 PX − PXp Y = SSRfull − SSRreduced
Y0 (I − PX ) Y = SSEfull
(and then of course Y0 (PX − P1 ) Y = SSRfull ). These four sums of squares
are often arranged in an ANOVA table for testing the hypothesis (5).
It is common in regression analysis to use “reduction in sums of squares”
notation and write
R(β 0 ) = Y0 P1 Y
¡
¢
R(β 1 , . . . , β p |β 0 ) = Y0 PXp − P1 Y
¡
¢
R(β p+1 , . . . , β r |β 0 , β 1 , . . . , β p ) = Y0 PX − PXp Y
so that in this notation, identity (7) becomes
Y0 Y = R(β 0 ) + R(β 1 , . . . , β p |β 0 ) + R(β p+1 , . . . , β r |β 0 , β 1 , . . . , β p ) + SSE
And in fact, even more elaborate breakdowns of the overall sum of squares are
possible. For example,
R(β 0 ) = Y0 P1 Y
R(β 1 |β 0 ) = Y0 (PX1 − P1 ) Y
R(β 2 |β 0 , β 1 ) = Y0 (PX2 − PX1 ) Y
..
.
¡
¢
R(β r |β 0 , β 1 , . . . , β r−1 ) = Y0 PX − PXr−1 Y
represents a “Type I” or “Sequential” sum of squares breakdown of Y0 Y −SSE.
(Note that these sums of squares are appropriate numerator sums of squares for
testing significance of individual β’s in models that include terms only up to the
one in question.)
The enterprise of trying to assign a sum of squares to a predictor variable
strikes Vardeman as of little real interest, but is nevertheless a common one.
Rather than think of
¡
¢
R(β i |β 0 , β 1 , . . . , β i−1 ) = Y 0 PXi − PXi−1 Y
11
(which depends on the usually essentially arbitrary ordering of the columns of
X) as “due to xi ” it is possible to invent other assignments of sums of squares.
One is the “SAS Type II Sum of Squares” assignment. With
Xhi = (1|x1 | · · · |xi−1 |xi+1 | · · · |xr )
(the original model matrix with the column xi deleted), These are
¡
¢
R(β i |β 0 , β 1 , . . . , β i−1 , β i+1 , . . . , β r ) = Y 0 PX − PXhi Y
(appropriate numerator sums of squares for testing significance of individual β’s
in the full model).
It is worth noting that Theorem B.47 of Christensen guarantees that any
matrix like PX − PXi is a perpendicular projection matrix and that Theorem
B.48 implies that C (PX − PXi ) = C (X)∩C (Xi )⊥ (the orthogonal complement
of C (Xi ) with respect to C (X) defined on page 395 of Christensen). Further,
it is easy enough to argue using Christensen’s Theorem 1.3.7b (Koehler’s 4.7)
that any set of sequential sums of squares has pair-wise independent elements.
And one can apply Cochran’s Theorem (Koehler’s 4.9 on panel 333) to conclude
that the whole set (including SSE) are mutually independent.
1.9
Linear Models and Two-Way Factorial Analyses
As a second application/specialization of the general linear model framework
we consider the two-way factorial context, That is, for I levels of Factor A and
J levels of Factor B, we consider situations where one can make observations
under every combination of a level of A and a level of B. For i = 1, 2, . . . , I and
j = 1, 2, . . . , J
yijk = the kth observation at level i of A and level j of B
The most straightforward way to think about such a situation is in terms of the
“cell means model”
yijk = µij + ijk
(8)
In this full rank model, provided every “within cell” sample size nij is positive,
each µij is estimable and thus so too is any linear combination of these means.
It is standard to think of the IJ means µij as laid out in a two-way table as
µ11
µ21
..
.
µ12
µ22
..
.
···
···
..
.
µ1J
µ2J
..
.
µI1
µI2
···
µIJ
Particular interesting parametric functions (c0 β’s) in model (8) are built on row,
column and grand average means
µi. =
J
I
1X
1X
1 X
µij and µ.j =
µij and µ.. =
µ
J j=1
I i=1
IJ i,j ij
12
Each of these is a linear combination of the I ×J means µij and is thus estimable.
So are the linear combinations of them
¡
¢
αi = µi. − µ.. , β j = µj. − µ.. , and αβ ij = µij − µ.. + αi + β j
(9)
The “factorial effects” (9) here are particular (estimable) linear combinations of
the cell means. It is a consequence of how these are defined that
X
X
X
X
αi = 0,
β j = 0,
αβ ij = 0 ∀j, and
αβ ij = 0 ∀i
(10)
i
j
i
j
An issue of particular interest in two way factorials is whether the hypothesis
H0 :αβ ij = 0 ∀i and j
(11)
is tenable. (If it is, great simplification of interpretation is possible ... changing
levels of one factor has the same impact on mean response regardless of which
level of the second factor is considered.) This hypothesis can be equivalently
written as
µij = µ.. + αi + β j ∀i and j
or as
¡
¢ ¡
¢
µij − µij 0 − µi0 j − µi0 j 0 = 0 ∀i, i0 , j and j 0
and is a statement of “parallelism” on “interaction plots” of means. To test
this, one could write the hypothesis in terms of (I − 1)(J − 1) statements
µij − µi. − µ.j + µ.. = 0
about the cell means and use the machinery for testing H0 :Cβ = d from Example 9. In this case, d = 0 and the test is about EY falling in some subspace
of C (X). For thinking about the nature of this subspace and issues related to
the hypothesis (11), it is probably best to back up and consider an alternative
to the cell means model approach.
Rather than begin with the cell means model, one might instead begin with
the non-full-rank “effects model”
yijk = µ∗ + α∗i + β ∗j + αβ ∗ij +
(12)
ijk
I have put stars on the parameters to make clear that this is something different
from beginning with cell means and defining effects as linear combinations of
them. Here there are k = 1 + I + J + IJ parameters for the means and only IJ
different means. A model including all of these parameters can not be of full
rank. To get simple computations/formulas, one must impose some restrictions.
There are several possibilities.
In the first place, the facts (10) suggest the so called “sum restrictions” in
the effects model (12)
X
X ∗
X ∗
X ∗
α∗i = 0,
β j = 0,
αβ ij = 0 ∀j, and
αβ ij = 0 ∀i
i
j
i
j
13
Alternative restrictions are so-called “baseline restrictions.” SAS uses the baseline restrictions
α∗I = 0, β ∗J = 0, αβ ∗Ij = 0 ∀j, and αβ ∗iJ = 0 ∀i
while R and Splus use the baseline restrictions
α∗1 = 0, β ∗1 = 0, αβ ∗1j = 0 ∀j, and αβ ∗i1 = 0 ∀i
as
Under any of these sets of restrictions one may write a full rank model matrix
Ã
!
X =
n×IJ
1 | Xα∗ | Xβ ∗ | Xαβ ∗
n×1 n×(I−1)
n×(J−1) n×(I−1)(J−1)
and the no interaction hypothesis (11) is the hypothesis H0 :EY ∈ C ((1|Xα∗ |Xβ ∗ )).
So using the full model/reduced model paradigm from the regression discussion,
one then has an appropriate numerator sum of squares
´
³
SSH 0 = Y0 PX − P(1|X ∗ |Xβ∗ ) Y
α
and numerator degrees of freedom (I − 1) (J − 1) (in complete factorials where
every nij > 0).
Other hypotheses sometimes of interest are
H0 :αi = 0 ∀i or H0 :β j = 0 ∀j
(13)
These are the hypotheses that all row averages of cell means are the same and
that all column averages of cell means are the same. That is, these hypotheses
could be written as
H0 :µi. − µi0 . = 0 ∀i, i0 or H0 :µ.j − µ.j 0 = 0 ∀j, j 0
It is possible to write the first of these in the cell means model as H0 :Cβ =
0 for C that is (I − 1) × k and each row of C specifying αi = 0 for one of
i = 1, 2, . . . , (I − 1) (or equality of two row average means). Similarly, the
second can be written in the cell means model as H0 :Cβ = 0 for C that is
(J − 1) × k and each row of C specifying β j = 0 for one of j = 1, 2, . . . , (J − 1)
(or equality of two column average means). Appropriate numerator sums of
squares and degrees of freedom for testing these hypotheses are then obvious
using the material of Example 9. These sums of squares are often referred to
as “Type III” sums of squares.
How to interpret standard partitions of sums of squares and to relate them
to tests of hypotheses (11) and (13) is problematic unless all “cell” sample sizes
are the same (all nij = m, the data are “balanced”). That is, depending upon
what kind of partition one asks for in a call of a standard two-way ANOVA
routine, the program produces the following breakdowns
14
“Source”
Type I SS
(in the order A,B,A × B)
Type II SS
A
R (α∗ ’s|µ∗ )
R (α∗ ’s|µ∗ ,β ∗ ’s)
B
R (β ∗ ’s|µ∗ ,α∗ ’s)
R (β ∗ ’s|µ∗ ,α∗ ’s)
A×B
R (αβ ∗ ’s|µ∗ ,α∗ ’s,β ∗ ’s)
R (αβ ∗ ’s|µ∗ ,α∗ ’s,β ∗ ’s)
Type III SS
SSH 0 for type
(13) hypothesis
in model (8)
SSH 0 for type
(13) hypothesis
in model (8)
SSH 0 for type
(11) hypothesis
in model (8)
The “A×B” sums of squares of all three types are the same. When the data are
balanced, the “A” and “B” sums of squares of all three types are also the same.
But when the sample sizes are not the same, the “A” and “B” sums of squares of
different “types” are generally not the same, only the Type III sums of squares
are appropriate for testing hypotheses (13), and exactly what hypotheses could
be tested using the Type I or Type II sums of squares is both hard to figure
out and actually quite bizarre when one does figure it out. (They correspond
to certain hypotheses about weighted averages of means that use sample sizes
as weights. See Koehler’s notes for more on this point.) From Vardeman’s
perspective, the Type III sums of squares “have a reason to be” in this context,
while the Type I and Type II sums of squares do not. (This is in spite of
the fact that the Type I breakdown is an honest partition of an overall sum of
squares, while the Type III breakdown is not.)
An issue related to lack of balance in a two-way factorial is the issue of
“empty cells” in a two-way factorial. That is, suppose that there are data for
only k < IJ combinations of levels of A and B. “Full rank” in this context
is “rank k” and the cell means model (8) can have only k < IJ parameters
for the means. However, if the I × J table of data isn’t “too sparse” (if the
number of empty cells isn’t too large and those that are empty are not in a
nasty configuration) it is still possible to fit both a cell means model and a
no-interaction effects model to the data. This allows the testing of
H0 : the µij for which one has data show no interactions
i.e.
¡
¢ ¡
¢
H0 : µij − µij 0 − µi0 j − µi0 j 0 = 0 ∀ quadruples
(i, j), (i, j 0 ), (i0 , j) and (i0 , j 0 ) complete in the data set
(14)
and the estimation of all cell means under an assumption that a no-interaction
effects model appropriate to the cells where one has data extends to all I × J
cells in the table.
15
That is, let X be the cell means model matrix (for k “full” cells) and
Ã
!
X∗ =
n×k
1 | X∗α∗ | X∗β ∗
n×1 n×(I−1)
n×(J−1)
be an appropriate restricted version of an effects model model matrix (with no
interaction terms). If the pattern of empty cells is such that X∗ is full rank
(has rank I + J − 1), the hypothesis (14) can be tested using
F =
Y0 (PX − PX∗ ) Y/ (k − (I + J − 1))
Y0 (I − PX ) Y/ (n − k)
and an F(k−(I+J−1)),(n−k) reference distribution. Further, every
µ∗ + α∗i + β ∗j
is estimable in the no interaction effects model. Provided this model extends to
all I × J combinations of levels of A and B, this provides estimates of mean
responses for all cells. (Note that this is essentially the same kind of extrapolation one does in a regression context to sets of predictors not in the original
data set. However, on an intuitive basis, the link supporting extrapolation is
probably stronger with quantitative regressors than it is with the qualitative
predictors of the present context.)
2
Nonlinear Models
A generalization of the linear model is the (potentially) “nonlinear” model that
for β a k × 1 vector of (unknown) constants (parameters) and for some function
f (x, β)
that is smooth (differentiable) in the elements of β, says that what is observed
can be represented as
(15)
yi = f (xi , β) + i
for each xi a known vector of constants. (The dimension of x is fixed but
basically irrelevant for what follows. In particular, it need not be k.) As
is typical in the linear model, one usually assumes that E i = 0 ∀i, and it is
also common to assume that for an unknown constant (a parameter) σ 2 > 0,
Var² =σ 2 I.
2.1
Ordinary Least Squares in the Nonlinear Model
In general (unlike the case when f (xi , β) = x0i β and the model (15) is a linear
model) there are typically no explicit formulas for least squares estimation of β.
That is, minimization of
g (b) =
n
X
i=1
2
(yi − f (xi , b))
16
(16)
is a problem in numerical analysis. There are a variety of standard algorithms
used for this purpose. They are all based on the fact that a necessary condition
for bOLS to be a minimizer of g (b) is that
¯
∂g ¯¯
= 0 ∀j
∂bj ¯b=bO L S
so that in search for an ordinary least squares estimator, one might try to find
a simultaneous solution to these k “estimating” equations. A bit of calculus
and algebra shows that bOLS must then solve the matrix equation
0 = D0 (Y − f (X, b))
(17)
where we use the notations
D =
n×k
µ
∂f (xi , b)
∂bj
¶



and f (X, b) = 

n×1
f (x1 , b)
f (x2 , b)
..
.
f (xn , b)





In the case of the linear model
µ
¶
∂ 0
D=
x b = (xij ) = X and f (X, b) = Xb
∂bj i
so that equation (17) is 0 = X0 (Y − XB), i.e. is the set of normal equations
X0 Y = X0 Xb.
One of many iterative algorithms for searching for a solution to the equation
(17) is the Gauss-Newton algorithm. It proceeds as follows. For
 r 
b1
 br2 


br =  . 
 .. 
brk
the approximate solution produced by the rth iteration of the algorithm (b0 is
some vector of starting values that must be supplied by the user), let
¯
µ
¶
∂f (xi , b) ¯¯
Dr =
¯
∂bj
b=br
The first order Taylor (linear) approximation to f (X, β) at br is
f (X, β) ≈ f (X, br ) + Dr (β − br )
So the nonlinear model Y = f (X, β) + ² can be written as
Y ≈ f (X, br ) + Dr (β − br ) + ²
17
which can be written in linear model form (for the “response vector” Y∗ =
(Y − f (X, br ))) as
(Y − f (X, br )) ≈ Dr (β − br ) + ²
In this context, ordinary least squares would say that β − br could be approximated as
−1
(β\
− br )OLS = (Dr0 Dr ) Dr0 (Y − f (X, br ))
which in turn suggests that the (r + 1)st iterate of b be taken as
−1
br+1 = br + (Dr0 Dr )
Dr0 (Y − f (X, br ))
These iterations are continued until some convergence criterion is satisfied. One
type of convergence criterion is based on the “deviance” (what in a linear model
would be called the error sum of squares). At the rth iteration this is
SSE r = g (br ) =
n
X
i=1
(yi − f (xi , br ))
2
and in rough terms, the convergence criterion is to stop when this ceases to
decrease. Other criteria have to do with (relative) changes in the entries of br
(one stops when br ceases to change much). The R function nls implements this
algorithm (both in versions where the user must supply formulas for derivatives
and where the algorithm itself computes approximate/numerical derivatives).
Koehler’s note discuss two other algorithms, the Newton-Raphson algorithm
and the Fisher scoring algorithm, as they are applied to the (iid normal errors
version of the) loglikelihood function associated with the nonlinear model (15).
Our fundamental interest here is not in the numerical analysis of how one finds a
minimizer of the function (16), bOLS , but rather how that estimator can be used
in making inferences. There are two routes to inference based on complimentary
large sample theories connected with bOLS . We outline those in the next two
subsections.
2.2
Inference Methods Based on Large n Theory for the
Distribution of MLE’s
Just as in the linear model case discussed in Section 1.7, bOLS and SSE/n
are (joint) maximum likelihood estimators of β and σ 2 under the iid normal
errors nonlinear model. General theory about the consistency and asymptotic
normality of maximum likelihood estimators (outlined, for example, in Section
7.3.1 of the Appendix) suggests the following.
Claim 10 Precise statements of the approximations below can be made in some
large n circumstances.
µ
¶
¯
³
´
.
−1
∂f (xi ,b) ¯
2
0
1) bO L S ∼MVNk β, σ (D D)
where D =
. Further,
¯
∂bj
b=β
18
−1
(D0 D) “typically gets small” with increasing sample size.
2
2) M SE = SSE
n−k ≈ σ .
µ
¶
¯
³
´−1
−1
∂f (xi ,b) ¯
b 0D
b
b
where D=
.
3) (D0 D) ≈ D
¯
∂bj
b=bO L S
4) For a smooth (differentiable) function h that maps <k → <q , claim 1)
and the “delta method” (Taylor’s Theorem of Section 7.2 of the
Appendix) imply that
µ
¶
¯
³
´
.
¯
h (bO L S ) ∼MVNq h (β) , σ 2 G (D0 D)−1 G0 for G = ∂h∂bi (b)
.
¯
j
q×k
b=β
µ
¶
¯
b = ∂hi (b) ¯¯
5) G ≈G
.
∂bj
b=bO L S
Using this set of approximations, essentially exactly as in Section 1.6, one
can develop inference methods. Some of these are outlined below.
Example 11 (Inference for a single β j ) From part 1) of Claim 10 we get the
approximation
bO L Sj − β j .
∼ N (0, 1)
√
σ ηj
for η j the jth diagonal entry of (D0 D)−1 . But then from parts 2) and 3) of the
claim,
bO L Sj − β j
bO L Sj − β j
≈√
p
√
σ ηj
M SE b
ηj
³
´−1
b 0D
b
for b
η j the jth diagonal entry of D
. In the (normal Gauss-Markov)
linear model context, this last random variable is in fact t distributed for any
n. Then both so that the nonlinear model formulas reduce to the linear model
formulas, and as a means of making the already very approximate inference
formulas somewhat more conservative, it is standard to say
bO L Sj − β j .
√
p ∼ tn−k
M SE b
ηj
and thus to test H0 :β j = # using
bO L Sj − #
T =√
p
M SE b
ηj
and a tn−k reference distribution, and to use the values
q
√
bO L Sj ± t M SE b
ηj
as confidence limits for β j .
19
Example 12 (Inference for a univariate function of β, including a single mean
response) For h that maps <k → <1 consider inference for h (β) (with application to f (x, β) for a given set of predictor variables x). Facts 4) and 5) of
Claim 10 suggest that
h (bO L S ) − h (β)
.
r ³
∼ tn−k
´−1
√
b D
b 0D
b
b0
M SE G
G
This leads (as in the previous application/example) to testing H0 :h (β) = #
using
h (bO L S ) − #
r ³
T =
´−1
√
b D
b 0D
b
b0
M SE G
G
and a tn−k reference distribution, and to use of the values
r ³
´−1
√
b D
b 0D
b
b0
h (bO L S ) ± t M SE G
G
as confidence limits for h (β).
For a set of predictor variables x, this can then be applied to h (β) = f (x, β)
to produce inferences for the mean response at x. That is, with
Ã
!
Ã
!
¯
¯
¯
∂f (x, b) ¯¯
∂f
(x,
b)
b =
¯
G =
and, as expected, G
1×k
∂bj ¯b=β
∂bj ¯b=bO L S
one may test H0 :f (x, β) = # using
T =
f (x, bO L S ) − #
r ³
´−1
√
b D
b 0D
b
b0
M SE G
G
and a tn−k reference distribution, and use the values
r ³
´−1
√
b D
b 0D
b
b0
f (x, bO L S ) ± t M SE G
G
as confidence limits for f (x, β).
Example 13 (Prediction) Suppose that in the future, y ∗ normal with mean
h (β) and variance γσ2 independent of Y will be observed. (The constant γ is
assumed to be known.) Approximate prediction limits for y ∗ are then
r
³
´−1
√
b D
b 0D
b
b0
G
h (bO L S ) ± t M SE γ + G
20
2.3
Inference Methods Based on Large n Theory for Likelihood Ratio Tests/Shape of the Likelihood Function
Exactly as in Section 1.7 for the normal Gauss-Markov linear model, the (normal) “likelihood function” in the (iid errors) non-linear model is
Ã
!
n
X
¡
¢
n
1
1
−
2
L β,σ 2 |Y = (2π) 2 n exp − 2
(yi − f (xi , β))
σ
2σ i=1
Suppressing display of the fact that this is a function of Y (i.e. this is a “random
function” of the parameters β and σ 2 ) it is common to take a natural logarithm
and consider the “log-likelihood function”
n
¡
¢
1 X
n
n
l β,σ 2 = − ln2π − − lnσ 2 − 2
(yi − f (xi , β))2
2
2
2σ i=1
There is general large n theory concerning the use of this kind of random function in testing (and the inversion of tests to get confidence sets) that leads to
inference methods alternative to those presented in Section 2.2. That theory
is summarized in Section 7.3.2 of the Appendix. Here we make applications of
it in the nonlinear model. In the notation of the Appendix, our applications
to the nonlinear model will be to the case where r = k + 1 and θ stands for a
vector including both β and σ 2 .)
Example¡ 14 (Inference
for σ 2 ) Here let θ 1 in the general theory be σ 2 . For
¢
2
2
any σ , l β,σ is maximized as a function of β by bO L S . So then
and
¡
¢
SSE
n
n
“l∗ (θ 1 ) ” = l bO L S ,σ 2 = − ln 2π − ln σ 2 −
2
2
2σ 2
µ
¶
³
´
SSE
n
n SSE
n
b
“l θ M L E ” = l bO L S ,
= − ln2π − ln
−
n
2
2
n
2
so that applying display (60) a large sample approximate confidence interval for
σ 2 is
½
¾
n
SSE
n SSE
n 1 2
σ 2 | − ln σ 2 −
>
−
ln
−
−
χ
2
2σ2
2
n
2
2 1
This is an alternative to simply borrowing from the linear model context the
approximation
SSE . 2
∼ χn−k
σ2
and the confidence limits in Example 6.
Example 15 (Inference for
¡ the ¢whole vector β) Here we let θ1 in the general
theory be β. For any β, l β,σ 2 is maximized as a function of σ 2 by
Pn
2
c2 (β) = i=1 (yi − f (xi , β))
σ
n
21
So then
³
´
c2 (β) = − n ln2π − n lnσ
c2 (β) − n
“l∗ (θ 1 ) ” = l β,σ
2
2
2
³
´
bM L E ” is as before) applying display (60) a large sample apand (since “l θ
proximate confidence set for β is
½
¾
n c2
n SSE
1 2
β | − lnσ (β) > − ln
− χk
2
2
n
2
½
¾
SSE
1
c2 (β) < ln
=
β | lnσ
+ χ2k
n
n
(
µ
¶)
n
X
1 2
2
=
β|
(yi − f (xi , β)) < SSE exp
χ
n k
i=1
Now as it turns out, in the linear model an exact confidence region for β is of
the form
(
µ
¶)
n
X
k
2
β|
(yi − f (xi , β)) < SSE 1 +
(18)
Fk,n−k
n−k
i=1
for Fk,n−k the upper α point. So it is common to carry over formula (18) to the
nonlinear model setting. When this is done, the region prescribed by formula
(18) is called the “Beale” region for β.
Example 16 (Inference for a single β j ) What is usually a better alternative
to the methods in Example 11 (in terms of holding a nominal coverage probability) is the following application of the general likelihood analysis. θ 1 from
the general theory will now be β j . Let β∗j stand for the vector β with its jth
c∗ ¡β ¢ minimize Pn ¡yi − f ¡xi ,β , β∗ ¢¢2 over choices
entry deleted and let β
j
j
i=1
j
j
of β∗j . Reasoning like that in Example 15 says that a large sample approximate
confidence set for β j is
µ
¶)
n ³
³
X
¡ ¢´´2
1 2
∗
c
< SSE exp
yi − f xi ,β j ,βj β j
βj |
χ
n 1
i=1
(
¡
¢¢2
Pn ¡
This is the set of β j for which there is a β ∗j that makes i=1 yi − f xi ,β j , β∗j
not too much larger than SSE. It is
to again appeal to exact theory
¡ common
¢
for the linear model and replace exp n1 χ21 with something related to an F percentage point, namely
µ
¶ µ
¶
t2n−k
1
1+
F1,n−k = 1 +
n−k
n−k
(here the upper α point of the F distribution and the upper
distribution are under discussion).
22
α
2
point of the t
3
Mixed Models
A second generalization of the linear model is the model
Y = Xβ + Zu + ²
for Y, X, β, and ² as in the linear model (1), Z a n × q model matrix of known
constants, and u a q × 1 random vector. It is common to assume that


µ
¶
µ
¶
G
0
u
u
q×q
q×n

E
= 0 and Var
=
²
²
0
R
n×q
n×n
This implies that
EY = Xβ and VarY = ZGZ0 + R
The covariance matrix V = ZGZ0 +R is typically a function of several variances
(called “variance components”). In simple standard models with balanced data,
it is often a patterned matrix. In this model, some objects of interest are
the entries of β (the so-called “fixed effects”), the variance components, and
sometimes (prediction of) the entries of u (the “random effects”).
3.1
Maximum Likelihood in Mixed Models
0
We will acknowledge that the covariance matrix V =
+ R is typically a
¡ ZGZ
¢
2
function of several (say p) variances by writing V σ (thinking that σ 2 =
¡ 2 2
¢0
σ1 , σ 2 , . . . , σ 2p ). The normal likelihood function here is
¡
¢
¡
¡ ¢¢
L Xβ, σ 2 = f Y|Xβ, V σ 2
µ
¶
¯
¡ 2 ¢¯− 12
¡ 2 ¢−1
1
−n
0
2 ¯
¯
= (2π)
detV σ
exp − (Y − Xβ) V σ
(Y − Xβ)
2
¡ ¢
Notice that for fixed σ 2 (and therefore fixed V σ 2 ), to maximize this as a
function of Xβ, one wants to use
¡ ¢
¡ ¢
\
b ∗ σ 2 = the generalized least squares estimate of Xβ based on V σ 2
Xβ
(σ 2 ) = Y
This is (after a bit of algebra beginning with the generalized least squares formula (4))
³
´−1
¡ ¢
¡ ¢
¡ ¢−1
b ∗ σ 2 = X X0 V σ 2 −1 X
Y
X0 V σ 2
Y
Plugging this into the likelihood, one produces a profile likelihood for the vector
σ2,
´
³
¡ ¢
¡ ¢
b ∗ σ2 , σ2
L∗ σ 2 = L Y
µ
¶
¯
¡ 2 ¢¯− 12
¡ 2 ¢´0 ¡ 2 ¢−1 ³
¡ 2 ¢´
1³
−n
∗
∗
2 ¯
b
b
¯
detV σ
= (2π)
exp − Y − Y σ
V σ
Y−Y σ
2
23
At least in theory, this can be maximized to find
for the variance com¡ MLE’s
¢
b 2M L a maximizer of L∗ σ 2 , the maximum likelihood
ponents. And with σ
¶
µ
´
³
\
b 2M L .
b 2M L , σ
estimate of the whole set of parameters is Xβ σ
While maximum likelihood is supported by large sample theory, the folklore
is that in small samples, it tends to underestimate variance components. This
is generally attributed to “a failure to account for a reduction in degrees of
freedom associated with estimation of the mean vector.” Restricted Maximum
Likelihood or “REML” estimation is a modification of maximum likelihood that
for the estimation of variance components essentially replaces use of a likelihood
based on Y with one based on a vector of residuals that is known to have mean
0, and therefore requires no estimation of a mean vector.
More precisely, suppose that B is m × n of rank m = n−rank(X) and
BX = 0. Define
r = BY
A REML estimate of σ 2 is a maximizer of a likelihood based on r,
µ
¶
¯
¡ 2 ¢ 0 ¯− 12
¡ 2¢
¡ 2¢
¡ 2 ¢ 0 ¢−1
1 0¡
−m
2 ¯
¯
Lr σ = f r|σ = (2π)
detBV σ B
exp − r BV σ B
r
2
Typically the entries of a REML estimate of σ 2 are larger than those of a
maximum likelihood estimate based on Y. And (happily) every m × n matrix
B of rank m = n−rank(X) with BX = 0 leads to the same (REML) estimate.
3.2
Estimation of an Estimable Vector of Parametric Functions Cβ
Estimability of a linear combination of the entries of β means the same thing
here as it always does (estimability has only to do with mean structure, not
variance-covariance structure). For C an l × k matrix with C = AX for some
l × n matrix A, the BLUE of Cβ is
¡ ¢
d = AY
b ∗ σ2
Cβ
This has covariance matrix
´−
³
¡ ¢
d = C X0 V σ 2 −1 X C0
VarCβ
The BLUE of Cβ depends on the variance components, and is typically
b 2 , say via
unavailable. If one estimates σ 2 with σ
or maximum likelihood,
´
³ REML
¡ 2¢
¡ 2¢
2
b
b , which makes available the
then one may estimate V σ as V σ = V σ
approximation to the BLUE
³ ´
d
d = AY
b∗ σ
b2
Cβ
24
(19)
It is then perhaps possible to estimate the variance-covariance matrix of the
approximate BLUE (19) as
µ
³ ´−1 ¶−
\
d = C X0 V σ
b2
VarCβ
X
C0
(My intuition is that in small to moderate samples this will tend to produce
standard errors that are better than nothing, but probably tend to be too small.)
3.3
Best Linear Unbiased Prediction and Related Inference in the Mixed Model
We now consider problems of prediction related to the random effects contained
in u. Under the MVN assumption
E [u|Y] = GZ0 V−1 (Y − Xβ)
(20)
which, obviously, depends on the fixed effect vector β. (For what it is worth,
µ
¶ µ
¶
u
G
GZ0
Var
=
Y
ZG ZGZ0 + R
and the GZ0 appearing in (20) is the covariance between u and Y.) Something
that is close to the conditional mean (20), but that does not depend on fixed
effects is
³
´
b∗
b = GZ0 V−1 Y−Y
u
(21)
b ∗ is the generalized least squares (best linear unbiased) estimate of the
where Y
mean of Y,
¡
¢
b ∗ = X X0 V−1 X − X0 V−1 Y
Y
The predictor (21) turns out to be the Best Linear Unbiased Predictor of u, and
if we temporarily abbreviate
¢−
¡
(22)
B = X X0 V−1 X X0 V−1 and P = V−1 (I − B)
this can be written as
b = GZ0 V−1 (Y − BY) = GZ0 V−1 (I − B) Y = GZ0 PY
u
(23)
b and the problems of quoting appropriWe consider here predictions based on u
ate “precision” measures for them.
b is an obvious approximation and a
To begin with the u vector itself, u
precision of prediction should be related to the variability in the difference
u−b
u = u − GZ0 PY = u − GZ0 P (Xβ + Zu + ²)
This random vector has mean 0 and covariance matrix
¢ ¡
¢0
¡
Var (u−b
u) = I − GZ0 PZ G I − GZ0 PZ + GZ0 PRPZG = G − GZ0 PZG
(24)
25
(This last equality is not obvious to me, but is what McCulloch and Searle
promise on page 170 of their book.)
b in (23) is not available unless knows the covariance matrices G and
Now u
V. If one has estimated variance components and hence has estimates of G
and V (and for that matter, R, B, and P) the approximate BLUP
b
b 0 PY
b
b = GZ
u
may be used. A way of making a crude approximation to a measure of precision of the approximate BLUP (as a predictor u) is to plug estimates into the
relationship (24) to produce
³
´
\
b
b GZ
b 0 PZ
b G
b
b = G−
Var u−u
Consider now the prediction of a quantity
l = c0 β + s0 u
for an estimable c0 β (estimability has nothing to do with the covariance structure of Y and so “estimability” means here what it always does). As it turns
out, if c0 = a0 X, the BLUP of l is
³
´
0
0
b
b ∗ + s0 u
b = a0 BY + s0 GZ PY = a0 B + s0 GZ P Y
(25)
l= a0 Y
To quantify the precision of this as a predictor of l we must consider the random
variable
b
l−l
The variance of this is the unpleasant (but not impossible) quantity
³
´
¡
¢−
0
Var b
l − l = a0 X X0 V−1 X X0 a + s0 GZ PZGs−2a0 BZGs
(26)
(This variance is from page 256 of McCulloch and Searle.)
Now b
l of (25) is not available unless one knows covariance matrices. But with
estimates of variance components and corresponding matrices, what is available
as an approximation to b
l is
³
´
bb
b 0 GZ
b 0P
b Y
l = a0 B+s
A way of making a crude approximation to a measure of precision of the approximate BLUP (as a predictor of l) is to plug estimates into the relationship
(26) to produce
³
´
³
´−
\
0b b
b −1 X X0 a + s0 GZ
b 0 PZ
b Gs−2a
b
Var b
l − l = a0 X X0 V
BZGs
26
3.4
Confidence Intervals and Tests for Variance Components
Section 2.4.3 of Pinhiero and Bates indicates that the
¡ following is the
¢ state of
the art for interval estimation of the entries of σ 2 = σ 21 , σ 22 , . . . , σ 2p . (This is
an application of the general theory outlined in Appendix 7.3.1) Let
¡
¢ ¡
¢
γ = γ 1 , γ 2 , . . . , γ p = log σ 21 , log σ 22 , . . . , log σ 2p
be the vector of log variance components. (This one-to-one transformation
is used because for finite samples, likelihoods for γ tend to be “better beb 2 be a ML or
haved/more nearly quadratic” than likelihoods for σ 2 .) Let σ
2
REML estimate of σ , and define the corresponding vector of estimated log
variance components
´
¢ ³
¡
b= γ
b2 , . . . , γ
bp = log σ
b22 , . . . , log σ
b2p
b21 , log σ
γ
b1 , γ
Consider a full rank version of the fixed effects part of the mixed model and let
¡
¢
¡
¢
l β, σ 2 = log L β, σ 2
be the loglikelihood and
¡ ¢
¡ ¢
lr σ 2 = log Lr σ 2
be the restricted loglikelihood. Define
¡ ¡
¢¢
l∗ (β, γ) = l β, exp γ 1 , exp γ 2 , . . . , exp γ p
and
lr∗ (γ) = lr
¡¡
¢¢
exp γ 1 , exp γ 2 , . . . , exp γ p
These are the loglikelihood and restricted loglikelihood for the γ parameterization.
To first lay out confidence limits based on maximum likelihood, define a
matrix of second partials


M11 M12
p×p
p×k

M =
M21 M22
k×p
k×k
where



¯
¯
2 ∗
2 ∗
¯
¯
∂
∂
l
(β,
γ)
l
(β,
γ)
¯
¯
 , M22 = 

M11 = 
¯
∂γ i ∂γ j ¯(β,γ)= β,e
∂β
∂β
i
j
e
e γ)
(β,γ)=(β,e
( γ )M L
ML


¯
∂ 2 l∗ (β, γ) ¯¯
 = M021

and M12 =
∂γ i ∂β j ¯(β,γ)= β,e
e
( γ )M L

27
Then the matrix
Q = −M−1
³
´
b
b, β
functions as an estimated variance-covariance matrix for γ
, which is
ML
approximately normal in large samples. So approximate confidence limits for
the ith entry of γ are
√
γ
bi ± z qii
(27)
(where qii is the ith diagonal element of Q). Exponentiating, approximate
confidence limits for σ 2i based on maximum likelihood are
³
√
√ ´
b2i exp (z qii )
(28)
σ
b2i exp (−z qii ) , σ
A similar story can be told based on REML estimates. If one defines
Ã
!
¯
∂ 2 lr∗ (γ) ¯¯
Mr =
∂γ i ∂γ j ¯γ=eγ
p×p
REML
and takes
Qr = −M−1
r
the formula (28) may be used, where where qii is the ith diagonal element of
Qr .
b and quadratic
Rather than working through approximate normality of γ
shape for l∗ (β, γ) or lr∗ (γ) near its maximum, it is possible to work directly
b 2 and get a different formula parallel to (27) for intervals centered at
with σ
2
σ
bi . Although the two approaches are equivalent in the limit, in finite samples,
method (28) does a better job of holding its nominal confidence level, so it is
the one commonly used.
As discussed in Section 2.4.1 of Pinhiero and Bates, it is common to want to
compare two random effects structures for a given Y and fixed effects structure,
where the second is more general than the first (the models are “nested”). If λ1
and λ2 are the corresponding maximized log likelihoods (or maximized restricted
log likelihoods), the test statistic
2 (λ2 − λ1 )
can be used to test the hypothesis that the smaller/simpler/less general model is
adequate. If there are p1 variance components in the first model and p2 in the
second, large sample theory (see Appendix 7.3.2) suggests a χ2p2 −p1 reference
distribution for testing the adequacy of the smaller model.
3.5
Linear Combinations of Mean Squares and the CochranSatterthwaite Approximation
It is common in simple mixed models to consider linear combinations of certain
independent “mean squares” as estimates of variances. It is thus useful to have
some theory for such linear combinations.
28
Suppose that
M S1 , M S 2 , . . . , M S l
are independent random variables and for constants df1 , df2 , . . . , dfl each
(dfi ) M Si
∼ χ2dfi
EM Si
(29)
S 2 = a1 M S1 + a2 M S2 + · · · + al M Sl
(30)
Consider the random variable
Clearly,
ES 2 = a1 EM S1 + a2 EM S2 + · · · + al EM Sl
Further, it is possible to show that
VarS 2 = 2
X
a2i
(EM Si )
dfi
2
and one might estimate this by plugging in estimates of the quantities (EM Si )2 .
2
2
The most obvious way of estimating (EM Si ) is using (M Si ) , which leads to
the estimator
X (M Si )2
\2 = 2
a2i
VarS
dfi
A somewhat more refined (but less conservative) estimator follows from the
2
2
i +2
observation that E(M Si ) = dfdf
(EM Si ) , which suggests
i
∗
\2 = 2
VarS
X
2
a2i
(M Si )
dfi + 2
q
p
∗
\2 or VarS
\2 could function as a standard error for S 2 .
Clearly, either VarS
Beyond producing a standard error for S 2 , it is common to make a distributional approximation for S 2 . The famous Cochran-Satterthwaite approximation
is most often used. This treats S 2 as approximately a multiple of a chi-square
random variable. That is, while S 2 does not exactly have any standard distribution, for an appropriate ν one might hope that
νS 2 . 2
∼ χν
ES 2
(31)
¡
¢
Notice that the variable νS 2 /ES 2 has mean ν. Setting Var νS 2 /ES 2 = 2ν
(the variance of a χ2ν random variable) and solving for ν produces
¡ 2 ¢2
ES
ν = P (a EM S )2
i
i
dfi
29
(32)
and approximation (31) with degrees of freedom (32) is the Cochran-Satterthwaite
approximation.
This approximation leads to (unusable) confidence limits for ES 2 of the form
¶
µ
νS 2
νS 2
(33)
,
upper α2 point of χ2ν lower α2 point of χ2ν
These are unusable because ν depends on the unknown expected means squares.
But the degrees of freedom parameter may be estimated as
¡
S2
¢2
ν = P (a M S )2
b
i
i
(34)
dfi
and plugged into the limits (33) to get even more approximate (but usable)
confidence limits for ES 2 of the form
¶
µ
νS2
b
νbS 2
,
(35)
upper α2 point of χ2eν lower α2 point of χ2eν
(Approximation (31) is also used to provide approximate t intervals for various
kinds of estimable functions in particular mixed models.)
3.6
Mixed Models and the Analysis of Balanced TwoFactor Nested Data
There are a variety of standard (balanced data) analyses of classical “ANOVA”
that are based on particular mixed models. Koehler’s mixed models notes treat
several of these. One is the two-factor nested data analysis of Sections 28.1-28.5
of Neter et al. that we proceed to consider.
We assume that Factor A has a levels, within each one of which Factor B
has b levels, and that there are c observations from each of the a × b instances
of a level of A and a level of B within that level of A. Then a possible mixed
effects model is that for i = 1, 2, . . . , a, j = 1, 2, . . . , b, and k = 1, 2, . . . , c
yijk = µ + αi + β ij +
ijk
where µ is an unknown
(the model’s only fixed effect) and the a random
¡ constant
¢
effects αi are iid N 0, σ 2α , independent of the ab random effects β ij that are iid
´
³
N 0, σ 2β , and the set of αi and β ij is independent of the set of ijk that are iid
¡
¢
N 0, σ 2 . This can be written in the basic mixed model form Y = Xβ +Zu+²,
30
where


α1
 ..

 .



 αa 



 β 11 
1


 ..

1 


.



,
β
=µ,
and
u
=

..


. 
 β 1b 
 .

1
 ..



 β 
 a1 
 .

 ..

β ab



X =

For the abc × 1 vector Y with the yijk written down in lexicogrpahical (dictionary) order (as regards the triple numerical subscripts) the covariance matrix
here is
V =σ 2α I ⊗ J + σ 2β I ⊗ J + σ 2 I
a×a
bc×bc
ab×ab
c×c
abc×abc
For purposes of defining mean squares of interest (and computing them) we
may think of doing linear model (all fixed effects) computations under the sum
restrictions
X
X
αi = 0 and
β ij = 0 ∀i
j
Then for a full rank linear model with (a − 1) parameters αi and a (b − 1) parameters β ij define
SSA = R (α’s|µ)
SSB(A) = R (β’s|µ, α’s) = R (β’s|µ)
SSE = e0 e
and associate with these sums of squares respective degrees of freedom (a − 1) ,
a (b − 1) , and ab (c − 1). The corresponding mean squares
M SA =
M SB(A) =
M SE
=
SSA
a−1
SSB(A)
a(b − 1)
SSE
ab(c − 1)
are independent, and after normalization have chi-squared distributions as in
(29).
The expected values of the mean squares are
EM SA = σ 2 + cσ 2β + bcσ 2α
EM SB(A) = σ 2 + cσ 2β
EM SE
= σ2
31
and these suggest estimators of variance components
σ
b2α
σ
b2β
σ
b2
1
1
1
(M SA − M SB(A)) = M SA − M SB(A)
bc
bc
bc
1
1
1
=
(M SB(A) − M SE) = M SB(A) − M SE
c
c
c
= M SE
=
Further, since these estimators of variance components are of the form (30), the
Cochran-Satterthwaite material of Section 3.5 (in particular formula (35)) can
be used to make approximate confidence limits for the variance components.
(The interval for σ 2 is exact, as always.)
Regarding inference for the fixed effect µ, the BLUE is
y ... = µ + α. + β .. +
...
It is then easy to see that
Vary ... =
σ 2β
σ 2α
σ2
1
+
+
=
EM SA
a
ab
abc
abc
This, in turn, correctly suggests that confidence limits for µ are
r
M SA
y ... ± t
abc
for t a percentage point of the t(a−1) distribution.
3.7
Mixed Models and the Analysis of Unreplicated Balanced One-Way Data in Complete Random Blocks
Koehler’s Example 10.1 and Neter et al.’s Sections 27.12 and 29.2 consider the
analysis of a set of single observations from a × b combinations of a level of
Factor A and a level of a random (“blocking”) Factor B. (In Section 29.2 of
Neter et al., the second factor is “subjects” and the discussion is in terms of
“repeated measures” on subjects under a different treatments.) That is, for
i = 1, 2, . . . , a and j = 1, 2, . . . , b in this section we suppose that
yij = µ + αi + β j +
ij
(36)
³
´
where the fixed effects are µ and the αi , the b values β j are iid N 0, σ 2β , and
¡
¢
the set of β j is independent of the set of ij that are iid N 0, σ 2 . With the
32
ab × 1 vector of observations ordered as


y11

 ..

 .


 ya1 


 y12 



 ..

 .


Y=

y
 a2 

 .

 ..


 y1b 



 .

 ..
yab
the model can be written in the standard mixed model form as


1
0
0 ···
0




a×1
1
I
µ

 a×1 a×1 a×1
a×1 a×a


0
1
0
·
·
·
0




a×1  
I   α1   a×1 a×1 a×1
 1

 a×1 a×a   α2  
0
0
1
·
·
·
0

Y = .

 +  a×1 a×1 a×1
a×1  
..
 ..
  ..  

.
.
.
.
.
.

 .  

 ..
..
..
. . ..


1
I
αa
a×1 a×a
0
0
0 ···
1
a×1
a×1
a×1
a×1
β1
β2
β3
..
.
βb




+²


and the variance-covariance matrix for Y is easily be seen to be
V =σ2β I ⊗ J + σ 2 I
b×b
a×a
ab×ab
For purposes of defining mean squares of interest (and computing them) we
may think of doing linear model (all fixed effects) computations under the sum
restrictions
X
X
αi = 0 and
βj = 0
Then for a full rank linear model with (a − 1) parameters αi and (b − 1) parameters β i define
SSA = R (α’s|µ)
SSB = R (β’s|µ, α’s) = R (β’s|µ)
SSE = e0 e
and associate with these sums of squares respective degrees of freedom (a − 1) ,
(b − 1) , and (a − 1) (b − 1). The mean squares
M SB
=
M SE
=
SSB
b−1
SSE
(a − 1)(b − 1)
33
are independent, and after normalization have chi-squared distributions as in
(29).
The expected values of the two mean squares of most interest are
EM SB
= σ 2 + aσ 2β
EM SE
= σ2
and these suggest estimators of variance components
σ
b2β
σ
b2
1
1
1
(M SB − M SE) = M SB − M SE
a
a
a
= M SE
=
Note that just as in Section 3.6, the Cochran-Satterthwaite material of Section
3.5 (in particular formula (35)) can be used to make approximate confidence
limits for the variance components. (The interval for σ 2 is exact, as always.)
What are usually of more interest in this model are intervals for estimable
functions of the fixed effects. In this direction, consider first the estimable
function
µ + αi
It is the case that the BLUE of this is
y i. = µ + αi + β . +
i.
This has
σ2β
σ2
+
b
b
A possible estimated variance for the BLUE of µ + αi is then
Vary i. =
Sy2i. =
´
1³ 2
1
1
(a − 1)
b2 =
σ
bβ + σ
(SSB + SSE) = M SB +
M SE
b
ab(b − 1)
ab
ab
which is clearly of form (30). Applying the Cochran-Satterthwaite approximation, unusable (approximate) confidence limits for µ + αi are then easily seen to
be
q
y i. ± t Sy2i.
where t is a tν percentage point. And further approximation suggests that
usable approximate confidence limits are
q
y i. ± b
t Sy2i.
for b
t a teν percentage point, with b
ν as given in general in (34) and here specifically
as
³
´2
Sy2i.
ν= 1
b
2
( ab M SB )2 ( (a−1)
ab M SE )
+ (a−1)(b−1)
(b−1)
34
What is perhaps initially slightly surprising is that some estimable functions
have obvious exact confidence limits (ones that don’t require application of the
Cochran-Satterthwaite approximation). Consider, for example, αi − αi0 . The
BLUE of this is
¢ ¡
¢
¡
y i. − y i0 . = µ + αi + β . + i. − µ + αi0 + β . + i0 .
= (αi − αi0 ) + ( i. − i0 . )
This has variance
2σ 2
b
which can be estimated as a multiple of M SE alone. So exact confidence limits
for αi − αi0 are
r
2M SE
y i. − y i0 . ± t
(37)
b
for t a percentage point of the t(a−1)(b−1) distribution.
It is perfectly possible to restate the model yij = µ+αi +β j + ij in the form
yij = µi + β j + ij , or in the case that the a treatments themselves have their
own factorial structure, even replace the µi with something like αl + γ k + αγ lk
(producing a model with a factorial treatment structure in random blocks).
This line of thinking (represented, for example, in Section 29.3 of Neter et al.)
suggests that linear combinations of the µi beyond µi − µi0 = αi − αi0 can be of
practical interest in the present context. Many of these will be of the form
X
X
ci = 0
(38)
ci µi for constants ci with
Var (y i. − y i0 . ) =
The BLUE of quantity (38) is
X
X
X
ci µi +
ci
ci y i. =
which has variance
X
i.
σ2 X 2
ci
b
P
(Notice that as for estimating αi − αi0 , the condition
ci = 0 guarantees that
the random block effects do not enter this analysis.) So confidence limits for
(38) generalizing the ones in (37) are
rP
X
c2i √
ci y i. ± t
M SE
b
for t a percentage point of the t(a−1)(b−1) distribution.
Var
3.8
ci y i. =
Mixed Models and the Analysis of Balanced SplitPlot/Repeated Measures Data
Something fundamentally different from the imposition of a factorial structure
on the µi of model (36) is represented by the “split plot model”
yijk = µ + αi + β j + αβ ij + γ jk +
35
ijk
(39)
for
yijk
= the kth observation at the ith level of A (the split plot treatment)
and the jth level of B (the whole plot treatment)
i = 1, 2, . . . , a, j = 1, 2, . . . , b, k = 1, 2, . . . , n, where the effects µ, αi , β j , and
αβ ij are fixed (and, say, subject to the sum restriction so that the fixed effects
¡
¢
part of the model is full rank) and the γ jk are iid N 0, σ 2γ independent of the
¢
¡
iid N 0, σ 2 errors ijk . If we write


y1jk
 y2jk 


yjk =  .


 ..
yajk
and let


y11
 ..

 .



 y1n 


 y21 


 ..

 .



Y=

y
 2n 
 .

 ..



 yb1 


 .

.
 .

ybn
it is the case that
VarY =σ 2γ
I
bn×bn
⊗ J +σ2
a×a
I
abn×abn
For purposes of defining mean squares of interest (and computing them) we
may think of doing linear model (all fixed effects) computations under the sum
restrictions
X
X
X
X
X
αi = 0,
β j = 0,
αβ ij = 0 ∀j,
αβ ij = 0 ∀i,
γ jk = 0 ∀j
i
j
k
For a full rank linear model with (a − 1) parameters αi , (b − 1) parameters β i ,
(a − 1)(b − 1) parameters αβ ij , and b(n − 1) parameters γ jk define
SSA
SSA
SSAB
SSC(B)
SSE
=
=
=
=
=
R (α’s|µ)
R (β’s|µ)
R (αβ’s|µ)
R (γ’s|µ, β’s)
e0 e
36
and associate with these sums of squares respective degrees of freedom (a − 1) ,
(b − 1) , (a − 1)(b − 1), b(n − 1), and (a − 1)b (n − 1). The mean squares
SSC(B)
b(n − 1)
SSE
(a − 1)b(n − 1)
M SC(B) =
M SE
=
are independent, and after normalization have chi-squared distributions as in
(29). The expected values are
EM SC(B) = σ 2 + aσ 2γ
= σ2
EM SE
These suggest estimators of variance components
σ
b2γ
σ
b2
1
(M SC(B) − M SE)
a
= M SE
=
and the Cochran-Satterthwaite approximation produces confidence intervals for
these.
Regarding inference for the fixed effects, notice that
y i.. − y i0 .. = (αi − αi0 ) + (
so that confidence limits for αi − αi0 are
r
y i.. − y i0 .. ± t
i..
−
i0 .. )
2M SE
bn
Then note that (for estimating β j − β j 0 )
¡
¢
y .j. − y .j 0 . = β j − β j 0 + (γ j. − γ j 0 . ) + (
(40)
.j.
−
.j 0 . )
has variance
¡
¢ 2σ 2γ
2σ 2
2
Var y .j. − y .j 0 . =
+
=
EM SC(B)
n
an
an
So confidence limits for β j − β j 0 are
r
y .j. − y .j 0 . ± t
2M SC(B)
an
(41)
Notice from the formulas (40) and (41) that the “error terms” appropriate
for comparing (or judging differences in) main effects are different for the two
factors. The standard jargon (corresponding to formula (40)) is that the “split
plot error” is used in judging the “split plot treatment” and (corresponding to
37
formula (41)) that the “whole plot error” is used in judging the “whole plot
treatment.” This is consistent with standard ANOVA analyses where
H0
H0
H0
: αi = 0 ∀i is tested using F = M SA/M SE
: β j = 0 ∀j is tested using F = M SB/M SC(B)
: αβ ij = 0 ∀i, j is tested using F = M SAB/M SE
The plausibility of the last of these derives from the fact that
¢ ¡
¢
¡
¢ ¡
¢
¡
y ij. − y ij 0 . − y i0 j. − y i0 j 0 .
= αβ ij − αβ ij 0 − αβ i0 j − αβ i0 j 0 +
( ij. − ij 0 . ) − ( i0 j. − i0 j 0 . )
4
Bootstrap Methods
Much of what has been outlined here (except for the Linear Models material
where exact distribution theory exists) has relied on large sample approximations to the distributions of estimators and hopes that when we plug estimates
of parameters into formulas for approximate standard deviations we get reliable
standard errors. Modern computing has provided an alternative (simulationbased) way of assessing the precision of complicated estimators. This is the
“bootstrap” methodology. This section provides a brief introduction.
4.1
Bootstrapping in the iid (Single Sample) Case
Suppose that (possibly vector-valued) random quantities
Y1 , Y2 , . . . , Yn
are independent, each with marginal distribution F . F here is some (possibly
multivariate) distribution. It may be (but need not necessarily be) parameterized by (a possibly vector-valued) θ. (In this case, we might write Fθ .)
Suppose that
Tn = some function of Y1 , Y2 , . . . , Yn
GF
= the (F ) distribution of Tn
n
F
H(Gn ) = some characteristic of GF
n of interest
In simple
If F is known, evaluation of H(GF
n ) is a probability problem.
cases, this may yield to analytical calculations (after the style of Stat 542).
When the probability problem is analytically intractable, simulation can often
be used. One simulates B samples
Y1∗ , Y2∗ , . . . , Yn∗
(42)
from F . Each of these is used to compute a simulated value of Tn , say
Tn∗
38
(43)
The “histogram”/empirical distribution of the B simulated values
∗
∗
∗
Tn1
, Tn2
, . . . , TnB
(44)
is used to approximate GF
n . Finally,
∗
∗
∗
, Tn2
, . . . , TnB
})
H (empirical distribution of {Tn1
(45)
is an approximation for H(GF
n ).
If F is unknown (the problem is one of statistics rather than probability)
“bootstrap” approximations to the simulation substitute for probability calculations may be made in at least two different ways. First, in parametric contexts,
b and then F with Fb = Fe . Simulated values (42)
one might estimate θ with θ
θ
are generated not from F , but from Fb = Fθe , leading to a bootstrapped version
of Tn , (43). This can
¡ be¢ done B times to produce values (44) and the approximation (45) for H GF
n . This way of operating is (not surprisingly) called a
“parametric bootstrap.”
Second, in nonparametric contexts, one might estimate F with
Fb = the empirical distribution of Y1 , Y2 , . . . , Yn
Then as in the parametric case, simulated values (42) can be generated from Fb.
Here, this amounts to random sampling with replacement from {Y1 , Y2 , . . . , Yn }.
A single bootstrap sample (42) then leads to a simulated value of Tn . B such
values (44) in turn lead to the approximation (45) for H(GF
n ). Not surprisingly,
this methodology is called a “nonparametric bootstrap.”
There are two basic issues of concern in this kind of “bootstrap” approximation for H(GF
n ). There are:
1. The Monte Carlo error in approximating
e
H(GF
n)
with
∗
∗
∗
H (empirical distribution of Tn1
, Tn2
, . . . , TnB
)
Presumably, this is often handled by making B large so that
e
∗
∗
∗
GF
n ≈ empirical distribution of Tn1 , Tn2 , . . . , TnB
and relying on some kind of “continuity” of H(·) (so that if the empirical
e
∗
∗
∗
distribution of Tn1
, Tn2
, . . . , TnB
is nearly GF
n , then
e
∗
∗
∗
H(GF
n ) ≈ H (empirical distribution of Tn1 , Tn2 , . . . , TnB )
2. The estimation error in approximating F with Fb and then approximating
GF
n
39
with
e
GF
n
Here we have to rely on an adequately large original sample size, n, and
hope that
e
F
GF
(46)
n ≈ Gn
so that (again assuming that H(·) is continuous)
³ ´
¡ F¢
e
H GF
n ≈ H Gn
Notice that only the first of these issues can be handled by the (usually cheap)
device of increasing B. One remains at the mercy of the fixed original sample
to produce (46).
Several applications/examples now follow.
Example 17 (Bootstrap standard error for Tn ) There is, of course, interest in
producing standard errors for complicated estimators. That is, one often wants
to estimate
¡ ¢ p
H GF
VarF Tn
(47)
n =
∗
∗
∗
Based on a set of B bootstrapped values Tn1
, Tn2
, . . . , TnB
, a estimate of (47) is
r
¢
1 X¡ ∗
∗ 2
Tni − Tn.
B−1
Example 18 (Bootstrap estimate of bias of Tn ) Suppose that it is θ = η (F )
that one hopes to estimate with Tn . The bias of Tn is
BiasF (Tn ) = EF Tn − η (F )
So a sensible estimate of this bias is
³ ´
∗ −η F
b
Tn.
And a possible bias-corrected version of Tn is
³
³ ´´
∗ −η F
b
Tn − Tn.
b
Notice that in
³ the
´ case that F is the empirical distribution of Y1 , Y2 , . . . , Yn
∗ to
and Tn = η Fb , this bias correction for Tn amounts to using 2Tn − Tn.
estimate θ.
Example 19 (Percentile bootstrap confidence intervals) Suppose that a quantity θ = η (F ) is of interest and that
Tn = η (the empirical distribution of Y1 , Y2 , . . . , Yn )
40
∗
∗
∗
Based on B bootstrapped values Tn1
, Tn2
, . . . , TnB
, define ordered values
∗
∗
∗
≤ Tn(2)
≤ · · · ≤ Tn(B)
Tn(1)
Adopt the following convention (to locate lower and upper α2 points for the histogram/empirical distribution of the B bootstrapped values). For
jα
k
kL =
(B + 1) and kU = (B + 1) − kL
2
(bxc is the largest integer less than or equal to x) the interval
i
h
∗
∗
,
T
Tn(k
n(kU )
L)
(48)
contains (roughly) the “middle (1 − α) fraction of the histogram of bootstrapped
values.” This interval is called the (uncorrected) “(1 − α) level bootstrap percentile confidence interval” for θ.
The standard argument for why interval (48) might function as a confidence
interval for θ is as follows. Suppose that there is an increasing function m(·)
such that with
φ = m(θ) = m (η (F ))
and
b = m (Tn ) = m (η (the empirical distribution of Y1 , Y2 , . . . , Yn ))
φ
for large n
¡
¢
.
b∼
N φ, w2
φ
Then a confidence interval for φ is
h
i
b − zw, φ
b + zw
φ
and a corresponding confidence interval for θ is
³
´
³
´i
h
b − zw , m−1 φ
b + zw
m−1 φ
(49)
The argument is then that the bootstrap percentile interval (48) for large n and
large B approximates this interval (49). The plausibility of an approximate
correspondence between (48) and (49) might be argued as follows. Interval (49)
is approximately
¤
£ −1
m (φ − zw) , m−1 (φ + zw)
h
³
³
´
´i
α
b , m−1 upper α point of the dsn of φ
b
= m−1 lower
point of the dsn of φ
h
³ ³ 2 α
³ ³ 2 α
´´
´´i
−1
−1
= m
m lower
m lower
point of Tn dsn , m
point of Tn dsn
2
2
i
h
α
α
point of the dsn of Tn , upper
point of the dsn of Tn
= lower
2
2
and one may hope that interval (48) approximates this last interval. The beauty
of the bootstrap argument in this context is that one doesn’t need to know the
correct transformation m (or the standard deviation w) in order to apply it.
41
Example 20 (Bias corrected and accelerated bootstrap confidence intervals (BCa
and ABC intervals)) In the set-up of Example 19, the statistical folklore is that
for small to moderate samples, the intervals (48) may not have actual coverage
probabilities close to nominal. Various modifications/ improvements on the basic percentile interval have been suggested. Two such are the “bias corrected
and accelerated” (BCa ) intervals and the “approximate biased corrected” (ABC)
intervals.
The basic idea in the first of these cases is to modify the percentage points of
the histogram of bootstrapped values one uses. That is, in place of the intervals
(48), the BCa idea employs
i
h
∗
∗
,
T
(50)
Tn(α
n(α2 (B+1))
1 (B+1))
for appropriate fractions α1 and α2 . (Round α1 (B + 1) down and α2 (B + 1)
up in order to get integers.) For Φ the standard normal cdf,
z = the upper
Tnǰ
α
point of the standard normal distribution
2
∗
that are smaller than Tn )
zb0 = Φ−1 (the fraction of the Tni
= Tn computed dropping the jth observation from the data set
= η (the empirical distribution of Y1 , Y2 , . . . , Yj−1 , Yj+1 , . . . , Yn )
n
∗ =
Tn.̌
and
1X
T
n j=1 nǰ
¡ ∗
¢3
Tn.̌ − Tnǰ
b
a= ³
¢2 ´3/2
Pn ¡ ∗
6
j=1 Tn.̌ − Tnǰ
Pn
j=1
the BCa method (50) uses
µ
µ
¶
¶
zb0 − z
zb0 + z
α1 = Φ zb0 +
and α2 = Φ zb0 +
1−b
a (b
z0 − z)
1−b
a (b
z0 + z)
The quantity zb0 is commonly called “a measure of median bias of Tn in normal
units” and is 0 when half of the bootstrapped values are smaller than Tn . The
quantity b
a is commonly called an “acceleration factor” and intends to correct
the percentile intervals for the fact that the standard deviation of Tn may not be
constant in θ.
Evaluation of the BCa intervals may be computationally burdensome when
n is large. The is an analytical approximation to these known as the ABC
intervals that requires less computation time. Both the BCa and ABC intervals
are generally expected to do a better job of holding their nominal confidence
levels than the uncorrected percentile bootstrap intervals.
42
4.2
Bootstrapping in Nonlinear Models
It is possible to extend the bootstrap idea much beyond the simple context of
iid (single sample) models. Here we consider its application in the nonlinear
models context of Section 2. That is, suppose that as in equation (15)
yi = f (xi , β) +
for i = 1, 2, . . . , n
i
and Tn is some statistic of interest computed for the (xi , yi ) data. (Tn might,
for example, be the jth coordinate of bOLS .) There are at least two commonly
suggested methods for using the bootstrap idea in this context. These are
bootstrapping (x, y) pairs and bootstrapping residuals.
When bootstrapping (x, y) pairs, one makes up a bootstrap sample
∗
∗
∗
(x, y)1 , (x, y)2 , . . . , (x, y)n
by sampling with replacement from
{(x1 , y1 ) , (x2 , y2 ) , . . . , (xn , yn )}
For each of B such bootstrap samples, one computes a value Tn∗ and proceeds as
in the iid/single sample context. Indeed, the clearest rationale for this method
is in those cases where it makes sense to think of (x1 , y1 ) , (x2 , y2 ) , . . . , (xn , yn )
as iid from some joint distribution F for (x, y).
To bootstrap residuals, one computes
³
´
b for i = 1, 2, . . . , n
ei = yi − f xi ,β
and makes up bootstrap samples
³
´
³
´
³
´
´ ³
´
³
´
³
b + e∗ , x2 , f x2 ,β
b + e∗ , . . . , xn , f xn ,β
b + e∗
x1 , f x1 ,β
1
2
n
(51)
by sampling from
{e1 , e2 , . . . , en }
with replacement to get the e∗i values.
One might improve this second plan somewhat by taking account of the fact
that residuals don’t have constant variance. For example, in the Gauss-Markov
linear model one has Varei = (1 − hii ) σ2 and one might think of sampling not
residuals, but rather analogues of the values
e2
en
e
√ 1
,√
,..., √
1 − h11
1 − h22
1 − hnn
³
´
b . (Or even going a step further, one might take
for addition to the f xi ,β
√
account of the fact that analogues of the ei / 1 − hii don’t (sum or) average to
zero, and sample from values created by subtracting from the residuals adjusted
for non-constant variance the arithmetic mean of those.)
43
The statistical folklore says that one must be very careful that the constant
variance assumption makes sense before putting much faith in the method (51).
Where there is a trend in the error variance, the method of bootstrapping residuals will produce synthetic data sets that are substantially unlike the original
one. In such cases, it is unreasonable to expect the bootstrap methodology to
produce reliable inferences.
5
Generalized Linear Models
Another kind of generalization of the Linear Model is meant to allow “regression
type” analysis for even discrete response data. It is built on the assumption
that an individual response/observation, y, has pdf or pmf
·
¸
yθ − b (θ)
f (y|θ, φ) = exp
− c (y, φ)
(52)
a (φ)
for some functions a(·), b(·), c(·, ·), a “canonical parameter” θ, and a “dispersion
parameter” φ. (We will soon let θ depend upon a k-vector of explanatory
variables x. But for the moment, we need to review some properties of the
exponential family (52).) Notable distributions that can be written in the
form (52) include the normal, Poisson, binomial, gamma and inverse normal
distributions.
General statistical theory (whose validity extends beyond the exponential
family) implies that
¯
¯
∂
Eθ0 ,φ0
=0
log f (y|θ, φ)¯¯
∂θ
θ0 ,φ0
and that
Varθ0 ,φ0
Ã
!
¯
¯
¯
¯
∂2
∂
¯
log
f
(y|θ,
φ)
= −Eθ0 ,φ0
log f (y|θ, φ)¯¯
2
¯
∂θ
∂θ
θ 0 ,φ0
θ 0 ,φ0
Here these facts imply that
.
µ = Ey = b0 (θ)
(53)
Vary = a (φ) b00 (θ)
(54)
and that
From (53) one has
θ = b0−1 (µ)
If we write
we have
¢
¡
v (µ) = b00 b0−1 (µ)
Vary = a (φ) v (µ)
and unless v is constant, Vary varies with µ.
44
The heart of a generalized linear model is then the decision to model some
function of µ =Ey as linear in x. That is, for some so-called “link function”
h(·) and some k-vector of parameters β, we suppose that
h (µ) = x0 β
(55)
Immediately from (55) one has then assumed that
µ = h−1 (x0 β)
(56)
Then for n fixed vectors of predictors xi , we suppose that y1 , y2 , . . . , yn are
independent with Eyi = µi satisfying
h (µi ) = x0i β
(Probably to avoid ambiguities, we should also assume that
 0 
x1
 x02 


X = . 
n×k  ..

x0n
is of full rank.) The likelihood function for this model is then
f (Y|β, φ) =
n
Y
exp
i=1
where
·
¸
yi θi − b (θi )
− c (yi , φ)
a (φ)
¢
¡
θi = b0−1 (µi ) = b0−1 h−1 (x0i β)
and inference proceeds based on large sample theory for maximum likelihood
estimation and likelihood ratio testing.
5.1
Inference for the Generalized Linear Model Based on
Large Sample Theory for MLEs
³
´
b φ
b is a maximum likelihood estimate of the parameter vector (β, φ) in
If β,
the generalized linear model, it must satisfy
¯
¯
∂
log f (Y|β, φ)¯¯
= 0 ∀j
∂β j
eφ
e)
(β,
As it turns out, these k equations can be written as
0=
n
X
i=1
v (h−1
xij
0
(xi β)) h0
(h−1
(x0i β))
45
¤
£
yi − h−1 (x0i β) ∀j
(57)
which doesn’t involve the dispersion parameter or the form of the function a.
b is a numerical problem handled by
Solving the equations (57) for an estimate β
various clever algorithms.
b solves the equations (57), standard large sample theory suggests that
If β
for large n
³
´
.
−1
b∼
MVNk β,a (φ) (X0 W (β) X)
β
for
W (β) = diag
Ã
1
v (h−1 (x01 β)) (h0 (h−1 (x01 β)))
So, for example, if
2,...,
1
v (h−1 (x0n β)) (h0 (h−1 (x0n β)))
³
³ ´ ´−1
b X
Q =a[
(φ) X0 W β
and qj is the jth diagonal entry of Q, approximate confidence limits for β j are
b ± z √qj
β
j
and more generally, approximate confidence limits for c0 β are
p
b ± z c0 Qc
c0 β
(which, for example, gives limits for the linear functions x0i β, which can then
be transformed to limits for the µi = h−1 (x0i β)).
In the event that one is working in a model where a (·) is not constant, it is
possible to do maximum likelihood for φ, that is try to solve
³
´
∂
b φ =0
log f Y|β,
∂φ
for φ. This is
#
"
n
³ ´´
X
a0 (φ) ³ b
∂
b
+
c (yi , φ) = 0
2 yi θ i − b θ i
∂φ
i=1 a (φ)
³
³
´´
b . However, it appears that it is more
µi ) = b0−1 h−1 x0i β
for b
θi = b0−1 (b
common to simply estimate a (φ) directly as
a[
(φ) =
³
´
b .
for µ
bi = h−1 x0i β
n
2
1 X (yi − µ
bi )
n − k i=1 v (b
µi )
46
2
!
5.2
Inference for the Generalized Linear Model Based on
Large Sample Theory for Likelihood Ratio Tests/Shape
of the Likelihood Function
The standard large sample theory for likelihood ratio tests/shape of the likelihood function can be applied here. Let
l (β, φ) = log f (Y|β, φ)
b (β) a maximizer of l (β, ·), the set
With φ
½ ³
¾
´
³
´
b φ
b − l β, φ
b (β) ≤ 1 χ2
β|l β,
2 k
is an approximate confidence set for β. Similarly, with βǰ the (k − 1)-vector
¢¡ ¢
¡
with β j deleted, and \
βǰ , φ β j a maximizer of l subject to the given value of
β j , the set
¾
½
³
´
³ ¡
¢ ¡ ¢´ 1 2
b φ
b −l β , \
β
,
φ
β
≤
χ
β j |l β,
j
j
ǰ
2 1
functions as an approximate confidence set for β j . And in those cases where
a (·) is not constant and one needs an inference method for φ, the set
½ ³
´
³
´ 1 ¾
b
b
b
φ|l β, φ − l β, φ ≤ χ21
2
is an approximate confidence set for φ.
5.3
Common GLMs
In Generalized Linear Model parlance, the link function that makes θ = x0 β
is called the “canonical link function.” This is h = b0−1 . Usually, statistical
software for GLMs defaults to the canonical link function unless a user specifies
some other option. Here we simply list the three most common GLMs, the
canonical link function and common alternatives.
The Gauss-Markov normal linear model is the most famous GLM. Here
θ = µ =Ey, the canonical link function is the identity, h (µ) = µ, and this
makes µ = x0 β.
Poisson responses make a second standard GLM. Here, µ =Ey = b0 (θ) =
exp (θ), so that θ = log µ. The canonical link is the log function, h (µ) = log µ,
and
µ = exp (x0 β)
This is the ”log-linear model.” Other possibilities for links are the identity
√
h (µ) = µ (giving µ = x0 β), and the square root link, h (µ) = µ (giving
2
µ = (x0 β) ).
47
Bernoulli or binomial responses constitute a third important case. Here,
binomial (m, p) responses have µ =Ey = mp. For Bernoulli (binomial with
m = 1) cases the canonical link is the logit function
h (µ) = log
from which
p=
µ
p
= log
1−µ
1−p
exp (x0 β)
1 + exp (x0 β)
This is so-called “logistic regression.” And if one wants to think not about individual Bernoulli outcomes, but rather the numbers of “successes” in m trials (binomial observations), the corresponding canonical link is h (µ) = log ((µ/n) / (1 − (µ/n))).
Other possible links for the Bernoulli/binomial case are the “probit” and
“complimentary log log” links. The first of these is
h (µ) = Φ−1 (p)
which gives p = Φ (x0 β) and the case of “probit analysis.” The second is
h (µ) = log (− log (1 − p))
which gives p = 1 − exp (− exp (x0 β)).
6
Smoothing Methods
There are a variety of methods for fitting “curves” through n data pairs
(xi , yi )
that are more flexible than, for example, polynomial regression based on the
linear models material of Section 1 or the fitting of some parametric nonlinear
function y = f (x, β) via nonlinear least squares as in Section 2. Here we briefly
describe a few of these methods. (A good source for material on this topic is
the book Generalized Additive Models by Hastie and Tibshirani.)
Bin smoothers partition the part of the x axis of interest into disjoint intervals (bins) and use as a “smoothed” y value at x,
yb(x) = “operation O” applied to those yi whose xi are in the same bin as x
Here “operation O” can be “median,” “mean,” “linear regression,” or yet something else. A choice of “median” or “mean” will give the same value across an
entire bin, and therefore produce a step function for yb(x).
Running smoothers, rather than applying a fixed finite set of bins across the
interval of values of x of interest, “use a different bin for each x,” with bin defined
in terms of “k nearest neighbors.” (So the bin length changes with x, but the
number of points considered in producing yb(x) does not.) Symmetric versions
apply “operation O” to the k2 values yi whose xi are less than or equal to x and
48
closest to x, together with the k2 values yi whose xi are greater than than or equal
to x and closest to x. Possibly non-symmetric versions simply use the k values
yi whose xi are closest to x. Where “operation O” is (simple linear) regression,
this is a kind of local regression that treats all points in a neighborhood of x
equally. By virtue of the way least squares works, this gives neighbors furthest
from x the greatest influence on yb(x). And like bin smoothers, these running
smoothers will typically produce fits that have discontinuities.
Locally weighted running (polynomial regression) smoothers can be thought
of as addressing deficiencies of the running regression smoother. For a given x,
one
1. identifies the k values xi closest to x (call this set Nk (x))
2. computes the distance from x to the element of Nk (x) furthest from x,
∆k (x) =
max |x − xi |
xi ∈Nk (x)
3. assigns weights to elements of Nk (x) as
wi
= weight on xi
µ
¶
|x − xi |
= W
∆k (x)
for some appropriate “weight function” W : [0, 1] → [0, ∞)
4. fits a polynomial regression using weighted least squares (For example, in
the linear case one chooses b0,x and b1,x to minimize
X
2
wi (yi − (b0,x + b1,x xi ))
i s.t.xi ∈Nk (x)
There are simple closed forms for such b0,x and b1,x in terms of the xi , yi
and wi .)
5. uses
yb(x) = the value at x of the polynomial fit at x
The size of k makes a difference in how smooth such a yb(x) will be (the larger
is k, the smoother the fit). It is sometimes useful to write k = λn so that the
parameter λ stands for the fraction of the data set used to do the fitting at any
x. And a standard choice for weight function is the “tricube” function
½ ¡
¢3
1 − u3
0≤u≤1
W (u) =
0
otherwise
When this is applied with simple linear regression, the smoother is typically
called the loess smoother.
49
Kernel smoothers are locally weighted means with weights coming from a
“kernel” function. A kernel K(t) ≥ 0 is smooth, possibly symmetric about 0,
and decreases as t moves away from 0. For example, one might use the standard
normal density, φ (t), as a kernel function. One defines weights at x by
µ
¶
xi − x
wi (x) = K
b
for b > 0 a “bandwidth” parameter. The kernel smoother is then
P
yi wi (x)
yb(x) = P
wi (x)
The choice of the Gaussian kernel is popular, as are the choices
¢
½ 3¡
2
−1 ≤ t ≤ 1
4 1−t
K(t) =
0
otherwise
and
K(t) =
½
¡
¢
3 − 5t2 −1 ≤ t ≤ 1
0
otherwise
3
8
(Note that this last choice can produce negative weights.)
The idea behind polynomial regression splines is to fit a function via least
squares that is piecewise a polynomial and is smooth (has a couple of continuous
derivatives). That is, if smoothing over the interval [a, b] is desired, one picks
positions
a ≤ ξ1 ≤ ξ2 ≤ · · · ≤ ξk ≤ b
for k “knots” (points at which k +1 polynomials will be “tied together”). Then,
for example, in the cubic case one fits by OLS the relationship
y ≈ β 0 + β 1 x + β 2 x2 + β 3 x3 +
where
¢3
¡
x − ξj + =
½ ¡
¢3
x − ξj
0
k
X
j=1
¡
¢3
θj x − ξ j +
if x ≥ ξ j
otherwise
This can be done using a regression program (there are k + 4 regression parameters
to¤ fit). The fitted function is cubic on each interval [a, ξ 1 ] , [ξ 1 , ξ 2 ] , . . . ,
£
ξ k−1 , ξ k , [ξ k , b] with 2 continuous derivatives (and therefore smoothness) at
the knots.
Finally (in this short list of possible methods) are the cubic smoothing
splines. It turns out to be possible to solve the optimization problem of minimizing
Z b
X
2
(yi − f (xi ))2 + λ
(f 00 (t)) dt
(58)
a
50
over choices of function f with 2 continuous derivatives. In criterion (58), the
sum is a penalty for “missing” the data values with the function, the integral is
a penalty for f wiggling (a linear function would have second derivative 0 and
incur no penalty here), and λ weights the two criteria against each other (and
is sometimes called a “stiffness” parameter). Even such widely available and
user-friendly software as JMP provides such a smoother.
Theory for choosing amongst the possible smoothers and making good choices
of parameters (like bandwidth, stiffness, knots, kernel, etc.), and doing approximate inference is a whole additional course.
7
Appendix
Here we collect some general results that are used repeatedly in Stat 511.
7.1
Some Useful Facts About Multivariate Distributions
(in Particular Multivariate Normal Distributions)
Here are some important facts about multivariate distributions in general and
multivariate normal distributions specifically.
1. If a random vector X has mean vector µ and covariance matrix Σ ,
k×1
k×k
k×1
then Y = B X + d has mean vector EY = Bµ + d and covariance
l×1
l×kk×1
l×1
matrix VarY = BΣB0 .
2. The MVN distribution is most usefully defined as the distribution of X =
k×1
A Z + µ , for Z a vector of independent standard normal random
k×pp×1
k×1
variables. Such a random vector has mean vector µ and covariance matrix
Σ = AA0 . (This definition turns out to be unambiguous. Any dimension
p and any matrix A giving a particular Σ end up producing the same
k-dimensional joint distribution.)
3. If is X is multivariate normal, so is Y = B X + d .
k×1
l×1
l×kk×1
l×1
4. If X is MVNk (µ, Σ), its individual marginal distributions are univariate
normal. Further, any sub-vector of dimension l < k is multivariate normal
(with mean vector the appropriate sub-vector of µ and covariance matrix
the appropriate sub-matrix of Σ).
5. If X is MVNk (µ1 , Σ11 ) and independent Y of which is MVNl (µ2 , Σ22 ),
then



µ
¶
µ
¶
Σ
0
11
X
µ1
k×l 
∼ MVNk+l 
,
Y
µ2
0
Σ22
l×k
51
6. For non-singular Σ, the MVNk (µ, Σ) distribution has a (joint) pdf on
k-dimensional space given by
µ
¶
1
−k
− 12
0 −1
2
fX (x) = (2π) |det Σ| exp − (x − µ) Σ (x − µ)
2
7. The joint pdf given in 6 above can be studied and conditional distributions
(given values for part of the X vector) identified. For

X =
k×1
X1
l×1
X2
(k−l)×1


 ∼ MVNk 
µ1
l×1
µ2
(k−l)×1
 
,
Σ11
l×l
Σ12
l×(k−l)
Σ21
Σ22
(k−l)×l
(k−l)×(k−l)


the conditional distribution of X1 given that X2 = x2 is
¡
¢
−1
X1 |X2 = x2 ∼ MVNl µ1 + Σ12 Σ−1
22 (x2 − µ2 ) , Σ11 − Σ12 Σ22 Σ21
8. All correlations between two parts of a MVN vector equal to 0 implies
that those parts of the vector are independent.
7.2
The Multivariate Delta Method
Here we present the general “delta method” or “propagation of error” approximation that stands behind large sample standard deviations used throughout
these notes. Suppose that X is k-variate with mean µ and variance-covariance
matrix Σ. If I concern myself with p (possibly nonlinear) functions g1 , g2 , ..., gp
(each mapping Rk to R) and define


g1 (X)
 g2 (X) 


Y=

..


.
gp (X)
a multivariate Taylor’s Theorem argument and fact 1 in Section 7.1 provide an
approximate mean vector and an approximate covariance matrix for Y. That
is, if the functions gi are differentiable, let
Ã
!
¯
∂gi ¯¯
D =
p×k
∂xj ¯µ ,µ ,...,µ
1
2
k
A multivariate Taylor approximation says that for each xi near µi
 


g1 (µ)
g1 (x)
 g2 (x)   g2 (µ) 
 


y=
≈
 + D (x − µ)
..
..




.
.
gp (x)
gp (µ)
52
So if the variances of the Xi are small (so that with high probability Y is near
µ, that is that the linear approximation above is usually valid) it is plausible
that Y has mean vector


 
EY1
g1 (µ)
 EY2   g2 (µ) 


 

 ..  ≈ 
..

 .  
.
EYk
gk (µ)
and variance-covariance matrix
VarY ≈ DΣD0
7.3
Large Sample Inference Methods
Here is a summary of two variants of large sample theory that are used extensively in Stat 511 to produce approximate inference methods where no exact
methods are available. These are theories of the behavior of maximum likelihood estimators (actually, of solutions to the likelihood equation(s)) and of the
shape of the likelihood function/behavior of likelihood ratio tests.
7.3.1
Large n Theory for Maximum Likelihood Estimation
Let θ be a generic r × 1 parameter and let l (θ) stand for the natural log of the
likelihood function. The likelihood equations are the r equations
∂
l (θ) = 0
∂θi
b is an estimator of θ that is “consistent for θ” (for large samples
Suppose that θ
“is guaranteed to be close to θ”) and solves the likelihood equations, i.e. has
¯
¯
∂
l (θ)¯¯
= 0 ∀i
∂θi
e
θ=θ
Typically, maximum likelihood estimators have these properties (they solve the
likelihood equations and are consistent). There is general statistical theory that
then suggests that for large samples
b is approximately MVNr
1. θ
b≈θ
2. Eθ
b may be obtained from sec3. an estimated variance-covariance matrix for θ
b That is, one may use
ond partials of l (θ) evaluated at θ.
[b
Varθ
=
µ
−
¯
¶−1
¯
∂2
¯
l (θ)¯
∂θi ∂θj
e r×r
θ=θ
53
(59)
These are very important facts, and for example allow one to make approximate confidence limits for the entries of θ. If di is the ith diagonal entry of
the estimated variance-covariance matrix (59), θi is the ith entry of θ, and b
θi is
b approximate confidence limits for θi are
the ith entry of θ,
p
b
θi ± z di
for z an appropriate standard normal quantile.
7.3.2
Large n Theory for Inference Based on the Shape of the Likelihood Function/Likelihood Ratio Testing
There is a second general approach (beyond maximum likelihood estimation)
that is applied to produce (large sample) inferences in Stat 511 (in Sections 2,3
and 5, where no exact theory is available). That is the large sample theory of
likelihood ratio testing and its inversion to produce confidence sets.
Again let θ be a generic r × 1 parameter vector that we now partition as


θ1
p×1

θ=
θ2
(r−p)×1
(We may take p = r and θ = θ1 here.) Further, again let l (θ) stand for the
natural log of the likelihood function. There is a general large sample theory
of likelihood ratio testing/shape of l (θ) that suggests that if θ1 = θ 01
µ
¶
.
2 maxl (θ) −
max 0 l (θ) ∼ χ2p
θ
θ with θ 1 =θ 1
b2 (θ1 ) maximizes l (θ1 , θ2 ) over choices of θ2 , this may be
Now if for each θ 1 , θ
rewritten as
³ ³
´
³
¡ ¢´´ . 2
bM LE − l θ0 , θ
b2 θ 0
2 l θ
∼ χp
1
1
This leads to the testing of H0 :θ 1 = θ01 using the statistic
´
³
³ ³
¡ ¢´´
b2 θ 0
bM LE − l θ0 , θ
2 l θ
1
1
and a χ2p reference distribution.
A (1 − α) level confidence set for θ 1 can be made by assembling all of those
θ 01 ’s for which the corresponding α level test of H0 :θ 1 = θ01 does not reject.
This amounts to
¾
½
³
´
³
´
b2 (θ 1 ) > l θ
bM LE − 1 χ2
(60)
θ1 | l θ1 , θ
2 p
where χ2p is the upper α point. The function
³
´
b2 (θ 1 )
l∗ (θ1 ) = l θ 1 , θ
54
is the loglikelihood maximized out over θ 2 and is called the “profile loglikelihood
function” for θ 1 . The formula (60) is a very important one and says that an
approximate confidence set for θ1 consists of all those vectors whose values of the
profile loglikelihood are within 12 χ2p of the maximum loglikelihood. Statistical
folklore says that such confidence sets typically do better in terms of holding
their nominal confidence levels than the kind of methods outline in the previous
section.
55