Download Document

Document related concepts

Linear least squares (mathematics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Sufficient statistic wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Resampling (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Transcript
Chapter 7
Estimation
1
Criteria for Estimators
• Main problem in stats: Estimation of population parameters, say θ.
• Recall that an estimator is a statistic. That is, any function of the
observations {xi} with values in the parameter space.
• There are many estimators of θ. Question: Which is better?
• Criteria for Estimators
(1) Unbiasedness
(2) Efficiency
(3) Sufficiency
(4) Consistency
2
Unbiasedness
Definition: Unbiasedness
^
An unbiased estimator, say θ , has an expected value that is equal to the
value of the population parameter being estimated, say θ. That is,
^
E[ θ ] = θ
Example:
E[ x ] = µ
E[s2]
= σ2
3
Efficiency
Definition: Efficiency / Mean squared error
An estimator is efficient if it estimates the parameter of interest in
some best way. The notion of “best way” relies upon the choice of a
loss function. Usual choice of a loss function is the quadratic: ℓ(e) =
e2, resulting in the mean squared error criterion (MSE) of optimality:
(
)
(
)
2
2


ˆ
ˆ
ˆ
ˆ
MSE = E θ − θ
= E θ − E (θ ) + E (θ ) − θ
= Var (θˆ) + [b(θ )]2




^
^
b(θ): E[( θ - θ) bias in θ .
The MSE is the sum of the variance and the square of the bias.
=> trade-off: a biased estimator can have a lower MSE than an
unbiased estimator.
Note: The most efficient estimator among a group of unbiased
4
estimators is the one with the smallest variance => BUE.
Efficiency
Now we can compare estimators and select the “best” one.
Example: Three different estimators’ distributions
3
2
1, 2, 3 based on samples
of the same size
1
θ
–
–
–
–
Value of Estimator
1 and 2: expected value = population parameter (unbiased)
3: positive biased
Variance decreases from 1, to 2, to 3 (3 is the smallest)
3 can have the smallest MST. 2 is more efficient than 1.
5
Relative Efficiency
It is difficult to prove that an estimator is the best among all estimators, a
relative concept is usually used.
Definition: Relative efficiency
Variance of first estimator
Relative Efficiency =
Variance of second estimator
Example: Sample mean vs. sample median
Variance of sample mean = σ2/n
Variance of sample median = πσ2/2n
Var[median]/Var[mean]
= (πσ2/2n) / (σ2/n) = π/2 = 1.57
6
The sample median is 1.57 times less efficient than the sample mean.
Asymptotic Efficiency
• We compare two sample statistics in terms of their variances. The
statistic with the smallest variance is called efficient.
• When we look at asymptotic efficiency, we look at the asymptotic
variance of two statistics as n grows. Note that if we compare two
consistent estimators, both variances eventually go to zero.
Example: Random sampling from the normal distribution
• Sample mean is asymptotically normal[μ,σ2/n]
• Median is asymptotically normal [μ,(π/2)σ2/n]
• Mean is asymptotically more efficient
Sufficiency
• Definition: Sufficiency
A statistic is sufficient when no other statistic, which can be calculated
from the same sample, provides any additional information as to the
value of the parameter of interest.
Equivalently, we say that conditional on the value of a sufficient
statistic for a parameter, the joint probability distribution of the data
does not depend on that parameter. That is, if
P(X=x|T(X)=t, θ) = P(X=x|T(X)=t)
we say that T is a sufficient statistic.
• The sufficient statistic contains all the information needed to
estimate the population parameter. It is OK to ‘get rid’ of the
8
original data, while keeping only the value of the sufficient statistic.
Sufficiency
• Visualize sufficiency: Consider a Markov chain θ → T(X1, . . . ,Xn) →
{X1, . . . ,Xn} (although in classical statistics θ is not a RV). Conditioned
on the middle part of the chain, the front and back are independent.
Theorem
Let p(x,θ) be the pdf of X and q(t,θ) be the pdf of T(X). Then, T(X) is
a sufficient statistic for θ if, for every x in the sample space, the ratio of
p(x θ )
q (t θ )
is a constant as a function of θ.
Example: Normal sufficient statistic:
Let X1, X2, … Xn be iid N(μ,σ2) where the variance is known. The
sample mean, x , is the sufficient statistic for μ.
Proof: Let’s starting with the joint distribution function
f
(x
µ)
2

xi − µ ) 
(
1
exp  −

∏
2
2
2σ
i =1
2πσ




2
n

xi − µ )
(
1
exp  − ∑
n
2
2
2
2
σ
=
1
i


( 2πσ )
n




• Next, add and subtract the sample mean:
f
(x µ)
1
( 2πσ )
2
n
2
1
( 2πσ )
2
n
2
 n ( xi − x + x − µ ) 2
exp  − ∑
2
σ
2
i =1






n

2
2
−
+
−
µ
x
x
n
x
(
)
(
)
 ∑ i
exp  − i =1
2σ 2









• Recall that the distribution of the sample mean is
)
(
=
q T (X )θ
(
1
2π σ
2
n
)
1
2
 n ( x − µ )2
exp  −
2
σ
2






• The ratio of the information in the sample to the information
in the statistic becomes independent of μ
f
(
(x θ )
q T ( x) θ
f
(
)
n

2
2
x
x
n
x
µ
−
+
−
(
)
(
)
 ∑ i
1
exp  − i =1
n
2
2
2
σ
2

( 2πσ )


=
 n ( x − µ )2 
1
exp  −

1
2
2
2
σ
2




2π σ
n
)
(
(x θ )
q T
( x) θ )
1
n
1
2
( 2πσ )
2
n −1
2


exp  −









2 
x
x
−
( i
) 
∑
i =1

2
2σ



n
Sufficiency
Theorem: Factorization Theorem
Let f(x|θ) denote the joint pdf or pmf of a sample X. A statistic T(X)
is a sufficient statistic for θ if and only if there exists functions g(t|θ)
and h(x) such that, for all sample points x and all parameter points θ
(
)
f ( x θ ) = g T ( x) θ h ( x)
• Sufficient statistics are not unique. From the factorization theorem
it is easy to see that (i) the identity function T(X) = X is a sufficient
statistic vector and (ii) if T is a sufficient statistic for θ then so is any
1-1 function of T. Then, we have minimal sufficient statistics.
Definition: Minimal sufficiency
A sufficient statistic T(X) is called a minimal sufficient statistic if, for any
other sufficient statistic T ’(X), T'(X) is a function of T (X).
Consistency
Definition: Consistency
The estimator converges in probability to the population parameter
being estimated when n (sample size) becomes larger
^
θ n → θ.
That is,
p
^
We say that θ n is a consistent estimator of θ.
Example:
x is a consistent estimator of μ (the population mean).
• Q: Does unbiasedness imply consistency?
No. The first observation of {xn}, x1, is an unbiased estimator of μ. That
is, E[x1] = μ. But letting n grow is not going to cause x1 to converge in
13
probability to μ.
Squared-Error Consistency
Definition: Squared Error Consistency
^
The sequence {θ n} is a squared-error consistent estimator of θ, if
^
limn→∞ E[(θ n - θ)2] = 0
^
That is, θ n
.s.
m


→
θ.
• Squared-error consistency implies that both the bias and the variance
of an estimator approach zero. Thus, squared-error consistency implies
consistency.
14
Order of a Sequence: Big O and Little o
• “Little o” o(.).
A sequence {xn}is o(nδ) (order less than nδ) if |n-δ xn|→ 0, as n → ∞.
Example: xn = n3 is o(n4) since |n-4 xn|= 1 /n → 0, as n → ∞.
• “Big O” O(.).
A sequence {xn} is O(nδ) (at most of order nδ ) if n-δ xn → ψ, as n → ∞
(ψ≠0, constant).
Example: f(z) = (6z4 – 2z3 + 5) is O(z4) and o(n4+δ) for every δ>0.
Special case: O(1): constant
• Order of a sequence of RV
The order of the variance gives the order of the sequence.
Example: What is the order of the sequence { x }?
Var[ x ] = σ2/n, which is O(1/n)
-or O(n-1).
Root n-Consistency
• Q: Let xn be a consistent estimator of θ. But how fast does xn
converges to θ ?
The sample mean, x , has as its variance σ2/n, which is O(1/n). That is,
the convergence is at the rate of n-½. This is called “root n-consistency.”
Note: n½ x has variance of O(1).
• Definition: nδ convergence?
If an estimator has a O(1/n2δ) variance, then we say the estimator is nδ
–convergent.
Example: Suppose var(xn) is O(1/n2). Then, xn is n–convergent.
The usual convergence is root n. If an estimator has a faster (higher
degree of) convergence, it’s called super-consistent.
Estimation
• Two philosophies regarding models (assumptions) in statistics:
(1) Parametric statistics.
It assumes data come from a type of probability distribution and makes
inferences about the parameters of the distribution. Models are
parameterized before collecting the data.
Example: Maximum likelihood estimation.
(2) Non-parametric statistics.
It assumes no probability distribution –i.e., they are “distribution free.”
Models are not imposed a priori, but determined by the data.
Examples: histograms, kernel density estimation.
• In general, parametric statistics makes more assumptions.
Least Squares Estimation
• Long history: Gauss (1795, 1801) used it in astronomy.
• Idea: There is a functional form relating Y and k variables X. This
function depends on unknown parameters, θ. The relation between Y
and X is not exact. There is an error, ε. We will estimate the
parameters θ by minimizing the sum of squared errors.
(1) Functional form known
yi = f(xi, θ) + εi
(2) Typical Assumptions
- f(x, θ) is correctly specified. For example, f(x, θ) = X β
- X are numbers with full rank --or E(ε|X) = 0. That is,
- ε ~ iid D(0, σ2 I)
(ε ⊥ x)
Least Squares Estimation
• Objective function: S(xi, θ) =Σi εi2
• We want to minimize w.r.t to θ. That is,
minθ {S(xi, θ) =Σi εi2 = Σi [yi - f(xi, θ)]2 }
=> d S(xi, θ)/d θ = - 2 Σi [yi - f(xi, θ)] f ‘(xi, θ)
f.o.c. => - 2 Σi [yi - f(xi, θLS)] f ‘(xi, θLS) =0
Note: The f.o.c. deliver the normal equations.
The solution to the normal equation, θLS, is the LS estimator. The
estimator θLS is a function of the data (yi ,xi).
Least Squares Estimation
Suppose we assume a linear functional form. That is, f(x, θ) = Xβ.
Using linear algebra, the objective function becomes
S(xi, θ) =Σi εi2 = ε’ε = (y- X β)’ (y- X β)
The f.o.c.
- 2 Σi [yi - f(xi, θLS)] f ‘(xi, θLS) = -2 (y- Xb)’ X =0
where b = βOLS.
(Ordinary LS. Ordinary=linear)
Solving for b
=> b = (X’ X)-1 X’ y
Note: b is a (linear) function of the data (yi ,xi).
Least Squares Estimation
The LS estimator of βLS when f(x, θ) = X β is linear is
b = (X′X)-1 X′ y
Note: b is a (linear) function of the data (yi ,xi). Moreover,
b = (X′X)-1 X′ y = (X′X)-1 X′ (Xβ + ε) = β +(X′X)-1 X′ε
Under the typical assumptions, we can establish properties for b.
1) E[b|X]= β
2) Var[b|X] = E[(b-β) (b-β)′|X] =(X′X)-1 X’E[ε ε′|X] X(X′X)-1
= σ2 (X′X)-1
Under the typical assumptions, Gauss established that b is BLUE.
3) If ε|X ~ iid N(0, σ2In)
=> b|X ~iid N(β, σ2 (X’ X)-1)
4) With some additional assumptions, we can use the CLT to get
b|X → N(β, σ2/n (X’ X/n)-1)
a
Maximum Likelihood Estimation
• Idea: Assume a particular distribution with unknown parameters.
Maximum likelihood (ML) estimation chooses the set of parameters
that maximize the likelihood of drawing a particular sample.
• Consider a sample (X1, ... , Xn) which is drawn from a pdf f(X|θ)
where θ are parameters. If the Xi’s are independent with pdf f(Xi|θ)
the joint probability of the whole sample is:
L( X | θ ) = f( X 1 ... X n | θ ) =
n
∏ f( X
i |θ
)
i=1
The function L(X| θ) --also written as L(X; θ)-- is called the likelihood
function. This function can be maximized^with respect to θ to
produce maximum likelihood estimates (θ MLE).
Maximum Likelihood Estimation
• It is often convenient to work with the Log of the likelihood
function. That is,
ln L(X|θ) = Σi ln f(Xi| θ).
• The ML estimation approach is very general. Now, if the model is
not correctly specified, the estimates are sensitive to the
misspecification.
Ronald Fisher (1890 – 1962)
Maximum Likelihood: Example I
Let the sample be X={5, 6, 7, 8, 9, 10}drawn from a Normal(μ,1).
The probability of each of these points based on the unknown
mean, μ, can be written as:
f (5 | µ ) =
f (6 | µ ) =
f (10 | µ ) =
 (5 − µ )2
exp  −
2
2π






 (6 − µ )2
exp  −
2
2π







1
1
 (10 − µ )2
exp  −
2
2π


1
Assume that the sample is independent.




Maximum Likelihood: Example I
Then, the joint pdf function can be written as:
L( X | µ ) =
1
(2π )
−5
2
2
2
 (5 − µ )2
(
(
6− µ)
10 − µ ) 
exp −
−
−

2
2
2


The value of µ that maximize the likelihood function of the sample
can then be defined by max L( X | µ )
µ
It easier, however, to maximize ln L(X|μ). That is,
(
(
∂ 
5 − µ )2 (6 − µ )2
10 − µ )2 
−
−
max ln (L( X | µ )) ⇒
K −

µ
∂µ 
2
2
2

(5 − µ ) + (6 − µ ) +  + (10 − µ ) = 0
µˆ MLE
5 + 6 + 7 + 8 + 9 + 10 _
=
= x
6
Maximum Likelihood: Example I
• Let’s generalize this example to an i.i.d. sample X={X1, X2,...,
XT}drawn from a Normal(μ,σ2). Then, the joint pdf function is:
T
L=
∏
i =1
 ( X − µ) 2
i
exp −
2

σ
2
2πσ 2

1

 = ( 2πσ 2 ) −T / 2


T
∏
i =1
 ( X − µ) 2
i
exp −
2

σ
2





Then, taking logs, we have:
1
T
L = − ln 2πσ 2 − 2
2
2σ
T
∑
( X i − µ) 2 = −
i =1
1
T
T
ln 2π − ln σ 2 − 2 ( X − µ )′( X − µ )
2
2
2σ
We take first derivatives:
∂L
1
=−
∂µ
2σ 2
∂ ln L
∂σ 2
=−
T
∑ 2( X
i
− µ) ( −1) =
i =1
T
2σ 2
+
1
2σ 4
1
σ
T
∑
i =1
( X i − µ) 2
2
T
∑(X
i =1
i
− µ)
Maximum Likelihood: Example I
• Then, we have the f.o.c. and jointly solve for the ML estimators:
1
∂L
(1)
= 2
∂µ σˆ
MLE
T
∑(X
i
− µˆ MLE ) = 0 ⇒ µˆ MLE
i =1
1
=
T
T
∑X
i
=X
i =1
Note: The MLE of μ is the sample mean. Therefore, it is unbiased.
(2)
∂ ln L
∂σ 2
T
1
=− 2 +
2σˆ MLE 2σˆ 4MLE
T
∑
i =1
( X i − µˆ MLE ) 2 = 0 ⇒ σˆ 2MLE
Note: The MLE of σ2 is not s2. Therefore, it is biased!
1
=
T
T
∑
i =1
( X i − X )2
Maximum Likelihood: Example II
• We will work the previous example with matrix notation. Suppose
we assume:
yi = X i β + ε i
ε i ~ N (0, σ 2 )
y = Xβ + ε
or
ε ~ N (0, σ 2 I T )
where Xi is a 1xk vector of exogenous numbers and β is a kx1 vector
of unknown parameters. Then, the joint likelihood function becomes:
T
L=
∏
i =1
 ε2
exp − i 2
 2σ
2πσ 2

1

 = (2πσ 2 ) −T / 2


T
∏
i =1
 ε2
exp − i 2
 2σ





• Then, taking logs, we have the log likelihood function::
T
1
ln L = − ln 2πσ 2 − 2
2
2σ
T
∑
i =1
ε i2 = −
T
1
ln 2πσ 2 − 2 (y − Xβ)′(y − Xβ)
2
2σ
Maximum Likelihood: Example II
• The joint likelihood function becomes:
1
T
ln L = − ln 2πσ 2 − 2
2
2σ
T
∑
ε i2 = −
i =1
1
T
T
ln 2π − ln σ 2 − 2 (y − Xβ)′(y − Xβ)
2
2
2σ
• We take first derivatives of the log likelihood wrt β and σ2:
∂ ln L
1
=−
∂β
2
∂ ln L
∂σ
2
=−
T
∑
2ε i x i' / σ 2 = −
i =1
T
2σ
2
− (−
σ
2
X' ε
T
1
2σ
1
4
)
∑
i =1
ε i2
=(
1
2σ
2
)[
ε' ε
σ
2
−T]
• Using the f.o.c., we jointly estimate β and σ2:
: ∂ ln L
1
1
ˆ
ˆ
∂β
∂ ln L
∂σ 2
=−
σ2
X' ε =
X' (y − Xβ MLE ) = 0 ⇒ β MLE = ( X' X) −1 X' y
σ2
e' e
e' e
1
= ( 2 )[ 2
− T ] = 0 ⇒ σˆ 2MLE =
=
T
2σˆ MLE σˆ MLE
T
∑
i =1
( yi − X i βˆ MLE ) 2
T
ML: Score and Information Matrix
Definition: Score (or efficient score)
δ log(L(X | θ ))
S(X ;θ ) =
=
δθ
δ log(f(xi | θ ))
i =1
δθ
∑
n
S(X; θ) is called the score of the sample. It is the vector of partial
derivatives (the gradient), with respect to the parameter θ. If we have
k parameters, the score will have a kx1 dimension.
Definition: Fisher information for a single sample:
 ∂ log(f(X | θ ))  2 
E 
  = I (θ)
∂
θ
 

I(θ) is sometimes just called information. It measures the shape of the
log f(X|θ).
ML: Score and Information Matrix
• The concept of information can be generalized for the k-parameter
case. In this case:
 ∂ log L  ∂ log L  T 
E 

  = I (θ)
 ∂θ  ∂θ  
This is kxk matrix.
If L is twice differentiable with respect to θ, and under certain
regularity conditions, then the information may also be written as9
 ∂ log L  ∂ log L  T 
  δ2 log(L(X | θ )) 
 = I (θ)
E 
  = E - 

∂θ∂θ'
 ∂θ  ∂θ  
 

I(θ) is called the information matrix (negative Hessian). It measures the
shape of the likelihood function.
ML: Score and Information Matrix
• Properties of S(X; θ):
δ log(L(X | θ ))
S(X ;θ ) =
=
δθ
δ log(f(xi | θ ))
i =1
δθ
∑
n
(1) E[S(X; θ)]=0.
∫
∫
∫
∂f ( x;θ )
f ( x;θ )dx = 1 ⇒
dx = 0
∂θ
∂f ( x;θ )
1
f ( x;θ )dx = 0
f ( x;θ ) ∂θ
∂ log f ( x;θ )
f ( x;θ )dx = 0 ⇒ E[S ( x;θ )] = 0
∂θ
∫
ML: Score and Information Matrix
(2) Var[S(X; θ)]= n I(θ)
∂ log f ( x;θ )
f ( x;θ )dx = 0
∂θ
Let' s differentiate the above integral once more :
∫
∫
∂ log f ( x;θ ) ∂f ( x;θ )
∂ 2 log f ( x;θ )
dx +
f ( x;θ )dx = 0
∂θ
∂θ
∂θ∂θ '
∂ 2 log f ( x;θ )
∂ log f ( x;θ )  1
∂f ( x;θ ) 
 f ( x;θ )dx +

f ( x;θ )dx = 0
(
;
)
'
θ
θ
θ
θ
f
x
∂
∂
∂θ
∂


∫
 ∂ log f ( x;θ ) 

 f ( x;θ )dx +
∂θ


∫
∫
∫
2
∫
∂ 2 log f ( x;θ )
f ( x;θ )dx = 0
∂θ∂θ '
 ∂ log f ( x;θ )  2 
 ∂ 2 log f ( x;θ ) 
E 
  = −E 
 = I (θ )
∂θ∂θ '
∂θ
 



∂ log f ( x;θ )
] = n I (θ )
Var[ S ( X ;θ )] = n Var[
∂θ
ML: Score and Information Matrix
(3) If S(xi; θ) are i.i.d. (with finite first and second moments), then we
can apply the CLT to get:
Sn(X; θ) = Σi S(xi; θ)
a

→
N(0, n I(θ)).
Note: This an important result. It will drive the distribution of MLE
estimators.
ML: Score and Information Matrix – Example
• Again, we assume:
yi = X i β + ε i
ε i ~ N (0, σ 2 )
y = Xβ + ε
or
ε ~ N (0, σ 2 I T )
• Taking logs, we have the log likelihood function:
1
T
2
ln L = − ln 2πσ − 2
2
2σ
T
∑
ε i2
i =1
1
T
T
2
= − ln 2π − ln σ − 2 (y − Xβ)′(y − Xβ)
2
2
2σ
• The score function is –first derivatives of log L wrt θ=(β,σ2):
1
∂ ln L
=−
∂β
2
∂ ln L
∂σ
2
=−
T
∑
2ε i x i' / σ 2 = −
i =1
T
2σ
2
− (−
σ
2
X' ε
T
1
2σ
1
4
)
∑
i =1
ε i2
=(
1
2σ
2
)[
ε' ε
σ
2
−T]
ML: Score and Information Matrix – Example
• Then, we take second derivatives to calculate I(θ): :
T
∂ ln L2
1
xi xi ' / σ2 = 2 X ' X
=−
∂β ∂β'
σ
i =1
∑
∂ ln L
∂β∂σ '
2
=−
∂ ln L
∂σ ∂σ '
2
• Then,
2
1
σ
=−
4
T
∑ε x '
i i
i =1
1
2σ
[
4
ε' ε
σ
2
−T]+ (
1
2σ
 1
X'X)
(
∂ ln L  σ 2
I (θ) = E[−
]= 
∂θ∂θ' 
0

)(−
2

0 

T 
2σ 4 
ε' ε
1
σ
2σ
)=−
4
[2
4
ε' ε
σ
2
−T]
ML: Score and Information Matrix
In deriving properties (1) and (2), we have made some implicit
assumptions, which are called regularity conditions:
(i) θ lies in an open interval of the parameter space, Ω.
(ii) The 1st derivative and 2nd derivatives of f(X; θ) w.r.t. θ exist.
(iii) L(X; θ) can be differentiated w.r.t. θ under the integral sign.
(iv) E[S(X; θ) 2]>0, for all θ in Ω.
(v) T(X) L(X; θ) can be differentiated w.r.t. θ under the integral sign.
Recall: If S(X; θ) are i.i.d. and regularity conditions apply, then we can
apply the CLT to get:
a
S(X; θ) →
N(0, n I(θ))
ML: Cramer-Rao inequality
Theorem: Cramer-Rao inequality
Let the random sample (X1, ... , Xn) be drawn from a pdf f(X|θ) and let
T=T(X1, ... , Xn) be a statistic such that E[T]=u(θ), differentiable in θ.
Let b(θ)= u(θ) - θ, the bias in T. Assume regularity conditions. Then,
[u ' (θ )]2 [1 + b' (θ )]2
Var(T) ≥
=
nI (θ )
nI (θ )
Regularity conditions:
(1) θ lies in an open interval Ω of the real line.
(2) For all θ in Ω, δf(X|θ)/δθ is well defined.
(3) ∫L(X|θ)dx can be differentiated wrt. θ under the integral sign
(4) E[S(X;θ)2]>0, for all θ in Ω
(5) ∫T(X) L(X|θ)dx can be differentiated wrt. θ under the integral sign
ML: Cramer-Rao inequality
[u ' (θ )]2 [1 + b' (θ )]2
Var(T) ≥
=
nI (θ )
nI (θ )
The lower bound for Var(T) is called the Cramer-Rao (CR) lower bound.
Corollary: If T(X) is an unbiased estimator of θ, then
Var(T) ≥ (nI (θ )) −1
Note: This theorem establishes the superiority of the ML estimate over
all others. The CR lower bound is the smallest theoretical variance. It
can be shown that ML estimates achieve this bound, therefore, any
other estimation technique can at best only equal it.
ML: Cramer-Rao inequality
Proof: For any T(X) and S(X;θ) we have
[Cov(T,S)]2 ≤ Var(T) Var(S)
(Cauchy-Schwarz inequality)
Since E[S]=0, Cov(T,S)=E[TS].
Also, u(θ) = E[T] = ∫ T L(X;θ) dx. Differentiating both sides:
u’(θ) = ∫ T δL(X;θ)/δθ dx = ∫ T [1/L δL(X;θ)/δθ] L dx
= ∫ T S L dx = E[TS] = Cov(TS)
Substituting in the Cauchy-Schwarz inequality:
[u’(θ)]2 ≤ Var(T) n I(θ)
=> Var(T) ≥[u’(θ)]2/[n I(θ)] ■
ML: Cramer-Rao inequality
Note: For an estimator to achieve the CR lower bound, we need
[Cov(T,S)]2 = Var(T) Var(S).
This is possible if T is a linear function of S. That is,
T(X) = α(θ) S(X;θ) + β(θ)
Since E[T] = α(θ) E[S(X;θ)] + β(θ) = β(θ) . Then,
S(X;θ) = δ log L(X;θ)/δθ =[T(X) - β(θ)]/ α(θ).
Integrating both sides wrt to θ:
log L(X;θ) = U(X) – T(X) A(θ)+ B(θ)
That is,
L(X;θ) = exp{ΣiU(Xi) – A(θ) ΣiT(Xi) + n B(θ)}
Or,
f(X;θ) = exp{U(X) – T(X) A(θ)+ B(θ)}
ML: Cramer-Rao inequality
f(X;θ) = exp{U(X) – T(X) A(θ)+ B(θ)}
That is, the exponential (Pitman-Koopman-Darmois) family of
distributions attain the CR lower bound.
• Most of the distributions we have seen belong to this family: normal,
exponential, gamma, chi-square, beta, Weibull (if the shape parameter is
known), Dirichlet, Bernoulli, binomial, multinomial, Poisson, negative
binomial (with known parameter r), and geometric.
• Note: The Chapman–Robbins bound is a lower bound on the variance
of estimators of θ. It generalizes the Cramér–Rao bound. It is tighter
and can be applied to more situations –for example, when I(θ) does
not exist. However, it is usually more difficult to compute.
Cramer-Rao inequality: Multivariate Case
• When we have k parameters, then covariance matrix of the estimator
T(X) has a CR lower bound given by:
T
∂u (θ)
∂
θ
u
(
)
I (θ) −1
Covar( T( X)) ≥
∂θ
∂θ
Note: In matrix notation, the inequality A ≥ B means the matrix A-B is
positive semidefinite.
If T(X) is unbiased, then
Covar( T( X)) ≥ I (θ) −1
C. R. Rao (1920, India) & Harald Cramer (1893-1985, Sweden)
Cramer-Rao inequality: Example
We want to check if the sample mean and s2 for an i.i.d. sample X={X1,
X2,..., XT}drawn from N(μ,σ2) achieve the CR lower bound. Recall:
 n
 2
∂ ln L
] = σ
I (θ) = E[ −
∂θ∂θ'
 0
 X

0 

n 
2σ 4 
Since the sample mean and s2 are unbiased, the CR lower bound is
given by:
Covar( T ) ≥ I (θ) −1
Then,
σ2
Var(X) ≥
n
&
2σ 4
Var(s ) ≥
n
2
We have already derived that Var( X) = σ2/n and Var(s2) = 2 σ4/(n-1).
Then, the sample mean achieves its CR bound, but s2 does not..
Concentrated ML
• We split the parameter vector θ into two vectors:
L( θ ) = L( θ 1 ,θ 2 )
Sometimes, we can derive a formulae for the ML estimate of θ2, say:
θ 2 = g( θ 1 )
If this is possible, we can write the Likelihood function as
L( θ 1 ,θ 2 ) = L( θ 1 , g( θ 1 )) = L* ( θ 1 )
This is the concentrated likelihood function.
• This process is often useful as it reduces the number of parameters
needed to be estimated.
Concentrated ML: Example
• The normal log likelihood function can be written as:
( )− 2σ ∑
n
ln L( µ , σ ) = − ln σ
2
2
(X i
i =1
1
2
2
n
− µ )2
• This expression can be solved for the optimal choice of σ2 by
differentiating with respect to σ2:
∂ ln L( µ , σ 2 )
∂σ
2
⇒ − nσ
2
=−
+
2
⇒ σˆ MLE
=
n
2σ
2
+
1
( )
2σ
2 2
2
(
)
µ
X
−
∑i =1 i
n
1
n
2
(
)
µ
X
−
∑i =1 i
n
=0
2
(
)
µ
X
−
∑i =1 i
n
=0
Concentrated ML: Example
• Substituting this result into the original log likelihood produces:
n
n 1

(
ln L( µ ) = − ln
X i − µ )2  −
i =1
2 n

n
1
(
X i − µ )2
n
i =1
1
2
2
X j −µ
1
=
j
n
n
n 1
 n
(
X i − µ )2  −
= − ln
i =1
2 n
 2
∑
∑ (
)
∑
∑
• Intuitively, the ML estimator of µ is the value that minimizes the
MSE of the estimator. Thus, the least squares estimate of the mean
of a normal distribution is the same as the ML estimator under the
assumption that the sample is i.i.d.
Properties of ML Estimators
^
(1) Efficiency. Under general conditions, we have that θ MLE
^
Var( θ MLE ) ≥ ( nI (θ )) −1
The right-hand side is the Cramer-Rao lower bound (CR-LB). If an
estimator can achieve this bound, ML will produce it.
(2) Consistency. We know that E[S(Xi; θ)]=0 and Var[S(Xi; θ)]= I(θ).
The consistency of ML can be shown by applying Khinchine’s LLN
to S(Xi,; θ) and then to Sn(X; θ)=Σi S(Xi,; θ).
Then, do a 1st-order Taylor expansion of Sn(X; θ) around θˆMLE
S n (X; θ ) = S n (X; θˆMLE ) + S n ' (X; θ n* )(θ − θˆMLE )
S (X; θ ) = S ' (X; θ * )(θ − θˆ )
n
n
n
| θ − θ n* | ≤ | θ − θˆMLE | < ε
MLE
Sn(X; θ) and ( θˆMLE- θ) converge together to zero (i.e., expectation).
Properties of ML Estimators
(3) Theorem: Asymptotic Normality
Let the likelihood function be L(X1,X2,…Xn| θ). Under general
conditions, the MLE of θ is asymptotically distributed as
(
a
N θ , nI (θ ) −1
θˆMLE 
→
)
Sketch of a proof. Using the CLT, we’ve already established
Sn(X; θ) → N(0, nI(θ)).
Then, using a first order Taylor expansion as before, we get
p
S n (X; θ )
1
n1/2
= S n ' (X; θ n* )
1
n1/2
(θ − θˆMLE )
Notice that E[Sn′(xi ; θ)]= -I(θ). Then, apply the LLN to get
Sn′ (X; θn*)/n → -I(θ).
(using θn* → θ.)
Now, algebra and Slutzky’s theorem for RV get the final result.
p
p
Properties of ML Estimators
(4) Sufficiency. If a single sufficient statistic exists for θ, the MLE of θ
must be a function of it. That is, θˆMLEdepends on the sample
observations only through the value of a sufficient statistic.
(5) Invariance. The ML estimate is invariant under functional
transformations. That is, if θˆMLEis the MLE of θ and if g(θ) is a
function of θ , then g(θˆMLE) is the MLE of g(θ) .
Quasi Maximum Likelihood
• ML rests on the assumption that the errors follow a particular
distribution (OLS is only ML if the errors are normal).
•Q: What happens if we make the wrong assumption?
White (Econometrica, 1982) shows that, under broad assumptions
about the misspecification of the error process, θ̂ MLE is still a
consistent estimator. The estimation is called Quasi ML.
• But the covariance matrix is no longer I(θ)-1, instead it is given by
Var[θˆ ] = I( θˆ )-1[ S( θˆ )′S( θˆ )]I( θˆ )-1
• In general, Wald and LM tests are valid, by using this corrected
covariance matrix. But, LR tests are invalid, since they works directly
from the value of the likelihood function.
ML Estimation: Numerical Optimization
• In simple cases like OLS, we can calculate the ML estimates from the
f.o.c.’s –i.e., analytically . But in most situations we cannot.
• We resort to numerical optimisation of the likelihood function.
• Think of hill climbing in parameter space. There are many algorithms
to do this.
• General steps:
(1) Set an arbitrary initial set of parameters –i.e., starting values.
(2) Determine a direction of movement
(3) Determine a step length to move
(4) Check convergence criteria and either stop or go back to (2).
ML Estimation: Numerical Optimization
• In simple cases like OLS, we can calculate the ML estimates from the
f.o.c.’s –i.e., analytically . But in most situations we cannot.
• We resort to numerical optimisation of the likelihood function.
• Think of hill climbing in parameter space. There are many algorithms
to do this.
• General steps:
(1) Set an arbitrary initial set of parameters –i.e., starting values.
(2) Determine a direction of movement (for example, by dL/dθ).
(3) Determine a step length to move (for example, by d2L/dθ2).
(4) Check convergence criteria and either stop or go back to (2).
ML Estimation: Numerical Optimization
L
Lu
β β
1
2
β
*
Method of Moments (MM) Estimation
• Simple idea:
Suppose the first moment (the mean) is generated by the distribution,
f(X,θ). The observed moment from a sample of n observations is
n
m1 = (1 / n) ∑ xi
i =1
Hence, we can retrieve the parameter θ by inverting the distribution
function f(X,θ):
m1 = f ( x | θ )
=> θ = f
−1
(m1 ) = m1
• Example: Mean of a Poisson
pdf:
f(x) = exp(-λ) λx/x!
E[X] = λ
=> plim (1/N)Σi xi = λ.
Then, the MM estimator of λ is the sample mean of X => λMM = x
Method of Moments (MM) Estimation
• Example: Mean of Exponential
pdf: f(x,λ) = λ e-λy
E[X] = 1/λ
=> plim (1/N)Σixi = 1/λ
Then, the λMM = 1/ x .
• Let’s complicate the MM idea:
Now, suppose we have a model. This model implies certain knowledge
about the moments of the distribution.
Then, we invert the model to give us estimates of the unknown
parameters of the model, which match the theoretical moments for a
given sample.
MM Estimation
• We have a model Y = h (X,θ), where θ are k parameters. Under this
model, we know what some moments of the distribution should be.
That is, the model provide us with k conditions (or moments), which
should be met:
E ( g (Y , X | θ )) = 0
• In this case, the (population) first moment of g (Y,X, θ) equals 0.
Then, we approximate the k moments –i.e., E(g)- with a sample
measure and invert g to get an estimate of θ:
θˆMM = g −1 (Y , X ,0)
θˆMM is the Method of Moment estimator of θ.
Note: In this example we have as many moments (k) as unknown
parameters (k). Thus, θ is uniquely and exactly determined.
MM Estimation: Example
We start with a model Y = X β + ε. In OLS estimation, we make the
assumption that the X’s are orthogonal to the errors. Thus,
E ( X ' e) = 0
The sample moment analogue for each xi is
(1 / n)
∑
n
t =1
xit et = 0
− or (1 / n) X ' e = 0.
And, thus,
(1 / n) X ' e = 0 = (1 / n) X ' (Y − Xβ MM )
=> X ' Y = X ' Xβ MM
Therefore, the method of moments estimator, βMM, solves the normal
equations. That is, βMM will be identical to the OLS estimator, b .
Generalized Method of Moments (GMM)
• So far, we have assumed that there are as many moments (l ) as
unknown parameters (k). The parameters are uniquely and exactly
determined.
• If l < k –i.e., less moment conditions than parameters-, we would
not be able to solve them for a unique set of parameters (the model
would be under identified).
• If l > k –i.e., more moment conditions than parameters-, then all
the conditions can not be met at the same time, the model is over
identified and we have GMM estimation.
If we can not satisfy all the conditions at the same time, we want to
make them all as close to zero as possible at the same time. We have
to figure out a way to weight them.
Generalized Method of Moments (GMM)
• Now, we have k parameters but l moment conditions l>k. Thus,
E (m j (θ )) = 0
m (θ ) = (1 / n)
j = 1,...l
∑
n
t =1
m j (θ ) = 0
(l population moments)
j = 1,...l
(l sample moments)
• Then, we need to make all l moments as small as possible,
simultaneously. Let’s use a weighted least squares criterion:
Min( q ) = m (θ )'W m (θ )
θ
That is, the weighted squared sum of the moments. The weighting
matrix is the lxl matrix W. (Note that we have a quadratic form.)
∂m (θ )'
W m (θ GMM ) = 0
• First order condition: 2
∂θ ' θ =θ GMM
Generalized Method of Moments (GMM)
• The GMM estimator, θGMM, solves the kx1 system of equations.
There is typically no closed form solution for θGMM. It must be
obtained through numerical optimization methods.
• If plim m (θ ) =0, and W (not a function of θ) is a positive definite
matrix, then θGMM is a consistent estimator of θ.
• The optimal W
Any weighting matrix produces a consistent estimator of θ. We can
select the most efficient one –i.e., the optimal W.
The optimal W is simply the covariance matrix of the moment
conditions. Thus,
Optimal W = W * = Asy Var ( m )
Properties of the GMM estimator
• Properties of the GMM estimator.
(1) Consistency.
If plim m (θ )=0, and W (not a function of θ) is a pd matrix, then
under some conditions, θGMM → θ.
p
(2) Asymptotic Normality
Under some general condition θGMM
VGMM=(1/n)[G′V-1G]-1,
a

→
N(θ, VGMM), and
where G is the matrix of derivatives of the moments
with respect to the parameters and V = Var (n1/ 2 m (θ ))
Lars Peter Hansen (1952)
Bayesian Estimation: Bayes’ Theorem
• Recall Bayes’ Theorem:
Prob( X | θ ) Prob(θ )
Prob(θ X ) =
Prob( X )
- P(θ): Prior probability about parameter θ.
- P(X|θ): Probability of observing the data, X, conditioning on θ.
This conditional probability is called the likelihood –i.e., probability of
event X will be the outcome of the experiment depends on θ.
- P(θ |X): Posterior probability -i.e., probability assigned to θ, after X is
observed.
- P(X): Marginal probability of X. This the prior probability of
witnessing the data X under all possible scenarios for θ, and it
depends on the prior probabilities given to each θ.
Bayesian Estimation: Bayes’ Theorem
• Example: Courtroom – Guilty vs. Non-guilty
G: Event that the defendant is guilty.
E: Event that the defendant's DNA matches DNA found at the
crime scene.
The jurors, after initial questions, form a personal belief about the
defendant’s guilt. This initial belief is the prior.
The jurors, after seeing the DNA evidence (event E), will update their
prior beliefs. This update is the posterior.
Bayesian Estimation: Bayes’ Theorem
• Example: Courtroom – Guilty vs. Non-guilty
- P(G): Juror’s personal estimate of the probability that the defendant
is guilty, based on evidence other than the DNA match. (Say, .30).
- P(E|G): Probability of seeing event E if the defendant is actually
guilty. (In our case, it should be near 1.)
- P(E): E can happen in two ways: defendant is guilty and thus DNA
match is correct or defendant is non-guilty with incorrect DNA match
(one in a million chance).
- P(G|E): Probability that defendant is guilty given a DNA match.
Prob(E | G ) Prob(G )
1x(.3)
Prob(G E ) =
=
= .999998
-6
Prob(E )
.3x1 + .7x10
Bayesian Estimation: Viewpoints
• Implicitly, in our previous discussions about estimation (MLE), we
adopted a classical viewpoint.
– We had some process generating random observations.
– This random process was a function of fixed, but unknown
parameters.
– Then, we designed procedures to estimate these unknown
parameters based on observed data.
• For example, we assume a random process such as CEO
compensation. This CEO compensation process can be
characterized by a normal distribution.
– We can estimate the parameters of this distribution using
maximum likelihood.
Bayesian Estimation: Viewpoints
– The likelihood of a particular sample can be expressed as
(
)
2
 1
2
L X 1 , X 2 , X n µ , σ =
exp − 2 ∑i =1 ( X i − µ ) 
n
n
 2σ

(2π ) 2 σ
2
1
– Our estimates of µ and σ2 are then based on the value of
each parameter that maximizes the likelihood of drawing
that sample
Thomas Bayes (1702–April 17, 1761)
Bayesian Estimation: Viewpoints
• Turning the classical process around slightly, a
Bayesian viewpoint starts with some kind of
probability statement about the parameters
(a prior). Then, the data, X, are used to update our prior beliefs (a
posterior).
– First, assume that our prior beliefs about the distribution
function can be expressed as a probability density function π(θ),
where θ is the parameter we are interested in estimating.
– Based on a sample -the likelihood function, L(X,θ)- we can
update our knowledge of the distribution using Bayes’ theorem:
Prob( X | θ )π (θ )
π (θ X ) =
=
Prob( X )
L (X θ )π (θ )
∫
∞
−∞
L (X θ )π (θ )dθ
Bayesian Estimation: Example
• Assume that we have a prior of a Bernoulli distribution. Our
prior is that P in the Bernoulli distribution is distributed Β(α,β):
π ( P ) = f (P; α , β ) =
B (α , β ) =
1
∫
0
1
P α −1 (1 − P )β −1
B (α , β )
x α −1 (1 − x )β −1 dx =
Γ(α ) Γ(β )
Γ(α + β )
Γ(α + β ) α −1
(1 − P )β −1
π ( P) =
P
Γ(α ) Γ(β )
Bayesian Estimation: Example
• Assume that we are interested in forming the posterior distribution
after a single draw, X:
π (P X ) =
P X (1 − P )1− X
1
∫
0
=
P X (1 − P )1− X
Γ(α + β ) α −1
P (1 − P )β −1
Γ(α ) Γ(β )
Γ(α + β ) α −1
P (1 − P )β −1 dP
Γ(α ) Γ(β )
P X +α −1 (1 − P )β − X
1
∫
0
P X +α −1 (1 − P )β − X dP
Bayesian Estimation: Example
• Following the original specification of the beta function
1
∫P
0
X +α −1
(1 − P )
β −X
dP =
1
∫P
0
α * −1
(1 − P )
β * −1
dP
where α * = X + α and β * = β − X + 1
Γ(X + α ) Γ(β − X + 1)
=
Γ(α + β + 1)
• The posterior distribution, the distribution of P after the
observation is then
π (P X ) =
Γ(α + β + 1)
P X +α −1 (1 − P )β − X
Γ( X + α ) Γ(β − X + 1)
Bayesian Estimation: Example
• The Bayesian estimate of P is then the value that minimizes a loss
function. Several loss functions can be used, but we will focus on
the quadratic loss function consistent with mean square errors

min
E
ˆ

P
(
)
2


ˆ
E
P
P
∂
−
2



ˆ−P =0
ˆ
P−P
= 2E P
⇒

ˆ
∂P
ˆ = E[ P ]
⇒P
(
)
[
]
• Taking the expectation of the posterior distribution yields
E [P ] =
Γ(α + β + 1)
P X +α (1 − P )β − X dP
0 Γ ( X + α ) Γ (β − X + 1)
1
∫
Γ(α + β + 1)
=
Γ( X + α ) Γ(β − X + 1)
1
∫
0
P X +α (1 − P )β − X dP
Bayesian Estimation: Example
• As before, we solve the integral by creating α*=α+X+1 and β*=βX+1. The integral then becomes
1
∫P
0
α * −1
(1 − P )
β * −1
(
)
(
)
Γ(α + X + 1)Γ(β − X + 1)
dP =
=
Γ(α + β + 2 )
Γ(α + β )
Γ α* Γ β*
*
*
Γ(α + β + 1) Γ(α + X + 1) Γ(β − X + 1)
E [P ] =
Γ(α + β + 2 ) Γ(α + X ) Γ(β − X + 1)
• Which can be simplified using the fact Γ(α+1)= α Γ(α):
E [P ] =
(α + X )Γ(α + X )
Γ(α + β + 1) Γ(α + X + 1)
Γ(α + β + 1)
=
(α + β + 1)Γ(α + β + 1) Γ(α + X )
Γ(α + β + 2 ) Γ(α + X )
(α + X )
=
(α + β + 1)
Bayesian Estimation: Example
• To make this estimation process operational, assume that we have a
prior distribution with parameters α=β=1.4968 that yields a beta
distribution with a mean P of 0.5 and a variance of the estimate of
0.0625.
• Extending the results to n Bernoulli trials yields
Γ(α + β + n )
β −Y + n −1
Y +α −1
(1 − P )
P
π (P X ) =
Γ(α + Y )Γ(β − Y + n )
where Y is the sum of the individual Xs or the number of heads in
the sample. The estimated value of P then becomes:
Y +α
ˆ
P=
α +β +n
Bayesian Estimation: Example
• Suppose in the first draw Y=15 and n=50. This yields an
estimated value of P of 0.31129. This value compares with the
maximum likelihood estimate of 0.3000. Since the maximum
likelihood estimator in this case is unbiased, the results imply that
the Bayesian estimator is biased.