Download Review of Bayesian and Frequentist Statistics

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
.
.
.
.
..
.
Review of Bayesian and Frequentist Statistics
Bertrand Clarke1
1 Department
of Medicine
University of Miami
NDU 2011
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
. Outline
.
. .1 Basic Definitions
.
. .2 Models of Variability
Survey Sampling
Frequentist
Bayesian
Other Schools
. The Normal Example
.
3
.
.
. 4. Suffiency and Exponential Families
.
. 5. Main Frequentist Estimators
Method of Moments
MLE’s
UMVUE’s
Testing
.6. Bayesian Estimation
.
Review of Bayesian and Frequentist Statistics
Conjugate Priors B. Clarke
.
.
.
.
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
. Basic Definitions
There are several schools of thought in Statistics. They
differ primarily in how they treat variability.
To explain them we need some definitions.
Population: The collection of all outcomes, real or
imagined, to which one wants conclusions to apply. May be
natural or artificial; must be precise.
Examples: All people born on a Tuesday since 1950. All
left handed vegetarians employed at NDU (provided
vegetarian is precisely defined).
All motorized vehicles registered in Lebanon. All Lebanese
residents over age 65 as of 1 July 2011. All runs of a
specific experiment that might be performed. All strings of
zeros and ones.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
. Population vs sample
A lot of work goes into defining a population accurately.
A sample is a subset of size n from the population.
Samples let us make inferences about the population they
represent.
We want the population we sampled to be the same as the
population we want to study.
A frame (if it exists) is the set from which we draw our
samples.
If we take a sample of businesses that have webpages this
will not be the same as the population of all businesses.
Even if a sample is drawn from the correct population, it
doesn’t mean it is representative.
Ideal: Representative sample of the population of interest.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
. Representative samples
Not always possible....
Ideal case: The selection from the population is found by
‘simple random sampling’.
A sample of size n is taken and (i) each element of the
population has the same chance of inclusion in the sample
(ii) the selection of one element for the sample is
unaffected by the selection of any other element.
This means all individuals and all samples are
probabilistically equivalent in the sense that they are the
output of the same sampling process.
Abbreviated: IID
Sometimes samples are dependent or not-identical and we
must model this.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
. Random Variables
Random variable: The process by which a measurement is
generated, X .
Outcome: The measurement generated. X = x
The process of obtaining a measurement of, say, the size
of a tumor is the random variable X . The measurement is
X = .5 cm.
A random variable has an associated probability, P. Thus:
P(X ∈ A) is the probability that the random variable gives
us a value in the set A.
We might consider th probability that a tumor of diameter
at least .5cm grows. This is P(X ≥ .5).
A specific outcome does not have a probability.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
. Technical Point
The way I defined RV is informal. Here’s the real definition:
X : (X , F) → (|R, B(|R)) where X is measurable i.e.,
∀B ∈ B(|R), X −1 (B) ∈ F.
We assign a probability PR to the observation space (the
range) and pull it back to give a probability on the
underlying measure space.
Thus: PD (X −1 (A)) = PR (A) for A ∈ B(|R) and:
PR (A) = PD (X ∈ A) = PD (X −1 (A)) = PD (ω ∈ X |X (ω) ∈ A)
(neglecting {, }’s) and we usually drop the subscripts on P
for convenience.
We never see (X , F) and we don’t know much about
it....we just more or less assume it works out OK and there
are theorems guaranteeing it has the properties we want.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
. Parameters
We often get n outcomes x1 , . . . , xn of a random variable
X . We may denote the n draws of X by X1 , . . . , Xn .
The collection of outcomes/measurements is the sample.
The population is fixed, but we consider different
descriptions of it.
A description of a population is given by a probability on it.
We don’t know the correct P but we may have a collection
of probabilities P that we are sure contains the true one.
Often P = {Pθ |θ ∈ Θ}, Θ ⊂ IR d . θ is a parameter.
We use a function of x1 , . . . , xn to estimate θ. Write
θ̂ = θ̂(x n ) where x n = (x1 , . . . , xn ).
We write θ̂(X n ), where X n = (X1 , . . . , Xn ) when we want to
emphasize that θ̂ can be regarded as a RV in its own right.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
. Inference
The basic problem of inference is to use the data i.e., the
sample, to get an estimate θ̂ of θtrue , i.e., to identify the
correct description of the population.
Sometimes the parameter means something e.g., height of
people. Sometimes a parameter is just an index for a
collection of probabilities.
Here, we won’t usually make a differencbetween a
probability, a density, and a dsitribution function since the
parameter would identify any of them.
Not enough to annouce θ̂...want a description of how θ̂
varies.
So, we must understand models of variability.
There are 3 major ones and several minor ones.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Survey Sampling
Frequentist
Bayesian
Other Schools
. Survey Sampling
The population is finite, size N.
The only randomness is in which sample is chosen. An
individual, once chosen, generates a measurement with no
ambiguity.
The X is the selection of an individual from the population.
( )
N
There are
possible samples of size n and when we
n
get one we use it to get a point estimate i.e., a θ̂.
If we take, say, a mean, then X̄ has a distribution
generated by considering the possible samples of size n.
So, E(X̄ ) is the sum of possible values of X̄ weighted by
the probability of choosing a sample that gives that value.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Survey Sampling
Frequentist
Bayesian
Other Schools
. Variability
We can also find Var(X̄ ) = E(X̄ − E(X̄ ))2
Usually must have n << N for decent inference.
In this case we might get a confidence interval of the form
x̄ ± z1−α/2 FPC √sn .
This means that 100(1 − α/2) of the samples of size n that
we might get will give an interval of the form
x̄ ± z1−α/2 FPC √sn that contains θ = E(X ).
The FPC is called the finite population correction.
Note that the variabiity is in the sample chosen and we
imagine the result of choosing all samples of size n.
This is Frequentist in that we consider the effect of
repeated samples of size n and invoke the Frequency
interpretation of probability.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Survey Sampling
Frequentist
Bayesian
Other Schools
. Frequentist
Rests on the Frequentist interpretation of probability. That
is, the probability of an event A (such as tossing a coin and
getting tails) is the limit
# times observed A
P(A) = lim
.
n→∞ # times we looked
Not a formal limit (given an ϵ you can’t find an n).
Given Pθ and n copies of X we form confidence regions.
A 1 − α confidence region is a random set R(X n ) with the
property that
Pθ (θ ∈ R(X n )) = 1 − α.
Note that we have one Pθ for each Xi and another Pθn for
X n formed from n copies of Pθ but we don’t bother to
distinguish between them.
.
B. Clarke
.
.
.
Review
n of Bayesian and Frequentist Statistics n
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Survey Sampling
Frequentist
Bayesian
Other Schools
. Confidence Intervals
The question is how to find CR’s. Usual approach is to
form an interval.
Suppose θ is a mean θ = E(X ).
Then, if σ is known, as n → ∞, we can show
√
Pθ (σzα/2 ≤ n(X̄ − θ) ≤ z1−α/2 σ) → 1 − α
√
That is R(X n ) = {σzα/2 ≤ n(X̄ − θ) ≤ z1−α/2 σ)} is an
asymptotic 1 − α CI.
√
R(X n ) = {σzα/2 ≤ n(X̄ − θ) ≤ z1−α/2 σ)} and we have
one outcome of it (from the n outcomes of X ).
Frequentist prediction comes from Pθ̂ (·), i.e., A has is a
1 − α prediction region if Pθ̂ (Xn+1 ∈ A) = 1 − α. (Not exact
because it neglects variation in θ̂.)
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Survey Sampling
Frequentist
Bayesian
Other Schools
. Confidence
√
Interpretation: x̄ ± zα/2 σ/ n is an interval produced by a
technique that ensures 100(1 − α)% of intervals so formed
will contain θ.
Confidence is a property of the [process of producing the
interval,not of the numerical interval itself.
The distribution of θ̂ = X̄ is called the sampling distribution.
It is the central object for Frequentist inference.
Statements about where a parameter lies retain the
randomness of the data generating mechansm. It’s as if
we never forget that the sample we got came from a RV.
The outcome x is what we see. The X is like the process
by which we got the outcome.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Survey Sampling
Frequentist
Bayesian
Other Schools
. Standard Error
A Frequentist distinguishes between the SD and the SE.
An SD is the σ for a single out come of a RV X. This is a
property of the population distribution.
An SE is the σ for a function of n outcomes of a RV X . This
is a property of the sampling distribution.
For one X , Var(X ) = σ 2 : The X has a distribution with a
density curve and we find the SD.
For IID X1 , ... , Xn , Var(X̄ ) = σ 2 /n and this is taken in the
sampling distribution for X̄ which is derived from the
distribution of X but will be much more peaked around µ.
∑
For INID X1 , ... , Xn , Var(X̄ ) = ni=1 σi2 /n and for
dependent variables, all bets are off.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Survey Sampling
Frequentist
Bayesian
Other Schools
. In Practice
A Frequentist chooses a class of densities f (x|θ)
integrating to 1 for each θ; f (x n |θ) = f (x1 |θ) · · · f (xn |θ).
The MLE is a standard estimator:
θ̂ = arg max f (x n |θ).
θ
Many choices of f (·|θ) have a sufficient statistic: Poisson,
Binomial, normal, exponential...
Definition of sufficiency: T (X ) is sufficent for θ in f (x|θ)
⇐⇒ inference on θ only depends on T .
Sufficient statistics contain all the information about θ in
the data so functions of them are good estimators.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Survey Sampling
Frequentist
Bayesian
Other Schools
. Other Frequentist Techniques
A statistic T is unbiased for θ if and only if Eθ T (X n ) = θ.
1
CRLB for unbiased Statistics: Varθ (T (X n )) ≥ In (θ)
.
UMVUE: Any statistic that achieves the CRLB for all θ in an
interval.
It turns out that UMVUE’s are unique and can be given as
functions of sufficient statistics.
Given X1 , . . . , Xn put them in order from smallest to largest:
X(1) , . . . , X(n) .
L-Statistics: Linear combinations of order statistics.
Decision theoretic statistics....Covariates...
These are all ways to find statistics to generate a point
estimate and a sampling distribution and hence CI.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Survey Sampling
Frequentist
Bayesian
Other Schools
. Bayesian
The Frequentist assumes θ is fixed and the data retain
their stochastic character (via Frequency interpretation).
Bayesians reverse this: θ is a random variable Θ and the
data, once obtained are treated as fixed. So, you condition
on them.
Where does the distribution on Θ come from? We make it
up. Call it the density w(θ). We still have the conditional
density for X given Θ = θ that we write as f (·|θ).
Joint density for Θ, X n ) is
w(θ)f (x n |θ) = w(x n |θ)m(x n )
∫
where m(x n ) = w(θ)f (x n |θ)dθ is called the mixture
distribution or the marginal for the data.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Survey Sampling
Frequentist
Bayesian
Other Schools
. Bayesian Inference
Bayesians make inferences from the posterior w(θ|x n ).
A 1 − α credible set R = Rα (x n ) is any set of parameter
values satisfying
W (R|x n ) = 1 − α.
This does not require the Frequency interpretation.
The interpretation of a credibility region R is that
conditional on the data, we have a set that contains 1 − α
posterior probability.
The posterior density is the Bayesian’s analog to the
sampling distribution.
No repeated sampling assumption (which might not be
satisfied) just a direct statement about where θ is –
conditional on the data.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Survey Sampling
Frequentist
Bayesian
Other Schools
. Choosing the prior
There are two approaches to prior selection.
Subjective: (snide) The investigator consults his/her
feelings and impressions about where θ might be and
draws curves to represent this trying to find a mathematical
form that they fit.
Subjective: (fair) The investigator reflects carefully on the
relevant information about θ that might be available and
tries to formulate a prior density that summarises this.
Objective: The prior is chosen by some kind of auxilliary
principle, e.g., noninformativity, usually an optimization or
invariance criterion.
Choose a class of priors and evaluate the stability of
inferences to over the class.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Survey Sampling
Frequentist
Bayesian
Other Schools
. Types of Bayes Estimator
Decision theoretic: Choose a loss function L(·, ·) and find
∫
n
δB (x ) = arg min L(θ, δ(x n ))w(θ|x n )dθ
δ
The integral is called the posterior risk of δ.
Posterior mode: Analogous to the MLE, choose
θ̂PM = arg max w(θ|x n )
θ
Actually, |θ̂MLE − θ̂PM | = OP (1/n).
Conventionally, the Bayesian wants to see the whole
posterior because the shape of the curve explains the
variability better than Var(Θ|X = x) can.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Survey Sampling
Frequentist
Bayesian
Other Schools
. Variability
As noted, given w(θ|X n ), the Bayesian might use the
posterior variance
∫
(θ − E(Θ|X n = x n ))2 w(θ|x n )dθ
∫
where E(Θ|x n ) = θw(θ|x n )dθ is the posterior mean.
√
Just as X̄ → µ and (X̄ − µ)/(σ/ n) → N(0, 1), posterior
quantities have analogous properies.
Bayesian LLN: E(Θ̄|X n ) → µ.
√ √
Bayesian CLT: w((θ − θ̂)/( n I(θ̂)) → N(0, 1). (Note
θ̂ = E(Θ̄|X n ) is one choice.)
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Survey Sampling
Frequentist
Bayesian
Other Schools
. Where is the variability?
Importance of variability: 1 m ± 1cm vs, 1m ± 1 km.
The Bayesian says the data, once obtained, are no longer
stochastic. They are the fixed outcomes of a RV and so
you condition on them.
The Bayesian says the variability is transmuted from the
data to the parameter by way of the posterior distribution
for the parameter that is conditional on the data.
Thus, the Survey Sampler thinks in terms of subsets of a
specific population; a Frequentist thinks in terms of
repeated sampling; a Bayesian thinks of what the resuling
posterior says about the parameter given the data.
Outcomes remain stochastic for the Frequentist, not the
Bayesian.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Survey Sampling
Frequentist
Bayesian
Other Schools
. Special Cases
Conjugate priors: Choose a prior from a class so that the
posterior is in the same class. Depends on the likelihood.
Non-parametric Bayes is exactly the same structure as
parametric Bayes.
Bayesian prediction comes from the predictive distribution:
∫
m(xn+1 |x n ) = f (xn+1 |θ)w(θ|x n )dθ.
That is, A is a 1 − α prediction region if
Mn+1 (Xn+1 ∈ A|x n ) = 1 − α, still conditional on x n , like
Frequentist case.
In IID cases, Bayes and Frequentist methods for estimation
are often asymptotically equivalent.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Survey Sampling
Frequentist
Bayesian
Other Schools
. Testing
Bigger differences in hypothesis testing: The p-value is
obtained from the sampling distribution and has a very
different meaning than a Bayes factor.
Bayes testing is ]decision-theoretic using 0-1 loss.
Bayes testing of H0 : θ ∈ S vs H1 : θ ∈ S c based on
W (S|x n ) or, equivalently, on the Bayes factor
BF (1; 2) =
W (S|x n )/W (S)
.
W (S c |x n )/W (S c )
This is the ratio of the prior odds to the posterior odds.
Contrast: Frequentist testing uses the Neyman-Pearson
Lemma which is an optimization of P(reject H0 |H1 true )
subject to P(reject H0 |H0 true ) ≤ 1 − α, both proabilities
are in the sampling distribution of the test statistic.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Survey Sampling
Frequentist
Bayesian
Other Schools
. Likelihood; Information-Theoretic
Likelihood = conditional density for X given θ but regarded
as a function of θ for fixed X = x, L(θ|x) = f (x|θ).
LP: All inferences should come only from the Likelihood.
Get intervals like {θ| L(θ̂|x n ) − L(θ|x n ) ≤ t} for thresholds t.
No notion of confidence or credibility.
Information-theoretic: The idea is that the central features
of models and data are expressible in terms of measures
of complexity (Kolmogorov, VC-dimension, codelength).
e.g., choose a model p̂
p̂ = arg min L(p) + L(x n |p)
p∈P
where L(·) is the Shannon codelength for x n given p or p.
Includes maxent, rel. entropy criteria, MML, MDL, etc etc.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Survey Sampling
Frequentist
Bayesian
Other Schools
. Predictive
Prequential Principle, Dawid (1984).
‘The method of evaluation of a predictor should be disjoint
from its method of construction, e.g., depend only on the
predictions it makes and the future data.’
Typically look at something like
CPE =
n2
∑
(Ŷi+1 (xi+1 ; x i ) − Yi+1 (xi+1 ))2 .
i=n1
as a way to evaluate a predictor Ŷi+1 .
Inference comes from prediction errors.
Other principles: variance/bias, robustness, complexity etc.
Predictive criteria are most important with complex and
high dimensional data where modeling is impossible.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Survey Sampling
Frequentist
Bayesian
Other Schools
. Another perspective...
Fiducialist: Wang, Hannig, Iyer (2011). This was a weird
idea due to Fisher that never worked but from time to time
people try to make it work.
Regard X as X = G(θ, U), U = error distribution. Define
Q(x, u) = {θ| G(θ, u) = x}; Q is like a G−1 for fixed θ.
Fiducial distribution for θ is: Q(x, U ∗ )|Q(x, U ∗ ) ̸= ϕ), where
U is an IID copy of U.
Still fairly complicated and under development, but an
interesting alternative.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
. Frequentist
X ∼ N(µ, σ 2 ) has density
1
2
2
f (x|µ, σ) = √
e−(x−µ) /2σ .
2πσ
n
n
n IID outcomes X = x satisfy:
σ2
1 ∑
) (n − 1)S 2 /σ 2 = 2
(xi − x̄)2 ∼ χ2n−1
n
σ
i−1
√
2
X̄ and
√ S are independent. So, n(X̄ − µ)/σ ∼ N(0, 1)
and n(X̄ − µ)/s
√ ∼ tn−1 gives CI’s for µ and σ.
tn−1 = N(0, 1)/ χ2 /n; tends to N(0, 1) as n → ∞ but
otherwise has heavier tails; tk +1 has moments of order at
most k (tk+1 ∼ O(x k+1 )).
n
X̄ ∼ N(µ,
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
. Bayes version
X IID N(µ, σ 2 )...assume σ fixed, estimate µ. (If we use X̄ ,
set σ 2 = σ ′2 /n.)
Use a prior w(µ) on µ: µ ∼ N(θ, τ 2 ), θ, τ known.
Joint distribution:
h(µ, x) = w(µ)f (x|µ) =
1
2
2
e−(1/2)[(µ−θ)/τ +(x−µ)/σ ]
2πστ
Bayes rule: w(θ)f (x|µ) = w(µ|x)m(x)....so to find w(θ|x)
let ρ = (τ 2 + σ 2 )/τ 2 σ 2 and complete the square:
(µ − θ)/τ 2 + (x − µ)/σ 2 =
1 µ
ρ
x
(θ − x)2
[µ − ( 2 + 2 )] +
.
2
ρ τ
σ
2(σ 2 + τ 2 )
This gives a useful form for h(µ, x).
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
. Bayes continued
To find w(µ|x), divide h(x, µ) by m(x):
∫
1
2
2
2
m(x) = h(x, µ)dµ = √
e−(µ−x) /2(σ +τ ) = N(µ, σ 2 +τ 2 ).
2πρστ
Now, w(µ|x) = h(x, µ)/m(x) equals
√
ρ −ρ/2(µ−(1/ρ)(θ/τ 2 +x/σ2 ))2
e
.
2π
Thus, (µ|x) has a N(µ(x), 1/ρ) distribution where
σ2
τ2
θ
+
x
σ2 + τ 2
σ2 + τ 2
We can now get credible sets from N(µ(x), 1/ρ) for any x.
If σ not known, we still get the tn−1 distribution...
µ(x) =
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
. Bayesian, unknown σ
Let Xi ∼ N(µ, σ 2 ) IID for i = 1, . . . , n and assume
w(µ, σ 2 ) ∝ 1/σ 2 .
It can be shown that w(µ|σ 2 , x n ) ∼ N(x̄, σ 2 ).
Write w(µ|x n ), then
∫
n
w(µ|x ) = w(µ|σ 2 , x n )dσ 2 .
n
2
It can be shown
√ that nw(µ|x ) ∼ tn−1 (x̄, s /n). That is,
w((µ − x̄)/(s/ n)|x ) ∼ tn−1 .
From w(µ, σ|x n ) ∝ w(µ, σ 2 )f (x n |µ, σ 2 ) we can get
∫ ( )
1
2
2
2
2 n
e−(1/2σ )(n−1)s +n(x̄−µ) dµ,
w(σ |x ) ∝
2
σ
i.e., an Inv-χ2n−1,s2 = Inv-gamma((n − 1)/2, (n − 1)s2 /2).
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
. Information-theory
∫
The entropy of a RV is H(X ) = − p(x) log p(x)dx; the
entropy is the minimal noiseless codelength.
The normal family can be derived from the following.
Consider k functions f1 ,...,fk and numbers a1 , ..., ak and
suppose E(fj (X )) = aj for j = 1, . . . , k . The maximum
entropy distribution for X , if it exists, is
p(x) = ce
∑k
j=1
λj fj (x)
∫
where c and the λj ’s are found to make p(x)dx = 1.
Given the first two moments, solving this gives the normal.
That is, if k = 2, let f1 be X f2 = X 2 .
Another coding argument gives estimates for µ and σ by
two-stage coding Barron and Cover (1990).
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
. Predictive
Use x n to predict Xn+1 – analog to estimating µ.
Consider sample mean x̄. If no model is assumed, set
µ = EXi and σ 2 = Var(Yi ).
Use standard inequalities (Markov, triangle,
Cauchy-Schwarz) to obtain for given τ > 0.
√ )
(
σ(1 + (1/ n))
P |Ȳ − Yn+1 | ≥
τ
(
)
τ
√
≤
(E|Ȳ − µ|2 )1/2 + (E|µ − Yn+1 |2 )1/2
σ(1 + (1/ n))
≤ τ.
For τ < 1, the Frequentist PI for known σ is
√
Ȳ ± σ(1 + (1/ n))/τ.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
. Normal Case
If Xi ∼ N(µ, σ 2 ), X̄ − Xn+1 ∼ N(0, σ 2 (1 + n1 )) and the
prediction interval becomes X̄ ± z1−α σ(1 + (1/n))1/2
where z1−α is the 100(1 − α) percentile of N(0, 1).
So, if τ = 1/z1−α , the only difference between the general
1/2 versus
case and
√ the normal case is (1 + (1/n))
(1 + 1/ n) which is asymptotically negligible.
If σ is unknown, the PI’s become
√
√
n − 1 1 + (1/n)
√
Ȳ ± σ̂
,
τ n−3
An exact form for the normal can be found in this case too.
See Geisser (1995), Chap. 2.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
. Bayes Prediction
As above, the posterior mean is
E(Θ|x n ) = τ 2 /(σ 2 /n + τ 2 )x̄ + (σ 2 /n)/(σ 2 /n + τ 2 )µ and the
posterior variance is 1/ρ = τ 2 σ 2 /(nτ 2 + σ 2 ) = O(1/n). So,
m(xn+1 |x n ) = N(E(Xn+1 |X n = x n ), Var(Xn+1 |X n = x n )),
E(Xn+1 |X n = x n ) = E(Θ|X n = x n ),
and
Var(Xn+1 |X n = x n ) = σ 2 + Var(Θ|X n = x n ).
PI is now E(Θ|X n = x n ) ± zα/2 Var(Xn+1 |X n = x n ).
As before, this can be extended to the case that σ is not
known by putting a prior on it.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
. Exponential Famlies
A parametric family f (x|θ) is of exponential form ⇐⇒ ∃K
f (x|θ) can be written as
f (x|θ) = h(x)c(θ)e
∑K
k =1
wk (θ)tk (x)
.
Support of X is independent of θ.
The K functions tk are sufficent for θ in X , even as n
increases.
Convenient with independence: exponents add.
Natural form: Replace wk (θ) by η.
c(θ) is the normalizing constant, h(x) independent of x
Examples: Normal, Poisson, Binomial, Exponential,
Gamma, Dirichlet, χ2k ....etc
Non-exponential families: Unif [0, θ], Cauchy, Laplace, tn ,...
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
. Sufficiency
T (X ) is sufficent for θ in f (x|θ) ⇐⇒ the conditional
distribution for X given T (X ) is independent of θ.
Formally, T is sufficent ⇐⇒
P(X = x|T (X ) = t, θ) = P(X = x|T (X ) = t).
Fisher’s Factorization criterion for sufficiency:
T is sufficient for θ ⇐⇒ f (x n |θ) = g(T (x n )|θ)h(x n )
T partitions the sample space into sets
τt = {x n |T (x n ) = t} so two x n ’s in the same τ lead to the
same infrences.
If T is sufficient, Frequentists like to use it for inference
(esp.if minimal).
If T is sufficent than w(θ|x n ) = w(θ|t(x n )).
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
. Examples of Sufficiency
If X n are IID from a full rank K natural parameter
exponential
then then
∑family,
∑
n
n
T (x ) = ( i=1 t1 (x1 ), . . . , ni=1 tK (x1 )) is (minimal)
sufficient for θ.
Sufficent statistics are not unique – any one-one function
of a sufficient statistic is also sufficient.
A sufficent statistic T is minimal ⇐⇒ for any other
sufficient statistic S, ∃g so that T = g(S).
Test for minimality: Suppose:
f (x n |θ)
f (y n |θ)
constant as a function of θ ⇐⇒ T (x n ) = T (y n ).
Then T is minimal sufficient.
B. Clarke
.
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
. An Unusual Example
Let Xi ∼ Unif [θ, θ + 1] be IID. Joint pdf is
{
1, x(n) − 1 < θ < x(1)
f (x n |θ) =
0, otherwise
Check the condition of the theorem:
f (x n |θ)
constant in θ ⇐⇒
f (y n |θ)
{
x(n) = y(n)
x(1) = x(1)
So, T (X n ) = (X(1) , X(n) ) is minimal sufficient.
T is two-dimensional but θ is unidimensional, f is not an
exponential family.
In a full rank exponential family of order K , there are K
sufficient statistics.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
. Sufficiency and Ancillarity
One extreme: The ordered data X(1) , ..., X(n) are always a
sufficent statistic.
The other extreme is that exponential famlies are almost
characterized by having a finite dimensional sufficient
statistic. Suppose X n are IID f (·|θ).
Then, fθ has support independent of θ and there is a K
dimensional sufficient statistics ⇐⇒ f ) · |θ) is of
exponential form.
What about functions of the data that are not sufficient?
The question is whether they depend on the parameter...or
have other information about the parameter.
A statistic S(x n ) is ancillary ⇐⇒ distribution of S(x n )
does not depend on θ.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
. Properties of Ancillarity
Example: Location scale family: Suppose we have a RV Z
with density f . Write X = σZ + µ. Now, X has density
f (x|µ, σ) = (1/σ)f ((x − µ)/σ)
Assume IID data from f (x|µ, σ) and let R = X(n) − X(1) be
the range. DF of R is
FR,µ (r ) = Pµ (R ≤ r ) = Pµ (X(n) − X(1) ≤ r )
= Pµ ((X(n) − µ) − (X(1) − µ) ≤ r )
= Pf (Z(n) − Z(1) ≤ r )
which is constant in µ. So, R is ancillary.
In a scale family, any statistic that is a function of Xi /Xn will
be ancillary. (Think of the distribution function of
(1/Xn )(X1 , . . . , Xn−1 ).
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
. Completeness
A family of densities f (T |θ) for a statistic T is complete
⇐⇒ ∀θ Eθ g(T ) = 0 implies g = 0 a.e.
X̄ is complte for N(µ, 1).
In a full rank
∑ exponential family
∑
T (X n ) = ( ni=1 T1 (Xi ), . . . , ni=1 TK (Xi ) is complete.
Basu’s theorem: If T is complete and minimal sufficient
then it is independent of every ancillary statistic.
Curious fact: The distribution of an ancillary statistic does
not depend on θ, but there can be information in the
ancillary about θ (Fraser-Monette example). Basu’s
theorem eliminates this possibility.
A minimal sufficent statistic therefore has all the
information in the data about the parameter.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
. Natural Sufficient Statistic
Let X n be an IID∑sample from an exponential family and
write Tk (X n ) = ni=1 Tk (Xi ).
Suppose (w1 (θ), . . . , wK (θ)) contains an open set in IR K
and (T1 (X n ), . . . , TK (X n )) contains an open set in IR K .
Then: The distribution of (T1 , . . . , TK ) as RV’s is
fT (u1 , . . . , uK |θ) = H(u1 , . . . , uK )c n (θ)e
∑K
k =1
wk (θ)Uk
.
The sampling distribution of the complete, minimal
sufficient summary statistics of an exponential family is of
exponential form.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Method of Moments
MLE’s
UMVUE’s
Testing
. MOM
Suppose X n is IID f (x|θ1 , . . . , θK ).
Equate the first K sample moments to the first K
population moments
∑ and solve for estimates
∑ of the θk ’s.
Write m1 = (1/n) ni=1 Xi , ..., mK = (1/n) ni=1 X K .
Write µ1 (θ1K ) = E(X ), ..., µK (θ1K ) = E(X K ).
Solve K equations: Mk = µk (θ1K ) for the K unknowns
θ1K .
∑n
2
2
2
N(µ, σ ): We
∑nget µ̂2= X̄ 2and E(X ) = (1/n) i=1 Xi . So,
2
σ̂ = (1/n) i=1 Xi − µ̂ .
Xi ∼ Bin(n, p): X̄ = E(X ) = np and
X¯2 = np(1 − p) + (np)2 . Solving gives p̂ = X̄ /n̂ and
n̂ =
X̄ − (1/n)
X̄ 2
∑n
i=1 (Xi
− X̄ )2
. ← can be < 0 !
.
.
.
Can use CLT’s to get asymptoic normality and CI’s.
B. Clarke
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Method of Moments
MLE’s
UMVUE’s
Testing
. MLE’s
Recall the MLE is θ̂ = arg maxθ f (x n |θ).
Properties: Not always unique: X ∼ Unif [θ, θ + 1]
{
1, θ < 1 < θ + 1
L(θ|x) =
0, otherwise
so any θ < 1 is an MLE.
Discrete case: Xi ∼ Bin(n, p). MLE for n? Then
n ( )
∏
n xi
n
L(n, p|x ) =
p (1 − p)1−xi .
xi
i=1
Observe
= 0 for n < maxi xi so n ≥ maxi xi
Want least n satisfying L(n|x n ) ≥ L(n − 1|x n ) and
L(n + 1|x n ) ≤ L(n|x n ), then argue uniqueness.
L(n|x n )
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Method of Moments
MLE’s
UMVUE’s
Testing
. Uniqueness Result for MLE’s
May need tricks to find MLE’s...can’t always solve
(log L(θ|x))′ = 0. Discrete parameters, and boundary
problems of parameter space may be problems.
Often helpful to take logs...
In exponential families we have MLE’s: Let
f (x|θ) = e
∑K
k =1
wk (θ)Tk (X )+log c(θ)+log h(X )
If C is the interior of the range of (w1 (θ), . . . , wK (θ)) ⊂ IR K
for θ ∈ Θ,
and the equations Eθ Tk (X ) = Tk (x n ) have a solution say
θ̂ = (θ̂1 (x n ), . . . , θ̂1 (x n )) for which (c1 (θ̂1 ), . . . , cK (θ̂k )) ∈ C
Then the MLE is unique.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Method of Moments
MLE’s
UMVUE’s
Testing
. Invariance
If θ̂ = MLE(θ), then for any τ (θ), MLE(τ ) = τ (θ̂).
Proof: If τ is one-one and η = τ (θ) then take sup’s on both
sides o L∗ (η|x n ) = L(τ −1 (η)|x n ) = L(θ|x n ).
If τ is not one-one, let L∗ (η|x n ) = sup{θ| τ (θ)=η} L(θ|x n ) and
proceed as before.
Example: Suppose Yi = β0 + β1 Xi + ϵi , ϵi ∼ N(0, σ 2 )
For fixed σ, can find MLE’s for β0 and β1 from minimizing
n
∑
(yi − β0 − β1 xi )2 .
i=1
This is the exponent in the normal so MLE(β0 , β1 ) is the
LSE. Putting β̂0 and β̂1 into L′ (β̂0 , β̂1 , σ 2 |Y1n ) = 0 gives
∑
σ̂ 2 = (1/n) ni=1 (yi − β̂0 − β̂1 xi )2 .
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Method of Moments
MLE’s
UMVUE’s
Testing
. Asymptotics of the MLE I
A sequence of estimators Wn = Wn (X n ) is consistent for
p
τ (θ) ⇐⇒ ∀θ Wn → τ (θ) .
A sequence of estimatorsWn is asymptotically efficient for
τ (θ) ⇐⇒ Wn achieves the CRLB in the limit of large n:
lim
n→∞
Varθ (Wn )
(τ ′ (θ)2 /nI(θ))
where I(θ) = Eθ ((∂/∂θ) log f (X |θ))2 .
Asymptotic normality means ∃V > such that
√
D
n(θ̂ − θ) → N(0, V ).
MLE’s are consistent, asymptotically normal and efficient.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Method of Moments
MLE’s
UMVUE’s
Testing
. Asymptotics of the MLE II
pθ
Under ‘regularity’ conditions we have that 1) θ̂ −→ θ
√
D
2) n(θ̂ − θ) −→ N(0, 1/I(θ))
Regularity conditions: (i) (∂/∂θ)i f (·|θ) exists for i = 1, 2, 3,
(ii) (∂/∂θ)i f (·|θ) bounded by integrable functions of x
locally in θ i=1, 2, 3 , (iii) E[(∂/∂θ) ln f (X |θ)]2 < ∞.
h(θ̂) is also AN(h(θ), (h′ (θ)/nI(θ))2 ).
Same result ∑
holds with
Î(θ̂) = (1/n) ni=1 (∂ 2 /∂θ2 ) ln f (·|θ)pθ I(θ).
→
Multivariate version too: Under regularity conditions,
pθ
(θ̂1 , . . . , θ̂K ) −→ (θ1 , . . . , θK )
√
D
n(θ̂ − θ) −→ N(0, I(θ)−1 ).
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Method of Moments
MLE’s
UMVUE’s
Testing
. Best Unbiased
The MSE of an estimator θ̂ of θ is
Eθ (θ̂ − θ)2 = E(θ̂ − E(θ̂))2 + (E θ̂ − θ)2
i.e., MSE = Variance plus bias-squared.
Let’s search the collection of unbiased estimators for the
one with the smallest variance.
Definition: θ̂ is the best unbiased estimator of θ ⇐⇒
Eθ θ̂ = θ and ∀θ and ∀ unbiased θ∗ Varθ (θ̂) ≤ Varθ (θ∗ ).
In Poisson(λ), λ̂ = X̄ and S 2 both estimate Varλ (X ) = λ.
Can show X̄ has a lower variance than S 2 .
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Method of Moments
MLE’s
UMVUE’s
Testing
For IID data, the smallest variance given by CRLB:
Varθ (Wn ) ≥ (Eθ Wn )′ /nI(θ).
If X ∼ fθ independent of Y ∼ gθ then
IX ,Y (θ) = IX (θ) + IY (θ).
Fisher information for Poisson(λ) is I(λ) = 1/λ. Since
λ′ = 1 and Varλ (X̄ ) = λ/n, we see that X̄ attains the CRLB
and so is best unbiased or UMVU.
Roughly, if Xi ’s are IID f (·|θ) and W (X n ) is unbiased for
τ (θ) then Wn attains the CRLB ⇐⇒ ∃a(θ) so that
∂
a(θ)(W (X n ) − τ (θ)) =
ln f (x n |θ).
∂θ
Strange fact: The∑
best unbiased estimator for σ 2 in
2
N(µ, σ ) is (1/n) (Xi − µ)2 , so, if µ is unknown, you can’t
attain the CRLB.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Method of Moments
MLE’s
UMVUE’s
Testing
. UMVU and Sufficiency
Rao-Blackwell Theorem: Let W be unbiased for τ (θ) and
let T be sufficient for θ. Let ϕ(T ) = E(W |T ). Then: (1)
Eϕ(T ) = τ (θ) and (2) ∀θ we have Varθ (ϕ(T )) ≤ Varθ (W ).
UMVUE’s are unique
Let W be unbaised for τ (θ). Then τ is UMVU ⇐⇒ W is
uncorrelated with any unbiased estimator of 0.
Theorem: Let T be complete nd sufficient for θ and let
ϕ(T ) be an estimator. Then, ϕ(T ) is the unique UMVUE for
Eθ ϕ(T ).
Lehmann-Scheffe Theorem: Let T be complete and
sufficient for θ and let h(X ) be unbiased for τ (θ). Then
W = ϕ(T ) = E(h(X )|T ) is UMVUE for τ (θ).
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Method of Moments
MLE’s
UMVUE’s
Testing
. Example
E.g., Xi ∼∑
Bin(n, p) IID. Let τ (p) = Pp (one success). Then,
T (X n ) =
Xi is complete and sufficient, but not unbiased.
Let h(X ) = χX = 1 then E(h(X )) = τ (θ)
Theorem ⇒ ϕ(T ) = E(h(X )|T ) is UMVU for τ (θ).
Work out what ϕ is:
ϕ(t) = E(h(X )|T = t) = P(X1 = 1|T = t)
Writing out the conditiona probability and simplifying gives
(
) ( )
n−1
n
ϕ(T ) = n
/
.
T −1
T
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Method of Moments
MLE’s
UMVUE’s
Testing
. Neyman-Pearson Framework
Basic hypothesis testing problem: H : θ ∈ ΩH vs
K : θ ∈ ΩK , ΩH ∩ ΩK = ∅.
Suppose we base our decision on an outcome of X .
A test is defined by a region S and x ∈ S means do not
reject H while x ∈ S c means reject H.
4 cases: θ ∈ H, K and we choose H, K.
Want P(Type I error) = PH (reject H) low.
Want P(Type II error) = 1 − PK (reject H) high.
Power function is Pθ (reject H), a function of θ. Want power
low on ΩH and high on ΩK .
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Method of Moments
MLE’s
UMVUE’s
Testing
. NPFL intuition
A test defined by a region S is level α ⇔ Pθ (S c ) ≤ α for all
θ ∈ ΩH .
Suppose ΩH = P0 and ΩK = P1 . Then, we want to find
points for a set S so that we can reject on S c and
∑
P0 (x) ≤ α
and
x∈S c
∑
P1 (x) maximum.
x∈S c
The most valuable points have a high value of P1 (x)/P2 (x).
So, rank the points in decreasing order by P1 (x)/P2 (x) and
then start putting them in until you hit α in terms of P0 .
This leads to the Most Powerful test in the
simple-vs-simple case.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Method of Moments
MLE’s
UMVUE’s
Testing
. NPFL
Let P0 and P1 have pdf’s p0 and p1 .
For testing H : P0 vs K : P1 there is a critical function ϕ and
a constant k so that (1) E(ϕ) = α and (2)
{
1 p1 (x) > kp0 (x)
ϕ(x) =
0 p1 (x) < kp0 (x)
If a ϕ satisfies (1) and (2) for some k , it is MP for P0 vs P1 .
If ϕ is a MP level α test for P0 vs P1 then ∃k so that (2) is
satisfied.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Method of Moments
MLE’s
UMVUE’s
Testing
. MP and Sufficiency
Consider H : θ = θ0 vs K : θ = θ1 . Suppose T is sufficient
for θ with density g(T |θ). Then:
Any test based on T with rejection region S is an MP level
α test if it satifies
{
g(t|θ1 ) > kg(t|θ0 ) ⇒ t ∈ S
g(t|θ1 ) < kg(t|θ0 ) ⇒ t ∈ S c
for some k > 0 where α = Pθ0 (T ∈ S)
We rarely test point nulls aginst point nulls...so we want a
concept of uniformly most powerful i.e., tests that are good
for composite hypotheses.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Method of Moments
MLE’s
UMVUE’s
Testing
. Uniform NPFL
Consider H : θ ∈ ΩH vs K : θ ∈ ΩK , with ΩH = ΩcK and
suppose we have a test based on a sufficient statistic T
with pdf g(t|θ) and rejection region S.
If (1) the test is level α, (2) ∃θ0 ∈ ΩH with Pθ0 (S) = α, and
(3) For the same θ0 as in (2) we have ∀θ∗ ∈ ΩK there is a
K > 0 so that
{
g(t|θ∗ ) > kg(t|θ0 ) ⇒ t ∈ S
g(t|θ∗ ) < kg(t|θ0 ) ⇒ t ∈ S c .
Then: This test is UMP level α for H vs K.
This result works for many one-sided tests, in particular for
one-dimensional exponential families.
Also....Monotone Likelihood Ratio, unbiasedness, etc.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Conjugate Priors
Decision Theory
Testing
. Examples
We already saw that a normal prior on a normal likelihood
gave a normal posterior for estimating the mean.
Let P be a class of prior densities and let F be a class of
densities from a parametric family.
P is conjugate to F ⇐⇒ ∀f ∈ F , ∀w ∈ P, ∀x n w(θ|x n ) ∈ P.
That is, the posterior is in the same class as the prior.
Suppose Xi are IID ∏
Poisson(λ) so
n
n
x̄
−nλ
f (x |λ) = λ e
/ ni=1 xi !. Let λ ∼ Gamma(α, β). Then,
the joint density is
h(x n , λ) =
λnx̄+α−1 e−λ(n+1/β)
.
Γ(α)β α πxi !
So, w(λ|x n ) is a Gamma(nx̄ + α, (n + 1/β)−1 ).
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Conjugate Priors
Decision Theory
Testing
. Conjugacy and Exponential Families
Recall the exponential
family ∑
∑K
∑n
n
c
(θ)
Tk (xi )+ ni=1 S(xi )+nd(θ) .
k
k
=1
i=1
f (x |θ) = e
We obtain a conjugate prior by using the sufficient
statistics and n.
∑
Let tk = ni=1 Tk (xi ) for k = 1, . . . , K and tK +1 = n.
∫ ∑K
Let w(t1 , . . . , tk+1 ) = e k =1 ck (θ)tk +tK +1 d(θ) dθ.
Let Ω = {t1K +1 | w(t1K +1 ) < ∞}
The K + 1 parameter exponential family given by
w(θ|t1K +1 ) = e−
∑K
K +1
)
k=1 ck (θ)tk +tK +1 d(θ)−ln w(t1
is conjugate to f (x n |θ) and w(t1K +1 ) is the normalizing
constant so the data is fixed.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Conjugate Priors
Decision Theory
Testing
. Basic Setup
If you have a good idea about the various costs of the ways
to be wrong you can make decisions that minimize the
average cost of errors.
We have a model f (·|θ), a parameter space Θ, data x from
a sample space. When we get data we make a decision
about where θ is by using a rule δ. The collection of rules
we allow ourselves is A, the action space. We measure
cost by a loss function L : Θ × A → IR
A can be Θ (estimation) or {acceptH, rejectH} (testing).
L(θ, δ(x)) is the cost of δ(x) when θ is the state of nature.
How to choose a good δ?
Try averaging: R(θ, δ) = Eθ L(θ, δ(X )), the expected loss.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Conjugate Priors
Decision Theory
Testing
. Bayes Optimality
R(θ, δ) is called the risk of δ.
Smaller risk is better, so we can compare curves
gδ (θ) = R(θ, δ) for various δ. If the curve for a δ does not
admit a uniform improvement then δ is called admissible.
This is hard to work with.
Assume a prior w(θ) and form the Bayes risk:
∫
R(w, δ) = Ew R(Θ, δ) = w(θ)R(θ, δ)dθ
∫
[
]
= m(x)w(θ|x)L(θ, δ(x))dxdθ = Em Ew(·|x) L(Θ, δ(X ))
The posterior risk (in [ ]’s) gives the Bayes estimator:
δB = arg min R(w, δ) = arg min R(w, δ|x)
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Conjugate Priors
Decision Theory
Testing
The Bayes or posterior risk (they are equivalent) depends
on w, not θ.
An alternative is the minimax (mM) approach: Minimize the
maximum risk. That is find
δ = arg min max R(θ, δ) = arg RmM .
δ
θ
Or, maximize the minimum risk (maximin, Mm):
δ = arg max min R(w, δ) = arg RMm .
w
δ
Both are global criteria; the Mm risk is based on the Bayes
estimator, δB .
RmM ≤ RMm . The Game Theorem asserts they are equal.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Conjugate Priors
Decision Theory
Testing
. Two Loss Functions
Given a loss function we can, in principle, work out the
Bayes estimator (and the mM, Mm, admissible etc).
Suppose L(θ, δ) = (θ − δ)2 .
∫
arg min w(θ|x)(θ − δ(x))2 dθ = E(Θ|x).
δ
Suppose L(θ, δ) = |θ − δ|.
∫
arg min w(θ|x)|θ − δ(x)|dθ = median(w(θ|x)).
δ
Other loss functions give percentiles (asymmetric absolute
value) or m(x) (relative entropy), LINEX loss etc etc
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Conjugate Priors
Decision Theory
Testing
. Generalized 0-1 Loss
Bayes testing is the Bayes action under generlaized 0-1
loss, i.e., it has the smallest Bayes risk.
Consider H : θ ∈ ΩH vs. K : θ ∈ ΩK . Let a0 (a1 ) be the
action that we choose H (K). So, given a rule δ, the
acceptance region for H is {x|δ(x) = a0 }.
So, define the loss


0 θ ∈ ΩH , a = a 0



0 θ ∈ Ω , a = a
K
1
L(θ, a) =
c1 θ ∈ ΩH , a = a1



c θ ∈ Ω , a = a
2
K
0
c1 (c2 ) is the cost of a Type I (II) error.
B. Clarke
.
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Conjugate Priors
Decision Theory
Testing
. Risk of a Test
Let R = {x| δ(x) = a1 } and let β(θ) = Pθ (X ∈ R). Then
{
c1 β(θ)
θ ∈ ΩH
E(θ, δ) =
c2 (1 − β(θ)) θ ∈ ΩK
So, if c1 = c2 = 1 we have the usual power function of a
Frequentist test.
Theorem: Under generalized 0-1 loss, any test of the form
‘reject H : θ ∈ ΩH when W (ΩK |x) < c2 /(c1 + c2 )’ is Bayes
optimal.
That is, δB (x) = χ(W (ΩK |x) < c1 /(c1 + c2 )) is the Bayes
optimal test and achieves minδ R(w, δ|x).
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.
Basic Definitions
Models of Variability
The Normal Example
Suffiency and Exponential Families
Main Frequentist Estimators
Bayesian Estimation
Conjugate Priors
Decision Theory
Testing
. A Final Example
Suppose X ∼ N(θ, σ 2 ) and w is N(µ, τ 2 ) with µ, σ and τ
known. Let η = σ 2 /(σ 2 + τ 2 ).
(Θ|X̄ ) has distribution
N(E(Θ|X̄ ), Var(Θ|X̄ )) = N((1 − η)x̄ + ηµ, τ 2 η).
Test H : θ ≥ θ0 vs K : θ < θ0 .
x̄−ηµ
√
W (H|x) = W (Θ ≥ θ0 |x) = P(N(0, 1) > θ0 −(1−η)
|x̄).
τ η
Now


W (H|x) ≤ c2 /(c1 + c2 ) = α ∈ (0, 1)
√
⇐⇒ (θ0 − (1 − η)x̄ − ηµ)/τ η > zα
√


η(µ−θ0 )−zα τ η
⇐⇒ x̄ < θ0 −
1−η
Thus, rejecting H for small values of x̄ is Bayes optimal
where the threshold depends on c1 , c2 and the prior.
.
B. Clarke
.
.
.
Review of Bayesian and Frequentist Statistics
.
.