Download Estimation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Sufficient statistic wikipedia , lookup

Foundations of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Confidence interval wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

German tank problem wikipedia , lookup

Student's t-test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
Estimation
COMP 245 STATISTICS
Dr N A Heard
Contents
1
Parameter Estimation
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
2
2
2
Point Estimates
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Bias, Efficiency and Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
3
4
5
3
Confidence Intervals
10
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Normal Distribution with Known Variance . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Normal Distribution with Unknown Variance . . . . . . . . . . . . . . . . . . . . . 11
1
1
Parameter Estimation
1.1
Introduction
In statistics we typically analyse a set of data by considering it as a random sample from a
larger, underlying population about which we wish to make inference.
1. The chapter on numerical summaries considered various summary sample statistics for
describing a particular sample of data. We defined quantities such as the sample mean x̄,
and sample variance s2 .
2. The chapters on random variables, on the other hand, were concerned with characterising the underlying population. We defined corresponding population parameters such as
the population mean E( X ), and population variance Var( X ).
We noticed a duality between the two sets of definitions of statistics and parameters.
In particular, we saw that they were equivalent in the extreme circumstance that our sample
exactly represented the entire population (so, for example, the cdf of a new randomly drawn
member of the population is precisely the empirical cdf of our sample).
Away from this extreme circumstance, the sample statistics can be seen to give approximate
values for the corresponding population parameters. We can use them as estimates.
For convenient modelling of populations (point 2), we met several simple parameterised
probability distributions (e.g. Poi(λ), Exp(λ), U(a, b), N(µ, σ2 )). There, population parameters
such as mean and variance are functions of the distribution parameters. So more generally, we
may wish to use the data, or just their sample statistics, to estimate distribution parameters.
For a sample of data x = ( x1 , . . . , xn ), we can consider these observed values as realisations
of corresponding random variables X = ( X1 , . . . , Xn ).
If the underlying population (from which the sample has been drawn) is such that the
distribution of a single random draw X has probability distribution PX |θ (·|θ ), where θ is a
generic parameter or vector of parameters, we typically then assume that our n data point
random variables X are i.i.d. PX |θ (·|θ ).
Note that PX |θ (·|θ ) is the conditional distribution for draws from our model for the population
given the true (but unknown) parameter values θ.
1.2
Estimators
Statistics, Estimators and Estimates
Consider a sequence of random variables X = ( X1 , . . . , Xn ) corresponding to n i.i.d. data
samples to be drawn from a population with distribution PX . Let x = ( x1 , . . . , xn ) be the
corresponding realised values we observe for these r.v.s.
A statistic is a function T : Rn → R p applied to the random variable X. Note that T ( X ) =
T ( X1 , . . . , Xn ) is itself a random variable. For example, X̄ = ∑in=1 Xi /n is a statistic.
The corresponding realised value of a statistic T is written t = t( x ) (e.g. t = x̄).
If a statistic T ( X ) is to be used to approximate parameters of the distribution PX |θ (·|θ ), we
say T is an estimator for those parameters; we call the actual realised value of the estimator
for a particular data sample, t( x ), an estimate.
2
2
2.1
Point Estimates
Introduction
Definition
A point estimate is a statistic estimating a single parameter or characteristic of a distribution.
For a running example which we will return to, consider a sample of data ( x1 , . . . , xn ) from
an Exponential(λ) distribution with unknown λ; we might construct
a point estimatefor either
−
1
λ itself, or perhaps for the mean of the distribution = λ
, or the variance = λ−2 .
Concentrating on the mean of the distribution in this example, a natural estimator of this
could be the sample mean, X̄. But alternatively, we could propose simply the first data point
we observe, X1 as our point estimator; or, if the data had been given to us already ordered we
might (lazily) suggest the sample median, X({n+1}/2) .
How do we quantify which estimator is better?
Sampling Distribution
Suppose for a moment we actually knew the parameter values θ of our population distribution PX |θ (·|θ ) (so suppose we know λ in our exponential example).
Then since our sampled data are considered to be i.i.d. realisations from this distribution
(so each Xi ∼ Exp(λ) in that example), it follows that any statistic T = T ( X1 , . . . , Xn ) is also a
random variable with some distribution which also only depends on these parameters.
If we are able to (approximately) identify this sampling distribution of our statistic, call it
PT |θ , we can then find the conditional expectation, variance, etc of our statistic.
Sometimes PT |θ , will have a convenient closed-form expression which we can derive, but
in other situations it will not.
For those other situations, then for the special case where our statistic T is the sample
mean, then provided that our sample size n is large, we use the CLT to give us an approximate
distribution for PT |θ . Whatever the form of the population distribution PX |θ , we know from the
CLT that approximately X̄ ∼ N(E[ X ], Var[ X ]/n).
For our Xi ∼ Exp(λ) example, it can be shown that the statistic T = X̄ is a continuous
random variable with pdf
f T |λ ( t | λ ) =
(nλ)n tn−1 e−nλt
,
( n − 1) !
t > 0.
This is actually the pdf of a Gamma(n, nλ) random variable, a well known continuous
distribution, so T ∼ Gamma(n, nλ).
So using the fact that Gamma(α, β) has expectation
E( X̄ ) = ET |λ ( T |λ) =
n
1
= = E( X ).
nλ
λ
So the expected value of X̄ is the true population mean.
3
α
, here we have
β
2.2
Bias, Efficiency and Consistency
Bias
The previous result suggests that X̄ is, at least in one respect, a good statistic for estimating
the unknown mean of an exponential distribution.
Formally, we define the bias of an estimator T for a parameter θ,
bias( T ) = E[ T |θ ] − θ.
If an estimator has zero bias we say that estimator is unbiased. So in our example, X̄ gives
an unbiased estimate of θ = λ−1 , the mean of an exponential distribution.
[In contrast, the sample median is a biased estimator of the mean of an exponential distri5
bution. For example if n = 3, it can be shown that E( X({n+1}/2) ) =
.]
6λ
In fact, the unbiasedness of X̄ is true for any distribution; the sample mean x̄ will always
be an unbiased estimate for the population mean µ:
n
nµ
∑ i = 1 Xi
∑ n E ( Xi )
= i =1
=
= µ.
E( X̄ ) = E
n
n
n
Similarly, there is an estimator for the population variance σ2 which is unbiased, irrespective of the population distribution. Disappointingly, this is not the sample variance
S2 =
1 n
( Xi − X̄ )2 ,
n i∑
=1
as this has one too many degrees of freedom. (Note that if we knew the population mean µ,
1 n
then ∑ ( Xi − µ)2 would be unbiased for σ2 .)
n i =1
Bias-Corrected Sample Variance
However, we can instead define the bias-corrected sample variance,
Sn2 −1 =
n
1
( Xi − X̄ )2
n − 1 i∑
=1
which is then always an unbiased estimator of the population variance σ2 .
Warning: Because of it’s usefulness as an unbiased estimate of σ2 , many statistical text books and
n
1
( xi − x̄ )2 as
software packages (and indeed your formula sheet for the exam!) refer to s2n−1 =
n − 1 i∑
=1
the sample variance.
Efficiency
Suppose we have two unbiased estimators for a parameter θ, which we will call Θ̂( X ) and
Θ̃( X ). And again suppose we have the corresponding sampling distributions for these estimators, PΘ̂|θ and PΘ̃|θ , and so can calculate their means, variances, etc.
Then we say Θ̂ is more efficient than Θ̃ if:
1. ∀θ, VarΘ̂|θ Θ̂θ ≤ VarΘ̃|θ Θ̃θ ;
4
2. ∃θ s.t. VarΘ̂|θ Θ̂θ < VarΘ̃|θ Θ̃θ .
That is, the variance of Θ̂ is never higher than that of Θ̃, no matter what the true value of θ
is; and for some value of θ, Θ̂ has a strictly lower variance than Θ̃.
If Θ̂ is more efficient than any other possible estimator, we say Θ̂ is efficient.
Example
Suppose we have a population with mean µ and variance σ2 , from which we are to obtain
a random sample X1 , . . . , Xn .
Consider two estimators for µ, M̂ = X̄, the sample mean, and M̃ = X1 , the first observation
in the sample.
Well we have seen E( X̄ ) = µ always, and certainly E( X1 ) = µ, so both estimators are
unbiased.
σ2
, and of course Var( X1 ) = σ2 , independent of µ. So if n ≥ 2, M̂
n
is more efficient than M̃ as an estimator of µ.
We also know Var( X̄ ) =
Consistency
In the previous example, the worst aspect of the estimate M̃ = X1 is that it does not change,
let alone improve, no matter how large a sample n of data is collected. In contrast, the variance
of M̂ = X̄ gets smaller and smaller as n increases.
Technically, we say an estimator Θ̂ is a consistent estimator for the parameter θ if Θ̂ converges in probability to θ. That is, ∀e > 0, P(|Θ̂ − θ | > e) → 0 as n → ∞.
This is hard to demonstrate, but if Θ̂ is unbiased we do have:
lim Var Θ̂ = 0 ⇒ Θ̂ is consistent.
n→∞
So returning to our example, we see X̄ is a consistent estimator of µ for any underlying population.
2.3
Maximum Likelihood Estimation
Deriving an Estimator
We have met different criteria for measuring the relative quality of rival estimators, but no
principled manner for deriving these estimates.
There are many ways of deriving estimators, but we shall concentrate on just one - maximum likelihood estimation.
Likelihood
If the underlying population is a discrete distribution with an unknown parameter θ, then
each of the samples Xi are i.i.d. with probability mass function p X |θ ( xi ).
Since the n data samples are independent, the joint probability of all of the data, x =
( x1 , . . . , xn ), is
5
n
L(θ | x ) =
∏ p X | θ ( x i ).
i =1
The function L(θ | X ) is called the likelihood function and is considered as a function of the
parameter θ for a fixed sample of data x. L(θ | X ) is the probability of observing the data we
have, x, if the true parameter were θ.
If, on the other hand, the underlying population is a continuous distribution with an unknown parameter θ, then each of the samples Xi are i.i.d. with probability density function
f X |θ ( xi ), and the likelihood function is defined by
n
L(θ | x ) =
∏ f X | θ ( x i ).
i =1
Maximising the Likelihood
Clearly, for a fixed set of data, varying the population parameter θ would give different
probabilities of observing these data, and hence different likelihoods.
Maximum likelihood estimation seeks to find the parameter value θ̂ MLE which maximises
the likelihood function,
θ̂ MLE = argsup L(θ | x ).
θ
This value θ̂ MLE is known as the maximum likelihood estimate (MLE).
Log-Likelihood Function
For maximising the likelihood function, it often proves more convenient to consider the
log-likelihood, `(θ | x ) = log{ L(θ | x )}. Since log(·) is a monotonic increasing function, the
argument θ maximising ` maximises L.
The log-likelihood function can be written as
n
`(θ | x ) =
∑ log{ pX|θ (xi )}
n
or
`(θ | x ) =
i =1
∑ log{ f X|θ (xi )},
i =1
for discrete and continuous distributions respectively.
In either case, solving
∂2
∂
`(θ ) = 0 yields the MLE if 2 `(θ ) < 0.
∂θ
∂θ
MLE Example: Binomial(10, p)
Suppose we have a population which is Binomial(10, p), where the probability parameter
p is unknown.
Then suppose we draw a sample of size 100 from the population — that is, we observe 100
independent Binomial(10, p) random variables X1 , X2 , . . . , X100 — and obtain the data summarised in the table below.
6
x
Freq.
0
2
1
16
2
35
3
22
4
21
5
3
Each of our Binomial(10, p) samples Xi have pmf
10 xi
p X ( xi ) =
p (1 − p)10− xi ,
xi
6
1
7
0
8
0
9
0
10
0
i = 1, 2, . . . , 100.
Since the n = 100 data samples are assumed independent, the likelihood function for p for
all of the data is
n
n 10 xi
10− xi
L ( p ) = ∏ p X ( xi ) = ∏
p (1 − p )
xi
i =1
i =1
( )
n
n
n
10
= ∏
p∑i=1 xi (1 − p)10n−∑i=1 xi .
xi
i =1
So the log-likelihood is given by
( )
!
n
n
n
10
`( p) = log ∏
+ log( p) ∑ xi + log(1 − p) 10n − ∑ xi .
xi
i =1
i =1
i =1
Notice the first term in this equation is constant wrt p, so finding the value of p maximising
`( p) is equivalent to finding p maximising
!
n
n
i =1
i =1
log( p) ∑ xi + log(1 − p) 10n − ∑ xi
and this becomes apparent when we differentiate ` wrt p, with the constant term differentiating
to zero.
10n − ∑in=1 xi
∂
∑n x
`( p) = 0 + i=1 i −
.
∂p
p
1− p
Setting this derivative equal to zero, we get
10n − ∑in=1 xi
∑in=1 xi
−
=0
p
1− p
n
n
⇒(1 − p) ∑ xi = p 10n − ∑ xi
i =1
!
i =1
n
n
n
i =1
i =1
⇒ ∑ xi = p 10n − ∑ xi + ∑ xi
i =1
⇒p =
!
∑in=1
x̄
xi
= .
10n
10
To check this point is a maximum of `, we require the second derivative
∂2
10n − ∑in=1 xi
∑in=1 xi
`(
p
)
=
−
−
∂p2
p2
(1 − p )2
n x̄ 10n − n x̄
x̄
10 − x̄
=− 2 −
= −n
+
p
(1 − p )2
p2
(1 − p )2
7
(which is in fact < 0 ∀ p, the likelihood is log concave).
x̄
Substituting p =
, this gives
10
1
1
1000n
+
=−
,
−100n
x̄ 10 − x̄
(10 − x̄ ) x̄
which is clearly < 0. So the MLE for p is
x̄
= 0.257.
10
Second MLE Example: N(µ, σ2 )
If X ∼ N(µ, σ2 ), then
1
( x − µ )2
√
f X (x) =
exp −
2σ2
σ 2π
So for an i.i.d. sample x = ( x1 , . . . , xn ), the likelihood function for (µ, σ2 ) for all of the data
is
L(µ, σ2 ) =
n
∏ f X ( xi )
i =1
n
∑ n ( x − µ )2
.
= (2πσ2 )− 2 exp − i=1 i 2
2σ
The log likelihood is
n
∑ n ( x − µ )2
.
`(µ, σ2 ) = − log(2π ) − n log(σ) − i=1 i 2
2
2σ
If we are interested in the MLE for µ, we can take the partial derivative wrt µ and set this
equal to zero.
0=
∂
∑n ( x − µ)
`(µ, σ2 ) = i=1 2i
∂µ
σ
n
⇐⇒ 0 =
⇐⇒ µ =
∑ ( xi − µ ) =
i =1
∑in=1
n
xi
n
∑ xi − nµ
i =1
= x̄.
To check this is a maximum, we look at the second derivative.
∂2
−n
`(µ, σ2 ) = 2 ,
2
∂µ
σ
which is negative everywhere, so x̄ is the MLE for µ, independently from the value of σ2 .
CLT significance:
This result is particularly useful. We have already seen that X̄ is always an unbiased estimator for the population mean µ. And now, using the CLT, we also have that for large n that
X̄ is always approximately the MLE for the population mean µ, irrespective of the distribution
of X.
8
Properties of the MLE
So how good an estimator of θ is the MLE?
- The MLE is not necessarily unbiased.
+ For large n, the MLE approximately normally distributed with mean θ.
+ The MLE is consistent.
+ The MLE is always asymptotically efficient, and if an efficient estimator exists, it is the
MLE.
+ Because it is derived from the likelihood of the data, it is well-principled. This is the
“likelihood principle”, which asserts that all the information about a parameter from a set
of data is held in the likelihood.
9
3
3.1
Confidence Intervals
Introduction
Uncertainty of Estimates
In most circumstances, it will not be sufficient to report simply a point estimate θ̂ for an
unknown parameter θ of interest. We would usually want to also quantify our degree of uncertainty in this estimate. If we were to repeat the whole data collection and parameter estimation
procedure, how different an estimate might we have obtained.
If we were again to suppose we had knowledge of the true value of our unknown parameter θ, or at least had access to the (approximate) true sampling distribution of our statistic, PT |θ ,
then the variance of this distribution would give such a measure.
The solution we consider is to plug in our known estimated value of θ, θ̂, into PT |θ and
hence use the (maybe further) approximated sampling distribution, PT |θ̂ .
Confidence Intervals
In particular, we know by the CLT that for any underlying distribution (mean µ, variance
2
σ ) for our sample, the sample mean X̄ is approximately normally distributed with mean µ
2
2
and variance σn . We now further approximate this by imagining X̄ ∼ N( x̄, σn ).
Then, if we knew the true population variance σ2 , we would be able to say that had the true
mean parameter µ been x̄, then for large n, from our standard normal tables, with 95% probability
we would have observed our statistic X̄ within the interval
σ
σ
x̄ − 1.96 √ , x̄ + 1.96 √ .
n
n
This is known as the 95% confidence interval for µ.
More generally, for any desired coverage probability level 1 − α we can define the the 100(1 −
α)% confidence interval for µ by
σ
σ
x̄ − z1− α2 √ , x̄ + z1− α2 √
n
n
where zα is the α-quantile of the standard normal (so before we used α = 0.05 and hence z0.975
to obtain our 95% C.I.).
σ
Loose interpretation: Amongst all the possible intervals x̄ ± z1− α2 √ we might have observed
n
(that is, from samples we might have drawn), 95% would have contained the unknown true
parameter value µ.
Example
A corporation conducts a survey to investigate the proportion of employees who thought
the board was doing a good job. 1000 employees, randomly selected, were asked, and 732 said
they did. Find a 99% confidence interval for the value of the proportion in the population who
thought the board was doing a good job.
10
Clearly each observation Xi ∼ Bernoulli( p) for some unknown p, and we want to find a
C.I. for p, which is also the mean of Xi .
We have our estimate p̂ = x̄ = 0.732 for which we can use the CLT. Since the variance of
Bernoulli( p) is p(1 − p), we can use x̄ (1 − x̄ ) = 0.196 as an approximate variance.
So an approximate 99% C.I. is
"
#
r
r
0.196
0.196
0.732 − 2.576 ×
, 0.732 + 2.576 ×
1000
1000
3.2
Normal Distribution with Known Variance
Confidence Intervals for N(µ, σ2 ) Samples - σ2 Known
The CI given in the Bernoulli( p) example was only an approximate interval, relying on the
Central Limit Theorem, and also assuming the population variance σ2 was known.
However, if we in fact know that X1 , . . . , Xn are an i.i.d. sample from N(µ, σ2 ), then we
have exactly
σ2
X̄ − µ
√ ∼ N(0, 1)
X̄ ∼ N µ,
, or
n
σ/ n
In which case,
σ
σ
x̄ − z1− α2 √ , x̄ + z1− α2 √
n
n
is an exact confidence interval for µ, assuming we know σ2 .
3.3
Normal Distribution with Unknown Variance
Confidence Intervals for N (µ, σ2 ) Samples - σ2 Unknown
However, in any applied example where we are aiming to fit a normal distribution model
to real data, it will usually be the case that both µ and σ2 are unknown.
However, if again we have X1 , . . . , Xn as an i.i.d. sample from N(µ, σ2 ) but with σ2 now
unknown, then we have exactly
X̄ − µ
√ ∼ tn −1
s n −1 / n
s
∑in=1 ( Xi − X̄ )2
is the bias-corrected sample standard deviation, and tν is
n−1
the Student’s t distribution with ν degrees of freedom.
Then if follows that an exact 100(1 − α)% confidence interval for µ is
where sn−1 =
s n −1
s n −1
x̄ − tn−1,1− α2 √ , x̄ + tn−1,1− α2 √
n
n
where tν,α is the α-quantile of tν .
11
0.4
tν pdf for ν = 1, 5, 10, 30, and N(0, 1) pdf
f(x)
0.0
0.1
0.2
0.3
N(0,1)
t_1
t_5
t_10
t_30
−3
−2
−1
0
1
2
3
x
Notes:
• tν is heavier tailed than N(0, 1) for any number of degrees of freedom ν.
• Hence the t distribution CI will always be wider than the Normal distribution CI. So if
we know σ2 , we should use it.
• limν→∞ tν ≡ N(0, 1).
• For ν > 40 the difference between tν and N(0, 1) is so insignificant that the t distribution
is not tabulated beyond this many degrees of freedom, and so there we can instead revert
to N(0, 1) tables.
12
Table of tν,α
ν
0.95
1 6.31
2 2.92
3 2.35
4 2.13
5 2.02
6 1.94
7 1.89
8 1.86
0.975
12.71
4.30
3.18
2.78
2.57
2.45
2.36
2.31
α
0.99
31.82
6.96
4.54
3.75
3.36
3.14
3.00
2.90
ν
0.995
63.66
9.92
5.84
4.60
4.03
3.71
3.50
3.36
9
10
12
15
20
25
40
∞
0.95
1.83
1.81
1.78
1.75
1.72
1.71
1.68
1.645
0.975
2.26
2.23
2.18
2.13
2.09
2.06
2.02
1.96
α
0.99
2.82
2.76
2.68
2.60
2.53
2.48
2.42
2.326
0.995
3.25
3.17
3.05
2.95
2.85
2.78
2.70
2.576
Example
A random sample of 100 observations from a normally distributed population has sample
mean 83.2 and bias-corrected sample standard deviation of 6.4.
1. Find a 95% confidence interval for µ.
2. Give an interpretation for this interval.
Solution:
s n −1
1. An exact 95% confidence interval would given by x̄ ± t99,0.975 √
.
100
Since n = 100 is large, we can approximate this by
s n −1
6.4
x̄ ± z0.975 √
= 83.2 ± 1.96 ×
= [81.95, 84.45].
10
100
2. With 95% confidence, we can say that the population mean lies between 81.95 and 84.45
13