Download printable version

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Confidence interval wikipedia , lookup

Misuse of statistics wikipedia , lookup

German tank problem wikipedia , lookup

Student's t-test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Transcript
Parameter Estimation
Estimation of the Mean
Suppose y1 ………. yn are independent and
identically distributed. The method of moments
estimator (and least squares estimator) of the
population mean μ is given by the sample mean
Also
where σ is the population standard deviation
It can be shown that
ˆ  
~ N (0,1)
S .E.( ˆ )
The relation comes from the Central Limit
Theorem and usually holds good in practice for
all but the smallest values of n.
Confidence intervals for the population mean
can be calculated.
(often by using sample mean +/- 2 standard errors)
However,σ usually needs to be estimated by the
sample standard deviation and this introduces an
additional degree of uncertainty which should
lead to wider confidence intervals.
However,σ usually needs to be estimated by the
sample standard deviation and this introduces an
additional degree of uncertainty which should
lead to wider confidence intervals.
When the population distribution is approximately
normal we can make an appropriate correction by
replacing the normal distribution with the t
distribution with n-1 degrees of freedom.
Otherwise a greater correction is ideally required.
Example: Failures Data
The numbers of operating hours between
successive failures of the air conditioning
equipment aboard an aircraft were as follows:
413 14 58 37 100 65 9 169 447 184 36 201
118 34 31 18 18 67 57 62 7 22 34
The data are also available as the R vector
failures. We have n = 23 observations.
The data are clearly very positively skewed so
an exponential Q-Q plot is carried out.
The graph suggests that they might
reasonably be modelled by an Exp( μ -1)
distribution (exponential mean μ ),
corresponding to a memoryless property in
the failure times.
From the plot, a resistant estimate of μ
would appear to be about 80, but it is
difficult to make any (graphical)
assessment of uncertainty.
Gradient = 80
We now wish to find an estimate of the
population mean, μ.
Let μ be the sample mean.
We can also work out the standard error
S.E. is given by σ/√n so is 119.2897/√23
This calculates as 24.87
A 95% confidence interval can be calculated by
the usual methods or obtained on R. Since the
population standard deviation has been
estimated from the sample and the sample
size is reasonably small, the t distribution is
appropriate.
So the 95% confidence interval is
[44.11,147.28].
This should really be widened a little
bit to allow for non-normality of the
population distribution.
Estimation of the Median
Sometimes it can be more useful to look at
the population MEDIAN rather than mean. A
possible estimator of this is given by the
sample median, m. Here, at least when n is
moderately large,
where f(m) is the density of the underlying
distribution at the median m.
For a normally distributed N(μ, σ) population,
the sample median has standard error
1.253σ/√n, and so is a less efficient estimator
of μ than the sample mean.
However, for longer-tailed distributions, the
sample median is a more efficient estimator of
location than the sample mean. This is a
closely related to the fact that the sample
median is a resistant estimator.
We will use the median in the “failures”
example.
Example: Estimation of Median for
Failures Data
We can estimate the population median
from the sample median which has a
value of 57. We need to ask, though,
how accurate is this estimate and can we
use it to construct a confidence interval
for m?
We could use the formula for the
standard error quoted earlier to calculate
confidence intervals but the sample size
is not very large.
We instead use bootstrap estimation to
answer these questions.
Bootstrap estimation is a fairly general
technique for making assessments of
uncertainty about estimators. It typically
requires the use of simulation.
What we would like is the sampling
^ - m, giving the variation
distribution of m
of the sample median about the
population median.
However, this requires knowledge of the
(unknown) underlying population
distribution.
We therefore substitute for the population
distribution by using instead the empirical
distribution of the data (the bootstrap).
Suppose this empirical distribution has
median m*. Let the random variable m*
denote the sample median of a random
sample (independent identically
distributed observations) of size 23 from
this empirical distribution.
Then we would expect that the sampling
^ - m* should be very
distribution of m*
^ - m.
close to that of m
Now let us study the distribution of of
^ m*. Since we know the value of
m*m* (57), it is just a case of looking at
^ We will use simulation and set up
m*.
an R vector called ms of size 1000
and use it to store the results of 1000
^
simulations of m*.
First consider the command sample.
Now use a for loop to do a simulation
Recall that each component of ms is the
median of a random sample of size 23,
obtained by sampling with replacement
from failures. Hence the variability in ms
is much less than the variability in
failures itself.
Typing qqnorm(ms) produces the normal
Q-Q plot for the distribution of ms.
This distribution is not particularly normal,
so the earlier theory for the sampling
distribution of the median would not have
been very good here.
A reasonable 95% confidence interval,
more formally a 95% percentile interval,
for the original population median m is
given by [Qe(0.025), Qe(0.975)], where Qe
is the empirical quantile function of the
^
bootstrap simulations ms of m
So that (34,67) is a reasonable confidence
interval for m.
Again, this confidence interval should be
widened a little to allow for the
approximation involved in using the
empirical distribution of the data.
Failures data - further discussion.
If we assume that the population distribution
is Exp( μ-1), then for the population median,
m, we have m = μ ln 2.
Failures data - further discussion.
If we assume that the population distribution
is Exp( μ-1), then for the population median,
m, we have m = μ ln 2.
It follows that we can also obtain an
estimate of the population mean, μ, from an
estimate of m. In particular the 95%
confidence interval for m of (34, 67)
obtained above translates into a 95%
confidence interval for μ of (49.1, 96.7).
This should be compared with that
obtained earlier by estimation based on the
sample mean (44,147).
However, no allowance is made here for
the uncertainty involved in the exponential
assumption.