Download Module 3: Estimation and Properties of Estimators

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Randomness wikipedia , lookup

Hardware random number generator wikipedia , lookup

Generalized linear model wikipedia , lookup

Probability box wikipedia , lookup

Least squares wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Transcript
Module 3: Estimation and Properties of Estimators
Math 4820/5320
Introduction
This section of the book will examine how to find estimators of unknown parameters. For example,
if the population mean is unknown and it is of interest, we can estimate the population mean
through a variety of methods. We will also explore interval estimation which gives an interval in
which we can have a certain degree of confidence that the true population parameter is contained.
Finally, we will explore properties of different types of estimators.
Reading
Chapters 8 – 9 of Mathematical Statistics with Applications by Wackerly, Mendenhall, and Scheaffer.
Estimation
Def: An estimator is a rule, often expressed as a formula, that tells how to calculate the value of
an estimate based on the measurements contained in the sample.
Def: A single number that estimates a population parameter is called a point estimate.
Def: An interval between two values that is intended to enclose a population parameter is a called
an interval estimate.
Example: Suppose we collect the following random sample of exam scores:
98, 100, 0, 58, 86, 77, 90, 83, 70
What are some potential point estimates and interval estimates?
1
Bias and Mean Squared Error of Point Estimators
Illustration of the distribution of estimates
Def: Let θ̂ be a point estimator for a parameter θ. Then θ̂ is an unbiased estimator if E[θ̂] = θ. If
E[θ̂] 6= θ, θ̂ is said to be biased.
Def: The bias of a point estimator θ̂ is given by: B(θ̂) = E[θ̂] − θ.
Def: The mean square error (MSE) of a point estimator is given by: M SE(θ̂) = E[(θ̂ − θ)2 ].
Note: M SE(θ̂) = V ar(θ̂) + [B(θ̂)]2 .
Sampling distributions of two unbiased estimators
2
Some Common Unbiased Point Estimators
Note: Some specific methods for finding point estimators are given later in this module (see Chapter
9). These are commonly used point estimators.
Def: The standard deviation of the sampling distribution of the estimator θ̂, σθ̂ , is called the
standard error of the estimator θ̂.
If the population mean, µ, is unknown, it is common to estimate the population mean with sample
mean.
Example:
Note: If we are interested in a population proportion, we will denote it by p.
The above table is Table 8.1 from the textbook.
3
Example:
If the population variance, σ 2 , is unknown, it is typical to estimate the population variance with
the sample variance.
2 =
Note: SX
1
n−1
02 =
Note: SX
1
n
Pn
i=1 (Xi
Pn
i=1 (Xi
− X̄)2 is an unbiased point estimator for σ 2 .
− X̄)2 is a biased point estimator for σ 2 .
Evaluating the Goodness of a Point Estimator
Def: The error of estimation is the distance between an estimator and its target parameter. That
is, = |θ̂ − θ|.
Example: A sample of n = 1000 voters, randomly selected from a city, showed y = 560 in favor
of candidate Jones. Estimate p, the fraction of voters in the population favoring Jones, and
place a 2-standard-error bound on the error of estimation.
4
Confidence Intervals
Recall: An interval estimator is a rule specifying the method for using the sample measurements
to calculate two numbers that form the endpoints of the interval.
Interval estimators are commonly called confidence intervals.
Suppose that θˆL and θˆU are the (random) lower and upper confidence limits, respectively, for a
parameter θ.
Then if, P (θˆL ≤ θ ≤ θˆU ) = 1 − α, (1 − α) is confidence coefficient and [θˆL , θˆU ] is called a
two-sided confidence interval.
5
It is also possible to construct a one-sided confidence interval. For example, if P (θˆL ≤ θ) = 1 − α,
then [θˆL , ∞] is a one-sided confidence interval. Similarly, if P (θ ≤ θˆU ) = 1 − α, then [−∞, θˆU ] is a
one-sided confidence interval.
One common method for finding confidence intervals is called the pivotal method. This method
depends on finding a pivotal quantity that possesses two characteristics:
1. It is a function of the sample measurements and the unknown parameter θ, where θ is the
only unknown quantity.
2. Its probability distribution does not depend on the parameter θ.
The following facts are useful when using such a method:
1. If c > 0, then P (a ≤ Y ≤ b) = P (ca ≤ cY ≤ cb).
2. If d ∈ R, then P (a ≤ Y ≤ b) = P (a − d ≤ Y − d ≤ b − d).
Example:
6
Example:
Large-Sample Confidence Intervals
In the previous section, we found point estimators for µ, p, µ1 − µ2 , and p1 − p2 , along with their
corresponding standard errors.
Utilizing the Central Limit Theorem, we can argue that all of these point estimators will have an
approximately normal distribution. For instance, if we have a sufficiently large sample size, then
Z = σθ̂ follows an approximately standard normal distribution.
θ̂
Using this approximation, we can construct 100(1 − α)% confidence intervals.
7
Example:
Example:
Similar to these two-sided confidence intervals, we can determine that 100(1α)% one-sided confidence limits, often called upper and lower bounds, respectively, are given by:
100(1 − α)% lower bound for θ is θ̂ − zα σθ̂ .
100(1 − α)% upper bound for θ is θ̂ + zα σθ̂ .
8
Example:
Illustration of Confidence Intervals:
9
Interpretation of Confidence Intervals:
Note: If the sample size is reasonably large (n ≥ 30), we can use the sample variance as an estimate
of population variance when constructing confidence intervals.
Note: When constructing a confidence interval for population proportions, we can estimate the
standard error by plugging in our estimated proportions.
Example:
10
Selecting Sample Size
In most cases, increasing sample size is not free. As a result, we can plan ahead for a particular
standard error.
Example: Suppose the weight of copper pipes are normally distributed with mean µ = 5 pounds
and σ 2 = 4 pounds2 . Find the necessary sample size, so that the margin of error for a 95%
confidence interval is at most 0.2 pounds.
Example: Suppose we want to poll Americans on their preference of candidates for president.
It is estimated that 44% of Americans support Hillary Clinton currently. Estimate what sample
size is needed if we want to create a 95% confidence interval for Hillary Clinton’s support with
a margin of error of 3%.
11
Small-Sample Confidence Intervals for µ and µ1 − µ2
In the previous section, we assumed that the data came from a normal distribution and either the
population standard deviation or population variance was known.
In fact, even if the population standard deviation or variance is unknown, but the sample size
is sufficiently large, then σ 2 ≈ s2 . We can use the methods of the previous section to create a
confidence interval in such a situation.
This section will examine when either the sample size is small, so σ 2 is not necessarily approximately equal to s2 or if the distribution of our data is not normal.
Let’s examine the first case. If we are interested in the distribution of X̄ and our data comes from
X̄−µ
√ ∼ t(n − 1).
a normal distribution, then we know from chapter 6 that s/
n
The second case is a bit more hairy. If the distribution of our data is approximately normally
distributed (and in particular, not too skewed), then the Central Limit Theorem will kick in and
we can create an approximate confidence interval.
Example: Suppose a group of 30 students are given an exam and further suppose that scores
are normally distributed with an unknown mean. If the sample mean of the 50 students is 101
and the sample standard deviation is 14.9, create a 95% confidence interval for µ.
12
If we are interested in creating a confidence interval for the the difference in means, there are two
methods. If the variances of the two populations are assumed to be equal, we can “pool” variances
to get a better estimate of σ. In particular, we can create a confidence interval as follows.
(Y¯1 − Y¯2 ) ± tα/2 · Sp
Sp2 =
r
1
1
+ , df = n1 + n2 − 2
n1 n2
(n1 − 1)S12 + (n2 − 1)S22
n1 + n2 − 2
Example: Suppose on the exam mentioned in the previous example, we are given the following
breakdown by alma mater.
School
Sample Mean
Sample Variance
Sample Size
UCD
105.1
15.2
15
UCB
96.9
14.6
15
Compute a 95% confidence interval for the difference in means. Assume that the variances of
the two populations are approximately equal. Is there evidence that students from the two
schools have different average scores?
13
If the variances of the two populations are not equal, then we cannot “pool” variances. Below is
how to create a two-sided confidence interval when we don’t pool variances.
s
s2
s21
+ 2 , df = ν
n1 n2
2
s21 /n1 + s22 /n2
ν= 2
(s1 /n1 )2 /(n1 − 1) + (s22 /n2 )2 /(n2 − 1)
(Y¯1 − Y¯2 ) ± tα/2 ·
Example: Rework the previous example without making the assumption that the two
populations have the same variances.
14
Confidence Intervals for σ 2
Pn−1
(Y −Ȳ )2
2
Recall that if X1 , X2 , . . . , Xn are normally distributed, then i=1 σ2i
= (n−1)·S
has a χ2 disσ2
tribution with (n-1) degrees of freedom. We can use this fact to create a confidence interval for σ 2 .
In particular, if we want to create a two-sided confidence interval for σ 2 with confidence coefficent
(1 − α), consider the following.
h
i
(n − 1) · S 2
2
P χ2L ≤
≤
χ
=1−α
U
σ2
P
h (n − 1) · S 2
χ2(α/2)
≤ σ2 ≤
(n − 1) · S 2 i
=1−α
χ2(1−α/2)
Example: Suppose a group of 30 students are given an exam and further suppose that scores
are normally distributed with an unknown mean. If the sample mean of the 50 students is 101
and the sample standard deviation is 14.9, create a 95% confidence interval for σ.
15
Properties of Point Estimators and Methods of Estimation
Note: The following notes cover chapter 9 of the textbook.
In the previous section (Chapter 8), we considered some common point estimators (e.g. estimate
µ with X̄). In this chapter, we will examine some properties of point estimators, as well as how to
derive other point estimators.
Recall that an estimator θ̂ is an unbiased estimator for θ if E[θ̂] = θ. Typically, unbiased estimators
are preferred to biased estimators. Suppose we are given two unbiased estimators, θˆ1 and θˆ2 . Which
one is preferred? In this section, we will examine additional properties of estimators such efficiency,
consistency, and sufficiency.
Relative Efficiency
Def: Given two unbiased estimators, θˆ1 and θˆ2 , of a parameter θ, with variances V ar(θˆ1 ) and
V ar(θˆ2 ), respectively, then the efficiency of θˆ1 relative to θˆ2 , denoted ef f (θˆ1 , θˆ2 ), is defined to be
the ratio:
V ar(θˆ2 )
ef f (θˆ1 , θˆ2 ) =
V ar(θˆ1 )
If θˆ1 and θˆ2 are both unbiased estimators, then if ef f (θˆ1 , θˆ2 ) ≥ 1, then θˆ1 is preferred to θˆ2 since
V ar(θˆ1 ) ≤ V ar(θˆ2 ). If ef f (θˆ1 , θˆ2 ) ≤ 1, then θˆ2 is preferred to θˆ1 since V ar(θˆ1 ) ≥ V ar(θˆ2 ). In other
words, when choosing between two unbiased estimators, it is common to select the estimator with
the smaller variance.
16
Example: Let Y1 , Y2 , . . . , Yn be a random sample of (continuous) uniform[0, θ] random variables.
Two unbiased estimators of θ are:
n + 1
θˆ1 = 2Ȳ , θˆ2 =
Y(n)
n
Y(n) = max(Y1 , Y2 , . . . , Yn )
Find the efficiency of θˆ1 relative to θˆ2 .
17
Consistency
The law of large numbers essentially states that if an experiment is repeated many times, then the
sample mean will converge to the population mean. We have to be careful to define what we mean
by “converge” since we are discussing random quantities.
Def: The weak law of large numbers states that the sample mean converges in probability to the
population mean (or expected value).
p
X¯n → µ, as n → ∞
limn→∞ P (|X¯n − µ| > ) = 0, ∀ > 0
Def: The strong law of large numbers states that the sample mean converges almost surely to the
population mean (or expected value).
a.s.
X¯n → µ, as n → ∞
P (limn→∞ X¯n = µ) = 1
Note: Almost sure convergence is a stronger condition than convergence in probability.
p
Def: The estimator θˆn is said to be a (weakly) consistent estimator of θ if θˆn → θ, i.e., for all > 0,
limn→∞ P (|θˆn − θ| > ) = 0.
Note: We will focus on determining if an estimator is (weakly) consistent, but if the estimator
converges almost surely to the parameter of interest, it is said to be strongly consistent.
Figure 1: Sample mean of n N (µ = 4, σ 2 = 100) random variables
18
R code for previous figure:
set.seed(2016)
n = 10000
idx = 1:n
xn.bar = rep(1,n)
# generate 10000 normal(mu=4,sd=10) random variables
x = rnorm(n=n, mean = 4, sd = 10)
for (i in 1:10000 ){
xn.bar[i] = mean(x[1:i])
}
plot(xn.bar~idx,type="l",xlab=’Sample Size’,ylab=’Sample Mean’)
abline(h=4,lty=2)
Theorem: An unbiased estimator θˆn for θ is a consistent estimator for θ if limn→∞ V ar(θn ) = 0.
iid
Example: Show that 2Ȳ is a consistent estimator for θ if Y1 , Y2 , . . . , Yn ∼ U nif orm(0, θ).
p
p
Theorem: (Slutsky’s Theorem) Suppose that θˆn → θ and that θˆn0 → θ0 , then the following are true.
p
(a) θˆn + θˆn0 → θ + θ0 .
p
(b) θˆn · θˆn0 → θ · θ0 .
p
(c) If θ0 6= 0, then θˆn /θˆn0 → θ/θ0 .
p
(d) If g(·) is a real-valued function that is continuous at θ, then g(θˆn ) → g(θ).
19
Sufficiency
Up to this point, we have considered estimators that “make sense” in some regard (such as estimating µ with X̄). One unresolved issue that we would like to address is data reduction. If we
simply report X̄ as an estimate for µ, have we “lost” any information about µ? In this section, we
will examine how to find statistics that in some sense summarize all of the information in a sample
about a target parameter. These statistics will be called sufficient statistics.
Def: Let Y1 , Y2 , . . . , Yn denote a random sample from a probability distribution with an unknown
parameter θ. Then the statistics U = g(Y1 , Y2 , . . . , Yn ) is said to be a sufficient statistics for θ if
the conditional distribution of Y1 , Y2 , . . . , Yn given U does not depend on θ.
Def: Let y1 , y2 , . . . , yn be sample observations taken on corresponding random variables Y1 , Y2 , . . . , Yn
whose distribution depends on a parameter θ. Then if Y1 , Y2 , . . . , Yn are discrete random variables,
the likelihood of the sample, L(y1 , y2 , . . . , yn |θ), is defined to be the joint probability of y1 , y2 , . . . , yn .
If Y1 , Y2 , . . . , Yn are continuous random variables, the likelihood L(y1 , y2 , . . . , yn |θ) is defined to be
the joint density evaluated at y1 , y2 , . . . , yn .
Theorem: Let U be a statistic based on the random sample Y1 , Y2 , . . . , Yn . Then U is a sufficient statistic
for the estimation of parameter θ if and only if the likelihood L(y1 , y2 , . . . , yn |θ) can be factored
into two non-negative functions,
L(y1 , y2 , . . . , yn |θ) = g(u, θ) · h(y1 , y2 , . . . yn )
where g(u, θ) is a function only of u and θ and h(y1 , y2 , . . . , yn ) is not a function of θ.
Example: (9.39) Let YP
1 , Y2 , . . . , Yn be a random sample of Poisson(λ) random variables. Show
by conditioning that ni=1 Yi is sufficient for λ.
20
Example: (9.38) Let Y1 , Y2 , . . . , Yn be a random sample of normal random variables with mean
µ and variance σ 2 .
(a) Write the likelihood function.
(b) If µ is unknown and σ 2 is known, show that Ȳ is sufficient for µ.
(c) If µ is known and σ 2 is unknown, show that
Pn
i=1 (Xi
− µ)2 is sufficient for σ 2 .
Pn
Pn
2
(d) If µ and σ 2 are both unknown, show that
i=1 Yi are jointly sufficient for µ
i=1 Yi and
P
n
2
2
and σ . (Note: This will imply that Ȳ and i=1 (Yi − Ȳ ) or Ȳ and S 2 are jointly sufficient for
µ and σ 2 .
21
The Rao-Blackwell Theorem and Minimum-Variance Unbiased Estimation
Theorem: (Rao-Blackwell) Let θ̂ be an unbiased estimator for θ such that V ar(θ̂) < ∞. If U is a
sufficient statistic for θ, define θˆ∗ = E[θ̂|U ]. Then for all θ, E[θˆ∗ ] = θ and V ar(θˆ∗ ) ≤ V ar(θ̂).
The Rao-Blackwell Theorem allows us to condition on a sufficient statistic to obtain unbiased estimators with smaller variance. It turns out that if you condition on a minimal sufficient statistic,
then we can obtain what is called the minimum-variance unbiased estimator (MVUE).
Def: A statistic S(X) is called a minimal sufficient statistic if
(i) S(X) is a sufficient statistic
(ii) If T(X) is a sufficient statistic, then S(X) = f(T(X)).
Def: θ̂ is a minimum-variance unbiased estimator (MVUE) if θ̂ is an unbiased estimator for a parameter θ such that if θ̃ is another unbiased estimator, then V ar(θ̂) ≤ V ar(θ̃).
Example: (9.59) The number of breakdowns Y per day for a certain machine is a Poisson
random variable with mean θ. The daily cost of repairing these breakdowns is given by C = 3Y 2 .
If Y1 , Y2 , . . . , Yn denote the observed number of breakdowns for n independently selected days,
find an MVUE for E(C).
22
Example: Let Y1 , Y2 , . . . , Yn denote a random sample from the uniform distribution over the
interval(0, θ).
(a) (9.49) Show that Y (n) = max(Y1 , Y2 , . . . , Yn ) is sufficient for θ.
(b) (9.61) Use Y(n) to find a MVUE of θ.
23
Methods of Estimation
Method of Moments
The Method of Moments is a very simple procedure for finding point estimates for population parameters. The method is as follows:
P
Let µ0k = E[Y k ] and m0k = n1 ni=1 Yik . Choose as estimates those values of the parameters that
are solutions of the equations µ0k = m0k for k = 1, 2, 3, ..., where k is the number of parameters to
estimate.
Example: Let Y1 , Y2 , . . . , Yn be a random sample of Uniform[0, θ], where θ is unknown. Use the
method of moments to find an estimator θ̂ of θ. Is θ̂ an unbiased, consistent estimator of θ?
Example: (9.72) Let Y1 , Y2 , . . . , Yn be a random sample from a normal distribution with mean
µ and variance σ 2 . Find the method of moments estimators for µ and σ 2 .
24
Example: (9.74) Let Y1 , Y2 , . . . , Yn be a random sample from the following distribution:
(
2
(θ − y) 0 ≤ y ≤ θ
θ2
f (y|θ) =
0
y<0
(a) Find an estimator for θ by using the method of moments.
(b) Is this estimator a sufficient statistic for θ?
25
Method of Maximum Likelihood
In the previous section, we found how to derive method of moments estimators. While these are
easy to compute, they don’t always lead to particularly good estimators. The Rao-Blackwell Theorem allows us to find a minimum-variance unbiased estimator, however, while it isn’t particularly
difficult to find a sufficient statistic, the determination of the function of the minimal sufficient
statistic that gives us an unbiased estimator can be largely hit or miss. This section will focus on
maximum likelihood estimators which often lead to minimum-variance unbiased estimators.
Method of Maximum Likelihood: Suppose that the likelihood function depends on k parameters,
θ1 , . . . , θk . Choose as estimates those values of the parameters that maximize the likelihood,
L(y1 , y2 , . . . , yn |θ1 , . . . , θk ).
Note: One very nice property of maximum likelihood estimates is called the invariance property of
MLEs. This states that if f (θ) is a one-to-one function of θ, and if θ̂ is the MLE for θ, then the
MLE of f (θ) is given by f (θ̂).
Example: Let Y1 , Y2 , . . . , Yn be a random sample from a normal distribution with mean
µ and variance σ 2 . Find the maximum likelihood estimators for µ and σ 2 .
26
Example: (9.80) Let Y1 , Y2 , . . . , Yn be a random sample of Poisson random variables with mean λ.
(a) Find the MLE λ̂ for λ.
(b) Find the expected value and variance of λ̂.
(c) Is λ̂ a consistent estimator for λ? Explain.
(d) What is the MLE for P (Y = 0) = e−λ .
27