Download Ch.7: Estimation 1 Introduction

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Choice modelling wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
Ch.7:
1
Estimation
Introduction
The objective of data collection is to learn about the distribution, or aspects of the
distribution, of some characteristic of the units of a population of interest. In Chapter 6,
we saw how to estimate certain key descriptive parameters of a population distribution,
such as the mean, the variance, percentiles, and probabilities (or proportions), from the
corresponding sample quantities. In this chapter we will learn another approach to the
estimation of such population parameters. This approach is based on the assumption
that the population distribution belongs in a certain family of distribution models, and
on methods for fitting a particular family of distribution models to data. Different fitting
methods will be presented and discussed.
The estimators obtained from this other approach will occasionally differ from the estimators we saw in Chapter 6. For example, under the assumption that the population
distribution is normal, estimators of population percentiles and proportions depend only
on the sample mean and sample variance, and thus differ from the sample percentiles and
proportions; the assumption of a uniform distribution yields an estimator of the population mean value which is different from the sample mean; the assumption of Poisson
distribution yields an estimator of the population variance which is different from the
sample variance.
Another learning objective of this chapter is to develop criteria for selecting the best
among different estimators of the same quantity, or parameter. For example, should the
alternative estimators, which were mentioned in the preceding paragraph, be preferred
over those which were discussed in Chapter 6? The same criteria can also help us decide whether a stratified sample is preferable to simple random sample for estimating
the population mean or a population proportion. Finally, in this chapter we will learn
how to report the uncertainty of estimators through their standard error and how that
leads to confidence intervals for estimators which have (or have approximately) a normal
distribution. The above estimation concepts will be developed here in the context of a
single sample, but will be applied in later chapters to samples from several populations.
1
2
Overview, Notation and Terminology
Many families of distribution models, including all we have discussed, depend on a small
number of parameters; for example, a Poisson distribution model is identified by the
single parameter λ, and normal models are identified by two parameters, µ and σ 2 . Such
families of distribution models are called parametric. An approach to extrapolating
sample information to the population is to assume that the population distribution is a
member of (or belongs in) a specific parametric family of distribution models, and then
fit the assumed family to the data, i.e. identify the member of the parametric family that
best fits the data.
There are several methods/criteria for fitting a parametric family of distribution models
to data. They all amount to estimating the model parameters, and taking as the fitted
model the one that corresponds to the estimated parameters.
Example 2.1. Car manufacturers often advertise damage results from low impact crash
experiments. In an experiment crashing n = 20 randomly selected cars of a certain
type against a wall at 5 mph, let X denote the number of cars that sustain no visible
damage. Here it is reasonable to assume that the distribution of X, which is the population
distribution in this case, is a member of the family of binomial probability models. A
binomial distribution is identified by the sample size used (here the sample size is n = 20),
and the parameter p, which is the probability that a randomly selected car will sustain no
visible damage when crashed at 5 mph. The best fitting model is the binomial distribution
that corresponds to the estimated value of p. For example, if X = 12 of the 20 cars in
the experiment sustain no visible damage, the estimate of p is p̂ = 12/20, and the best
fitting model is Bin(20, 0.6).
Example 2.2. The response time, X, of a robot to a certain malfunction of a certain production process (e.g. car manufacturing) is often the variable of interest. Let X1 , . . . , X36
denote 36 response times that are to be measured. Here it is not clear what the population distribution (i.e. the distribution of each Xi ) might be, but it might be assumed (at
least tentatively) that this distribution is a member of the normal family of distributions.
The model parameters that identify a normal distribution are its mean and variance.
Thus, the best fitting model is the normal distribution with mean and variance equal to
the sample mean and sample variance, respectively, obtained from the data (i.e. the 36
measured response times). For example, if the sample mean of the 36 response times is
X = 9.3, and the sample variance is S 2 = 4.9, the best fitting model is N (9.3, 4.9).
2
Example 2.3. The lifetime of electric components is often the variable of interest in
reliability studies. Let T1 , . . . , T25 denote the life times, in hours, of a random sample of
25 components. Here it is not clear what the population distribution might be, but it
might be assumed (at least tentatively) that it is a member of the exponential family of
models, which was introduced in Example 3.7, page 14 of Chapter 3. Thus, each Ti has
pdf fλ (t) = λ exp(−λt), for t ≥ 0, fλ (t) = 0, for t < 0, for some λ > 0. Here the single
model parameter λ identifies the exponential distribution. Since the model mean value
(i.e. the mean value of a population having the exponential distribution) is λ−1 , and since
the population mean value can be estimated by the sample mean, X, of the 25 life times,
the model parameter λ can be estimated by λ̂ = 1/X. Thus, the best fitting exponential
model is the exponential distribution with model parameter equal to λ̂. For example, if
the average of the 25 life times is 113.5 hours, the best fitting model is the exponential
distribution with λ = 113.5−1 .
Example 2.4. Suppose, as in the previous example, that interest lies in the distribution
of the life time of some type of electric component, and let T1 , . . . , T25 denote the life
times, in hours, of a random sample of 25 such components. If the assumption that the
population distribution belongs in the exponential family does not appear credible, it
might be assumed that it is a member of the gamma family of distribution models. This
is a richer family of models and it includes the models of the exponential distribution.
The gamma distribution is identified by two parameters, α and β, and its pdf is of the
form
fα,β (x) =
1
β α Γ(α)
xα−1 e−x/β , x > 0,
where Γ(α) is the gamma function. The model parameters α and β are related to the
model mean and variance through
α=
µ2
σ2
,
β
=
.
σ2
µ
Since the model mean and variance can be estimated by the sample mean and variance,
the parameters α and β can be estimated by
2
S2
X
α̂ = 2 , β̂ =
,
S
X
where X, S 2 denote the sample mean and sample variance of the 25 life times. For example, if the average of the 25 life times is 113.5 hours, and their sample variance is 1205.55
hours2 , the best fitting model is the gamma distribution with α = 113.52 /1205.55 = 10.69,
and β = 1205.55/113.5 = 10.62.
3
The method used in the above examples to fit parametric families of models to data,
namely the method of matching the nonparametric estimates of the population mean and
population variance to the model parameters, is called the method of moments. It will be
discussed further, later in this chapter, along with other methods of fitting.
Having a fitted distribution model means that the entire distribution has been estimated.
In particular, the fitted model provides immediate and direct estimation of any percentile
and probability that might be of interest.
Example 2.5. For example, the fitted model of Example 2.2 allows direct and immediate
estimation of the probability that the reaction time of the robot will exceed 50 nanoseconds, and of the 95th percentile of the robot reaction time as those of the fitted normal
distribution:
µ
Pb(X > 50) = 1 − Φ
50 − X
S
¶
, and x̂0.05 = X + z0.05 S,
(2.1)
where a hat place above a quantity denotes an estimator of the quantity, X is the sample
mean and S is the sample standard deviation obtained from the 36 measured response
times.
Of course, the nonparametric estimates of proportions and percentiles that were discussed
in Chapter 6, can still be used. For example, the probability P (X > 50) in Example
2.5, can be estimated as the proportion of the 36 robot reaction times that exceed 50
nanoseconds, and x̂0.05 can be estimated by the 95th sample percentile. If the assumption
of normality is correct, i.e. if the population distribution of the robot reaction time is
normal, then the estimates in (2.1) are to be preferred over the nonparametric ones,
according to criteria that will be discussed in this chapter.
Keep in mind that, if the parametric assumption is not correct, i.e. if the population
distribution is not a member of the assumed parametric family of models, then the fitted
model is an estimator of the best-approximating model, which is the member of the
parametric family that best approximates the population distribution. Consequently,
estimators of probabilities and percentiles based on the fitted model (as was done in
Example 2.5) will not have desirable properties. The following example illustrates this
point.
Example 2.6. To demonstrate the effect that an incorrect modeling assumption can have
on the estimation of probabilities and percentiles, suppose, as in Example 2.4, that interest
lies in the distribution of the life time of some type of electric component, and the life times
4
of a random sample of 25 such components is observed. With the information given in that
example, the best fitting gamma model is the one with α = 113.52 /1205.55 = 10.69, and
β = 1205.55/113.5 = 10.62. Thus, using the assumption of a gamma distribution model,
the probability that a randomly selected component will last more than 140 hours, and
the 95th percentile of life times are estimated as those of the best fitting gamma model.
Using a statistical software package we obtain the estimates
Pb(X > 140) = 0.21, and x̂0.05 = 176.02.
However, if the assumption of an exponential model distribution is made, as in Example
2.3, the best fitting exponential model gives estimates
Pb(X > 140) = 0.29, and x̂0.05 = 340.02.
Hopefully, there are diagnostic tests that can help decide whether a parametric assumption
is not correct, or which of two parametric families provides a better fit to the data. A way
of gaining confidence that the population distribution belongs in the assumed parametric
family (ideal case), or at least it is well approximated by the best approximating model,
is the probability plot, which we saw in Chapter 6.
In all that follows, the Greek letter θ serves as a generic notation for any model or
population parameter(s) that we are interested in estimating. Thus, if we are interested
in the population mean value, then θ = µ, and, if we are interested in the population mean
value and variance then, θ = (µ, σ 2 ). If θ denotes a population parameter, then, by true
value of θ we mean the (unknown to us) population value of θ. For example, if θ denotes
the population mean or variance, then the true value of θ is the (unknown to us) value of
the population mean or variance. If θ denotes a model parameter then, by true value
of θ we mean the (unknown to us) value of θ that corresponds to the best-approximating
model. For example, if we use the exponential family to model the distribution of life
times of a certain component, then the unknown to us value of λ = 1/µ, where µ is the
population mean, is the true value of θ = λ.
3
Estimators and Estimates
An estimator, θ̂, of the true value of θ is a function of the random sample X1 , . . . , Xn ,
θ̂ = θ̂(X1 , . . . , Xn ).
5
Here we refer to the random sample in the planning stage of the experiment, when
X1 , . . . , Xn are random variables (hence the capital letters). Being a function of random
variables, i.e. a statistic, an estimator is a random variable, and thus it has a sampling
distribution. If θ is a population parameter, or, if it is a model parameter and the assumed
parametric model is correct, the sampling distribution of an estimator θ̂ depends on the
true value of θ. This dependence of the sampling distribution of θ̂ on θ is often indicated
by writing the expected value, E(θ), of θ̂ as
³ ´
Eθ=θ0 θ̂ ,
and read ”the expected value of θ̂ when the true value of θ is θ0 ”.
Example 3.1. a) In Example 2.1, we used the proportion, p̂ = X/20, of cars that
sustain no visible damage as an estimator of the probability, p, of a car sustaining no
visible damage when crashed at 5 mph. Because X = 20p̂ ∼ Bin(20, p), we see that
the distribution of p̂ depends on the true value of p. For example, if p = 0.7, then
np̂ = X ∼ Bin(n, 0.7). This dependence of the distribution of X on the true value p
is often indicated in the formula for the expected value of a binomial random variable,
which is E(X) = 20p, by writing it as Ep (X) = 20p. In particular, if the true value of p
is p = 0.7, we write
Ep=0.7 (X) = 20 × 0.7 = 14, Ep=0.7 (p̂) = 0.7,
and read ”when the true value of p is 0.7, the expected value of X is 14 and the expected
value of p̂ is 0.7”.
b) In Example 2.2, we used the average, X, of the 36 response times as an estimator
of the expected value, µ, of the robot’s reaction time to a future malfunction. Because
X ∼ N (µ, σ 2 /36), we see that the distribution of X depends on the true value of µ (and
also on that of σ 2 ). If the true value of µ is µ = 8.5, we write
¡ ¢
Eµ=8.5 X = 8.5,
and read ”when the true value of µ = 8.5, the expected value of X is 8.5”.
Let X1 = x1 , . . . , Xn = xn be the observed values when the experiment has been carried
out. The value, θ̂(x1 , . . . , xn ), of the estimator evaluated at the observed sample values
will be called a point estimate or simply an estimate. Point estimates will also be
denoted by θ̂. Thus, an estimator is a random variable, while a (point) estimate is a
specific value that the estimator takes.
6
Example 3.2. a) In Example 2.1, the proportion p̂ = X/20 of cars that will sustain
no visible damage among the 20 that are to be crashed is an estimator of p, and, if the
experiment yields that X = 12 sustain no visible damage, p̂ = 12/20 is the point estimate
of p.
b) In Example 2.2, µ̂ = X = (X1 +· · ·+X36 )/36 is an estimator of µ, and, if the measured
response times X1 = x1 , . . . , X36 = x36 yield an average of (x1 +· · ·+x36 )/36 = 9.3, µ̂ = 9.3
is a point estimate of µ.
4
Biased and Unbiased Estimators
Being a random variable, θ̂ serves only as an approximation to the true value of θ. With
some samples, θ̂ will overestimate the true θ, whereas for others it will underestimate it.
Therefore, properties of estimators and criteria for deciding among competing estimators
are most meaningfully stated in terms of their distribution, or the distribution of
θ̂ − θ = error of estimation.
(4.1)
The first criterion we will discuss is that of unbiasedness.
Definition 4.1. The estimator θ̂ of θ is called unbiased for θ if
E(θ̂) = θ.
(More exactly, θ̂ is unbiased for θ if Eθ (θ̂) = θ.) The difference
E(θ̂) − θ
is called the bias of θ̂ and is denoted by bias(θ̂). (More exactly, biasθ (θ̂) = Eθ (θ̂) − θ.)
Example 4.1. a) For the estimator p̂ discussed in Example 2.1,
Ep (p̂) = p,
where p is the population proportion. Thus p̂ is an unbiased estimator of p.
b) For the estimator µ̂ = X discussed in Example 2.2,
Eµ (X) = µ,
where µ is the true population mean. Thus, X is an unbiased estimator of µ.
7
Unbiased estimators have zero bias. This means that, though with any given sample θ̂
may underestimate or overestimate the true value of θ, the estimation error θ̂ − θ averages
to zero. Thus, when using unbiased estimators, there is no tendency to overestimate or
underestimate the true value of θ.
Other commonly used unbiased estimators in statistics include the estimators of the regression coefficients, which are described in the next example.
Example 4.2. ESTIMATION OF THE REGRESSION COEFFICIENTS. Let (X, Y )
be a bivariate random variable and suppose that its regression function (or the regression
function of its underlying population) is of the form
E(Y |X = x) = α + βx.
Then, it can be shown that
β=
Cov(X, Y )
, and α = E(Y ) − βE(X).
Var(X)
(4.2)
Thus, if we have a sample (X1 , Y1 ), . . . , (Xn , Yn ) from the underlying population, the
method of moments estimator of the regression parameters is
d
Cov(X,
Y)
b
βb =
, and α
b = Y − βX,
d
Var(X)
(4.3)
where
n
d
Cov(X,
Y)=
n
¢¡
¢
¢2
1 X¡
1 X¡
d
=
Xi − X Yi − Y , and Var(X)
Xi − X
n − 1 i=1
n − 1 i=1
are the sample covariance of the (Xi , Yi )s and the sample variance of the Xi s, respectively.
REMARK: The estimators of the regression coefficients in (4.3) can also be derived with
the method of least squares (see Section 10) and thus are commonly referred to as the least
squares estimators. Under the additional assumption of normality, i.e. if it is assumed
that the conditional distribution of Y given X = x is
Y |X = x ∼ N (α + βx, σ 2 ),
the estimators (4.3) can also be derived by the method of maximum likelihood (see Section
10).
Proposition 4.1. The least squares estimators of the regression parameters, given in
(4.3), are unbiased. Thus,
³ ´
Eβ βb = β, and Eα (b
α) = α.
8
We close this section with a proposition which shows that another common estimator, the
sample variance, is an unbiased estimator of population variance.
Proposition 4.2. If X1 , . . . , Xn is a n sample from a population with variance σ 2 , and
P
let S 2 = (n − 1)−1 i (Xi − X)2 be the sample variance. Then,
¡ ¢
¡ ¢
E S 2 = σ 2 , or Eσ S 2 = σ 2
i.e. S 2 is an unbiased estimator of σ 2 .
5
The Standard Error and the Mean Square Error
Though unbiasedness is a desirable property, many commonly used estimators do have a
small, but non-zero, bias. Moreover, there are cases where there is more than one unbiased
estimator. The following examples illustrate both possibilities.
Example 5.1. a) Let X1 , . . . , Xn be a random sample from some population having a
continuous symmetric distribution. Thus, the mean and the median coincide. Then the
e and any trimmed mean are unbiased estimators
sample mean X, the sample median X,
for µ, as are hybrid estimators such as
½
¾
Xi + Xj
median
; i, j = 1, . . . , n, i 6= j .
2
b) The sample standard deviation S is a biased estimator of the population standard
deviation σ.
c) Let X1 , . . . , Xn be the life times of a random sample of n valves, and assume that each
life time has the exponential distribution with parameter λ (so µ = 1/λ). Then 1/X is a
biased estimator of λ = 1/µ, and exp{−500/X} is a biased estimator of exp{−λ500} =
P (X > 500), the probability that the life of a randomly chosen valve exceeds 500 time
units of operation.
d) Let X1 , . . . , Xn be a sample from a normal distribution with parameters µ, σ 2 . Then,
µ
¶
µ
¶
17.8 − X
14.5 − X
Φ
−Φ
S
S
is a biased estimator of P (14.5 < X < 17.8), where X ∼ N (µ, σ 2 ), and
xα = X + Szα
is a biased estimator of the (1 − α)100th percentile of X.
9
In this section we will describe a criterion, based on the concept of mean square error, for
comparing the performance, or quality, of estimators that are not necessarily unbiased.
As a first step, we will discuss a criterion for choosing between two unbiased estimators.
This criterion is based on the standard error, a term synonymous with standard deviation.
Definition 5.1. The standard error of an estimator θ̂ is its standard deviation σθ̂ =
q
q
σθ̂2 , and an estimator/estimate of the standard error, σ
bθ̂ = σ
bθ̂2 , is called the estimated
standard error.
Example 5.2. a) The standard error, and the estimated standard error, of the estimator
p̂ discussed in Example 2.1 are
r
σp̂ =
p(1 − p)
, and σ̂p̂ =
n
r
p̂(1 − p̂)
.
n
With the given information that 12 of 20 cars sustain no visible damage, we have p̂ =
12/20 = 0.6, so that the estimated standard error is
r
0.6 × 0.4
σ̂p̂ =
= 0.11.
20
b) The standard error, and the estimated standard error of X are
σ
S
σX = √ , and σ̂X = √ ,
n
n
where S is the sample standard deviation. If the sample standard deviation of the n = 36
robot reaction times mentioned in Example 2.2 is S = 1.3, the estimated standard error
of X in that example is
1.3
σ̂X = √ = 0.22.
36
A comparison of unbiased estimators is possible on the basis of their standard errors:
Among two unbiased estimators of a parameter θ, choose the one with the smaller standard
error. This is the standard error selection criterion for unbiased estimators. The
rationale behind this criterion is that a smaller standard error implies that the distribution
is more concentrated about the true value of θ. This is illustrated in the following figure.
10
p.d.f of
θ2
p.d.f of
θ1
}
θ2is more reliable as an estimator of θ
theta
The standard error selection criterion for unbiased estimators will be used in Section 6
to determine whether or not stratified random sampling yields a better estimator of the
population mean than simple random sampling.
Depending on the population distribution and on θ, it may be possible to find an unbiased estimator of θ that has the smallest variance among all unbiased estimators of θ.
When such estimators exist they are called minimum variance unbiased estimators
(MVUE). The following proposition gives some examples of MVUE.
Proposition 5.1. a) If X1 , . . . , Xn is a sample from a normal distribution then X is
MVUE for µ, and S 2 is MVUE for σ 2 .
b) If X ∼ Bin(n, p), then p̂ = X/n is MVUE for p.
REMARK 1: The fact that X is MVUE for µ when sampling under a normal distribution
does not imply that X is always the preferred estimator for µ. Thus, if the population
distribution is logistic or Cauchy then the sample median, X̃, is better than X. Estimators
such as the various trimmed means and the hybrid estimator discussed in Example 5.1
perform well over a wide range of underlying population distributions.
We now proceed with the definition of mean square error and the corresponding more
general selection criterion.
Definition 5.2. The mean square error (MSE) of an estimator θ̂ for the parameter θ
is defined to be
³ ´
³
´2
M SE θ̂ = E θ̂ − θ .
(More exactly, M SEθ (θ̂) = Eθ (θ̂ − θ)2 .) The MSE selection criterion says that among
two estimators, the one with smaller MSE is preferred.
11
The mean square error of an unbiased estimator is the standard error of that estimator.
Thus, the MSE selection criterion reduces to the standard error selection criterion when
the estimators are unbiased. The next proposition reiterates this, and shows how the
MSE criterion incorporates both the standard error and the bias in order to compare
estimators that are not necessarily unbiased.
Proposition 5.2. If θ̂ is unbiased for θ then
³ ´
M SE θ̂ = σθ̂2 .
In general,
³ ´
h
³ ´i2
M SE θ̂ = σθ̂2 + bias θ̂
.
In the next example, the MSE selection criterion is used to decide the better of two
estimators for the mean value of a uniform population.
Example 5.3. Suppose that X1 , . . . , Xn is a sample from a uniform in (0, θ) population.
Since, the mean value of such a uniform population is µ = θ/2, and since the sample
mean, X, is an unbiased estimator of µ, it follows that
θ̂1 = 2X
is an unbiased estimator of θ. An alternative estimator of θ, the supremum of the population values, is the maximum of the sample values,
θ̂2 = X(n) .
Because the largest observation, X(n) , is always smaller than θ, it follows that θ̂2 always
underestimates θ, i.e. it is a biased estimator of θ.
To decide which of the two is better, we look at their MSE. Because the variance of a
uniform in (0, θ) population is σ 2 = θ2 /12, and θ̂1 is unbiased we have
M SE(θ̂1 ) =
1 2
1 2
σ =
θ .
n
12n
To find M SE(θ̂2 ), we first find the pdf of θ̂2 . Because
Fθ̂2 (y) = P (X(n) ≤ y) = P (X1 ≤ y, X2 ≤ y, . . . , Xn ≤ y) =
the pdf is
fθ̂2 (y) =
d
n
Fθ̂2 (y) = n y n−1 ,
dy
θ
12
³ y ´n
θ
,
for 0 < y < θ. Using this we obtain
Eθ (θ̂2 ) =
n
n 2
n
θ, Eθ (θ̂22 ) =
θ , so that Var(θ̂2 ) =
θ2 .
n+1
n+2
(n + 1)2 (n + 2)
Thus, the bias of θ̂2 is
Eθ (θ̂2 ) − θ = −
1
θ.
n+1
Combining the above, we obtain
n
1
1
2
2
M SE(θ̂2 ) =
θ
+
θ
=
(n + 1)2 (n + 2)
(n + 1)2
(n + 1)2
µ
¶
n
+ 1 θ2 .
n+2
It is easy to check that if the sample size n is 21 or larger, then
µ
¶
1
n
1
+
1
<
,
(n + 1)2 n + 2
12n
which implies that, according to the MSE selection criterion, the biased estimator θ̂2 is
to be preferred over the unbiased θ̂1 , if the sample size is large enough.
In the section on maximum likelihood estimation, the MSE criterion suggests that, with
large enough sample sizes, the biased estimators of probabilities and percentiles, which
are given in parts c) and d) of Example 5.1, are to be preferred over the corresponding
nonparametric estimators, i.e. sample proportions and sample percentiles, provided that
the assumption of exponential distribution made in part c) and of normal distribution in
part d) hold true.
6
∗
Simple Random Sampling or Stratified Sampling?
In this section we employ the MSE selection principle to decide if the nonparametric
estimators of the population mean and probabilities from stratified random samples are
to be preferred over those obtained from simple random sampling. Because the estimators
involved are unbiased, the MSE selection criterion here coincides with the standard error
selection criterion.
Suppose that a component is produced in two production facilities and the produced components are mixed before being packaged for shipment. One of the two facilities (facility
A) has modernized its equipment resulting in faster production of more reliable components than those of facility B. In particular, suppose that 60% of all components come
from facility A and 40% come from facility B. Let X denote the lifetime of a randomly
13
chosen component from the combined, or overall, population of such components. For example, the component can be randomly chosen from a randomly chosen package. We are
interested in estimating µ = E(X) and p = P (X > 500), from a sample of size n = 100.
We will consider estimation using two different sampling schemes. First, let X1 , . . . , X100
be the lifetimes of 100 components obtained by simple random sampling from the overall
population of components. Then,
100
X=
1 X
# of Xi > 500
Xi , and p̂ =
100 i=1
100
are the usual nonparametric estimators of µ and p.
The second sampling scheme is the stratified sampling introduced briefly in Chapter 1,
Section 1.4. A stratified sample consists of a simple random sample of n1 components from
facility A and n2 = 100 − n1 components from facility B. Since 60% of all components
come from facility A, it make sense to take n1 = 60 and n2 = 40. Let XA1 , . . . , XA60
denote the simple random sample of 60 components from facility A, and XB1 , . . . , XB40
the corresponding sample from facility B. Let
µA = E(XA ),
pA = P (XA > 500),
µB = E(XB ),
pB = P (XB > 500),
be the mean values and population proportions for each of the two facilities. Then,
µ = 0.6µA + 0.4µB , p = 0.6pA + 0.4pB .
The stratified estimators of µ and p, called stratified sample mean and stratified
sample proportion, respectively, are
X s = 0.6X A + 0.4X B , p̂s = 0.6p̂A + 0.4p̂B ,
where
60
XA =
1 X
# of XAi > 500
,
XAi , p̂A =
60 i=1
100
and similarly for X B and p̂B .
Using the linearity property of the expected value, it is easy to verify that the stratified sample mean and stratified sample proportion are unbiased estimators for µ and p,
14
respectively. Should they be preferred over the estimators of µ and p based on simple
random sampling? According to the MSE criterion the answer is yes. Indeed, consider
the variances of the two estimators of p:
V ar(p̂s ) = 0.62 V ar(p̂A ) + 0.42 V ar(p̂B )
pA (1 − pA )
pB (1 − pB )
+ 0.42
60
40
pA (1 − pA )
pB (1 − pB )
= 0.6
+ 0.4
100
100
p(1 − p)
V ar(p̂) =
100
pA (1 − p)
pB (1 − p)
= 0.6
+ 0.4
.
100
100
= 0.62
Their difference is
V ar(p̂) − V ar(p̂s )
pA
pB
= 0.6
(pA − p) + 0.4
(pB − p)
100
100
pA
pB
= 0.6
(0.4pA − 0.4pB ) + 0.4
(0.6pB − 0.6pA )
100
100
(0.6)(0.4)
(pA − pB )2 .
=
100
This shows that the variance of the stratified sample proportion is smaller than that of
the proportion based on a simple random sample. The two variances are the same only
if pA = pB . Thus, on the basis of the MSE selection criterion, the stratified sample
proportion is to be preferred over the simple random sampling proportion. Similar (but
slightly more complicated) calculations show that V ar(X) ≥ V ar(X s ).
7
Confidence Intervals
By virtue of the Central Limit Theorem, if the sample size n is large enough, many
estimators, θ̂, are approximately normally distributed. Moreover, such estimators are
typically unbiased (or nearly unbiased), and their estimated standard error, σ
bθ̂ , typically
provides a reliable estimate of σθ̂ . Thus, if n is large enough, many estimators θ̂ satisfy
¡
¢
·
θ̂ ∼ N θ, σ
bθ̂2 ,
(7.1)
where θ is the true value of the parameter. For example, this is the case for the nonparametric estimators, moment estimators and many maximum likelihood estimators, which
15
are described in Section 10, and least squares estimators, which are described in Chapter
10. For such estimators, it is customary to report point estimates together with their
estimated standard errors. The estimated standard error helps assess the size of the estimation error through the 68-95-99.7% rule of the normal distribution. For example, (7.1)
implies that
¯
¯
¯
¯
σθ̂
¯θ̂ − θ¯ ≤ 2b
holds approximately 95% of the time. Alternatively, this is can be written as
θ̂ − 2b
σθ̂ ≤ θ ≤ θ̂ + 2b
σθ̂
which gives an interval of plausible values for the true value of θ, with degree of plausibility
approximately 95%. Such intervals are called confidence intervals. The abbreviation
CI will be used for ”confidence interval”. Note that if θ̂ is unbiased and we believe that the
normal approximation to its distribution is quite accurate, the 95% CI uses z0.025 = 1.96
instead of 2, i.e.
θ̂ − 1.96b
σθ̂ ≤ θ ≤ θ̂ + 1.96b
σθ̂ .
The general technique for constructing (1 − α)100% confidence intervals for a parameter
θ based on an approximately unbiased and normally distributed estimator, θ̂, consists of
two steps:
a) Obtain an error bound, which holds with probability 1 − α. This error bound is of the
form
¯
¯
¯
¯
bθ̂
¯θ̂ − θ¯ ≤ zα/2 σ
b) Convert the error bound into an interval of plausible values for θ of the form
θ̂ − zα/2 σ
bθ̂ ≤ θ ≤ θ̂ + zα/2 σ
bθ̂ ,
or, in short-hand notation,
θ̂ ± zα/2 σ
bθ̂ .
The degree of plausibility, or confidence level, of the interval will be (1 − α)100%.
16
In this section we will discuss the interpretation of CIs, present nonparametric confidence
intervals for population means and proportions, as well as an alternative, normal based
CI for the population mean. The issue of precision in estimation will be considered, and
we will see how the sample size can be manipulated in order to increase the precision of
the estimation of a population mean and proportion. Furthermore, we will discuss the
related issue of constructing prediction intervals under the normality assumption, and we
will present CIs for other parameters such as percentiles and the variance. Finally, some
methods for fitting models to data will be discussed.
7.1
Interpreting Confidence Intervals
There are two ways to view any CI, depending on whether θ̂ is viewed as an estimator
(i.e. a random variable) or as a point estimate. When θ̂ is viewed as a random variable
the CI is a random interval. Hence we can say that
the true value of θ belongs in θ̂ ± zα/2 σ
bθ̂
holds with probability 1 − α. But when θ̂ is viewed as a point estimate, the CI is a fixed
interval which either contains the true (and unknown to us) value of θ or it does not. For
example, as we will see in the next subsection, the estimate p̂ = 0.6, based on n = 20
Bernoulli trials, leads to the fixed 95% confidence interval of
(0.38, 0.82)
for p. This either contains the true value of p or it does not.
So how is a computed (1 − α)100% CI to be interpreted? To answer this question, think
of the process of constructing a (1 − α)100% CI as performing a Bernoulli trial where
the outcome is ”success” if the true value of θ belongs in the CI. Thus, the probability of
”success” is 1 − α, but we are not able to observe the outcome of the trial. Not knowing
the outcome, the only thing we can say is that our degree of confidence that the outcome
was ”success” is measured by (1 − α)100%.
7.2
Nonparametric CIs for Means and Proportions
In this subsection we will describe a CI for the mean of a population, when no parametric
assumption about the population distribution is made. As a special case, we will consider
CIs for a population proportion.
17
Let X1 , . . . , Xn denote a simple random sample from a population with mean value µ and
variance σ 2 . If the sample size is large (n > 30) the Central Limit Theorem asserts that
¡
¢
·
X ∼ N µ, σ 2 /n .
(7.2)
Moreover, if n is large enough, S is a good estimator of σ. Thus,
X −µ ·
√ ∼ N (0, 1) .
S/ n
(7.3)
Relation (7.3) implies that the bound on the error of estimation
¯
¯
¯X − µ¯ ≤ zα/2 √S
n
(7.4)
holds with approximate probability 1−α, and this leads to the (approximate) (1−α)100%
confidence interval
S
X ± zα/2 √ ,
n
(7.5)
for µ. (In the rare cases where the true value of σ is known, we can use it instead of its
estimate S.) This is a nonparametric CI for the population mean, because it does not use
any distributional assumptions (although it does assume, as the Central Limit Theorem
does, that the population mean and variance exist, and are finite).
Example 7.1. In Examples 2.2 and 5.2, a sample of size n = 36 measured reaction times
yielded a sample average of X = 5.4, and a sample standard deviation of S = 1.3. Since
n = 36 is large enough to apply the Central Limit Theorem, the available information
yields an approximate 68% CI for the true value of µ (the mean response time of the
robot in the conceptual population of all malfunctions of that type), of
1.3
1.3
5.4 − √ ≤ µ ≤ 5.4 + √ , or 5.18 ≤ µ ≤ 5.62.
36
36
Example 7.2. A random sample of n = 56 cotton samples gave average percent elongation of 8.17 and a sample standard deviation of S = 1.42. Calculate a 95% CI for µ, the
true average percent elongation.
Solution. The sample size of n = 56 is large enough for the application of the Central Limit
Theorem, which asserts that the distribution of the sample mean, X, is well approximated
by the normal distribution. Thus, the error bound (7.4) and the resulting nonparametric
CI (7.5) for µ can be used. The information given yields the CI
S
1.42
X̄ ± zα/2 √ = 8.17 ± 1.96 √ = 8.17 ± .37 = (7.8, 8.54).
n
56
18
A special case of the nonparametric CI (7.5) arises when X1 , . . . , Xn is a sample from a
Bernoulli population. Thus, each Xi takes the value 1 or 0, the population mean is µ = p
(the probability of 1), and the population variance is σ 2 = p(1 − p). When sampling
from a Bernoulli population, we are typically given only the binomial random variable,
T = X1 + · · · + Xn , or the observed proportion of 1s (which is the sample average),
X = p̂. By the Central Limit Theorem, the normal distribution provides an adequate
approximation to the distribution of T , or that of p̂, whenever n ≥ 5 and n(1 − p) ≥ 5,
and thus (7.2) is replaced by
·
p̂ ∼ N (p, p(1 − p)/n) .
The above approximate distribution of p̂ leads to the (approximate) (1 − α)100% CI
p̂ ± zα/2
p̂(1 − p̂)
√
.
n
(7.6)
Note that, because p is unknown, the condition np ≥ 5 and n(1 − p) ≥ 5 for applying the
CLT is replaced in practice by np̂ ≥ 5 and n(1 − p̂) ≥ 5, i.e. at least five 1s and at least
five 0s.
Example 7.3. In the car crash experiment of Example 2.1, it was given that 12 of 20
cars sustained no visible damage. Thus, the number of those that did not sustain visible
damage and the number of those that did, exceeds five. With this information, we have
p̂ = 12/20 = 0.6, so that, an approximate 95% CI, of the form given in (7.6), for the true
value of p (the population proportion of cars that sustain no visible damage) is
r
r
0.6 × 0.4
0.6 × 0.4
0.6 − 1.96
≤ p ≤ 0.6 + 1.96
,
20
20
or 0.6 − 0.2191 ≤ p ≤ 0.6 + 0.2191 or 0.38 ≤ p ≤ 0.82.
Example 7.4. The point estimate for the probability that a certain component functions
properly for at least 1000 hours, based on a sample of n = 100 of such components, is
p̂ = 0.91. Give a 95% CI for p.
Solution. The sample size conditions for applying the aforementioned CI for a binomial
parameter p hold here. Thus the desired CI is:
r
r
p̂(1 − p̂)
.91(.09)
p̂ ± z.025
= .91 ± 1.96
= .91 ± .059.
n
100
19
7.3
Confidence Intervals for a Normal Mean
In this subsection we will assume that X1 , . . . , Xn is a sample from a population having
a normal distribution, and we will present an alternative confidence interval for µ, which
does not require the sample size to be large.
In Chapter 5 we saw that, if X1 , . . . , Xn is a random sample from a normal population,
then by subtracting the population mean from the sample mean, X, and dividing the
√
difference by the estimated standard error, S/ n, we obtain a random variable which has
a t-distribution with n − 1 degrees of freedom. That is, we have
X −µ
√ ∼ tn−1 .
S/ n
(7.7)
√
Note that (7.7) gives the exact distribution of (X − µ)/(S/ n), whereas (7.3) gives the
approximate distribution when n is sufficiently large. Using relation (7.7) we obtain an
alternative bound on the error of estimation of µ, namely
¯
¯
¯X − µ¯ ≤ tn−1,α/2 √S ,
n
(7.8)
which holds with probability 1 − α. We note again that the probability for this error
bound is exact, whereas the error bound (7.4) holds with probability approximately 1 − α,
provided that n is sufficiently large. Of course, (7.8) requires the normality assumption,
whereas (7.4) does not.
The notation tn−1,α/2 is similar to the zα notation, i.e. it corresponds to the 100(1−α/2)th
percentile of the t-distribution with n − 1 degrees of freedom:
p.d.f of the t-distr. with
ν degrees of freedom
area=
α
t ν,α
Note that, as the degrees of freedom ν = n − 1 gets large, tν,α/2 approaches zα/2 ; for
example, the 95th percentile for the t-distributions with ν = 9, 19, 60 and 120, are
1.833, 1.729, 1.671, 1.658, respectively, while z0.05 = 1.645. Selective percentiles of tdistributions are given in the t-table.
The error bound (7.8) leads to the following 100(1 − α)% CI for the normal mean:
¶
µ
S
S
X − tn−1,α/2 √ , X + tn−1,α/2 √
(7.9)
n
n
20
Example 7.5. The mean weight loss of n = 16 grinding balls after a certain length of
time in mill slurry is 3.42g with S = 0.68g. Construct a 99% CI for the true mean weight
loss.
Solution. Here α = .01 and tn−1,α/2 = t15,0.005 = 2.947. Thus a 99% CI for µ is
√
√
X ± tn−1,α/2 (S/ n) = 3.42 ± 2.947(0.68/ 16), or 2.92 < µ < 3.92.
7.4
The issue of Precision
Precision in the estimation of a parameter θ is quantified by the size of the bound of
the error of estimation |θ̂ − θ|. Equivalently, it can be quantified by the length of the
corresponding CI, which is twice the size of the error bound. A shorter error bound, or
shorter CI, implies more precise estimation.
The bounds we have seen on the error of estimation of a population mean µ and population
proportion p, are of the form
¯
¯
¯X − µ¯ ≤ zα/2 √S (nonparametric case, n > 30)
n
¯
¯
¯X − µ¯ ≤ tn−1,α/2 √S
(normal case, any n)
n
r
p̂(1 − p̂)
|p̂ − p| ≤ zα/2
(np̂ ≥ 5, n(1 − p̂) ≥ 5).
n
The first and the third of these bounds hold with probability approximately 1 − α, while
the second holds with probability exactly 1 − α, if the underlying population has a normal
distribution. In the rare cases when σ can be considered know, the S of the first error
bound is replaced by σ.
The above expressions suggest that the size of the error bound (or length of CI) depends
on the sample size n. In particular, a larger sample size yields a smaller error bound, and
thus more precise estimation. Shortly we will see how to choose n in order to achieve a
prescribed degree of precision.
It can also be seen that the probability with which the error bound holds, i.e. 1 − α, also
affects the size of the error bound, or length of the CI. For example, a 90% CI ((α = .1, so
α/2 = .05)), is narrower than a 95% CI ((α = .05, so α/2 = .025)), which is less narrower
than a 99% CI ((α = .01, so α/2 = .005)). This is so because
z.05 = 1.645 < z.025 = 1.96 < z.005 = 2.575,
21
and similar inequalities for the t-critical values. The increase of the length of the CI with
the level of confidence is to be expected. Indeed, we are more confident that the wider
CI will contain the true value of the parameter. However, we rarely want to reduce the
length of the CI by decreasing the level of confidence.
We will now deal with the main learning objective of this section, which is how to choose
n in order to achieve a prescribed degree of precision in the estimation of µ and of p. As
we will see, the main obstacle to getting a completely satisfactory answer to this question
lies in the fact that the estimated standard error, which enters all expressions of error
bounds, is unknown prior to the data collection. For the two estimation problems, we will
discuss separately ways to bypass this difficulty.
7.4.1
Sample size determination for µ
Consider first the task of choosing n for precise estimation of µ, in the rare case that σ is
known. In that case, if normality is assumed, or if we know that the needed sample size
will be > 30, the required n for achieving a prescribed length L of the (1 − α)100% CI,
is found equating the expression for the length of the CI to L, and solving the resulting
equation for n. That is, we solve
σ
2z.025 √ = L,
n
for n. The solution is
µ
n=
σ
2zα/2
L
¶2
.
(7.10)
More likely than not, the solution will not be an integer, in which case the recommended
procedure is to round up. The practice of rounding up guarantees that the prescribed
objective will be more than met.
Example 7.6. The time to response (in milliseconds) to an editing command with a new
operating system is normally distributed with an unknown mean µ and σ = 25. We want
a 95% CI for µ of length L = 10 milliseconds. What sample size n should be used?
Solution. For 95% CI, α = .05, α/2 = .025 and z.025 = 1.96. Thus, from formula (7.10)
we obtain
µ
n=
25
2 · (1.96)
10
¶2
which is rounded up to n = 97.
22
= 96.04,
Typically however, σ is unknown, and thus, sample size determinations must rely on some
preliminary approximation, Sprl , to it. Two commonly used methods for obtaining this
approximation are:
a) If the range of population values is known, then σ can be approximated by dividing
the range by 3.5 or 4. That is, use
range
range
, or Sprl =
,
3.5
4
instead of σ in the formula (7.10). This approximation is obtained by considering the
√
standard deviation of a uniform in (a, b) random variable, which is σ = (b−a)/ 12 =
Sprl =
(b − a)/3.464.
b) Alternatively, the approximation can be based on the sample standard deviation, Sprl ,
of a preliminary sample.
The reason why this is not a very satisfactory solution is because the standard deviation
of the final sample, upon which the CI will be calculated, will be different from that of the
preliminary approximation of it, regardless of how this approximation was obtained. Thus
the prescribed precision objective might not be met, and some trial-and-error iteration
might be involved. The trial-and-error process gets slightly more complicated with the
t-distribution CIs, because the t-percentiles change with the sample size.
7.4.2
Sample size determination for p
Consider next the selection of sample size for meeting a prescribed level of precision in
the estimation of the binomial parameter p.
As noted previously, the required n for achieving a prescribed length L of the (1 − α)100%
CI, is found by equating the expression for the length of the CI for p to L, and solving
the resulting equation for n. That is, we solve
r
p̂(1 − p̂)
2zα/2
=L
n
for n. The solution is
2
4zα/2
p̂(1 − p̂)
n=
.
(7.11)
L2
The problem is that the value of p̂ is not known before the sample is collected, and thus
sample size determinations must rely on some preliminary approximation, p̂prl , to it. Two
commonly used methods for obtaining this approximation are:
23
a) When preliminary information about p exists. This preliminary information may
come either from a small pilot sample or from expert opinion. In either case, the
preliminary p̂prl is entered in (7.11) instead of p̂ for sample size calculation. Thus,
n=
2
4zα/2
p̂prl (1 − p̂prl )
L2
,
(7.12)
and we round up.
b) When no preliminary information about p exists. In this case, we replace p̂(1 − p̂) in
(7.11) by 0.25. The rationale for doing so is seen by noting that p̂(1 − p̂) ≤ 0.25;
thus, by using the larger 0.25 the calculated sample size will be at least as large as
needed for meeting the precision specification. This gives
n=
2
zα/2
.
(7.13)
L2
Example 7.7. A preliminary sample gave p̂prl = 0.9. How large should n be to estimate
the probability of interest to within 0.01 with 95% confidence?
Solution. “To within 0.01” is another way of saying that the 95% bound on the error of
estimation should be 0.01, or the desired CI should have a width of 0.02. Since we have
preliminary information, we use (7.12):
n=
4(1.96)2 (.91)(.09)
= 3146.27.
(.02)2
This is rounded up to 3147.
Example 7.8. A new method of pre-coating fittings used in oil, brake and other fluid
systems in heavy-duty trucks is being studied. How large n is needed to estimate the proportion of fittings that leak to within .02 with 90% confidence? (No prior info available).
Solution. Here we have no preliminary information about p. Thus, we apply the formula
(7.13) and we obtain
2
n = zα/2
/L2 = (1.645)2 /(.04)2 = 1691.26.
This is rounded up to 1692.
8
Prediction Intervals
The meaning of the word prediction is related to, but distinct from, the word estimation.
The latter is used when we are interested in learning the value of a population or model
24
parameter, while the former is used when we want to learn about the value that a future
observation might take.
For a concrete example, suppose you contemplate eating a hot dog and you wonder about
the amount of fat in the hot dog which you will eat. This is different from the question
”what is the expected (mean) amount of fat in hot dogs?” To further emphasize the
difference between the two, suppose that the amount of fat in a randomly selected hot
dog is known to be N (20, 9). Thus there are no unknown parameters to be estimated. In
particular we know that expected amount of fat in hot dogs is 20 gr. Still the amount of
fat in the hot dog which you will eat is unknown, simply because it is a random variable.
How do we predict it? According to well-accepted criteria, the best point-predictor of a
normal random variable with mean µ is µ. A (1 − α)100% prediction interval, or PI, is
an interval that contains the random variable (which is being predicted) with probability
1 − α. If both µ and σ are known, then a (1 − α)100% PI is
µ ± zα/2 σ.
In our particular example, X ∼ N (20, 9), so the best point predictor of X is 20 and a
95% PI is 20 ± (1.96)3 = (14.12, 25.88).
When µ, σ are unknown (as is typically the case) we use a sample X1 , . . . , Xn to estimate
µ, σ by X, S, respectively. Then, as best point predictor of a future observation, we use
X. But now, the prediction interval (always assuming normality) must take into account
the variability X, S as estimators of µ, σ. Doing so yields the following (1 − α)100% PI
for the next observation X:
r
r
µ
¶
1
1
X − tα/2,n−1 S 1 + , X + tα/2,n−1 S 1 +
.
n
n
In the above formula, the variability of X is accounted for by the
of S is accounted for by the use of the t-percentiles.)
(8.1)
1
, and the variability
n
Example 8.1. The fat content measurements from a sample of size n = 10 hot dogs,
gave sample mean and sample standard deviation of X = 21.9, and S = 4.134. Give a
95% PI for the fat content of the next hot dog to be sampled.
Solution. Using the given information in the formula (8.1), we obtain the PI
r
1
X̄ ± t.025,9 S 1 + = (12.09, 31.71).
n
25
9
∗
Other Confidence Intervals
In this section we will present confidence intervals for other parameters of interest, such as
the median and other percentiles, the variance, and regression parameters. In addition,
we will give a general method for constructing CIs for functions of parameters. The
techniques for constructing some of these CIs differ from the one used in Section 7.
9.1
Nonparametric CIs for percentiles
Let X1 , . . . , Xn denote a sample from a population having a continuous distribution, and
let xp denote the (1 − p)100th percentile. The basic idea for constructing a nonparametric
CI for xp is to associate a Bernoulli trial Yi with each observation Xi :
(
1 if Xi > xp
Yi =
0 if Xi < xp
Thus, the probability of a 1 (or success) in each Bernoulli trial is p. Let Y =
P
i
Yi be the
Binomial(n, p) random variable. Let also X(1) < · · · < X(n) denote the ordered sample
values. Then the events
X(k) < xp < X(k+1) , X(k) < xp , xp < X(k+1) ,
are equivalent to
Y = n − k, Y ≤ n − k, Y ≥ n − k,
respectively.
Nonparametric (1 − α)100% CIs for xp will be of the form
X(a) < xp < X(b) ,
where the indices a, b are found from the requirements that
¡
¢
P xp < X(a) = α/2
¢
¡
P X(b) < xp = α/2.
These requirements can equivalently be expressed in terms of Y as
P (Y ≥ n − a + 1) = α/2
(9.1)
P (Y ≤ n − b) = α/2,
(9.2)
26
and thus, a and b can be found using either the binomial tables (since Y ∼ Binomial(n, p))
or the normal approximation to the binomial. Using the normal approximation with
continuity correction, a and b are found from:
n − a − np + 0.5
n − b − np + 0.5
p
p
= zα/2 ,
= −zα/2 .
np(1 − p)
np(1 − p)
The special case of p = 0.5, which corresponds to the median, deserves separate consideration. In particular, the (1 − α)100% CIs for the median x0.5 will be of the form
X(a) < x0.5 < X(n−a+1) ,
(9.3)
so that only a must be found. To see that this is so, note that if p = 0.5 then Y , the number
of 1s, has the same distribution as n − Y , the number of 0s (both are Binomial(n, 0.5)).
Thus,
P (Y ≥ n − a + 1) = P (n − Y ≥ n − a + 1)
= P (Y ≤ a − 1) ,
which implies that if a, b satisfy (9.1), (9.2), respectively, then they are related by n − b =
a − 1 or b = n − a + 1. The a in relation (9.3) can be found from the requirement that
P (Y ≤ a − 1) = α/2.
(9.4)
Example 9.1. Let X1 , . . . , X25 be a sample from a continuous population. Find the
confidence level of the following CI for the median:
¡
¢
X(8) , X(18) .
Solution. First note that the CI (X(8) , X(18) ) is of the form (9.3) with a = 8. According
to the formula (9.4), and the binomial tables,
α = 2P (Y ≤ 7) = 2(0.022) = 0.044.
¡
¢
Thus, the confidence level of the CI X(8) , X(18) , is (1 − α)100% = 95.6%.
9.2
9.2.1
CIs for the variance
CIs for a normal variance
Let X1 , . . . , Xn be a random sample from a population whose distribution belongs in the
normal family. In Chapter 5, we saw that, for normal samples, the sampling distribution
27
of S 2 is a multiple of χ2n−1 random variable. Namely,
(n − 1)S 2
∼ χ2n−1 .
2
σ
This fact implies that
(n − 1)S 2
< χ2α/2,n−1
2
σ2
will be true (1 − α)100% of the time, where χ2α/2,n−1 , χ21−α/2,n−1 denote percentiles of the
χ21− α ,n−1 <
χ2n−1 distribution as shown in the figure below.
2
p .d.f of χ n-1 distr
α
2
χ21− α ,n−1
2
α
2
χ2α/2,n−1
Note that the bounds on the error of estimation of σ 2 by S 2 are given in terms of the
ratio S 2 /σ 2 . After some algebraic manipulations, we obtain that
(n − 1)S 2
(n − 1)S 2
2
<
σ
<
χ2α/2,n−1
χ21− α ,n−1
(9.5)
2
is true (1 − α)100% the time.
Selective percentiles of χ2 distributions are given in chi-square table.
Example 9.2. An optical firm purchases glass to be ground into lenses. As it is important
that the various pieces of glass have nearly the same index of refraction, interest lies in
the variability. A random sample of size n = 20 measurements, yields S 2 = (1.2)10−4 .
Find a 95% CI for σ.
Solution. Here n − 1 = 19, χ2.975,19 = 8.906, and χ2.025,19 = 34.852. Thus, according to
(9.5),
(19)(1.2 × 10−4 )
(19)(1.2 × 10−4 )
< σ2 <
.
32.852
8.906
It follows that a 95% CI for σ is
r
r
(19)(1.2 × 10−4 ) √ 2
(19)(1.2 × 10−4 )
< σ <
,
32.852
8.906
or .0083 < σ < .0160.
28
9.2.2
Nonparametric CIs for the variance
9.3
Nonparametric CIs for functions of parameters
9.4
CIs for Regression Parameters
In this subsection we will describe confidence intervals for the slope parameter, β, of the
simple linear regression model for the regression function of Y on X,
µY |X (x) = E(Y |X = x) = α + βx.
As seen in Example 4.2, if (X1 , Y1 ), . . . , (Xn , Yn ) are observations to be made, the LSE
estimator of the slope is
P
P
P
P
(Xi − X)(Yi − Y )
n Xi Yi − ( Xi )( Yi )
P
P
P
.
β̂ =
=
n Xi2 − ( Xi )2
(Xi − X̄)2
In Proposition 4.1 we saw that β̂ is unbiased for β. The following proposition gives
the standard error and the estimated standard error of β̂, and, under the additional
assumption of normality, it gives the distribution of β̂
Proposition 9.1. Assume that the conditional variance of Y , given that X = x, is σ 2 .
Let (X1 , Y1 ), . . . , (Xn , Yn ) be observations to be made, and let β̂ be the LSE of the slope
parameter β. Then,
1. The standard error, and the estimated standard error of β̂ are
v
v
u
u
σ2
S2
u
u
σβ̂ = t P
, and σ̂β̂ = t P
,
1 P
1 P
Xi2 − ( Xi )2
Xi2 − ( Xi )2
n
n
respectively, where
n
1 X
S =
(Yi − α̂ − β̂Xi )2 ,
n − 2 i=1
2
is the estimator of σ 2 .
2. Assume now, in addition, that the conditional distribution of Y given X = x is
normal. Thus,
Y |X = x ∼ N (α + βx, σ 2 ).
29
Then, β̂ has a normal distribution. Thus,
µ
¶
β̂ − β
2
β̂ ∼ N β, σβ , or
∼ N (0, 1).
σβ̂
3. Under the above assumption of normality,
β̂ − β
∼ tn−2 .
σ̂β̂
REMARK: The quantity
Pn
i=1 (Yi
− α̂ − β̂Xi )2 , which is used in the definition of the
estimator, S 2 , of σ 2 , is called the error sum of squares, and is denoted by SSE. A
computational formula for SSE is
SSE =
n
X
i=1
Yi2
− α̂
n
X
Yi − β̂
i=1
n
X
X i Yi .
(9.6)
i=1
Part 3 of Proposition 9.1, can be used for constructing bounds for the estimation error,
β̂ − β, and CIs for β, just as relation (7.7) was used for such purposes in the estimation
of a normal mean. Thus, the bound
¯
¯
¯
¯
¯β̂ − β ¯ ≤ tn−2,α/2 σ̂β̂
(9.7)
on the error of estimation of β, holds with probability 1 − α. This bound leads to an
100(1 − α)% CI for the slope β of the true regression line of the form
β̂ ± tα/2,n−2 σ̂β̂ .
(9.8)
Example 9.3. The data in this example come from a study on the dependence of
Y =propagation of an ultrasonic stress wave through a substance, on X=tensile strength
of substance. The n = 14 observations are:
30
X
Y
12
3.3
30
3.2
36
3.4
40
3.0
45
2.8
57
2.9
62
2.7
67
2.6
71
2.5
78
2.6
93
2.2
94
2.0
100
2.3
105
2.1
Assuming that the regression function of Y on X is described by the simple linear regression model, compute the LSE α̂ and β̂ and construct a 95% CI for β.
P
P
P
P
Solution. With the given data, i Xi = 890, i Xi2 = 67, 182, i Yi = 37.6, i Yi2 =
P
103.54 and i Xi Yi = 2234.30. From this we get
β̂ = −0.0147209, and α̂ = 3.6209072.
Moreover, the point estimate of σ 2 is
S2 =
SSE
103.54 − β̂0 (37.6) − β̂1 (2234.30)
.2624532
=
=
= 0.02187.
n−2
12
12
Using the above, the estimated standard error of β̂ is
s
v
u
S2
0.02187
u
σ̂β̂ = t P
=
= 0.001414.
1
P
1
67, 182 − 14
8902
Xi2 − ( Xi )2
n
Finally, the 95% CI for β is
−0.0147209 ± t0.025,12 0.001414 = −0.0147209 ± 2.179 × 0.001414
= −0.0147209 ± 0.00308 = (−0.0178, −0.01164).
31
10
∗
Methods for Fitting Models to Data
We have already seen the method of moments, as it applies to fitting both distribution
models and the simple linear regression model. In this section we will present two other
popular methods for fitting models to data, the method of least squares, and the method
of maximum likelihood.
10.1
The Method of Least Squares
In this subsection we will present the method of least squares, which is the most common
method for fitting regression models. We will, describe this method for the case of fitting
the simple linear regression model, namely the model which assumes that the regression
function of Y on X is
µY |X (x) = E(Y |X = x) = α + βx.
The estimators we will obtain, called the least squares estimators, are the same as the
moment estimators that were derived in Example 4.2.
To explain the LS method consider the problem of deciding which of two lines fit the data
better. A typical data set and two lines that might be thought of fitting the data, are
shown in the next figure.
Line 1
Line 2
}
Vertical Distance from Line 1
To answer the question of which of two lines fit the data better, one must first adopt a
principle, on the basis of which, to judge the quality of a fit. The principle we will use is
the principle of least squares. According to this principle, the quality of the fit of a line to
data (x1 , y1 ), . . . , (xn , yn ) is judged by the sum of the squared vertical distances of each
point (xi , yi ) from the line. The line for which this sum of squared vertical distances is
smaller, is said to provide a better fit to the data.
The least squares estimates of the intercept, α, and the slope β, are the intercept and
the slope, respectively, of the best fitting line, i.e. of the line with the smallest sum of
32
vertical square distances. The best fitting line is also called the estimated regression
line.
Since the vertical distance of (xi , yi ) from a line a + bx is yi − (a + bxi ), the method of
least squares finds the values α̂, β̂ which minimize
n
X
(yi − a − bxi )2
i=1
with respect to a, b. This minimization problem has a simple closed-form solution:
P
P
P
n xi yi − ( xi )( yi )
P
P
β̂ =
, α̂ = ȳ − β̂ x̄.
n x2i − ( xi )2
Thus, the estimated regression line is
µ̂Y |X (x) = α̂ + β̂x.
Example 10.1. With n = 10 data points on X=stress applied and Y =time to failure we
P
P 2
P
P
have summary statistics
xi = 200,
xi = 5412.5,
yi = 484,
xi yi = 8407.5. Thus
the best fitting line has slope and intercept of
β̂ =
10(8407.5) − (200)(484)
= −.900885,
10(5412.5) − (200)2
α̂ =
1
200
(484) − (−.900885)
− 11.41 = 66.4177,
10
10
respectively.
10.2
The Method of Maximum Likelihood (ML)
The method of ML estimates θ by addressing the question “what value of the parameter
is most likely to have generated the data?”
The answer to this question, which is the ML estimator (MLE), is obtained by maximizing
(with respect to the parameter) the so-called likelihood function which is simply the
joint p.d.f. (or p.m.f.) of X1 , . . . , Xn evaluated at the sample points:
lik(θ) =
n
Y
f (xi |θ).
i=1
Typically, it is more convenient to maximize the logarithm of the likelihood function,
which is called the log-likelihood function. Since the logarithm is a monotone function,
this is equivalent to maximizing the likelihood function.
33
Example 10.2. A Bernoulli experiment with outcomes 0 or 1 is repeated independently
20 times. If we observe 5 1s, find the MLE of the probability of 1, p.
Solution. Here the observed random variable X has a binomial(n = 20, p) distribution.
Thus,
µ ¶
20 5
P (X = 5) =
p (1 − p)15 .
5
The value of the parameter which is most likely to have generated the data X = 5 is the
one that maximizes this probability, which, in this case, is the likelihood function. The
log-likelihood is
µ ¶
20
ln P (X = 5) = ln
+ 5 ln(p) + 15 ln(1 − p).
5
Setting the first derivative of it to zero yields the MLE p̂ =
the binomial probability p is p̂ =
X
.
n
5
. In general, the MLE of
20
Example 10.3. Let X1 = x1 , . . . , Xn = xn be a sample from a population having the
exponential distribution, i.e. f (x|λ) = λ exp(λ). Find the MLE of λ.
Solution, The likelihood function here is
lik(λ) = λe−λx1 . . . λe−λxn = λn e−λ
P
xi
,
and the first derivative of the log-likelihood function is
·
X ¸ n X
∂
n ln(λ) − λ
Xi = −
Xi .
∂λ
λ
n
1
Setting this to zero yields λ̂ = P
=
as the MLE of λ.
Xi
X
Example 10.4. The lifetime of a certain component is assumed to have the exp(λ)
distribution. A n sample of n components is tested. Due to time constraints, the test
is terminated at time L. So instead of observing the life times X1 , . . . , Xn we observe
Y1 , . . . , Yn where

Xi if Xi ≤ L
Yi =
L if X > L
i
We want to estimate λ.
Solution. For simplicity, set Y1 = X1 , . . . , Yk = Xk , Yk+1 = L, . . . , Yn = L. The likelihood
function is
λe−λY1 . . . λe−λYk · e−λYk+1 . . . e−λYn
34
and the log-likelihood is
µ
¶
ln(λ) − λY1
µ
+
+ ...
¶
ln(λ) − λYk
= k ln(λ) − λ
k
X
− λL − . . . − λL
Yi − (n − k)λL.
i=1
Setting the derivative w.r.t. λ to zero and solving gives the MLE
k
λ̂ = Pk
i=1 Yi + (n − k)L
k
= Pn
i=1
Yi
=
k
nY
Example 10.5. Let X1 = x1 , . . . , Xn = xn be a sample from a population having the
uniform distribution on (0, θ). Find the MLE of θ.
1
Solution. Here f (x|θ) = , if 0 < x < θ, and 0 otherwise. Thus the likelihood function is
θ
1
provided, 0 < Xi < θ for all i, and is 0 otherwise. This is maximized by taking θ as
θn
small as possible. However if θ smaller than max(X1 , . . . , Xn ) then likelihood function is
zero. Thus the MLE is θ̂ = max(X1 , . . . , Xn ).
Theorem 10.1. (Optimality of MLEs) Under smoothness conditions on f (x|θ), when n
is large, the MLE θ̂ has sampling distribution which is approximately normal with mean
value equal to (or approximately equal to) the true value of θ, and variance nearly as small
as that of any other estimator. Thus, the MLE θ̂ is approximately a MVUE of θ.
REMARK: Among the conditions needed for the validity of Theorem 1 is that the set of
x-values for which f (x|θ) > 0 should not depend on θ. Thus, the sampling distribution
of the MLE θ̂ = max(X1 , . . . , Xn ) of Example 9 is not approximately normal even for
large n. However, application of the MSE criterion yields that the biased estimator
max(X1 , . . . , Xn ) should be preferred over the unbiased estimator 2X if n is sufficiently
large. Moreover, in this case, it is not difficult to remove the bias of θ̂ = max(X1 , . . . , Xn ).
Theorem 10.2. (Invariance of MLEs) If θ̂ is the MLE of θ and we are interested in
estimating a function, ϑ = g(θ), of θ then
ϑ̂ = g(θ̂)
is MLE of ϑ. Thus, ϑ̂ has the stated optimality of MLEs stated in Theorem 1.
35
According to Theorem 2, the estimators given in Example 5c),d) are optimal. The following examples revisit some of them.
Example 10.6. Consider the setting of Example 7, but suppose we are interested in the
1
mean lifetime. For the exponential distribution µ = . (So here θ = λ, ϑ = µ and
λ
1
ϑ = g(θ) = .) Thus µ̂ = 1/λ̂ is the MLE of µ.
θ
Example 10.7. Let X1 , . . . , Xn be a sample from N (µ, σ 2 ). Estimate:
a) P (X ≤ 400), and b) x.1 .
µ
400 − µ
Solution. a) ϑ = P (X ≤ 400) = Φ
σ
¶
µ
400 − X̄
.
ϑ̂ = g(µ̂, s2 ) = Φ
s
¶
= g(µ, σ 2 ). Thus
b) ϑ̂ = x̂.1 = µ̂ + σ̂z.1 = g(µ̂, σ̂ 2 ) = X̄ + sz.1
Note: As remarked also in Section 6.2, Example 14 shows that the estimator we choose
depends on what assumptions we are willing to make. If we do not assume normality (or
any other distribution)
a) P (X ≤ 400) would be estimated by p̂=the proportion of Xi ’s that are ≤ 400,
b) x.1 would be estimated by the sample 90th percentile.
If the normality assumption is correct, the MLEs of Example 14 are to be preferred by
Theorem 1.
36