Download Problems of Estimation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

German tank problem wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Problems of Estimation
A common problem in statistics is to obtain information about the mean, µ, of a population.
For example, we might want to know
• the mean age of people in the civilian labor force,
• the mean cost of a wedding,
• the mean gas mileage of a new-model car, or
• the mean starting salary of liberal-arts graduates.
If the population is small, we can ordinarily determine µ exactly by first taking a census and
then computing µ from the population data. If the population is large, however, as it often
is in practice, taking a census is generally impractical, extremely expensive, or impossible.
Nonetheless, we can usually obtain sufficiently accurate information about µ by taking a sample
from the population.
Point Estimate
One way to obtain information about a population mean µ is to estimate it by a sample mean
x, as illustrated in the next example.
EXAMPLE: The U.S. Census Bureau publishes annual price figures for new mobile homes in
Manufactured Housing Statistics. The figures are obtained from sampling, not from a census.
A simple random sample of 36 new mobile homes yielded the prices, in thousands of dollars,
shown in the Table below. Use the data to estimate the population mean price, µ, of all new
mobile homes.
Solution: We estimate the population mean price, µ, of all new mobile homes by the sample
mean price, x, of the 36 new mobile homes sampled. From the Table above,
X
xi
2278
x=
=
= 63.28
n
36
Interpretation: Based on the sample data, we estimate the mean price, µ, of all new mobile
homes to be approximately $63.28 thousand, that is, $63,280.
An estimate of this kind is called a point estimate for µ because it consists of a single number,
or point.
1
Confidence-Interval Estimate
As we know, a sample mean is usually not equal to the population mean; generally, there is
sampling error. Therefore, we should accompany any point estimate of µ with information
that indicates the accuracy of that estimate. This information is called a confidence-interval
estimate for µ, which we introduce in the next example.
EXAMPLE: Consider again the problem of estimating the (population) mean price, µ, of all new
mobile homes by using the sample data in the Table above. Let’s assume that the population
standard deviation of all such prices is $7200.
(a) Identify the distribution of the variable x, that is, the sampling distribution of the sample
mean for samples of size 36.
(b) Use part (a) to show that 95% of all samples of 36 new mobile homes have the property
that the interval from x − 2.4 to x + 2.4 contains µ.
(c) Use part (b) and the sample data in the Table above to find a 95% confidence interval for
µ, that is, an interval of numbers that we can be 95% confident contains µ.
Solution:
(a) A histogram of the price data in the Table above shows that the prices of new mobile
homes are normally distributed. Because n = 36, σ = 7.2, and prices of new mobile homes are
normally distributed, it follows that
we √
don’t know)
• µx = µ (which
√
• σx = σ/ n = 7.2/ 36 = 1.2
• x is normally distributed
In other words, for samples of size 36, the variable x is normally distributed with mean µ and
standard deviation 1.2.
(b) The “95” part of the 68-95-99.7 rule states that, for a normally distributed variable, 95% of
all possible observations lie within two standard deviations to either side of the mean. Applying
this rule to the variable x and referring to part (a), we see that 95% of all samples of 36 new
mobile homes have mean prices within 2 · 1.2 = 2.4 of µ. Equivalently, 95% of all samples of
36 new mobile homes have the property that the interval from x − 2.4 to x + 2.4 contains µ.
(c) Because we are taking a simple random sample, each possible sample of size 36 is equally
likely to be the one obtained. From part (b), 95% of all such samples have the property that
the interval from x − 2.4 to x + 2.4 contains µ. Hence, chances are 95% that the sample we
obtain has that property. Consequently, we can be 95% confident that the sample of 36 new
mobile homes whose prices are shown in the Table above has the property that the interval
from x − 2.4 to x + 2.4 contains µ. For that sample, x = 63.28, so
x − 2.4 = 63.28 − 2.4 = 60.88 and x + 2.4 = 63.28 + 2.4 = 65.68
Interpretation: We can be 95% confident that the mean price, µ, of all new mobile homes is
somewhere between $60,880 and $65,680.
2
NOTE: A confidence interval for a population mean depends on the sample mean, x, which
in turn depends on the sample selected. For example, suppose that the prices of the 36 new
mobile homes sampled were as shown in the Table below instead of as in the Table above.
Then we would have x = 65.83 so that
x − 2.4 = 65.83 − 2.4 = 63.43 and x + 2.4 = 65.83 + 2.4 = 68.23
In this case, the 95% confidence interval for would be from 63.43 to 68.23. We could be 95%
confident that the mean price, µ, of all new mobile homes is somewhere between $63,430 and
$68,230.
The Estimation of Means (σ Known)
We now develop a procedure to obtain a confidence interval for a population mean when the
population standard deviation is known. In doing so, we assume that the variable under consideration is normally distributed. Because of the central limit theorem, however, the procedure
will also work to obtain an approximately correct confidence interval when the sample size is
large, regardless of the distribution of the variable.
The basis of our confidence-interval procedure was stated in Chapter 10: If x is a normally
distributed variable with mean µ and standard deviation σ, then, for samples of√size n, the
variable x is also normally distributed and has mean µ and standard deviation σ/ n. We can
use that fact and the “95%” part
√ of the 68-95-99.7 rule to conclude that 95% of all samples of
size n have means within 2 · σ/ n of µ, as depicted in Figure (a) below
√
More generally, we can say that 100(1−α)% of all samples of size n have means within zα/2 ·σ/ n
as depicted in Figure (b) above. Equivalently, we can say that 100(1 − α)% of all samples of
size n have the property that the interval from
σ
x − zα/2 · √
n
σ
to x + zα/2 · √
n
contains µ.
3
EXAMPLE: The Bureau of Labor Statistics collects information on the ages of people in the
civilian labor force and publishes the results in Employment and Earnings. Fifty people in the
civilian labor force are randomly selected; their ages are displayed in the Table below.
Find a 95% confidence interval for the mean age, µ, of all people in the civilian labor force.
Assume that the population standard deviation of the ages is 12.1 years.
4
EXAMPLE: The Bureau of Labor Statistics collects information on the ages of people in the
civilian labor force and publishes the results in Employment and Earnings. Fifty people in the
civilian labor force are randomly selected; their ages are displayed in the Table below.
Find a 95% confidence interval for the mean age, µ, of all people in the civilian labor force.
Assume that the population standard deviation of the ages is 12.1 years.
Solution: Because the sample size is 50, which is large, and the population standard deviation
is known, we can use the Procedure above to find the required confidence interval.
Step 1: For a confidence level of 1 − α, use Table I to find zα/2 .
We want a 95% confidence interval, so α = 1 − 0.95 = 0.05. From Table I, zα/2 = z0.05/2 =
z0.025 = 1.96.
Step 2: The confidence interval for µ is from
σ
x − zα/2 · √
n
to
σ
x + zα/2 · √
n
We know σ = 12.1, n = 50, and, from Step 1, zα/2 = 1.96. To compute x for the data in the
Table above, we apply the usual formula:
X
xi
1819
=
= 36.4
x=
n
50
to one decimal place. Consequently, a 95% confidence interval for µ is from
12.1
36.4 − 1.96 · √
50
12.1
to 36.4 + 1.96 · √
50
or 33.0 to 39.8.
Interpretation: We can be 95% confident that the mean age, µ, of all people in the civilian
labor force is somewhere between 33.0 years and 39.8 years.
5
Determining the Required Sample Size
If the margin of error and confidence level are given, then we must determine the sample size
needed to meet those specifications. To find
√ the formula for the required sample size, we solve
the margin-of-error formula, E = zα/2 · σ/ n, for n.
EXAMPLE: Consider again the problem of estimating the mean age, µ, of all people in the
civilian labor force.
(a) Determine the sample size needed in order to be 95% confident that µ is within 0.5 year of
the estimate, x. Recall that σ = 12.1 years.
(b) Find a 95% confidence interval for µ if a sample of the size determined in part (a) has a
mean age of 38.8 years.
Solution:
(a) To find the sample size, we use the Formula above. We know that σ = 12.1 and E = 0.5.
The confidence level is 0.95, which means that α = 0.05 and zα/2 = z0.025 = 1.96. Thus
n=
z
α/2
E
· σ 2
=
1.96 · 12.1
0.5
2
= 2249.79
which, rounded up to the nearest whole number, is 2250.
Interpretation: If 2250 people in the civilian labor force are randomly selected, we can be
95% confident that the mean age of all people in the civilian labor force is within 0.5 year of
the mean age of the people in the sample.
(b) Applying the above Procedure with α = 0.05, σ = 12.1, x = 38.8, and n = 2250, we get the
confidence interval
12.1
12.1
to 38.8 + 1.96 · √
38.8 − 1.96 · √
2250
2250
or 38.3 to 39.3.
Interpretation: We can be 95% confident that the mean age, µ, of all people in the civilian
labor force is somewhere between 38.3 years and 39.3 years.
6
The Estimation of Means (σ Unknown)
What if, as is usual in practice, the population standard deviation is unknown? Then we cannot
base our confidence-interval procedure on the standardized version of x. The best we can do is
estimate the population standard deviation, σ, by the sample standard deviation, s; in other
words, we replace σ by s in the equation
z=
x−µ
√
σ/ n
and base our confidence-interval procedure on the resulting variable
t=
x−µ
√
s/ n
which is a value of a random variable having the t-distribution. More specifically, this distribution is called the Student t-distribution or Student’s t-distribution, as it was first
developed by a statistician, W.S.Gosset, who published his work under the pen name “Student.”
There is a different t-distribution for each sample size. We identify a particular t-distribution
by its number of degrees of freedom (df). For the studentized version of x, the number of
degrees of freedom is 1 less than the sample size, which we indicate symbolically by df = n−1.
A variable with a t-distribution has an associated curve, called a t-curve. Although there is
a different t-curve for each number of degrees of freedom, all t-curves are similar and resemble
the standard normal curve, as illustrated in the Figure above (right).
Percentages (and probabilities) for a variable having a t-distribution equal areas under the
variable’s associated t-curve. For our purposes, one of which is obtaining confidence intervals
for a population mean, we don’t need a complete t-table for each t-curve; only certain areas
will be important. Table II is sufficient for our purposes.
EXAMPLE: For a t-curve with 13 degrees of freedom, determine t0.05 ; that is, find the t-value
having area 0.05 to its right, as shown the Figure below.
7
EXAMPLE: For a t-curve with 13 degrees of freedom, determine t0.05 ; that is, find the t-value
having area 0.05 to its right, as shown in Figure (a) below.
Solution: To find the t-value in question, we use Table II, a portion of which is given below:
The number of degrees of freedom is 13, so we first go down the outside columns, labeled df,
to “13.” Then, going across that row to the column labeled t0.05 , we reach 1.771. This number
is the t-value having area 0.05 to its right, as shown in Figure (b) above. In other words, for a
t-curve with df = 13, t0.05 = 1.771.
EXAMPLE: The Federal Bureau of Investigation (FBI) compiles data on robbery and property crimes and publishes the
information in Population-at-Risk Rates and Selected Crime
Indicators. A simple random sample of pickpocket offenses
yielded the losses, in dollars, shown in the Table on the right.
Use the data to find a 95% confidence interval for the mean
loss, µ, of all pickpocket offenses.
8
EXAMPLE: The Federal Bureau of Investigation (FBI) compiles data on robbery and property crimes and publishes the
information in Population-at-Risk Rates and Selected Crime
Indicators. A simple random sample of pickpocket offenses
yielded the losses, in dollars, shown in the Table on the right.
Use the data to find a 95% confidence interval for the mean
loss, µ, of all pickpocket offenses.
Solution: Because the sample size, n = 25, is moderate, we first need to consider questions of
normality. To do that, we constructed a histogram that reveals that indeed we have a roughly
normal population. So, we can apply the above Procedure to find the confidence interval.
Step 1: For a confidence level of 1 − α, use Table II to find tα/2 with df= n − 1,
where n is the sample size.
We want a 95% confidence interval, so α = 1−0.95 = 0.05. For n = 25, we have df= 25−1 = 24.
From Table II, tα/2 = t0.05/2 = t0.025 = 2.064.
Step 2: The confidence interval for µ is from
s
x − tα/2 · √
n
to
s
x + tα/2 · √
n
From Step 1, tα/2 = 2.064. Applying the usual formulas for x and s to the data in the Table
above gives x = 513.32 and s = 262.23. So a 95% confidence interval for µ is from
262.23
513.32 − 2.064 · √
25
262.23
to 513.32 + 2.064 · √
25
or 405.07 to 621.57.
Interpretation: We can be 95% confident that the mean loss of all pickpocket offenses is
somewhere between $405.07 and $621.57.
9
10
11