Download ch_07 - ROHAN Academic Computing

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

German tank problem wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Chapter 7
Statistical Inference:
Confidence Intervals

Learn ….
How to Estimate a Population
Parameter Using Sample Data
Agresti/Franklin Statistics, 1 of 87
 Section 7.1
What Are Point and Interval
Estimates of Population
Parameters?
Agresti/Franklin Statistics, 2 of 87
Point Estimate

A point estimate is a single
number that is our “best guess” for
the parameter
Agresti/Franklin Statistics, 3 of 87
Interval Estimate

An interval estimate is an interval
of numbers within which the
parameter value is believed to fall.
Agresti/Franklin Statistics, 4 of 87
Point Estimate vs Interval
Estimate
Agresti/Franklin Statistics, 5 of 87
Point Estimate vs Interval
Estimate


A point estimate doesn’t tell us how
close the estimate is likely to be to
the parameter
An interval estimate is more useful
• It incorporates a margin of error which
helps us to gauge the accuracy of the
point estimate
Agresti/Franklin Statistics, 6 of 87
Point Estimation: How Do We Make
a Best Guess for a Population
Parameter?

Use an appropriate sample statistic:
• For the population mean, use the sample
•
mean
For the population proportion, use the
sample proportion
Agresti/Franklin Statistics, 7 of 87
Point Estimation: How Do We Make
a Best Guess for a Population
Parameter?

Point estimates are the most common
form of inference reported by the
mass media
Agresti/Franklin Statistics, 8 of 87
Properties of Point Estimators

Property 1: A good estimator has a
sampling distribution that is centered at
the parameter
• An estimator with this property is
unbiased
• The sample mean is an unbiased estimator
of the population mean
• The sample proportion is an unbiased
estimator of the population proportion
Agresti/Franklin Statistics, 9 of 87
Properties of Point Estimators

Property 2: A good estimator has a
small standard error compared to
other estimators
• This means it tends to fall closer than
other estimates to the parameter
Agresti/Franklin Statistics, 10 of 87
Interval Estimation: Constructing an
Interval that Contains the Parameter
(We Hope!)

Inference about a parameter should
provide not only a point estimate but
should also indicate its likely
precision
Agresti/Franklin Statistics, 11 of 87
Confidence Interval


A confidence interval is an interval
containing the most believable values
for a parameter
The probability that this method
produces an interval that contains the
parameter is called the confidence
level
•
This is a number chosen to be close to 1,
most commonly 0.95
Agresti/Franklin Statistics, 12 of 87
What is the Logic Behind
Constructing a Confidence Interval?

To construct a confidence interval for
a population proportion, start with the
sampling distribution of a sample
proportion
Agresti/Franklin Statistics, 13 of 87
The Sampling Distribution of the
Sample Proportion




Gives the possible values for the sample
proportion and their probabilities
Is approximately a normal distribution for
large random samples
Has a mean equal to the population
proportion
Has a standard deviation called the
standard error
Agresti/Franklin Statistics, 14 of 87
A 95% Confidence Interval for a
Population Proportion

Fact: Approximately 95% of a normal
distribution falls within 1.96 standard
deviations of the mean
• That means:
With probability 0.95, the
sample proportion falls within about 1.96
standard errors of the population
proportion
Agresti/Franklin Statistics, 15 of 87
Margin of Error


The margin of error measures how
accurate the point estimate is likely to
be in estimating a parameter
The distance of 1.96 standard errors
in the margin of error for a 95%
confidence interval
Agresti/Franklin Statistics, 16 of 87
Confidence Interval


A confidence interval is constructed
by adding and subtracting a margin of
error from a given point estimate
When the sampling distribution is
approximately normal, a 95%
confidence interval has margin of
error equal to 1.96 standard errors
Agresti/Franklin Statistics, 17 of 87
 Section 7.2
How Can We Construct a
Confidence Interval to Estimate a
Population Proportion?
Agresti/Franklin Statistics, 18 of 87
Finding the 95% Confidence Interval
for a Population Proportion



We symbolize a population proportion by p
The point estimate of the population
proportion is the sample proportion
We symbolize the sample proportion by p̂
Agresti/Franklin Statistics, 19 of 87
Finding the 95% Confidence Interval
for a Population Proportion


A 95% confidence interval uses a margin of
error = 1.96(standard errors)
[point estimate ± margin of error] =
p̂  1.96(standard errors)
Agresti/Franklin Statistics, 20 of 87
Finding the 95% Confidence Interval
for a Population Proportion

The exact standard error of a sample proportion
equals:
p (1  p )
n


This formula depends on the unknown population
proportion, p
In practice, we don’t know p, and we need to
estimate the standard error
Agresti/Franklin Statistics, 21 of 87
Finding the 95% Confidence Interval
for a Population Proportion

In practice, we use an estimated standard
error:
se 
p
ˆ (1  p
ˆ)
n
Agresti/Franklin Statistics, 22 of 87
Finding the 95% Confidence Interval
for a Population Proportion

A 95% confidence interval for a population
proportion p is:
p̂  1.96(se), with se 
p̂(1 - p̂)
n
Agresti/Franklin Statistics, 23 of 87
Example: Would You Pay Higher
Prices to Protect the Environment?

In 2000, the GSS asked: “Are you
willing to pay much higher prices in
order to protect the environment?”
• Of n = 1154 respondents, 518 were
willing to do so
Agresti/Franklin Statistics, 24 of 87
Example: Would You Pay Higher
Prices to Protect the Environment?

Find and interpret a 95% confidence
interval for the population proportion
of adult Americans willing to do so at
the time of the survey
Agresti/Franklin Statistics, 25 of 87
Example: Would You Pay Higher
Prices to Protect the Environment?
518
p̂ 
 0.45
1154
(0.45)(0.55)
se 
 0.015
1154
p̂  1.96(se)  1.96(0.015)
 0.45  0.03  (0.42, 0.48)
Agresti/Franklin Statistics, 26 of 87
Sample Size Needed for Large-Sample
Confidence Interval for a Proportion

For the 95% confidence interval for a
proportion p to be valid, you should have at
least 15 successes and 15 failures:
np
ˆ  15 and n(1- p̂)  15
Agresti/Franklin Statistics, 27 of 87
“95% Confidence”


With probability 0.95, a sample
proportion value occurs such that the
confidence interval contains the
population proportion, p
With probability 0.05, the method
produces a confidence interval that
misses p
Agresti/Franklin Statistics, 28 of 87
How Can We Use Confidence
Levels Other than 95%?



In practice, the confidence level 0.95
is the most common choice
But, some applications require
greater confidence
To increase the chance of a correct
inference, we use a larger confidence
level, such as 0.99
Agresti/Franklin Statistics, 29 of 87
A 99% Confidence Interval for p
pˆ  2.58(se)
Agresti/Franklin Statistics, 30 of 87
Different Confidence Levels
Agresti/Franklin Statistics, 31 of 87
Different Confidence Levels

In using confidence intervals, we
must compromise between the
desired margin of error and the
desired confidence of a correct
inference
• As the desired confidence level
increases, the margin of error gets
larger
Agresti/Franklin Statistics, 32 of 87
What is the Error Probability for
the Confidence Interval Method?

The general formula for the confidence
interval for a population proportion is:
Sample proportion ± (z-score)(std. error)
which in symbols is
pˆ  z(se)
Agresti/Franklin Statistics, 33 of 87
What is the Error Probability for
the Confidence Interval Method?
Agresti/Franklin Statistics, 34 of 87
Summary: Confidence Interval
for a Population Proportion, p

A confidence interval for a population
proportion p is:
p̂  z
p̂(1 - p̂)
n
Agresti/Franklin Statistics, 35 of 87
Summary: Effects of Confidence
Level and Sample Size on Margin of
Error

The margin of error for a confidence
interval:
• Increases as the confidence level
increases
• Decreases as the sample size
increases
Agresti/Franklin Statistics, 36 of 87
What Does It Mean to Say that
We Have “95% Confidence”?

If we used the 95% confidence
interval method to estimate many
population proportions, then in the
long run about 95% of those intervals
would give correct results, containing
the population proportion
Agresti/Franklin Statistics, 37 of 87
A recent survey asked: “During the
last year, did anyone take something
from you by force?”


a.
b.
c.
Of 987 subjects, 17 answered “yes”
Find the point estimate of the proportion
of the population who were victims
.17
.017
.0017
Agresti/Franklin Statistics, 38 of 87
 Section 7.3
How Can We Construct a
Confidence Interval To Estimate a
Population Mean?
Agresti/Franklin Statistics, 39 of 87
How to Construct a Confidence
Interval for a Population Mean




Point estimate ± margin of error
The sample mean is the point
estimate of the population mean
The exact standard error of the
sample mean is σ/ n
In practice, we estimate σ by the
sample standard deviation, s
Agresti/Franklin Statistics, 40 of 87
How to Construct a Confidence
Interval for a Population Mean



For large n…
•
and also
For small n from an underlying population
that is normal…
The confidence interval for the population
mean is:
x  z(

n
)
Agresti/Franklin Statistics, 41 of 87
How to Construct a Confidence
Interval for a Population Mean



In practice, we don’t know the
population standard deviation
Substituting the sample standard
deviation s for σ to get se = s/ n
introduces extra error
To account for this increased error,
we replace the z-score by a slightly
larger score, the t-score
Agresti/Franklin Statistics, 42 of 87
How to Construct a Confidence
Interval for a Population Mean


In practice, we estimate the standard
error of the sample mean by se = s/ n
Then, we multiply se by a t-score from
the t-distribution to get the margin of
error for a confidence interval for the
population mean
Agresti/Franklin Statistics, 43 of 87
Properties of the t-distribution



The t-distribution is bell shaped and
symmetric about 0
The probabilities depend on the
degrees of freedom, df
The t-distribution has thicker tails and
is more spread out than the standard
normal distribution
Agresti/Franklin Statistics, 44 of 87
t-Distribution
Agresti/Franklin Statistics, 45 of 87
Summary: 95% Confidence
Interval for a Population Mean

A 95% confidence interval for the
population mean µ is:
s
x  t ( ); df  n - 1
n
.025

To use this method, you need:
•
•
Data obtained by randomization
An approximately normal population distribution
Agresti/Franklin Statistics, 46 of 87
Example: eBay Auctions of
Palm Handheld Computers

Do you tend to get a higher, or a
lower, price if you give bidders the
“buy-it-now” option?
Agresti/Franklin Statistics, 47 of 87
Example: eBay Auctions of
Palm Handheld Computers


Consider some data from sales of the
Palm M515 PDA (personal digital
assistant)
During the first week of May 2003, 25
of these handheld computers were
auctioned off, 7 of which had the
“buy-it-now” option
Agresti/Franklin Statistics, 48 of 87
Example: eBay Auctions of
Palm Handheld Computers

“Buy-it-now” option:
235 225 225 240 250 250 210

Bidding only:
250 249 255 200 199 240 228
255 232 246 210 178 246 240
245 225 246 225
Agresti/Franklin Statistics, 49 of 87
Example: eBay Auctions of
Palm Handheld Computers

Summary of selling prices for the two
types of auctions:
buy_now N Mean StDev
no
18 231.61 21.94
yes
7 233.57 14.64
buy_now Maximum
no
255.00
yes
250.00
Minimum Q1 Median
Q3
178.00 221.25 240.00 246.75
210.00 225.00 235.00 250.00
Agresti/Franklin Statistics, 50 of 87
Example: eBay Auctions of
Palm Handheld Computers
Agresti/Franklin Statistics, 51 of 87
Example: eBay Auctions of
Palm Handheld Computers

To construct a confidence interval
using the t-distribution, we must
assume a random sample from an
approximately normal population of
selling prices
Agresti/Franklin Statistics, 52 of 87
Example: eBay Auctions of
Palm Handheld Computers



Let µ denote the population mean for
the “buy-it-now” option
The estimate of µ is the sample mean:
x = $233.57
The sample standard deviation is:
s = $14.64
Agresti/Franklin Statistics, 53 of 87
Example: eBay Auctions of
Palm Handheld Computers

The 95% confidence interval for the “buy-itnow” option is:
s
14.64
x  t.025 ( )  233.57  2.44(
)
n
7

which is 233.57 ± 13.54 or (220.03, 247.11)
Agresti/Franklin Statistics, 54 of 87
Example: eBay Auctions of
Palm Handheld Computers

The 95% confidence interval for the
mean sales price for the bidding only
option is:
(220.70, 242.52)
Agresti/Franklin Statistics, 55 of 87
Example: eBay Auctions of
Palm Handheld Computers

Notice that the two intervals overlap
a great deal:
• “Buy-it-now”: (220.03, 247.11)
• Bidding only: (220.70, 242.52)

There is not enough information for us to
conclude that one probability distribution
clearly has a higher mean than the other
Agresti/Franklin Statistics, 56 of 87
How Do We Find a t- Confidence
Interval for Other Confidence
Levels?

The 95% confidence interval uses t.025
since 95% of the probability falls
between - t.025 and t.025

For 99% confidence, the error
probability is 0.01 with 0.005 in each
tail and the appropriate t-score is t.005
Agresti/Franklin Statistics, 57 of 87
If the Population is Not Normal,
is the Method “Robust”?


A basic assumption of the confidence
interval using the t-distribution is that
the population distribution is normal
Many variables have distributions that
are far from normal
Agresti/Franklin Statistics, 58 of 87
If the Population is Not Normal,
is the Method “Robust”?

How problematic is it if we use the tconfidence interval even if the
population distribution is not normal?
Agresti/Franklin Statistics, 59 of 87
If the Population is Not Normal,
is the Method “Robust”?


For large random samples, it’s not
problematic
The Central Limit Theorem applies:
for large n, the sampling distribution
is bell-shaped even when the
population is not
Agresti/Franklin Statistics, 60 of 87
If the Population is Not Normal,
is the Method “Robust”?



What about a confidence interval using the
t-distribution when n is small?
Even if the population distribution is not
normal, confidence intervals using t-scores
usually work quite well
We say the t-distribution is a robust method
in terms of the normality assumption
Agresti/Franklin Statistics, 61 of 87
Cases Where the t- Confidence
Interval Does Not Work

With binary data

With data that contain extreme
outliers
Agresti/Franklin Statistics, 62 of 87
The Standard Normal Distribution is
the t-Distribution with df = ∞
Agresti/Franklin Statistics, 63 of 87
The 2002 GSS asked: “What do you
think is the ideal number of children in
a family?”

a.
b.
c.
d.
The 497 females who responded had a median
of 2, mean of 3.02, and standard deviation of
1.81. What is the point estimate of the
population mean?
497
2
3.02
1.81
Agresti/Franklin Statistics, 64 of 87
 Section 7.4
How Do We Choose the Sample
Size for a Study?
Agresti/Franklin Statistics, 65 of 87
How are the Sample Sizes
Determined in Polls?

It depends on how much precision is
needed as measured by the margin of
error

The smaller the margin of error, the
larger the sample size must be
Agresti/Franklin Statistics, 66 of 87
Choosing the Sample Size for
Estimating a Population Proportion?



First, we must decide on the desired
margin of error
Second, we must choose the
confidence level for achieving that
margin of error
In practice, 95% confidence intervals
are most common
Agresti/Franklin Statistics, 67 of 87
Example: What Sample Size Do
You Need For An Exit Poll?

A television network plans to predict
the outcome of an election between
two candidates – Levin and Sanchez

They will do this with an exit poll that
randomly samples votes on election
day
Agresti/Franklin Statistics, 68 of 87
Example: What Sample Size Do
You Need For An Exit Poll?

The final poll a week before election
day estimated Levin to be well ahead,
58% to 42%
• So the outcome is not expected to be
close

The researchers decide to use a
sample size for which the margin of
error is 0.04
Agresti/Franklin Statistics, 69 of 87
Example: What Sample Size Do
You Need For An Exit Poll?

What is the sample size for which a
95% confidence interval for the
population proportion has margin of
error equal to 0.04?
Agresti/Franklin Statistics, 70 of 87
Example: What Sample Size Do
You Need For An Exit Poll?

The 95% confidence interval for a
population proportion p is:
p
ˆ  1.96( se)

If the sample size is such that 1.96(se) =
0.04, then the margin of error will be
0.04
Agresti/Franklin Statistics, 71 of 87
Example: What Sample Size Do
You Need For An Exit Poll?

Find the value of the sample size n for
which 0.04 = 1.96(se):
0.04  1.96 pˆ (1  pˆ n
solve algebraically for n :
n  (1.96) pˆ (1  pˆ ) /(0.04)
2

Agresti/Franklin Statistics, 72 of 87
2
Example: What Sample Size Do
You Need For An Exit Poll?

A random sample of size n = 585
should give a margin of error of about
0.04 for a 95% confidence interval for
the population proportion
Agresti/Franklin Statistics, 73 of 87
How Can We Select a Sample Size
Without Guessing a Value for the
Sample Proportion


In the formula for determining n, setting
p̂ = 0.50 gives the largest value for n
out of all the possible values to
substitute for p̂
Doing this is the “safe” approach that
guarantees we’ll have enough data
Agresti/Franklin Statistics, 74 of 87
Sample Size for Estimating a
Population Parameter

The random sample size n for which a
confidence interval for a population proportion p
has margin of error m (such as m = 0.04) is
pˆ (1  pˆ ) z
n
m
2
2
Agresti/Franklin Statistics, 75 of 87
Sample Size for Estimating a
Population Parameter

The z-score is based on the confidence
level, such as z = 1.96 for 95%
confidence

You either guess the value you’d get for
the sample proportion based on other
information or take the safe approach of
setting p̂ = 0.50
Agresti/Franklin Statistics, 76 of 87
Sample Size for Estimating a
Population Mean

The random sample size n for which a 95%
confidence interval for a population mean has
margin of error approximately equal to m is
4s
n
m
2
2

To use this formula, you guess the value you’ll
get for the sample standard deviation, s
Agresti/Franklin Statistics, 77 of 87
Sample Size for Estimating a
Population Mean


In practice, since you don’t yet have
the data, you don’t know the value of
the sample standard deviation, s
You must substitute an educated
guess for s
• You can use the sample standard
deviation from a similar study
Agresti/Franklin Statistics, 78 of 87
Example: Finding n to Estimate
Mean Education in South Africa

A social scientist plans a study of
adult South Africans to investigate
educational attainment in the black
community

How large a sample size is needed so
that a 95% confidence interval for the
mean number of years of education
has margin of error equal to 1 year?
Agresti/Franklin Statistics, 79 of 87
Example: Finding n to Estimate
Mean Education in South Africa

No prior information about the
standard deviation of educational
attainment is available

We might guess that the sample
education values fall within a range of
about 18 years
Agresti/Franklin Statistics, 80 of 87
Example: Finding n to Estimate
Mean Education in South Africa




If the data distribution is bell-shaped,
the range from x – 3s to x + 3 s will
contain nearly all the distribution
The distance x – 3 s to x + 3s equals
6s
Solving 18 = 6s for s yields s = 3
So ‘3’ is a crude estimate of s
Agresti/Franklin Statistics, 81 of 87
Example: Finding n to Estimate
Mean Education in South Africa

The desired margin of error is m = 1 year

The required sample size is:
2
2
4s 4(3)
n

 36
m
1
2
2
Agresti/Franklin Statistics, 82 of 87
What Factors Affect the Choice
of the Sample Size?


The first is the desired precision, as
measured by the margin of error, m
The second is the confidence level
Agresti/Franklin Statistics, 83 of 87
What Other Factors Affect the
Choice of the Sample Size?

A third factor is the variability in the
data
• If subjects have little variation (that is, s
is small), we need fewer data than if
they have substantial variation

A fourth factor is financial
• Cost is often a major constraint
Agresti/Franklin Statistics, 84 of 87
What if You Have to
Use a Small n?

The t- methods for a mean are valid
for any n
• However, you need to be extra cautious to
look for extreme outliers or great
departures from the normal population
assumption
Agresti/Franklin Statistics, 85 of 87
What if You Have to
Use a Small n?

In the case of the confidence interval
for a population proportion, the
method works poorly for small
samples
Agresti/Franklin Statistics, 86 of 87
Constructing a Small-Sample
Confidence Interval for a Proportion


Suppose a random sample does not have at least 15
successes and 15 failures
The confidence interval formula:
p
ˆz
p
ˆ (1  p
ˆ)
n
● Is still valid if we use it after adding ‘2’ to the original
number of successes and ‘2’ to the original number of
failures

This results in adding ‘4’ to the sample size n
Agresti/Franklin Statistics, 87 of 87