Download Confidence Interval for Proportions and Means

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

German tank problem wikipedia , lookup

Transcript
Confidence Interval for a Proportion
Confidence Interval for
Proportions and Means
„
„
Dr. Tom Ilvento
BUAD 820
Proportions
„
„
„
„
ps
The Pepsi Challenge asked soda drinkers to
compare Diet Coke and Diet Pepsi in a blind taste
test.
Pepsi claimed that more than ½ of Diet Coke
drinkers said they preferred Diet Pepsi
Suppose we take a random sample of 100 Diet
Coke Drinkers and we found that 56 preferred Diet
Pepsi.
Proportions
ps = sample proportion
If x represents the number of successes in our
sample, then our estimator of p (population
parameter) from a sample is
„
ps = x/n
The variance of a proportion is given by
2
„ s = p sq s
.5
„ s = (psqs)
„ Where qs = 1- ps
„
The Standard Error of the Sampling
Distribution of a proportion is
„
„
SE for p = (pq/n).5
If we don’t know p and q, we use the sample
estimates, ps and qs
Note: we will think there is a population
proportion, p, with variance equal to σ2
My sample
„
„
n = 100
Calculate ps
Pepsi Challenge
„
„
„
Calculate qs
„
Calculate the Variance and Standard Deviation
„
„
„
„
Calculate the Standard Error
The sample provides an estimate –
„ Point Estimate, a single value computed from a
sample and used to estimate the value of the
target population.
„ The sample proportion and s are point
estimates of population proportion p and
population standard deviation σ respectively.
I would like to place a bound of error around
the estimate – Confidence Interval
„
1
Pepsi Challenge
„
„
I need to think of my sample as one of many
possible samples
I know from our work on the Normal curve
that a z-value of ± 1.96 corresponds to 95
percent of the values
„ A z-value of 1.96 is associated with a
probability of .475 on one side of the
normal curve
„ 2 times that value yields 95% of the area
under the normal curve
Pepsi Challenge
„
„
„
If I think of my sample as part of the sampling
distribution
I can place a ± 1.96(standard error) around my
estimate
Like this:
„ .56 ± 1.96(.0496)
„ .56 ± .097
„ .463 to .657
Notice that this interval has values less
than .50, which are below Pepsi’s claim!
Why did I use the standard error in
my formula?
„
„
„
„
I am asking the question about the proportion of Diet
Coke drinkers who prefer Pepsi
I want some sense of how well my sample estimates the
population
If it is drawn randomly it will represent the population,
plus some sampling error
A 95% confidence interval means that
„ If I would have taken all possible samples
„ And calculated a confidence interval for each
one
„ 95% of them would have contained the true
population parameter
What is a Confidence Interval?
„
„
„
To construct a confidence
interval we need
What is a Confidence Interval?
„
We calculate the probability that the estimation
process will result in an interval that contains
the true value of the population proportion or
mean
„ If we had repeated samples
„ Most of the C.I.s would contain the
population parameter
„ But not all of them will
It is an interval estimate of a population
parameter
The plus or minus part is also known as a
Bound of Error or Margin of Error
Placed in a probability framework
„
„
„
An point estimator
A sample and a sample estimate using the
estimator
„ ps
Knowledge of the Sampling Distribution of the
point estimator
„ The Standard Error of the estimator
„ The form of the sampling distribution
2
To construct a confidence
interval we need
„
„
A probability level we are comfortable with –
how much certainty. It’s also called
“Confidence Coefficient”
A level of Error - α
„ refers to the combined area to the right and
left of our interval
„ A 95% C.I. Has α = 1 - .95 = .05
„ α refers to the probability of being
wrong in our confidence interval
Confidence Interval for a
Population Proportion p
„
Formula for C.I. for a Proportion ps
p s ± Z α 2σ
„
„
p
It is approximate because we are using the Normal
Approximation to the Binomial Distribution
Assumption: A sufficiently large random sample of
size n is selected from the population.
ps ± Z α 2
The C.I. formula
„
„
„
„
ps ± Z α 2
The C.I. Formula
„
„
„
„
For the Pepsi
„ 90% C.I.
„ 95% C.I.
„ 99% C.I.
Challenge Example
.56 ± 1.645(.0496) = .56 ± .0816
.56 ± 1.96(.0496) = .56 ± .0972
.56 ± 1.575(.0496) = .56 ± .1277
For any given sample size, if you want to be
more certain (smaller α ) you have to accept
a wider interval
ps (1 − ps )
n
The larger the probability level for a C.I.
The smaller the value of α, and α/2
The larger the z value
CONFIDENCE
LEVEL
100(1- α)
For any given sample size, the width of
the Confidence Interval depends on α
„
ps (1 − p s )
n
ps ± Zα 2
ps (1 − ps )
n
zα/2 refers to the z-score associated with a
particular probability level divided by 2
α refers to the area in the tails of the
distribution
We divide by 2 because we divide α
equally on both sides of the mean
Which means the probability in the tails of
both sides of the normal curve
p s (1 − p s )
n
≈ ps ± Zα2
α
α/2
zα
90%
.10
.05
1.645
95%
.05
.025
1.96
99%
.01
.005
2.575
Problem
„
„
Survey questionnaire for who would you vote
for
1,052 adults were surveyed by a major
newspaper
„ The percentage who indicated
Candidate B was 35%
„ Construct a 95% C.I. For this proportion
3
Newspaper Confidence Interval
Problem
Newspaper C.I.
„
The newspaper said “there is a ± 3.0% margin
of error.”
Where did this figure come from?
„ It doesn’t match our previous figure of
2.88%
„
And what does it mean?
„
Newspaper C.I.
„
They calculated a general C.I. For a proportion
at .5
.5
„ Standard Error = [(.5 * .5)/1,052]
„ = .0154
„ C.I.
„
„
.5 ± 1.96(.0154)
.5 ± .0302
Variance is largest at .5
„
„
Confidence Interval for the mean
„
„
„
Suppose I am concerned about the quality of
drinking water for people who use wells in a
particular geographic area
I will test for nitrogen, as Nitrate+Nitrite
The U.S. EPA sets a MCL of 10 mg/l of
Nitrate/Nitrite (MCL=Maximum contaminant
level)
„ Below the threshold is considered safe
For a proportion, the variance is largest at .5,
or an equal split
2
„ At .5 s = (.5)(.5) = .25
2
„ At .7 s = (.7)(.3) = .21
2
„ At .3 s = (.3)(.7) = .21
Which brings up another unique thing about
proportions – once you specify a value of p
for the population, the variance (σ2 ) is
known.
Water Quality Example
„
„
„
Let’s say there are 2,500 households in the
area
I could try to test them all, but at $50 a test it
would cost $125,000 and weeks of work
So, I decide to take 50 well water samples,
and test for the presence of nitrogen
4
My sample
„
„
„
„
n = 50
Mean = 7 mg/l
s = 3.003 mg/l
Standard error = 3.003/(50).5 = .425
From Excel
Nitrate+Nitrite
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
Confidence Level(95.0%)
Water Quality Example
„
„
„
„
If I think of my sample as part of the sampling
distribution I can place a Bound of Error around my
estimate
But I have one problem with this approach with the
mean. I have two estimates
„ The estimate of the mean
„ The estimate of the standard deviation (s), which is
used to estimate the standard error
If σ is known, we don’t have a problem and we would
use a z-value for the confidence interval. But σ is rarely
known!
What can I do about this?
t-distribution
„
„
„
Similar to the standard normal distribution
The t-distribution varies with n (sample size)
via degrees of freedom
„ df = n-1
As n gets larger, the t-distribution
approximates the z distribution
7.000
0.425
7.050
7.100
3.003
9.018
-0.723
0.101
11.600
1.600
13.200
350.000
50
0.853
Stem-and-Leaf Display
for Nitrate+Nitrite
Stem unit: 1
1
2
3
4
5
6
7
8
9
10
11
12
13
6
2
0
0
3
2
0
1
0
0
0
3
2
9
46
46
25
45
45
11
66
36
55
5
8
6
89
89
79
12479
9
78
8
Relax and have a beer!
„
„
„
W.S. Gossett worked for
Guinness Brewery in
Ireland around 1900
In quality control tests
he noticed the problem
of using the z-distribution
His solution was the tdistribution
Comparison of a z and t-value as
the Sample Size Gets Larger
The Value of a z-value or t-value for a 95% C.I.
Sample size
10
20
30
50
100
500
1000
Z-value
too small
too small
1.960
1.960
1.960
1.960
1.960
t-value
2.262
2.093
2.045
2.010
1.984
1.965
1.962
5
The formula for Confidence
Interval for the Mean
The t-Table
„
„
„
„
„
Organized with degrees of freedom as rows
Probabilities in the right tail (") are the
columns
We substitute the t-value from the table for a
z-value in the C.I.
In the case of a small sample, n < 30, the
Central Limit Theorem doesn’t hold.
In order to do a C.I., a big assumption
with a small sample is that the
population is distributed approximately
normal
⎛ s ⎞
x ± tα / 2, n −1d.f. ⎜
⎟
⎝ n⎠
tα / 2 is based on (n − 1) degrees of freedom
For any probability level, as the degrees of
freedom get larger, the t-value gets smaller
The meaning of the t-value
„
„
„
„
The t-value is interpreted like the z-value from the
standardized normal table
NOTE: For a Confidence Interval, the t-value represents
the corresponding value at α/2
Which is out in the right tail of the curve
So a t-value for 30 degrees of freedom at the .025 level
is 2.042
„ This corresponds to a z-value of 1.96
„ And is used for a 95% C.I.
Degrees of
Freedom
t.100
t.050
t.025
t.0005
1
3.078
6.314
12.706
636.62
2
1.886
2.920
4.303
31.598
3
1.638
2.353
3.182
12.924
As the degrees of freedom gets to 30, the t-value approaches z
Comparing z-distribution and tdistribution
30
1.310
1.697
2.042
3.646
4
1.282
1.645
1.960
3.291
Formation of a Confidence Interval of
the Mean
„
„
BASIC STEPS
Set a probability that an interval estimator encloses the
population parameter
p = .95
„ Set an alpha level as 1-p .05
„ Divide the alpha by 2
.025
„ Calculate the degrees of freedom as n-1
„ Locate the ½ probability value for your degrees of
freedom in the t-Table
„ Find the corresponding t value for the 1/2 probability
2.010
6
Back to the Water Quality
Example
Formula for C.I. for the mean
Took a sample estimate
of the mean
Treated it as one of
many samples from a
sampling distribution
with a standard error
Since σ is not known,
we used the sample
estimate of the
standard deviation, s.
And we will use a tvalue.
„
Use the Population
parameter σ if it is
known, and a z-value
⎛ σ ⎞
x ± zα / 2 ⎜
⎟
⎝ n⎠
„
„
⎛ s ⎞
x ± t n −1d . f . ⎜
⎟
⎝ n⎠
Use the sample
estimate s, and a tvalue, if σ is not
known
Back to the Water Quality
Example
„
„
For a specified probability
level, e.g. .95, we
generate a t value
That puts a bound around
our estimate of the mean
that represents 2.010
standard deviations around
the mean in a sampling
distribution
„
„
t.05/2, 49 d.f.=2.010
„
„
„
„
mean = 7.0
„
3/(50).5 = .424
Confidence Interval
„
„
„
7.0 ± 2.010(.425)
7.0 ± .854
6.146 to 7.854
„
„
Remember, we only have one sample
And thus one interval estimate
If we could draw repeated samples
95 percent of the Confidence Intervals
calculated on the sample mean
Would contain the true population parameter
Our one sample interval estimate may not contain
the true population parameter
±2.010 standard deviations around the
mean represents 95% of the values in a tdistribution
90% C.I. From Sampling Exercise from a
Population with µ = 75 and σ = 10
95% C.I. From Sampling Exercise from a
Population with µ = 75 and σ = 10
90% Confidence Intervals for 100 Samples, n = 50,
95% Confidence Intervals for 100 Samples, n = 50,
85.0
85.0
80.0
80.0
75.0
75.0
70.0
70.0
65.0
65.0
1
11
21
31
41
51
61
71
81
91
1
11
21
31
41
51
61
71
81
91
7
99% C.I. From Sampling Exercise from a
Population with µ = 75 and σ = 10
Now you try it
„
99% Confidence Intervals for 100 Samples, n = 50,
„
85.0
80.0
„
75.0
„
70.0
A furniture company wants to test a random sample of
sofas to determine how long the cushions last
They simulate people sitting on the sofas by dropping a
heavy object on the cushions until they wear out – they
count the number of drops it takes
This test involves 9 sofas
„ Mean = 12,648.889
„ s = 1,898.673
Assume it follows a normal distribution. Generate
a 95% Confidence Interval for this problem
65.0
1
11
21
31
41
51
61
71
81
91
Sofa Test Answer
Solving this problem with Excel
„
„
„
PhStat and Confidence Intervals
Excel Output for Sofa problem
Sofa Drops
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
Confidence Level(95.0%)
„
12,648.889
632.891
12742
#N/A
1898.673
3604958.111
-0.676
-0.372
5,886
9,459
15,345
113,840
9
1,459.450
Mean = 12,648.889
SE = 632.891
„
s = 1,898.673
„
12,648.889 ± 1,459.447
I entered the data into a column in Excel
I then used the following sequence
„ Tools
„ Data Analysis
„ Descriptive Statistics
I then follow the options, including:
„ Identify the Input Range, marking a label is in the
first row
„ Output range
„ Descriptive statistics
„ A 95% Confidence Interval
Excel’s Data Analysis will only construct a Confidence
Interval as part of the Descriptive Statistics
PHStat will construct C.I. for
„ Mean, sigma known
„ Mean, sigma unknown
„ Proportion
„ Variance
„ Population Confidences
You can enter a range of data, or just the mean,
standard deviation, and sample size
Try it for this problem!
8
A few more points on small
sample C.I.
„
„
If we cannot assume a normal distribution
„ The probability associated with our interval
is not (1 - α)
„ We really shouldn’t construct a C.I.
„ Or we should get more data
If σ is known, we can use the z instead of the
t, but we still need to have an approximately
normal distribution
What influences the width of a
confidence interval?
„
„
„
„
What influences the width of a
confidence interval?
„
„
„
Sample Size or n
The larger the sample size, the smaller the
C.I.
For a 95% Confidence Interval when s = 25
„ n=50
2.010(25/(50).5)
= 7.11
.5
„ n=500 1.9647(25/(500) ) = 2.20
What influences the width of a
confidence interval?
• The level of α
• The larger the level of α, the smaller the C.I.
„ For a 95% Confidence Interval when s = 25
and n=50
„ α =.05
2.010(25/(50).5) = 7.11
„ α =.10
1.6766(25/(50).5) = 5.93
What influences the width of a
confidence interval?
•
•
„
The level of the confidence
coefficient (1-α)
The larger the confidence coefficient, the
larger the C.I.
When s = 25 and n = 50
„
95% C.I. 2.010(25/(50).5)
= 7.11
„
99% C.I. 2.680(25/(500).5)
= 9.48
The sample size
The level of α
The level of the confidence coefficient
(1-α)
The variability of the data – i.e., the
standard deviation
Focus in on sample size (n)
„
„
„
For a given (1- α) C.I.
and a given bound of error (B)
„ which is what we add or subtract to the
sample estimate
We can calculate the needed sample size as
( zα / 2 ) 2 σ 2
n=
B2
PhStat will do
this for you,
under Sample
Size
9
Confidence Interval Summary
„
„
„
Provides an interval estimate of a sample
estimator
Requires knowledge of the sampling
distribution of the estimator
We treat our estimate from a sample as one of
many possible estimates from many possible
samples
Confidence Interval Summary
„
„
Figure a C.I. Probability level as (1 - α)
„ where α/2 represents the probability in either tail
of the sampling distribution
„
(1 - α) is referred to as the confidence coefficient
For proportions, you can use a z-score provided the
sample size is large enough (Binomial
approximization)
ps ± zα / 2
ps qs
n
Confidence Interval Summary
„
For the mean
„
„
„
If σ is known, use a z-value for the C.I. similar
to proportions
If σ is unknown, use the t-table with n-1
degrees of freedom
If the sample size is small (<30), and the
distribution is approximately normal, use the ttable with n-1 degrees of freedom
x ± tα / 2, n −1d . f .
s
n
10