Download Slide 1 - msmatthewsschs

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Chapter 10: Estimating With Confidence
10.1 – Confidence Intervals: The basics
Statistical Inference: Using sample data to draw
conclusions about a population
Note: Each sample may vary, but the population
parameter doesn’t!
Sampling Distribution:
• If population is approximately normal, so is
the sample distribution
• If population is skewed, the sample distribution
is approximately normal if n30 by the central
limit theorem
• If given sample data, look at the distribution to
assess normality if needed. (Normal Prob. Plot)
x  
x 

n
Confidence Interval:
• Uses the sample distribution to predict
population parameter
• It is an interval of numbers above and below
the sample statistic
Confidence Level:
The probability the interval will capture the true
parameter value in repeated samples
Critical Value:
The probability p lying to its right under the
standard Normal curve. ( Z* )
Margin of Error:
• How accurate our estimate is based on the
variability of the sample distribution. We add
and subtract this from our estimate.
estimate  margin of error
Caution! Margin of error is only from random
sampling errors. This does not include errors in
collecting the data!
Most Common Critical Values
Confidence Level (C)
Upper tail prob.
1 C
1  0.9

 0.05
2
2
90%
0.05
0.05
0.90
Z=?
Z=?
Z* Value
 1.645
Most Common Critical Values
Confidence Level (C)
Upper tail prob.
1  C 1  0.95

 0.025
2
2
95%
0.025
0.025
0.95
Z=?
Z=?
Z* Value
 1.96
Most Common Critical Values
Confidence Level (C)
Upper tail prob.
1  C 1  0.99

 0.005
2
2
99%
0.005
0.005
0.99
Z=?
Z=?
Z* Value
 2.576
Calculator Tip: Critical Values
2nd Dist – invNorm( (1 + C)/2 )
OR: Look at the T-Tables for the most common ones!
(You will learn more about them later)
Confidence Interval for a Population mean ( known)
(Z-Interval)
estimate  margin of error
estimate  critical value  standard error
x

  
Z *

 n
Properties of Confidence Intervals
  
x  Z *

 n
• The interval is always centered around the statistic
• The higher the confidence level, the wider the
interval becomes
• If you increase n, then the margin of error decreases
Calculator Tip: Z-Interval
Stat – Tests – ZInterval
Data: If given actual values
Stats: If given summary of values
Interpreting a Confidence Interval:
What you will say:
I am C% confident that the true parameter is
captured in the interval
What it means:
If we took many, many, SRS from a population
and calculated a confidence interval for each
sample, C% of the confidence intervals will
contain the true mean
CAUTION!
Never Say:
The interval will capture the true mean C% of the time.
It either does or does not!
Conditions for a Z-Interval:
1. SRS (should say)
2. Normality (CLT or population approx normal)
3. Independence (Population 10x sample size)
 N  10n
Steps to Construct ANY Confidence Interval:
PANIC
P: Parameter of Interest (what are you looking for?)
A: Assumptions (what are the conditions?)
N: Name the type of interval (what type of data do we have?)
I:
Interval (Finally! You can calculate!)
C: Conclusion in context (I am ___% confident the true
parameter lies between ________ and _________)
Example #1
Serum Cholesterol-Dr. Paul Oswick wants to estimate the true
mean serum HDL cholesterol for all of his 20-29 year old female
patients. He randomly selects 30 patients and computes the
sample mean to be 50.67. Assume from past records, the
population standard deviation for the serum HDL cholesterol for
20-29 year old female patients is =13.4.
a. Construct a 95% confidence interval for the mean
serum HDL cholesterol for all of Dr. Oswick’s 20-29
year old female patients.
P: The true mean serum HDL cholesterol for all of
Dr. Oswick’s 20-29 year old female patients.
A: SRS: Says randomly selected
Normality: Approximately normal by the
CLT (n  30)
Independence: I am assuming that Dr. Oswick
has 300 patients or more.
N: One sample Z-Interval
I:
  
x  Z *

 n
 13.4 
50.67  1.96 

 30 
50.67  4.795
 45.875, 55.465
C: I am 95% confident the true mean serum HDL
cholesterol for all of Dr. Oswick’s 20-29 year old
female patients is between 45.875 and 55.465
Example #1
Serum Cholesterol-Dr. Paul Oswick wants to estimate the true
mean serum HDL cholesterol for all of his 20-29 year old female
patients. He randomly selects 30 patients and computes the
sample mean to be 50.67. Assume from past records, the
population standard deviation for the serum HDL cholesterol for
20-29 year old female patients is =13.4.
b. If the US National Center for Health Statistics reports the
mean serum HDL cholesterol for females between 20-29 years
old to be  = 53, do Dr. Oswick’s patients appear to have a
different serum level compared to the general population?
Explain.
No, 53 is contained in the interval.
Example #1
Serum Cholesterol-Dr. Paul Oswick wants to estimate the true
mean serum HDL cholesterol for all of his 20-29 year old female
patients. He randomly selects 30 patients and computes the
sample mean to be 50.67. Assume from past records, the
population standard deviation for the serum HDL cholesterol for
20-29 year old female patients is =13.4.
c. What two things could you do to decrease your margin of
error?
Increase n
Lower confidence level
Example #2
Suppose your class is investigating the weights of Snickers 1ounce Fun-Size candy bars to see if customers are getting full
value for their money. Assume that the weights are Normally
distributed with standard deviation = 0.005 ounces. Several
candy bars are randomly selected and weighed with sensitive
balances borrowed from the physics lab. The weights are
0.95
1.02
0.98 0.97 1.05 1.01 0.98 1.00
ounces. Determine a 90% confidence interval for the true mean,
µ. Can you say that the bars weigh 1oz on average?
P: The true mean weight of Snickers 1-oz Fun-size
candy bars
A: SRS: Says randomly selected
Normality: Approximately normal because the
population is approximately normal
Independence: I am assuming that Snickers
has 80 bars or more in the 1-oz
size
N: One sample Z-Interval
I:
  
x  Z *

 n
 0.005 
.995  1.645 

 8 
.995  0.0029
.9921, .9979
C: I am 90% confident the true mean weight of
Snickers 1-oz Fun-size candy bars is between
.9921 and .9979 ounces. I am not confident that
the candy bars weigh as advertised at the 90%
level.
Choosing a Sample Size for a specific margin of error
  
Z *
m
 n
 Z * 
n

 m 
2
Note: Always round up! You can’t have part of a
person! Ex: 163.2 rounds up to 164.
Example #3
A statistician calculates a 95% confidence interval for the mean
income of the depositors at Bank of America, located in a poverty
stricken area. The confidence interval is $18,201 to $21,799.
a. What is the sample mean income?
  
x  Z *

 n
18, 201  21, 799
x
 $20, 000
2
Example #3
A statistician calculates a 95% confidence interval for the mean
income of the depositors at Bank of America, located in a poverty
stricken area. The confidence interval is $18,201 to $21,799.
b. What is the margin of error?
x  20, 000
  
x  Z *

 n
m
m = 21,799 – 20,000
m = 1,799
Example #4
A researcher wishes to estimate the mean number of miles on
four-year-old Saturn SCI’s. How many cars should be in a
sample in order to estimate the mean number of miles within a
margin of error of  1000 miles with 99% confidence assuming
=19,700.
 Z * 
n

 m 
2
 2.576 19, 700 
n

1000


n   50.7472
2
n  2575.2783
2
n  2576
10.2 – Estimating a Population Mean
In the 10.1 we made an unrealistic assumption
that the population standard deviation was
known and could be used to calculate confidence
intervals.
Standard Error: When the standard deviation of a
statistic is estimated from the data
s
n
When we know  we can use the Z-table to make a
confidence interval. But, when we don’t know it, then
we have to use something else!
Properties of the t-distribution:
•
σ is unknown
• Degrees of Freedom = n – 1
•
More variable than the normal distribution (it has
fatter tails than the normal curve)
• Approaches the normal distribution when the
degrees of freedom are large (sample size is large).
• Area is found to the right of the t-value
Properties of the t-distribution:
• If n < 15, if population is approx normal, then so
is the sample distribution. If the data are clearly
non-Normal or if outliers are present, don’t use!
• If n > 15, sample distribution is normal, except if
population has outliers or strong skewness
• If n  30, sample distribution is normal, even if
population has outliers or strong skewness
Example #1
Determine the degrees of freedom and use the t-table to find
probabilities for each of the following:
DF
P(t > 1.093)
n = 11 10
Picture
1.093
Probability
Example #1
Determine the degrees of freedom and use the t-table to find
probabilities for each of the following:
DF
P(t > 1.093)
n = 11 10
Picture
Probability
0.15
1.093
Example #1
Determine the degrees of freedom and use the t-table to find
probabilities for each of the following:
DF
P(t < 1.093)
Picture
Probability
0.85
n = 11 10
1.093
Example #1
Determine the degrees of freedom and use the t-table to find
probabilities for each of the following:
DF
P(t > 0.685)
Picture
n = 24 23
0.685
Probability
Example #1
Determine the degrees of freedom and use the t-table to find
probabilities for each of the following:
DF
P(t > 0.685)
Picture
Probability
0.25
n = 24 23
0.685
Example #1
Determine the degrees of freedom and use the t-table to find
probabilities for each of the following:
DF
P(t > 0.685)
Picture
Probability
0.25
n = 24 23
-0.685
Example #1
Determine the degrees of freedom and use the t-table to find
probabilities for each of the following:
DF
P(0.70<t<1.093) n = 11
Picture
10
0.70 1.093
Probability
Example #1
Determine the degrees of freedom and use the t-table to find
probabilities for each of the following:
DF
P(0.70<t<1.093) n = 11
Picture
10
0.1
0.70 1.093
.25 – .15 = 0.1
Probability
Calculator Tip: Finding P(t)
2nd – Dist – tcdf( lower bound, upper
bound, degrees of freedom)
One-Sample t-interval:
 s 
x  t *n1 

 n
Calculator Tip: One sample t-Interval
Stat – Tests – TInterval
Data: If given actual values
Stats: If given summary of values
Conditions for a t-interval:
1. SRS (should say)
2. Normality (population approx normal and n<15, or
moderate size (15≤ n < 30) with moderate
skewness or outliers, or large sample size
n ≥ 30)
3. Independence (Population 10x sample size)
 N  10n
Robustness:
The probability calculations remain fairly accurate
when a condition for use of the procedure is
violated
The t-distribution is robust for large n values,
mostly because as n increases, the t-distribution
approaches the Z-distribution. And by the CLT, it
is approx normal.
Example #2
Practice finding t*
n
Degrees of
Freedom
Confidence
Interval
n = 10
9
99% CI
n = 20
90% CI
n = 40
95% CI
n = 30
99% CI
t*
Example #2
Practice finding t*
n
Degrees of
Freedom
Confidence
Interval
n = 10
9
99% CI
n = 20
19
90% CI
n = 40
95% CI
n = 30
99% CI
t*
3.250
Example #2
Practice finding t*
n
Degrees of
Freedom
Confidence
Interval
n = 10
9
99% CI
3.250
n = 20
19
90% CI
1.729
n = 40
39
95% CI
n = 30
99% CI
t*
Example #2
Practice finding t*
n
Degrees of
Freedom
Confidence
Interval
t*
n = 10
9
99% CI
3.250
n = 20
19
90% CI
1.729
n = 40
39
95% CI
2.042
n = 30
29
99% CI
Example #2
Practice finding t*
n
Degrees of
Freedom
Confidence
Interval
t*
n = 10
9
99% CI
3.250
n = 20
19
90% CI
1.729
n = 40
39
95% CI
2.042
n = 30
29
99% CI
2.756
Example #3
As part of your work in an environmental awareness group, you
want to estimate the mean waste generated by American adults.
In a random sample of 20 American adults, you find that the
mean waste generated per person per day is 4.3 pounds with a
standard deviation of 1.2 pounds. Calculate a 99% confidence
interval for  and explain it’s meaning to someone who doesn’t
know statistics.
P: The true mean waste generated per person per
day.
A: SRS: Says randomly selected
Normality: 15<n<30. We must assume the
population doesn’t have strong
skewness. Proceeding with caution!
Independence: It is safe to assume that there
are more than 200 Americans
that create waste.
N: One Sample t-interval
I:
 s 
x  t *n1 

 n
df = 20 – 1 = 19
I:
 s 
x  t *n1 

 n
 1.2 
4.3  2.861

 20 
4.3  0.7677
3.5323, 5.0677
df = 20 – 1 = 19
C: I am 99% confident the true mean waste
generated per person per day is between 3.5323
and 5.0677 pounds.
Matched Pairs t-procedures:
Subjects are matched according to characteristics
that affect the response, and then one member is
randomly assigned to treatment 1 and the other to
treatment 2. Recall that twin studies provide a
natural pairing. Before and after studies are
examples of matched pairs designs, but they require
careful interpretation because random assignment is
not used.
Apply the one-sample t procedures to the differences
Confidence Intervals for Matched Pairs
xd  t
*
n 1
 Sd 


 n
Example #4
Archaeologists use the chemical composition of clay found in
pottery artifacts to determine whether different sites were
populated by the same ancient people. They collected five
random samples from each of two sites in Great Britain and
measured the percentage of aluminum oxide in each. Based on
these data, do you think the same people used these two kiln
sites? Use a 95% confidence interval for the difference in
aluminum oxide content of pottery made at the sites and assume
the population distribution is approximately normal. Can you
say there is no difference between the sites?
New
Forrest
Ashley
Trails
Difference
20.8
18
18
15.8
18.3
19.1
14.8
16.7
18.3
17.7
1.7
3.2
1.3
-2.5
.6
P: μn = New Forrest percentage of aluminum oxide
μa = Ashley Trails percentage of aluminum oxide
μd = μn - μa = Difference in aluminum oxide levels
The true mean difference in aluminum oxide levels
between the New Forrest and Ashley Trails.
A: SRS: Says randomly selected
Normality: Says population is approx normal
Independence: It is safe to assume that there
are more than 50 samples
available
N: Matched Pairs t-interval
I:
xd  t
*
n 1
 Sd 


 n
df = 5 – 1 = 4
I:
xd  t
*
n 1
 Sd 


 n
 2.105469069 
.86  2.776

5


.86  2.613866034
1.754, 3.4743
df = 20 – 1 = 19
C: I am 95% confident the true mean difference in
aluminum oxide levels between the New Forrest
and Ashley Trails is between –1.754 and 3.4743.
Can you say there is no difference between the sites?
Yes, zero is in the confidence interval, so it is safe
to say there is no difference.
Example #5
The National Endowment for the Humanities sponsors summer
institutes to improve the skills of high school language teachers.
One institute hosted 20 Spanish teachers for four weeks. At the
beginning of the period, the teachers took the Modern Language
Association’s listening test of understanding of spoken Spanish.
After four weeks of immersion in Spanish in and out of class,
they took the listening test again. (The actual spoken Spanish in
the two tests was different, so that simply taking the first test
should not improve the score on the second test.) Below is the
pretest and posttest scores. Give a 90% confidence interval for
the mean increase in listening score due to attending the
summer institute. Can you say the program was successful?
Subject
Pretest
Posttest
Subject
Pretest
Posttest
1
30
29
11
30
32
2
28
30
12
29
28
3
31
32
13
31
34
4
26
30
14
29
32
5
20
16
15
34
32
6
30
25
16
20
27
7
34
31
17
26
28
8
15
18
18
25
29
9
28
33
19
31
32
10
20
25
20
29
32
P: μB = Pretest score
μA = Posttest score
μd = μB - μA = Difference in test scores
The true mean difference in test scores between the
Pretest and Posttest
A: SRS: We must assume the 20 teachers are randomly
selected
Normality:
A: SRS: We must assume the 20 teachers are randomly
selected
Normality: 15<n<30 and distribution is
approximately normal, so safe to
assume
Independence: It is safe to assume that there
are more than 200 Spanish
teachers
N: Matched Pairs t-interval
I:
*  Sd 
xd  t n1 

 n
df = 20 – 1 = 19
I:
xd  t
*
n 1
 Sd 


 n
 3.2032 
1.45  1.729 

 20 
1.45 1.2384
 2.689,
 0.2115
df = 20 – 1 = 19
C: I am 90% confident the true mean difference in
test scores between the Pretest and Posttest
is between –2.689 and –0.2115.
Can you say the program was successful?
Yes, zero is not in the confidence interval, so the
pretest score is lower than the posttest score.
10.3 – Estimating a Population Proportion
Properties of
p̂ :
# of success in the sample
x
p̂ 

n
n
 p̂  p
 pˆ 
pˆ 1  pˆ 
n
Confidence Interval for a Population Proportion:
pˆ  Z *
pˆ 1  pˆ 
n
Notice! We use Z* and not t*
Conditions for a p-interval:
1. SRS (should say)
2. Normality
 npˆ  10 AND n 1  pˆ   10 
3. Independence (Population 10x sample size)
 N  10n
Calculator Tip: One sample p-Interval
Stat – Tests – 1–PropZInt
x = # of successes in the sample
Example #1
A news release by the IRS reported 90% of all Americans fill
out their tax forms correctly. A random sample of 1500
returns revealed that 1200 of them were correctly filled out.
Calculate a 92% confidence interval for the proportion of
Americans who correctly fill out their tax forms. Is the IRS
correct in their report?
P: The true percent of Americans who fill out their
tax forms correctly
A: SRS: Says randomly selected
Normality:
1200
p̂ 

1500
0.80
n 1  pˆ   10
npˆ  10
1500  0.8  10
1500 1  0.8  10
1200  10
300  10
Yes, safe to assume an approximately normally distribution
Independence: It is safe to assume that there are
more than 15,000 people who file
their taxes
N: One Sample Proportion Interval
I:
pˆ  Z *
Z* = ?
pˆ 1  pˆ 
n
?
Confidence Level (C)
Upper tail prob.
1  C 1  0.92

 0.04
2
2
92%
0.04
0.04
0.92
Z=?
Z=?
Z* Value
Confidence Level (C)
Upper tail prob.
1  C 1  0.92

 0.04
2
2
92%
0.04
0.04
0.92
Z=?
Z=?
Z* Value
 1.75
N: One Sample Proportion Interval
I:
pˆ  Z *
0.80  1.75
pˆ 1  pˆ 
n
0.8 1  0.8 
0.8  0.0181
 0.7819,
0.8181
1500
C: I am 92% confident the true percent of
Americans who fill out their tax forms correctly
is between 78.19% and 81.8%
Is the IRS correct in their report?
No, 90% is not in the interval!
Sample size for a Desired Margin of Error
If we want the margin of error in a level C confidence interval for p
to be m, then we need n subjects in the sample, where:
p* = An estimate for
p̂
p *(1  p*)
Z*
m
n
n
p *(1  p*)
 m 


 Z *
2
Note: If p is unknown use the most conservative value
of p = 0.5. Since n is the sample size, it must be a
whole number!!! Round up!
Example #2
You wish to estimate with 95% confidence; the proportion of
computers that need repairs or have problems by the time the
product is three years old. Your estimate must be accurate
within 3.5% of the true proportion.
a. Find the sample size needed if a prior study found that 19%
of computers needed repairs or had problems by the time the
product as three years old.
p *(1  p*) 0.19(1  0.19)
.1539
n



2
2
.000318877551
 .035 
 m 




1.96


Z
*


482.6304  483
Example #2
You wish to estimate with 95% confidence; the proportion of
computers that need repairs or have problems by the time the
product is three years old. Your estimate must be accurate
within 3.5% of the true proportion.
b. If no preliminary estimate is available, find the most
conservative sample size required.
n
p *(1  p*)
 m 


 Z *
2

0.5(1  0.5)
 .035 


 1.96 
2
.25


.000318877551
784
Example #2
You wish to estimate with 95% confidence; the proportion of
computers that need repairs or have problems by the time the
product is three years old. Your estimate must be accurate
within 3.5% of the true proportion.
c. Compare the results from a and b.
483
784
Using 0.5 makes the sample size very large, ensuring
that enough people will be surveyed.