Download CHAPTER 9—POINT AND INTERVAL ESTIMATION

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Chapter 9--Estimation.Doc
STATISTICS 301—APPLIED STATISTICS, Statistics for Engineers and Scientists, Walpole, Myers, Myers, and Ye, Prentice Hall
Goal:
In this section we will investigate the concept of “Estimation” in which our goal is to use
sample information (assumed to be a random sample from the population of interest) to
arrive at a reasonable guess of a population parameter. Estimation is done in two ways—
point estimation (or single value) and interval estimation (an interval or range of likely
values).
INTERVAL ESTIMATION (aka CONFIDENCE INTERVALS)
The advantage of point estimation and point estimates is their simplicity—a single number.
However, this simplicity has a price. Consider the following.
In a follow-up to the Dean’s request about the proportion of MU undergrads who plan to
attend graduate school, he checks with another faculty member who also collects data.
This faculty member reports to the Dean that 78.45% of the students he has asked plan
to attend graduate school. Whom does the Dean believe since the results are slightly
different, me (recall my estimate was pˆ = 60.2% ) or the other faculty member? In
other words, what DON’T point estimates tell you about the estimate and sample?
This is the downside of point estimates—they provide no sense of how large the sample is
nor how variable the estimate is. We know that the spread of every sampling distribution is
dependent upon the sample size so that smaller sample sizes yield sampling distributions
with larger spread and for larger sample sizes, the sampling distribution is less variable.
What would the SE of my estimate of p, the proportion of MU students going to grad
school?
SE(0.602) = √(0.602)(1-0.602)/123 = 0.044
What would the SE of the other estimate of p from the other faculty member? What
do you need to know? Suppose n = 30.
SE(0.7845) = √(0.7845)(1-0.7845)/30 = 0.075
The SE of our estimate is almost ½ of the other!
Hence, if we are only given the estimate, the accuracy of point estimates is not evident!
D:\769815146.doc
4/30/2017
1
Just as we can obtain point estimates for every population parameter we have discussed
thus far, we can also obtain Confidence Intervals for these parameters. However we will
only give the Confidence Interval (CI) or interval estimates for the population mean and
proportion ( and later for the difference between two means and two proportions).
Defn:
Most interval estimates for parameters are of the form:
Point Estimate of Parameter ± Multiplier * SE(Point Estimate)
PE ± Multiplier * SE(PE)
or [ PE - M * SE(PE), PE + M * SE(PE) ]
where the Multiplier is an upper percentile point from the sampling
distribution of the point estimator used.
Hence to form an interval estimate or confidence interval for a parameter we need:
1. A point estimate of the parameter,
2. the distribution of the point estimator,
3. and an estimate of the Standard Error of the point estimate.
CONFIDENCE INTERVAL FOR POPULATION MEAN (popln or )
Using our basic confidence interval form for the population mean we know that:
1. Our point estimate of  is x .
2. The SE of X is
σ
n
(≈
s
n
if  were unknown, which is the “usual” case!).
3. Lastly, the distribution of X depends on several things:
a. If the population is Normal and  is known, then the Multiplier is a z value.
b. If n is large, then X is approximately Normal and our Multiplier is z value
c. And if n is small,  is unknown, and the population is Normal, the Multiplier is a t
value with n – 1 degrees of freedom.
D:\769815146.doc
4/30/2017
2
Thm: If X1, X2, …, Xn are a random sample from a population with mean = , variance = 2,
then a ( 1 -  ) 100% confidence interval for  is:
i.
x ± z (  2)

if the population is Normally distributed and  is known
n
ii. x ± z (  2)
iii. x ± t(  2;n-1)
s
if n is large ( n > 30)
n
s
if n is small,  is unknown, & the population is Normal.
n
EXAMPLE #1
Recall our Milky Way candy data in which we found that the average weight of the 40
candy bars was 59.97 grams with a standard deviation of 1.92 grams. Find a 95%
confidence interval for the mean weight of all Milky Way candy bars.
Parameter:  = mean weight of all Milky Way candy bars
Point Estimate: x =
Standard Error of our Point Estimate:
σ
n
but since  is unknown we use
s
n
=
 value: Since 95% = ( 1 -  ) 100%  =
Multiplier: Since n is large, our multiplier is z/2) =
Our 95% confidence interval for the mean weight of all Milky Way candy bars is
D:\769815146.doc
4/30/2017
3
Example #2: Exercise 9.6 from WMMY 8th page 286
A random sample of 50 college students yields a sample average hgt of 174.5 cm and
a standard deviation of 6.9 cm. Obtain a 98% CI for the mean hgt of college
students.
D:\769815146.doc
4/30/2017
4
CONFIDENCE INTERVAL INTERPRETATION
We just found a 95% CI for  = mean weight of all Milky Way candy bars was (59.4, 60.6).
Now some True/False questions.
T F a. The probability that  is in the CI is 95%.
T F b. The probability that X is in the CI is 95%.
T F c. The probability that  is in the CI is either 0% or 100%.
T F d. 95% of all such CI’s contain .
T F e. We can conclude that  is closer to the center of the CI than the ends.
T F f. 95% of all candy bars weigh between (59.4, 60.6gms).
Before we answer these T/F, here are some more questions:
Is  a constant or does it vary?
Is 2 known or unknown?
D:\769815146.doc
4/30/2017
5
Based on our RS of n = 40, what is the distribution of X and what does it look like?
If we took a different sample of Milky Way candy bars, would we get the same 95% CI?
Would  change?
Would
x
change?
Would s change?
Would z/2) change?
D:\769815146.doc
4/30/2017
6
1.
For each CI, is  in the interval? So what’s the probability  is in any ONE CI?
2.
What % of ALL CI’s contain ?
3.
What would the population distribution look like?
D:\769815146.doc
4/30/2017
7
We just found that a 95% CI for  = mean weight of all Milky Way candy bars
was (59.4, 60.6gms).
True/False Answers:
T F a. The probability that  is in the CI is 95%.
T F b. The probability that
x
is in the CI is 95%.
T F c. The probability that  is in the CI is either 0% or 100%.
T F d. 95% of all such CI’s contain .
T F e. We can conclude that  is closer to the center of the CI than the ends.
T F f. 95% of all candy bars weigh between (59.4, 60.6gms).
D:\769815146.doc
4/30/2017
8
What is(are) the population(s)?
What is(are) the parameter(s)?
1995:
2006:
D:\769815146.doc
4/30/2017
9
CONFIDENCE INTERVAL FOR POPULATION PROPORTION,
DIFFERENCE BETWEEN TWO MEANS, & DIFFERENCE BETWEEN
TWO PROPORTIONS
We present the CI’s forms for the above three different cases in the following theorems,
then present several examples.
Thm: If X1, X2, …, Xn are a random sample from a population with proportion, p, then
a ( 1 -  ) 100% confidence interval for p is
ˆ - p)
ˆ
p(1
n
p̂ ± z (  2) *
if np > 5 and n(1-p) > 5 OR n p̂ > 5 and n(1- p̂ ) > 5.
Thm: Let x1 and s1 and x2 and s2 be the sample average and sample standard deviation,
respectively, of two independent random samples of sizes n1 and n2, respectively,
from two populations with means 1 and 2, then a ( 1 -  ) 100% confidence interval
for ( 1 - 2) is
2
(x1 -x2 ) ± t(  , df)
2
 s12 s22 
 + 
n n
s12 s22
+ , where df=  12 2  2 .
n1 n2
 s12   s22 
   
 n1  +  n2 
n1 -1
n2 -1
Thm: If p̂1 and p̂2 are sample proportions from two independent random samples of size n1
and n2 from two populations with proportions p1 and p2, then a ( 1 -  ) 100%
confidence interval for (p1 - p2) is
 pˆ1 -pˆ2  ±z (
D:\769815146.doc

*
2)
pˆ1 (1-pˆ1 ) pˆ2 (1-pˆ2 )
+
n1
n2
if n1 p̂1 > 5, n1(1- p̂1 ) > 5, n2 p̂2 > 5, and n2(1- p̂2 ) > 5.
4/30/2017
10
EXAMPLE #1 Underweight Milky Way Candy Bars
Let’s let p be the proportion of “vending-sized” Milky Way candy bars that are below
the stated Net Weight of 58.1 grams.
62.2
59.6
60.4
59.7
62.4
59.7
61.6
64.5
57.4
56.0
60.7
59.1
61.3
57.2
58.4
Candy Wgt
58.6
57.1
61.5
61.5
59.9
64.6
58.6
61.6
61.9
59.5
60.2
60.5
61.3
59.2
59.7
62.1
59.6
58.3
57.1
58.7
60.3
60.7
60.0
58.2
57.7
We find that 6 of the 40 candy bars weighed less than 58.1 grams. Our point
6
= 15.0% . Let’s also
estimate of the proportion of underweight Milky Ways is pˆ =
40
obtain a 95% CI for p. Checking to insure our sample size is large enough
1. n( p̂ ) = 40(0.15) = 6 > 5
AND
2. n(1- p̂ ) = 40(1-0.15) = 34 > 5!
So our 95% CI for p is
p̂± z (  2)
ˆ - p)
ˆ
p(1
0.150(1 - 0.150)
= 0.150 ± z (0.025)
= 0.150 ± 1.96(0.0565)
n
40
= 0.150 ± 0.1107 = (0.0393, 0.2607)
We can then conclude, with a very high degree of confidence (95% !) that between
4% and 26% of Milky Way candy bars are underweight.
Do you believe MW’s claim that no candy bar that is less than 58.1 gm goes out of
the assembly line? Why?
D:\769815146.doc
4/30/2017
11
Example #2: Phone Battery Data
Lithium Ion Batteries:
12.24
13.86
15.78
17.65
12.51
14.15
15.9
17.85
Nickel Metal Hydride:
12.59
13.48
14.07
15.04
12.68
13.5
14.15
15.07
9.75
12.9
14.2
16.06
17.9
10.17
13.15
14.25
16.25
11.77
13.16
14.42
16.42
11.77
13.61
14.57
16.43
11.87
13.63
14.84
16.46
11.90
13.63
14.92
16.82
12.12
13.66
14.93
17.04
12.15
13.75
14.95
17.08
12.18
13.81
15.63
17.58
10.08
12.85
13.51
14.19
15.10
11.98
12.88
13.52
14.49
15.22
12.19
13.06
13.67
14.53
15.28
12.36
13.07
13.83
14.59
15.3
12.37
13.18
13.85
14.61
15.38
12.4
13.18
13.86
14.81
15.53
12.45
13.35
13.9
14.85
15.54
12.46
13.38
14.02
14.99
15.59
12.54
13.47
14.05
15.01
15.72
nLI = 45, xLI =14.348, sLI = 2.0693, se( xLI ) = 2.0693/45 = 0.3085
n NIMH = 53, xNIMH =13.826, sNIMH = 1.1819, se( xNIMH ) = 1.1819/53 = 0.1623
2
(1-) 100% CI for 1 - 2 is (x1 -x2 ) ± t(  , df)
2
se(x1 -x2 ) =
s12 s22
+ =
n1 n2
 s12 s22 
 + 
n n
s12 s22
+ , where df=  12 2  2 .
n1 n2
 s12   s22 
   
 n1  +  n2 
n1 -1
n2 -1
2.06932 1.18192
+
= 0.357
45
53
2
2
 s12 s22 
 2.06932 1.18192 
+
+




n1 n2 
45
53 


and df =
=
= 67.38, so call it 68.
2
2
2
2
 s12   s22 
 2.06932   1.18192 
   

 

 45  +  53 
 n1  +  n2 
45
53
n1 -1
n2 -1
Obtain a 90% CI for  NIMH -  LI: (13.826 – 14.348)  t(0.05, df)*0.357
(13.826 – 14.348)  t(0.05, 68)*0.357
 -0.522  1.990*0.357
 -0.522  0.711
 [ -1.233, 0.189 ]

Interpretation?
D:\769815146.doc
4/30/2017
12
Example #3: Hair Color & Pain Threshold Data
Light Blonde:
62 60 71 55 48
Dark Brunette: 32 39 51 30 35
nLB = 5, xLB =59.2, sLB = 8.5264, se( xLB ) = 8.5264/5 = 3.8131
nDB = 5, xDB =37.4, sDB = 8.3247, se( xDB ) = 8.3247/5 = 3.7229
2
(1-) 100% CI for 1 - 2 is (x1 -x2 ) ± t(  , df)
2
s12 s22
+
=
n1 n2
 s12 s22 
 + 
n n
s12 s22
+ , where df=  12 2  2 .
n1 n2
 s12   s22 
   
 n1  +  n2 
n1 -1
n2 -1
8.52642 8.3247 2
+
= 5.3292
5
5
2
 8.52642 8.3247 2 
+


5
5


and df =
= 7.9954, call it 8.
2
2
2
 8.5264   8.3247 2 

 

5
5

 +

5-1
5-1
Obtain a 99% CI for LB - DB: (59.2 –37.4)  t(0.005, df)*5.3292,




(59.2 –37.4)  t(0.005, 8)*5.3291
21.8  3.355*5.3292
21.8  17.8795
[ 3.92, 39.68 ]
Interpretation?
D:\769815146.doc
4/30/2017
13
Example #4: Example 9.6 from WMMY 8th page 289
Compare the gas mileage of two car types (compact and sub-compact). We have two
independent RS’s with summary information:
nSC = 75, xSC =42, sSC = 8
nC = 50, xC =36, sS = 6
Obtain a 96% CI for C - SC.
D:\769815146.doc
4/30/2017
14
Example #5: Exercise 9.65 from WMMY 8th page 305
Compare the proportion of females and males with a certain minor blood disorder.
We have independent RS’s of size 1,000 and found 275 females with the disorder
and 250 males with the disorder.
Obtain a 95% confidence interval for the difference in proportions.
D:\769815146.doc
4/30/2017
15
Example #6:
D:\769815146.doc
4/30/2017
16
Example #7:
D:\769815146.doc
4/30/2017
17
Confidence Interval for Difference of Two Means
Non-Independent Samples—What’s the Effect?
(9.44 WMMY 8th) A taxi company is trying to decide whether to purchase Brand A or Brand B
tires for its fleet of taxis. A tire from each brand is assigned at random to the rear wheels of 8
taxis and the following distances, in km, recorded until a tire had only 1/8” of tread remaining.
Taxi
1
2
3
4
5
6
7
8
n
Average
ST Dev
Brand A
34,400 45,500 36,700 32,000 48,400 32,800 38,100 30,100 8
33,112
6546.7549
Brand B
36,700 46,800 37,700 31,100 47,800 36,400 38,900 31,500 8
34,101
6181.0627
Are these two samples independent?
Now let’s calculate the se(xA - xB ) assuming independent samples. Recall that se(x1 -x2 ) =
s12 s22
+
.
n1 n2
The problem is that since the samples are NOT independent, our se(xA - xB ) could either overestimate or under-estimate that true standard error!
D:\769815146.doc
4/30/2017
18
Confidence Intervals for Difference of Two Means
Paired Data Case
Thm: Assuming a sample of “n” paired observations (x1i, y2i), a (1-) 100% CI for 1 - 2 is
d ± t( , n-1) se(d),
2
where di = (x1i - y2i) and se(d) =
Taxi
1
2
3
4
sd
n
.
5
6
7
8
n
Average
ST Dev
Brand A
34,400 45,500 36,700 32,000 48,400 32,800 38,100 30,100 8
33,112
6546.7549
Brand B
36,700 46,800 37,700 31,100 47,800 36,400 38,900 31,500 8
34,101
6181.0627
-1112.5
1454.4881
Difference -2,300 -1,300 -1,000
900
600
-3,600
-800
-1,400
8
Hence, for our data, a 95% CI for the difference in mileage for the two Brands of tires (A - B)
is:
D:\769815146.doc
4/30/2017
19
NOTES AND COMMENTS ON CONFIDENCE INTERVALS
1. While n > 30 will work well in most instances, larger sample sizes would be needed if
the population is known to be severely skewed. If the population is symmetric or
approximately so, then CI’s for the mean () based on samples of size 30 are
adequate.
Populations that are known to be severely skewed, in either direction, would require
a larger sample size.
2. INTERPRETATION OF CI’S: A 95% CI would be interpreted as follows:
We are 95% confident that the parameter of interest falls somewhere
within the stated interval.
Notice that we do NOT say, “The probability is 95% that the parameter of interest
falls somewhere within the stated interval” since this is not true. Hence avoid using
the term “probability” in the interpretation of CI’s.
3. The Degree of Confidence of the CI is a statement about how sure or confident we
are in our CI. The higher the degree of confidence, the more certain we are with
our statement; the lower the degree of confidence the less sure we are. While
higher confidence in general is better, the sacrifice is a wider CI and hence more
possible values for the parameter. One usually uses 90%, 95%, or 99% in most
cases.
What would a 100% confidence interval be? How informative is it?
D:\769815146.doc
4/30/2017
20
4. CI’s are a statement about a population parameters value. It does NOT say anything
about what percent or proportion of the population falls in the interval. Hence for a
95% CI, you can NOT conclude “95% of the population falls within the CI.” Rather
it is an interval in which the population parameter is likely to lie.
5. The Margin of Error of a confidence interval is the ½ width of the confidence
interval. So for our candy bar example, since the 95% confidence interval
was [ 59.97 ± 0.595 ], the Margin of Error would be 0.595.
The Margin of Error provides some evidence of how large an “error” is involved with
our estimate or how far away our estimate is from the true parameter.
6. If the degree of confidence is not stated, it’s assumed to be 95%. So if a Margin of
Error is given with no indication of the degree of confidence, assume it is 95%.
D:\769815146.doc
4/30/2017
21
USING SAS TO OBTAIN CONFIDENCE INTERVALS
Recall we found a 95% CI for  = mean weight of all Milky Way candy bars was (59.4, 60.6).
SAS CI for a Single Mean
OPTIONS LS=110 PS=60 PAGENO=1 NODATE FORMDLIM='+';
TITLE 'CI.SAS';
TITLE2 'EXAMPLE OF ONE AND TWO SAMPLE CI OF MEANS IN SAS';
TITLE3 'MILKY WAY WGT DATA FROM CLASS';
DATA MWDATA;
INPUT MW_WGT @@;
DATALINES;
62.2 59.7 60.7 58.6 57.1 60.2 62.1 60.3
59.6 61.6 59.1 61.5 61.5 60.5 59.6 60.7
60.4 64.5 61.3 59.9 64.6 61.3 58.3 60.0
59.7 57.4 57.2 58.6 61.6 59.2 57.1 58.2
62.4 56.0 58.4 61.9 59.5 59.7 58.7 57.7
;
PROC TTEST DATA= MWDATA ALPHA=0.05;
VAR MW_WGT;
RUN;
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
CI.SAS
1
EXAMPLE OF ONE AND TWO SAMPLE CI OF MEANS IN SAS
MILKY WAY WGT DATA FROM CLASS
The TTEST Procedure
Statistics
Variable
MW_WGT
N
Lower CL
Mean
Mean
Upper CL
Mean
Lower CL
Std Dev
Std Dev
Upper CL
Std Dev
Std Err
Minimum
Maximum
40
59.351
59.965
60.579
1.573
1.9203
2.4657
0.3036
56
64.6
T-Tests
D:\769815146.doc
Variable
DF
t Value
Pr > |t|
MW_WGT
39
197.50
<.0001
4/30/2017
22
SAS CI of the Difference of Two Means—Independent Samples
PROC IMPORT DATAFILE='C:\MyDocs\Class\1 Winter 2007\STA 301\Data Sets\Battery.xls'
OUT=BATTERY;
PROC TTEST DATA=BATTERY ALPHA=0.10;
TITLE3 'BATTERY DATA';
CLASS BATTERYTYPE;
VAR TIME;
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
CI.SAS
2
EXAMPLE OF ONE AND TWO SAMPLE CI OF MEANS IN SAS
BATTERY DATA
The TTEST Procedure
Statistics
Variable
BatteryType
N
Time
Time
Time
LithiumIon
NickelMetalHydride
Diff (1-2)
53
45
Lower CL
Mean
Mean
Upper CL
Mean
Lower CL
Std Dev
Std Dev
Upper CL
Std Dev
Std Err
13.554
13.83
-1.078
13.826
14.348
-0.522
14.098
14.867
0.0328
1.0199
1.765
1.4757
1.1819
2.0693
1.649
1.4119
2.5149
1.8731
0.1623
0.3085
0.3343
Obtain a 90% CI for  NIMH -  LI: [ -1.233, 0.189 ] Why the difference?
T-Tests
Variable
Method
Variances
Time
Time
Pooled
Satterthwaite
Equal
Unequal
DF
t Value
Pr > |t|
96
67.4
-1.56
-1.50
0.1214
0.1387
Equality of Variances
D:\769815146.doc
Variable
Method
Time
Folded F
Num DF
Den DF
F Value
Pr > F
44
52
3.07
0.0001
4/30/2017
23
SAS CI of the Difference of Two Means—Paired Data
DATA PAIRED;
TITLE3 'PAIRED TIRE DATA';
INPUT BRANDA BRANDB;
DIFF = BRANDA-BRANDB;
DATALINES;
34400 36700
45500 46800
36700 37700
32000 31100
48400 47800
32800 36400
38100 38900
30100 31500
;
PROC TTEST DATA=PAIRED;
VAR DIFF;
RUN;
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
CI.SAS
3
EXAMPLE OF ONE AND TWO SAMPLE CI OF MEANS IN SAS
PAIRED TIRE DATA
The TTEST Procedure
Statistics
Variable
N
Lower CL
Mean
Mean
Upper CL
Mean
Lower CL
Std Dev
Std Dev
Upper CL
Std Dev
Std Err
Minimum
Maximum
DIFF
8
-2328
-1113
103.48
961.67
1454.5
2960.3
514.24
-3600
900
T-Tests
Variable
DIFF
D:\769815146.doc
DF
t Value
Pr > |t|
7
-2.16
0.0673
4/30/2017
24
Approximate 95% Margin
of Error for proportions
is 1/√n …
So MoE is
D:\769815146.doc
4/30/2017
25
D:\769815146.doc
4/30/2017
26