Download MA4413-07

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Confidence Intervals
[1]
Statistical Estimation
sample statistic = parameter estimate
X
=
̂
s
=
̂
Example:
1
X
n
n
n
X
i1
i
1
2
s
(Xi  X )

n  1 i 1
[2]
Parameters and Statistics
• Process parameters,  and 
• (Model parameters,  and )
• Sample statistics, X and s
• Statistical inference
– inferring knowledge of  and , unknown,
from values of X and s, calculated from data
[3]
Clip gap measurements in twenty five samples of
five measurements each
Sample
1
2
3
4
5
6
7
8
9
10
11
12
65
70
65
65
85
75
85
75
85
65
75
80
80
70
75
60
70
70
75
65
70
75
65
85
80
60
75
75
85
70
75
80
65
75
70
60
70
80
75
75
65
80
85
85
75
60
70
60
80
65
80
75
90
50
80
85
75
85
65
70
Range
20
20
10
15
20
25
15
20
20
20
40
20
Sample
13
14
15
16
17
18
19
20
21
22
23
24
25
70
70
75
75
70
65
70
85
75
60
90
80
80
75
85
75
80
75
80
65
75
85
70
80
70
75
70
60
70
60
65
65
85
65
70
60
60
65
60
65
50
55
65
80
80
60
80
65
65
75
80
65
75
65
65
65
60
65
60
70
65
70
70
60
65
5
25
15
15
15
15
20
5
30
20
15
10
10
Clip
gaps
Clip
gaps
Range
[4]
Plot subgroup means
90
85
80
75
X bar
70
65
60
55
5
10
15
20
25
Sample Number
XBefore  73.8
X After  66.75
[5]
Estimation : how do we quantify
the implied uncertainty?
Based on the 16×5 = 80 values sampled from the
stable process before the new batch of raw
material, can we estimate the process mean?
XBefore  73.8
How do we represent the uncertainty associated
with this estimate?
XBefore  73.8
Evaluating the estimate
in light
of its implied uncertainty, would we conclude that
the process is “on target” ?
[6]
[7]
[8]
[9]
[10]
It is unlikely that two samples of the same size taken from the sample
population would return exactly the same value for the sample mean.
The sample mean will vary from sample to sample.
The sample mean is itself a random variable
with its own population mean
its own standard deviation (called the standard error)
and its own distribution (sampling distribution of the mean)
Properties of the sampling distribution of the mean
The sampling distribution of the mean turns out to be a normal
distribution. (see diagrams below).
This is always true if the underlying distribution of the variable is
itself normal; but even more importantly, it is approximately true as
long as the distribution of the original variables is not very skewed,
and the approximation improves as the sample size (n) increases.






The second result which is of concern relates to the mean of all the
sampling means in the sampling distribution of the mean.
Fairly reasonably it turns out to be nothing more than the mean () of
the population from which the samples were chosen.
Thus, sample means, are distributed normally about an unknown
population mean which is being estimated.
This justifies the intuitive notion that most of the possible sample
means should be fairly close to this population value.
The sample mean should be fairly near to the population mean. The
question arises of how near is fairly near, which, of course, relates to
the dispersion of the sample means around the population mean.
It can be shown that the standard deviation of the sampling
distribution of the mean (more usually called the standard error of the
mean, or, when there is no ambiguity, the standard error) is given by
SE( X) 

n
where  is the standard deviation of the original population, and n is
the sample size.
Thus, estimates based on a large sample size are more precise than
estimates associated with small samples.
- Why?
The Normal model for X and for X-bar
3
3
n
3
n
3
[14]
Implications of the standard error formula
X is very likely to be within 2 standard errors of  and
is even more likely to be within 3 standard errors of .
This means that, having calculated a value of X from
sampled data, we can be reasonably confident that 
is within 2/n of the calculated value and even more
confident that  is within 3/n of the calculated value
[15]
Sampling distribution of X-bar
95%
-3
-2
-1
2
0
1
2
3
2
n
n
Z scale
X scale
95% chance that X-bar is within 2/n of ,
therefore,
95% confident that  is within 2/n of X-bar
[16]
Logic of confidence intervals
With repeated sampling from the process, n at a time
and calculating a new value of X each time, expect
95% of the calculated values of X to be within two
standard errors of .
Changing emphasis, expect that, in 95% of samples
from a stable process,  will be within two standard
errors of the calculated value of X .
Therefore, given a single sample from the process, we
are 95% confident that the value of  will be within two
standard errors of the calculated value of X .
[17]
95% confidence interval for 
 X  2 /
n
,
X  2 / n

that is,
all values of  within 2 standard errors of X
[18]
Example
XBefore  73.8
s = 7.3
n = 80.
Confidence interval for Before is:
73.8 - 2 × 7.3/80 to
73.8 + 2 × 7.3/80,
72.2 to 75.4 .
[19]
Exercise
X After  66.75
s = 7.3
n = 40.
Calculate a confidence interval for After
[20]
50 simulated confidence intervals
[21]
[22]
[23]
[24]
[25]
[26]
The value 2 is an approximation to the value 1.96 from the
normal tables.
The Normal
model for X
X
XX
S.E.
XXX
XXXX
XX XX
XXXXXX
XXXXXXXX
X    X 

95% of sample means lie in the range given by




  196
.    X    196
.  
 n
 n


X  196
.      X  196
.  
 n
 n

n
Problem Name: Cadmium Ion Concentration in Sludge
Application: Interval Estimation of a Population Mean
Problem Description: 70 determinations of the Cd2+ ion
concentration were made. The data showed a sample mean of 54.97
mg/ml and a standard deviation of 0.33 mg/ml.
Our best estimate of  is
54.97 mg/ml, but what level
of confidence do we place in
this figure?
What we require is an
INTERVAL ESTIMATE.
[28]
Example: 95% CI for Mean Cadmium Ion Concentration
A 95% confidence interval for the true mean Cadmium ion
concentration is calculated as

 



.   , X  196
.  
 X  196


 n 
n


 0.33 ,54.97  196
 0.33 
54
.
97

196
.
.








70
70  

54.97  0.08,54.97  0.08
54.89  55.05
Under repeated sampling we would expect the true mean Cadmium
ion concentration to lie in an interval constructed in such a fashion,
95% of the time.
[29]
General Procedure: Interval estimate of a population mean
X  Za /2

n
where 1 - a is the confidence level.
Confidence
Interval
Sampling distribution
of X
(1 - a )  100%
of all
X values

90%
95%
99%
a
Za / 2
0.10 1.645
0.05 1.960
0.01 2.576
x
[30]
Example: 99% CI for Mean Cadmium Ion Concentration
A 99% confidence interval for the true mean Cadmium ion
concentration is calculated as

   , X  2.58   
X

2
.
58
 
 



 n 
n


 0.33 ,54.97  2.58 0.33 
54
.
97

2
.
58








70
70  

. ,54.97  010
. 
54.97  010
54.87  55.07
Under repeated sampling we would expect the true mean Cadmium
ion concentration to lie in an interval constructed in such a fashion,
99% of the time.
[31]
Example: Tablets require an average weight of 100mg. An
inspector takes a sample of 200 tablets and finds that
X  98.52 mg, and s  7.1 mg.
A 95% CI is

   , X  196
  
X

196
.
.
 
 



 n 
n


 7.1  ,98.52  196
 7.1  
98
.
52

196
.
.








200
200  

98.52  0.98,98.52  0.98
97.54  99.50
Quality engineer says that
this interval is “too wide”!
[32]
Example: What sample size would be required to estimate the
mean weight of tablets to within + 0.85mg, using a 95% C.I.?


X  196
.    X  0.85
 n
7.1
196
.    0.85
 n
196
.  7.1

n

 0.85 
 268
2
Thus, in order to achieve the desired precision in our estimate of
the population mean we should use a sample of size 268.
[33]
Suppose a new sample gave
X  98.32 mg, and s  7.0 mg.

 7.0  ,98.32  196
 7.0  
98
.
32

196
.
.








268
268  

98.32  0.84,98.32  0.84
97.48  99.16
[34]
# The normal core body temperature of a healthy, resting adult human
# being is stated to be at 98.6 degrees Fahrenheit. We will consider
# data reported by Mackowiak et al., JAMA 268:1578-1580, 1992. TRY...
temps = read.table("C:/Kev/MA4413/data/Mackowiak.txt", header=TRUE)
temps
boxplot(temp ~ gender, data = temps)
abline( h = 98.6, col = "green", lty=2, lwd=2)
stats = function(x) c(mean(x),sd(x),sd(x)/sqrt(length(x)))
CI = function(x, w=1.96) mean(x) + c(-1,1) * w *
sd(x) / sqrt(length(x))
with(temps, by(temp, gender, stats))
with(temps, by(temp, gender, CI))
means = with(temps, by(temp, gender, mean))
CIs = with(temps, by(temp, gender, CI))
lines(x = c(1,1), y = CIs$female, col = "red", lwd = 3)
lines(x = c(2,2), y = CIs$male, col = "red", lwd = 3)
points(x = 1:2, y = means, pch = 16, col = "blue", cex=1.5)
[35]
[36]
Example: Rental Costs
• A reporter for a student newspaper is writing an article
on the cost of off-campus housing.
• A sample of 10 one-bedroom units within a half-mile of
campus resulted in a sample mean of €550 per month
and a sample deviation of €30.
• Calculate a 95% confidence interval estimate of the
mean rent per month for the population of onebedroom units within a half-mile of campus.
• We’ll assume this population to be normally
distributed.
[37]
Interval Estimation of a Population Mean
Small-Sample Case (n < 30)
If the data have a normal probability
distribution and the sample standard
deviation s is used to estimate the
population standard deviation ,
the interval estimate is given by:
X  t a /2
s
n
where ta/2 is the value providing an
area of a/2 in the upper tail of a
t distribution with n-1 degrees of freedom.
Example: Apartment Rents
• t Value
At 95% confidence, 1 - a = .95, a = .05, and a/2 = .025.
t.025 is based on n - 1 = 10 - 1 = 9 degrees of freedom.
In the t distribution table we see that t.025 = 2.262.
Degrees
Area in Upper Tail
of Freedom
.10
.05
.025
.01
.005
.
.
.
.
.
.
6
1.440
1.943
2.447
3.143
3.707
7
1.415
1.895
2.365
2.998
3.499
8
1.397
1.860
2.306
2.896
3.355
9
1.383
1.833
2.262
2.821
3.250
10
1.372
1.812
2.228
2.764
3.169
[39]
Example: Apartment Rents
• Interval Estimation of a Population Mean:
Small-Sample Case (n < 30) with  Unknown
s
x  t.025
n
30
550  2.262
10
550 + 21.46
or
$528.54 to $571.46
We are 95% confident that the mean rent per
month for the population of one-bedroom units
within a half-mile of campus is between $528.54
and $571.46.
[40]
Percentage points of the t Distribution
of Freedom
.10
.05
.025
.01
.005
.
.
.
.
.
.
29
1.311
1.699
2.045
2.462
2.756
30
1.310
1.697
2.042
2.457
2.750
.
.
.
.
.
.
40
1.303
1.684
2.021
2.423
2.704
.
.
.
.
.
.
60
1.296
1.671
2.000
2.390
2.617
.
.
.
.
.
.
120
1.289
1.658
1.980
2.358
2.617
.
.
.
.
.
.
infinity
1.282
1.645
1.960
2.326
2.576
[41]
Problem Description: A quality control inspector weighs the
contents of 7 packets of breakfast cereal all from the same filling
machine. The data recorded were
111g, 117g, 105g, 100g, 97g, 118g, 113g.
Use a 95% confidence interval estimate to determine if the machine
is filling to the a priori target value of 115 grams per pack.
At 95% confidence, 1-a = 0.95 and a = 0.05.
s 
X  t a / 2 

 n
t-dist
on 6df
-2.447
8.22 
108.7  2.447

 7
+2.447
108.7  7.6
or
101.1 to 116.3
TRY:
w = c(111, 117, 105, 100, 97, 118, 113)
n = length(w)
qt(0.975, df = n - 1)
qt(0.025, df = n - 1, lower.tail = FALSE)
mean(w) +c(-1,1) * qt(0.975, df = n - 1) * sd(w) / sqrt(n)
t.test(w)$conf
#the R function t.test does all this
qqnorm(w)
#test the assumption of normal data!!
[43]
N-Score Plots: Testing the assumption of normality
NSCORES are idealised values
we would expect if the data came
from a normal distribution.
Use Z values {Z1…Z7} that divide
the standard curve normal into 8
sections, with the area to the left
of each Z equal to (i - 1/2)/n of the
total area, where n = 7 and i runs
from 1 to 7 in this example.
The assumption of normality of
the Weight data is being tested.
If the points fall on a line then the
assumption of normality is not
called into question!
Normal scores and
the Normal diagnostic plot
[45]
Normal diagnostic plot
• If the sampled process follows the Normal model,
the similarity of the spacing patterns will lead to a
straight line scatter plot pattern, with some
chance variation.
• If the scatter plot pattern is not a straight line with
some chance variation, then the conclusion is
that the sample process does not conform to the
Normal model.
[46]
Normal diagnostic plot, Presses 1-4
[47]
Reference plots
[48]
Normal plot, Presses 1-4, all data, with reference plots
[49]
A skew frequency curve
[50]
Return on Stocks
[51]
Assumption of Normality??
[52]
Sample Statistics and t-value from tables
• Sample mean:
Xbar = -0.00983
• Standard Deviation:
s = 0.055
• Sample size:
n = 30
• t Value
At 95% confidence, 1 - a = .95, a = .05, and a/2 = .025.
t.025 is based on n - 1 = 30 - 1 = 29 degrees of freedom.
In the t distribution table we see that t.025 = 2.045
Verify that the 95% CI estimate is -0.0304 % TO 0.0107 %
[53]
Prediction Intervals
Confidence interval:
x  t.025
s
n
Prediction interval:
x  t.025 s 1 
1
n
Verify that the 95% PI estimate is -0.125 % TO 0.105 %
[54]
Confidence Interval For A Proportion
Suppose that 46 respondents of a sample of 140
students claim to attend lecturers. The sample
proportion is p = 46/140 = 0.33, a 95% confidence
interval for the population proportion p is required.
p1  p
SE p  
N
p 1  p 
SE p  
N
In general a confidence interval is constructed as
Point Estimate + Value*SE(Point Estimate)
p  196
.  SE p 
0.331  0.33
0.33  196
.
140
0.25  0.41
Sample Size
Suppose that the research team are unhappy about the
width of the interval and say that in future they would
like estimates in the form
X% + 2%
To achieve this level of precision in the estimate how
large must the sample be??
p1  p
p  Za 
 p  0.02
N
2
 Za 
N    p1  p
 .02
Since p is unknown this expression cannot be evaluated
immediately. Consider the following table:
p
.1
.2
.3
.4
.5
.4
1-p
.9
.8
.7
.6
.5
.6
p(1-p)
.09
.16
.21
.24
.25
.24
p(1-p) has a maximum when p = 0.5 - if we use this
value we do at least as well as required.
2
 Za 
N    0.25
 .02
For a 95% confidence interval we have
2
196
. 

N 
 0.25
 .02 
 2401