Download Lecture 6

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Lecture 6
Selected material from:
Ch. 9 Estimation using a single sample
Point estimate
A point estimate of a population characteristic is a single number that is based on sample data and represents a plausible value of the characteristic.
Example: A sample of 200 students at a large university is selected to estimate the proportion of students at the university that wear contact lenses (). In this sample 47 wore contact lenses. number of successes in the sample
The statistic p 
n
is a reasonable choice for a formula to obtain a point
estimate for .
47
 0.235
Such a point estimate is p 
200
2
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example: Weights
A sample of weights (pounds) of 34 US male university students was obtained.
185
202
197
188
166
148
161
139
214
170
231
180
174
177
283
207
176
194
175
170
184
180
184
176
202
151
189
167
179
178
176
168
177 155
To estimate the true mean of all male university students, you might use the sample mean as a point estimate for the true mean.
sample mean  x  182.44
3
(82.8 kg)
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example: Weights
After looking at a histogram and boxplot of the data you might notice that the data seem reasonably symmetric with an outlier, so you could also use either the sample median or sample trimmed mean as a point estimate. sample mean  x  182.44
5% trimmed mean  180.07
(82.8 kg)
(81.7 kg)
177  178
sample median 
 177.5
2
(80.5 kg)
140
4
180
220
260
(max 283 lbs = 128.4 kg)
Pittsburgh
Steelers
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Motivation for confidence interval in addition to point estimate.
Contact lens example:
A sample of 200 students at a large university is selected to estimate the proportion of students that wear contact lenses. In this sample 47 wore contact lenses. The point estimate is p=47/200 = .235 , 23.5% of students wear contact lenses.
A shortcoming of the point estimate is that it gives no idea of the uncertainty in it. The sample size here (200) is fairly large so we can be confident that the point estimate is fairly accurate. But what if the sample size had only been 50 or 10?
The point estimate does not give us that.
5
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Motivation for confidence interval in addition to point estimate.
Point estimate is p=47/200 = .235
We could report the standard deviation:
p 
 (1   )
n
p(1  p)

 .03
n
where we substitute the sample proportion p=.235 for the population proportion π since we do not know π.
But a standard deviation can be hard to understand.
People understand a range of plausible values more easily,
and that is what the confidence interval gives. 6
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Confidence interval
Longhorn stadium, Austin, TX
A confidence interval for a population characteristic is an interval of plausible values for the characteristic. It is constructed from a random sample so that, with a chosen degree of confidence, the population characteristic will be captured inside the interval.
The confidence level associated with a confidence interval estimate is the success rate of the method used to construct the interval. 7
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Confidence interval (CI)
For example, a 95% confidence interval should, by repeated random sampling and construction of the interval, include the true population characteristic in the interval in 95% of the intervals. Sample 1 (n=50) Sample 2 (n=50) ................... Sample 100 (n=50)
95% CI 1
95% CI 2 95% CI 100
≈ 95 of these 100 intervals (95%) should
contain the true population characteristic Note we will never know the true population characteristic so can never check this, there are advanced mathematical arguments for showing this.
8
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
95% Confidence interval for a population proportion π
When n is large, a 95% confidence
interval for  is

p(1  p)
p(1  p) 
, p  1.96
 p  1.96

n
n 

p is the sample
proportion based on a
random sample
of size n.
The endpoints of the interval are often
abbreviated by
p(1  p)
p  1.96
n
where - gives the lower endpoint and + the
upper endpoint.
9
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
95% CI for a population proportion
For large n, the following interval constructed from a simple random sample with statistic p can be expected to include the true population proportion with probability .95.

p(1  p)
p(1  p) 
 p  1.96

,
p

1
.
96


n
n


Practical test to see if n is large enough:
n is large enough if
np  10 and n(1‐p)  10
10
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example: Gym
A student asked a random sample of 182 students at a large university whether they want a new fitness club (gym), and found that 75 said they do want it. Provide a point estimate and 95% confidence interval for the proportion of all university of students who want a new gym on campus.
75
p
 0.4121
182
np = 182(0.4121) = 75 >10, n(1‐p)=182(0.5879) = 107 >10 So we can use the formulas given on the previous slide to find a 95% confidence interval for p.
p(1  p)
0.4121(0.5879)
 0.4121  1.96
p  1.96
n
182
 0.4121  0.07151
95% confidence interval for p: (0.341, 0.484)
11
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
General confidence levels
Suppose instead of a 95% confidence interval, you want to be more confident, say 99%, or less confident, say 80%, of estimating the population proportion.
Question: How do you construct a C% confidence interval for a general confidence level C?
Answer: p   z critical value 
p(1  p)
n
where the z critical value is chosen so that the area of the standard Normal curve between –z and z is C%.
12
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Finding a z critical value
Example: a z critical value for a 98% confidence interval.
Looking up the cumulative area (0.99) in the standard Normal table we find z = 2.33. By symmetry the area less than –z=‐2.33 is 0.01.
2.33
13
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Some common critical values
Confidence z critical
As the confidence level increases level
value
80%
90%
95%
98%
99%
99.8%
99.9%
14
1.28
1.645
1.96
2.33
2.58
3.09
3.29
so does the z critical value  the confidence interval gets wider.
For the gym example, p=0.4121
80% CI (0.37,0.46)
95% CI (0.34,0.48)
99.9% CI (0.29,0.53)
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Estimated standard error
The estimated standard error of a statistic is the estimated standard deviation of the statistic.
It is sometimes simply referred to as the standard error.
For sample proportions, the standard deviation is
(1  )
n
Recall we did not know
π so used p instead.
This means that the standard error of the sample
proportion is
p(1  p)
n
15
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Standard error
For proportions,
p(1  p)
s tan dard error 
n

p(1  p)
p(1  p) 

, p  1.96
95%CI   p  1.96

n
n


 p  1.96 (st error p), p  1.96 (st error p)
This quantity bounds how far the
interval ranges. 16
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Bounds on errors
The bound on error of estimation, B, associated with a 95% confidence interval, is (1.96)∙(standard error of the statistic).
The bound on error of estimation, B, associated with a general confidence interval, is (z critical value)∙(standard error of the statistic).
17
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Standard error
For proportions,

p(1  p) 
p(1  p)

, p  1.96
95%CI   p  1.96

n
n


 p  1.96 (st error p), p  1.96 (st error p)
 (p  B, p  B)
 p   B
Since π is in the interval.
Hence B tells you the bound on how far your sample estimate is from the „truth“ for a given level of confidence.
18
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Bounds on errors
The bound on error of estimation, B, associated with a general confidence interval is (z critical value)∙(standard error of the statistic).
As intuitively seen by the formula for B, the bound increases with increasing z critical value and/or standard error of
the statistic.
The z critical value increases as the
confidence level increases as shown
before. So the higher the confidence you want,
the higher the error bound you are stuck with.
19
Confidence z critical
level
value
80%
90%
95%
98%
99%
99.8%
99.9%
1.28
1.645
1.96
2.33
2.58
3.09
3.29
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Bounds on errors
The bound on error of estimation, B, associated with a general confidence interval is (z critical value)∙(standard error of the statistic).
The bound increases with increasing z critical value and/or standard error of the statistic.
For a proportion
st error of p 
p (1  p )
n
decreases as n increases.
20
By increasing the sample
size n of your experiment you can obtain a lower error B in your point estimate.
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Design of experiments
Suppose an investigator wants to conduct an experiment to estimate a proportion π. He wants to know what sample size (n) he should use. To determine the sample size he needs to answer three things:
a.) What is his educated guess of the true value of π?
b.) What degree of confidence does he want in his answer? 95%?
c.) What bound B is acceptable for the error?
Then you backsolve the equation: B=(z critical value)∙(standard error of the statistic)
to get n.
21
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Sample size for a proportion
For a specified π and error bound B, and for 95% confidence:
B  1.96
 (1   )
n
 n B  1.96  (1   )
1.96  (1   )
 n
B
1.962  (1   )
n
B2
22
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Sample size for proportions
The sample size required to estimate a population proportion π
to within an amount B with 95% confidence is
 1.96 
n  (1  ) 

B


2
The value of  should be based on prior information. If no prior information is available, use  = 0.5 in the formula to obtain a conservatively large value for n. One rounds the result up to the nearest integer since you cannot sample fractions of people or things. 23
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example: Germany’s Next Top Model
Luisa, 2012
ProSieben would like to find a 95% confidence interval estimate within 0.03 for the proportion of all households that watch
Germany’s Next Top Model regularly. How large a sample is needed if a prior estimate for  is 0.15? B = 0.03 prior estimate of  = 0.15
2
2
 1.96 
 1.96 
n  (1  ) 
  (0.15)(0.85) 
  544.2
 B 
 0.03 
A sample of 545 households would be needed.
24
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example: Germany’s Next Top Model
How large a sample is needed if we have no prior estimate of ,
i.e. have no idea how many people watch that show?
We should use p = 0.5 for a conservative estimate.
2
2
 1.96 
 1.96 
n  (1  ) 
  (0.5)(0.5) 
  1067.1
 B 
 0.03 
The required sample size is now 1068.
A good prior estimate for π can lower the required sample size (n=545 on the previous slide where we knew π = 0.15 is much lower).
25
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example: Germany’s Next Top Model
How large a sample is needed if we have no prior estimate of 
and we want higher confidence at 99%?
We should use the z critical value 2.58 (for a 99% confidence interval)
2.58
2.58
2
2
 1.96 
 1.96 
n  (1  ) 
  (0.5)(0.5) 
  1067.1
 B 
 0.03 
1849
The required sample size is now 1849.
The higher the confidence wanted the higher the sample size needed.
26
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Confidence interval for a continuous population mean 
If
1. x is the sample mean from a random
sample,
2. The sample size n is large (generally n  30),
3. , the population standard deviation, is
known, then the general formula for a
confidence interval for a population mean  is

given by
x   z critical value 
27
n
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Small sample sizes
Bound for estimating the mean

x   z critical value 
n
Actually this confidence interval is valid when  is known and either
1. n is large (n  30) or
2. The population distribution is Normal (any sample size).
So if the sample size is small (n < 30) but a histogram or other graphical display indicates the data are normally distributed, then this same confidence interval can be used.
28
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example: Ketchup
A filling machine has a true population standard deviation  = 0.228 ounces when used to fill ketchup bottles. A random sample of 36 “6 ounce” bottles of ketchup was selected from the output from this machine and the sample mean was 6.018 ounces. Find a 90% confidence interval estimate for the true mean fills of ketchup from this machine.
Pittsburgh, PA
x  6.018,   0.228, n  36
From the tables, the z critical value is 1.645.

90% Confidence interval x  (z critical value)
29
n
(5.955, 6.081)
0.228
 6.018  1.645
 6.018  0.063
36
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Unknown 
An Irish statistician W. S. Gossett derived the Student’s t distribution that describes the behavior of W.S. Gossett
x  0
s n
s is the sample standard deviation (Ch. 4): s  s 2 
 (x  x)
n 1
2

Sxx
n 1
Gossett invented this while working at the Guiness brewery in the early 1900‘s and named it Student‘s instead of Gossett‘s so Guiness would not think he was giving away company trade secrets.
30
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
t Distributions
The statistic
X 
t
s/ n
follows a t distribution with df = n ‐ 1 (degrees of freedom) if n ≥ 30 or for smaller sample sizes if the distribution of the data can be assumed to follow a Normal distribution.
31
t Distributions
Comparison of normal and t distibutions
df = 2
df = 5
df = 10
df = 25
Normal
-4
-3
-2
-1
0
1
2
3
4
As the degrees of freedom (df) increases the t‐distribution approaches the standard Normal distribution.
32
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
One‐sample t procedures
Suppose that a simple random sample of size n is drawn from a population having unknown mean μ. Then the confidence interval for  is
s
s 

, x  (t critical value)
 x  (t critical value)

n
n

where the t critical value is the t value that gives central area C
for the t distribution with df n‐1, chosen as the quantile with area C + (1‐C)/2 to the left (e.g. .975 quantile for a .95 CI).
33
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example: TV
Ten randomly selected students were locked up for a week in their houses and asked to list how many hours of television they watched. The results are
82
66
90
84
75
88
80
94
110
91
Find a 90% confidence interval estimate for the true mean number of hours of television they watched.
1
2
3
4
We construct a histogram and it looks like the assumption of normality for these data is reasonable.
0
Frequency
Since n=10 small, we need to assume viewing times are normally distributed in order to use the test.
Histogram of x
60
34
70
80
90
x
100
110
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Example: TV
Calculating the sample mean and standard
deviation we have n = 10, x = 86, s =11.842.
The 95% quantile of the t‐distribution with df = 9 is provided as 1.833 from the R statistical package. The 90% confidence interval for  is given as
s
11.842
x  t *
 86  (1 .833)
 86  6 .86
n
10
( 79 .14 , 92 .86)
Answer: The students locked up in their houses for a week watched between 79 and 93 hours of television.
35
That was clearly an American study. What would you expect from Germans? © 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Tutorial 6: Impact
of working on
finishing college
University of California
Santa Barbara
An article in the Journal of College Student Development examined
whether having a job while at the university was associated with a
greater tendency to drop-out (quit the university). They collected a
sample of 44 students who dropped out and 257 students who did not
drop out and asked them the number of hours worked per week. The
data are summarized in the following table.
36
Group
Mean
Standard
deviation
Drop-out
(n = 44)
25.62
14.41
Not drop-out
(n=257)
18.10
15.31
Why do you think the
sample size for the dropouts is so much smaller than
the other one?
Tutorial 6: Impact
of working on
finishing college
Group
Mean
Standard
deviation
Drop-out
(n = 44)
25.62
14.41
Not drop-out
(n=257)
18.10
15.31
1.) Consider the 44 drop-outs as a random sample of all drop-outs
from the university and compute a 98% confidence interval (CI) for
the mean number of hours worked by all students who drop out at
that university.
2.) Now do similarly to compute a 98% confidence interval for the
mean number of hours worked for students who do not drop out.
3.) Compare the two CIs. Which interval is wider? Comment why
that is so.
In R, qt(x,df) gives the x qt(.98,44)
quantile of the t‐distribution qt(.98,43)
with df degrees of freedom. R qt(.99,44)
qt(.99,43)
output 
37
2.116
2.118
2.414
2.416
qt(.98,257)
qt(.98,256)
qt(.99,257)
qt(.99,256)
2.064
2.064
2.342
2.341
Tutorial 6: Impact
of working on
finishing college
Group
Mean
Standard
deviation
Drop-out
(n = 44)
25.62
14.41
Not drop-out
(n=257)
18.10
15.31
4.) Based on the CI for drop-outs, would it be safe to say that
drop-outs work on average more than 20 hours? Describe your
conclusion.
5.) Now fixing the standard errors to equal their estimated values
as known, calculate the bounds of estimation for a 98% CI for both
groups and compare. Use the z.99 critical value of 2.33.
6.) If you wanted to collect more drop-outs to get the same bound
(98% CI) as for the non-drop-outs, how many more would you need?
38
Related documents