Download Class 8 Lecture: Confidence Intervals & t distribution

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Sociology 5811:
Lecture 8: CLT Applications:
Confidence Intervals, Examples
Copyright © 2005 by Evan Schofer
Do not copy or distribute without
permission
Announcements
• Problem Set 3 handed out
• On course website
Review: Sampling Distributions
• Q: What is the sampling distribution of the mean?
• Answer: Sampling Distribution: The distribution
of estimates created by taking all possible unique
samples (of a fixed size) from a population
• Q: What is the Standard Error?
• Answer: The standard deviation of the sampling
distribution
• Q: What does the Standard Error tell you?
• Answer: How “dispersed” estimates will be
around the true parameter value
Review: Central Limit Theorem
• Q: What does the CLT mean in plain language?
1. As N grows large, the sampling distribution of
the mean approaches normality
2. μ Y  μ Y
σY
3. σ Y 
N
Central Limit Theorem: Visually


Y
s
σY
μY
Implications of the C.L.T
• Visually: Suppose we observe mu-hat = 16
There are
many
possible
locations
of 
μ̂  16
μ
μ̂  16
μ
Sampling
distribution
But, mu-hat always
falls within the
sampling
distribution
μ
μ̂  16
μ̂  16
μ
Implications of the C.L.T
• What is the relation between the Standard Error
and the size of our sample (N)?
• Answer: It is an inverse relationship.
• The standard deviation of the sampling distribution shrinks
as N gets larger
• Formula:
σY
σY 
N
• Conclusion: Estimates of the mean based on
larger samples tend to cluster closer around the
true population mean.
Implications of the CLT
• The width of the sampling distribution is an
inverse function of N (sample size)
– The distribution of mean estimates based on N = 10
will be more dispersed. Mean estimates based on
N = 50 will cluster closer to .
μ̂
μ
Smaller sample size
μ̂
μ
Larger sample size
Confidence Intervals
• Benefits of knowing the width of the sampling
distribution:
• 1. You can figure out the general range of error
that a given point estimate might miss by
• Based on the range around the true mean that the estimates
will fall
• 2. And, this defines the range around an estimate
that is likely to hold the population mean
• A “confidence interval”
• Note: These only work if N is large!
Confidence Interval
• Confidence Interval: “A range of values around a
point estimate that makes it possible to state the
probability that an interval contains the
population parameter between its lower and upper
bounds.” (Bohrnstedt & Knoke p. 90)
• It involves a range and a probability
• Examples:
• We are 95% confident that the mean number of CDs owned
by grad students is between 20 and 45
• We are 50% confident the mean rainfall this year will be
between 12 and 22 inches.
Confidence Interval
• Visually: It is probable that  falls near mu-hat
μ̂  16
μ
Range where  is
unlikely to be
Probable
values of 
μ
μ
Q: Can  be this
far from mu-hat?
Answer: Yes, but it is
very improbable
Confidence Interval
• To figure out the range in of “error” in our mean
estimate, we need to know the width of the
sampling distribution
• The Standard Error! (S.D. of the sampling dist of the mean)
• The Central Limit Theorem provides a formula:
σY
σY 
N
• Problem: We do not know the exact value of
sigma-sub-Y, the population standard deviation!
Confidence Interval
• Question: How do we calculate the standard
error if we don’t know the population S.D.?
• Answer: We estimate it using the information we
have:
sY
σ̂ Y 
N
• Where N is the sample size and s-sub-Y is the
sample standard deviation.
95% Confidence Interval Example
• Suppose a sample of 100 students with mean SAT
score of 1020, standard deviation of 200
• How do we find the 95% Confidence Interval?
• If N is large, we know that:
• 1. The sampling distribution is roughly normal
• 2. Therefore 95% of samples will yield a mean estimate
within 2 standard deviations (of the sampling distribution)
of the population mean ()
• Thus, 95% of the time, our estimates of  (Y-bar)
are within two “standard errors” of the actual
value of  .
95% Confidence Interval
• Formula for 95% confidence interval:
95% CI : Y  2(σY )
• Where Y-bar is the mean estimate and sigma (Ybar) is the standard error
• Result: Two values – an upper and lower bound
• Adding our estimate of the standard error:
 sY 
Y  2(σ̂Y )  Y  2

 N
95% Confidence Interval
• Suppose a sample of 100 students with mean SAT
score of 1020, standard deviation of 200
s
• Calculate:
95% CI : Y  2( )
N
200
10
200
1020  (2)(
)  1020  2(
100
 1020  2(20)  1020  40
• Thus, we are 95% confident that the population
mean falls between 980 and 1060
)
Confidence Intervals
• Question: Suppose we want to know the
confidence interval for a value other than 95%?
• How can we find the C.I. For any number?
• Answer #1: We know that 68% of cases fall
within 1 standard deviation, 99% within 3
• Q: What is 99% C.I.? (Y-bar = 1020, S.D. = 200)
99%CI : Y  3( )
200
1020  (3)(
)  960 to 1080
100
s
N
Confidence Intervals
• Question: Which was a larger range: the 95% CI
or 99% CI ?
• Answer: The 99% range was larger
• The larger the range, the more likely that the true
mean will fall in it
• It is a safe bet if you specify a very wide range
• If you want to bet that the mean will fall in a very narrow
range, you’ll lose more often.
Confidence Intervals
• Question: Suppose we want to know the
confidence interval for a value other than 95%?
• Answer #2: Look at the “Z-table”
• Z-table = Normal curve probability distribution
with mean 0, SD of 1
• Found on Knoke, p. 459
– It tells you the % of cases falling within a particular
number of S.D.’s of the mean
• Lists all values, not just 1, 2, and 3!
Confidence Intervals: Z-table
Question:
What Z-value
should we use
for 20%
confidence
interval?
Answer: 10%
fall from 0 to
Z=.26.
20% of cases fall
from -.26 to +.26
Confidence Intervals
• General formula for Confidence Interval:
C.I. : Y  Zα/2 (σ Y )
• Where:
– Y-bar is the sample mean
– Sigma sub-Y-bar is the standard error of mean
– Z sub a/2 is the Z-value for level of confidence
– It can be looked up in a Z-table
– If you want 90%, look up p(0 to Z) of .45
Small N Confidence Intervals
• If N is large, the C.L.T. assures us that that the
sampling distribution is normal
• This allows us to construct confidence intervals
• Issue: What if N is not large?
• The sampling distribution may not be normal
• Z-distribution probabilities don’t apply…
• In short: If N is small our confidence interval
formula based on Z-scores doesn’t work.
Small N Confidence Intervals
• Solution: Find another curve that accurately
characterizes sampling distribution for small N
• The “T-distribution”
• An alternative that accurately approximates the shape of the
sampling distribution for small N
• The T distribution actually a set of distributions
with known probabilities
• Again, we can look up values in a table to determine
probabilities associated with a # of standard deviations from
the mean.
Confidence Intervals for Small N
• Small N C. I. Formula:
• Yields accurate results, even if N is not large
C.I. : Y  t α/2 (σ̂ Y )
• Again, the standard error can be estimated by the
sample standard deviation:
 s 
C.I. : Y  t α/2 

 N
T-Distributions
• Issue: Which T-distribution do you use?
• The T-distribution is a “family” of distributions
• In a T-Distribution table, you’ll find many T-distributions to
choose from
• One t-distribution for each “degree of freedom”
– Also called “df” or “DofF”
• Which T-distribution should you use?
• For confidence intervals: Use T-distribution for
df = N - 1
• Ex: If N = 15, then look at T-distribution for df = 14.
Looking Up T-Tables
Choose the
desired
probability
for a/2
Find t-value
in correct row
and column
Choose the
correct df
(N-1)
Interpretation
is just like a
Z-score.
2.145 = number
of standard
errors for C.I.!
Uses of Confidence Intervals
• What are some uses for confidence intervals?
• 1. Assessing the general quality of an estimate
– Ex: Mean level of happiness of graduate students
• Happiness scored on a measure from 1-10 (10=most)
– Suppose 95% is: 6 +/- 4
• i.e., range = 2 to 10
– Question: Is this a “good” estimate?
– Answer: No, it is not very useful.
• Something like 6 +/- 1 is a more useful estimate.
Uses of Confidence Intervals
• 2. Comparing a mean estimate to a specific value
• Ex: Comparing a school’s test scores to a
national standard
• Suppose national standard on a math test is 47
• Suppose a sample of students scores 52. Did the school
population meet the national standard?
• If 99% CI is 50-54, then the answer is probably yes
– If 99% CI is 42-62, it isn’t certain.
• Ex: A factory makes bolts that must hold 10 kilos
• Confidence intervals let you verify that the bolts are strong
enough, without testing each one.
Uses of the Sampling Distribution
• Extended example:
• Let’s figure out what the sampling distribution
looks like for a specific population
• Since the sampling distribution is a probability
distribution….
• We can then calculate the probability of observing any
particular value of Y-bar (given a known )
• Note: Later we’ll use the converse logic to draw
conclusions about the actual value of , given an observed
Y-bar.
Probability of Y-bar, given 
• Suppose we have a population with the following
characteristics:
•  = 23,  = 9
• What is the probability of picking a sample
(N=35) that has a mean of 27 or more?
• To determine this, we must first determine the
shape of the sampling distribution
• Then we can determine the probability of falling a given
distance from it…
Probability of Y-bar, given 
• Q: According to the Central Limit Theorem,
what is the mean of the sampling distribution?
• A: Same as the population: μ Y  μ  23
• Second, we must determine the “width” of the
sampling distribution: the standard deviation
(referred to as Standard Error)
• The C.L.T says we can calculate it as:
σY
9
9
σY 


 1.52
N
35 5.9
Probability of Y-bar, given 
• If we know  and the Standard Error, we can
draw the sampling distribution of the mean for
this population:
μ Y  23, σ Y  1.5
19 20 21 22 23 24 25 26 27
Probability of Y-bar, given 
• We know that 95% of possible Y-bars fall within
two Standard Errors (i.e., +/- 3):
– between 20 and 26
μ Y  23, σ Y  1.5
19 20 21 22 23 24 25 26 27
Probability of Y-bar, given 
• To determine the probability associated with a
particular value, convert to Z-scores
• p(-1<Z<1) is.68, p(-2<Z<2) is.95, etc
• We use a slightly different Z-score formula than
we learned before
• But it is analogous
(Yi  Y )
(Y  μ)
Zi 

sY
σY
Probability of Y-bar, given 
• Why use a different formula for Z-scores?
• Old formula calculates # standard deviations a
case falls from the sample mean
• From Y-sub-i to Y-bar
• New formula tells the number of standard errors a
mean estimate falls from the population mean 
• From Y-bar to mu
(Yi  Y )
(Y  μ)
Zi 

sY
σY
Probability of Y-bar, given 
• Back to the problem: What is the Z-score
associated with getting a sample mean of 27 or
greater from this population?
• Sampling distribution mean = 23
• Standard error = 1.5
(Y  μ) 27  23
Z

 2.66
σY
1.5
Probability of Y-bar, given 
• Finally, what is the probability of observing a Zscore of 2.66 (or greater) in a standard normal
distribution?
• To convert Z-scores to probabilities, look it up in
a table, such as Knoke p. 463
• Area beyond Z=2.66 is .0039
• How do we interpret that?
• Lets look at it visually:
Probability of Y-bar, given 
• The Z-distribution is a probability distribution
– Total area under curve = 1.0
– Area under half curve is .5
– Red are (“Area beyond Z”) = .0039
Probability of Y-bar, given 
Is the probability of Z > 2.66 very large?
No! Red area =
probability of
Z > 2.66 = .004,
which is .4%
-3
-2
-1
0
1
2
3
Probability of Y-bar, given 
• Conclusion: Y-bar of 27 (or larger) should occur
only 4 out of 1000 times we sample from this
population
• Possible interpretations:
• 1. We just experienced an improbable sample
• 2. Our sample was biased, not representative
• 3. Maybe we begin to suspect that the population mean ()
isn’t really 23 after all…
• Idea: We could “cast doubt on” someone’s claim
that  = 23, given this observed Y-bar and S.D.
• Hypothesis testing is based on this!
Conclusions About Means
• The previous example started out with the
assumption that  = 23
– Typically,  will be unknown; Only Y-bar is known
– But, the same logic can be applied to “test” whether 
is likely to equal 23
• If observed Y-bar is highly unlikely, we cast doubt on the
idea that  is really 23
– Example: We can “test” whether a school’s math
scores are above national standard of 47
• If school sample is far above national average, it is
improbable that the school population is at or below 47
• Next Class: Hypothesis testing!