Download Confidence Intervals - McMaster University, Canada

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

German tank problem wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Sociology 6Z03
Topic 13: Confidence Intervals
John Fox
McMaster University
Fall 2016
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Fall 2016
1 / 30
Fall 2016
2 / 30
Outline: Confidence Intervals
Introduction
Confidence Interval for the Population Mean
Varying the Level of Confidence
Choosing the Sample Size, n
Cautions Concerning Confidence Intervals
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Introduction
From Population to Samples
Implicitly using results derived from probability theory, we have, thus far, been reasoning
deductively from characteristics of a known population to characteristics of samples
drawn at random from that population.
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Fall 2016
3 / 30
Introduction
From Population to Samples
Thought Question
Suppose that the population of single-parent families in Canada has an average annual
income of µ = $35,000 and a standard deviation of σ = $10,000 (both made up). The
means x from repeated samples of size n = 100 drawn randomly from this population:
A are approximately normally distributed with a mean of $35,000 and a standard deviation of
SD (x ) = σ = $10, 000.
B are approximately normally distributed with a mean of $35,000 and a standard deviation of
σ
10, 000
SD (x ) = √ = √
= $1, 000
n
100
C I don’t know.
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Fall 2016
4 / 30
Introduction
From Sample to Population
When we draw a sample in a real application of statistics, of course, we do not know the
characteristics of the population.
If the characteristics of the population were known, then there would be no point in sampling!
Moreover, in real applications, the researcher draws a single sample of size n, not
repeated samples.
If a researcher had the resources to draw 1,000 samples each of size n = 100, then he or she
would treat these data as a single larger sample of n = 1, 000 × 100 = 100, 000 cases.
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Fall 2016
5 / 30
Introduction
From Sample to Population: Statistical Inference
The central issue in statistical inference is to draw conclusions inductively about the
population on the basis of a single sample of size n drawn from it. There are two common
modes of statistical inference:
1. Estimation We want, on the basis of our data to derive a “best guess” of the value of a
population parameter, such as the population mean income µ.
Such a best guess is called a point estimate. We know, however, that point estimates —
which are sample statistics, like the sample mean x — vary from sample to sample.
It is generally desirable to reflect the uncertainty due to sampling variation in an interval
estimate, also called a confidence interval.
Typically, a confidence interval takes the form of a point estimate ± a margin of error.
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Fall 2016
6 / 30
Introduction
From Sample to Population: Statistical Inference
2. Hypothesis Tests Sometimes we are interested in establishing whether or not a parameter
is equal to a specific value.
For example, we might want to learn whether or not the population mean income of men
and women is the same — that is, whether the difference in their mean income,
µMen − µWomen , is zero.
A statistical hypothesis test, also called a “significance” test, tells us the degree to which
the data support the hypothesis that a parameter is equal to a particular value.
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Fall 2016
7 / 30
Confidence Interval for the Population Mean
From Population to Samples
We will begin by assuming, unrealistically, that we know the population mean µ and
standard deviation σ, and, consequently that we know the sampling distribution of sample
means, for samples of size n: that is,
σ
x ∼ N µ, √
n
In our example, where µ = 35, 000, σ = 10, 000, and n = 100, recall that
x ∼ N (35, 000; 1, 000).
Using the 68–95–99.7 rule for the normal distribution, we know that about 95 percent of
sample means x will be within two SD (x ) of the population mean µ, as shown in the
graph on the next slide.
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Fall 2016
8 / 30
Confidence Interval for the Population Mean
From Population to Samples
95 percent
of samples
2.5 percent
of samples
2.5 percent
of samples
x
33000
35000
37000
µ − 2 × SD(x)
µ
µ + 2 × SD(x)
Sampling distribution of the sample mean x.
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Fall 2016
9 / 30
Confidence Interval for the Population Mean
From Population to Samples
Thought Question
(A) True or (B) False? For this example:
95 percent of sample means are in the interval
µ ± 2 × SD (x ) = 35, 000 ± 2 × 1, 000
= 35, 000 ± 2, 000
2.5 percent of sample means are below
µ − 2 × SD (x ) = 35, 000 − 2 × 1, 000 = 33, 000
and 2.5 percent of sample means are above
µ + 2 × SD (x ) = 35, 000 + 2 × 1, 000 = 37, 000
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Fall 2016
10 / 30
Confidence Interval for the Population Mean
Reversing the Interval
Put another way: In 95 percent of samples, the population mean is within two SD (x ) of
the sample mean x.
If, therefore, we construct an interval of width ±2 × SD (x ) around the sample mean,
x ± 2 × SD (x ) = x ± 2 × 1, 000
= x ± 2, 000
then this interval will include the population mean µ in 95 percent of repeated samples.
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Fall 2016
11 / 30
Confidence Interval for the Population Mean
For example, a researcher draws a sample of size n = 100 and calculates the sample mean
x = 36, 000.
The researcher does not know the population mean µ, but — let us suppose — does
know that the population standard deviation is σ = 10, 000. Then, he or she would
calculate the interval
x ± 2 × SD (x ) = x ± 2, 000
= 36, 000 ± 2, 000
Definition
This interval, which has the form
point estimate ± margin of error
is called a confidence interval for the unknown population mean µ.
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Fall 2016
12 / 30
Confidence Interval for the Sample Mean
In this case, the population mean µ = 35, 000 (which is known to us, but not to the
researcher) falls within the confidence interval:
x − 2 × SD(x)
x
x + 2 × SD(x)
36,000
38,000
●
34,000
True µ = 35000
These conditions are unrealistic: In real applications, when µ is unknown, then so is σ.
But we want to keep things simple for now. Later on, we’ll learn how to handle the situation
where the population standard deviation σ is also unknown.
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Fall 2016
13 / 30
Fall 2016
14 / 30
Confidence Interval for the Sample Mean
Interpretation of Confidence Intervals: Behaviour with Repeated Sampling
95% of
samples
2.5% of
samples
Known only
to 'God'
Confidence
intervals for
repeated
samples
2.5% of
samples
_
µ - 2 x SD(x)
µ
_
µ + 2 x SD(x)
the researcher
has only one
sample
2.5 % of
intervals
miss low
2.5 % of
intervals
miss high
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Confidence Interval for the Sample Mean
Interpretation of Confidence Intervals: Behaviour with Repeated Sampling
Thought Question
(A) True or (B) False?
In 95 percent of samples, the sample mean x falls in the interval µ ± 2 × SD (x ). When
this happens, the population mean µ also falls within the confidence interval
x ± 2 × SD (x ).
In 2.5 percent of samples, the sample mean x exceeds µ + 2 × SD (x ), and when this
happens, the population mean µ falls below the lower bound of the confidence interval.
In 2.5 percent of samples, the sample mean x is below µ − 2 × SD (x ), and therefore the
population mean µ is above the upper bound of the confidence interval.
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Fall 2016
15 / 30
Confidence Interval for the Sample Mean
Interpretation of Confidence Intervals: Behaviour with Repeated Sampling
In summary, then, in 95 percent of samples, the confidence interval x ± 2 × SD (x )
includes the unknown population mean µ and in 5 percent of samples, the confidence
interval fails to include µ.
For this reason, x ± 2 × SD (x ) is called a 95 percent confidence interval.
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Fall 2016
16 / 30
Confidence Interval for the Population Mean
Correct Interpretation of Confidence Intervals
Important Point
Here is the correct interpretation of the 95-percent confidence interval
x ± 2 × SD (x ) = 36, 000 ± 2, 000:
The researcher is 95 percent confident that the unknown population mean µ is somewhere
between $34,000 and $38,000, in the sense that he or she has employed a procedure that
produces the right answer 95 percent of the time with repeated sampling (and the wrong
answer 5 percent of the time). The researcher does not know whether or not µ is in the
interval that is constructed for this particular sample.
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Fall 2016
17 / 30
Confidence Interval for the Population Mean
Common Misinterpretations of Confidence Interval
The probability is 95 percent that the unknown population mean µ is in the interval
x ± 2 × SD (x ) = 36, 000 ± 2, 000
The population mean µ is either in the interval (in which case the probability that it is in the
interval is 1), or it is not in the interval (in which case the probability that it is in the interval
is 0). In the example, we (because of our assumed omniscience) know that µ = 35, 000 is in
the interval, but the researcher does not know this.
The probability is 95 percent that a family selected at random has an income in the
interval
x ± 2 × SD (x ) = 36, 000 ± 2, 000
The distribution of individual income
√ scores x in the population has a mean of µ (not x), a
standard deviation of σ (not σ/ n), and is not necessarily a normal distribution.
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Fall 2016
18 / 30
Confidence Interval for the Population Mean
Common Misinterpretations of Confidence Interval
The probability is 95 percent that the sample mean x is in the interval
x ± 2 × SD (x ) = 36, 000 ± 2, 000
This is just nonsense: The mean x = 36, 000 for this particular sample is certainly in the
interval, since the interval is centred around it.
When we sample repeatedly, 95 percent of sample means x are in the interval
x ± 2 × SD (x ) = 36, 000 ± 2, 000
The sampling distribution of x is centred on the population mean µ = 35, 000, not on the
sample mean x = 36, 000 for a particular sample. Thus 95 percent of sample means are in
the interval
µ ± 2 × SD (x ) = 35, 000 ± 2, 000
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Fall 2016
19 / 30
Varying the Level of Confidence
So far, we have constructed a confidence interval at the 95 percent or .95 confidence level.
This is the confidence level that is most commonly used.
Other common levels are 90 percent (or .9) and 99 percent (or .99).
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Fall 2016
20 / 30
Varying the Level of Confidence
More generally, to construct a confidence interval for the mean at the confidence level C ,
we take
σ
x ± z ∗ × SD (x ) = x ± z ∗ √
n
where the area between −z ∗ and z ∗ under the standard normal distribution is C :
Standard Normal Density
Probability = C
Probability = (1 − C)/2
− z*
Probability = (1 − C)/2
0
z*
Notice that the area in each “tail” of the distribution is (1 − C )/2.
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Fall 2016
21 / 30
Varying the Level of Confidence
Critical Values of z
Here are the critical values z ∗ corresponding to the common confidence levels:
Confidence level
90%
95%
99%
John Fox (McMaster University)
One-tail area
.05
.025
.005
Soc 6Z03:Confidence Intervals
z∗
1.645
1.960 ≈ 2
2.576
Fall 2016
22 / 30
Varying the Level of Confidence
For the example, we get the following confidence intervals at the 90%, 95%, and 99%
levels of confidence:
90% CI:
95% CI:
99% CI:
John Fox (McMaster University)
36, 000 ± 1.645 × 1, 000 = 36, 000 ± 1645
36, 000 ± 1.960 × 1, 000 = 36, 000 ± 1960
36, 000 ± 2.576 × 1, 000 = 36, 000 ± 2576
Soc 6Z03:Confidence Intervals
Fall 2016
23 / 30
Varying the Level of Confidence
Thought Question
We would like the confidence interval to be as narrow as possible; that is, we want a
small margin or error. This example illustrates an important characteristic of
confidence intervals:
A If we want greater confidence that the parameter is included in the interval, then we need
to construct a wider interval.
B If we want greater confidence that the parameter is included in the interval, then we need
to construct a narrower interval.
C The width of the confidence interval is unrelated to the level of confidence.
D I don’t know.
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Fall 2016
24 / 30
Varying the Level of Confidence
Factors Affecting the Margin of Error
There are three factors that affect the margin of error
σ
z∗ √
n
1
2
To make the confidence level larger, z ∗ gets larger. This produces a larger margin of error.
The more variable the scores are in the population (that is, the larger the value of σ), the
larger the margin of error.
It is easier to estimate µ precisely in a homogeneous population than in a heterogeneous one.
Because the population standard deviation is not under our control, we cannot achieve
greater precision by making it smaller.
3
Because n is in the denominator of the margin of error, we get greater precision from
large samples than from small ones.
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Fall 2016
25 / 30
Varying the Level of Confidence
Factors Affecting the Margin of Error
Thought Question
To cut the margin of error in half (i.e., to make the width of the confidence interval
half as large), we have to:
A double the sample size n.
B halve the sample size n.
C make the sample size n four times as large.
D I don’t know.
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Fall 2016
26 / 30
Choosing the Sample Size, n
Say that we desire a particular margin of error m. We’ve decided to use confidence level
C , corresponding to the standard-normal value z ∗ , and we know that the population
standard deviation is σ.
We want to figure out the sample size n that is required to achieve this margin of error.
The margin of error is
σ
m = z∗ √
n
Solving for n produces
n=
z ∗σ 2
m
If the computed value of n is not a whole number then round up to the next whole number.
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Fall 2016
27 / 30
Choosing the Sample Size, n
Example
To illustrate, suppose that we want to construct a C = 95 percent confidence interval for
the mean income µ in a population in which the standard deviation of income is
σ = $10, 000 (as in our previous example).
We want our confidence interval to have a margin of error of m = $200.
Then, the required sample size is
n=
1.960 × 10, 000 2
= 9, 604
200
or nearly 10,000 families.
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Fall 2016
28 / 30
Cautions Concerning Confidence Intervals
For the formula for the confidence interval
σ
x ± z∗ √
n
to be accurate, the data must be a SRS from a large population
The formula is not correct for complex probability sampling designs such as stratified
samples.
There is no correct method for constructing confidence intervals for haphazardly
(nonrandomly) selected data.
Because the sample mean x can be strongly affected by outliers, so can the confidence
interval.
If the population is not normal, and the sample size is very small, then the formula may
not be accurate.
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Fall 2016
29 / 30
Cautions Concerning Confidence Intervals
Even if the formula is accurate, it might not be sensible to use the mean as a summary —
for example, when the population distribution of x is very skewed.
To use the formula, you need to know the population standard deviation σ.
In a large sample you can safely substitute the sample standard deviation s to get an
approximate confidence interval
s
x ± z∗ √
n
We’ll learn later what to do when the sample size is small.
The margin of error covers only random sampling errors.
Other sources of error, such as undercoverage and nonresponse in surveys, are not included
in the margin of error.
John Fox (McMaster University)
Soc 6Z03:Confidence Intervals
Fall 2016
30 / 30