Download Sampling (cont`d) and Confidence Intervals

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Confidence interval wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Resampling (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Sampling (cont’d) and
Confidence Intervals
11.220
Lecture 9
8 March 2006
R. Ryznar
Census Surveys
• Decennial Census
– Every (over 11 million) household gets the short form
and 17% or 1/6 get the long form
– Miss approximately .12% of population overall
(including about 2.78% of black population)
– Why do it?
• Current Population Survey
– 60,000 households interviewed every month
• The American Community Survey
– Contacts 3 million households (including some from
every county) and will replace the long form in 2010
Gallup Polls
Many do not believe a survey of 1500-2000
respondents can represent the views of all
Americans.
Estimation
• Parameter
– A number that describes the population.
– We don’t know its value.
• Statistic
– A number that describes a sample. It can
change from sample to sample.
– If we take lots of samples the statistic follows
a predictable pattern.
Sampling Variability
Law of Large Numbers
As the number of trials increases the
average outcome approaches the mean of
the population (i.e., the expected outcome)
and the standard deviation of the average
outcome approaches zero.
• To reduce bias
– take a SRS.
• To reduce variability
– take larger samples.
• The “margin of error” is about sampling
variability.
We say, “The president’s approval rating is 40%,
plus or minus 3 percentage points. We are 95%
percent confident that the true population
proportion is between 37% and 43%.”
Central Limit Theorem
The distribution of an average tends to be
Normal, even when the distribution from
which the average is computed is
decidedly non-Normal.
Quick method for a 95% confidence interval around
a sample proportion is
1/ n
Margin of Error and Sample Size
1 / n = 1 / 1600 = 1 / 40 = 0.025 or 3.0%
1 / n = 1 / 2527 = 1 / 50.27 = 0.020 or 2.0%
1 / n = 1 / 100 = 1 / 10 = 0.1 or 10%
The size of the population has little influence on the behavior of
statistics from random samples. The population size does not
matter as long as it is at least 100 times larger than the sample.
Gallup Polls
Many do not believe a survey of 1500-2000
respondents can represent the views of all
Americans.
Estimating a Population Proportion
We take a survey (SRS) to estimate the
percentage of overweight children aged 6 –
11 years in the general population.
count of successes in the sample 408
pˆ =
=
= 15.3%
n
2673
Sampling Distribution of a Sample
Proportion
• If the sample is large enough, the sampling
distribution of p̂ is approximately normal.
• The mean of the sampling distribution is p.
• The standard deviation of the sampling
distribution is
p (1 − p )
n
The standard deviation from our sample is:
p (1 − p )
=
n
pˆ (1 − pˆ ) .153(.847)
= .006963
n
2673
The 95% Confidence Interval around our estimate is:
pˆ (1 − pˆ )
pˆ ± zα / 2
n
.153 ± 1.96 (.00696)
.153 ± .0136
13.9%,16.7%
The 95% Confidence Interval around our estimate is:
pˆ (1 − pˆ )
pˆ ± zα / 2
n
.153 ± 2 (.00696)
.153 ± .0139
13.9%,16.7%
What if you wanted a 99% Confidence Interval around our
estimate?
pˆ ± zα / 2
pˆ (1 − pˆ )
n
.153 ± 2.58 (.00696)
.153 ± .018
13.5%,17.1%
Sampling distribution of a sample
mean
Choose an SRS of size n from a population in which
individuals have mean µ and standard deviation σ.
Let x be the mean of the sample. Then:
• The sampling distribution of x is approximately
normal when the sample size n is large.
• The mean of the sampling distribution is equal to µ.
• The standard deviation (standard error of the
estimate) of the sampling distribution is
σ x = s.e. = σ / n
Confidence Interval for a
Population Mean (µ)
When n is large (>30) the sample standard deviation
s is close to σ and can be used to estimate it.
Confidence Interval for a population mean:
x ± zα / 2 σ x
x ± zα / 2
x ± zα / 2
σ
n
s
n
Suppose a program director wants to estimate the average
length of time (in months) clients remain in a rehab clinic
program.
She takes a random sample of 100 clients’ records and uses
the sample’s mean x , to estimate µ, the population mean.
We start by calculating the mean and the sample standard
deviation. Assume that:
∑ x = 465
∑( x − x ) = 2,387
2
Then,
∑ x 465
x=
=
= 4.65
n
100
2
∑
x
−
x
(
)
2,387
2
s =
=
= 24.11
n −1
99
and s = 4.9
Since we have a large sample (n=100) we can
substitute s for σ. A 95% confidence interval for the
mean number of months spent in the program is
s
⎛ 4.9 ⎞
x±2
= 4.65 ± 2⎜
⎟ = 4.65 ± .98
100
⎝ 10 ⎠
Confidence Interval = 3.67, 5.63
Small sample estimates of µ
⎛ s ⎞
x ± tα / 2 ⎜
⎟
⎝ n⎠
Where tα/2 is based on (n – 1) degrees of
freedom.
• Assumption: A random sample is selected
from a population with a relative frequency
distribution that is approximately normal.
Food prices have been going up rapidly. To periodically assess the
increase in prices you purchase the same items at twenty different
grocery stores. The mean and standard deviation of the costs at the
twenty supermarkets are:
x = $26.84 and s = $2.63
If we assume that the distribution of costs for the grocery basket at
all supermarkets is approximately normal, we can use the t-statistic
to form the confidence interval. For a confidence level of 95%, we
need the tabulated value of t with df = n – 1 = 19.
From the t table we see that tα/2 = t.025 = 2.093
⎛ s ⎞
⎛ 2.63 ⎞
x ± t.025 ⎜
⎟ = 26.84 ± 2.093⎜
⎟ = 26.84 ± 1.23 = (25.61, 28.07 )
⎝ n⎠
⎝ 20 ⎠
Thus, we are reasonably confident (95%) that the interval from $25.61 to
$28.07 contains the true mean cost µ of the grocery basket. This is
because if we were to employ our interval estimator on repeated
occasions, 95% of the intervals constructed would contain µ.
Determining Sample Size
How can the appropriate sample size be determined? First
determine how reliable you want the estimate to be.
Example:
Consider the rehab program example where we
estimated the mean length of time clients stayed in the
program. A sample of 100 clients’ records produced an
estimate, x , that was within .98 month of µ with
probability equal to .95. What if we wanted to estimate
the true mean to within .5 month with a probability equal
to .95. How large a sample would be required?
For the sample size n = 100, we found that an approximate
95% confidence interval to be
x ± 2σ x ≈ 4.65 ± .98
If we now want our estimator to be within .5 month of µ, we
2σ
must have
= .5
2 x = .5 or
n
σ
S=4.9
2( 4.9)
= .5
n
2( 4.9)
n=
= 19.6
.5
n = 19.6 2
= 384.16 ≈ 384
You would have to sample approximately 384 clients’
records in order to estimate the mean length of stay in
the program, µ, to within .5 month with probability equal
to .95.
Understanding Degrees of
Freedom
Statisticians use the terms "degrees of freedom" to describe the
number of values in the final calculation of a statistic that are free to
vary. Consider, for example the statistic s2.
To calculate the s2 of a random sample, we must first calculate the
mean of that sample and then compute the sum of the several
squared deviations from that mean. While there will be n such
squared deviations only (n - 1) of them are, in fact, free to assume
any value whatsoever. This is because the final squared deviation
from the mean must include the one value of X such that the sum of
all the Xs divided by n will equal the obtained mean of the sample.
All of the other (n - 1) squared deviations from the mean can,
theoretically, have any values whatsoever. For these reasons, the
statistic s2 is said to have only (n - 1) degrees of freedom.