Download Sampling Distribution Models – Chapter 18

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Sampling Distribution Models –
Chapter 18
How much faith should we put in our
sample statistics?
From two lectures ago…
„
„
„
„
Suppose all of us took a road trip to Las
Vegas and I gave each of you $100 in $1
chips to play roulette.
Suppose each one of us bet Red 100 times.
Would each one of us walk away with a mean
of $94.737 at the end of the betting? And what
about that Standard Deviation of $9.8614?
Think about that question, read pp. 436-439 of
Chapter 17 (this is all we need), try Lab 4
(your TAs wrote it), play some online roulette
.05
Computer Simulation, 100 $1 bets made
by 100 different people on Roulette
The mean was $95.9
.04
The median was $96
Density
.03
The minimum was $68
The maximum was $120
.02
Q3 was 102
Q1 was 90
.01
The SD was $9.494
0
The IQR was $12
70
80
90
100
$ after 100 $1 bets
110
120
1
Computer Simulation, 100 $1 bets made
by 1000 different people playing Roulette
.04
The mean was $94.70
The median was $94
.03
The minimum was $68
Density
.02
The maximum was
$126
Q3 was 102
.01
Q1 was 88
The SD was $9.811
0
The IQR was $14
60
80
100
$ after 100 $1 bets
120
.04
Computer Simulation, 100 $1 bets made
by 5000 different people playing Roulette
The mean was $94.71
.03
The median was $94
The minimum was $58
Density
.02
The maximum was
$132
Q3 was 102
.01
Q1 was 88
0
The SD was $9.983
60
80
100
$ after 100 $1 bets
120
140
The IQR was $14
Compare the 100 bets results
Theoretical (from calculations)for
100 bets of $1
Empirical (observed) for 100 bets
made by 100 people
Empirical (observed) for 100 bets
made by 1000 people
Empirical (observed) for 100 bets
made by 5000 people (this is very
close to the theoretical values)
Mean
SD
$94.74
(µ)
$95.90
$9.99
(σ)
$9.49
$94.70
$9.81
$94.71
$9.98
2
From last time, this situation:
„
„
„
„
1000 Likely voters are surveyed about their
vote on November 4th.
What is the chance that 530 of them were
women (suppose we were expecting 500
because it’s typically 50% female, 50% male)
When dealing with a large number of trials in
a Binomial situation, making direct
calculations of the probabilities becomes
tedious (or outright impossible).
Fortunately, the Normal model comes to the
rescue…
The Normal Model to the Rescue
„
As long as the Success/Failure Condition
holds (the basic unit is binomial – either a
voter is a man or is a woman), we can use
the Normal model to approximate Binomial
probabilities.
‰
Success/failure condition: A Binomial model is
approximately Normal if we expect at least 10
successes and 10 failures:
np ≥ 10 and nq ≥ 10
Continuous Random Variables
„
„
When we use the Normal model to approximate the
Binomial model, we are using a continuous random
variable (the Normal) to approximate a discrete
random variable (the Binomial).
So, when we use the Normal model, we no longer
calculate the probability that the random variable
equals an exact value (e.g. 530 female voters – the
binomial can do it, but it is computationally
intensive), but only that it lies between two values
(e.g. at least 530 female voters – 530 to 1000).
3
So theory tells us to expect 500 but we get
530? Is something wrong?
„
„
„
The underlying population (think infinitely large and
not countable) here is voters and we are
categorizing them by gender (assume 2 groups).
Theory tells us that in a sample of 1000 people we
should get 500 male, 500 female
Suppose in our one sample we get 530 females
(think about this in the context of the roulette games
in the first slides).
Instead of calculating the exact probability with the
binomial formula of Chapter 17, we can approximate
the probability with the normal because we’ve met
the conditions.
Normal Approximation (p. 439)
„
„
We might restate the question, what was the
probability (chance) of getting at least 530 women in a
sample of 1000 people?
Formally: (note E(X)=µ=np for a binomial and σ=√npq)
P ( x ≥ 530 ) = P ( z ≥
„
„
530 − (1000 * .50 )
x − np
) = P(z ≥
) ≈ P ( z ≥ 1 .90 )
npq
1000 * .50 * .50
What’s left is to find the P(z ≥ 1.90) which is 1.9713=.0287
So the probability is .0287 or about 2.87% chance of
getting 530 women in a sample of size 1000 if you
were expecting 500 women.
We enter Chapter 18 now and the
question changes a little bit
„
„
„
Here is a situation: 667 Likely
North Carolina voters are
surveyed on October 23rd-28th,
2008 about their vote on
November 4th.
Question: What is the chance
that 52% (that’s a sample
percentage, so it’s not 52
voters) of the 667 said they
would vote for Obama when he
actually got 50% of the vote?
We need to review first
4
Recall Chapter 12
We sample because it is often too expensive or simply impossible to
measure the whole population. Election surveys are samples, but today
Nov 5th 2008 we know the parameter everyone tried to estimate. So let’s
see what we can learn from it.
Unknown Population Parameter
Population
Inference
Sample Statistic
Sample
Really Quick Review
„
Chapter 12 – Sample Surveys
‰
Parameter (Population Characteristics)
„
„
‰
„
„
μ (mean)
p (proportion) and np (count, number, or sum from the roulette
example)
Statistic (Sample Characteristics)
„
y
„
p̂
(sample mean)
(sample proportion) and nˆp (sample count or sum)
Chapter 12 - Statistics will be different for each sample (think
about 100 plays of Roulette in Vegas)
Chapter 14 - Taking a sample from a population is a random
phenomena. That means:
„
„
The outcome (statistic) is unknown before the sampling occurs
BUT their long term (think infinite) behavior is predictable
0
.01
Density
.02
.03
.04
Under the right conditions this is the long
term behavior of sample statistics (p. 459
Ch. 18)
60
„
80
100
$ after 100 $1 bets
120
140
This particular normal distribution has a special
name – it is a SAMPLING DISTRIBUTION or a
distribution of all possible samples of size n
5
The Central Limit Theorem(CLT)
(pp. 458-459) for a sample proportion
„
If a random sample of n observations is
selected from a population (any population
which can be characterized by Bernoulli
trials), and x “successes” are observed, then
when n is sufficiently large, the sampling
distribution of the sample proportion p will be
approximately a normal distribution. (see the
previous slide – this is the theory which
supports the sampling distribution)
The Central Limit Theorem(CLT)
(pp. 458-459) for a sample proportion
„
„
When we select simple random samples of size n,
the sample proportions p̂ p-hat that we obtain will
vary from sample to sample (like all of the polls we
see before an election. We can model the
distribution of these sample proportions with a
probability model that is
⎛
pq ⎞
⎟
N ⎜⎜ p,
⎟
n
⎝
⎠
⎛
AKA N ⎜ p,
⎝
p (1 − p ) ⎞
n
⎟
⎠
Things to note
„
„
„
„
As sample size n gets larger, σ, the
standard deviation gets smaller
The formula tells us that larger samples are
more accurate. As n grows larger, the
standard deviation grows smaller.
The assumption being made here is that the
sampled values (observations) must be
independent of each other
The sample size n must be large enough
6
Sampling Distribution for p̂ the sample
percentage
„
Shape
„
Two assumptions must hold in order for us to be
able to use the normal distribution
„
„
„
Normal Distribution
The sampled values must be independent of each other
The sample size, n, must be large enough
Please Note
„
‰
‰
As sample size n gets larger, σ, the standard deviation
gets smaller
The formula for the standard deviation tells us that larger
samples are more accurate. As n grows larger, the
standard deviation grows smaller.
Sampling Distribution for p̂ the sample
percentage (cont’d)
„
It is hard to check that independence and
sample size hold, so typically will settle for
checking the following conditions
„
„
„
10% Condition – the sample size, n, is less than 10% of
the population size
Success/Failure Condition – np > 10, n(1-p) > 10
These conditions seem to contradict one
another, but in practice, they don’t. Usually
populations are much larger than samples
(many more than 10 times)
So, back to North Carolina
„
„
„
Let’s assume that 50% is the true percentage that voted for
Obama in North Carolina.
How likely was it for CNN to have a poll of 667 voters that had
52% saying that they would vote?
To answer this, we use the normal approximation to the
binomial, this tells us what the sampling distribution looks like
⎛
(.50)(.50) ⎞
⎟ ≈ N (.49,.02)
N ⎜⎜ .49,
667 ⎟⎠
⎝
„
„
„
The sampling distribution is centered on .50 (the true
population proportion, not a sample) with a SD of .02
We note from the original poll, a margin of error of 4% or .04
was given, that is 2SD (or 2sigma or 2Z)
Recall the 68-95-99.7 rule? For a normal population, 68% of
the observations should be within +/- 1Z, 95% w/i +/- 2Z
7
Now we can answer the question (p. 464466)
„
„
„
We need a Z score (!)
We need to phrase the answer so that we are working with a
range (e.g. at least 52% of the vote) instead of an exact value
because the normal is continuous and mathematically we
can’t find the area under an exact value
Formally
⎛
⎜
.52 − .50
P ( pˆ ≥ .52) = P ⎜ Z ≥
⎜
.50 * .50
⎜
667
⎝
⎞
⎟
⎟ ≈ P ( Z ≥ +1.03)≈ .1515
⎟
⎟
⎠
Understanding the Answer
„
„
First the formula (note 1-p = q in your
text)
pˆ − p
Z≥
p * (1 − p )
n
Then the Z score. Since we wanted
P( Z≥+1.24) means that there is .1515
chance or about a 15.2% chance of
getting a single survey of size 667 with
52% saying that they would vote for
Obama when in reality, only 50% voted.
More interpretation
„
„
While the chance is low, it is not zero,
sampling error suggests that CNN could have
done everything correct and still been off.
The chance of being as far off as 2% was
about 15.2%.
If CNN had been farther off, they might want
to take a look at their survey and the
sampling procedure to see if anything needed
adjustment.
8
Chapter 18 so far
While samples are
drawn from populations,
it is helpful to
understand that a
sample is just one
possible realization of
all the possible samples
in the sampling
distribution
Population
Sampling Distribution
Samples
9