Download Slides for Session #12

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Statistics for Social
and Behavioral Sciences
Session #12:
Sampling Distribution
Central Limit Theorem
(Agresti and Finlay, Chapter 9)
Prof. Amine Ouazad
Statistics Course Outline
PART I. INTRODUCTION AND RESEARCH DESIGN
Week 1
PART II. DESCRIBING DATA
Weeks 2-4
PART III. DRAWING CONCLUSIONS FROM DATA:
INFERENTIAL STATISTICS
Weeks 5-9
Firenze or Lebanese Express is coming up next session
PART IV. : CORRELATION AND CAUSATION:
REGRESSION ANALYSIS
This is where we talk
about Zmapp and Ebola!
Weeks 10-14
Last Session
• A random variable is a variable whose value has not been
realized.
• The expectation of a random variable Y is:
E(Y) = S yk P(Y=yk)
Also, E(X+Y) = E(X) + E(Y),
and E(c X)=c E(X), and E(E(X|Z))=E(X)
• Typically the probability distribution P is not known, but we
approximate it….
– Using the distribution for past values of Y (example: earnings of
previous graduates)
– Using polls, to ask individuals for example how they will vote.
• The normal distribution is an ubiquitous distribution, that is
symmetric, bell shaped. It is characterized by its mean m and its
standard deviation s.
• The standard normal distribution has mean 0 and standard
deviation 1.
Outline
1. The standard normal distribution
Z-Score
2. Polls and normal distributions
Sampling distribution of a statistic
A simulation
Central Limit Theorem
Variance of a yes/no (dummy) variable
Next time:
Probability Distributions (continued)
Chapter 4 of A&F
Comparing test scores across colleges
“Early paleontology in Indianapolis”
“Hip hop in the Middle East”
Test scores have a normal
distribution with mean 3
and standard deviation 4.
Test scores have a normal
distribution with mean 4
and standard deviation 1.
• Problem: how do I compare Marina’s test score of 3.6 at the paleontology course
with a test score of 4.1 at the Hip Hop in the Middle East?
Z-score !
• Take a student’s paleontology test score at the
end of the semester. This is a random variable.
– Its probability distribution has a mean of m=3 with
a standard deviation of s=4.
– Now consider the “z-scored” paleontology test
score:
z - scored paleontology =
paleontology test score - m
– The z-scored paleontology test scoreshas a mean
of 0, and a standard deviation of 1.
Standard Normal Distribution
• Is simply the normal distribution with mean 0
and standard deviation 1.
• A z-score of 3 means that the student is three
times the standard deviation (of original test
scores) above the mean.
So who has a better grade, Marina
or Slavoj?
Outline
1. The standard normal distribution
Z-Score
2. Polls and normal distributions
Sampling distribution of a statistic
A simulation
Central Limit Theorem
Variance of a yes/no (dummy) variable
Next time:
Probability Distributions (continued)
Chapter 4 of A&F
Who will win the mid term
elections in the US?
• Mid term elections are held two years after
the presidential elections in the United States.
• They take place early november 2014.
• A question: what fraction of the voters will
vote for Cory Gardner in Colorado?
Mario Zapata Encinas
Real Clear Politics: Gardner vs. Udall
- Who will win?
It would be logical to think that Gardner will win, because
from the statistics, he has a higher percentage of votes
(without taking into consideration the margin of error of
these statistics)
- What is MoE?
MoE stands for Margin of Error, which is a statistic
expressing the random sampling error in survey results
-
What is the likely distribution of the fraction of
voters who will vote for Gardner?
According to RCP, 46.6/100 of voters will choose Gardner
over Udall.
Colorado Senate: Gardner (R) vs Udall (D)
The average MoE is 3.68%
Thus, the likely distribution of Gardner voters is
between 43.42% to 50.78%
LIKELY
I think Gardner will win the election.
WINNER
Nick Chaubey
10-28
Conducting a poll
• Goal: estimating the fraction of individuals who
intend to vote for a candidate.
• Thinking like a statistician:
1. Ask an empirical question.
2. Design the study: population? Sample? Sampling
method? Response, nonresponse bias?
3. Describe the data: what is the mean of the sample?
4. Make inferences: in plain English, can we predict what
candidate will win?
Polling company methodology
• Select a sample of individuals by either simple
random sampling, cluster sampling, or
stratified random sampling.
• Sample size N.
• Ask each individual i=1,2,…,N:
– « Which candidate do you intend to vote for? »
– Note VoteGardneri=1 if individual i intends to vote
for Gardner, and 0 otherwise.
• Report the mean of the sample:
1 N
Mean(VoteGardner) = åVoteGardneri
N i=1
But there is sampling error!
Groundhog Day: In expectation
• Take the i-th individual that will be contacted by
Rasmussen.
• The probability that « individual i-th declares
voting for Gardner » (an event) is
– The true fraction of individuals who intend to vote for
Gardner in the US population of eligible voters.
• Write VoteGardneri=Xi the random variable:
• 1 if the i-th individual declares intending to vote for Gardner,
• and 0 otherwise.
• Then:
E(Xi) = true fraction of individuals who will vote for Gardner.
We write it E(Xi) = p
Groundhog Day: In Expectation
• Now what is the expected value of the fraction
declaring they will vote for Gardner?
1 N
Mean(VoteGardner) = åVoteGardneri
N i=1
• It is:
1 N
E ( Mean(VoteGardner)) = E( åVoteGardneri )
N i=1
• Now remember that E(X+Y)=E(X)+E(Y) so…
E(Mean(VoteGardner)) = p
• The polling company will get the true fraction of
voters for Gardner… in expectation!
Sampling distribution of a statistic
• But there is some chance that the mean will be
far off the true fraction… what probability?
• A statistic is a random variable.
• Indeed the % of respondents who say they intend
to vote for Gardner depends on the sample that
was drawn.
• This is random as the sample was collected by
simple random sampling.
• The mean is a random variable:
1 N
E ( Mean(VoteGardner)) = E( åVoteGardneri )
N i=1
Central Limit Theorem
That is the probability
that the reported
fraction is equal to 30%
Probability(Mean(VoteGardner)=m)
With some (low)
probability the polling
company will give a
number ‘far’ over the
true fraction of voters
for Gardner
The reported fraction
could be here, e.g. 30%
With
probability
95%,
the
estimated fraction of voters for
Gardner will be between the true
fraction + - 2 standard deviations of
the distribution.
True fraction of voters for Gardner
With some (low)
probability the polling
company will give a
number ‘far’ over the
true fraction of voters
for Gardner
m
• Central Limit Theorem: With a large sample size, the sampling
distribution of the mean(VoteGardner) is normal, and the empirical
rule applies.
Central Limit Theorem
• The last remaining element is the standard deviation of the
sampling distribution.
• Noting sX the standard deviation of X, the sampling
distribution of the mean of X has standard deviation:
sX
N
• The standard deviation of the sampling distribution is called
the standard error. It is a measure of sampling error.
• Finally what is sX?
• For a proportion, sX = √( p (1-p) ) , where p is the true
value.
Good news
• There is some probability that the reported mean will be far above
or far below the true mean. But:
• With a large sample size, the probability that the reported mean
is further than 2 standard errors from the true mean is 5%.
• The most likely outcome is the true mean.
– The mode of the sampling distribution is the true mean.
• The expected value of the reported mean is the true
mean.
• The larger the sample size, the smaller the standard
error.
Bad news
• We measure the reported mean, we know the
sample size….
• But we don’t know the true mean.
• Without the true mean we cannot know what
the sampling distribution is…
– we miss both the mean (p) and the standard
deviation ( sX / √N ) (aka standard error) of that
statistic.
• If we knew p, the true mean, there would be
no need for a poll.
Next session: the solution to this conundrum
Exercise: Compute the
Margin of Error
• The Rasmussen Poll interviewed 966
individuals.
• Assuming that the true fraction of individuals
who will vote for Gardner is 50%, what is the
Margin of Error?
• The Margin of Error is here two standard
errors of the distribution.
• Is it close to the result reported by the
website?
Wrap up
• A statistic is a random variable.
• The distribution of a statistic is called its sampling
distribution.
• In particular the mean of a variable in a sample is a
statistic.
• The expected value of the sample mean is equal to
the true mean.
• The standard deviation of the sample mean is called
the standard error.
• Central Limit theorem: with a large sample size, the
sampling distribution of the mean of X is normal, and
the empirical rule applies.
The standard error is sX / √N.
Coming up:
Readings:
• Chapter 5 entirely – estimation, confidence intervals.
• Online quiz on Thursday.
For help:
• Amine Ouazad
Office 1135, Social Science building
[email protected]
Office hour: Tuesday from 5 to 6.30pm.
• GAF: Irene Paneda
[email protected]
Sunday recitations.
At the Academic Resource Center, Monday from 2 to 4pm.
Related documents