Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics for Social and Behavioral Sciences Session #12: Sampling Distribution Central Limit Theorem (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad Statistics Course Outline PART I. INTRODUCTION AND RESEARCH DESIGN Week 1 PART II. DESCRIBING DATA Weeks 2-4 PART III. DRAWING CONCLUSIONS FROM DATA: INFERENTIAL STATISTICS Weeks 5-9 Firenze or Lebanese Express is coming up next session PART IV. : CORRELATION AND CAUSATION: REGRESSION ANALYSIS This is where we talk about Zmapp and Ebola! Weeks 10-14 Last Session • A random variable is a variable whose value has not been realized. • The expectation of a random variable Y is: E(Y) = S yk P(Y=yk) Also, E(X+Y) = E(X) + E(Y), and E(c X)=c E(X), and E(E(X|Z))=E(X) • Typically the probability distribution P is not known, but we approximate it…. – Using the distribution for past values of Y (example: earnings of previous graduates) – Using polls, to ask individuals for example how they will vote. • The normal distribution is an ubiquitous distribution, that is symmetric, bell shaped. It is characterized by its mean m and its standard deviation s. • The standard normal distribution has mean 0 and standard deviation 1. Outline 1. The standard normal distribution Z-Score 2. Polls and normal distributions Sampling distribution of a statistic A simulation Central Limit Theorem Variance of a yes/no (dummy) variable Next time: Probability Distributions (continued) Chapter 4 of A&F Comparing test scores across colleges “Early paleontology in Indianapolis” “Hip hop in the Middle East” Test scores have a normal distribution with mean 3 and standard deviation 4. Test scores have a normal distribution with mean 4 and standard deviation 1. • Problem: how do I compare Marina’s test score of 3.6 at the paleontology course with a test score of 4.1 at the Hip Hop in the Middle East? Z-score ! • Take a student’s paleontology test score at the end of the semester. This is a random variable. – Its probability distribution has a mean of m=3 with a standard deviation of s=4. – Now consider the “z-scored” paleontology test score: z - scored paleontology = paleontology test score - m – The z-scored paleontology test scoreshas a mean of 0, and a standard deviation of 1. Standard Normal Distribution • Is simply the normal distribution with mean 0 and standard deviation 1. • A z-score of 3 means that the student is three times the standard deviation (of original test scores) above the mean. So who has a better grade, Marina or Slavoj? Outline 1. The standard normal distribution Z-Score 2. Polls and normal distributions Sampling distribution of a statistic A simulation Central Limit Theorem Variance of a yes/no (dummy) variable Next time: Probability Distributions (continued) Chapter 4 of A&F Who will win the mid term elections in the US? • Mid term elections are held two years after the presidential elections in the United States. • They take place early november 2014. • A question: what fraction of the voters will vote for Cory Gardner in Colorado? Mario Zapata Encinas Real Clear Politics: Gardner vs. Udall - Who will win? It would be logical to think that Gardner will win, because from the statistics, he has a higher percentage of votes (without taking into consideration the margin of error of these statistics) - What is MoE? MoE stands for Margin of Error, which is a statistic expressing the random sampling error in survey results - What is the likely distribution of the fraction of voters who will vote for Gardner? According to RCP, 46.6/100 of voters will choose Gardner over Udall. Colorado Senate: Gardner (R) vs Udall (D) The average MoE is 3.68% Thus, the likely distribution of Gardner voters is between 43.42% to 50.78% LIKELY I think Gardner will win the election. WINNER Nick Chaubey 10-28 Conducting a poll • Goal: estimating the fraction of individuals who intend to vote for a candidate. • Thinking like a statistician: 1. Ask an empirical question. 2. Design the study: population? Sample? Sampling method? Response, nonresponse bias? 3. Describe the data: what is the mean of the sample? 4. Make inferences: in plain English, can we predict what candidate will win? Polling company methodology • Select a sample of individuals by either simple random sampling, cluster sampling, or stratified random sampling. • Sample size N. • Ask each individual i=1,2,…,N: – « Which candidate do you intend to vote for? » – Note VoteGardneri=1 if individual i intends to vote for Gardner, and 0 otherwise. • Report the mean of the sample: 1 N Mean(VoteGardner) = åVoteGardneri N i=1 But there is sampling error! Groundhog Day: In expectation • Take the i-th individual that will be contacted by Rasmussen. • The probability that « individual i-th declares voting for Gardner » (an event) is – The true fraction of individuals who intend to vote for Gardner in the US population of eligible voters. • Write VoteGardneri=Xi the random variable: • 1 if the i-th individual declares intending to vote for Gardner, • and 0 otherwise. • Then: E(Xi) = true fraction of individuals who will vote for Gardner. We write it E(Xi) = p Groundhog Day: In Expectation • Now what is the expected value of the fraction declaring they will vote for Gardner? 1 N Mean(VoteGardner) = åVoteGardneri N i=1 • It is: 1 N E ( Mean(VoteGardner)) = E( åVoteGardneri ) N i=1 • Now remember that E(X+Y)=E(X)+E(Y) so… E(Mean(VoteGardner)) = p • The polling company will get the true fraction of voters for Gardner… in expectation! Sampling distribution of a statistic • But there is some chance that the mean will be far off the true fraction… what probability? • A statistic is a random variable. • Indeed the % of respondents who say they intend to vote for Gardner depends on the sample that was drawn. • This is random as the sample was collected by simple random sampling. • The mean is a random variable: 1 N E ( Mean(VoteGardner)) = E( åVoteGardneri ) N i=1 Central Limit Theorem That is the probability that the reported fraction is equal to 30% Probability(Mean(VoteGardner)=m) With some (low) probability the polling company will give a number ‘far’ over the true fraction of voters for Gardner The reported fraction could be here, e.g. 30% With probability 95%, the estimated fraction of voters for Gardner will be between the true fraction + - 2 standard deviations of the distribution. True fraction of voters for Gardner With some (low) probability the polling company will give a number ‘far’ over the true fraction of voters for Gardner m • Central Limit Theorem: With a large sample size, the sampling distribution of the mean(VoteGardner) is normal, and the empirical rule applies. Central Limit Theorem • The last remaining element is the standard deviation of the sampling distribution. • Noting sX the standard deviation of X, the sampling distribution of the mean of X has standard deviation: sX N • The standard deviation of the sampling distribution is called the standard error. It is a measure of sampling error. • Finally what is sX? • For a proportion, sX = √( p (1-p) ) , where p is the true value. Good news • There is some probability that the reported mean will be far above or far below the true mean. But: • With a large sample size, the probability that the reported mean is further than 2 standard errors from the true mean is 5%. • The most likely outcome is the true mean. – The mode of the sampling distribution is the true mean. • The expected value of the reported mean is the true mean. • The larger the sample size, the smaller the standard error. Bad news • We measure the reported mean, we know the sample size…. • But we don’t know the true mean. • Without the true mean we cannot know what the sampling distribution is… – we miss both the mean (p) and the standard deviation ( sX / √N ) (aka standard error) of that statistic. • If we knew p, the true mean, there would be no need for a poll. Next session: the solution to this conundrum Exercise: Compute the Margin of Error • The Rasmussen Poll interviewed 966 individuals. • Assuming that the true fraction of individuals who will vote for Gardner is 50%, what is the Margin of Error? • The Margin of Error is here two standard errors of the distribution. • Is it close to the result reported by the website? Wrap up • A statistic is a random variable. • The distribution of a statistic is called its sampling distribution. • In particular the mean of a variable in a sample is a statistic. • The expected value of the sample mean is equal to the true mean. • The standard deviation of the sample mean is called the standard error. • Central Limit theorem: with a large sample size, the sampling distribution of the mean of X is normal, and the empirical rule applies. The standard error is sX / √N. Coming up: Readings: • Chapter 5 entirely – estimation, confidence intervals. • Online quiz on Thursday. For help: • Amine Ouazad Office 1135, Social Science building [email protected] Office hour: Tuesday from 5 to 6.30pm. • GAF: Irene Paneda [email protected] Sunday recitations. At the Academic Resource Center, Monday from 2 to 4pm.