Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics for Social and Behavioral Sciences Session #14: Estimation, Confidence Interval (Agresti and Finlay, Chapter 5) Prof. Amine Ouazad Statistics Course Outline PART I. INTRODUCTION AND RESEARCH DESIGN Week 1 PART II. DESCRIBING DATA Weeks 2-4 PART III. DRAWING CONCLUSIONS FROM DATA: INFERENTIAL STATISTICS Weeks 5-9 Firenze or Lebanese Express now PART IV. : CORRELATION AND CAUSATION: REGRESSION ANALYSIS This is where we talk about Zmapp and Ebola! Weeks 10-14 Last 2 Sessions • A statistic is a random variable. • The distribution of a statistic is called its sampling distribution. • In particular the mean of a variable in a sample is a statistic. • The expected value of the sample mean is equal to the true mean. • The standard deviation of the sample mean is called the standard error. • Central Limit theorem: with a large sample size, the sampling distribution of the mean of X is normal, and the empirical rule applies. The standard error is sX / √N. Last 2 Sessions • For a proportion (X is 0,1): sX = √( p (1-p) ). As we typically do not observe the true proportion p, but the sample proportion p. • For other variables (X is not 0,1): As we do not observe the true standard deviation sX but rather the sample standard deviation sX, we approximate sX by sX and thus approximate the standard error by sX / √N. • We are interested in estimating parameters, but we only observe statistics. Can we use statistics as estimators? Outline 1. Back to Zomato Just applying the formulas we know 2. Estimators: Point Estimator Biased vs Unbiased Estimators Efficient vs Inefficient Estimators Interval Estimator Next time: Estimation, Confidence Intervals (continued) Chapter 5 of A&F Back to Zomato 1. What statistical issue would preclude us from using the Central Limit Theorem? 2. Assuming we can use the CLT, what is the Margin of Error on Cafe Firenze and Lebanese Express’s ratings? Think !! • Questions: 1. When rating a restaurant, what are the possible choices for the user? 2. What is 3.4 on this rating? 3. What are we trying to estimate? 4. What is the formula for the standard error of ratings? • Is a rating X a 0,1 variable? 5. What is the standard deviation sX of ratings? 6. Finally what is the standard error of the rating 3.4? 7. And what is the margin of error for the rating 3.4? (MoE = twice the standard error) Recap: Central Limit Theorem • Central Limit Theorem: with large sample size, the distribution of the sample mean is normal, with mean the true mean and with standard deviation (=standard error) equal to: sX Café Firenze’s case N • X is not 0,1: Approximate the true standard deviation sX using the sample standard deviation s X. • X is 0,1: Approximate sX = √( p (1-p) ) , where p is the true proportion, using the sample proportion for p. Back to Zomato • If we had all the ratings of individual users: – John – Abdullah – Anthony – Claire – Al Bloom – John Sexton – Ayesha 3 4 5 3 3 3 3 “Hated it, service is poor” “Great venue” “Perfect, loved the al dente pasta” “Ok for a downtown lunch” “The italian restaurant of the world” “Can achieve more” “There are alternatives” • The average is 3.4, and we would find sX=……………. Zomato Problemo • The website only reports the sample mean of ratings… • We thus have to figure out a conservative of sX (the largest possible). • What is the highest possible sx? Outline 1. Back to Zomato Just applying the formulas we know 2. Estimators: Point Estimate Biased vs Unbiased Estimators Efficient vs Inefficient Estimators Interval Estimate Next time: Estimation, Confidence Intervals (continued) Chapter 5 of A&F Parameters and their point estimates Parameters (« True » values) Point Estimate Population mean m Example: Population mean rating of Cafe Firenze Sample mean m Sample mean rating of Cafe Firenze Population median Sample median Population standard deviation sX Example: Population standard deviation of ratings of Cafe Firenze Sample standard deviation sX. Sample standard deviation of ratings of Cafe Firenze Population variance sX2 Sample variance sX2 Population p-th percentile Sample p-th percentile • This is called a “point estimate” because we give a single number (a “point” on the axis). Biased vs Unbiased Estimator • We have seen that to get the standard error of the sample mean, we need to have an estimate of sX. • So far we have used: N 1 2 (x x ) å i N i=1 • And the textbook has given: 1 N 2 (x x ) å i N -1 i=1 • These are two different estimators of the same quantity sX. • The textbook’s estimator of sX is unbiased. These two formulas are “point estimates”. Efficient vs Inefficient Estimator • Among all possible estimators, an estimator is efficient if it has the smallest standard error. • The standard error of 1 N (xi - x )2 å N i=1 • Is smaller than the standard error of 1 N 2 (x x ) å i N -1 i=1 • The slides’ version is efficient, while the textbook’s version is unbiased. There is a conundrum. These two formulas are “point estimates”. What do you actually need to remember? • “Good” estimators are unbiased and efficient. – The sample mean is an unbiased and efficient estimator of the population mean. • “Less good” estimators may be either unbiased or efficient. – The sample standard deviation with denominator N-1 is unbiased but inefficient. – The sample standard deviation with denominator N is biased but efficient. – We keep using the formula we learnt… Parameters and Interval Estimate • An interval estimate is an interval of numbers around the point estimate, which includes the parameter with probability either 90%, 95%, or 99%. • Example: “the interval estimate [156.2 cm – 0.49cm ; 156.2 cm + 0.49cm] includes the population average height with probability 95%.” Parameters and Interval Estimate • An interval estimate that includes the parameter with probability 95% is called a 95% confidence interval. • The expression “95% confidence interval” is widely used. • Example: “[156.2 cm – 0.49cm ; 156.2 cm + 0.49cm] is a 95% confidence interval for the population average height.” How do we build a 95% confidence interval? • Goal: estimate the population average m. • From previous session: [m – MoE ; m + MoE] includes the sample mean with probability 95%. • We conclude: the interval [m – MoE; m+MoE] includes the population mean with probability 95%. [m – MoE; m+MoE] is a 95% confidence interval for m. MoE = 1.96 x Standard Error Standard Error = sX/√N Wrap up • Central Limit theorem: with a large sample size, the sampling distribution of the sample mean of X is normal, and the empirical rule applies. The standard error is the standard deviation of the sampling distribution sX / √N. • For a proportion: sX = √( p (1-p) ). As we typically do not observe the true proportion p, but the sample proportion p. • For other variables: As we do not observe the true standard deviation sX but rather the sample standard deviation sX, we approximate the standard error by sX / √N. • We are interested in estimating parameters, but we only observe statistics. Can we use statistics as estimators? Estimators can be unbiased, and efficient. Coming up: Readings: • This week and next week: – – • • • • Chapter 5 entirely – estimation, confidence intervals. Understand the confidence interval, the point estimate. Online quiz on Thursday. Deadlines are sharp and attendance is followed. Tonight is the midterm election!! Watch : http://www.msnbc.com/jose-diaz-balart/watch/is-2014-the-margin-of-error-midterms-349919811638 For help: • Amine Ouazad Office 1135, Social Science building [email protected] Office hour: Tuesday from 5 to 6.30pm. • GAF: Irene Paneda [email protected] Sunday recitations. At the Academic Resource Center, Monday from 2 to 4pm.