Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics for Social and Behavioral Sciences Session #15: Interval Estimation, Confidence Interval (Agresti and Finlay, Chapter 5) Prof. Amine Ouazad Statistics Course Outline PART I. INTRODUCTION AND RESEARCH DESIGN Week 1 PART II. DESCRIBING DATA Weeks 2-4 PART III. DRAWING CONCLUSIONS FROM DATA: INFERENTIAL STATISTICS Weeks 5-9 Firenze or Lebanese Express’s ratings are within a MoE of each other! PART IV. : CORRELATION AND CAUSATION: REGRESSION ANALYSIS This is where we talk about Zmapp and Ebola! Weeks 10-14 Last Session: Inference Central Limit Theorem: • with a large sample size N, the sampling distribution of the sample mean is approximately normal. • The mean of the sampling distribution is the population mean. • The standard deviation of the sampling distribution is sX/√N, where sX is the standard deviation of X. • A conservative Margin of Error (= 2 standard errors) for Cafe Firenze’s restaurant rating is 1.1 with 14 votes. • For any rating from 1 to 5, the largest possible Margin of Error is 4/√N, where N is the number of ratings. • With TripAdvisor, we see the rating of each individual customer, and so we can calculate sX! Today • Use this margin of error to provide interval estimates: – A 95% confidence interval for Café Firenze is [2.3,4.5]. – “The true rating of Café Firenze is between 2.3 and 4.5 with probability 95%”. – Note: average was 3.4 and MoE was 1.1. – A 95% confidence interval for Cory Gardner’s vote share in Colorado is [48-3.6,48+3.6]=[44.4,51.6]. – “The true vote share for Cory Gardner is between 42.9% of the vote and 50.1% of the vote with 95% probability”. – Note: MoE was 3.6. News: Last Tuesday • We learnt the population proportion p !!! – Proportion of voters for Cory Gardner. • The latest poll was giving us a sample proportion of the vote p (N around 1000). Outline 1. Interval Estimation Confidence Interval 2. Choosing between 90, 95, 99% confidence 3. When distributions are normal: t-distribution Next time: Estimation, Confidence Intervals (continued) Chapter 5 of A&F Parameters and Interval Estimate • An interval estimate is an interval of numbers around the point estimate, which includes the parameter with probability either 90%, 95%, or 99%. • Example: “the interval estimate [156.2 cm – 0.49cm ; 156.2 cm + 0.49cm] includes the population average height with probability 95%.” • Sample mean: 156.2cm, MoE = 0.49 cm. Parameters and Interval Estimate • An interval estimate that includes the parameter with probability 95% is called a 95% confidence interval. • The expression “95% confidence interval” is widely used. • Example: “[156.2 cm – 0.49cm ; 156.2 cm + 0.49cm] is a 95% confidence interval for the population average height.” • Sample mean: 156.2cm, MoE = 0.49 cm. We use 1.96 instead of 2 from now on. How do we build a 95% confidence interval? Goal: estimate the population average m. From previous sessions: [m – MoE ; m + MoE] includes the sample mean with probability 95%. We conclude: the interval [m – MoE; m+MoE] includes the population mean with probability 95%. [m – MoE; m+MoE] is a 95% confidence interval for m. MoE = 1.96 x Standard Error Standard Error = sX/√N Outline 1. Interval Estimation Confidence Interval 2. Choosing between 90, 95, 99% confidence 3. When distributions are normal: t-distribution Next time: Estimation, Confidence Intervals (continued) Chapter 5 of A&F Choosing between 90%, 95%, 99% • The interval estimate [Sample Mean – MoE, Sample Mean + MoE] includes the population mean (the parameter) with probability: • 99% if MoE = 2.58 • 95% if MoE = 1.96 • 90% if MoE = 1.65 * Standard Error * Standard Error * Standard Error • The width of a confidence interval: 1. Increases as the confidence level increases. 2. Decreases as the sample size increases. Building 90%, 95%, 99% confidence intervals Exercise: • The sample mean weight (a sample of individuals in the US) is 60.0 kg, and the sample standard deviation is 29.9 kg. • Find a 90% (resp., 95%, 99%) confidence interval for the population mean weight. Why 90%, 95%, 99%? • Invented by Jerzy Newman in the 1930s. • R.A. Fisher developed the theory of statistical testing. • Sample sizes were small at the time (a few hundred), and 95% seemed a reasonable confidence level. • Medical sciences introduced confidence intervals in medicine soon after their discoveries. • 95% became the standard. R.A. Fisher Outline 1. Interval Estimation Confidence Interval 2. Choosing between 90, 95, 99% confidence 3. When distributions are normal: t-distribution Next time: Estimation, Confidence Intervals (continued) Chapter 5 of A&F Central Limit Theorem • Requires a large sample size N. • This is because it applies to any distribution of X. • Example #1: – We had a sample of N songs, and the number of times Xi that song had been played. – The number of times Xi a song is played on Spotify does not have a normal distribution. – But we can build a confidence interval for the average number of times a song is played (m), provided we have a large enough number N of songs. – MoE = 1.96 * sX/√N for a 95% confidence interval. We can use our formulas to find a 95% confidence interval for m=360.63 as: • N is large. Even though X does not have a normal distribution. What if N is small? • If N is “small”, the Central Limit Theorem does not apply…. – We cannot use our formulas. • “Small” ? Less than a few hundred (from experience). • If N is very small: N=2 These sampling distributions are not normal. N=5 If N is small • sX is potentially very far from sx. • But… we can still find confidence intervals if X is normal. • The sampling distribution of the sample mean is Student’s t distribution, with degrees of freedom (df) equal to N-1, and with standard deviation sx/√N. If N is small A 95% confidence interval for the sample mean is: [Sample Mean – MoE , Sample Mean + MoE] With MoE = z * Standard Error. • z= 1.96 when the df = ∞ • z> 1.96 when the df are small. • See next table for the exact value of z. t Table Why is it called Student’s t distribution? • The t distribution was allegedly invented by a person called Student. • That “Student” was an engineer at Guinness’s Factories in Ireland: William Sealy Gossett. • He was producing small samples of a drink, seeking guidance for industrial quality control: – He was trying a small number of samples (N=2,4, perhaps 7). – And from these samples was trying to infer the quality of all containers of the product (the population). W.S. Gosset and Some Neglected Concepts in Experimental Statistics: Guinnessometrics II, Stephen T. Ziliak, 2011. Wrap up • Interval estimates for a population mean (a parameter) when N is large, for any distribution of X. • Build a confidence interval for a parameter: the interval [Sample Mean – MoE ; Sample + MoE] includes the parameter with probability: 99% if MoE = 2.58 * Standard Error 95% if MoE = 1.96 * Standard Error 90% if MoE = 1.65 * Standard Error • The t-distribution gives confidence intervals when the sample size N is small… and when the distribution of X is normal. • Use z given by Table 5.1 of Agresti and Finlay for degrees of freedom N-1. Coming up: Readings: • This week and next week: – Chapter 5 entirely – estimation, confidence intervals. • Online quiz deadline Tuesday 9am. • Deadlines are sharp and attendance is followed. For help: • Amine Ouazad Office 1135, Social Science building [email protected] Office hour: Tuesday from 5 to 6.30pm. • GAF: Irene Paneda [email protected] Sunday recitations. At the Academic Resource Center, Monday from 2 to 4pm.