Download What is this “Margin of Error”?

What is this “Margin of Error”? On August 14 of this year, NBC News reported: In Florida, incumbent Republican Sen. Marco Rubio is ahead of top Democratic challenger Patrick Murphy, 49 percent to 43 percent. (It was Rubio 47 percent, Murphy 44 percent a month ago.) The NBC/WSJ/Marist polls were conducted Aug. 4-10 of 862 registered voters in Florida (which has a margin of error of plus-minus 3.3 percentage points). In this U.S. presidential election season, every day brings news concerning the latest political preference polls. At the end of each news report, you’ll be told the “margin of error” in the reported estimates. Every poll is, of course, subject to what’s called “sampling error”: Evenly if properly run, there’s a chance that the poll will, merely due to bad luck, end up with a randomly-chosen sample of voters which is not perfectly representative of the overall electorate. However, using the tools you learned in your “probability” course, we can measure the likelihood of such bad luck. Assume that the poll was not subject to any type of systematic bias (a critical assumption, unfortunately frequently not true in practice). Then there was a 95% chance, when the poll was conducted, that the final estimates would differ from the true numbers which describe the entire population by no more than the reported “margin of error”. In simple terms: The pollster in the NBC study reported above, knowing only the results of his/her poll, should be willing to bet $19 against your bet of $1 that Marco Rubio’s actual level of support amongst registered voters in early August was in the 49% ± 3.3% range. (This range is called “a 95%-confidence interval” for the true percentage.) The next page extends this language to other types of estimates and to forecasts. Optional note: The reported margin of error in the news story refers to the estimated support levels of each of the two candidates separately. How much can we trust the estimate that Rubio is currently ahead by 6 percentage points? Or that Rubio’s lead has grown by 3 percentage points over the past month? In our second day of classes, we’ll see that the margin of error in the estimated “gap” of 6% is roughly 6.6%, and that the margin of error in the 3% estimated change in the gap is roughly 9.2%! The insight is simple: Combining fuzzy numbers increases the resulting fuzz. 1 The Language of (Statistical) Estimation Typically, when we’re estimating some unknown characteristic of a population, or predicting something, we’ll be able to compute (from our sample data) one standard-deviation’s-worth of “noise” (i.e., potential error due to the randomness of the sampling process that gave us our data) in our estimate or prediction. And typically, the methods we’ll use will lead to normally-distributed uncertainty in our estimate or prediction. Therefore, 95% of the time we’ll get an estimate or prediction that differs from the truth (i.e., the true value of that population characteristic, or the actual value ultimately taken by whatever we’re predicting) by no more than approximately two standard-deviation’s-worth of noise. One standard-deviation’s-worth of potential error in an estimate or prediction is called the standard error of the {quantity being estimated or predicted}. As we examine the results of a regression analysis, we’ll encounter, for example, the standard error of a mean, the standard error of a proportion, the standard error of a prediction, the standard error of an estimated subgroup mean, the standard errors of the estimated coefficients in our regression model, and the standard error of the regression itself. The margin of error (at the 95%-confidence level) in our estimate or prediction will simply be (approximately 2)·(the standard error of that estimate or prediction). Both the appropriate “approximately 2” multiplier (which comes from something called “the t-distribution” with some number of “degrees of freedom”) and the appropriate standard error (again, this represents one standard-deviation’s-worth of “noise” in our estimation or prediction process) will be computed for us by any modern statistical software. Finally, a 95%-confidence interval for our estimate or prediction is (the estimate or prediction itself) ± (~2)·(the standard error of the estimate or prediction), or (the estimate or prediction itself) ± (the margin of error in the estimate or prediction). Here’s an example from the first dataset we’ll analyze in class, consisting of 15 cars sampled from the fleet owned and operated by a municipality. The variables for each car are maintenance Costs over the previous year, Mileage (in 000s) driven over the previous year, Age (in years) at the start of the year, and Make (all the cars were either Fords or Hondas; Make is encoded as Ford = 0, Honda = 1). estimate ± margin of error 95%-confidence interval for mean annual cost per car (across the fleet) $688.87 ± 2.1448·$28.84 95%-confidence interval for mean annual miles driven per car 16,373 ± 2.1448·1,121 95%-confidence interval for mean age of cars in the fleet 1.000 ± 2.1448·0.218 95%-confidence interval for fraction of fleet for which Make = Honda 46.67% ± 2.1448·13.33% 2 Optional: Connection with the Probability Module If what appears below feels too technical, please feel free to skip it completely! Every statistical study begins with a random sampling of data from a population of interest. Any number then computed from the sample data can be thought of as one realization of a random variable which takes different values for different samples. For example, imagine repeatedly drawing random individuals (say, from the population of EMP alumni who graduated between 2005 and 2010), where each draw is equally likely to yield any one of the individuals in that population. (This is called “simple random sampling with replacement.”) We’ll then average the annual incomes of all of the sampled individuals, and use that as an estimate of the true population mean, µ (the mean annual income of EMP alumni 5 to 10 years after graduation). Call the computed sample mean x (a number); this is a realization of the random variable X , which varies from one sample to the next. The individual random draws X1, X2, …, Xn are independent, identically-distributed random variables, and X is their average. Therefore, from your probability course: (1) E[ X ] = µ , (2) StdDev ( X ) = σ / n , and (3) X is approximately normally distributed (where σ is the population standard deviation, and approximate normality follows from the Central Limit Theorem). If X were precisely normally distributed, there would be a 95% chance that its realized value x differs from its expected value (the true population mean µ) by no more than ± 1.96 ⋅ σ / n . We don’t know σ, of course. But we can use the sample standard deviation s as an estimate of σ, and therefore we can be 95%-confident that the interval x ± (~ 2) ⋅ s / n contains the true population mean µ. This is called a 95%-confidence interval for µ. (The “approximately 2” multiplier is a bit larger than 1.96, to cover for our use of s instead of σ.) Similarly, when estimating or predicting anything statistically, a 95%-confidence interval will be (the estimate or prediction) ± (~ 2) ⋅ (one standard-deviation’s-worth of random variability in the estimate or prediction). That standard-deviation’s-worth of “noise” in our estimate or prediction is called the “standard error” of whatever we’re estimating or predicting. On the previous page, note that each “standard error of the mean” is simply the sample standard deviation, divided by 15 , the square-root of the sample size. In regression studies, computing standard errors typically requires much more work than using the simple s / n formula for the standard error of the mean. That’s why we leave the calculation to the computer. 3

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download What is this “Margin of Error”?