Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Foundations of statistics wikipedia , lookup
Inductive probability wikipedia , lookup
Degrees of freedom (statistics) wikipedia , lookup
History of statistics wikipedia , lookup
Confidence interval wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
Regression toward the mean wikipedia , lookup
Misuse of statistics wikipedia , lookup
STAT 101, Module 7: The Root-N Law, the Central Limit Theorem, Standard Errors, and Confidence Intervals (Book: chapter 7) Independent and Uncorrelated Random Variables Definitions: Two random variables X and Y are called … o … independent if the events Ai = (X=xi) and Bj = (Y=yj) are independent for all possible values xi of X and yj of Y. o … uncorrelated if C(X,Y) = 0. We say “uncorrelated” even though we use the covariance in the definition. Maybe that’s because we can’t say “uncovarianced”, or maybe because if σ(X) and σ(Y) are both >0, then C(X,Y) = 0 c(X,Y) = 0. Theorem: If X and Y are independent, they are uncorrelated. The theorem is important because it is often easier to recognize that two random variables are independent than uncorrelated, even though independence is a more stringent condition. For example, one recognizes coin flips immediately as independent. o “Proof”: C(X,Y) = E((X – µX) (Y – µY)) = E(X – µX) E(Y – µY) = 0· 0 = 0 The second equality is some grinding algebra, but nothing deep. o The converse is not true! Here is a counter example of a pair of random variables that are uncorrelated but not independent: P(X=1 and Y=2) = P(X=2 and Y=1) = P(X=2 and Y=3) = P(X=3 and Y=2) = ¼ This can be realized by a game where two dice are thrown repeatedly till they show a 1-2 pair or a 2-3 pair in any order. The outcome will be one of (1,2), (2,1), (2,3), (3,2), with equal probability. To see that the two random variables are not independent, check the marginal (plain) probabilities: P(X=1) = P(Y=1) = P(X=3) = P(Y=3) = ¼ P(X=2) = P(Y=2) = ½ => P(X=1 and Y=2) = ¼ ≠ P(X=1)·P(Y=2) = ¼ · ½ = ⅛ Intuitively, the two random variables cannot be independent because if X=1 we know Y=2, for example. To see that the two random variables are uncorrelated, calculate their covariance. Note, however, that E(X) = E(Y) = 2, hence each summand in C(X,Y) has a factor 0 (each pair has an outcome 2). Therefore C(X,Y) = 0. o The following is an example of two independent variables: P(X=1) = P(X=2) = P(X=3) = ⅓ P(Y=1) = P(Y=2) = P(Y=3) = ⅓ P(X=x and Y=y) = P(X=x) · P(Y=y) Note that for independent variables we only need to specify the marginal probabilities, and the joint probabilities are obtained by multiplication. In this example the important thing is not that the probabilities of 1,2,3 are equal, but that they can be multiplied to obtain the joint probability of all pairs of values. This can be realized by a game where two dice are thrown till they both show a value of 3 or less. o Both of the above examples are constructed by “conditioning”. This is often useful to go from a known situation to a slightly different one: simply single out the cases that you like and condition on them. In this case the instructor didn’t want to deal with 6 · 6 = 36 outcomes, which is why he scaled things down to outcomes 3 or less. The Root-N Law and the Standard Error We are finally able to determine the rate at which relative frequencies and means grow more precise. It will be a disappointing result, because the precision gets better only very slooooowwwwwwly… We consider a possibly long series of random variables with identical possible outcomes and identical probabilities for these outcomes (“identically distributed”), and we also assume the variables are uncorrelated: X1 , X2 , X3 , X4 , X5 , …, XN As examples, keep in mind flipping a coin, rolling a die, but also daily stock market returns (which are surprisingly uncorrelated day to day), or the monthly credit card bills of a randomly sampled series of households, measurements of blood glucose in a given patient (the measurements are slightly different even from the same blood sample due to measurement error), survival times of cancer patients treated with a new therapy,… Note that X1 stands for the values of the first case across datasets, X2 for the values of the second case across datasets,… It therefore makes sense to talk about the probability distribution of the variable X1, X2,… The assumption of identical distribution has the consequence that not only are all the probabilities P(X1=x) = P(X2=x) = P(X3=x) =… the same for all possible values x, but so are the expected values and variances and SDs: E(X1) = E(X2) = E(X3) = … = E(XN) = µ V(X1) = V(X2) = V(X3) = … = V(XN) = σ2 We think of the whole series as repeatable: Over and over, we could o flip another N coins, o roll another N dice, o look at another series of N daily stock returns, o another sample of N households and their monthly credit card bills, o another set of N blood glucose measurements from the same blood sample, o another clinical trial with N treated patients and their survival times,… We are now interested in the mean value of these outcomes: X = (X1 + X2 + X3 + X4 + X5 + …+ XN ) / N Because of the assumed repeatability, X is a random variable in its own right: every repetition would produce a slightly different mean. Its expected values is obviously E( X ) = μ, but what is its SD? Theorem: If X1 , X2 , …, XN are uncorrelated and identically distributed with same variance σ2, then V( X ) = σ2 / N o Proof: V( X ) = V(X1 + X2 + …+ XN ) / N2 = C(X1 + X2 + …+ XN , X1 + X2 + …+ XN )/ N2 = ( V(X1) + V(X2) + …+ V(XN) + … + C(Xi, Xj) + … ) / N 2 = ( N σ2 ) / N 2 = σ2 / N The steps of the proof are as follows: 1) pull out the factor 1/N as 1/N 2 ; 2) expand the variance of the sum into N variances and N(N–1) covariances; 3) use the fact that all covariances disappear; 4) use the fact that all variances are the same, σ2. (For those who enjoy math: This is really a giant application of a version of the theorem of Pythagoras. It is like taking a giant N-dimensional triangle, or N-angle, really, and doing something like this: hypotenuse2 = (side 1)2 + (side 2)2 + (side 3)2 + … + (sideN)2, where all sides are of equal length, so that hypotenuse2 = N · (any side)2. The quantity we are examining, though, is hypotenuse2 /N 2 = (any side)2 / N, which is σ2/N.) o What is disappointing about this result? It becomes clear once we reformulate it in terms of standard deviations, which are the real measure of dispersion: σ( X ) = σ / N ½ Definition: σ( X ) is called the standard error of the mean . The standard error is a standard deviation but only in a special case: when describing the variability of an estimate such as a mean across datasets. Interpretation: The standard error of the mean is a measure of dispersion of the mean from dataset to dataset , assuming one could obtain datasets like the observed one over and over and over… This mental exercise should give you something to think. In any given data analysis, you are looking at one single dataset. You are calculating one number from a column, its mean (mean household income, say, and this could be something like $53,128.358). How come we are going to think that this number is “variable”? It’s one number, right? There are many households in the sample, but there is only one mean. And how are we going to pretend we knew something about the “variability” of this one number? Well, the mental exercise starts from the realization of repeatability of the data collection. We could collect other datasets just like the one we have, at least in principle, and each time the mean would be slightly different. The miracle of the root-N law for the standard error of the mean happens by making an assumption that the cases/rows/records were uncorrelated, which is usually the case when the cases are obtained by independent sampling or can otherwise be thought of as arising independently of each other. This is where the math proof gives insight: the “Pythagorean miracle” happens only because we assume that the individual observations are uncorrelated (“orthogonal”) to each other. Examples: o Assume we are looking at household surveys of various sample sizes. To make things concrete, assume the observations are the household incomes, which may average around $50,000 with a SD of $30,000. Then: N= 1: σ( X ) = σ / 1 (= dispersion of the raw observations) N= 100: σ( X ) = σ / 10 N = 10000: σ( X ) = σ / 100 N = 1000000: σ( X ) = σ / 1000 Thus the uncertainty in the mean household incomes drops to ±$3,000 (N=100), to ±$300 (N=10,000), to ±$30 (N=1,000,000). We have a diminishing returns effect! Gaining 10-fold precision requires 100-fold increases in sample size. o Your employer conducted a survey of households on a shoestring budget, and the sample size was just N = 200. The manager is naturally dissatisfied with the precision of the estimates of product take-rates, average household income, average household spending, household preferences,… So he/she presses upper management for more money. He/she happily reports back to the group that conducted the survey, saying “I got sufficient funds to double the sample size, so we can slash the errors by a factor of two.” What should your response be? “Apologies, but we’ll be able to reduce the errors only by about 30%, not 50%.” Why is this the correct response? The sample size grows from 200 to 400. The standard error decreases from σ/2001/2 to σ/4001/2. The ratio is ( σ/4001/2 ) / (σ/2001/2 ) = (200/400)1/2 = 1/21/2 = 0.7071068 ≈ 70% Thus the reduction is not even quite 30%. To slash the standard error by half, one needs to quadruple the sample size!!! Standard Error Estimate of the Mean The root-N law and the standard error are theoretical so far because they rest on an unknown population quantity σ. While it is nice to have insights into how precision depends on the sample size N, it would be even nicer if the standard error could be estimated. This is indeed done and part of standard statistical practice: Although we don’t know σ, we can estimate it! The obvious estimate is the empirical standard deviation s of the observations: 1 ( X 1 X ) 2 ( X 2 X ) 2 ... ( X N X ) 2 s = N 1 1/ 2 which in the limit N → ∞ goes to σ = ( (x1–μ)2 · P(X=x1) + (x2– μ)2 · P(X=x2) + … )1/2 In words: o σ is the “true” or population SD “calculated” from infinitely many observations Xi. o s is the estimated or sample SD calculated from the N observations X1, X2, …, XN of a single dataset. Therefore, the natural estimate of the standard error is: stderr( X ) = s / N ½ With this estimation step, we have achieved something remarkable: Based on one single dataset (the one we have in hand), we estimate how much the mean of a variable varies across datasets! Isn’t this stranger than strange? How is this possible? It is possible due to the math that goes into the root-N law. This math draws on the assumption that the cases/rows of the dataset are sampled independently. Such independence makes the N values of a variable uncorrelated if we could repeat data collections. Having zerocovariances between all N observations wipes out most terms in V( X ), and the root-N miracle happens, leaving us with a population standard deviation that can be estimated from any single dataset… The full ramifications will become clear as we develop the notion of a confidence interval constructed from standard errors. When polls around election time report a margin of error, it is the standard error of a proportion of voters. Recall that a proportion is just a mean of 0s and 1s, where 1=‘in favor of the incumbent’, 0=‘in favor of the challenger’. As for terminology: the technically correct term “standard error estimate of the mean” is usually replaced with the shorter “standard error of the mean” or even shorter “standard error”. This is technically not correct because the standard error is a theoretical population quantity, but the precise term is too much of a mouth full to bother. Standard Errors in JMP: Take any dataset with quantitative variables and apply Distribution to them. For example, go to the dataset PennStudents.JMP and run the variables Height and Weight through Distribution. We focus on the bottom list labeled ‘Moments’: HEIGHT: Mean Std Dev Std Err Mean upper 95% Mean lower 95% Mean N 67.754103 3.9749694 0.2012804 68.149836 67.358369 390 WEIGHT: Mean Std Dev Std Err Mean upper 95% Mean lower 95% Mean N 150.07821 30.051343 1.5217089 153.07001 147.0864 390 From Module 6 we know how to interpret the ‘Mean’ and the ‘Std Dev’ in conjunction with the bellcurve. What is new is that we can make sense of the next three numbers, labeled ‘Std Err Mean’, ‘upper 95% Mean’ and ‘lower 95% Mean’: o The ‘Std Err Mean’ is of course the standard error estimate of the mean. We can confirm that it is obtained from the standard deviation (of the observations) by dividing with the root of N: 3.9749694 / 3901/2 = 0.2012804 So: The mean, which is 67.75 for this dataset, would be different for other datasets, but it would vary around the population mean of Height (which we don’t know) with a standard deviation of about 0.2. o The next two numbers, ‘upper 95% Mean’ and ‘lower 95% Mean’ are roughly the mean ± two standard errors. So why aren’t these two number not exactly 67.754193 ± 2· 0.2012804 = (67.35154, 68.15666) ? The reason is that the empirical rule as we formulated it with a nice factor 2 is not exact. JMP and all software packages calculate exacter numbers to achieve 95% coverage, but you see that JMP’s numbers are reasonably close to the rough-andready ± 2 stderr rule. When available, use JMP’s numbers; when not use the empirical rule. Wait a minute! How could JMP assume that the distribution of the means across datasets is approximately normally distributed? This is what seems to be going on when labeling these bounds as upper and lower bounds of a 95% coverage interval. Something is missing: the Central Limit Theorem. The Central Limit Theorem Theorem: If X1 , X2 , …, XN are mutually independent and identically distributed with the same population mean μ and the same population variance σ2. Then, as N → ∞, the variation of the sample means X = 1 (X1 + X2 + …+ XN ) N from dataset to dataset resembles ever more a normal distribution with population mean μ and population variance σ2/N. We knew the last part already: Whatever the distribution is, it must have population mean (expected value) μ and variance σ2/N, the latter due to the root-N law. The powerful part is that this distribution looks ever more like a bellcurve. Unfortunately, we can’t indulge the intellectually curious with a proof or even a proof idea. The best we can do is to illustrate with a simulation, and this is what you are doing in Homework 5. In class we will do another simulation using Sim 300x Uniform.JMP The powerful and counter-intuitive part of the central limit theorem (CLT) is that it does not matter what the distribution of the observations Xi of a variable is: means across datasets will look ever more normally distributed. In other words, your Distribution analysis of the variable/column with values (X1 , X2 , …, XN) may look skewed or discrete, the means of the same variable across datasets would look approximately normally distributed, and this approximation gets better as N → ∞. A rule of thumb is that for sample sizes as low as N = 50, the normal distribution is a good approximation to the distribution of means across datasets. Reminder: We have been careful spelling out that the object of study is the distribution of the mean of a variable/column across datasets with N cases/rows. “Across datasets” means “across dataset collections”. Keep in mind: We are playing a mind game by examining hypothetically what means would look like if we could collect datasets over and over and over… So we said this is a hypothetical mind game. In reality, if we are ever in the situation of collecting more than one dataset with the same variables, we will most likely not analyze the datasets separately. Instead, we will merge them into one larger dataset with many more cases and the same variables. If the two datasets were both of size N, the merged dataset will be of size 2N. (By how much can we hope to slash the standard error of the means of the variables?) (A note on “meta-analysis” for the intellectually curious: There exists a situation in which one analyzes results from multiple datasets, namely, when one surveys research that has been going on for years and has produced multiple studies of roughly the same problem resulting in datasets that all contain some of the same variables of interest. This is the case typically in the medical field where a disease is investigated over and over from various angles. Such studies will have some of the same variables and also some that are specific to them. When surveying such studies, one can use techniques from a statistical specialty called “metaanalysis”. Typically one has only access to the summary statistics such as means, standard deviations, correlations of the variables as reported in papers published in scientific journals, but one does not have access to the multiple datasets themselves. By combining the estimates of multiple studies, meta-analytic techniques will then provide more accurate estimates than any of the individual studies.) The Empirical Rule Based on the Central Limit Theorem The upshot of the central limit theorem is that for moderate and large samples sizes (N ≥ 30), we can make approximate probability statements such as those of the empirical rule: P( | X – μ | ≤ 2σ/N½ ) ≈ 19/20 P( | X – μ | ≤ σ/N½ ) ≈ 2/3 This is of course not useful, although true. It becomes potentially useful once we try to estimate the unknown population standard deviation σ with a sample standard deviation s: P( | X – μ | ≤ 2s/N½ ) ≈ 19/20 P( | X – μ | ≤ s/N½ ) ≈ 2/3 Are these still acceptable approximations? It turns out the answer is yes! Here is why this is a non-trivial answer: By estimating σ with s, we incur dataset-to-dataset variability in s, just as in the sample mean X . Wouldn’t one expect this variation in s to destroy the nice empirical rule? Think about it: s undershoots σ as often as it overshoots, and when it overshoots, it makes the interval wider than necessary, so maybe the problem is not so bad. In fact, it isn’t. Here is what mathematical statistics found out: For small sample sizes we need to lift the factor 2 just a little bit, but for large sample sizes we can actually use a factor slightly below 2. The following table lists factors for various sample sizes as suggested by the theory: N: Factor: 10 2.23 15 2.13 20 2.09 30 2.04 N: Factor: 50 2.01 60 2.00 75 1.99 100 1.98 40 2.02 These factors used to be tabulated but are now computed by software such as JMP as needed. If we denote the factors by tN , the following probability statement is made exact, assuming the data themselves are normally and independently distributed: P( | X – μ | ≤ tN ·s/N½ ) = 0.95 Note the equal sign! For all practical purposes, the factor 2 will be just fine if we only remember that it is a little too small for N less than about 50. For N ≥ 100, the factor may actually be conservative in many cases, which is not a problem. It only means the probability may be a tad greater than 0.95, such as 0.952 for N = 100. Now, these probabilities are computed assuming normal data. If the data are nonnormal, such as skewed or discrete, the probability may be a touch below 0.95. In all, P( | X – μ | ≤ 2 s/N½ ) ≈ 0.95 is a pretty good rule, definitely for N ≥ 100, unless the data are crazy even for N ≥ 50. Insight into the problem discussed here developed in the early 1900s. Someone named Gosset did a mathematical investigation into the probability distribution of the quantity t = ( X – μ)/s/N½, the so-called tstatistic, assuming that the observations X1, X2,… are all normally and independently distributed. He actually derived the density function for this statistic. What we denote as tN is the 97.5% quantile of this tdistribution. The sample size N is called the “degree of the tdistribution”, and you may encounter references to “a t-distribution with N degrees of freedom”. Here are some trivia surrounding these discoveries, quoted from the Wikipedia (search “student’s t”): “The derivation of the t-distribution was first published in 1908 by William Sealy Gosset, while he worked at a Guinness Brewery in Dublin. He was not allowed to publish under his own name, so the paper was written under the pseudonym Student. The ttest and the associated theory became well-known through the work of R.A. Fisher, who called the distribution "Student's distribution".” Confidence Intervals Above we looked at the probability of the statement | X – μ | ≤ 2s/N½ and came away with the message that it is close to 0.95 for N ≥ 50. Preliminary observation for the next step: o In words, the inequality | X – μ | ≤ 2 s/N½ expresses the idea that the distance between X and μ is no more than 2s/N½. o There are two ways to express the same idea asymmetrically: μ is no further away from X than 2s/N½ : <<<<<<< X – 2 s/N½ ≤ μ ≤ X + 2 s/N½ X is no further away from μ than 2 s/N½ : μ – 2 s/N½ ≤ X ≤ μ + 2 s/N½ It is the first of these two asymmetric formulations we now consider. The figure below illustrates the three ways of looking at the condition. The top of the figure shows the distance between X and μ and compares it with 2 s/N½, which here is larger. The middle of the figure shows an interval centered at X of half-width 2s/N½ catching the value μ. The bottom of the figure shows an interval centered at μ of half-width 2s/N½ catching the value X . We rewrite the probability statement in the following suggestive form: P( X – 2 s/N½ ≤ μ ≤ X + 2 s/N½ ) ≈ 0.95 In words: The interval “ X ± 2 s/N½” catches the true population mean μ for about 19 out of 20 datasets. This interval is called: a 95% CONFIDENCE INTERVAL for the unknown population mean μ A common abbreviation for “confidence interval” is “CI”, so we may say “the 95% CI for the mean is …” The number 95% or 0.95 refers to the “coverage probability”, where “coverage” refers to covering the true value μ in the interval. Q: What would be the coverage probability of the CI X ± s/N½ ? Two vexing aspects of confidence intervals: o CIs are random intervals because they are constructed from datasets: each dataset produces one value for X and one for s, and both vary from dataset to dataset. o The target μ, by contrast, does not vary. In this mental game it is fixed but unknown across data collections. You can compare the situation to blindly shooting an arrow with a wide suction cup as a tip at a bull’s eye, and 19 out of 20 times the arrow’s suction cup covers the very center point of the bull’s eye (μ). Note that not only is it random where the arrow hits ( X ), but random is also the radius of the suction cup (2s/N½). Of course this is a two-dimensional metaphor for something that is going on in one dimension only. The figure on the left shows 100 CIs from 100 simulated datasets, each of size N=20. The dot shows the horizontal location of X , and the vertical centerline shows the fixed target μ. The 100 horizontal line segments represent the 100 CIs, vertically spread out for better comparison. Note that there are shorter and longer line segments, representing the variability in s. It appears that 6 or 7 intervals are missing the true value μ, in rough agreement with the approximate 5% missing rate expected in the long run as the number of datasets goes to infinity. In reality, you will see only one dataset and hence one CI for a variable, but you have to mentally embed this one CI in this picture. CIs in practice: We return to the analysis of the variables Height and Weight in the dataset PennStudents.JMP: HEIGHT Mean Std Dev Std Err Mean upper 95% Mean lower 95% Mean N 67.754103 3.9749694 0.2012804 68.149836 67.358369 390 WEIGHT Mean Std Dev Std Err Mean upper 95% Mean lower 95% Mean N 150.07821 30.051343 1.5217089 153.07001 147.0864 390 First, recall that Std Err Mean = Std Dev / N½ . We now understand “upper 95% Mean” and “lower 95% Mean”: These represent the 95% CI = Mean ± tN Std Err Mean, where tN ≈ 2. The interpretation is: o The interval (67.36 in, 68.15 in) has about a 95% chance of catching the population mean height of Penn students. o The interval (147.1 lb, 153.1 lb) has about a 95% chance of catching the population mean weight of Penn students. Again: o The unknown “population mean height” is fixed while the CI varies from sample to sample. This particular CI (67.36, 68.15) has a 95% chance of containing the population mean height, and this is all we can know about the population mean height. o The unknown “population mean weight” is fixed while the CI varies from sample to sample. The particular CI (147.1, 153.1) has a 95% chance of containing the population mean weight, and this is all we can know about the population mean weight. In either of case, are we ever going to know whether these particular two intervals contain the respective population means? No, we will never know, yet it’s the best game we can play. The trade-off between precision and uncertainty: We will never know whether the interval actually contains the population mean μ, but we can play with the width of the interval, that is, with the precision requirement: o We could lower the precision by widening the CI to three standard errors, which would reduce the uncertainty by raising the coverage probability from 0.95 to 0.997. o We could raise the precision by narrowing the CI to one standard error, which would raise the uncertainty by lowering the coverage probability from 0.95 to 0.68. There is no escaping the trade-off between precision and uncertainty. The conventional trade-off is to ask for 0.95 coverage probability, leaving a 5% uncertainty, implying a precision of ±2 stderr. When would we want a wider CI? If the cost on acting the CI is high. For example, if a clinical trial says average survival times increase by 2 years with a new treatment, wouldn’t you want to be quite certain before switching to the new treatment? Standard Error Estimates and CIs for Proportions The case of a random variable with 0/1 outcomes is so special for its simplicity and importance that we should examine it separately. It even has a special name: a Bernoulli random variable. When X=1 and X=0 are the only values, then o the sample mean of N realizations X1, X2, X3, …, XN is just the relative frequency or proportion of 1’s, and o the population mean is just the probability of observing 1. For these reasons, one writes p = P(X = 1) and p̂ = X = #{Xi = 1}/N This notation indicates that the proportion p̂ is an estimate of the probability p. Terminology: X is called a Bernoulli random variable with parameter p. Variance and standard deviation: We found in Module 6 that the population variance and standard deviation of X are V(X) = p(1–p) , σ(X) = (p(1–p))1/2 Now doesn’t this suggest that the sample variance and standard deviation should be the following? s2(X) = p̂ (1– p̂ ) , s(X) = ( p̂ (1– p̂ ))1/2 Very close! In fact, if we calculate the sample variance according to the usual formula and make use of the fact that the values are only 1’s and 0’s, we get 1 ((X – X )2 + (X – X )2 + … + (X – X )2 ) = 1 2 N N 1 p̂ (1– p̂ ) N/(N–1) Now, there is a reason to ignore the factor N/(N–1), which is close to 1 anyway, and we don’t need to know the details. Hence we take the formulas in the red box as the final definitions. Standard errors: Sample standard deviations of 0/1 outcomes have really no practical meaning because there is certainly no empirical rule that applies here. Instead, the purpose of the standard deviation is as an aide in calculating a standard error estimate of the porportion: stderr( p̂ ) = ( p̂ (1– p̂ ) / N )1/2 We can restate the empirical rule for proportions as follows: P( | p̂ – p | ≤ 2 stderr( p̂ ) ) ≈ 0.95 In words: The sample proportion p̂ has a chance of about 19 in 20 to catch the true probability p within two standard errors. Application: Consider a poll of a candidate based on 1000 respondents (= people bothered by phone during dinner time, yet willing to volunteer an answer). Let’s say 465 were in favor of candidate Z. The proportion is p̂ = 0.465. The standard error (…estimate of the proportion) is stderr = (0.465 · 0.535 / 1000)1/2 = 0.01577. Thus two standard errors is 0.03154, and the newspapers will report “candidate Z favored by 46.5% of likely voters, with a margin of error of 3%”. Clarification: The newspapers invented the term “statistical dead heat”. They mean that based on the margin of error one cannot be sure that one candidate is ahead of the other. You should realize that this “dead heat” is less a property of what is going on in the population than a definition based on convention and sample size. It is assumed (and this is a convention) that the “margin of error” should be based on two standard errors, implying a coverage probability of the true proportion of 95%. Also, the pollers seem to have taken sample sizes around 1000 as the standard (see for example the sample sizes quoted in BushJobRatingsGallup.JMP). These two facts combine to a definition of “statistical dead heat”. One would have fewer dead heats if one made the CIs narrower and allowed greater uncertainty in coverage, and/or if one used sample sizes greater than 1000. For your own thinking, you could switch to a confidence interval based on ±stderr, which leaves you with an uncertainty of 1/3 instead of 1/20, but let’s you gamble that candidate Z is ahead or lagging. Evidence for/against μ: Rejection and Significance Levels based on CIs CIs are random intervals centered at the sample mean that, like a fishing net, try to catch something, here the population mean μ. Now let us change the point of view: let us center things at the unknown population μ, as in the following two figures: We drew the bellcurve because based on the CLT it is a good approximation for the dataset-to-dataset distribution of sample means X around the population mean μ. As the figures state, μ makes X look more likely in the first case than in the second case. The farther X is from μ, the lower the density function and the lower the probability of X gets. Turning things around, we now ask what evidence X lends for the unknown μ. The following seems reasonable: o When μ makes X more probable, X lends more evidence for μ. o When μ makes X less probable, X lends less evidence for μ. The principle at work here is: o Population means μ assign probabilities to sample means X . o Sample means X assign evidence to population means μ. Strictly speaking, of course, it is the pair of parameters (μ, σ) that define a normal population, and it is this population that assigns probabilities to intervals of values of X . Conversely, it is X in conjunction with the standard error estimate s(X)/N1/2 that together assign evidence to values of μ. Never mind, the two bullets above are more catchy and more memorable. The evidence game has given rise to the follow formulation: Values of μ that are further away from X than two standard errors are rejected at the 5% significance level. This is language from the theory of “statistical tests” which we’ll look into in the next module. For now it gives us another way of interpreting the values inside and outside the CI: o The values inside the CI are possible population means μ for which there is no evidence to reject them. o The values outside the CI are possible population means μ for which there is evidence to reject them. Example: Above we looked at political candidate Z who had 46.5% of respondents favoring him/her. Two standard errors of this proportion turned out to be about about 3.2%. Thus the confidence interval CI is 46.5%±3.2% = (43.3%, 49.7%). This implies that the value μ=50% is rejected at the 5% significance level because it falls outside the 95% CI. The evidence at the 5% significance level is that the candidate does not have a majority. Please, do not confuse the various percentages: the proportion 46.5% and the CI refer to proportions of likely voters. So does the hypothetical boundary value 50% that divides majority from minority. The two values 95% and 5%, however, refer to strength of evidence in these voter proportions: 95% is the probability that the CI catches the true population proportion; 100% – 95% = 5% is the “unlikeliness” of finding the sample mean this far out in the tail. Coverage probabilities of CIs and significance level of rejection: If the CI has another half-width, for example, ± 3 stderr, then the coverage probability is 0.997, and we will say that we reject the values outside this CI at the 0.3% significance level. If we wanted to play the game at the 1% significance level, we would have to use CIs with coverage probability 99%, requiring a half-width of about ± 2.6 stderr. In general, if the coverage probability of the CI is 1– α, we say the values outside the CI are rejected at the α·100% significance level.