* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Download Confidence Intervals - McMaster University, Canada
Survey
Document related concepts
Transcript
Sociology 6Z03 Topic 13: Confidence Intervals John Fox McMaster University Fall 2016 John Fox (McMaster University) Soc 6Z03:Confidence Intervals Fall 2016 1 / 30 Fall 2016 2 / 30 Outline: Confidence Intervals Introduction Confidence Interval for the Population Mean Varying the Level of Confidence Choosing the Sample Size, n Cautions Concerning Confidence Intervals John Fox (McMaster University) Soc 6Z03:Confidence Intervals Introduction From Population to Samples Implicitly using results derived from probability theory, we have, thus far, been reasoning deductively from characteristics of a known population to characteristics of samples drawn at random from that population. John Fox (McMaster University) Soc 6Z03:Confidence Intervals Fall 2016 3 / 30 Introduction From Population to Samples Thought Question Suppose that the population of single-parent families in Canada has an average annual income of µ = $35,000 and a standard deviation of σ = $10,000 (both made up). The means x from repeated samples of size n = 100 drawn randomly from this population: A are approximately normally distributed with a mean of $35,000 and a standard deviation of SD (x ) = σ = $10, 000. B are approximately normally distributed with a mean of $35,000 and a standard deviation of σ 10, 000 SD (x ) = √ = √ = $1, 000 n 100 C I don’t know. John Fox (McMaster University) Soc 6Z03:Confidence Intervals Fall 2016 4 / 30 Introduction From Sample to Population When we draw a sample in a real application of statistics, of course, we do not know the characteristics of the population. If the characteristics of the population were known, then there would be no point in sampling! Moreover, in real applications, the researcher draws a single sample of size n, not repeated samples. If a researcher had the resources to draw 1,000 samples each of size n = 100, then he or she would treat these data as a single larger sample of n = 1, 000 × 100 = 100, 000 cases. John Fox (McMaster University) Soc 6Z03:Confidence Intervals Fall 2016 5 / 30 Introduction From Sample to Population: Statistical Inference The central issue in statistical inference is to draw conclusions inductively about the population on the basis of a single sample of size n drawn from it. There are two common modes of statistical inference: 1. Estimation We want, on the basis of our data to derive a “best guess” of the value of a population parameter, such as the population mean income µ. Such a best guess is called a point estimate. We know, however, that point estimates — which are sample statistics, like the sample mean x — vary from sample to sample. It is generally desirable to reflect the uncertainty due to sampling variation in an interval estimate, also called a confidence interval. Typically, a confidence interval takes the form of a point estimate ± a margin of error. John Fox (McMaster University) Soc 6Z03:Confidence Intervals Fall 2016 6 / 30 Introduction From Sample to Population: Statistical Inference 2. Hypothesis Tests Sometimes we are interested in establishing whether or not a parameter is equal to a specific value. For example, we might want to learn whether or not the population mean income of men and women is the same — that is, whether the difference in their mean income, µMen − µWomen , is zero. A statistical hypothesis test, also called a “significance” test, tells us the degree to which the data support the hypothesis that a parameter is equal to a particular value. John Fox (McMaster University) Soc 6Z03:Confidence Intervals Fall 2016 7 / 30 Confidence Interval for the Population Mean From Population to Samples We will begin by assuming, unrealistically, that we know the population mean µ and standard deviation σ, and, consequently that we know the sampling distribution of sample means, for samples of size n: that is, σ x ∼ N µ, √ n In our example, where µ = 35, 000, σ = 10, 000, and n = 100, recall that x ∼ N (35, 000; 1, 000). Using the 68–95–99.7 rule for the normal distribution, we know that about 95 percent of sample means x will be within two SD (x ) of the population mean µ, as shown in the graph on the next slide. John Fox (McMaster University) Soc 6Z03:Confidence Intervals Fall 2016 8 / 30 Confidence Interval for the Population Mean From Population to Samples 95 percent of samples 2.5 percent of samples 2.5 percent of samples x 33000 35000 37000 µ − 2 × SD(x) µ µ + 2 × SD(x) Sampling distribution of the sample mean x. John Fox (McMaster University) Soc 6Z03:Confidence Intervals Fall 2016 9 / 30 Confidence Interval for the Population Mean From Population to Samples Thought Question (A) True or (B) False? For this example: 95 percent of sample means are in the interval µ ± 2 × SD (x ) = 35, 000 ± 2 × 1, 000 = 35, 000 ± 2, 000 2.5 percent of sample means are below µ − 2 × SD (x ) = 35, 000 − 2 × 1, 000 = 33, 000 and 2.5 percent of sample means are above µ + 2 × SD (x ) = 35, 000 + 2 × 1, 000 = 37, 000 John Fox (McMaster University) Soc 6Z03:Confidence Intervals Fall 2016 10 / 30 Confidence Interval for the Population Mean Reversing the Interval Put another way: In 95 percent of samples, the population mean is within two SD (x ) of the sample mean x. If, therefore, we construct an interval of width ±2 × SD (x ) around the sample mean, x ± 2 × SD (x ) = x ± 2 × 1, 000 = x ± 2, 000 then this interval will include the population mean µ in 95 percent of repeated samples. John Fox (McMaster University) Soc 6Z03:Confidence Intervals Fall 2016 11 / 30 Confidence Interval for the Population Mean For example, a researcher draws a sample of size n = 100 and calculates the sample mean x = 36, 000. The researcher does not know the population mean µ, but — let us suppose — does know that the population standard deviation is σ = 10, 000. Then, he or she would calculate the interval x ± 2 × SD (x ) = x ± 2, 000 = 36, 000 ± 2, 000 Definition This interval, which has the form point estimate ± margin of error is called a confidence interval for the unknown population mean µ. John Fox (McMaster University) Soc 6Z03:Confidence Intervals Fall 2016 12 / 30 Confidence Interval for the Sample Mean In this case, the population mean µ = 35, 000 (which is known to us, but not to the researcher) falls within the confidence interval: x − 2 × SD(x) x x + 2 × SD(x) 36,000 38,000 ● 34,000 True µ = 35000 These conditions are unrealistic: In real applications, when µ is unknown, then so is σ. But we want to keep things simple for now. Later on, we’ll learn how to handle the situation where the population standard deviation σ is also unknown. John Fox (McMaster University) Soc 6Z03:Confidence Intervals Fall 2016 13 / 30 Fall 2016 14 / 30 Confidence Interval for the Sample Mean Interpretation of Confidence Intervals: Behaviour with Repeated Sampling 95% of samples 2.5% of samples Known only to 'God' Confidence intervals for repeated samples 2.5% of samples _ µ - 2 x SD(x) µ _ µ + 2 x SD(x) the researcher has only one sample 2.5 % of intervals miss low 2.5 % of intervals miss high John Fox (McMaster University) Soc 6Z03:Confidence Intervals Confidence Interval for the Sample Mean Interpretation of Confidence Intervals: Behaviour with Repeated Sampling Thought Question (A) True or (B) False? In 95 percent of samples, the sample mean x falls in the interval µ ± 2 × SD (x ). When this happens, the population mean µ also falls within the confidence interval x ± 2 × SD (x ). In 2.5 percent of samples, the sample mean x exceeds µ + 2 × SD (x ), and when this happens, the population mean µ falls below the lower bound of the confidence interval. In 2.5 percent of samples, the sample mean x is below µ − 2 × SD (x ), and therefore the population mean µ is above the upper bound of the confidence interval. John Fox (McMaster University) Soc 6Z03:Confidence Intervals Fall 2016 15 / 30 Confidence Interval for the Sample Mean Interpretation of Confidence Intervals: Behaviour with Repeated Sampling In summary, then, in 95 percent of samples, the confidence interval x ± 2 × SD (x ) includes the unknown population mean µ and in 5 percent of samples, the confidence interval fails to include µ. For this reason, x ± 2 × SD (x ) is called a 95 percent confidence interval. John Fox (McMaster University) Soc 6Z03:Confidence Intervals Fall 2016 16 / 30 Confidence Interval for the Population Mean Correct Interpretation of Confidence Intervals Important Point Here is the correct interpretation of the 95-percent confidence interval x ± 2 × SD (x ) = 36, 000 ± 2, 000: The researcher is 95 percent confident that the unknown population mean µ is somewhere between $34,000 and $38,000, in the sense that he or she has employed a procedure that produces the right answer 95 percent of the time with repeated sampling (and the wrong answer 5 percent of the time). The researcher does not know whether or not µ is in the interval that is constructed for this particular sample. John Fox (McMaster University) Soc 6Z03:Confidence Intervals Fall 2016 17 / 30 Confidence Interval for the Population Mean Common Misinterpretations of Confidence Interval The probability is 95 percent that the unknown population mean µ is in the interval x ± 2 × SD (x ) = 36, 000 ± 2, 000 The population mean µ is either in the interval (in which case the probability that it is in the interval is 1), or it is not in the interval (in which case the probability that it is in the interval is 0). In the example, we (because of our assumed omniscience) know that µ = 35, 000 is in the interval, but the researcher does not know this. The probability is 95 percent that a family selected at random has an income in the interval x ± 2 × SD (x ) = 36, 000 ± 2, 000 The distribution of individual income √ scores x in the population has a mean of µ (not x), a standard deviation of σ (not σ/ n), and is not necessarily a normal distribution. John Fox (McMaster University) Soc 6Z03:Confidence Intervals Fall 2016 18 / 30 Confidence Interval for the Population Mean Common Misinterpretations of Confidence Interval The probability is 95 percent that the sample mean x is in the interval x ± 2 × SD (x ) = 36, 000 ± 2, 000 This is just nonsense: The mean x = 36, 000 for this particular sample is certainly in the interval, since the interval is centred around it. When we sample repeatedly, 95 percent of sample means x are in the interval x ± 2 × SD (x ) = 36, 000 ± 2, 000 The sampling distribution of x is centred on the population mean µ = 35, 000, not on the sample mean x = 36, 000 for a particular sample. Thus 95 percent of sample means are in the interval µ ± 2 × SD (x ) = 35, 000 ± 2, 000 John Fox (McMaster University) Soc 6Z03:Confidence Intervals Fall 2016 19 / 30 Varying the Level of Confidence So far, we have constructed a confidence interval at the 95 percent or .95 confidence level. This is the confidence level that is most commonly used. Other common levels are 90 percent (or .9) and 99 percent (or .99). John Fox (McMaster University) Soc 6Z03:Confidence Intervals Fall 2016 20 / 30 Varying the Level of Confidence More generally, to construct a confidence interval for the mean at the confidence level C , we take σ x ± z ∗ × SD (x ) = x ± z ∗ √ n where the area between −z ∗ and z ∗ under the standard normal distribution is C : Standard Normal Density Probability = C Probability = (1 − C)/2 − z* Probability = (1 − C)/2 0 z* Notice that the area in each “tail” of the distribution is (1 − C )/2. John Fox (McMaster University) Soc 6Z03:Confidence Intervals Fall 2016 21 / 30 Varying the Level of Confidence Critical Values of z Here are the critical values z ∗ corresponding to the common confidence levels: Confidence level 90% 95% 99% John Fox (McMaster University) One-tail area .05 .025 .005 Soc 6Z03:Confidence Intervals z∗ 1.645 1.960 ≈ 2 2.576 Fall 2016 22 / 30 Varying the Level of Confidence For the example, we get the following confidence intervals at the 90%, 95%, and 99% levels of confidence: 90% CI: 95% CI: 99% CI: John Fox (McMaster University) 36, 000 ± 1.645 × 1, 000 = 36, 000 ± 1645 36, 000 ± 1.960 × 1, 000 = 36, 000 ± 1960 36, 000 ± 2.576 × 1, 000 = 36, 000 ± 2576 Soc 6Z03:Confidence Intervals Fall 2016 23 / 30 Varying the Level of Confidence Thought Question We would like the confidence interval to be as narrow as possible; that is, we want a small margin or error. This example illustrates an important characteristic of confidence intervals: A If we want greater confidence that the parameter is included in the interval, then we need to construct a wider interval. B If we want greater confidence that the parameter is included in the interval, then we need to construct a narrower interval. C The width of the confidence interval is unrelated to the level of confidence. D I don’t know. John Fox (McMaster University) Soc 6Z03:Confidence Intervals Fall 2016 24 / 30 Varying the Level of Confidence Factors Affecting the Margin of Error There are three factors that affect the margin of error σ z∗ √ n 1 2 To make the confidence level larger, z ∗ gets larger. This produces a larger margin of error. The more variable the scores are in the population (that is, the larger the value of σ), the larger the margin of error. It is easier to estimate µ precisely in a homogeneous population than in a heterogeneous one. Because the population standard deviation is not under our control, we cannot achieve greater precision by making it smaller. 3 Because n is in the denominator of the margin of error, we get greater precision from large samples than from small ones. John Fox (McMaster University) Soc 6Z03:Confidence Intervals Fall 2016 25 / 30 Varying the Level of Confidence Factors Affecting the Margin of Error Thought Question To cut the margin of error in half (i.e., to make the width of the confidence interval half as large), we have to: A double the sample size n. B halve the sample size n. C make the sample size n four times as large. D I don’t know. John Fox (McMaster University) Soc 6Z03:Confidence Intervals Fall 2016 26 / 30 Choosing the Sample Size, n Say that we desire a particular margin of error m. We’ve decided to use confidence level C , corresponding to the standard-normal value z ∗ , and we know that the population standard deviation is σ. We want to figure out the sample size n that is required to achieve this margin of error. The margin of error is σ m = z∗ √ n Solving for n produces n= z ∗σ 2 m If the computed value of n is not a whole number then round up to the next whole number. John Fox (McMaster University) Soc 6Z03:Confidence Intervals Fall 2016 27 / 30 Choosing the Sample Size, n Example To illustrate, suppose that we want to construct a C = 95 percent confidence interval for the mean income µ in a population in which the standard deviation of income is σ = $10, 000 (as in our previous example). We want our confidence interval to have a margin of error of m = $200. Then, the required sample size is n= 1.960 × 10, 000 2 = 9, 604 200 or nearly 10,000 families. John Fox (McMaster University) Soc 6Z03:Confidence Intervals Fall 2016 28 / 30 Cautions Concerning Confidence Intervals For the formula for the confidence interval σ x ± z∗ √ n to be accurate, the data must be a SRS from a large population The formula is not correct for complex probability sampling designs such as stratified samples. There is no correct method for constructing confidence intervals for haphazardly (nonrandomly) selected data. Because the sample mean x can be strongly affected by outliers, so can the confidence interval. If the population is not normal, and the sample size is very small, then the formula may not be accurate. John Fox (McMaster University) Soc 6Z03:Confidence Intervals Fall 2016 29 / 30 Cautions Concerning Confidence Intervals Even if the formula is accurate, it might not be sensible to use the mean as a summary — for example, when the population distribution of x is very skewed. To use the formula, you need to know the population standard deviation σ. In a large sample you can safely substitute the sample standard deviation s to get an approximate confidence interval s x ± z∗ √ n We’ll learn later what to do when the sample size is small. The margin of error covers only random sampling errors. Other sources of error, such as undercoverage and nonresponse in surveys, are not included in the margin of error. John Fox (McMaster University) Soc 6Z03:Confidence Intervals Fall 2016 30 / 30