Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 18: Sampling Distributions All random variables have probability distributions, and as statistics are random variables, they too have distributions. The random phenomenon that produces the statistics of interest in this chapter, pb and y is random sampling of a population. Consequently, the distribution of the statistic (b p or y) is called the sampling distribution. While the concepts and theory behind sampling distributions are sometimes difficult, the sampling distributions of both statistics usually can be approximated by normal distributions. Example: Gallup conducted a poll of 2,475 likely voters between October 30th and November 1st, 2008 (a few days before the 2008 Presidential election). 50.6% of likely voters said that they would vote for Barack Obama. Assume for the moment that it is November 2st, 2008. • Let p denote the proportion of all voters that will vote for Obama. • The statistic (an estimator of p) is pb and it’s realization was pb = 1253 = .5063. 2475 • It’s highly unlikely that pb = p, though the difference pb − p is likely to be small if the sample size is large. A measure of error is essential to properly interpret the estimate; most polling organizations report the margin of error. It’s impossible to know the error before the election results are tabulated and p is computed. In fact, it’s extremely rare to ever know the parameter. Consequently, computing pb − p as a measure of error is infeasible and some other approach is needed to characterize the accuracy (or error) of a statistic. One characterization of the error is the standard deviation of the sampling distribution of pb. In this case, the standard deviation is the average difference between p and a value of pb obtained from random sampling. The standard deviation of pb can be estimated from the sample. The characterization of error using the standard deviation can be criticized because the average error is used and the realized value is but one of many possible realizations (each depending on the drawn sample), and the observed realization is highly unlikely to be exactly one standard deviation from p. 132 Specifically, the estimated expected error is reported. The expected error is the average error if the method were to be used many times (the method being: collect a random sample of size n and compute pb many times.) Definition: The distribution of pb across all possible samples of a given size is called the sampling distribution of pb. • The standard deviation of the sampling distribution is the expected (or average) error of the estimator. • Statistical theory provides a technique for using a single sample to approximate the sampling distribution. First, recall that the distribution of a discrete random variable consists of 1. the possible realizations of the random variable and, 2. the associated probabilities of each realization. The distribution can be approximated by repeatedly random sampling the population and computing the statistic. The relative frequencies of occurrence of each realized value comprises the approximate sampling distribution. Simulating the sampling distribution of pb Example: Polling before an election is a good example for simulation since the true p is usually known after the election. In the 2008 presidential election, 52.9% of Americans voted for Barack Obama. If pb is computed from a random sample of n voters drawn from the 131,257,328 voters, then we can simulate the distribution of pb using the random number table. • Randomly draw n 3-digit numbers from 000, 001, 002, . . . , 999. If number is in the set {000, 001, 002, . . . , 528}, then the simulated vote was for Obama. If the number is not in the set, then the vote is for someone else. A random number table was used to simulate the polling of n = 6 voters. The results from ten repetitions are shown below. 133 Table 1: Ten repetitions of sampling the 2008 presidential voters. pb is the proportion of voters for Obama. n = 6 voters were randomly sampled by simulation. Sample Random numbers pb 1 03939 30763 06138 80062 5/6 = .833 2 75998 37203 07959 32846 2/6 = .333 3 94435 97441 90998 25104 3/6 = .5 4 04362 40989 69167 38894 3/6 = .5 89059 43528 10547 40115 2/6 = .333 5 6 87736 04666 75145 49175 .667 7 76488 80199 15860 07323 .333 8 36460 53722 66634 25045 .667 9 13205 69237 21820 20952 .5 10 11282 43632 49531 78988 .5 • The histogram to the right shows the distribution. The mean value was pb = .515 and the standard deviation was SD(b p) = .167. 2 0 1 Frequency 3 4 • Using R, thousands of random samples can be simulated for n = 6 or any other size. Simulated sampling distributions of pb are shown for sample sizes of 6, 25, and 100 in the histograms below. 0.3 0.4 0.5 0.6 0.7 Simulated proportion 134 0.8 0.9 15000 30000 n=25 5000 10000 Frequency 20000 15000 10000 5000 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 Simulated proportion 0.8 1.0 8000 n=100 4000 2000 • The distributions are appear to be approximately normal. 0.6 6000 • The center of each distribution is p = .529. • The standard deviations of the distributions decrease as n increases. 0.4 Simulated proportion Frequency The sampling distribution of pb 0 Frequency 25000 n=6 Henceforth, it is assumed that pb is computed 0.0 from a random sample of size n drawn from an infinitely large population with population proportion p. 0.2 0.4 Simulated proportion 1. The mean of the sampling distribution of pb is E(b p) = p. De Veaux et al. write µ(b p) = p. 2. The standard deviation of the sampling distribution of pb is p σ(b p) = SD(b p) = pq/n where q = 1 − p. 135 0.6 0.8 1.0 3. If n is sufficiently large, then the distribution of pb is approximately N (p, Shorthand notation: p · pb ∼ N (p, pq/n). p pq/n). If the population size is very large, then items 1-3 are approximately correct. Specifically, • The expected value of pb is E(b p) = p. p • The standard deviation of pb is well-approximated by pq/n provided that the 10% condition is valid (the sample size is no more than 10% of the population size).1 • The sampling distribution of pb is approximately normal provided that the success/failure condition holds. The success/failure condition states that np ≥ 10 and nq ≥ 10. • Example: The sampling distribution of pb computed from a sample of n = 100 voters is determined by noting that the true proportion of voters who voted for Barack Obama is p = .529. The 10% and success/failure conditions are satisfied since n = 100 is much less than the population of 119 million voters. Thus, the normal distribution approximation is appropriate, and · pb ∼ N (µ(b p), σ(b p)) where and µ(b p) = p = .529, σ(b p) = = p pq/n p .529 · .471/100 = .050. • The model is used now to approximate the probability that a realized sample proportion · will be less than .50. Since pb ∼ N (.53, .050), µ ¶ pb − p .5 − .53 P (b p < .5) = P < σ(b p) .050 = P (Z < −.6) = .2743. Sampling distribution of the sample mean: The sample mean is the principal statistic for characterizing the distribution of a quantitative variable over a population. • The population mean µ is estimated by the sample mean y. 1 If the 10% condition is satisfied then the sample can be viewed as independent realizations of n Bernoulli trials. 136 • Estimation error is characterized by the standard deviation of the sampling distribution of y as it is the expected difference between y and µ. 2. The population mean and standard deviation are µ = 405.93 and σ = 363.94 days, respectively. 600 400 0 200 1. These observations will be treated as a population which will be sampled to estimate its mean. Frequency Example: The survival times (days) of 2843 Australian patients diagnosed with AIDS before 1 July 1991 are shown in a histogram to the right.2 800 1000 • We’ll begin with a sampling experiment that generates approximate sampling distributions before developing the main theoretical results. 0 500 1000 1500 2000 2500 Days 3. The distributions of the sample mean computed from SRS of sizes n = 3, 5, 10, and 25 are to be investigated by taking 10,000 repeated samples of each size. Each sample yields a sample mean for a total of 10,000 y’s. 4. Histograms of sample means for each sample size are shown below. The centers are all approximately 406 days. The spread decreases as n increases, and the shapes of the distributions become more normal-like in appearance as n increases. 2 Source: Dr P. J. Solomon and the Australian National Centre in HIV Epidemiology and Clinical Research. 137 1500 1000 1500 1000 n=5 0 0 500 500 Frequency Frequency n=3 0 0 500 1000 500 1000 1500 1000 1500 1500 Sample mean 1000 1000 1500 1500 Sample mean n=25 0 0 500 500 Frequency Frequency n=10 0 500 1000 0 1500 500 Sample mean Sample mean The Central Limit Theorem explains why the sampling distribution of the sample mean appears to be increasingly more normal as the sample size increases. Central Limit Theorem: The sampling distribution of the sample mean computed from a simple random sample converges to a normal distribution as the sample size n tends to infinity. The Central Limit Theorem predicts the sampling distribution for (almost) all quantitative variables measured on any population. To which normal distribution is the sampling distribution of the mean converging? 138 1. The mean of the sampling distribution of y is the population mean µ. Hence, µ(y) = µ. √ 2. The standard deviation of the sampling distribution of y is: σ/ n. Hence, σ(y) = √ SD(y) = σ/ n. In summary, as the sample size n tends to infinity, the sampling distribution of y tends to √ a N (µ, σ/ n) distribution. It’s necessary to know the sampling distribution of a statistic for most of the formal inferential methods to come. The Central Limit Theory provides a simple approximation to the actual sampling distribution–namely, a normal distribution with appropriate mean and standard deviation. For y, the approximation is √ · y ∼ N (µ, σ/ n). The approximation holds for a sufficiently large sample size regardless of the shape of the population distribution. For virtually any population shape, the sample mean from a reasonably large sample will have an approximately normal distribution. The normal approximation is accurate whenever • The population distribution is itself normal. In this case, the sampling distribution of y is exactly normal for any sample size. • The sample size is at least 30. If there are any outliers, an even larger sample size may be needed. Necessary conditions: To be confident that the normal approximation is accurate, the following conditions must hold: 1. The data are a random sample the population. 2. The sample size is at least 30. Example: The mean age of the AIDS patients was reported to be 37.4 years and the standard deviation reported to be 10.1 years. To verify this statement, suppose a random sample of n = 36 patients is obtained and their ages recorded. 1. What are µ, σ, y, and s?3 2. What proportion of ages are less than or equal to 21? Let X denote the age of a random sampled AIDS patient. The proportion of ages less than or equal to 21 is the same as P (X ≤ 21) where is the age of a randomly chosen AIDS patient. Without the full population, it is impossible to answer the question exactly. The question can be 3 Presumably, µ = 37.4 and σ = 10.1. y and s are not known. 139 answered approximately by proceeding as if the distribution of age over the population was normal: µ ¶ 21 − 37.4 P (X ≤ 21) = P Z ≤ 10.1 = P (Z < −1.623) = .052. 3. What does the Central Limit Theorem say about the sampling distribution of y?4 4. What is the probability that the sample mean will be less or equal to than 35? µ ¶ 35 − 37.4 P (y ≤ 35) = P Z < 1.683 = P (Z < −1.425) = .077. 5. Suppose that the age distribution is skewed to right. Is the calculation of the proportion of ages less than or equal to 21 accurate?5 Is the calculation involving the sample mean accurate?6 6. Suppose that the sample mean age was y = 34. Is there reason to doubt the reported population mean value of µ = 37.4? A measure of statistical evidence against a population mean of 37.4 is obtained by computing the probability of obtaining a sample mean as small or smaller that 34. If this probability is very small, then the sample mean, and the data, contradict the supposition that 37.4 is µ. The computation is: µ ¶ 34 − 37.4 . P (y ≤ 34) = P Z ≤ 1.683 = P (Z ≤ −2.020) = .0217. The result is statistically significant in the sense that the sample data contradict the reported value: it is improbable (the probability is approximately .02) to observe a sample mean as small or smaller than 34 if the population mean truly were 37.4. 7. What assumptions are underlying the previous calculation?7 √ · y ∼ N (µ, σ(y)) where µ = 37.4 and σ(y) = 10.1/ 36 = 1.683. 5 No because the normal distribution probably will not be an accurate approximation of the age distribution. In fact, the actual proportion is .0204. 6 Yes, because n ≥ 30 implies that the normal distribution is an accurate approximation of the distribution of the sample mean. 7 The sample mean was computed from a random sample. 4 140 8. Would a sample mean as small or smaller than 34 have been more or less likely to have occurred by chance if the sample size had been n = 72 (instead of n = 36)?8 8 √ Smaller, since σ(y) = σ/ n, and P (y ≤ 34) = P ¶ µ √ 34 − µ . Z≤ n σ 141