* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Introduction • The reasoning of statistical inference rests on asking
Survey
Document related concepts
Transcript
The Practice of Statistics, 2nd ed. Yates, Moore, and Starnes Chapter 9 – Sampling Distributions Introduction • The reasoning of statistical inference rests on asking, “How often would this method give a correct result if I used it very many times?” • Exploratory data analysis (calculating means, medians, standard deviations, etc.) makes sense for any data, but formal inference does not. • Inference is most useful when we produce data by random sampling or randomized comparative experiments. It is under these conditions that the laws of probability answer the question posed above, namely, “What would happen if we did this many times?” 9.1 – Sampling Distributions • A parameter is a number that describes the population. In practice, we do not know this value because we cannot examine the entire population. (If we DID know it, we wouldn’t NEED statistics!) • A statistic is a number that describes a sample. The value of a statistic changes from sample to sample, but it is known (unlike a parameter). We often use a statistic to estimate a parameter. We must have notation that distinguishes between population parameters and sample statistics. The sample mean x can be used to approximate the population mean µ and the sample proportion p̂ can be used to approximate the population proportion p. Sampling variability How can a sample statistic, which is based on a small percentage of the members of the population, be an accurate estimate of the population parameter? Wouldn’t a second sample produce a different sample statistic? This is the concept of sample variability – the value of a statistic varies in repeated sampling. We need to know what happens if we take many samples. To find out, we do the following: • Take a large number of samples of the same size from the same population • Calculate the sample mean x or sample proportion p̂ for each sample • Examine the distribution of the values of x or p̂ (SOCS) If we were to do the above for all possible samples of a given size n from the same population, we would be describing the sampling distribution of the statistic. The sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population. In practice, actually finding and calculating the sampling distribution is nearly impossible because of the huge numbers of samples involved. However, the laws of probability allow us to obtain sampling distributions without simulation. The same long-run approach is useful here – haphazard sampling does not give regular and predictable results. However, when randomization is used in a design for producing data, statistics computed from the data have a definite pattern of behavior over many repetitions, even though the result of a single repetition is uncertain. Page 1 of 5 The Practice of Statistics, 2nd ed. Yates, Moore, and Starnes Chapter 9 – Sampling Distributions Unbiased statistics A statistic used to estimate a parameter is unbiased if the mean of its sampling distribution is equal to the true value of the parameter being estimated. This means that there is no “systematic tendency” to overestimate or underestimate the parameter, i.e., there is no “bias.” It can be shown mathematically that p̂ (the sample proportion) is an unbiased estimator of p (the population proportion) and x (the sample mean) is an unbiased estimator of µ (the population mean). Even though p̂ is an unbiased estimator of p, any given sample may produce an inaccurate estimate of p. This is because of the variability of the sample statistic. Larger samples have a clear advantage because larger samples have less variability. If a statistic is unbiased, it’s more likely that a larger sample, because of its lower variability, will produce an estimate closer to the true value of the parameter. One important and perhaps surprising result is that the variability (spread) of the sampling distribution does not depend very much on the size of the population! Instead, it depends on the size of the sample. As long as the population is much larger than the sample (say 10 times as large), the spread of the sampling distribution is approximately the same for any population size. Our goal is to have low bias and low variability. In a “target” analogy, bias is missing off to the same direction every time, while variability is missing in an evenly scattered way around the bull’s-eye. Properly chosen statistics computed from random samples of sufficient size will have low bias and low variability. 9.2 – Sample Proportions The sample proportion p̂ is the statistic we use to gain information about the unknown population parameter p. Sample proportions arise most often when we are interested in categorical variables such as “the proportion of adults who watch American Idol” or “the percent of adults who attended church or synagogue last week.” Sampling distribution of a sample proportion Suppose we choose an SRS of size n from a large population (at least 10 times the size of n) with population proportion p having some characteristic of interest. Let p̂ be the proportion of the sample having that characteristic. Then: • The sampling distribution of p̂ is the distribution of the values of p̂ in all possible samples of the same size from the population. • The mean µ p̂ of the sampling distribution of p̂ is exactly p. • The standard deviation σ p̂ of the sampling distribution is p (1 − p ) . n Page 2 of 5 The Practice of Statistics, 2nd ed. Yates, Moore, and Starnes Chapter 9 – Sampling Distributions Using the normal approximation for p̂ For large enough values of n (the sample size), the sampling distribution of p̂ is approximately normal. Furthermore, • The sampling distribution is closer to a normal distribution when the sample size n is large. • The accuracy of the normal approximation improves as the sample size n increases. • Given a fixed sample size n, the normal approximation is most accurate when p is close to ½ and least accurate when p is close to 0 or 1. • Rule of Thumb: We will use the normal approximation to the sampling distribution of p̂ for values of n and p that satisfy np ≥ 10 and n (1 − p ) ≥ 10 (because pˆ = X is based on X being binomial). n Notes about sample proportions: • p̂ is less variable in larger samples • We know exactly how quickly the standard deviation decreases as n increases. The sample size n is under the square root sign, so to cut the standard deviation in half, we need a sample four times as large. • The formula for the standard deviation of p̂ does not apply when the sample is a large part of the population. 9.3 – Sample Means We are often interested in quantitative measures of data such as household income, lifetime of a car brake pad, or blood pressure of a patient. Sometimes we are interested in the median or the standard deviation, but most often (at least in our course) we’ll be looking at the sample mean. Two important ideas come up in the discussion of sample means: • Averages are less variable than individual observations. • Averages are more normally distributed than individual observations. Sampling distribution of a sample mean Suppose that x is the mean of an SRS of size n drawn from a large population with mean µ and standard deviation σ . Then: • The sampling distribution of x is the distribution of the values of x in all possible samples of the same size from the population. • The mean µ x of the sampling distribution of x is µ . • The standard deviation σ x of the sampling distribution is σ n. Page 3 of 5 The Practice of Statistics, 2nd ed. Yates, Moore, and Starnes Chapter 9 – Sampling Distributions Notes about sample means: • The values of x are less spread out for larger samples. • Again, we know exactly how quickly the standard deviation decreases as n increases. The sample size n is under the square root sign, so to cut the standard deviation in half, we need a sample four times as large. The shape of the sampling distribution depends on the shape of the population distribution. In particular, if the population distribution is normal, then so is the distribution of the sample mean: If we draw an SRS from a population that has the normal distribution with mean µ and standard deviation σ , then the sample mean x has the ( normal distribution N µ , σ ) n . In other cases, particularly for larger sample sizes, we have what is known as the Central Limit Theorem to help us determine the shape of the sampling distribution (more on this later). IMPORTANT NOTE: The formulas for standard deviations of both sample means and of sample proportions should only be used if the population size is at least 10 times the sample size. As the sample size gets closer and closer to the population size, the formulas for the standard deviation become less and less accurate. Central Limit Theorem Many populations have roughly normal distributions, but very few are exactly normal. What happens to x when the population distribution is not normal? The central limit theorem is an important result of probability theory that answers this question: Draw an SRS of size n from any population whatsoever with mean µ and standard deviation σ . When n is large, the sampling distribution of the ( sample mean x is close to the normal distribution N µ , σ mean µ and standard deviation σ ) n with n. The size of the sample required in order for x to be close to normal depends on how “nonnormal” the population distribution is. More observations are required (i.e., a larger sample size) if the population distribution is far from normal (see EXAMPLE 9.12). Why is this important? The Central Limit Theorem (or CLT) allows us to use normal probability calculations to answer questions about sample means from many observations even when the original population distribution is not normal! Page 4 of 5 The Practice of Statistics, 2nd ed. Yates, Moore, and Starnes Chapter 9 – Sampling Distributions Summary Keep the following figure in mind as we proceed through the rest of the course: We take lots of samples of the same size (size n) from the same population and calculate the sample mean x for each of those samples. We then plot the distribution of all the different values of x . We also find the mean and standard deviation of x , which turn out to be µ and σ n , respectively. Because of the CLT, we can use this same mean and standard deviation for large enough samples regardless of the shape of the population distribution from which the samples were drawn. Additionally, here is a side-by-side comparison of the sampling distributions of the sample mean ( x ) and the sample proportion ( p̂ ) : Sampling Distributions Mean Standard Deviation ( x ) (Sample means) ( p̂ ) (Sample proportions µx = µ µ p̂ = p σx = σ n σ pˆ = p (1 − p ) n The above formulas for the standard deviations are exact if the population is infinite in size; they are approximate if the sample makes up no more than 10% of the population. Normality The sampling distribution of the sample mean is exactly normal if the population distribution is normal, and it is approximately normal if n ≥ 30 The sampling distribution of the sample proportion is approximately normal if np ≥ 10 and n (1 − p ) ≥ 10 . Page 5 of 5