Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Section 7.1 Sampling Distributions Introduction Statistical inference: use information from a sample to draw conclusions about a wider population. Different random samples yield different statistics. We can think of a statistic as a random variable because it takes numerical values that describe the outcomes of the random sampling process. Population Collect data from a representative Sample... Sample Make an Inference about the Population. Parameter – a number that describes the population – a fixed number – in practice, we do not know its value because we cannot examine the entire population Statistic – a number that describes a sample – the value of a statistic is known when we have taken a sample, but it can change from sample to sample – we often use a statistic to estimate an unknown parameter Compare Parameter Statistic mean: μ standard deviation: σ proportion: p mean: x-bar standard deviation: s proportion: 𝑝 (p-hat) • Sometimes we call the parameters “true”; true mean, true proportion, etc. • Sometimes we call the statistics “sample”; sample mean, sample proportion, etc. Identify the population, the parameter, the sample, and the statistic in each of the following settings. 1) A pediatrician wants to know the 75th percentile for the distribution of heights of 10-year-old boys, so she takes a sample of 50 patients and calculates that the 75th percentile in the sample is 56 inches. Population is all 10-year-old boys The parameter of interest is the 75th percentile for all 10-year-old boys. The sample is the 50 10-year-old boys included in the sample; the statistic is 56 inches, the 75th percentile of the heights in the sample. The statistic is 56 inches, the 75th percentile of the heights in the sample. 2) A Pew Research Center Poll asked 1102 12- to 17-yearolds in the United States if they have a cell phone. Of the respondents, 71% said Yes. <http://www.pewinternet.org/Reports/2009/14--Teensand-Mobile-Phones-Data-Memo.aspx> Population is all 12- to 17-year olds in the U.S. The parameter of interest is p, the proportion of all 12- to 17-year olds with cell phones. The sample is the 1102 12- to 17-year-olds in the sample. The statistic is the sample proportion with a cell phone, 𝑝 = 0.71 Sampling Variability Sampling variability: the value of a statistic varies in repeated random sampling. To make sense of sampling variability, we ask, “What would happen if we took many samples?” Sample Population Sample Sample Sample Sample Sample Sample Sample ? Sampling variability Why is sampling variability not fatal? – Take a large number of samples from the same population – Calculate the sample mean 𝑥 or sample proportion 𝑝 for each sample – Make a histogram of the values of 𝑥 or 𝑝 – Examine the distribution displayed in the histogram for shape, center, and spread, as well as outliers or other deviations p̂ Sampling Distribution The sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population. In practice, it’s difficult to take all possible samples of size n to obtain the actual sampling distribution of a statistic. Instead, we can use simulation to imitate the process of taking many, many samples. Sampling Distribution vs. Population Distribution There are actually three distinct distributions involved when we sample repeatedly and measure a variable of interest. 1. The population distribution gives the values of the variable for all the individuals in the population. 2. The distribution of sample data shows the values of the variable for all the individuals in the sample. 3. The sampling distribution shows the statistic values from all the possible samples of the same size from the population. http://onlinestatbook.com/stat_sim/sampling_dist/index.html Choosing cards We used Fathom to simulate choosing 100 SRSs of size 5 from a deck of cards. The graph below shows the distribution of the sample median for these 100 samples. Measures from Collection 1 Dot Plot 2 3 4 5 6 7 8 Sample M e dian 9 10 Problem: a) There is one dot on the graph at 2. Explain what this value represents b) Describe the distribution. Are there any obvious outliers? c) Would it be surprising to get a sample median of 4 or smaller in an SRS of size 5 from this deck? Justify your answer. d) Suppose that another student prepared a different deck of cards and claimed that it was exactly the same as the one used in the Activity. When you take an SRS of size 5, the median is 4. Does this provide convincing evidence 12 that the student’s deck is different? Explain. Solution: a) In one SRS of size 5, the median of the sample was 2. b) Shape: The graph is roughly symmetric with a single peak at 6. Center: The mean of the distribution is about 6. Spread: The values fall mostly between 4 and 8, although there are values as low as 2 and as high as 10. Outliers: There don’t seem to be any outliers. c) No. Getting a sample median of 4 or smaller occurs fairly often just by chance when taking random samples of size 5 from a deck of cards with the aces and face cards removed. A median of 4 or smaller occurred in 22 of the 100 trials of the simulation. d) Getting a sample median of 4 does not provide convincing evidence that the student’s deck is different. Because getting a median of 4 or smaller is somewhat likely to happen by chance alone, it is plausible that the deck is the same. 13 Describing Sampling Distributions “How trustworthy is a statistic as an estimator of the parameter?” Center: Biased and unbiased estimators In the chips example, we collected many samples of size 20 and calculated the sample proportion of red chips. How well does the sample proportion estimate the true proportion of red chips, p = 0.5? Note that the center of the approximate sampling distribution is close to 0.5. In fact, if we took ALL possible samples of size 20 and found the mean of those sample proportions, we’d get exactly 0.5. A statistic used to estimate a parameter is an unbiased estimator if the mean of its sampling distribution is equal to the true value of the parameter being estimated. Describing Sampling Distributions Spread: Low variability is better! Unfortunately, using an unbiased estimator doesn’t guarantee that the value of your statistic will be close to the actual parameter value. n=100 n=1000 Larger samples have a clear advantage over smaller samples. They are much more likely to produce an estimate close to the true value of the parameter. Describing Sampling Distributions One important and surprising fact is that the variability of a statistic in repeated sampling does not depend very much on the size of the population. Variability of a Statistic The variability of a statistic is described by the spread of its sampling distribution. This spread is determined mainly by the size of the random sample. Larger samples give smaller spreads. The spread of the sampling distribution does not depend much on the size of the population, as long as the population is at least 10 times larger than the sample. Bias, Variability, and Shape We can think of the true value of the population parameter as the bull’s-eye on a target and of the sample statistic as an arrow fired at the target. Both bias and variability describe what happens when we take many shots at the target. Bias means that our aim is off and we consistently miss the bull’s-eye in the same direction. Our sample values do not center on the population value. High variability means that repeated shots are widely scattered on the target. Repeated samples do not give very similar results.