Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 7 Sampling Distributions Statistical inference: We use data from a sample to draw conclusions about the population. Ex: Any product quality control Impractical to test every single item, so we test a sample from each batch. However, each sample will likely be a little different from the population. What would happen if we took lots of samples? What might they average to? Can we expect the sample to be truly representative of the population? Activity: German Tank Problem How many German tanks were there in WWII? The Allies captured several tanks but had no idea how many were really out there, so they sent the serial numbers of captured tanks to mathematicians in D.C. and asked them to estimate how many were really out there. (The idea being that the serial numbers were assigned by order of tank creation...) We will "capture" 4 tanks. Each team will have 10 minutes to estimate the total number of tanks based on the serial numbers using any statistical methods known. (Note: this does NOT include internet access!) Activity: German Tank Problem Specific data According to conventional Allied intelligence estimates the Germans were producing around 1,400 tanks a month between June 1940 and September 1942. Applying the above formula to the serial numbers of captured German tanks, (both serviceable and destroyed) the number was calculated to be 256 a month. After the war, captured German production figures from the ministry of Albert Speer showed the actual number to be 255. Estimates for some specific months is given as: Statistical Intelligence German Month estimate estimate records June 1940 169 1000 122 June 1941 244 1550 271 August 1942 327 1550 342 The formula: Minimum-variance unbiased estimator For point estimation (estimating a single value for the total), the minimum-variance unbiased estimator (MVUE, or UMVU estimator) is given by: where m is the largest serial number observed (sample maximum) and k is the number of tanks observed (sample size). The formula may be understood intuitively as: "The sample maximum plus the average gap between observations in the sample", and written as 7.1 What is a Sampling Distribution? Short answer: It is the distribution of every possible sample of a particular size (n). More vocab: Parameter number describing a population mean (μ), standard deviation (σ),proportion (p) Statistic number describing a sample mean (x), standard deviation (s),proportion (p) Activity: Reaching for Chips Population of chips: 200 total chips, of which 100 are red. Parameter: p = 1/2 of the chips in the POPULATION are red. Select (without looking) a sample of 20 chips. Record the proportion of reds; then return all chips to bag, mix up, and pass on the bag. We will then plot the DISTRIBUTION of our SAMPLES in our SAMPLING DISTRIBUTION and observe the SOCS. Example: Heights and Cell Phones Problem: Identify the population, the parameter, the sample, and the statistic in each of the following settings. (a) A pediatrician wants to know the 75th percentile for the distribution of heights of 10yearold boys so she takes a sample of 50 patients and calculates Q3 = 56 inches. (b) A Pew Research Center poll asked 1102 12 to 17yearolds in the United States if they have a cell phone. Of the respondents, 71% said yes. Distribution: a list of all possible values of a variable and how often it takes those values Population distribution: values of variable for all individuals in the population Sample distribution: values of variable for all individuals in a particular sample Sampling distribution: values of a statistic (mean, proportion, etc.) of all possible samples of the same size from the same population. Ex: Pretend we have a POPULATION of 4 numbers: 1, 2, 3, 4. We want a sample of size n = 2. What are all the possible samples? 1,2; 1,3; 1,4; 2,3; 2,4; 3,4. Let's say we're interested in the mean of the population (μ). List the means of each sample (x): 1.5, 2, 2.5, 2.5, 3, 3.5. Look at a histogram of the distribution. Note: The "true mean" (μ) is (1+ 2+3+4)/4 = 2.5, but usually the population is too large to calculate true mean. 2 1 0 A few of the Sample Distributions 0 1.0 2.0 x 3.0 4.0 Frequency Frequency Population Distribution 1 00 Frequency 2 1 0 0 1.0 2.0 x 3.0 4.0 1.0 2.0 x 3.0 4.0 1.0 2.0 x 3.0 4.0 1.0 2.0 x 3.0 4.0 2 1 00 Frequency Frequency Sampling Distribution 2 2 1 00 Activity: Choosing cards (Investigating Variability) Deck of cards with aces and face cards removed. Only have 2 - 10 in 4 suits = 36 cards. Samples of size n. Find mean, median, and range of each sample. Create dotplots of each sampling statistic. n=2 n=5 n = 10 The true mean, median, and range of the population are... μ = 4(2+3+...+9+10)/36 = 6 Median = 2, 3, 4, 5, 6, 7, 8, 9, 10 Range = 10 - 2 = 8 Compare our sampling distributions (such as they are) to the true population values. Are they accurate? Precise? Which statistics are biased? How does the sample size affect the variability? Precise does NOT mean accurate! If bias is present in the sampling procedure, you will still have biased results!! Tanks revisited... 5 methods for estimating the total number of tanks: (1) partition = max(5/4), (2) max = max, (3) MeanMedian = mean + median, (4) SumQuartiles = Q1 + Q3, (5) TwiceIQR = 2IQR. The graph shows the approximate sampling distribution for each of these statistics when taking samples of size 4 from a population of 342 tanks. (a) Which of these statistics appear to be biased estimators? (b) Of the unbiased estimators, which is best? (c) Why might a biased estimator be preferred to an unbiased estimator?