Download Chapter 7 Sampling Distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Generalized linear model wikipedia , lookup

Nyquist–Shannon sampling theorem wikipedia , lookup

Probability box wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Transcript
Chapter 7 Sampling Distributions
Statistical inference: We use data from a sample to draw conclusions about the population.
Ex: Any product quality control ­ Impractical to test every single item, so we test a sample from each batch.
However, each sample will likely be a little different from the population. What would happen if we took lots of samples? What might they average to? Can we expect the sample to be truly representative of the population?
Activity: German Tank Problem
How many German tanks were there in WWII? The Allies captured several tanks but had no idea how many were really out there, so they sent the serial numbers of captured tanks to mathematicians in D.C. and asked them to estimate how many were really out there. (The idea being that the serial numbers were assigned by order of tank creation...)
We will "capture" 4 tanks. Each team will have 10 minutes to estimate the total number of tanks based on the serial numbers using any statistical methods known. (Note: this does NOT include internet access!)
Activity: German Tank Problem
Specific data
According to conventional Allied intelligence estimates the Germans were producing around 1,400 tanks a month between June 1940 and September 1942. Applying the above formula to the serial numbers of captured German tanks, (both serviceable and destroyed) the number was calculated to be 256 a month. After the war, captured German production figures from the ministry of Albert Speer showed the actual number to be 255.
Estimates for some specific months is given as:
Statistical Intelligence German Month
estimate estimate records
June 1940 169 1000 122
June 1941 244 1550 271
August 1942 327 1550 342
The formula: Minimum-variance unbiased estimator
For point estimation (estimating a single value for the total),
the minimum-variance unbiased estimator (MVUE, or UMVU
estimator) is given by:
where m is the largest serial number observed (sample
maximum) and k is the number of tanks observed (sample size).
The formula may be understood intuitively as:
"The sample maximum plus the average gap between
observations in the sample", and written as
7.1 What is a Sampling Distribution?
Short answer: It is the distribution of every possible sample of a particular size (n).
More vocab:
Parameter ­ number describing a population
­ mean (μ), standard deviation (σ),proportion (p)
Statistic ­ number describing a sample
­ mean (x), standard deviation (s),proportion (p)
Activity: Reaching for Chips
Population of chips: 200 total chips, of which 100 are red.
Parameter: p = 1/2 of the chips in the POPULATION are red.
Select (without looking) a sample of 20 chips. Record the
proportion of reds; then return all chips to bag, mix up, and pass
on the bag.
We will then plot the DISTRIBUTION of our SAMPLES in our
SAMPLING DISTRIBUTION and observe the SOCS.
Example: Heights and Cell Phones
Problem: Identify the population, the parameter, the sample, and the statistic in each of the following settings.
(a) A pediatrician wants to know the 75th percentile for the distribution of heights of 10­year­old boys so she takes a sample of 50 patients and calculates Q3 = 56 inches. (b) A Pew Research Center poll asked 1102 12­ to 17­year­olds in the United States if they have a cell phone. Of the respondents, 71% said yes. Distribution: a list of all possible values of a variable and
how often it takes those values
Population distribution: values of variable for all
individuals in the population
Sample distribution: values of variable for all individuals
in a particular sample
Sampling distribution: values of a statistic (mean,
proportion, etc.) of all possible samples of the same size
from the same population.
Ex: Pretend we have a POPULATION of 4 numbers: 1, 2, 3, 4. We
want a sample of size n = 2.
What are all the possible samples?
1,2; 1,3; 1,4; 2,3; 2,4; 3,4.
Let's say we're interested in the mean of the population (μ). List
the means of each sample (x): 1.5, 2, 2.5, 2.5, 3, 3.5.
Look at a histogram of the distribution.
Note: The "true mean" (μ) is (1+
2+3+4)/4 = 2.5, but usually the population is too large to calculate true mean. 2
1
0
A few of the Sample
Distributions
0
1.0
2.0
x
3.0
4.0
Frequency
Frequency
Population Distribution
1
00
Frequency
2
1
0
0
1.0
2.0
x
3.0
4.0
1.0
2.0
x
3.0
4.0
1.0
2.0
x
3.0
4.0
1.0
2.0
x
3.0
4.0
2
1
00
Frequency
Frequency
Sampling Distribution
2
2
1
00
Activity: Choosing cards (Investigating Variability)
Deck of cards with aces and face cards removed.
Only have 2 - 10 in 4 suits = 36 cards.
Samples of size n. Find mean, median, and range of each sample.
Create dotplots of each sampling statistic.
n=2
n=5
n = 10
The true mean, median, and range of the population are...
μ = 4(2+3+...+9+10)/36 = 6
Median = 2, 3, 4, 5, 6, 7, 8, 9, 10
Range = 10 - 2 = 8
Compare our sampling distributions (such as they are) to the
true population values. Are they accurate? Precise?
Which statistics are biased?
How does the sample size affect the variability?
Precise does NOT mean accurate! If bias is present in the sampling procedure, you will still have biased results!!
Tanks revisited...
5 methods for estimating the total
number of tanks:
(1) partition = max(5/4),
(2) max = max,
(3) MeanMedian = mean + median,
(4) SumQuartiles = Q1 + Q3,
(5) TwiceIQR = 2IQR.
The graph shows the approximate
sampling distribution for each of these
statistics when taking samples of size 4
from a population of 342 tanks.
(a) Which of these statistics appear to be biased estimators?
(b) Of the unbiased estimators, which is best?
(c) Why might a biased estimator be preferred to an unbiased estimator?