Download Sampling Distributions Objectives Overview Learning Objectives

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Gibbs sampling wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Objectives
• Investigate the variability in sample statistics from
sample to sample
Sampling Distributions
Central Limit Theorem
• Find measures of central tendency for distribution of
sample statistics
• Find measures of dispersion for distribution of sample
statistics.
• Find the pattern of variability for sample statistics
Overview
• A new sample mean can be calculated each time a new
sample is taken
Distribution of Sample Mean
• In this way, the sample mean can be analyzed as a
random variable
• Being able to calculate (approximately) the distribution of
the sample mean is a critical tool for inference
Learning Objectives
• Understand the concept of a sampling
distribution
• Describe the distribution of the sample mean
for samples obtained from normal populations
• Describe the distribution of the sample mean
for samples obtained from a population that is
not normal
Statistical Inference
• Often the population is too large to perform a
census … so we take a sample
• How do the results of the sample apply to the
population?
– What’s the relationship between the sample mean
and the population mean?
– What’s the relationship between the sample standard
deviation and the population standard deviation?
• This is statistical inference
1
Estimation
Sample Mean is a Variable
•
We want to use the sample mean
mean µ
•
If we want to estimate the heights of eight year old girls, we can
proceed as follows
– Randomly select 100 eight year old girls
– Compute the sample mean of the 100 heights
– Use that as our estimate
•
This is using the sample mean to estimate the population mean
X
to estimate the population
• Usually, we just take one single sample to estimate
population parameter.
• However, if we take a series of different random samples
from a target population
– Sample 1 – we compute sample mean x1
– Sample 2 – we compute sample mean x2
– Sample 3 – we compute sample mean x3
– Etc.
• Each time we sample, we may get a different result
• The sample mean X is a random variable!
Distribution of Sample Mean
Example 1
Consider a population of a uniformly distributed
variable X with all of the values in the set {1, 2, 3, 4}
occurring equally likely:
• Because the sample mean is a random variable
– The sample mean has a probability distribution
– We can obtain the center and spread of the
probability distribution of the sample mean
1)
• This is called the distribution of the sample mean
2)
• Because the sample mean is a sample statistic, a
distribution of a sample statistic is often called a
sampling distribution. So, the distribution of the sample
mean is also called sampling distribution of the mean.
3)
4)
5)
Example 1 (continued)
• Mean of the population distribution of the variable X,
denoted by µx or simply µ:
1
1
1
1
µ x = ∑ x ⋅ p( x ) = 1 × + 2 × + 3 × + 4 × = 2.5
4
4
4
4
• Variance of the population distribution of the variable X,
denoted by σx2 or simply σ2 :


1
4
1
4
1
4
1
4
σ x 2 = ∑ x 2 ⋅ P( x ) − µ x 2 = 12 × + 2 2 × + 3 2 × + 4 2 ×  − 2.5 2 = 1.25
• Standard deviation of the population distribution of the
variable X, denoted by σx or simply σ :
σ x = 1.25 = 1.118
Calculate the mean and standard deviation of the
population.
Make a list of all samples of size 2 that can be drawn
from this set (Sample with replacement)
Construct the sampling distribution for the sample
mean for samples of size 2
Calculate the center and spread of the sampling
distribution for the sample mean
Compare 1) and 4)
Example 1 (continued)
This table lists all possible
samples of size 2, the mean
for each sample, and the
probability of each sample
occurring (all equally likely)
Sample
Sample Mean
Probability
{1,1}
1.0
1/16
{1,2}
1.5
1/16
{1,3}
2.0
1/16
{1,4}
2.5
1/16
{2,1}
1.5
1/16
{2,2}
2.0
1/16
{2,3}
2.5
1/16
{2,4}
3.0
1/16
{3,1}
2.0
1/16
{3,2}
2.5
1/16
{3,3}
3.0
1/16
{3,4}
3.5
1/16
{4,1}
2.5
1/16
{4,2}
3.0
1/16
{4,3}
3.5
1/16
{4,4}
4.0
1/16
2
Example 1 (continued)
Example 1 (continued)
•
• Summarize the information in the previous table to obtain the
sampling distribution of the sample mean :
Histogram: Sampling Distribution
of the Sample Mean
Sampling Distribution
of the Sample Mean
x
1.0
1.5
2.0
2.5
3.0
3.5
4.0
P( x)
1/16
2/16
3/16
4/16
3/16
2/16
1/16
•
0 .2 5
0 .2 0
1
2
3
4
3
2
1
+ 1.5 × + 2.0 × + 2.5 × + 3.0 × + 13.5 × + 4.0 × = 2.5
16
16
16
16
16
16
16
Variance of the sampling distribution of the sample mean
denoted by σ x 2 is :
X
,
σ x 2 = 1.0 2 × 1 + 1.5 2 × 2 + 2.0 2 × 3 + 2.5 2 × 4 + 3.0 2 × 3 + 3.5 2 × 2 + 4.0 2 × 1  − 2.5 2 = 0.625
16
16
16
16
16
16
16 

0 .1 5
0 .1 0
0 .0 5
•
0 .0 0
1 .0
1 .5
2 .0
2 .5
3 .0
3 .5
4 .0
x
Example 1 (continued)
From the above example, we conclude that
• Sampling distribution of the sample mean tends to be bell-shaped.
• The mean of the sampling distribution of the sample mean is the same as
the underlying population mean. That is
µx = µx
The standard deviation of the sampling distribution of the sample mean is
less than the standard deviation of the population standard deviation. In
fact,
σx =
Check:
µ x = 1.0 ×
P(x)
Notice that the sampling distribution of the sample mean is normal.
•
Mean of the sampling distribution of the sample mean X , denoted by
µ x is:
0.791 =
σx
Standard deviation of the sampling distribution of the sample mean
, denoted by σ x is:
X
σ x = 0.625 = 0.791
Example 2
• We have the data
1, 7, 11, 12, 17, 17, 17, 21, 21, 21, 22, 22
and we want to take samples of size n = 3
• First, a histogram of the entire data set
n
1.118
2
Example 2 (continued)
Example 2 (continued)
• A histogram of the entire data set
• Taking some samples of size 3
• Definitely skewed left … not bell shaped
• The first sample, 17, 21, 12, has a mean of 16.7
• The second sample, 17, 7, 17 has a mean of 13.7
• The third sample, 22, 11, 21 has a mean of 18
3
Example 2 (continued)
• More sample means from more samples
• We calculate the mean for each sample as shown below:
Example 2 (continued)
• Finally, a histogram of 20 sample means
Example 2 (continued)
Example 2 (continued)
• If taking a sample of size 5 repeatedly 20 times.
• Here is a histogram of 20 sample means:
Sampling Distribution of Sample Mean ( sample size = 5)
8
7
Frequency
6
5
4
3
2
1
0
•
The original data set was highly left skewed, but the set of sample
means is less skewed and closer to bell shaped
8.1 - 10.0
10.1 - 12.0
12.1 - 14.0
14.1 - 16.0
16.1 - 18.0
18.1 - 20.0
20.1 - 22.0
Sample Mean
Observe that the empirical distribution of sample means is more
closer to bell shaped when the size of the sample increases.
Distribution of Sample Mean
Distribution of Sample Mean
• In general, if the underlying population is closer
to be bell shaped (normally distributed), then the
sampling distribution (i.e. the distribution of
sample mean) will tend to be more bell shaped
as well.
• In fact, the sampling distribution
– Will be normally distributed
– Will have a mean equal to the mean of the population
– Will have a standard deviation less than the standard
deviation of the population
•
Why does it have a smaller standard deviation?
•
The population standard deviation σ x
– Is a measure of the distance/deviation between an individual
value and the population mean
– Is a standard deviation of the sample mean for n = 1
•
The standard deviation of the sample mean σ x
– Is a measure of the distance/deviation between the sample
mean and the mean of the sampling distribution (which is the
same as the population mean)
•
It makes sense that the estimate of the population mean using a
sample mean is more accurate (closer to the population mean) if the
sample contains more values (a larger n) from the population.
Therefore, the larger the sample size, the less of the deviation of the
sample mean from the true population mean. => standard deviation
of sample mean is inversely related to the sample size n.
4
Standard Error
Sampling Distribution of Sample Mean
• If a simple random sample of size n is drawn
from a population, then the sampling distribution
has
– Mean µ x = µ x and
– Standard deviation σ x =
σx
• The standard deviation of the sample mean σ x
is also called the standard error
• The formula for σ x is
n
• In addition, if the population is normally
distributed, then
σx =
– The sampling distribution is normally distributed
σx
n
• This is an extremely important formula
Example
Sampling Distribution of sample Mean
• If the random variable X has a normal distribution with a
mean of 20 and a standard deviation of 12
– If we choose samples of size n = 4, then the sample
mean will have a normal distribution with a mean of
12
20 and a standard deviation of 6 (since 6 =
)
• This is great if our random variable X has a normal
distribution
• However … what if underlying population distribution of
the random variable X does not have a normal
distribution
• What can we say about the sampling distribution of the
sample mean?
– Wouldn’t it be very nice if the sampling distribution for
sample mean also was normal?
– This is almost true …
4
– If we choose samples of size n = 9, then the sample
mean will have a normal distribution with a mean of
20 and a standard deviation of 4 ( since 4 = 12 )
9
Note: if the underlying population distribution is normal, the sampling
Distribution of sample mean will be noraml regarless if the sample size
is large or small.
Central Limit Theorem
Graphical Illustration of the Central Limit Theorem
Distribution of x:
n=2
Original Population
•
The Central Limit Theorem states
Regardless of the shape of the underlying population
distribution, the sampling distribution of the sample mean
becomes approximately normal as the sample size n increases.
•
Thus
– If the random variable X is normally distributed, then the
sampling distribution of the sample mean is normally distributed
also regardless the size of the sample.
10
20
30
x
10
20
Distribution of x:
n = 30
Distribution of x:
n = 10
– For all other random variables X, the sampling distributions are
approximately normally distributed if the size of the sample is
large enough.
10
x
x
30
10
20
x
5
How large is the sample?
Applications of the Central Limit
Theorem
• This approximation, of the sampling
distribution being normal, is good for large
sample sizes … large values of n
• How large does n have to be?
• A rule of thumb – if n is 30 or higher, this
approximation is probably pretty good
• When the sampling distribution of the sample
mean is (exactly) normally distributed, or
approximately normally distributed (by the
CLT), we can answer probability questions
using the standard normal distribution.
Example 1
Example 1 (continued)
• We’ve been told that the average weight of
giraffes is 2400 pounds with a standard
deviation of 300 pounds
• We randomly picked 50 giraffes and measured
them and found that the sample mean was 2600
pounds
• Is our data consistent with what we’ve been
told? ( That is, does the sample mean of 2600
pounds observed support the claim that the
average of population giraffes is 2400 pounds?)
•
Although we do not know the shape of the distribution of giraffe’s weight.
Since the sample size is 50 which is large enough to justify the central limit
theorem. The sampling distribution of the average weight of 50 giraffes is
expected to be approximately normal with mean 2400 pounds (the same as
the population) and a standard deviation of 300 / √ 50 = 42.4 pounds
•
Using our calculations for the general normal distribution, 2600 is 200
pounds over 2400, and 200 pounds is 200 / 42.4 = 4.7 . That is the Z-score
for the average weight of 2600 is 4.7 which is a value near the end of the
right tail of a standard normal distribution.
•
From our normal calculator, probability obtaining an average at least this
large by chance is less than 0.00001.
•
Something is definitely strange … we’ll see what to do later in inferential
statistics
Example 2 (continued)
Example 2
Or, use Z-table to solve:
Consider a normal population with µ = 50 and σ =15.
Suppose a sample of size 9 is selected at random. Find:
1) P ( 45 ≤ x ≤ 60)
0.4772
0.3413
2) P ( x ≤ 47.5)
Solutions: Since the original population is normal, the
45
− 1.00
distribution of the sample mean is also (exactly) normal with
1) µ x = µ = 50
2) σ x = σ
n = 15
3) use TI calculator to find the probability :
normalcdf(45,60,50,5) = 0.8186
normalcdf(-E99,47.5,50,5) = 0.3085
9 = 15 3 = 5
z=
x-µ
;
σ n
50
0
60
2.00
x
z
− 
 45 − 50
≤ z ≤ 60 50
P(45 ≤ x ≤ 60) = P
 5
5 
= P(−1.00 ≤ z ≤ 2.00)
= 0.3413 + 0.4772 = 08185
.
6
Example 2 (continued)
0.3085
Example 3
A recent report stated that the day-care cost per week in Boston is $109.
Suppose this figure is taken as the mean cost per week and that the
standard deviation is known to be $20.
0.1915
1) Find the probability that a sample of 50 day-care centers would show
a mean cost of $105 or less per week.
47.5 50
x
0
z
-0.50
z=
x-µ
σ
2) Suppose the actual sample mean cost for the sample of 50 day-care
centers is $120. Is there any evidence to refute the claim of $109
presented in the report?
Solutions:
The shape of the original distribution is unknown, but the sample size,
n = 50, is large. The CLT applies.
The distribution of X is approximately normal with
 x − 50 47.5 − 50
≤
P( x ≤ 47.5) = P 

 5
5 
;
n
µ = µ = 109
x
= P( z ≤ −.5)
= 0.5000 − 01915
= 0.3085
.
Example 3 (continued)
1)
2)
Use Z-table:
0.4207
0.0793
x-µ
z=
;
σ n
109
0
•
z
Or, Use TI calculator: normalcdf(-E99,105,109,2.83) = 0.0787
n = 20
50 ≈ 2.83
• To investigate the claim, we need to examine how likely an
observation is the sample mean of $120
x

105 − 109 
P ( x ≤ 105) = P z ≤


2.83 
= P ( z ≤ − 1.41)
= 0.5000 − 0.4207 = 0.0793
x
Example 3 (continued)
Consider how far out in the tail of the distribution of the sample mean
is $120.
z= x-µ ;
σ n
105
− 1.41
σ =σ

P ( x ≥ 120) = P  z ≥ 120 − 109

2.83 
= P ( z ≥ 389
. )
= 0.5000 - 0.4999 = 0.0001
Or using TI calculator: normalcdf(120,E99,120,2.83) = 5.08E-5
•
Since the probability is so small, this suggests the observation of $120
is very rare (if the mean cost is really $109)
•
There is evidence (the sample) to suggest the claim of µ = $109 is likely
wrong
Summary
• The sample mean is a random variable with a
distribution called the sampling distribution
– If the sample size n is sufficiently large (30 or more
is a good rule of thumb), then this distribution is
approximately normal
– The mean of the sampling distribution is equal to the
mean of the underlying population
– The standard error/deviation of the sampling
distribution is equal to σ x / n
Distribution of the Sample
Proportion
7
Learning Objective
• Describe the sampling distribution of a sample
proportion
• Calculate probabilities of a sample proportion
Sampling Distribution of Sample
Proportion
• We have the same questions for the sample
proportion as we had for the sample mean
– What is the mean for the sampling distribution of the
sample proportion?
– What is the standard deviation for the sampling
distribution of the sample proportion?
– What is the distribution of the sample proportion?
– Can we apply the Central Limit Theorem to
approximate these with normal distributions?
• The answer is yes …
Example
• If a polling company polled 800 people to
see if they supported a certain issue and
475 did, then we have a sample proportion
problem with
– n = 800
– x = 475
475
– and a sample proportion of p̂ = 800 = 0.59
Sample Proportion
• In an election, polling companies wish to
estimate the percent of people who will
vote for each candidate
• This clearly is a situation for sampling as it
is impractical to contact every single voter
• The desired results are proportions, for
example that 59% of the voters (a
proportion of 0.59) said that they will vote
for candidate A
Sample Proportions
• A random sample is take
– Of size n
– Each individual either has or does not have a
certain characteristic (dichotomous outcomes)
– In total, there are x individuals that have this
characteristic
• Then the sample proportion p̂ (p hat) (the
proportion of individuals with this characteristic is
x
given by
p̂ =
n
Sampling Distribution of Sample
proportion
• If the population proportion is p, then the
distribution of the sample proportion for a
sample of size n
– Is approximately normal if np(1-p) ≥ 10
– Has a mean of
µ p̂ = p
– Has a standard deviation of
σ p̂ =
p( 1 − p )
n
8
Example
• Assume that 80% of the people taking
aerobics classes are female and a simple
random sample of n = 100 students is
taken
– What is the probability that at most 75% of the
sample students are female?
– If the sample had exactly 90 female students,
would that be unusual?
Example (continued)
•
The sample proportion p̂ of aerobics students who are female
– Has an approximately normal distribution
– Has a mean of 0.80 and a standard deviation of 0.04
•
What is the probability that p̂ is 0.90 or more?
Notice that instead of finding the probability being exactly 0.90 (which will be zero using normal distribution), we can
justify if 0.90 is unusual or not (proportion is too high) by evaluating the probability being at least as high as 0.90,
since 0.90 is larger than the expected value of 0.80.
– 0.90 is 0.10 more than the mean of 0.80
– 0.10 is 2.5 standard deviations more than the mean (i.e. the z-score is
2.5)
– The normal probability P(z ≥ 2.5) = .0062
– Thus P( p̂ ≥ 2.5) = 0.0062 … pretty unlikely
Note: To obtain P( p̂ ≥ 2.5) = 0.0062 ,
apply TI-calculator with normcdf(2.5,E99,0,1) = 0.0062 or
normcdf(0.9, E99,0.8,0.04) = 0.0062
[or normcdf(0.9,1,0.8,0.04) , since probability can only go up to 1]
Example (continued)
•
The sample proportion p̂ of aerobics students who are female
– Has an approximately normal distribution
– Has a mean of 0.80 and a standard deviation of 0.04
•
What is the probability that p̂ is 0.75 or less?
– 0.75 is 0.05 less than the mean of 0.80
– 0.05 is 1.25 standard deviations less than the mean (i.e. the zscore is –1.25)
– The normal probability P(z ≤ –1.25) = .1056
– Thus P( p̂ ≤ 0.75) = .1056
Note: To obtain P( p̂ ≤ 0.75) = .1056,
apply TI graphing calculator with normalcdf(-E99,-1.25,0,1) or
normalcdf(-E99,0.75, 0.80,0.04)
[or normalcdf(0,0.75,0.80,0.04),since probability can’t be less
than zero.]
Summary
• The sample proportion, like the sample mean, is a
random variable
– If the sample size n is sufficiently large and the
population proportion p isn’t close to either 0 or 1,
then this distribution is approximately normal
– The mean of the sampling distribution is equal to the
population proportion p
– The standard deviation of the sampling distribution is
equal to p( 1 − p ) / n
Summary
Summary
of
Sampling Distributions
• The sample mean and the sample proportion
can be considered as random variables
• The sample mean is approximately normal with
– A mean equal to the population mean µ x = µ
– A standard deviation equal to σ x = σ / n
• The sample proportion is approximately normal
with
– A mean equal to the population proportion µ p̂ = p
– A standard deviation equal to σ p̂ = p( 1 − p ) / n
9