Download sampling distribution

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Sampling (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Gibbs sampling wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Statistics for Business
(Env)
Chapter 7
Sampling Distributions
1
Sampling Distributions
7.1
7.2
7.3
The Sampling Distribution of the Sample
Mean
Central Limit Theorem
STANDARD ERROR AND STATISTICAL
INFERENCE
2
The sampling process
A sample should be representative of the entire population, yet it is not
expected to be identical to the population.
3
Sampling distribution
• Suppose that we draw all possible samples
of size n from a given population.
• Suppose further that we compute a statistic
(e.g., a mean, IQR, standard deviation) for
each sample.
• The probability distribution of this statistic is
called a sampling distribution.
4
Sampling error is the discrepancy, or
amount of error, between a sample
statistic and its corresponding population
parameter.
The distribution of sample means is the
collection of sample means for all the
possible random samples of a particular
size (n) that can be obtained from a
population.
5
The sampling distribution
6
Two different questions
Data distribution
Distribution of sample means
P( X > 70)
P( X > 70)
7
The distribution of sample means is the
collection of sample means for all the
possible random samples of a particular
size (n) that can be obtained from a
population.
8
Variability of a Sampling Distribution
• The variability of a sampling distribution
depends on three factors:
– N: The number of objects in the population.
– n: The number of objects in the sample.
– The way that the random sample is chosen.
9
Sample without replacement
• If the population size is much larger than the
sample size, then the sampling distribution has
roughly the same sampling error, whether we
sample with or without replacement (population
element can be selected only one time).
• On the other hand, if the sample represents a
significant fraction (say, 1/10) of the population
size, the sampling error will be noticeably smaller,
when we sample without replacement.
10
Methods of Probability Sampling
The sampling error is the difference
between a sample statistic (e.g. X) and its
corresponding population parameter(e.g.
).
The sampling distribution of the sample
mean is the probability distribution of the
population of the sample means obtainable
from all possible samples of size n from a
population of size N.
11
Example 1:
A population that consists of only 4
scores: 2, 4, 6, 8.
Mean=5
12
All the possible samples of n = 2
TABLE 7.1
Notice that the table lists
random samples. This requires
sampling with replacement, so
it is possible to select the same
score twice.
13
FIGURE 7.2
The distribution of sample means for n = 2. The
distribution shows the 16 sample means from Table 7.1.
Mean of sample mean = 5
14
Sampling without replacement:
4C2
= 4!/(2! 2!) = 6
Sample
#1
#2
Sample mean
1
2
4
3
2
2
6
4
3
2
8
5
4
4
6
5
5
4
8
6
6
6
8
7
Mean of sample mean = 5
f
3
4
5
6
7
X
15
Example
1
Example 2:
A law firm has five partners.
At their weekly partners
meeting each reported the
number of hours they billed
clients for their services last
week.
Partner
Hours
Dunn
22
Hardy
26
Kiers
30
Malory
26
Tillman
22
If two partners are selected
randomly, how many different
samples are possible?
16
Sampling without replacement
Example 1
5 objects taken
2 at a time.
5 C2
Partners
1,2
1,3
1,4
1,5
2,3
2,4
2,5
3,4
3,5
4,5
5!

 10
2! (5  2)!
Total
48
52
48
44
56
52
48
56
52
48
A total of 10 different
samples
Mean
24
26
24
22
28
26
24
28
26
24
17
Example 1
As a sampling distribution
Sample Mean
Frequency
Probability
22
1
1/10
24
4
4/10
26
3
3/10
28
2
2/10
18
Compute the mean of the sample means. Compare it
with the population mean.
The mean of the sample means
E(X)   X 
22(1)  24(2)  26(3)  28(2)
 25.2
10
The population mean
22  26  30  26  22

 25 .2
5
Notice that the mean
of the sample means is
exactly equal to the
population mean.
19
Example 3
• Take another population: 3, 6, 9, 12, 15
• Population size N=5, sample size n=2, mean=9,
variance=18, SD=4.2426
• The number of possible samples which can be
drawn without replacement is 5C2 =10
20
Variance = 6.75
21
Example 4: Sampling All Stocks
• Population of returns of all 1,815 stocks listed on
NYSE for 1987
– See Figure on next slide
– The mean rate of return m was –3.5% with a standard
deviation s of 26%
• Draw all possible random samples of size n=5 and
calculate the sample mean return of each
– Sample with a computer
– See Figure on next slide
22
Example: Sampling All Stocks
23
Results from Sampling All Stocks
• Observations
– Both histograms appear to be bell-shaped and centered
over the same mean of –3.5%
– The histogram of the sample mean returns looks less
spread out than that of the individual returns
• Statistics
– Mean of all sample means: µx = µ = -3.5%
– Standard deviation of all possible means:

26
x 

 11.63%
n
5
24
Examples above demonstrate the construction of the distribution
of sample means for a relatively simple, specific situation. In most
cases, however, it will not be possible to list all the samples and
compute all the possible sample means. Therefore, it is necessary
to develop the general characteristics of the distribution of
sample means that can be applied in any situation. Fortunately,
these characteristics are specified in a mathematical proposition
known as the central limit theorem. This important
and useful theorem serves as a cornerstone for much of
inferential statistics.
25
General Conclusions
1. If the population of individual items is
normal, then the population of all sample
means is also normal
2. Even if the population of individual items is
not normal, there are circumstances when
the population of all sample means is
normal (Central Limit Theorem)
26
Central Limit Theorem: For any population with
mean  and standard deviation , the distribution of
sample means for sample size n will have a mean of 
and a standard deviation of 
and will
n
approach a normal distribution as n becomes
sufficiently large.
The value of this theorem comes from two simple facts. First, it
describes the distribution of sample means for any population,
no matter what shape, or mean, or standard deviation. Second,
the distribution of sample means “approaches” a normal
distribution very rapidly. By the time the sample size reaches n >
30, the distribution is almost perfectly normal.
27
Central Limit Theorem
If the samples size is large enough (n30), then we can
consider the sample mean approximately follows a
normal distribution
f(X) ~ N(, 2 / n).
This theorem also implies the variance of the sample
mean is the population variance divided by n. (for
large n)
2
2



Var (X)     
 X
n
Averages are less variable than individual observations.
28
Sample Means
Sample means follow
the normal distribution
under two conditions:
the population itself
follows the normal
distribution
OR
the sample size is large
enough (n30).
29
Distribution of all possible sample means
x
x 
n
The distribution of
sample means is
less spread out.

Distribution of data (normal distribution)
30
The standard deviation of the distribution of sample means
is called
The standard error measures the standard amount of
difference between
and  that is reasonable to expect
simply by chance.
It should be intuitively reasonable that the size of a sample
should influence how accurately the sample represents its
population. Specifically, a large sample should be more accurate
than a small sample. In general, as the sample size increases, the
error between the sample mean and the population mean
should decrease. This rule is also known as the law
of large
numbers.
31
The law of large numbers states that
the larger the sample size (n), the more
probable it is that the sample mean
will be close to the population mean.
The standard error provides a way to measure the “average”
or standard distance between a sample mean and the
population mean.
32
The distribution of sample means for random samples as
the size n increases
33
Example 5:
The population of scores on the SAT forms a normal distribution
with mean = 500 and sd = 100. If you take a random sample of n
= 25 students, what is the probability that the sample mean
would be greater than = 540?
You can restate this probability question as : Out of all the possible sample
means, what proportion has values greater than 540?
Need to determine the distribution of the sample mean with
n = 25. We know:
1. The distribution is normal because the population of SAT
scores is normal.
2. The distribution has a mean of 500 because the population
mean is 500.
3. The distribution has a standard error of 100/sqrt(25)
34
The distribution of sample
means for n = 25. Samples
were selected from a normal
population with mean = 500
and sd= 100.
x  
The next step is to use a z-score to locate the exact position of
the distribution.
n
= 540 in
35
The value 540 is located above the mean by 40 points, which
is exactly 2 standard deviations (in this case, exactly 2
standard errors). Thus, the z-score for 540 is 2.00.
Because this distribution of sample means is normal, you can
use the unit normal table to find the probability associated
with z>2.00. The table indicates that 0.0228 of the
distribution is located in the tail of the distribution beyond
z>2.00.
Our conclusion is that it is very unlikely, p = 0.0228 (2.28%),
to obtain a random sample of n = 25 students with an
average SAT score greater than 540.
36
Example 6
Suppose the mean selling
price of a gallon of
gasoline in the U.S. is
$1.30. Further, assume
the population  is $0.28.
What is the probability
that the mean of a
sample of 35 gasoline
stations is between $1.22
and $1.38?
37
Example 2
The z-values corresponding to $1.22 and
$1.38 are -1.69 and 1.69
From the table for standard normal distribution
P(1.69  Z  1.69)  2(.4545)  .9090
We would expect about 91% of the sample means to be
within $0.08 of the population mean.
38
Example 7
Assume that a school district has 10,000 sixth
graders. In this district, the average weight of
a sixth grader is 80 pounds, with a standard
deviation of 20 pounds. Suppose you draw a
random sample of 50 students. What is the
probability that the average weight of a
sampled student will be less than 75 pounds?
39
Example 7 cont.
• The standard deviation of the sampling
distribution can be computed using the following
formula.
x   n
• σx = 20 * sqrt(1/50) = 20 * 0.141 = 2.828
• The sampling distribution of the mean is normally
distributed with a mean of 80 and a standard
deviation of 2.83.
• To find from table:
P(z<(75-80)/2.83)=P(z<-1.77)=0.038
The Central Limit Theorem
Random Sample (x1, x2, …, xn)
x
X
as n  large
Population Distribution
(, )
(right-skewed)
Sampling
Distribution of
Sample Mean
(
x
 , x  
n
(nearly normal)
41
)
Example: Central Limit Theorem
Simulation
42
Histogram of Population - Bimodal
Distribution: population = 16,000;
mean = 5.002 std dev 4.242
Sampling Distribution (from a
bimodal population) n = 2: number
of samples = 4000; mean = 4.977;
std dev 3.017;
43
Sampling Distribution (from a
bimodal population) n = 3:
number of samples = 4000;
mean = 4.946; std dev 2.425;
Sampling Distribution n = 30:
number of samples = 4000;
mean = 5.032; std dev 0.722;
44
STANDARD ERROR AND STATISTICAL
INFERENCE : Standard error as a measure of
chance
Most inferential statistics are used in the context of a
research study. Typically, the researcher begins with
a general question about how a treatment will affect
the individuals in a population.
For example,
Will the drug affect blood pressure?
Will the hormone affect growth?
Will the special training affect students’ reading
scores?
45
46
The question for the researcher is how to interpret the 4-point
difference. Specifically, there are two possible explanations:
1. The treatment may have caused the scores in the sample to be
4 points higher.
2. The 4-point difference may be sampling error. Remember, a
sample mean is not expected to be exactly the same as the
population mean. Perhaps the treatment has no effect at all, and
the 4-point difference has occurred just by
chance.
The standard error can help the researcher decide between these two
alternatives. In particular, the standard error tells exactly how much difference
is reasonable to expect just by chance. For example, if the standard error is only
1 point, then the researcher could conclude that the observed difference (4
points) is much larger than would be expected by chance. In this case, it would
be reasonable to conclude that the treatment has caused the difference.
47
The standard error is reported in Scientific Journals in
two ways. It may be reported in a table along with the
sample means (see Table 7.2). Alternatively, the
standard error may be reported in graphs.
48
Figure 7.8 illustrates the use of a bar graph to display information about the sample mean
and the standard error. Note that the mean is represented by the height of the bar, and
the standard error is depicted on the graph by brackets at the top of each bar. Each
bracket extends 1 standard error above and 1 standard error below the sample mean.
49
Figure 7.9 shows how sample means and standard error are displayed on a line graph.
50
Summary: Sampling Methods and the Central
Limit Theorem
ONE
Explain why sometime sampling is the only feasible way to
learn about a population.
TWO
Define and construct a sample distribution of the sample mean.
THREE
Explain and apply the central limit theorem.
FOUR
STANDARD ERROR AND STATISTICAL INFERENCE
51