Download 1 Examples of Sampling 2 Sample Mean

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Regression toward the mean wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
1
Examples of Sampling
A company has 50000 employees at 1000 equally-sized locations around the world. All employees have
identification numbers. For simplicity, we assume that the identification numbers are 1 through 50000. We
want to obtain some sort of information about the employees. To do this, we will collect data from a sample
of 2000 of the employees. To collect this data, we will need to meet with the employees in person.
To obtain a simple random sample, we have a computer generate 2000 random numbers between 1 and 50000.
We can set this up to allow replacement or forbid replacement. For our sample, we choose the employees
with these identification numbers.
To obtain a systematic random sample, we start by randomly choosing the first employee. Say the first
chosen employee has ID number 37. Now observe that, since there are 50000 total employees and our sample
is to have 2000 employees, 1 out of every 25 employees are chosen for inclusion in the sample. Therefore,
we systematically select employees whose identification numbers are spaced 25 away from each other. So the
employees in our sample have identification numbers 12, 37, 62, 87, 112, and so on, up to 49987.
To obtain a clustered sample, we visit only a few of the locations and include in our sample all the employees
at each of the chosen locations. We just need to know how many locations we must visit. Since there are
50000 total employees at 1000 locations, there are 50 employees at each location. Since we need a sample of
2000 employees and there are 50 employees at each location, we need to visit 40 different locations.
To obtain a stratified sample, we visit every location and include the same number of employees from each
location in our sample. We just need to know how many employees at each location we need to pick. Since we
need a sample size of 2000 and there are 1000 locations, we need to choose 2 employees from each location.
The clustered sampling is best in the sense that it is the easiest to perform in practice. This is because we
only need to visit 40 out of the 1000 locations. In this sense, the stratified sampling is the worse, because we
necessarily need to visit all 1000 locations. The simple random and systematic random samplings will likely
require that we visit most, if not all, of the 1000 locations.
2
Sample Mean
Consider quantitative data for a population of size N . As we discussed in Lecture 3, this population has a
mean µ and standard deviation σ. The mean is obtained by adding up all the observations and dividing by
N . The variance is obtained by adding up all the squares of the deviations from the mean and dividing by
N . The standard deviation is the square root of the variance.
Now suppose that a sample of size n is randomly selected, perhaps through one of our four methods of
sampling. Let x̄ denote the mean of the sample. The mean is obtained by adding up all the observations in
the sample and dividing by n. The sample mean x̄ may or may not equal the population mean µ. We define
the sampling error as |x̄ − µ|. For example, if the population mean is 7 and the sample mean is 5, then the
sample error is |5 − 7| = | − 2| = 2.
Since x̄ was obtained through a random process, x̄ is a random variable. Therefore, it has a set of possible
values, a probability distribution, an expected value or mean, a variance, and a standard deviation. We let
µx̄ denote the mean of x̄ and we let σx̄ denote the standard deviation of x̄.
It is important to emphasise the distinction between the population data and the sample mean. The population data has a distribution, a mean, and a standard deviation in the sense of a data set. The sample mean
has a distribution, a mean, and a standard deviation in the sense of a random variable.
It turns out that the mean and standard deviation of the sample mean are related to the mean and standard
deviation of the population data in the following way:
µx̄ = µ
1
That is, the mean or expected value of the sample mean is the same as the population mean. Notice that
this does not depend on the sample size or the population size.
r
N −n
σ
σx̄ = √
n N −1
| {z }
FPCF
The second factor here, labelled FPCF, is known as the finite population correction factor. There are three
cases when we can ignore it:
•
•
•
when the population size is not given but assumed to be very large
when the sample size is less than 5% of the population size
when replacement is allowed in the sampling process
In these three cases, the formula for the standard deviation of the sample mean reduces to
σ
σx̄ = √ .
n
Observe that, as the sample size n increases, the standard deviation of the sample mean gets smaller. That
is, as the sample size increases, the sample mean becomes more likely to be closer to the population mean.
Notice that we have not said anything about the distribution of x̄ so far other than its mean and standard
deviation. For all we know at this point, it could follow a normal distribution, or a uniform distribution, or
any distribution really. We will give a more precise description of the distribution of x̄ later.
As an example, suppose that a family has five people, A, B, C, D, and E, who have heights 64, 65, 68, 72,
and 75 inches, respectively. This is our population data. Using techniques from Lecture 3, we can find that
the population mean is µ = 68.8 inches and the population standard deviation is 4.17 inches.
Now suppose that we obtain a simple random sample of 2 people from the family, without replacement. That
5·4
= 10
is, the sample must consist of 2 different people. From Lecture 7, we know that there are 5 C2 =
2·1
possible ways of doing this. Each pair of people is equally likely to occur, with probability 1/10. For each
different sample, we will get a (perhaps) different value for x̄. For example, if the sample consists of people
A and B, then x̄ is the average of 64 and 65, which is 64.5. We can then fill in the rest of the table below.
sample
x̄
A,B
64.5
A,C
66
A,D
68
A,E
69.5
B,C
66.5
B,D
68.5
B,E
70
C,D
70
C,E
71.5
D,E
73.5
In the second column, we see all the possible values of x̄. Notice that values of x̄ are less extreme than the
original population data. The most extreme heights in the population were 64 and 75. The most extreme
average heights of two individuals are 64.5 and 73.5. The value 70 occurs in two of the 10 groups. The other
values only occur in one of the 10 groups. Therefore, we can write down the probability distribution of x̄.
2
k
P (x̄ = k)
64.5
1/10
66
1/10
66.5
1/10
1/10
68
68.5
1/10
1/10
69.5
70
2/10
1/10
71.5
73.5
1/10
From this, we can use techniques of Lectures 8 and 9 to compute the mean and standard deviation of x̄ using
a table. For example, the next step in computing the mean would be to compute the values of kP (x̄ = k) for
all the possible values k. The mean would then be the sum of those values. We could compute the standard
deviation from the variance. Computing the variance requires several more columns.
Since x̄ is a sample mean, we don’t actually need to use these old techniques here. We can use formulas
to compute the mean and standard deviation of the sample mean. The mean of x̄ is simply the population
mean, so
µx̄ = µ = 68.8.
Since our population is small and the sample is 40% of the population and the sample did not allow replacement, we must include the FPCF in our computation of the standard deviation of the sample mean,
so
r
r
N −n
4.17 5 − 2
σ
= √
= 2.55.
σx̄ = √
n N −1
2 5−1
3
Distribution of Sample Mean for Normal Data
If the population data is a normally distributed data set then the sample mean is always a normally distributed random variable. As always, the mean of the sample mean is equal to the population mean. However,
the standard deviation of the sample mean decreases as the sample size increases.
Below is a histogram for mens’ heights. This is normally distributed and has population mean µ = 177 cm
and population standard deviation σ = 7 cm. The huge numbers for the frequencies give you an idea of how
large the population is.
5
histogram
x 10
5
frequency
4
3
2
1
0
155
160
165
170
175
180
height (cm)
185
190
195
200
3
If the sample size is 1, the sample mean is simply the value of the single chosen observation. The shape of
the graph of the pdf for this randomly chosen observation is exactly the same as the shape of the histogram
for the population. The mean of the sample mean is µx̄ = µ = 177 and the standard deviation of the sample
7
mean is σx̄ = √ = 7, just like for the population data set. Unlike for the histogram, the area under the
1
graph of a pdf must be 1.
If the sample size is 16, the sample mean is once again a normally distributed random variable with mean
7
µx̄ = µ = 177 and standard deviation σx̄ = √ = 1.75. As the sample size increases, the pdf for the sample
16
mean remains normally distributed with mean equal to to the population mean, but, since its standard
deviation decreases, the graph of its pdf becomes thinner and taller. The graph must become taller as it gets
thinner so that the area under it remains 1.
Below are graphs of the pdf for the sample mean for samples of different sizes.
distribution of sample mean
0.35
0.3
n=1
n=2
n=4
n=8
n=16
n=32
0.25
0.2
0.15
0.1
0.05
0
155
4
160
165
170
175
180
185
190
195
200
Central Limit Theorem
Now suppose that the population data is not necessarily normally distributed.
Below is a histogram for a large data set which has mean µ = 9.11 and standard deviation σ = 4.31 but
which is not normally distributed.
4
histogram
2500
frequency
2000
1500
1000
500
0
0
2
4
6
8
values
10
12
14
16
Below are graphs of the pdf for the sample mean for samples of different sizes.
distribution of sample mean
0.7
n=1
n=2
n=4
n=8
n=16
n=32
0.6
0.5
0.4
0.3
0.2
0.1
0
0
5
10
15
Evidently, the effect of taking larger and larger sample sizes is to smooth the graph of the pdf for the sample
mean. Eventually, the graph of the pdf approaches a normal curve and becomes thinner and taller.
The central limit theorem says that, if the sample size is at least 30, then the sample mean is approximately
normally distributed even if the original data set was not normally distributed.
5