Download CHAPTER 7 STANDARD ERROR OF THE MEAN AND

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Confidence interval wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Gibbs sampling wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
CHAPTER 7
STANDARD ERROR OF THE MEAN AND CONFIDENCE INTERVALS
Researchers rarely conduct statistical research with knowledge of the characteristics of
the entire population. Remember that a population is defined as a group of individuals or cases
which all share a common characteristic or set of characteristics. One example of a population
consists of all individuals living within the United States. This group represents a population
because they share the characteristic of living within the same geographic region. This
population is especially important to political science researchers and is probably the most
common subject of academic research in the contemporary field of political science. Despite the
prominence of this population as the subject of examination, it is virtually impossible for a
researcher to engage in a research project that considers every individual who is included in this
larger population. Financial constraints, time, and other factors make it impractical to include all
members of this population in most research projects.
Most researchers choose to focus on a sample rather than the entire population when
constructing their research projects. Using a sample allows the researcher to focus greater
attention on a relatively small number of individuals and enables the researcher to save both time
and money in collecting and analyzing the data obtained as a part of the study. In many cases,
samples may actually represent a better source of data than the entire population. When working
with extremely large populations such as the entire resident population of the United States, it
may take so long to conduct an analysis of the entire population that characteristics begin to
86
change before the research project can be completed.1 Pre-election surveys provide a potential
example of this effect. The dynamics of electoral campaigns often shift rapidly as candidates
gain or lose momentum that could not be accurately reflected in the lengthy process that would
be necessary to question all potential voters in the United States. Under such conditions, the use
of a sample of the larger population represents the best approach to the task of research.
The purpose of drawing a sample is to provide the researcher with a smaller number of
cases which are representative of the characteristics present in the larger population. Properly
drawn samples are scientifically reliable and extremely helpful in examining characteristics of
the population as a whole. They have advantages of speed, cost, and convenience compared to
efforts to analyze entire populations. The “ideal” form of a sample is the random sample, but a
variety of sampling techniques are available to researchers seeking to identify individuals for
participation in research projects. A few illustrations of samples are:
(1)
A public opinion poll of 2500 voters to determine how the individuals
polled intend to vote for President.
(2)
A comparison of political party policies among state legislatures by
selecting random samples of Democrat and Republican members.
(3)
A survey of 1500 randomly selected individuals to determine their
preference for a particular toothpaste.
(4)
A quality control project at a water plant which calls for the selection of
several water samples each hour to determine if the plant is meeting
certain Federal standards.
(5)
A random sample of 100 students at college A and a sample 100 students
from college B to determine why the students enrolled at their particular
institution.
1
The U.S. Census is an attempt to report on the characteristics of the entire population, but it is a two or
three year exercise that is not accurate when it is finished.
87
Hundreds of different kinds of samples selected from a wide variety of populations could be
mentioned, each having particular relevance for the research projects that could be conducted.
The random sample is the most useful type of sample for statistical analyses. In a random
sample, all members of a population are given an equal chance of being selected.
The researcher with limited time and resources would be using good judgment in
selecting a random sample and studying the characteristics distributed in the population. If the
sample selected is a good representation of the entire population, the researcher could then
generalize from the sample to the entire population from which the sample was drawn. However,
it should be noted that there will always be some differences between the sample and the
population that occur for a variety of reasons. The differences that exist between the true
characteristics of the population and those of the sample are referred to as sampling error.
Since there is sampling error, the mean of the sample will not always be the same as the mean of
the population from which the sample was drawn. Likewise, the presence of sampling error
means that the standard deviation of the sample and the standard deviation of the population will
not always be the same. Sampling error makes it difficult to generalize or infer something about
the population from a sample with any degree of accuracy, but there are statistical conventions
which enable researchers to successfully deal with sampling error.
The statistical concept of standard error of the mean is one way to overcome the
apparent difficulties mentioned above. In order to understand the standard error of the mean, one
must first understand the statistical notion of a sampling distribution of means. A sampling
distribution of means when a researcher draws repeated samples from the larger population,
calculating the mean of each sample individually. A frequency distribution is then constructed
88
using the mean values for each of the samples drawn from the population. For example, if a
researcher selected 1,000 samples of 1500 individuals from the population in the United States
and then calculated the mean incomes for each of those samples, 1,000 different mean values
would be identified. If frequency distribution was constructed using these 1,000 obtained
means, that distribution would be referred to as a sampling distribution of means. In the field
of statistics, the characteristics of this sampling distribution of means are expected to
approximate all the characteristics of the normal curve. Even though 1,000 samples have been
selected, one can only say that a sampling distribution of means approximates the normal curve
because there is still the matter of sampling error.
The purpose of the sampling distribution of means is to provide a more accurate
assessment of the location for the true population mean. The mean for each of the individual
samples can be substantially different from that of the overall population due to the presence of
sampling error. On the other hand, researchers should be able to use the sampling distribution of
means to produce a much more accurate assessment of the true population mean. The laws of
probability suggest that properly drawn samples should produce sample means that tend to
cluster around the value of the true population mean. For this reason, determining the mean of
the sampling distribution of means should produce a value which closely approximates the true
mean of the population. This is true even though there is some sampling error involved. The
standard deviation of the sampling distribution of means is smaller than the standard deviation of
the population. This is true because, in the process of calculating the means of the 1,000 samples
that were drawn from the population of the United States, the influence of extreme values in each
sample have been reduced dramatically. The values of the sampling distribution are more
89
concentrated in the center of the distribution, rather than being dispersed along the horizontal
axis of the curve. Since the sampling distribution of means approximates the normal curve, z
scores and probability, which were discussed in the last chapter, can be applied to the sampling
distribution. All of the conclusions and generalizations related to these concepts are also
applicable to this distribution. This means that the probability of drawing a sample with a mean
falling within 1 standard deviation of the true population mean is 68.26%. Likewise, the
probabilities of drawing samples with means within 2 and 3 standard deviations of the true
population mean are 95.44% and 99.74% respectively.
The problem that becomes evident is that one rarely has information about a sampling
distribution of means. If it became necessary to draw 1,000 samples from each population that
researchers wished to study, the entire purpose of sampling would be defeated. Researchers
generally collect data for only one or two samples that will be used to make generalizations about
the entire population. Even though the researcher does not have a sampling distribution of
means, an technique has been developed that produces a statistic which represents an estimate of
the standard deviation that would be present within a sampling distribution of means if it was
constructed based on information drawn from a single sample. This estimate of the standard
deviation is known as the standard error of the mean. The formula for the standard error of the
mean is as follows:
the standard error of the mean
the standard deviation of the sample
= the square root of the number of observations in the sample minus 1
90
The standard error of the mean, which is an estimate of the standard deviation within the sampling
distribution of means, can be used to determine the range of values in which the true population mean is
likely to fall.2 The probability that the population mean actually falls within this range of mean values
can also be estimated based on the characteristics a series of distributions, called t-distributions, which
are closely related to the normal curve. This characteristic makes it possible to create confidence
intervals based on the known characteristics of the t-distribution that applies in a particular case. A
statistic called degrees of freedom is used to determine the t-distribution that should be applied to a
particular context. This statistic is a simple measure based on the size of the sample under
consideration. The formula for determining degrees of freedom is:
. Information about the
appropriate t-distribution can then be used to produce an estimate of the range of values within which the
true population mean will be located. Appendix D contains a table of values for the most commonly used
t-distributions. The center of these intervals is the mean of the sample that has been drawn. They are
based on the laws of probability and the assumption that individual samples will tend to cluster around
the true population mean. Most researchers rely primarily on two specific confidence intervals: the 95%
and 99% confidence intervals which are illustrated in Figure 7:1.
2
The reason the standard error of the mean is only an estimate of the standard deviation of a
sampling distribution of means is that the researcher will never have an actual distribution of means.
91
FIGURE 7:1
SAMPLING DISTRIBUTION OF MEANS
95%
99%
This curve can be used to find the probability that a sample drawn from the population
will have a mean within some range of the true population mean. For example, a researcher can
be approximately 95% confident that the mean of a particular sample will fall within two
standard errors of the true population mean. Likewise, one can be over 99% confident that an
individual sample mean will fall within three standard errors of the true population mean. chance
that the mean of any random sample will fall within 2 standard errors of the mean. These ranges
represent the 95%, and 99% confidence intervals. The formula used for the construction of
confidence intervals using sample data is:
.
The wider the confidence interval, the more confident one can be that the mean of the population
falls within that range of sample means. The example shown above is based on a sample size of
31. By determining the degrees of freedom (31-1=30) and consulting the Table in Appendix D,
one finds the following:
92
95% confidence interval =
99% confidence interval =
To illustrate these points, suppose a researcher selected a sample of political TV ad ratings
for their cost efficiency, and the sample yielded the results shown in Figure 7:2.
SOLUTION MATRIX 7:2
POLITICAL TV AD EFFICIENCY RATINGS
Stations
Ratings
1
41
25.18
634.03
2
29
13.18
173.71
3
22
6.18
38.19
4
18
2.18
4.75
5
13
-2.82
7.95
6
11
-4.82
23.23
7
10
-5.82
33.87
8
9
-6.82
46.51
9
8
-7.82
61.16
10
7
-8.82
77.79
11
6
-9.82
96.43
93
Additional Steps to Find Standard Error of the Mean:
One can be 95% confident that the mean of the population from which this sample was drawn is
within an interval of values which range from 8.11 to 23.53.3 Likewise, one can be 99% confident
that the true population mean is within an interval of values which range from 4.86 to 26.78. The
confidence interval is wider at the 99% confidence level than at the 95% confidence level. At the
99% confidence level, only 1 sample mean in 100 is likely to fall outside the confidence interval.
This chapter has been devoted to sampling, the normal curve, the standard error of the mean
and confidence intervals. The standard error of the mean is an estimate of the standard deviation of
a sampling distribution of means based on data collected for a single sample and is an important
concept for calculating subsequent statistics discussed in this text. With a knowledge of these
3
The standard error of the m ean for this sam ple was large. Therefore, the potential for error is
large and that m eans that the confidence interval will be wide. In m any actual cases with large sam ples
the standard error of the m ean is sm all and the confidence interval is correspondingly sm all.
94
concepts firmly in mind, the student is now prepared to begin testing hypotheses.
A Major Idea:
Standard Error Provides a Means of
Estimating How Closely a Sample
Mean can be expected to Approximate
the True Population Mean.
95
SEQUENTIAL STATISTICAL STEPS
STANDARD ERROR OF THE MEAN
Step 1
Organize
Data Matrix
The first step in finding the standard error of the mean is to organize the
data in a frequency distribution
Step 2
Calculate the mean of the distribution.
Step 3
Find the deviation values by subtracting the mean from each individual
value.
Step 4
Square the deviation value.
Step 5
Step 6
Step 7
Step 8
Step 9
If the distribution is a frequency distribution multiply each deviation value by
the number of times the raw value occurs in the distribution
Add the deviation values which have been multiplied by the frequencies.
Find the variance of the distribution by dividing the sum of the deviation
values by n-1.
Calculate the standard deviation by square rooting the variance.
Finally, calculate the standard error of the mean by dividing the standard
deviation by the square root of n
96
EXERCISES - CHAPTER 7
(1)
Define the following terms:
(A) population
(B)
sampling distribution of means
(C)
confidence interval
(D) standard error of the mean
(E)
standard deviation
(F)
sampling error
(2)
Assume that the following data are sample data and find the standard error of the mean and
the confidence intervals for the 95% and 99% confidence levels for each distribution. What
is the median value for each distribution? Show all work.
(A)
X1 = 45, 51, 19, 23, 26, 27, 24, 65, 20, 21, 46, 41, 49, 36, 35
(B)
X2 = 49, 44, 23, 33, 26, 14, 21, 56, 18, 20, 31, 35, 38, 54, 53
(C)
X3 = 44, 53, 22, 29, 30, 27, 32, 55, 67, 21, 40, 35, 31, 51, 42
(3)
If the mean of a distribution is 15.00 and the standard error of the mean 0.50, what would be
the range of values for the 95% and 99% confidence levels?
(4)
Find the standard error of the mean for the following sample data and give the range of
values for the 95% and 99% confidence levels. Show all work.
(A)
(B)
(C)
(5)
X1 = 73, 35, 25, 60, 55, 30, 29, 58
X2 =100, 126, 89, 95, 64, 100, 100, 70, 70
X3 =34, 36, 45, 48, 50, 22, 25, 25, 25, 40, 40, 24
Assume that the mean of a distribution is 10.00 with a standard error of .75. What would be
the range of values for the 95% and 99% confidence levels?
97
(6)
Find the standard error of the mean and other statistics for the following data. Show all
work.
X
f
71-80
5
61-70
20
51-60
30
41-50
40
31-40
55
21-30
20
11-20
30
1-10
15
What is the median? What percentage of the values are above 50? What is the probability of a
value being above 70? What is the percentile rank of 65?
(7)
Which of the following groups has the highest standard error of the mean? Which has the
lowest standard error of the mean? What are the confidence intervals for each group at the
95% confidence level?
X1
f
X2
f
X3
f
65
1
70
1
90
1
50
2
60
1
75
2
45
3
55
3
70
4
30
4
45
5
60
2
20
1
10
2
55
2
10
1
5
1
40
1
98
(8)
Answer the questions given below for these sample data. Show all work.
X
f
Y
f
130
1
500
2
125
5
475
4
120
6
400
6
110
4
350
7
100
2
250
3
90
1
200
1
(1) mean
(1) mean
(2) mode
(2) mode
(3) median
(3) median
(4) s2
(4)
s2
(5) s
(5)
s
(6) Standard error of the mean
(6) Standard error of the mean
(7) z score for 105
(7)
(8) percent of the values above 115
(8) percent of the values below 225
probability a value will be below 118
(9)
(10) skew
(10)
99
z score for 375
probability a value will be above 450
skew