* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download CHAPTER 7 STANDARD ERROR OF THE MEAN AND
Survey
Document related concepts
Transcript
CHAPTER 7 STANDARD ERROR OF THE MEAN AND CONFIDENCE INTERVALS Researchers rarely conduct statistical research with knowledge of the characteristics of the entire population. Remember that a population is defined as a group of individuals or cases which all share a common characteristic or set of characteristics. One example of a population consists of all individuals living within the United States. This group represents a population because they share the characteristic of living within the same geographic region. This population is especially important to political science researchers and is probably the most common subject of academic research in the contemporary field of political science. Despite the prominence of this population as the subject of examination, it is virtually impossible for a researcher to engage in a research project that considers every individual who is included in this larger population. Financial constraints, time, and other factors make it impractical to include all members of this population in most research projects. Most researchers choose to focus on a sample rather than the entire population when constructing their research projects. Using a sample allows the researcher to focus greater attention on a relatively small number of individuals and enables the researcher to save both time and money in collecting and analyzing the data obtained as a part of the study. In many cases, samples may actually represent a better source of data than the entire population. When working with extremely large populations such as the entire resident population of the United States, it may take so long to conduct an analysis of the entire population that characteristics begin to 86 change before the research project can be completed.1 Pre-election surveys provide a potential example of this effect. The dynamics of electoral campaigns often shift rapidly as candidates gain or lose momentum that could not be accurately reflected in the lengthy process that would be necessary to question all potential voters in the United States. Under such conditions, the use of a sample of the larger population represents the best approach to the task of research. The purpose of drawing a sample is to provide the researcher with a smaller number of cases which are representative of the characteristics present in the larger population. Properly drawn samples are scientifically reliable and extremely helpful in examining characteristics of the population as a whole. They have advantages of speed, cost, and convenience compared to efforts to analyze entire populations. The “ideal” form of a sample is the random sample, but a variety of sampling techniques are available to researchers seeking to identify individuals for participation in research projects. A few illustrations of samples are: (1) A public opinion poll of 2500 voters to determine how the individuals polled intend to vote for President. (2) A comparison of political party policies among state legislatures by selecting random samples of Democrat and Republican members. (3) A survey of 1500 randomly selected individuals to determine their preference for a particular toothpaste. (4) A quality control project at a water plant which calls for the selection of several water samples each hour to determine if the plant is meeting certain Federal standards. (5) A random sample of 100 students at college A and a sample 100 students from college B to determine why the students enrolled at their particular institution. 1 The U.S. Census is an attempt to report on the characteristics of the entire population, but it is a two or three year exercise that is not accurate when it is finished. 87 Hundreds of different kinds of samples selected from a wide variety of populations could be mentioned, each having particular relevance for the research projects that could be conducted. The random sample is the most useful type of sample for statistical analyses. In a random sample, all members of a population are given an equal chance of being selected. The researcher with limited time and resources would be using good judgment in selecting a random sample and studying the characteristics distributed in the population. If the sample selected is a good representation of the entire population, the researcher could then generalize from the sample to the entire population from which the sample was drawn. However, it should be noted that there will always be some differences between the sample and the population that occur for a variety of reasons. The differences that exist between the true characteristics of the population and those of the sample are referred to as sampling error. Since there is sampling error, the mean of the sample will not always be the same as the mean of the population from which the sample was drawn. Likewise, the presence of sampling error means that the standard deviation of the sample and the standard deviation of the population will not always be the same. Sampling error makes it difficult to generalize or infer something about the population from a sample with any degree of accuracy, but there are statistical conventions which enable researchers to successfully deal with sampling error. The statistical concept of standard error of the mean is one way to overcome the apparent difficulties mentioned above. In order to understand the standard error of the mean, one must first understand the statistical notion of a sampling distribution of means. A sampling distribution of means when a researcher draws repeated samples from the larger population, calculating the mean of each sample individually. A frequency distribution is then constructed 88 using the mean values for each of the samples drawn from the population. For example, if a researcher selected 1,000 samples of 1500 individuals from the population in the United States and then calculated the mean incomes for each of those samples, 1,000 different mean values would be identified. If frequency distribution was constructed using these 1,000 obtained means, that distribution would be referred to as a sampling distribution of means. In the field of statistics, the characteristics of this sampling distribution of means are expected to approximate all the characteristics of the normal curve. Even though 1,000 samples have been selected, one can only say that a sampling distribution of means approximates the normal curve because there is still the matter of sampling error. The purpose of the sampling distribution of means is to provide a more accurate assessment of the location for the true population mean. The mean for each of the individual samples can be substantially different from that of the overall population due to the presence of sampling error. On the other hand, researchers should be able to use the sampling distribution of means to produce a much more accurate assessment of the true population mean. The laws of probability suggest that properly drawn samples should produce sample means that tend to cluster around the value of the true population mean. For this reason, determining the mean of the sampling distribution of means should produce a value which closely approximates the true mean of the population. This is true even though there is some sampling error involved. The standard deviation of the sampling distribution of means is smaller than the standard deviation of the population. This is true because, in the process of calculating the means of the 1,000 samples that were drawn from the population of the United States, the influence of extreme values in each sample have been reduced dramatically. The values of the sampling distribution are more 89 concentrated in the center of the distribution, rather than being dispersed along the horizontal axis of the curve. Since the sampling distribution of means approximates the normal curve, z scores and probability, which were discussed in the last chapter, can be applied to the sampling distribution. All of the conclusions and generalizations related to these concepts are also applicable to this distribution. This means that the probability of drawing a sample with a mean falling within 1 standard deviation of the true population mean is 68.26%. Likewise, the probabilities of drawing samples with means within 2 and 3 standard deviations of the true population mean are 95.44% and 99.74% respectively. The problem that becomes evident is that one rarely has information about a sampling distribution of means. If it became necessary to draw 1,000 samples from each population that researchers wished to study, the entire purpose of sampling would be defeated. Researchers generally collect data for only one or two samples that will be used to make generalizations about the entire population. Even though the researcher does not have a sampling distribution of means, an technique has been developed that produces a statistic which represents an estimate of the standard deviation that would be present within a sampling distribution of means if it was constructed based on information drawn from a single sample. This estimate of the standard deviation is known as the standard error of the mean. The formula for the standard error of the mean is as follows: the standard error of the mean the standard deviation of the sample = the square root of the number of observations in the sample minus 1 90 The standard error of the mean, which is an estimate of the standard deviation within the sampling distribution of means, can be used to determine the range of values in which the true population mean is likely to fall.2 The probability that the population mean actually falls within this range of mean values can also be estimated based on the characteristics a series of distributions, called t-distributions, which are closely related to the normal curve. This characteristic makes it possible to create confidence intervals based on the known characteristics of the t-distribution that applies in a particular case. A statistic called degrees of freedom is used to determine the t-distribution that should be applied to a particular context. This statistic is a simple measure based on the size of the sample under consideration. The formula for determining degrees of freedom is: . Information about the appropriate t-distribution can then be used to produce an estimate of the range of values within which the true population mean will be located. Appendix D contains a table of values for the most commonly used t-distributions. The center of these intervals is the mean of the sample that has been drawn. They are based on the laws of probability and the assumption that individual samples will tend to cluster around the true population mean. Most researchers rely primarily on two specific confidence intervals: the 95% and 99% confidence intervals which are illustrated in Figure 7:1. 2 The reason the standard error of the mean is only an estimate of the standard deviation of a sampling distribution of means is that the researcher will never have an actual distribution of means. 91 FIGURE 7:1 SAMPLING DISTRIBUTION OF MEANS 95% 99% This curve can be used to find the probability that a sample drawn from the population will have a mean within some range of the true population mean. For example, a researcher can be approximately 95% confident that the mean of a particular sample will fall within two standard errors of the true population mean. Likewise, one can be over 99% confident that an individual sample mean will fall within three standard errors of the true population mean. chance that the mean of any random sample will fall within 2 standard errors of the mean. These ranges represent the 95%, and 99% confidence intervals. The formula used for the construction of confidence intervals using sample data is: . The wider the confidence interval, the more confident one can be that the mean of the population falls within that range of sample means. The example shown above is based on a sample size of 31. By determining the degrees of freedom (31-1=30) and consulting the Table in Appendix D, one finds the following: 92 95% confidence interval = 99% confidence interval = To illustrate these points, suppose a researcher selected a sample of political TV ad ratings for their cost efficiency, and the sample yielded the results shown in Figure 7:2. SOLUTION MATRIX 7:2 POLITICAL TV AD EFFICIENCY RATINGS Stations Ratings 1 41 25.18 634.03 2 29 13.18 173.71 3 22 6.18 38.19 4 18 2.18 4.75 5 13 -2.82 7.95 6 11 -4.82 23.23 7 10 -5.82 33.87 8 9 -6.82 46.51 9 8 -7.82 61.16 10 7 -8.82 77.79 11 6 -9.82 96.43 93 Additional Steps to Find Standard Error of the Mean: One can be 95% confident that the mean of the population from which this sample was drawn is within an interval of values which range from 8.11 to 23.53.3 Likewise, one can be 99% confident that the true population mean is within an interval of values which range from 4.86 to 26.78. The confidence interval is wider at the 99% confidence level than at the 95% confidence level. At the 99% confidence level, only 1 sample mean in 100 is likely to fall outside the confidence interval. This chapter has been devoted to sampling, the normal curve, the standard error of the mean and confidence intervals. The standard error of the mean is an estimate of the standard deviation of a sampling distribution of means based on data collected for a single sample and is an important concept for calculating subsequent statistics discussed in this text. With a knowledge of these 3 The standard error of the m ean for this sam ple was large. Therefore, the potential for error is large and that m eans that the confidence interval will be wide. In m any actual cases with large sam ples the standard error of the m ean is sm all and the confidence interval is correspondingly sm all. 94 concepts firmly in mind, the student is now prepared to begin testing hypotheses. A Major Idea: Standard Error Provides a Means of Estimating How Closely a Sample Mean can be expected to Approximate the True Population Mean. 95 SEQUENTIAL STATISTICAL STEPS STANDARD ERROR OF THE MEAN Step 1 Organize Data Matrix The first step in finding the standard error of the mean is to organize the data in a frequency distribution Step 2 Calculate the mean of the distribution. Step 3 Find the deviation values by subtracting the mean from each individual value. Step 4 Square the deviation value. Step 5 Step 6 Step 7 Step 8 Step 9 If the distribution is a frequency distribution multiply each deviation value by the number of times the raw value occurs in the distribution Add the deviation values which have been multiplied by the frequencies. Find the variance of the distribution by dividing the sum of the deviation values by n-1. Calculate the standard deviation by square rooting the variance. Finally, calculate the standard error of the mean by dividing the standard deviation by the square root of n 96 EXERCISES - CHAPTER 7 (1) Define the following terms: (A) population (B) sampling distribution of means (C) confidence interval (D) standard error of the mean (E) standard deviation (F) sampling error (2) Assume that the following data are sample data and find the standard error of the mean and the confidence intervals for the 95% and 99% confidence levels for each distribution. What is the median value for each distribution? Show all work. (A) X1 = 45, 51, 19, 23, 26, 27, 24, 65, 20, 21, 46, 41, 49, 36, 35 (B) X2 = 49, 44, 23, 33, 26, 14, 21, 56, 18, 20, 31, 35, 38, 54, 53 (C) X3 = 44, 53, 22, 29, 30, 27, 32, 55, 67, 21, 40, 35, 31, 51, 42 (3) If the mean of a distribution is 15.00 and the standard error of the mean 0.50, what would be the range of values for the 95% and 99% confidence levels? (4) Find the standard error of the mean for the following sample data and give the range of values for the 95% and 99% confidence levels. Show all work. (A) (B) (C) (5) X1 = 73, 35, 25, 60, 55, 30, 29, 58 X2 =100, 126, 89, 95, 64, 100, 100, 70, 70 X3 =34, 36, 45, 48, 50, 22, 25, 25, 25, 40, 40, 24 Assume that the mean of a distribution is 10.00 with a standard error of .75. What would be the range of values for the 95% and 99% confidence levels? 97 (6) Find the standard error of the mean and other statistics for the following data. Show all work. X f 71-80 5 61-70 20 51-60 30 41-50 40 31-40 55 21-30 20 11-20 30 1-10 15 What is the median? What percentage of the values are above 50? What is the probability of a value being above 70? What is the percentile rank of 65? (7) Which of the following groups has the highest standard error of the mean? Which has the lowest standard error of the mean? What are the confidence intervals for each group at the 95% confidence level? X1 f X2 f X3 f 65 1 70 1 90 1 50 2 60 1 75 2 45 3 55 3 70 4 30 4 45 5 60 2 20 1 10 2 55 2 10 1 5 1 40 1 98 (8) Answer the questions given below for these sample data. Show all work. X f Y f 130 1 500 2 125 5 475 4 120 6 400 6 110 4 350 7 100 2 250 3 90 1 200 1 (1) mean (1) mean (2) mode (2) mode (3) median (3) median (4) s2 (4) s2 (5) s (5) s (6) Standard error of the mean (6) Standard error of the mean (7) z score for 105 (7) (8) percent of the values above 115 (8) percent of the values below 225 probability a value will be below 118 (9) (10) skew (10) 99 z score for 375 probability a value will be above 450 skew