Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 6 Putting Statistics to Work: the Normal Distribution Productive inference from sample to population requires that the appropriate statistic be used to characterize various probabilities associated with the distributions of interest. As we may hypothetically have an infinite number of means as well as an infinite number of standard deviations that describe potential distributions, we therefore have a problem to solve in that we do not have an infinite number of statistical procedures to deal with every possible distribution. Does this mean that we can't use statistics to analyze a vast majority of our data? Thankfully, no. While each distribution is unique, most distributions can be grouped with other distributions that share important characteristics. These groups of similar distributions can be further characterized by an 'ideal' (i.e., theoretical) distribution that typifies the important characteristics. Statistics applicable to the entire group of similar distributions can then be developed based upon our knowledge of the ideal distribution. Perhaps the most important ideal distribution used is the 'normal' distribution (Figure 6.1). Once one understands the characteristics of the normal distribution, knowledge of other distributions is easily obtained. Figure 6.1. The normal distribution. Most people are familiar with the normal distribution described as a “bell-shaped curve,” perhaps as a scale for grading. The bell-shaped curve is nothing but a special case of the normal distribution; the words “bell-shaped” describe the general shape of the distribution, and the word “curve” is used as a synonym for distribution. While we generally refer to the normal distribution, there are really many different normal distributions. In fact, there are as many different normal distributions as there are possible means and standard deviations, both theoretically and in the real world. However, all of these normal distributions share five characteristics. 1. Symmetry. If divided into left and right halves, each half is a mirror image of the other. 2. The maximum height of the distribution is at the mean. One of the consequences of this stipulation and number 1 above is that the mean, the mode and the median have identical values. 3. The area under a normal distribution sums to unity. This characteristic is simpler than it sounds but is important because of how we use the normal distribution. Areas within the theoretical distribution as a geometric form represent probabilities of events that range from 0 to 1 (i.e., 0% to 100%). The phrase ‘sums to unity’ means that all of the probabilities represented by the area under the normal distribution sum to 1 and thus represent all possible outcomes. Each half of the symmetrical distribution of which the mean is the center represents half (.5) of the probabilities. 4. Normal distributions are theoretically asymptotic at both ends, or tails, of the distribution. If we were to follow a point along the slope of the curve toward the tail to infinity, the point would incrementally become ever closer to zero without ever quite reaching it. This aspect of the normal distribution is necessary because we need to consider every possible variate to infinity. Put another way, every single possible variate can be assigned some probability of occurring, even if it is astronomically small. 5. The distribution of means of multiple samples from a normal distribution will have a tendency to be normally distributed. Considering this commonality among normal distributions requires thinking about means somewhat differently. As you know, means characterize groupings of variates. In this special context we need to consider calculating individual means on repeated samples, and plotting these means as variates that collectively create a new distribution that is composed of means. Accordingly, this new distribution has a tendency to be normally distributed. This issue will be further discussed in Chapter 7. With these commonalties in mind, let us further consider some of the differences among normal distributions. First, as Figures 3.17, 3.18 and 3.19 show, normal distributions may be conceptualized as leptokurtic, platykurtic, or mesokurtic. Additionally, any combination of means and standard deviations is possible, and there is no necessary relationship between the mean and the standard deviation for any given distribution. Normal distributions may have different means and the same standard deviation (Figure 6.2) or the same means and different standard deviations (Figure 6.3). Figure 6.2. Two normal distributions with different means and the same standard deviation. Figure 6.3. Two normal distributions with the same mean and different standard deviations. If σ is large, variates are generally far from the mean. If σ is small, most variates are relatively close to the mean. Regardless of the standard deviation, variates near the mean in a normal distribution are more common, and therefore more probable, than variates in the tails of the distribution. One of the most useful aspects of normal distributions is that regardless of the value of σ or µ (Figure 6.4): µ ± 1 σ contains 68.26% of all variates µ ± 2 σ contains 95.44% of all variates µ ± 3 σ contains 99.734% of all variates Figure 6.4. Percentages of variates within 1, 2, and 3 standard deviations from µ . It is also possible to express this relationship in terms of more commonly used percentages. For example: 50% of all items fall between µ ± .674 σ 95% of all items fall between µ ± 1.96 σ 99% of all items fall between µ ± 2.58 σ If µ ± 1 σ contains 68.26% of all variates, µ ± 2 σ contains 95.44% of all variates, and µ ± 3 σ contains 99.74% of all variates (Figure 6.4), we know that any values beyond µ ± 2 σ are rare events, expected less than 5 times in 100, and µ ± 3 σ is even more rare, expected less than 1 time out of 100. This characteristic of the normal distribution allows us to consider the probability of individual variates occurring within a geometric space under the distribution. As the probability space (i.e., the sum of the area of probability we are considering) under the normal distribution = 1.0, we know that the percentages mentioned above may be converted to probabilities. When we consider the relationship between a distribution and an individual variate of that distribution, we know that the probability is .6826 that the variate is within µ ± 1 σ ; .9544 that the variate is within µ ± 2 σ ; and .9974 that the variate is within µ ± 3 σ (Figure 6.5). Figure 6.5. Standard deviations as areas under the normal curve expressed as probabilities. The probabilities illustrated in Figure 6.5 are unchanging for all normal distributions regardless of their means or standard deviations. Furthermore, probabilities may be calculated for any area under the curve. For example, we might be interested in the areas between two points on the axis, or between one point and the mean, or between one point and infinity. These areas under the curve do vary depending on the location and the shape of the distribution as described by the mean and the standard deviation. In other words, there are as many relationships between any individual variate and the probabilities associated with normal distributions as there are different possible means and standard deviations. All are infinite in number.µσ In order to best effectively use the normal distribution to generate probabilities, statisticians have created the standard normal distribution. The standard normal distribution has, by definition, µ=0 and σ =1. Rather than calculate probabilities of areas under the curve for every possible mean and standard deviation, it is easiest to convert any distribution to the standard normal. This transformation occurs through the calculation of z, where: Formula 6.1: z = Yi − µ σ The calculation of z establishes the difference between any variate and the mean ( Yi − µ ), and expresses that difference in standard deviation units (by dividing by σ ). In other words, the product of the formula, called a z-score, is how many standard deviations Yi is from µ in the standard normal distribution. Appendix A is a table of areas under the curve of the standard normal distribution. Once we have a z-score, it is possible to use Appendix A to determine the exact probabilities under the curve. To illustrate this point, let us consider the following example. Donald K. Grayson, in his analysis of the microfauna from Hidden Cave, Nevada, notes that only one species of pocket gopher, Thomomys bottae, occurs in the area today, although it is possible for other species to have been represented in the past. Grayson (1985:144) presents the following descriptive statistics on mandibular alveolar lengths in mm for modern Thomomys bottae: Y =5.7, s=.48, and n=54. Specimen number HC-215 has a value Yi =6.4. What is the probability of obtaining a value between the mean, Y =5.7, and Yi =6.4 (Figure 6.6)? Figure 6.6. Illustration of the relationship between the sample mean and the variate of 6.4. Since we do not have the population parameters, we substitute the sample values for the mean and standard deviation. z= Yi − Y 6.40 − 5.70 = s .48 z = 1.46 . Is this This value for z tells us that 6.4 is 1.46 standard deviations from the mean a common or rare event? We know in general it is common, as the value lies between one and two standard deviations from the mean. Yet, we might be interested in the exact probability. These probabilities for areas under the standard normal distribution can be found in Appendix A. Values expressed in Appendix A are probabilities in the area between z and the mean. To find the probability for 1.46 standard deviation units, look down the left side of the table until the value 1.4 is located. Follow this row until it intersects with the column value for .06. At that intersection is the value .4279, which represents the probability of a variate falling between the mean and z = 1.46 . A value in that interval is therefore a common event. In addition to determining the probability found in the above example, we can also find the probability of having a value greater than and less than 6.4 (z = 1.46). Since we know that the total probability represented in the curve is equal to 1.0, and that .50 lies on each side of the mean, we can determine that .5 + .4279 = .9279 equals the probability of a value less than z = 1.46, and 1-.9279 = .0721 represents the probability of a value greater than z = 1.46. We could then conclude that a value larger than 1.46 would approach being a rare event, something that we would expect approximately only 7 times out of 100. The above example illustrates finding probabilities based on areas under the normal curve. It should be noted that we cannot determine exact values that represent points on the line, because points are infinitesimally small. To illustrate this, we did not determine above the probability of a value z = 1.46; only values greater or lesser than this value, or the probability of a value between z = 1.46 and the mean. The probability of the point z = 1.46 cannot be measured. If absolutely necessary to find the area that closely relates to 1.46, one should look for the area under the curve between 1.455 and 1.465. Note that Appendix A only presents values for areas where Yi is greater than the mean. What happens if Yi is less than the mean? Another example will serve to illustrate this point. What is the probability of an alveolar length between 5.3 mm and 6.8 mm (Figure 6.7)? Figure 6.7. Illustration of the relationship between the sample mean and the variates 5.3 and 6.8. We can illustrate this probability in the following way: Pr{5.3< Yi <6.8} Pr{ Y1 − Y Y −Y <z< 2 } s s Pr{ 5.3 − 5.7 6.8 − 5.7 <z< } .48 .48 Pr{ − .83 < z < 2.29 } Since the normal curve is symmetrical, it is possible to ignore the negative sign for -.83 to use Appendix A to find the area between this value and the mean. The tabled value for .83 = .2967. The tabled value for 2.29 = .4890. Since we are interested in the area under the curve between -.83 and 2.29, we can sum the two individual probabilities to determine that the Pr{ − .83 < z < 2.29 } = .2967+.4890 = .7857. You will note that we used a new kind of notation in the preceding example. Unlike many of the symbols previously discussed, this notation does not provide instructions for computation. Instead, it describes the problem we wish to solve. Pr is the symbol indicating we are determining a probability. The area inside the brackets {} is called the probability space. It indicates exactly what probability we wish to find: in this case, the probability of a variate with a value between 5.3 and 6.8. While it may seem tedious, it is important you explicitly write and draw a sketch of your probability space. It is a useful and easy way to keep track of the probability space you are after while ensuring you do not make a simple mistake. The normal distribution is incredibly useful for a number of reasons. For example, we may now conclude that specimen HC-215 from Hidden Cave does not differ in a significant manner from the modern population of Thomomys bottae. If it did, it may have led us to suggest that another species of Thomomys might have been present at Hidden Cave in the past – a conclusion of significant paleoenvironmental and archaeological significance. This and similar uses fall under the subject of hypothesis testing, the subject of the next chapter. References Cited Grayson, D.K. 1985. The paleontology of Hidden Cave: Birds and mammals. In The Archaeology of Hidden Cave, Nevada, edited by D. H. Thomas, pp. 125-161. American Museum of Natural History Anthropological Papers 66(1).