Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
JSS 18 (1) (1966) 2-15 WHY THE NORMAL DISTRIBUTION by R. H. DAW IT has been said that everybody believes in the Law of Errors (i.e. the Normal distribution), the experimenters because they think it is a mathematical theorem and the mathematicians because they think it is an experimental fact. Text-books on statistics often seem to produce the Normal distribution as the conjurer produces a rabbit from a hat and the actuarial student, who should be something of both mathematician and experimenter, is frequently somewhat mystified as to why such an apparently complicated distribution function as FIG. 1. The Normal distribution. (see Fig. 1) is suddenly introduced and given such prominence. This note is an attempt to explain the place of the Normal distribution in statistical theory and practice and it is hoped that it will encourage the reader to give some thought to basic assumptions. 2 WHY THE NORMAL DISTRIBUTION 3 FIRST DERIVATION OF THE NORMAL DISTRIBUTION The Normal distribution is often referred to as the Gaussian Law or sometimes the Laplace-Gauss Law after the pioneers who derived it on various assumptions in the late 18th and early 19th centuries; in fact the discovery of this distribution was considerably earlier. In a rare pamphlet dated 12 November 1733 by Abraham de Moivre written in Latin and entitled Approximatio ad summam terminorum binomii (a+b)" in seriem expansi the Normal distribution is derived as an approximation to the binomial distribution. Only two copies of this pamphlet are known, one being in the library of University College, London, and the other in Berlin. Archibald (1926) included a facsimile of the pamphlet and a photo-copy of this facsimile is in the Institute library. In the second edition of his Doctrine of Chances (1738) de Moivre published an English translation of his pamphlet with some additions and the text of this section of the book is reproduced by Smith (1929), while Walker (1929) gives the essential parts. De Moivre (1738) writes 'Altho' the Solution of Problems of Chance often require that several Terms of the Binomial (a+b)n be added together, nevertheless in very high Powers the thing appears so laborious, and of so great a difficulty, that few people have undertaken that Task'. In his solution he first finds that, for large values of n, the middle term in the expansion of (½ +½)" is approximately equal to 2/ nc, where c = 2n. In modern notation and the 2 2 circumstances of the problem σ = n/4, where σ is the variance of the binomial distribution (½ + ½)". Thus in effect de Moivre's result is that the maximum ordinate y0 of the binomial distribution is He then finds that 'the Logarithm of the Ratio, which a Term distant from the middle by the interval has to the middle Term, is or in modern notation whence the formula given above for the Normal distribution follows. 4 R. H. DAW There follow a number of lemmas and corollaries, including an n extension to the binomial (p+q) where p q, and to quote Smith (1929) 'This paper gives the first statement of the formula for the "normal curve", the first method of finding the probability of the occurrence of an error of a given size when that error is expressed in terms of the variability of the distribution as a unit, and the first recognition of that value later termed the probable error. It shows, also, that before Stirling, de Moivre had been approaching a solution of the value of factorial n.' All this is given in an obscure tract of 7 pages on abstract mathematics relating to games of chance written in Latin over two centuries ago. THE NORMAL DISTRIBUTION AS A LAW OF ERROR When using an instrument to make repeated measurements of a particular magnitude the measurements are subject to many different types of error. For example, astronomical observations may be affected by errors in the measuring instrument used, by external conditions such as temperature, wind and varying density of the earth's atmosphere and by personal peculiarities or bias of the observer. Enough is known about some of these types of error to apply corrections to the observations but even when this has been done to a series of observations of the same quantity, the corrected observations are still found in practice not to be all equal; there will be other causes of error which are too complex to be investigated or whose laws of action are unknown. Thus in an individual corrected observation the total remaining error may be regarded as the sum of a number of small accidental errors, some in one direction and some in the other, arising from different causes. The Normal distribution has from the time of Gauss, over a century ago, been generally adopted as giving the distribution of accidental errors of observations and is often referred to as the Law of Errors. On this basis mathematical theories on the treatment of observations have been built up. The Normal Law of Errors may be derived from certain assumptions about the errors. Suppose that a quantity which is being measured has a true magnitude of m but that there are various possible disturbances, n in number, each of which has a chance of ½ of producing an error of either + or —ε in the observed measurement. The probability distribution of the total error has therefore the WHY THE NORMAL DISTRIBUTION 5 binomial distribution (1/2+1and /2)nif n is large, this approximates to the Normal distribution. The Normal distribution can also be arrived at from more general assumptions about the errors. For example, instead of each component error having only two possible values, each can have a probability distribution. Then if the variances of the probability distribution of each component error are finite and all equal and the number of component errors is large, the probability distribution of the total error approximates to the Normal distribution (see, e.g. Brunt, 1931, p. 15, Jeffreys, 1939, p. 77). If the variances are not all equal it does not follow that the Normal distribution results. Whittaker & Robinson (1949, p. 177) give an example to the contrary. While the assumptions regarding the constituent errors which give rise to the Normal distribution may seem reasonable and give more confidence in the use of that distribution and more understanding of how it can arise, they cannot be proved. The only way to justify the use of the Normal distribution is to compare it with the distribution of a long series of observations of the same quantity. Many of the writers on the theory of errors from the time of Gauss onwards gave some, often meagre, comparison of actual observations with the theoretical Normal distribution designed to show how well theory and practice agreed. Airy (1879) concludes his treatise with a five-page appendix entitled Practical verification of the theoretical law for the frequencies of errors in which he considered 636 observations of an astronomical magnitude. He prepared a frequency distribution of the observations according to their difference from the mean, added the frequencies of positive and negative deviations, applied two summations and then gave a graphical and tabular comparison of the adjusted observed deviations with half of a Normal distribution. The frequency distribution of the adjusted observed deviations shows pronounced fairly smooth oscillations about the line representing the Normal distribution such as would be expected when a moving average is applied to a time series containing random elements; this effect is now known as the Slutzky-Yule effect (see, e.g. Kendall, 1951, p. 381). On the basis of this investigation Airy concludes 'the validity of every investigation in this Treatise is thereby established'. However, in fairness to earlier writers, it must be realized that not until K. Pearson (1900) devised the X2 test was it possible to make any proper test of agreement between theory and observation. After 6 R. H. DAW applying the X2 test to Airy's data and to another set of similar data and finding poor agreement with the Normal distribution, K. Pearson (1900) writes 'Now it appears to me that, if the earlier writers on probability had not proceeded so entirely from the mathematical standpoint, but had endeavoured first to classify experience in deviations from the average and then to obtain some measure of the actual goodness of fit provided by the Normal curve, that curve would never have obtained its present position in the theory of errors.' De Morgan (1838) had already had similar thoughts when he wrote 'My own impression derived from this and many other circumstances connected with the analysis of probability is that mathematical results have outrun their interpretation'. However, not all comparisons of observed and theoretical deviations show poor agreement with the Normal distribution. Brunt (1931) quotes data given by Gauss for which the x2 test gives P = 93, i.e. surprisingly good, perhaps too good, agreement with the Normal distribution. This causes Jeffreys (1938) to give a warning against the possible selection of such data because of the good agreement shown and perhaps the suppression of other similar data which showed large divergencies from the Normal distribution. It seems apposite to mention here the tendency for the results of experiments comparing, say, different treatments (e.g. fertilizers or drugs) to be published only when a significant result has been obtained. If a significance level of 5% is adopted, it would be expected that in 5% of the experiments significant differences would be found when in fact none existed. There is therefore an appreciable danger that workers may be misled by the publication of one trial showing a significant difference whereas a number of other trials of the same treatment showing no significant difference remain unpublished. Geary (1947) has suggested as one of the reasons for the prejudice in favour of the Normal distribution up to about the end of the last century, the fact that, to a close approximation, it applies to a wide range of mathematical conditions. As we have already seen, the Normal distribution is an approximation to the binomial distribution, a distribution which frequently occurs in practice. The Normal distribution is also an approximation to the Poisson distribution and to the hypergeometric distribution. WHY THE NORMAL DISTRIBUTION 7 So far the Normal distribution has been considered as the error distribution when a number of measurements of the same quantity (e.g. the coefficient of expansion of copper) are made. At first sight it appears a considerable extension to think that the Normal distribution may arise in a very similar way when measurements are made on a number of different individuals e.g. the height of a number of men. Nevertheless on further thought it does not seem unreasonable to regard the variability of measurements of the latter type, particularly those on biological material where many sources of variability would be expected, as the result of the superposition of a large number of small errors and hence to expect Normal distributions. This, and the fact that many observed frequency distributions were found to be of a shape something like the Normal distribution, led early workers to regard the Normal distribution as the ideal to which most frequency distributions should conform and to require an explanation of departures from this form. However, in the latter part of the nineteenth century, as more data accumulated it came to be realised that the Normal distribution was but one of many different types. In view of K. Pearson's comment in 1900 quoted above, it is interesting to find that six years earlier (K. Pearson, 1894), while realizing that homogeneous material could give rise to non-Normal frequency distributions, he was devising a method of splitting observed non-Normal distributions into two or more Normal distributions. He argued that the material giving rise to the observed distribution might not be homogeneous but a mixture of two homogeneous sets of material each of which would alone exhibit a Normal distribution. He even suggested that this method might be applicable to symmetrical distributions. E. S. Pearson (1965) gives some interesting history covering the years 1890-94 which he regards as the early part of one of the great formative periods of mathematical statistics. Many text-books give examples of various types of observed frequency distributions, e.g. Yule and Kendall (1950, Chapter 4) and Kendall and Stuart (1963, Chapter 1). Many of these distributions are symmetrical and not unlike the Normal distribution (e.g. stature) but there are others showing various degrees of asymmetry and extremely asymmetrical distributions (e.g. J-shaped distribution of incomes) and U-shaped distributions (e.g. cloudiness of the sky). Sometimes a frequency distribution shows several points of maximum 8 R. H. DAW frequency, which may be an indication of heterogeneous data. Brunt (1931, p. 47) gives a frequency distribution of the glumelength of wheat plants which shows three well defined maxima— the plants measured were in fact the second generation from a cross between two strains of wheat, one having a long and one a short glume-length. This is one of the types of frequency distribution which K. Pearson had in mind in his 1894 paper. JUSTIFICATION FOR ASSUMPTION OF NORMALITY In view of the fact that many frequency distributions differ appreciably from the Normal distribution some justification is required for the extensive assumption of Normality in statistical theory and its practical application to observed data. One of the justifications sometimes put forward is that sampling theory is comparatively simple when the Normal distribution is assumed and has been extensively worked out on that basis and that divergencies of the observed data from Normality do not have a large effect on the results. The last part of this justification will be considered later but the first part is decidedly unconvincing—if the assumptions which it is convenient to make are incorrect, one cannot have much confidence in the conclusions reached. Now most statistical analyses of observed data where Normality is assumed operate, not on single observations, but on the mean or total of a number of observations and the king-pin of these analyses is the very general theorem known as the Central Limit Theorem. One statement of this theorem is as follows: If ξ 1, ξ 2 , . . . are independent random variables all having the same probability distribution and if µ and σ denote the mean and standard deviation of every ξt, then, provided σ is finite, the distribution of the sum tends to the Normal form with mean nµ and standard deviation σ n as n tends to infinity. It follows that the arithmetic mean is distributed asymptotically Normally with mean µ and standard deviation σ/ n. WHY THE NORMAL DISTRIBUTION 9 The theorem can be stated in a more general form where the ξi do not all have the same probability distribution and, subject to certain not very restrictive conditions, the distribution of ξ is still asymptotically Normal (see, e.g. Cramer, 1946, p. 215). The Central Limit Theorem means that, within wide limits, whatever the distribution of the population from which the observations are randomly obtained, the distribution of the mean or total of a large number of observations will tend to be Normal. However, the theorem deals only with the limiting distribution as n tends to infinity and gives no indication of the rapidity with which the Normal distribution is approached as n increases. It is worth noting that, if γi (= β1) and y2 ( = β2 — 3) are the measures of skewness FIG. 2. Distribution of mean of rectangular distribution p(x) = 1, — • 5 < X < 5 and kurtosis respectively of a distribution, then the corresponding measures for the distribution of the mean areγ t l n and γ2 /n, which become closer and closer to the values of these measures (i.e. both zero) for the Normal distribution as n increases. As an example of the approach to Normality it is interesting to consider the distribution of the mean of samples of various sizes from the rectangular distribution; for samples of two the distribution is triangular, for samples of three it is made up of three parabolas and is already taking on something of the appearance of a Normal curve; for samples of four it is made up of four cubic curves (see Figure 2). For all sample sizes the distribution of the mean is symmetrical. In the case of a sample of 10 the measure of kurtosis, γ2, which is zero for the Normal distribution, has the value 10 R. H. DAW — 0 12 (Irwin, 1927). Thus even for quite small samples from the rectangular distribution, the distribution of the mean is approximately Normal. Broadly it can be said that for a large proportion of the frequency distributions likely to be encountered in practice the distribution of the sum or the mean of a number of observations is unlikely to differ greatly from the Normal distribution provided that the number of observations is not very small. EFFECTS OF NON-NORMALITY In recent years there has grown up quite a volume of literature concerned with the effect of non-Normality on various statistical tests when the sample size is not large. The term 'robustness' has been used to describe the extent to which a statistical test is affected by non-Normality of the data. A robust test is one which is affected comparatively little by non-Normality. Gayen (1949), (1950a), (1950b) has conducted a series of investigations on the effect of non-Normality on the F and t tests. The parent distribution from which samples are assumed drawn at random is specified by the Edgeworth form of the Gram-Charlier Type A series where is the standardized Normal function, ø (r)(x) its rth derivative and γ1 (= β1), y2(= β2-3) are respectively the measures of skewness and kurtosis. The true levels of significance are then found when F and tests at the nominal level of say 5% are applied to samples of given sizes drawn from parent distributions with specified values of γt and γ2. It is found that the F test (and the equivalent t test for two means) when used on a one-way classification to compare two or more means is robust, i.e. it is little affected by either skewness or kurtosis of the parent distribution. Table 1 gives some examples of the results for various values of γ12 and γ2 assuming that all samples are drawn from the same parent distribution; it shows that even for a small value of n the true significance level for WHY THE NORMAL DISTRIBUTION 11 the particular non-Normal distribution considered does not differ greatly from 5%. If the t test is used to compare the mean of a sample with a given fixed value, the effect of skewness is found to be rather serious but kurtosis does not have a large effect. However, the most frequent use of this test is when the quantities to which the test is applied consist of the differences between pairs of observations and it is required to find whether the mean difference is significantly different from zero or some other given value. In such cases, if the parameters γt and γ2 are the same for each population, the corresponding parameters γt' andγ 2 ' for the distribution of the differences would be γ1' = 0 and γ2' = γ2 /2; thus the distribution of the differences Table 1. True percentage probabilities associated with the 5% Normal theory significance point for the F test applied to a one-way classification of five samples each of five, i.e. degrees of freedom 4 and 20 2 γ1 o l 5 24 5 34 510 519 4 52 4 62 4 71 2 γ2 -1 0 2 500 would be symmetrical and hence the t test for this situation is reasonably robust. The F test can also be used to compare two variances, and an extension of this known as Bartlett's test (see, e.g. Pearson and Hartley, 1954, p. 57) is available for comparing more than two variances. Box (1953) has investigated the behaviour of these tests under nonNormality and has found them to be seriously affected by kurtosis and the larger the number of variances being compared the larger is the discrepancy. Also it is not always the case that the discrepancies are smaller for large samples than for small samples. Table 2 gives some figures for large samples. The effects of non-Normality on the x2 test, when used to compare an observed variance with a given fixed value, are very similar to those just described for the F test for comparing two variances. 12 R. H. DAW It is frequently suggested that a test of equality of variances should be made before making an analysis of variance test for equality of means which involves the assumption of equality of variances. However, when group sizes are equal or not very different, the analysis of variance test is comparatively little affected by inequalities of variances and, on the basis of his investigation, Box (1953) suggests that more wrong conclusions may be reached by first testing for variance inequalities than if this preliminary test were omitted unless the parent distribution is known to be effectively Normal. Davies (1956) Appendix 2A gives a description of the effects of departures from Normality on certain tests of significance. Table 2 True percentage probabilities in large samples associated with the 5% Normal theory significance points for the test for comparison of several variances Number of variances being compared 2 γ2 1 0 2 •56 50 16 6 5 •08 50 31 5 20 •0004 50 71 8 TESTING FOR NORMALITY There are two types of procedure commonly used in testing for Normality: (i) Fit a Normal distribution to the sample data and apply the x2 test of goodness of fit. (ii) Calculate certain functions of the moments of the data and examine the significance of departures from the corresponding values for the Normal distribution. While procedure (i) has certain advantages provided the data are plentiful, it may be somewhat insensitive because it takes no account of the sign or arrangement of the deviations and because of the necessity of grouping together the small frequencies in the tails of the distribution when applying the test. WHY THE NORMAL DISTRIBUTION 13 The usual moment ratio tests for Normality consist of a compari3 2 2 son of b1 = m3/m2 / and b2 = m4/m2 , calculated from the sample, with the corresponding values of 0 and 3 respectively for the Normal distribution. However, in small samples the sampling distribution of b2 is extremely skew and the criterion a = mean deviation standard deviation has been recommended instead. The value of a for large samples from a Normal distribution tends to 2/π, i.e. about 0 8. These tests are described and tables of the 5% and 1% significance levels given in Pearson and Hartley (1954) Table 34. However, testing for Normality is only a special case of the general problem of testing whether a sample can be regarded as having been drawn at random from a population with a specified probability distribution. Birnbaum (1953) reviews a number of tests which have been devised for this purpose and which can be used to test for Normality. In a recent paper Shapiro and Wilk (1965) describe a new overall test for Normality which, so far as their investigations go, compares favourably with the tests described above and by Birnbaum (1953). TRANSFORMATIONS TO ACHIEVE NORMALITY If the distribution of a set of observations is not Normal a transformation can sometimes be used to obtain a variable which is more nearly Normally distributed. Three transformations which may be useful for this purpose are to take the transformed variable as (i) the square root, (ii) the logarithm, (iii) the reciprocal, of the observed value. The square root transformation is likely to be appropriate in moderately skew distributions and the reciprocal for very skew distributions, with the logarithmic transformation for intermediate cases. However, a great many statistical analyses of observed data use the method of Analysis of Variance which involves the assumption that the variances of the groups being investigated are all equal. Sometimes the nature of the data is such that this assumption is not 14 R. H. DAW true; for example if the observations consist of proportions rather than measurements, the group variances depend on the mean of the group. In such cases a transformation can be applied to equalize the variances. Fortunately it is found that a transformation which tends to equalize the group variances often has the effect of improving the Normality, particularly when the original distribution does not deviate greatly from Normal. Quenouille (1950) gives in Chapter 8 an interesting discussion of transformations for both the purposes outlined above. CONCLUSION To sum up, many observed frequency distributions show considerable departures from the Normal distribution but nevertheless statistical tests where Normality is assumed may often be applied to the means or totals of a number of observations because the distribution of a mean or total is likely to be much closer to the Normal distribution than is the distribution of individual observations. However, the assumption of Normality should not be made blindly or without careful consideration. The F and t tests for comparing means are comparatively robust tests and not likely to be grossly misleading even if departures from Normality in the parent distribution are appreciable, and the number of observations small. Some care is required in using the t test to compare a mean with a given value unless the quantities operated on are differences between pairs of observations. The x2 and F tests for comparing variances must be used with great caution unless the parent distribution is known to be effectively Normal. While the Normal distribution has a very important place in the theory and practice of statistics, it must be kept in its place and not allowed to assume a greater importance than is appropriate. REFERENCES AIRY, G. B. (1879). On the algebraic and numerical theory of errors of observation and the combination of observations (3rd edition). London: Macmillan & Co. ARCHIBALD, R. C. (1926). A rare pamphlet of Moivre and some of his discoveries. his, 8, 671. BIRNBAUM, Z. W. (1953). Distribution-free tests of fit for continuous distribution functions. Ann. Math. Statist., 24, 1. Box, G. E. P. (1953). Non-normality and tests on variances. Biometrika, 40, 318. WHY THE NORMAL DISTRIBUTION 15 BRUNT, D. (1931). The combination of observations. Cambridge University Press. CRAMER, H. (1946). Mathematical methods of statistics, Princeton University Press. DAVIES, O. L. (Ed.) (1956). The design and analysis of industrial experiments. London: Oliver & Boyd. DE MOIVRE, A. (1738). The Doctrine of Chances (2nd edition). DE MORGAN, A. (1838) An essay on probabilities and on their application to life contingencies and insurance offices. London. GAYEN, A. K. (1949). The distribution of Student's t in random samples of any size drawn from non-normal universes. Biometrika, 36, 353. GAYEN, A. K. (1950a). The distribution of the variance ratio in random samples of any size drawn from non-normal universes. Biometrika, 37, 236. GAYEN, A. K. (1950b). Significance of difference between the means of two nonnormal samples. Biometrika, 37, 399. GEARY, R. C. (1947). Testing for normality. Biometrika, 34, 209. IRWIN, J. O. (1927). On the frequency distribution of the means of samples from a population having any law of frequency with finite moments with special reference to Pearson's Type II. Biometrika, 19, 225. JEFFREYS, H. (1938). The law of error and the combination of observations. Phil. Trans. Roy. Soc. A, 237, 231. JEFFREYS, H. (1939). Theory of Probability. Oxford University Press. KENDALL, M. G. (1951). The advanced theory of statistics, 2. London: Griffin. KENDALL, M. G. and STUART, A. (1963). The advanced theory of statistics, 1. London: Griffin. PEARSON, E. S. (1965) Studies in the history of probability and statistics XIV. Some incidents in the early history of biometry and statistics, 1890-94. Biometrika, 52, 3. PEARSON, E. S. and HARTLEY, H. O. (1954). Biometrika tables for statisticians. Cambridge University Press. PEARSON, K. (1894).* Contributions to the mathematical theory of evolution. Phil. Trans. Roy. Soc, A, 185, 71. PEARSON, K. (1900).* On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, 50, 157. QUENOUILLE, M. H. (1950). Introductory statistics. London: Butterworths. SHAPIRO, S. S. and WILK, M. B. (1965). An analysis of variance test for normality (complete samples). Biometrika, 52, 591. SMITH, D. E. (1929). A source book in mathematics. New York & London: McGraw Hill. WALKER, H. M. (1929). Studies in the history of statistical method. Baltimore: Williams and Wilkins. WHITTAKER, E. and ROBINSON, G. (1949). The calculus of observations. London: Blackie. YULE, G. U. and KENDALL, M. G. (1950). An introduction to the theory of statistics. London: Griffin. * These papers are reprinted in Karl Pearson's early statistical papers. Cambridge University Press, 1948.