Download JSS 18 (1) (1966) 2-15 - Institute and Faculty of Actuaries

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
JSS 18 (1) (1966) 2-15
WHY THE NORMAL DISTRIBUTION
by
R. H. DAW
IT has been said that everybody believes in the Law of Errors (i.e.
the Normal distribution), the experimenters because they think it is
a mathematical theorem and the mathematicians because they think
it is an experimental fact. Text-books on statistics often seem to
produce the Normal distribution as the conjurer produces a rabbit
from a hat and the actuarial student, who should be something of
both mathematician and experimenter, is frequently somewhat mystified as to why such an apparently complicated distribution function
as
FIG. 1. The Normal distribution.
(see Fig. 1) is suddenly introduced and given such prominence. This
note is an attempt to explain the place of the Normal distribution
in statistical theory and practice and it is hoped that it will encourage the reader to give some thought to basic assumptions.
2
WHY THE NORMAL
DISTRIBUTION
3
FIRST DERIVATION OF THE NORMAL DISTRIBUTION
The Normal distribution is often referred to as the Gaussian Law
or sometimes the Laplace-Gauss Law after the pioneers who derived
it on various assumptions in the late 18th and early 19th centuries;
in fact the discovery of this distribution was considerably earlier.
In a rare pamphlet dated 12 November 1733 by Abraham de Moivre
written in Latin and entitled Approximatio ad summam terminorum
binomii (a+b)" in seriem expansi the Normal distribution is derived
as an approximation to the binomial distribution. Only two copies
of this pamphlet are known, one being in the library of University
College, London, and the other in Berlin. Archibald (1926) included
a facsimile of the pamphlet and a photo-copy of this facsimile is in
the Institute library.
In the second edition of his Doctrine of Chances (1738) de Moivre
published an English translation of his pamphlet with some additions and the text of this section of the book is reproduced by Smith
(1929), while Walker (1929) gives the essential parts.
De Moivre (1738) writes 'Altho' the Solution of Problems of
Chance often require that several Terms of the Binomial (a+b)n
be added together, nevertheless in very high Powers the thing appears
so laborious, and of so great a difficulty, that few people have
undertaken that Task'. In his solution he first finds that, for large
values of n, the middle term in the expansion of (½ +½)" is approximately equal to 2/ nc, where c = 2n. In modern notation and the
2
2
circumstances of the problem σ = n/4, where σ is the variance
of the binomial distribution (½ + ½)". Thus in effect de Moivre's
result is that the maximum ordinate y0 of the binomial distribution is
He then finds that 'the Logarithm of the Ratio, which a Term
distant from the middle by the interval has to the middle Term, is
or in modern notation
whence the formula given above for the Normal distribution
follows.
4
R. H. DAW
There follow a number of lemmas and corollaries, including an
n
extension to the binomial (p+q) where p q, and to quote Smith
(1929) 'This paper gives the first statement of the formula for the
"normal curve", the first method of finding the probability of the
occurrence of an error of a given size when that error is expressed
in terms of the variability of the distribution as a unit, and the first
recognition of that value later termed the probable error. It shows,
also, that before Stirling, de Moivre had been approaching a solution
of the value of factorial n.' All this is given in an obscure tract of
7 pages on abstract mathematics relating to games of chance written
in Latin over two centuries ago.
THE NORMAL DISTRIBUTION AS A LAW OF ERROR
When using an instrument to make repeated measurements of a
particular magnitude the measurements are subject to many different types of error. For example, astronomical observations may
be affected by errors in the measuring instrument used, by external
conditions such as temperature, wind and varying density of the
earth's atmosphere and by personal peculiarities or bias of the
observer. Enough is known about some of these types of error to apply
corrections to the observations but even when this has been done
to a series of observations of the same quantity, the corrected
observations are still found in practice not to be all equal; there
will be other causes of error which are too complex to be investigated or whose laws of action are unknown. Thus in an individual
corrected observation the total remaining error may be regarded as
the sum of a number of small accidental errors, some in one direction
and some in the other, arising from different causes. The Normal
distribution has from the time of Gauss, over a century ago, been
generally adopted as giving the distribution of accidental errors of
observations and is often referred to as the Law of Errors. On
this basis mathematical theories on the treatment of observations
have been built up.
The Normal Law of Errors may be derived from certain assumptions about the errors. Suppose that a quantity which is being
measured has a true magnitude of m but that there are various possible disturbances, n in number, each of which has a chance of ½ of
producing an error of either + or —ε in the observed measurement.
The probability distribution of the total error has therefore the
WHY THE NORMAL DISTRIBUTION
5
binomial distribution (1/2+1and
/2)nif n is large, this approximates
to the Normal distribution.
The Normal distribution can also be arrived at from more general
assumptions about the errors. For example, instead of each component error having only two possible values, each can have a probability distribution. Then if the variances of the probability distribution of each component error are finite and all equal and the
number of component errors is large, the probability distribution
of the total error approximates to the Normal distribution (see, e.g.
Brunt, 1931, p. 15, Jeffreys, 1939, p. 77). If the variances are not all
equal it does not follow that the Normal distribution results. Whittaker
& Robinson (1949, p. 177) give an example to the contrary.
While the assumptions regarding the constituent errors which give
rise to the Normal distribution may seem reasonable and give more
confidence in the use of that distribution and more understanding
of how it can arise, they cannot be proved. The only way to justify
the use of the Normal distribution is to compare it with the distribution of a long series of observations of the same quantity.
Many of the writers on the theory of errors from the time of
Gauss onwards gave some, often meagre, comparison of actual
observations with the theoretical Normal distribution designed to
show how well theory and practice agreed. Airy (1879) concludes
his treatise with a five-page appendix entitled Practical verification of
the theoretical law for the frequencies of errors in which he considered
636 observations of an astronomical magnitude. He prepared
a frequency distribution of the observations according to their
difference from the mean, added the frequencies of positive and negative deviations, applied two summations and then gave a graphical
and tabular comparison of the adjusted observed deviations with
half of a Normal distribution. The frequency distribution of the
adjusted observed deviations shows pronounced fairly smooth
oscillations about the line representing the Normal distribution
such as would be expected when a moving average is applied to a
time series containing random elements; this effect is now known as
the Slutzky-Yule effect (see, e.g. Kendall, 1951, p. 381). On the basis
of this investigation Airy concludes 'the validity of every investigation in this Treatise is thereby established'.
However, in fairness to earlier writers, it must be realized that not
until K. Pearson (1900) devised the X2 test was it possible to make any
proper test of agreement between theory and observation. After
6
R. H. DAW
applying the X2 test to Airy's data and to another set of similar
data and finding poor agreement with the Normal distribution,
K. Pearson (1900) writes 'Now it appears to me that, if the earlier
writers on probability had not proceeded so entirely from the
mathematical standpoint, but had endeavoured first to classify
experience in deviations from the average and then to obtain some
measure of the actual goodness of fit provided by the Normal curve,
that curve would never have obtained its present position in the
theory of errors.'
De Morgan (1838) had already had similar thoughts when he
wrote 'My own impression derived from this and many other
circumstances connected with the analysis of probability is that
mathematical results have outrun their interpretation'.
However, not all comparisons of observed and theoretical deviations show poor agreement with the Normal distribution. Brunt
(1931) quotes data given by Gauss for which the x2 test gives
P = 93, i.e. surprisingly good, perhaps too good, agreement with
the Normal distribution. This causes Jeffreys (1938) to give a warning
against the possible selection of such data because of the good
agreement shown and perhaps the suppression of other similar
data which showed large divergencies from the Normal distribution.
It seems apposite to mention here the tendency for the results of
experiments comparing, say, different treatments (e.g. fertilizers or
drugs) to be published only when a significant result has been
obtained. If a significance level of 5% is adopted, it would be expected that in 5% of the experiments significant differences would be
found when in fact none existed. There is therefore an appreciable
danger that workers may be misled by the publication of one trial
showing a significant difference whereas a number of other trials of
the same treatment showing no significant difference remain unpublished.
Geary (1947) has suggested as one of the reasons for the prejudice
in favour of the Normal distribution up to about the end of the last
century, the fact that, to a close approximation, it applies to a wide
range of mathematical conditions. As we have already seen, the
Normal distribution is an approximation to the binomial distribution, a distribution which frequently occurs in practice. The Normal
distribution is also an approximation to the Poisson distribution
and to the hypergeometric distribution.
WHY THE NORMAL DISTRIBUTION
7
So far the Normal distribution has been considered as the error
distribution when a number of measurements of the same quantity
(e.g. the coefficient of expansion of copper) are made. At first
sight it appears a considerable extension to think that the Normal
distribution may arise in a very similar way when measurements are
made on a number of different individuals e.g. the height of a
number of men. Nevertheless on further thought it does not seem
unreasonable to regard the variability of measurements of the latter
type, particularly those on biological material where many sources
of variability would be expected, as the result of the superposition of
a large number of small errors and hence to expect Normal distributions. This, and the fact that many observed frequency distributions were found to be of a shape something like the Normal
distribution, led early workers to regard the Normal distribution as
the ideal to which most frequency distributions should conform and
to require an explanation of departures from this form. However,
in the latter part of the nineteenth century, as more data accumulated
it came to be realised that the Normal distribution was but one of
many different types.
In view of K. Pearson's comment in 1900 quoted above, it is
interesting to find that six years earlier (K. Pearson, 1894), while
realizing that homogeneous material could give rise to non-Normal
frequency distributions, he was devising a method of splitting
observed non-Normal distributions into two or more Normal
distributions. He argued that the material giving rise to the observed
distribution might not be homogeneous but a mixture of two
homogeneous sets of material each of which would alone exhibit a
Normal distribution. He even suggested that this method might be
applicable to symmetrical distributions. E. S. Pearson (1965) gives
some interesting history covering the years 1890-94 which he regards
as the early part of one of the great formative periods of mathematical statistics.
Many text-books give examples of various types of observed
frequency distributions, e.g. Yule and Kendall (1950, Chapter 4)
and Kendall and Stuart (1963, Chapter 1). Many of these distributions are symmetrical and not unlike the Normal distribution (e.g.
stature) but there are others showing various degrees of asymmetry
and extremely asymmetrical distributions (e.g. J-shaped distribution of
incomes) and U-shaped distributions (e.g. cloudiness of the sky). Sometimes a frequency distribution shows several points of maximum
8
R. H. DAW
frequency, which may be an indication of heterogeneous data.
Brunt (1931, p. 47) gives a frequency distribution of the glumelength of wheat plants which shows three well defined maxima—
the plants measured were in fact the second generation from a
cross between two strains of wheat, one having a long and one a
short glume-length. This is one of the types of frequency distribution which K. Pearson had in mind in his 1894 paper.
JUSTIFICATION FOR ASSUMPTION OF NORMALITY
In view of the fact that many frequency distributions differ
appreciably from the Normal distribution some justification is
required for the extensive assumption of Normality in statistical
theory and its practical application to observed data. One of the
justifications sometimes put forward is that sampling theory is
comparatively simple when the Normal distribution is assumed and
has been extensively worked out on that basis and that divergencies
of the observed data from Normality do not have a large effect on
the results. The last part of this justification will be considered
later but the first part is decidedly unconvincing—if the assumptions
which it is convenient to make are incorrect, one cannot have much
confidence in the conclusions reached.
Now most statistical analyses of observed data where Normality
is assumed operate, not on single observations, but on the mean or
total of a number of observations and the king-pin of these analyses
is the very general theorem known as the Central Limit Theorem.
One statement of this theorem is as follows:
If ξ 1, ξ 2 , . . . are independent random variables all having the
same probability distribution and if µ and σ denote the mean and
standard deviation of every ξt, then, provided σ is finite, the distribution of the sum
tends to the Normal form with mean nµ and standard deviation
σ n as n tends to infinity. It follows that the arithmetic mean
is distributed asymptotically Normally with mean µ and standard
deviation σ/ n.
WHY THE NORMAL DISTRIBUTION
9
The theorem can be stated in a more general form where the ξi
do not all have the same probability distribution and, subject to
certain not very restrictive conditions, the distribution of ξ is still
asymptotically Normal (see, e.g. Cramer, 1946, p. 215).
The Central Limit Theorem means that, within wide limits,
whatever the distribution of the population from which the observations are randomly obtained, the distribution of the mean or total
of a large number of observations will tend to be Normal. However,
the theorem deals only with the limiting distribution as n tends to
infinity and gives no indication of the rapidity with which the
Normal distribution is approached as n increases. It is worth noting
that, if γi (= β1) and y2 ( = β2 — 3) are the measures of skewness
FIG. 2. Distribution of mean of rectangular distribution p(x) = 1, — • 5 < X < 5
and kurtosis respectively of a distribution, then the corresponding
measures for the distribution of the mean areγ t l n and γ2 /n, which
become closer and closer to the values of these measures (i.e. both
zero) for the Normal distribution as n increases. As an example of
the approach to Normality it is interesting to consider the distribution of the mean of samples of various sizes from the rectangular
distribution; for samples of two the distribution is triangular,
for samples of three it is made up of three parabolas and is
already taking on something of the appearance of a Normal
curve; for samples of four it is made up of four cubic curves (see
Figure 2). For all sample sizes the distribution of the mean is
symmetrical. In the case of a sample of 10 the measure of kurtosis,
γ2, which is zero for the Normal distribution, has the value
10
R. H. DAW
— 0 12 (Irwin, 1927). Thus even for quite small samples from the
rectangular distribution, the distribution of the mean is approximately Normal.
Broadly it can be said that for a large proportion of the frequency
distributions likely to be encountered in practice the distribution of
the sum or the mean of a number of observations is unlikely to differ
greatly from the Normal distribution provided that the number of
observations is not very small.
EFFECTS OF NON-NORMALITY
In recent years there has grown up quite a volume of literature
concerned with the effect of non-Normality on various statistical
tests when the sample size is not large. The term 'robustness' has
been used to describe the extent to which a statistical test is affected
by non-Normality of the data. A robust test is one which is affected
comparatively little by non-Normality.
Gayen (1949), (1950a), (1950b) has conducted a series of investigations on the effect of non-Normality on the F and t tests. The
parent distribution from which samples are assumed drawn at
random is specified by the Edgeworth form of the Gram-Charlier
Type A series
where
is the standardized Normal function, ø (r)(x) its rth derivative and
γ1 (= β1), y2(= β2-3) are respectively the measures of skewness
and kurtosis. The true levels of significance are then found when F
and tests at the nominal level of say 5% are applied to samples
of given sizes drawn from parent distributions with specified values
of γt and γ2. It is found that the F test (and the equivalent t test
for two means) when used on a one-way classification to compare
two or more means is robust, i.e. it is little affected by either
skewness or kurtosis of the parent distribution. Table 1 gives some
examples of the results for various values of γ12 and γ2 assuming
that all samples are drawn from the same parent distribution; it
shows that even for a small value of n the true significance level for
WHY THE NORMAL DISTRIBUTION
11
the particular non-Normal distribution considered does not differ
greatly from 5%.
If the t test is used to compare the mean of a sample with a given
fixed value, the effect of skewness is found to be rather serious but
kurtosis does not have a large effect. However, the most frequent
use of this test is when the quantities to which the test is applied
consist of the differences between pairs of observations and it is
required to find whether the mean difference is significantly different
from zero or some other given value. In such cases, if the parameters
γt and γ2 are the same for each population, the corresponding parameters γt' andγ 2 ' for the distribution of the differences would
be γ1' = 0 and γ2' = γ2 /2; thus the distribution of the differences
Table 1.
True percentage probabilities associated with the 5% Normal theory
significance point for the F test applied to a one-way classification
of five samples each of five, i.e. degrees of freedom 4 and 20
2
γ1
o
l
5 24
5 34
510
519
4 52
4 62
4 71
2
γ2
-1
0
2
500
would be symmetrical and hence the t test for this situation is reasonably robust.
The F test can also be used to compare two variances, and an extension of this known as Bartlett's test (see, e.g. Pearson and Hartley,
1954, p. 57) is available for comparing more than two variances.
Box (1953) has investigated the behaviour of these tests under nonNormality and has found them to be seriously affected by kurtosis
and the larger the number of variances being compared the larger
is the discrepancy. Also it is not always the case that the discrepancies are smaller for large samples than for small samples. Table 2
gives some figures for large samples.
The effects of non-Normality on the x2 test, when used to compare
an observed variance with a given fixed value, are very similar to
those just described for the F test for comparing two variances.
12
R. H. DAW
It is frequently suggested that a test of equality of variances should
be made before making an analysis of variance test for equality of
means which involves the assumption of equality of variances. However, when group sizes are equal or not very different, the analysis
of variance test is comparatively little affected by inequalities of
variances and, on the basis of his investigation, Box (1953) suggests
that more wrong conclusions may be reached by first testing for
variance inequalities than if this preliminary test were omitted
unless the parent distribution is known to be effectively Normal.
Davies (1956) Appendix 2A gives a description of the effects of
departures from Normality on certain tests of significance.
Table 2
True percentage probabilities in large samples associated with the
5% Normal theory significance points for the test for comparison of
several variances
Number of variances being compared
2
γ2
1
0
2
•56
50
16 6
5
•08
50
31 5
20
•0004
50
71 8
TESTING FOR NORMALITY
There are two types of procedure commonly used in testing for
Normality:
(i) Fit a Normal distribution to the sample data and apply
the x2 test of goodness of fit.
(ii) Calculate certain functions of the moments of the data and
examine the significance of departures from the corresponding values for the Normal distribution.
While procedure (i) has certain advantages provided the data are
plentiful, it may be somewhat insensitive because it takes no account
of the sign or arrangement of the deviations and because of the
necessity of grouping together the small frequencies in the tails of the
distribution when applying the test.
WHY THE NORMAL DISTRIBUTION
13
The usual moment ratio tests for Normality consist of a compari3 2
2
son of b1 = m3/m2 / and b2 = m4/m2 , calculated from the
sample, with the corresponding values of 0 and 3 respectively for the
Normal distribution. However, in small samples the sampling distribution of b2 is extremely skew and the criterion
a = mean deviation
standard deviation
has been recommended instead. The value of a for large samples
from a Normal distribution tends to 2/π, i.e. about 0 8.
These tests are described and tables of the 5% and 1% significance
levels given in Pearson and Hartley (1954) Table 34.
However, testing for Normality is only a special case of the general
problem of testing whether a sample can be regarded as having been
drawn at random from a population with a specified probability
distribution. Birnbaum (1953) reviews a number of tests which have
been devised for this purpose and which can be used to test for
Normality. In a recent paper Shapiro and Wilk (1965) describe a
new overall test for Normality which, so far as their investigations
go, compares favourably with the tests described above and by
Birnbaum (1953).
TRANSFORMATIONS TO ACHIEVE NORMALITY
If the distribution of a set of observations is not Normal a transformation can sometimes be used to obtain a variable which is more
nearly Normally distributed. Three transformations which may be
useful for this purpose are to take the transformed variable as
(i) the square root,
(ii) the logarithm,
(iii) the reciprocal,
of the observed value. The square root transformation is likely to be
appropriate in moderately skew distributions and the reciprocal for
very skew distributions, with the logarithmic transformation for
intermediate cases.
However, a great many statistical analyses of observed data use
the method of Analysis of Variance which involves the assumption
that the variances of the groups being investigated are all equal.
Sometimes the nature of the data is such that this assumption is not
14
R. H. DAW
true; for example if the observations consist of proportions rather
than measurements, the group variances depend on the mean of the
group. In such cases a transformation can be applied to equalize the
variances. Fortunately it is found that a transformation which tends
to equalize the group variances often has the effect of improving the
Normality, particularly when the original distribution does not
deviate greatly from Normal. Quenouille (1950) gives in Chapter 8
an interesting discussion of transformations for both the purposes
outlined above.
CONCLUSION
To sum up, many observed frequency distributions show considerable departures from the Normal distribution but nevertheless
statistical tests where Normality is assumed may often be applied
to the means or totals of a number of observations because the distribution of a mean or total is likely to be much closer to the Normal
distribution than is the distribution of individual observations.
However, the assumption of Normality should not be made blindly
or without careful consideration.
The F and t tests for comparing means are comparatively robust
tests and not likely to be grossly misleading even if departures from
Normality in the parent distribution are appreciable, and the number
of observations small. Some care is required in using the t test to
compare a mean with a given value unless the quantities operated
on are differences between pairs of observations.
The x2 and F tests for comparing variances must be used with
great caution unless the parent distribution is known to be effectively
Normal.
While the Normal distribution has a very important place in the
theory and practice of statistics, it must be kept in its place and not
allowed to assume a greater importance than is appropriate.
REFERENCES
AIRY, G. B. (1879). On the algebraic and numerical theory of errors of observation
and the combination of observations (3rd edition). London: Macmillan & Co.
ARCHIBALD, R. C. (1926). A rare pamphlet of Moivre and some of his discoveries.
his, 8, 671.
BIRNBAUM, Z. W. (1953). Distribution-free tests of fit for continuous distribution
functions. Ann. Math. Statist., 24, 1.
Box, G. E. P. (1953). Non-normality and tests on variances. Biometrika, 40, 318.
WHY THE NORMAL DISTRIBUTION
15
BRUNT, D. (1931). The combination of observations. Cambridge University Press.
CRAMER, H. (1946). Mathematical methods of statistics, Princeton University
Press.
DAVIES, O. L. (Ed.) (1956). The design and analysis of industrial experiments.
London: Oliver & Boyd.
DE MOIVRE, A. (1738). The Doctrine of Chances (2nd edition).
DE MORGAN, A. (1838) An essay on probabilities and on their application to life
contingencies and insurance offices. London.
GAYEN, A. K. (1949). The distribution of Student's t in random samples of any
size drawn from non-normal universes. Biometrika, 36, 353.
GAYEN, A. K. (1950a). The distribution of the variance ratio in random samples
of any size drawn from non-normal universes. Biometrika, 37, 236.
GAYEN, A. K. (1950b). Significance of difference between the means of two nonnormal samples. Biometrika, 37, 399.
GEARY, R. C. (1947). Testing for normality. Biometrika, 34, 209.
IRWIN, J. O. (1927). On the frequency distribution of the means of samples from
a population having any law of frequency with finite moments with special
reference to Pearson's Type II. Biometrika, 19, 225.
JEFFREYS, H. (1938). The law of error and the combination of observations.
Phil. Trans. Roy. Soc. A, 237, 231.
JEFFREYS, H. (1939). Theory of Probability. Oxford University Press.
KENDALL, M. G. (1951). The advanced theory of statistics, 2. London: Griffin.
KENDALL, M. G. and STUART, A. (1963). The advanced theory of statistics, 1.
London: Griffin.
PEARSON, E. S. (1965) Studies in the history of probability and statistics XIV.
Some incidents in the early history of biometry and statistics, 1890-94.
Biometrika, 52, 3.
PEARSON, E. S. and HARTLEY, H. O. (1954). Biometrika tables for statisticians.
Cambridge University Press.
PEARSON, K. (1894).* Contributions to the mathematical theory of evolution.
Phil. Trans. Roy. Soc, A, 185, 71.
PEARSON, K. (1900).* On the criterion that a given system of deviations from the
probable in the case of a correlated system of variables is such that it can
be reasonably supposed to have arisen from random sampling. Philosophical
Magazine, 50, 157.
QUENOUILLE, M. H. (1950). Introductory statistics. London: Butterworths.
SHAPIRO, S. S. and WILK, M. B. (1965). An analysis of variance test for normality
(complete samples). Biometrika, 52, 591.
SMITH, D. E. (1929). A source book in mathematics. New York & London:
McGraw Hill.
WALKER, H. M. (1929). Studies in the history of statistical method. Baltimore:
Williams and Wilkins.
WHITTAKER, E. and ROBINSON, G. (1949). The calculus of observations. London:
Blackie.
YULE, G. U. and KENDALL, M. G. (1950). An introduction to the theory of statistics.
London: Griffin.
* These papers are reprinted in Karl Pearson's early statistical papers.
Cambridge University Press, 1948.