Download Scope Nov05 Vol14Iss4 - The University of Sheffield

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
tutorial
The Normal Distribution
Jenny V. Freeman and Steven A. Julious
Medical Statistics Group, School of Health and Related Research, University of Sheffield, Community
Sciences Centre, Northern General Hospital, Herries Road, Sheffield, UK
Introduction
The first two tutorials in this series have focused on
displaying data and simple methods for describing and
summarising data. There has been little discussion of
statistical theory. In this note, we will start to explore some
of the basic concepts underlying much statistical
methodology. We will describe the basic theory
underlying the Normal distribution and the link between
empirical frequency distributions (the observed
distribution of data in a sample) and theoretical
probability distributions (the theoretical distribution of
data in a population). In addition, we will introduce the
idea of a confidence interval.
Theoretical probability distributions
Since it is rarely possible to collect information on an
entire population, the aim of many statistical analyses is
to use information from a sample to draw conclusions (or
‘make inferences’) about the population of interest. These
inferences are facilitated by making assumptions about
the underlying distribution of the measurement of interest
in the population as a whole, and by applying an
appropriate theoretical model to describe how the
measurement behaves in the population. (Note that prior
to any analysis, it is usual to make assumptions about the
underlying distribution of the measurement being studied.
These assumptions can then be investigated through
various plots and figures for the observed data – e.g. a
histogram for continuous data. These investigations are
referred to as diagnostics and will be discussed
throughout subsequent notes.) In the context of this note,
the population is a theoretical concept used for
describing an entire group, and one way of describing the
distribution of a measurement in a population is by use of
a suitable theoretical probability distribution. Probability
distributions can be used to calculate the probability of
different values occurring, and they exist for both
continuous and categorical measurements.
In addition to the Normal distribution (described later in
this note), there are many other theoretical distributions,
including the chi-squared, binomial and Poisson
distributions (these will be discussed in later tutorials).
Each of these theoretical distributions is described by a
particular mathematical expression (formally referred to
as a model), and for each model there exist summary
measures, known as parameters, which completely
describe that particular distribution. In practice,
parameters are usually estimated by quantities calculated
from the sample, and these are known as statistics; that
is, a statistic is a quantity calculated from a sample in
order to estimate an unknown parameter in a population.
For example, the Normal distribution is completely
characterised by the population mean (µ) and population
standard deviation (σ), and these are estimated by the
sample mean ( ) and sample standard deviation (s)
respectively.
The Normal distribution
The Normal, or Gaussian, distribution (named in honour
of the German mathematician C. F. Gauss, 1777–1855) is
the most important theoretical probability distribution in
statistics. At this point, it is important to stress that, in this
context, the word ‘normal’ is a statistical term and is not
used in the dictionary or clinical sense of conforming to
what would be expected. Thus, in order to distinguish
between the two, statistical and dictionary ‘normal’, it is
conventional to use a capital letter when referring to the
Normal distribution.
The basic properties of the Normal distribution are
outlined in table 1. The distribution curve of data that are
Normally distributed has a characteristic shape; it is bellshaped, and symmetrical about a single peak (figure 1).
For any given value of the mean, populations with a small
Table 1: Properties of the Normal distribution
1. It is bell-shaped and has a single peak
(unimodal)
2. Symmetrical about the mean
3. Uniquely defined by two parameters, the
mean (µ) and standard deviation (σ)
4. The mean, median and mode all coincide
5. The probability that a Normally distributed
random variable, x, with mean µ and
standard deviation σ lies between the limits
(µ – 1.96σ) and (µ + 1.96σ) is 0.95, i.e. 95%
of the data for a Normally distributed random
variable will lie between the limits (µ – 1.96σ)
and (µ + 1.96σ)*
6. The probability that a Normally distributed
random variable, x, with mean µ and
standard deviation σ lies between the limits
(µ – 2.44σ) and (µ + 2.44σ) is 0.99
7. Any position on the horizontal axis of figure 1
can be expressed as a number of standard
deviations away from the mean value
*This fact is used for calculating the 95% confidence
interval for Normally distributed data.
Figure 1: The Normal distribution
standard deviation have a distribution clustered close to
the mean (µ), while those with a large standard deviation
have a distribution that is widely spread along the
measurement axis, and the peak is more flattened.
As mentioned earlier, the Normal distribution is described
completely by two parameters, the mean (µ) and the
standard deviation (σ). This means that for any Normally
distributed variable, once the mean and variance (σ2) are
known (or estimated), it is possible to calculate the
probability distribution for that population.
An important feature of a Normal distribution is that 95%
of the data fall within 1.96 standard deviations of the
mean – the unshaded area in the middle of the curve in
figure 1. A summary measure for a sample often quoted
is the two values associated with the mean +/- 1.96 x
standard deviation ( +/- 1.96s). These two values are
termed the Normal range, and represent the range within
which 95% of the data are expected to lie. Note that
68.3% of data lie within 1 standard deviation of the mean,
while virtually all of the data (99.7%) will lie within 3
standard deviations (95.5% will lie within 2). The Normal
distribution is important, as it underpins much of the
subsequent statistical theory outlined both in this and
later tutorials, such as the calculation of confidence
intervals and linear modelling techniques.
The Central Limit Theorem
(or the law of large numbers)
The Central Limit Theorem states that, given any series of
independent, identically distributed random variables, their
means will tend to a Normal distribution as the number of
variables increases. Put another way, the distribution of
sample means drawn from a population will be Normally
distributed whatever the distribution of the actual data in
the population as long as the samples are large enough.
In order to illustrate this, consider the random numbers
0–9. The distribution of these numbers in a random
numbers table would be uniform. That is, that each number
has an equal probability of being selected, and the shape
of the theoretical distribution is represented by a rectangle.
According to the Central Limit Theorem, if you were to
select repeated random samples of the same size from
this distribution, and then calculate the means of these
different samples, the distribution of these sample means
would be approximately Normal, and this approximation
would improve as the size of each sample increased.
Figure 2a represents the distribution of the sample means
for 500 samples of size 5. Even with such a small sample
size, the approximation to the Normal is remarkable;
repeating the experiment with samples of size 50 improves
the fit to the Normal distribution (figure 2b). The other
noteworthy feature of these two figures is that as the size
of the samples increases (from 5 to 50), the spread of the
means is decreased.
Each mean estimated from a sample is an unbiased
estimate of the true population mean, and by repeating the
a
b
Figure 2. Distribution of means from 500 samples
a: Samples of size 5, mean=4.64, sd=1.29
b: Samples of size 50, mean=4.50, sd=0.41
sampling many times we can obtain a sample of plausible
values for the true population mean. Using the Central
Limit Theorem, we can infer that 95% of sample means will
lie within 1.96 standard deviations of the population mean.
As we do not usually know the true population mean, the
more important inference is that with the sample mean we
are 95% confident that the population mean will fall within
1.96 standard deviations of the sample mean. In reality, as
we usually only take a single sample, we can use the
Central Limit Theorem to construct an interval within which
we are reasonably confident the true population mean will
lie. This range of plausible values is known as the
confidence interval, and the formula for the confidence
interval for the mean is given in table 2. Technically
speaking, the 95% confidence interval is the range of
values within which the true population mean would lie 95%
of the time if a study was repeated many times. Crudely
speaking, the confidence interval gives a range of
plausible values for the true population mean. We will
Table 2: Formula for the confidence interval
for a mean
to
s = sample standard deviation and n = number
of individuals in the sample.
discuss confidence intervals further in subsequent notes in
context with hypothesis tests and P-values.
In order to calculate the confidence interval, we need to be
able to estimate the standard deviation of the sample
mean. It is defined as the sample standard deviation, s,
divided by the square root of the number of individuals in
the sample,
, and is usually referred to as the standard
error. In order to avoid confusion, it is worth remembering
that with use of the standard deviation (of all individuals in
the sample), you can make inferences about the spread of
the measurement within the population for individuals,
while with use of the standard error, you can make
inferences about the spread of the means: the standard
deviation is for describing (the spread of data), while the
standard error is for estimating (how precisely the mean
has been pinpointed).
Summary
In this tutorial, we have outlined the basic properties of the
Normal distribution and discussed the Central Limit
Theorem and outlined its importance to statistical theory.
The Normal distribution is fundamental to many of the tests
of statistical significance outlined in subsequent tutorials,
while the principles of the Central Limit Theorem enable us
to calculate confidence intervals and make inferences
about the population from which the sample is taken.