Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
tutorial The Normal Distribution Jenny V. Freeman and Steven A. Julious Medical Statistics Group, School of Health and Related Research, University of Sheffield, Community Sciences Centre, Northern General Hospital, Herries Road, Sheffield, UK Introduction The first two tutorials in this series have focused on displaying data and simple methods for describing and summarising data. There has been little discussion of statistical theory. In this note, we will start to explore some of the basic concepts underlying much statistical methodology. We will describe the basic theory underlying the Normal distribution and the link between empirical frequency distributions (the observed distribution of data in a sample) and theoretical probability distributions (the theoretical distribution of data in a population). In addition, we will introduce the idea of a confidence interval. Theoretical probability distributions Since it is rarely possible to collect information on an entire population, the aim of many statistical analyses is to use information from a sample to draw conclusions (or ‘make inferences’) about the population of interest. These inferences are facilitated by making assumptions about the underlying distribution of the measurement of interest in the population as a whole, and by applying an appropriate theoretical model to describe how the measurement behaves in the population. (Note that prior to any analysis, it is usual to make assumptions about the underlying distribution of the measurement being studied. These assumptions can then be investigated through various plots and figures for the observed data – e.g. a histogram for continuous data. These investigations are referred to as diagnostics and will be discussed throughout subsequent notes.) In the context of this note, the population is a theoretical concept used for describing an entire group, and one way of describing the distribution of a measurement in a population is by use of a suitable theoretical probability distribution. Probability distributions can be used to calculate the probability of different values occurring, and they exist for both continuous and categorical measurements. In addition to the Normal distribution (described later in this note), there are many other theoretical distributions, including the chi-squared, binomial and Poisson distributions (these will be discussed in later tutorials). Each of these theoretical distributions is described by a particular mathematical expression (formally referred to as a model), and for each model there exist summary measures, known as parameters, which completely describe that particular distribution. In practice, parameters are usually estimated by quantities calculated from the sample, and these are known as statistics; that is, a statistic is a quantity calculated from a sample in order to estimate an unknown parameter in a population. For example, the Normal distribution is completely characterised by the population mean (µ) and population standard deviation (σ), and these are estimated by the sample mean ( ) and sample standard deviation (s) respectively. The Normal distribution The Normal, or Gaussian, distribution (named in honour of the German mathematician C. F. Gauss, 1777–1855) is the most important theoretical probability distribution in statistics. At this point, it is important to stress that, in this context, the word ‘normal’ is a statistical term and is not used in the dictionary or clinical sense of conforming to what would be expected. Thus, in order to distinguish between the two, statistical and dictionary ‘normal’, it is conventional to use a capital letter when referring to the Normal distribution. The basic properties of the Normal distribution are outlined in table 1. The distribution curve of data that are Normally distributed has a characteristic shape; it is bellshaped, and symmetrical about a single peak (figure 1). For any given value of the mean, populations with a small Table 1: Properties of the Normal distribution 1. It is bell-shaped and has a single peak (unimodal) 2. Symmetrical about the mean 3. Uniquely defined by two parameters, the mean (µ) and standard deviation (σ) 4. The mean, median and mode all coincide 5. The probability that a Normally distributed random variable, x, with mean µ and standard deviation σ lies between the limits (µ – 1.96σ) and (µ + 1.96σ) is 0.95, i.e. 95% of the data for a Normally distributed random variable will lie between the limits (µ – 1.96σ) and (µ + 1.96σ)* 6. The probability that a Normally distributed random variable, x, with mean µ and standard deviation σ lies between the limits (µ – 2.44σ) and (µ + 2.44σ) is 0.99 7. Any position on the horizontal axis of figure 1 can be expressed as a number of standard deviations away from the mean value *This fact is used for calculating the 95% confidence interval for Normally distributed data. Figure 1: The Normal distribution standard deviation have a distribution clustered close to the mean (µ), while those with a large standard deviation have a distribution that is widely spread along the measurement axis, and the peak is more flattened. As mentioned earlier, the Normal distribution is described completely by two parameters, the mean (µ) and the standard deviation (σ). This means that for any Normally distributed variable, once the mean and variance (σ2) are known (or estimated), it is possible to calculate the probability distribution for that population. An important feature of a Normal distribution is that 95% of the data fall within 1.96 standard deviations of the mean – the unshaded area in the middle of the curve in figure 1. A summary measure for a sample often quoted is the two values associated with the mean +/- 1.96 x standard deviation ( +/- 1.96s). These two values are termed the Normal range, and represent the range within which 95% of the data are expected to lie. Note that 68.3% of data lie within 1 standard deviation of the mean, while virtually all of the data (99.7%) will lie within 3 standard deviations (95.5% will lie within 2). The Normal distribution is important, as it underpins much of the subsequent statistical theory outlined both in this and later tutorials, such as the calculation of confidence intervals and linear modelling techniques. The Central Limit Theorem (or the law of large numbers) The Central Limit Theorem states that, given any series of independent, identically distributed random variables, their means will tend to a Normal distribution as the number of variables increases. Put another way, the distribution of sample means drawn from a population will be Normally distributed whatever the distribution of the actual data in the population as long as the samples are large enough. In order to illustrate this, consider the random numbers 0–9. The distribution of these numbers in a random numbers table would be uniform. That is, that each number has an equal probability of being selected, and the shape of the theoretical distribution is represented by a rectangle. According to the Central Limit Theorem, if you were to select repeated random samples of the same size from this distribution, and then calculate the means of these different samples, the distribution of these sample means would be approximately Normal, and this approximation would improve as the size of each sample increased. Figure 2a represents the distribution of the sample means for 500 samples of size 5. Even with such a small sample size, the approximation to the Normal is remarkable; repeating the experiment with samples of size 50 improves the fit to the Normal distribution (figure 2b). The other noteworthy feature of these two figures is that as the size of the samples increases (from 5 to 50), the spread of the means is decreased. Each mean estimated from a sample is an unbiased estimate of the true population mean, and by repeating the a b Figure 2. Distribution of means from 500 samples a: Samples of size 5, mean=4.64, sd=1.29 b: Samples of size 50, mean=4.50, sd=0.41 sampling many times we can obtain a sample of plausible values for the true population mean. Using the Central Limit Theorem, we can infer that 95% of sample means will lie within 1.96 standard deviations of the population mean. As we do not usually know the true population mean, the more important inference is that with the sample mean we are 95% confident that the population mean will fall within 1.96 standard deviations of the sample mean. In reality, as we usually only take a single sample, we can use the Central Limit Theorem to construct an interval within which we are reasonably confident the true population mean will lie. This range of plausible values is known as the confidence interval, and the formula for the confidence interval for the mean is given in table 2. Technically speaking, the 95% confidence interval is the range of values within which the true population mean would lie 95% of the time if a study was repeated many times. Crudely speaking, the confidence interval gives a range of plausible values for the true population mean. We will Table 2: Formula for the confidence interval for a mean to s = sample standard deviation and n = number of individuals in the sample. discuss confidence intervals further in subsequent notes in context with hypothesis tests and P-values. In order to calculate the confidence interval, we need to be able to estimate the standard deviation of the sample mean. It is defined as the sample standard deviation, s, divided by the square root of the number of individuals in the sample, , and is usually referred to as the standard error. In order to avoid confusion, it is worth remembering that with use of the standard deviation (of all individuals in the sample), you can make inferences about the spread of the measurement within the population for individuals, while with use of the standard error, you can make inferences about the spread of the means: the standard deviation is for describing (the spread of data), while the standard error is for estimating (how precisely the mean has been pinpointed). Summary In this tutorial, we have outlined the basic properties of the Normal distribution and discussed the Central Limit Theorem and outlined its importance to statistical theory. The Normal distribution is fundamental to many of the tests of statistical significance outlined in subsequent tutorials, while the principles of the Central Limit Theorem enable us to calculate confidence intervals and make inferences about the population from which the sample is taken.