Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Applied Biostatistics Mean and Standard Deviation Martin Bland Professor of Health Statistics University of York http://www-users.york.ac.uk/~mb55/ The mean The arithmetic mean or average, usually referred to simply as the mean is found by taking the sum of the observations and dividing by their number. The mean is often denoted by a little bar over the symbol for the variable, e.g. x . The sample mean has much nicer mathematical properties than the median and is thus more useful for the comparison methods described later. The median is a very useful descriptive statistic, but not much used for other purposes. Median, mean and skewness: Mean FEV1 = 4.06. Median FEV1 = 4.1, so the median is within 1% of the mean. Mean triglyceride = 0.51. Median triglyceride = 0.46. The median is 10% away from the mean. If the distribution is symmetrical the sample mean and median will be about the same, but in a skew distribution they will usually be different. If the distribution is skew to the right, as for serum triglyceride, the mean will usually be greater, if it is skew to the left the median will usually be greater. This is because the values in the tails affect the mean but not the median. 1 Frequency 80 60 40 20 0 0 .5 1 Triglyceride Median 1.5 2 Mean Increasing the largest observation will pull the mean higher. It will not affect the median. Variability The mean and median are measures of the central tendency or position of the middle of the distribution. We shall also need a measure of the spread, dispersion or variability of the distribution. Variability For use in the analysis of data, range and IQR are not satisfactory. Instead we use two other measures of variability: variance and standard deviation. These both measure how far observations are from the mean of the distribution. Variance is the average squared difference from the mean. Standard deviation is the square root of the variance. 2 Variance Variance is an average squared difference from the mean. Note that if we have only one observation, we cannot do this. The mean is the observation and the difference is zero. We need at least two observations. The sum of the squared differences from the mean is proportional to the number of observations minus one, called the degrees of freedom. Variance is estimated as the sum of the squared differences from the mean divided by the degrees of freedom. Variance FEV1: variance = 0.449 litres2 Gestational age: variance = 5.24 weeks2 Variance is based on the squares of the observations and so is in squared units. This makes it difficult to interpret. Standard deviation The variance is calculated from the squares of the observations. This means that it is not in the same units as the observations. We take the square root, which will then have the same units as the observations and the mean. The square root of the variance is called the standard deviation, usually denoted by s. FEV1: s = 0.449 = 0.67 litres. 3 Standard deviation FEV1: s = 0.449 = 0.67 litres. Frequency 20 15 10 5 0 2 3 x-2s x-s 4 x 5 6 x+2s x+s FEV1 (litre) Majority of observations within one SD of mean (usually about 2/3). Almost all within about two SD of mean (usually about 95%). Standard deviation Triglyceride: s = 0.04802 = 0.22 mmol/litre. Frequency 80 60 40 20 0 .5 1 1.5 2 0 x-2s x x+2s Triglyceride x-s x+s Majority of observations within one SD of mean (usually about 2/3). Almost all within about two SD of mean (usually about 95%), but those outside may be all at one end. Standard deviation Frequency Gestational age: s = 5.242 = 2.29 weeks. 500 400 300 200 100 0 x-s x+s x-2s x x+2s 20 25 30 35 40 45 Gestational age (weeks) Majority of observations within one SD of mean (usually about 2/3). Almost all within about two SD of mean (usually about 95%), but those outside may be all at one end. 4 Spotting skewness If the mean is less than two standard deviations, two standard deviations below the mean will be negative. For any variable which cannot be negative, this tells us that the distribution must be positively skew. If the mean or the median is near to one end of the range or interquartile range, this tells us that the distribution must be skew. If the mean or median is near the lower limit it will be positively skew, if near the upper limit it will be negatively skew. Spotting skewness Triglyceride: median = 0.46, mean = 0.51, SD = 0.22, range = 0.15 to 1.66, IQR = 0.35 to 0.60 mmol/l. These rules of thumb only work one way, e.g. mean may exceed two SD and distribution may still be skew. Gestational age: median = 39, mean = 38.95, SD = 2.29, range = 21 to 44, IQR = 38 to 40 weeks. The Normal distribution 0 5 Frequency 10 15 20 Many statistical methods are only valid if we can assume that our data follow a distribution of a particular type, the Normal distribution. This is a continuous, symmetrical, unimodal distribution described by a mathematical equation, which we shall omit. 2 3 4 FEV1 (litres) 5 6 5 The Normal distribution Many statistical methods are only valid if we can assume that our data follow a distribution of a particular type, the Normal distribution. This is a continuous, symmetrical, unimodal distribution described by a mathematical equation, which we shall omit. Frequency 300 200 100 0 150 160 170 Height (cm) 180 190 20 140 Frequency 10 15 Mean = 4.06 litres Variance = 0.45 litres2 0 5 SD = 0.67 litres 2 3 4 FEV1 (litres) 5 6 300 Frequency Mean = 162.4 cm 200 Variance = 39.5 cm2 SD = 2.3 cm 100 0 140 150 160 170 Height (cm) 180 190 The Normal distribution is not just one distribution, but a family of distributions. The particular member of the family that we have is defined by two numbers, called parameters. Parameter is a mathematical term meaning a number which defines a member of a class of things. The parameters of a Normal distribution happen to be equal to the mean and variance. These two numbers tell us which member of the Normal family we have. 6 The parameters (mean and variance) of a Normal distribution happen to be equal to the mean and variance. Relative frequency density These two numbers tell us which member of the Normal family we have. .4 Mean=0, variance=1 is called the Standard Normal distribution. .3 .2 .1 0 -5-4-3-2-1 0 1 2 3 4 5 6 7 8 9 10 Normal variable Mn=0, Var=1 Mn=3, Var=4 Mn=3, Var=1 The parameters (mean and variance) of a Normal distribution happen to be equal to the mean and variance. Relative frequency density These two numbers tell us which member of the Normal family we have. .4 The distributions are the same in terms of standard deviations from the mean. .3 .2 .1 0 -5-4-3-2-1 0 1 2 3 4 5 6 7 8 9 10 Normal variable Mn=0, SD=1 Mn=3, SD=2 Mn=3, SD=1 The Normal distribution is important for two reasons. 1. Many natural variables follow it quite closely, certainly sufficiently closely for us to use statistical methods which require this. 2. Even when we have a variable which does not follow a Normal distribution, if we the take the mean of a sample of observations, such means will follow a Normal distribution. 7 An illustration of the Central Limit Theorem Single Uniform variable Two Uniform variables Frequency 1000 800 600 400 200 0 Frequency 800 600 400 200 0 -.2 0 .2 .4 .6 .8 1 1.2 Uniform variable -.2 0 .2 .4 .6 .8 1 1.2 Mean of two Four Uniform variables Ten Uniform variables 2000 1500 1000 500 0 Frequency Frequency 1500 1000 500 0 -.2 0 .2 .4 .6 .8 1 1.2 Mean of four -.2 0 .2 .4 .6 .8 1 1.2 Mean of ten There is no simple formula linking the variable and the area under the curve. Hence we cannot find a formula to calculate the frequency between two chosen values of the variable, nor the value which would be exceeded for a given proportion of observations. Numerical methods for calculating these things with acceptable accuracy were used to produce extensive tables of the Normal distribution. These numerical methods for calculating Normal frequencies have been built into statistical computer programs and computers can estimate them whenever they are needed. Two numbers from tables of the Normal distribution: 1. we expect 68% of observations to lie within one standard deviation from the mean, 2. we expect 95% of observations to lie within 1.96 standard deviations from the mean. This is true for all Normal distributions, whatever the mean, variance, and standard deviation. 8 1. We expect 68% of observations to lie within one standard deviation from the mean, 2. we expect 95% of observations to lie within 1.96 standard deviations from the mean. Relative frequency density .4 .3 .2 .1 0 2.5% 2.5% -4 -3 -2 -1 0 1 2 3 Standard Normal variable 4 9