Download Lecture theatre notes: Mean and standard deviation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Mean field particle methods wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Applied Biostatistics
Mean and Standard Deviation
Martin Bland
Professor of Health Statistics
University of York
http://www-users.york.ac.uk/~mb55/
The mean
The arithmetic mean or average, usually referred to
simply as the mean is found by taking the sum of the
observations and dividing by their number.
The mean is often denoted by a little bar over the
symbol for the variable, e.g. x .
The sample mean has much nicer mathematical
properties than the median and is thus more useful for
the comparison methods described later.
The median is a very useful descriptive statistic, but not
much used for other purposes.
Median, mean and skewness:
Mean FEV1 = 4.06. Median FEV1 = 4.1, so the median
is within 1% of the mean.
Mean triglyceride = 0.51. Median triglyceride = 0.46.
The median is 10% away from the mean.
If the distribution is symmetrical the sample mean and
median will be about the same, but in a skew
distribution they will usually be different.
If the distribution is skew to the right, as for serum
triglyceride, the mean will usually be greater, if it is skew
to the left the median will usually be greater.
This is because the values in the tails affect the mean
but not the median.
1
Frequency
80
60
40
20
0
0
.5
1
Triglyceride
Median
1.5
2
Mean
Increasing the largest observation will pull the mean higher.
It will not affect the median.
Variability
The mean and median are measures of the central
tendency or position of the middle of the distribution.
We shall also need a measure of the spread, dispersion
or variability of the distribution.
Variability
For use in the analysis of data, range and IQR are not
satisfactory. Instead we use two other measures of
variability: variance and standard deviation.
These both measure how far observations are from the
mean of the distribution.
Variance is the average squared difference from the
mean.
Standard deviation is the square root of the variance.
2
Variance
Variance is an average squared difference from the
mean.
Note that if we have only one observation, we cannot do
this. The mean is the observation and the difference is
zero. We need at least two observations.
The sum of the squared differences from the mean is
proportional to the number of observations minus one,
called the degrees of freedom.
Variance is estimated as the sum of the squared
differences from the mean divided by the degrees of
freedom.
Variance
FEV1: variance = 0.449 litres2
Gestational age: variance = 5.24 weeks2
Variance is based on the squares of the observations
and so is in squared units.
This makes it difficult to interpret.
Standard deviation
The variance is calculated from the squares of the
observations. This means that it is not in the same
units as the observations.
We take the square root, which will then have the same
units as the observations and the mean.
The square root of the variance is called the standard
deviation, usually denoted by s.
FEV1: s =
0.449 = 0.67 litres.
3
Standard deviation
FEV1: s =
0.449 = 0.67 litres.
Frequency
20
15
10
5
0
2
3
x-2s
x-s
4
x
5
6
x+2s
x+s FEV1 (litre)
Majority of observations within one SD of mean (usually
about 2/3). Almost all within about two SD of mean
(usually about 95%).
Standard deviation
Triglyceride: s = 0.04802 = 0.22 mmol/litre.
Frequency
80
60
40
20
0
.5
1
1.5
2
0
x-2s
x
x+2s Triglyceride
x-s
x+s
Majority of observations within one SD of mean (usually
about 2/3). Almost all within about two SD of mean
(usually about 95%), but those outside may be all at one
end.
Standard deviation
Frequency
Gestational age: s = 5.242 = 2.29 weeks.
500
400
300
200
100
0
x-s x+s
x-2s x x+2s
20
25
30
35
40
45
Gestational age (weeks)
Majority of observations within one SD of mean (usually
about 2/3). Almost all within about two SD of mean (usually
about 95%), but those outside may be all at one end.
4
Spotting skewness
If the mean is less than two standard deviations, two
standard deviations below the mean will be negative.
For any variable which cannot be negative, this tells us that
the distribution must be positively skew.
If the mean or the median is near to one end of the range or
interquartile range, this tells us that the distribution must be
skew. If the mean or median is near the lower limit it will be
positively skew, if near the upper limit it will be negatively
skew.
Spotting skewness
Triglyceride: median = 0.46, mean = 0.51, SD = 0.22,
range = 0.15 to 1.66, IQR = 0.35 to 0.60
mmol/l.
These rules of thumb only work one way, e.g. mean may
exceed two SD and distribution may still be skew.
Gestational age: median = 39, mean = 38.95, SD = 2.29,
range = 21 to 44, IQR = 38 to 40 weeks.
The Normal distribution
0
5
Frequency
10
15
20
Many statistical methods are only valid if we can assume
that our data follow a distribution of a particular type, the
Normal distribution. This is a continuous, symmetrical,
unimodal distribution described by a mathematical
equation, which we shall omit.
2
3
4
FEV1 (litres)
5
6
5
The Normal distribution
Many statistical methods are only valid if we can assume
that our data follow a distribution of a particular type, the
Normal distribution. This is a continuous, symmetrical,
unimodal distribution described by a mathematical
equation, which we shall omit.
Frequency
300
200
100
0
150
160
170
Height (cm)
180
190
20
140
Frequency
10
15
Mean = 4.06 litres
Variance = 0.45 litres2
0
5
SD = 0.67 litres
2
3
4
FEV1 (litres)
5
6
300
Frequency
Mean = 162.4 cm
200
Variance = 39.5 cm2
SD = 2.3 cm
100
0
140
150
160
170
Height (cm)
180
190
The Normal distribution is not just one distribution, but a
family of distributions.
The particular member of the family that we have is
defined by two numbers, called parameters.
Parameter is a mathematical term meaning a number
which defines a member of a class of things.
The parameters of a Normal distribution happen to be
equal to the mean and variance.
These two numbers tell us which member of the Normal
family we have.
6
The parameters (mean and variance) of a Normal
distribution happen to be equal to the mean and variance.
Relative frequency
density
These two numbers tell us which member of the Normal
family we have.
.4
Mean=0, variance=1
is called the
Standard Normal
distribution.
.3
.2
.1
0
-5-4-3-2-1 0 1 2 3 4 5 6 7 8 9 10
Normal variable
Mn=0, Var=1
Mn=3, Var=4
Mn=3, Var=1
The parameters (mean and variance) of a Normal
distribution happen to be equal to the mean and variance.
Relative frequency
density
These two numbers tell us which member of the Normal
family we have.
.4
The distributions are
the same in terms of
standard deviations
from the mean.
.3
.2
.1
0
-5-4-3-2-1 0 1 2 3 4 5 6 7 8 9 10
Normal variable
Mn=0, SD=1
Mn=3, SD=2
Mn=3, SD=1
The Normal distribution is important for two reasons.
1. Many natural variables follow it quite closely, certainly
sufficiently closely for us to use statistical methods
which require this.
2. Even when we have a variable which does not follow
a Normal distribution, if we the take the mean of a
sample of observations, such means will follow a
Normal distribution.
7
An illustration of the Central Limit Theorem
Single Uniform variable
Two Uniform variables
Frequency
1000
800
600
400
200
0
Frequency
800
600
400
200
0
-.2 0 .2 .4 .6 .8 1 1.2
Uniform variable
-.2 0 .2 .4 .6 .8 1 1.2
Mean of two
Four Uniform variables
Ten Uniform variables
2000
1500
1000
500
0
Frequency
Frequency
1500
1000
500
0
-.2 0 .2 .4 .6 .8 1 1.2
Mean of four
-.2 0 .2 .4 .6 .8 1 1.2
Mean of ten
There is no simple formula linking the variable and the
area under the curve.
Hence we cannot find a formula to calculate the frequency
between two chosen values of the variable, nor the value
which would be exceeded for a given proportion of
observations.
Numerical methods for calculating these things with
acceptable accuracy were used to produce extensive
tables of the Normal distribution.
These numerical methods for calculating Normal
frequencies have been built into statistical computer
programs and computers can estimate them whenever
they are needed.
Two numbers from tables of the Normal distribution:
1. we expect 68% of observations to lie within one
standard deviation from the mean,
2. we expect 95% of observations to lie within 1.96
standard deviations from the mean.
This is true for all Normal distributions, whatever the mean,
variance, and standard deviation.
8
1. We expect 68% of observations to lie within one
standard deviation from the mean,
2. we expect 95% of observations to lie within 1.96
standard deviations from the mean.
Relative frequency
density
.4
.3
.2
.1
0
2.5%
2.5%
-4 -3 -2 -1 0 1 2 3
Standard Normal variable
4
9