Download Here - BCIT Commons

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Time series wikipedia , lookup

World Values Survey wikipedia , lookup

Transcript
MATH 2441
Probability and Statistics for Biological Sciences
Measures of Variability
Measures of central tendency are designed to distinguish between two similar looking distributions of values
which are "centered" at different locations along the horizontal axis. In the figure on the left just below, the
two distributions of values have what appear to be identical or nearly identical shapes, but the location
around which their values seem most tightly clustered are different. The data sets from which these two
distribution graphs were created would give different means, medians, and modes.
0
2
4
6
8
0
2
4
6
On the other hand, the two data sets giving rise to the two distributions shown in the figure to the right above
will give the same values for any of those three measures of central tendency, yet they are clearly not
identical distributions of data. The peak, center, region of highest frequency, is around the value 4 for both,
but they differ in how spread out (or how variable, or how dispersed) the data values are.
These schematic examples suggest that a second numerical summary that might be a useful and distinctive
summary characteristic of data distributions is some measure of variability, or dispersion. We will
consider a few of the possible ways of quantifying the variability of data.
The Range
The simplest way to measure the spread of values in a set of data is to calculate the range:
sample range = largest value - smallest value
(VAR-1)
This has the advantage of being quite simple to do (though not necessarily very easy -- try scanning a list of
50,000 numbers by eye to find the largest and smallest values present). However, it suffers from the major
disadvantage that its value depends on the two most atypical numbers in the whole set of data. Not only is
the value of the range computed from just two observations, with all the other observations there just to tell
you which two to use in your calculation, but it is calculated from the two most unusual and unpredictable
values in the whole set of observations.
The range fails as a useful statistic in one other way. Remember that our goal in calculating numerical
summaries from sample data is to be able use them to say something about the population from which the
sample was selected. The question is: what population property might the sample range tell us about? For
the sample range to be a good estimate of the population range, it would have to happen that the sample
fortuitously contains both the smallest and the largest values occurring in the population. If the size of the
sample is small compared to the size of the population, then this is extremely unlikely.
Under some circumstances, the sample range can be used to get a rough estimate of the most commonly
used measure of dispersion, the standard deviation. See the document "Rough Cuts".
David W. Sabo (1999)
Measures of Variability
Page 1 of 5
8
The Mean Deviation
In sets of data with very little variability or spread, the individual data values will all tend to be very similar to
the arithmetic mean. This suggests that if we want to find a single number reflecting the spread or variability
of the data which takes into account all observations, perhaps it would be useful to consider something
based on the set of deviations from the mean: x1  x , x 2  x , x 3  x , …, x n  x . If most of the data
values, x1, x2, x3, …, xn, are near in value to the mean, x , then most of these deviations will be small values.
Instead, if there is quite a bit of variability in the data, at least some of these deviations will be quite large
values.
One might guess that simply finding the arithmetic mean of these deviations would be a good start.
Unfortunately, this doesn't work. To see why, note the following.
average deviation 
x1  x   x 2  x   x3  x     x n  x 
n
that is, add up all n deviations, and divide by n. Now, examine the numerator more closely. If the brackets
are removed, we can rearrange the terms to give the sum of the data values x1 + x2 + x3 + … + xn, and the
sum of x with itself n times, giving a total of n x . Thus, the right-hand side above can be rewritten as
average deviation 
x1  x 2  x 3    x n  nx x1  x 2  x 3    x n nx


 xx 0
n
n
n
Thus, the average deviation computed in this way will always give the answer zero, and so as a measure of
something to distinguish between narrow and broad distributions, it is totally useless. Note that we have
obtained with result without considering any specific data values -- it is a property of the way this average
deviation is defined that its value is automatically zero. In fact, you could say that the mean value, x , is the
value which makes the average deviation zero. What's really happening here is that the mean value has the
property that every unit of deviation to its right is balanced by an equivalent deviation to its left, so that when
you sum over all deviations, they exactly cancel out. Deviations of data values to the left of the mean will be
negative numbers, deviations of data values to the right of the mean will be positive values, and the positive
and negative values exactly cancel out.
The mean deviation was a good idea in that it took into account the values of all of the observations, but it
failed to be useful because of exact cancellation between positive and negative deviations. Some people
have suggested overcoming this difficulty by defining the mean absolute deviation -- take the absolute
value of each deviation averaging. This would prevent the cancellation of positive and negative values in
the averaging process, because now the sum involves only positive values.
n
mean absolute deviation 
 xk  x
k 1
n
(VAR-2)
Although this is used sometimes, the absolute value operation has some nasty mathematical properties (the
graph of the absolute value function has a kink at the origin) which limits its theoretical usefulness compared
to the next alternative we describe.
The Variance and Standard Deviation
There is another way to get a sum of strictly non-negative values based on deviations from the mean:
square the deviations before summing. This gives the so-called variance, which leads to the most
commonly used measure of variability in statistics. Since this quantity is going to work for us, we need to
take care now to distinguish between sample variance, s2, and population variance, 2. The notation is
conventional, and will make sense shortly. Here,  is the lower case Greek character "sigma" or "s", written
like an "o" with a short horizontal tail going rightwards from the top.
The defining formulas for these two quantities are:
Page 2 of 5
Measures of Variability
David W. Sabo (1999)
 x k  x 
n
s2 
2
 x k   
N
k 1
2 
and
n 1
2
k 1
(VAR-3)
N
Here, lower case n is the size of the sample, and upper case N is the size of the population. These two
formulas are similar in some respects: the numerator is the sum of the squares of the deviations from the
respective means. In calculating the sample variance, we sum the squares of the deviations of the data in
the sample from the sample mean. In computing the population variance, we would sum the squares of the
deviations of the values in the population from the population mean. To get an average of the deviations,
we should divide by the number of terms in the sum, as done in the formula for 2.
The explanation why the numerator in the formula for s 2 is n-1 instead of n is a bit subtle. One reason is that
we would like to be able to prove that as the sample size n increases until eventually the sample becomes
identical to the original population, the formula for s 2 should smoothly transition into the formula for 2. This
will only happen if we start with an n-1 in the denominator of the formula for s2. Another reason is that since
x has been calculated from the xk's in the sample, we really have only n-1 independent pieces of
information left. (If I give you the value of all but one observation plus the mean of all observations, you can
easily work out what that unknown observation must have been -- hence once x is known, one of the n
observations is redundant.) The value, n - 1, of the denominator of s2 is called the degrees of freedom in
this context.
Look closely at the two formulas, (VAR-3). If the data values tend to be close to the mean, then the
deviations from the mean will be a set of small numbers. Squaring them will give a set of small numbers,
and so the sum in the denominator will be a small number. Thus, for tightly clustered sets of data, s 2 is
expected to be a relatively small value. On the other hand, if the data values are widely spread out (or at
least if a few of them are), then some of the deviations will be large numbers and their squares will be even
larger numbers, leading to a relatively large value for the numerator and hence for s 2. Hence small values of
s2 indicate tightly clustered sets of data, whereas large values of s 2 indicate more spread in the distribution
of values.
The formulas (VAR-3) make it easy to see what features of the data are reflected by s 2 and 2, but they are
rather awkward formulas to use for actually calculating these quantities. We can rearrange the numerators
algebraically into the somewhat more congenial form:
n
s2 
 xk
2
k 1
 n

  xk 
 k 1 

n
n 1
2
n
2
2
 x k  nx
k 1

(VAR-4a)
n 1
and
N
2 
 xk
k 1
2
N 
  xk 
 k 1 

N
N
2
N

2
2
 x k  n
k 1
(VAR-4b)
N
The pattern of computations in the numerators of the first form in each of these formulas arises so
commonly in statistical calculations that it is often denoted by the symbol SS (we'll use the subscript labels
'sample' and 'pop' here to distinguish between the SS for the sample data and the SS for the population
values):
n
SS sample   x k 2
k 1
 n

  xk 
 k 1 

n
2
N
and
SS pop   x k 2
k 1
N 
  xk 
 k 1 

N
2
(VAR-5)
Then,
David W. Sabo (1999)
Measures of Variability
Page 3 of 5
s2 
SS sample
and
n 1
2 
SS pop
N
(VAR-6)
(Although we've given formulas for 2 throughout here, you will probably never calculate 2 directly, but will
use the value of s2 for data from a sample as an estimate of the required value of 2.)
Don't jump to the conclusion that the formula (VAR-5,6) for s2 is hopelessly complicated. To compute
SSsample, you just need the sum of the squares of the data values (the first term), and the sum of the data
values themselves (the second term). Even the simplest hand-held calculators with statistical functions will
automate the process of computing these sums for you. Note the distinction between the two terms in the
formula for SS. To compute the value of the first term you must square the values first, and then sum those
squares. In the second term, you first sum the values, and then square the sum. These are not the same
operations.
Before doing an example calculation, we define two more terms. The standard deviation is simply the
square root of the corresponding variance. Thus
sample standarddeviation  s  s 2
(VAR-7a)
population standarddeviation     2
(VAR-7b)
and
So, to compute the standard deviation, first compute the variance, and then take the square root. Obviously
the standard deviation reflects the same features of the data as does the variance. Large variances will
have comparatively large square roots, whereas small variances will have comparatively small square roots.
Thus a comparatively large value for the standard deviation is an indication of considerable variability in the
data, whereas a comparatively smaller value for the standard deviation is an indication of less variability in
the data.
Example: Biotin
To illustrate the use of the formulas, consider the 'BiotinDry' set of data:
58.70
91.40
78.00
80.90
88.40
96.10
97.40
104.80 78.20
BiotinDry
consisting of a sample of n = 9 values. To compute s2, start by computing
9
2
2
2
2
2
2
2
2
2
2
 x k  58 .70  78 .00  80 .90  88 .40  96 .10  97 .40  104 .80  78 .20  91 .40
k 1
= 68063.27
Also
9
 x k  58 .70  78 .00  80 .90  88 .40  96 .10  97 .40  104 .80  78 .20  91 .40  773 .90
k 1
Now, put it all together:
n
s2 
 xk
k 1
2
 n

  xk 
 k 1 

n
n 1
2
773 .90 2
9
 189 .559
9 1
68063 .27 

rounded to three decimal places. (Since the data values have units of micrograms/100 g peanuts, the units
of s2 will be the square of this.) Then,
s  s 2  189.559  13.768 g / 100 g peanuts
Page 4 of 5
Measures of Variability
David W. Sabo (1999)
If you applied the same procedure to the BiotinOil data, you would get s 2  9.667, and so s  3.109,
indicating that the values in this second set of data are much more tightly clustered about the mean than in
the BiotinDry data.

In practice, you should try to avoid having to calculate s 2 in as detailed a fashion as the above example
illustrates, especially for large sets of data. Using the statistical facilities on your hand-held calculator can
automate the process a bit. Be careful to distinguish between the function keys which calculate s 2 or s and
those which calculate 2 or . In Excel, the function STDEV() can be used to calculate s, and the functions
n
SUMSQ() and SUM() can be used to calculate  x k
k 1
2
n
and  x k , respectively.
k 1
As measures of dispersion, the variance and standard deviation have one major advantage: they arise
automatically in the theory underlying the sampling process, and so will play an important role in the
methods of statistical inference that will occupy most of the time in this course. s 2 is an efficient and
unbiased estimator of 2 (though the same is not quite true of the relationship between s and ). Their one
defect is shared with the arithmetic mean: the value of s2 and s is rather sensitive to the presence of
unusual or atypical observations.
The Interquartile Range
The interquartile range is a measure of dispersion based on the calculation of percentiles for the data.
Refer to the document on "Measures of Relative Standing" for further details. The main value of the
interquartile range is that it provides a measure of dispersion which is insensitive to the presence of a small
number of unusual or atypical observations. In some respects, you could consider the interquartile range to
play a similar role with respect to the median as the standard deviation does with respect to the mean.
The Coefficient of Variation
The so-called coefficient of variation, is defined as the ratio
CV 
s
x
(Var-8)
expressed either as a fraction or in percentage form. You can view it as a measure of relative dispersion or
relative variability. Recall that both s and x have the same units of measurement as the original data. If the
units of measurement are changed, the values of these quantities will change as well. Their ratio, CV, is
dimensionless, indicating how great the variability of the data is in relation to the value of the mean. CV's
only make sense for data measured relative to a ratio scale. However, in those instances it reflects how
significant variation between values is relative to their actual typical value.
For instance, if we collected a sample of 50 apples, and found that their mean weight was 300 g, a standard
deviation of 1 g would indicate a very uniform sample of apples, weightwise. (See the document "Rough
Cuts" -- the standard deviation of 1 g would indicate that almost all of the apples would have weights
between 297 g and 303 g.) On the other hand, if we collected a sample of 50 blueberries and found that
their mean weight was 1 g, a standard deviation of 1 g would indicate a quite a non-uniform set of
blueberries, with some being probably four - six times as large as others. In the first case, the very uniform
set of apples would have a CV of 1/300  0.003, and in the second case, the very diverse set of blueberries
would have a CV of 1/1 = 1 which is 300 times as big as the CV for the apples.
David W. Sabo (1999)
Measures of Variability
Page 5 of 5