Download MEASURES OF LOCATION AND SPREAD

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Categorical variable wikipedia , lookup

Time series wikipedia , lookup

Transcript
MEASURES OF LOCATION AND SPREAD
Frequency distributions and other methods of data summarization and presentation
explained in the previous lectures provide a fairly detailed description of the data and
how it is distributed in the sample. In case of categorical variables this will be usually
enough. But in case of quantitative variables we have more methods to summerize and
present the data. Since quantitative variables are numbers (whether discrete or
continuous) we can order them and summarize them in terms of how they are clustered
and spread out in the sample. Quantitative variables can be summarized in terms of
location of different values (measures of location or measures of central tendency) and
how they are spread in the sample (measures of spread or variation)
MEASURES OF LOCATION (Measures of Central Tendency)
Measures of location tell us how different values of the variable are located when the data
is ordered. There are three measures of location which are the median, the mode and the
mean. Each of these measures has its own advantages and disadvantages which depend
on the type of data being summarized.
Median
When we order the variables in ascending or descending way, the median is the value that
divides the distribution into two equal parts so that there is the same number of
observations above and below the median.
For example: Age of 15 women in a survey was as follows:
17, 25, 36, 23, 44, 39, 19, 22, 30, 33, 42, 28, 27, 22, 18
To calculate the median, we rearrange the values in an
ascending order. The observation number 8 (27 years) is the
middle observation, i.e. there are 7 observation on either side
of 27, so the median age is 27 years.
When there is an even number of data values, there is no
single middle value. In this case the median is calculated by
the average of the central pair of values i.e. we add up the
two central values and divide the result by 2. For example in
table 2 there are 16 observations, there is no middle value for
16. The median fo this data will is calculated from the two
values in the middle of the data i.e. observations 7 and 8:
Table 1
ID Age
1
17
2
18
3
19
4
22
5
22
6
23
7
25
8
27
9
28
10 30
11 33
12 36
13 39
14 42
15 44
Table 2
ID
Age
1
17
2
18
3
19
4
22
5
22
6
23
7
25
8
27
9
28
10
30
11
33
12
36
13
39
14
42
15
44
16
46
Median age =(27+ 28)/2= 55/2=27.5 years
Median for Frequency Distributions
The median for a frequency distribution is simply the value at which the cumulative
relative frequency is 50%.
Biostatistics for medical students: written by Dr. Nasih Othman, Sulaimani Polytechnic University 2012

1

Mode
The mode of a distribution is simply the value that occurs most frequently. A distribution
may have more than one mode. In the example above, 22 is repeated twice, so it is the
mode.
Mean
The mean is the average of all values. The mean is calculated from the sum of all values
divided by the number of observations. If we assume that each of n observations (n is
the sample size) has a value xi then the mean
will be:
Example:
Age of 15 women in a survey was as follows:
17, 25, 36, 23, 44, 39, 19, 22, 30, 33, 42, 28, 27, 22, 18
Mean age of the women= sum of all ages/n= (17+ 25+ 36+ 23+ 44+ 39+ 19+ 22+ 30+
33+ 42+ 28+ 27+ 22+ 18)/15
=425/15= 28.3 years
The mean age of the sample is 28.3 years.
Mean for Frequency Distributions
If we have grouped data from a frequency table and we don’t have individual values, we
can still calculate the mean from the grouped
Table 1. Calculation of mean Hb of 50 women from a
data by calculating the total for each interval
frequency distribution table
(frequency X midpoint) and then adding up
Sum of
Hb
Frequency
Mid-point
totals for all intervals and dividing the total by
interval
the sample size. If f is frequency of each
8-8.9
4
8.5
34
interval, the mean will be calculated in the
9-9.9
7
9.5
66.5
following way:
10-10.9
18
10.5
189
11-11.9
12-12.9
13-13.9
14 and over
Total
13
3
4
1
50
11.5
12.5
13.5
14.5
149.5
37.5
54
14.5
545
Table 1 displays grouped data for Hb of 50
women. To calculate sum of each interval we first calculate the midpoint for the interval
(column 3), multiply this with the frequency (colum 2) to calculate sum of the values for
each interval (column 4).
Mean Hb= [(4*8.5) + (7 *9.5) +(18*10.5)+ (13*11.5)+ (3*12.5)+ (4*13.5)+ (1*14.5)]/50
Mean Hb=545/50=10.9 gm
Therefore the mean Hb of the 5o women is 10.9 gm.
Biostatistics for medical students: written by Dr. Nasih Othman, Sulaimani Polytechnic University 2012

2

Properties of the Mean, Median & Mode
1. The mean, mode and median will be similar if the data is normally distributed
(symmetrically distributed around the mean). If the data is not normally
distributed the three measures will be different.
2. The mean is sensitive to outliers; the others are not. An outlier is an extreme
value, a value which is far from the rest of the values. If there are outliers in the
data, the mean will be affected. The mode and the median are not affected by
outliers.
3. The mode may be affected by small changes in the data but the mean and
median are not affected by small changes in the data.
Which measures we should use?
Generally if the data distribution is not symmetrical (there are outliers) the median is a
better measure of location than the mean. When we want to perform statistical analysis
for inference, the mean is more flexible and useful to use. But, if the data is not
symmetrically distributed (not normally distributed), even for statistical inference, we
have to use the median.
Biostatistics for medical students: written by Dr. Nasih Othman, Sulaimani Polytechnic University 2012

3

MEASURES OF SPREAD
If we look at a set of quantitative data displayed as a frequency distribution or a graph,
we can say whether the observations are widely spread out from the mean or clustered
around the mean. But this is not enough; it is usually necessary to describe this variability
of the observations as a numerical value. Such a value is called a measure of spread. A
measure of spread of the data along with the mean provides a better informative summary
of a data set.
There are 3 main ways to summarize the variability of a set of data (three measures of
spread):
1. Range: gives the range of all values
2. Percentiles; reports what values are located in certain percentages of the
whole data
3. The standard deviation: calculates a single numerical measure of the
spread around the mean
Each measure has its own advantages but the standard deviation is most useful in
statistical calculations.
Range
The simplest way to describe the spread of a set of observations is to report the range
from the minimum value to the maximum. Therefore a range tells as the lowest value and
the highest value and hence the difference in-between. The problem with this is that it
reports the most extreme values which may not represent the majority of the data. The
actual distribution of all the values in-between these two extremes are not summarized in
any way.
Example:
Age of 15 women in a survey was as follows:
17, 25, 36, 23, 44, 39, 19, 22, 30, 33, 42, 28, 27, 22, 18
To calculate the range we first order the values from minimum ti maximum, then we
identify the smallest and the biggest value and report it.
17, 18, 19, 22, 22, 23, 25, 27, 28, 30, 33, 36, 39, 42, 44
The range is 17-44 years or 17. 44 years. This means that age of the women is spread out
from 17 to 44 years, including 44. Sometimes when we report range we also report the
interval (the difference between maximum and minimum). For example difference
between 44 and 17 (44-17) is 27 years. Then we say range was 27 years, (17-44).
Percentiles
A percentile (or centile) is the value below which a given percentage of the data has
occurred. For example, in the graph below of the height of a group of people, the 5%
percentile is 145 cm meaning that 5% of the group had height below 145 cm. The 95%
percentile is 165cm which means that 95% of the group had height below 165 cm. By
specifying these two percentiles we give a range in which 90% of the data lies and thus
Biostatistics for medical students: written by Dr. Nasih Othman, Sulaimani Polytechnic University 2012

4

140 145
150 155
160 165
170
Height in cm
Biostatistics for medical students: written by Dr. Nasih Othman, Sulaimani Polytechnic University 2012

5

Standard Deviation
The most common way of quantifying the variability of a distribution is to calculate its
standard deviation. This method uses all the observations, by accounting for all
deviations from the mean. By deviations we mean the differences between each
observation and the mean. The standard deviation is a sort of average of all the
deviations.
Mathematically, if we say each observation has a value Xi (where i = 1 to n) then the
distance from the mean value ,X¯, will be (X¯-Xi).
With n observations we will have n such distances.
We calculate the average of these distances by summing all the observed deviations and
dividing by n.
Average Deviation = [∑ (Xi- X¯)]/n
However, simply calculating the average deviation is not sufficient. In fact this equation
will always give an average deviation of zero, because positive deviations from the mean
will always exactly balance the negative deviations. What we are interested in is the
magnitude of the deviations. If we square the deviations before summing them, we will
always get a positive quantity. Dividing this by the total number of observations then
gives a measure of average deviation from the mean, known as the variance.
Variance, S² = [∑ (Xi- X¯)²]/n-1
Note. In this equation we use n-1, not n, as the denominator, because we are estimating
the population variance.
The problem with the variance is that it is squared, and so it is not in the same unit as the
original data. For example height of individuals will be in square cm which is unit of
area, not height.
If we take the square root of the variance we get a measure of variability in the same units
as the raw data. This quantity is called the standard deviation and tells us the average
distance of all the observations in a dataset from the mean.
Standard Deviation, S = √ [∑ (Xi- X¯)²]/n-1
Example: calculate variance and standard deviation for the following set of data on
weight of 10 people in Kgs.
61, 75, 65 58, 78, 82, 70, 72, 91, 77
For calculating variance, first calculate the mean weight
X¯=∑Xi/n= (61+ 75+ 65+58+78+82+70+72+91+77)/10=72.9 years
Then calculate variance by the formula
Variance, S² = [∑ (Xi- X¯)²]/n-1
Biostatistics for medical students: written by Dr. Nasih Othman, Sulaimani Polytechnic University 2012

6

Variance= [58-72.9)+(61-72.9)+65-72.9)+(70-72.9)+(72-72.9)+(75-72.9)+(77-72.9)+(7872.9)+(82-72.9)+(91-72.9)] ² /9=99.2
Then calculate standard deviation by taking the square root of the variance
S= √ variance= √99.2=9.96
What does this mean? The standard deviation for the data was 10 Kg, meaning that on
average each observation was 10 kg away from the mean (either more or less than the
mean).
How normal data is distributed i.e. spread out in relation to standard deviation?
For data that is normally distributed:
• About 68% of the data lies within 1 standard deviation of the mean
• About 95% of the data lies within 2 standard deviations of the mean
• About 99% of the data lies within 3 standard deviations of the mean
These proportions apply to all normal distributions, regardless of the total number of data
values or the width of the distribution. The standard deviation helps to summarize the
distribution of data. The standard deviation plays an important role in statistical data
analysis
.
Biostatistics for medical students: written by Dr. Nasih Othman, Sulaimani Polytechnic University 2012

7
