Download Measures of Dispersion (Range, standard deviation, standard error)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Measures of Dispersion
(Range, standard deviation, standard error)
Introduction
We have already learnt that ‘frequency distribution table gives a rough idea of the distribution of
the variables in a sample or population’, while mean, median and mode explain the central
tendency of the distribution. But, none of these measures describe how the data are spread with
respect to the central value. You see, we have stated in our earlier discussion that in a normal
distribution, all the measures of central tendency (mean, median and mode) will be the same, i.e.
the values of mean, median and mode are the same. The following diagram shows how an ideal
normal distribution, will look like? The mean, median and modal values are occupying the same
position, and there is a gradual decline in slope on either side of the mean. But, the spread of data
is not always gradual or smooth like this diagram.
The above diagram will give you an idea about the various natures of spread of data, in
spite of the fact that the mean, median and modal values are the same. The three coloured lines
(blue, green and red) represent distribution of three different data sets. Here, the mean, median
and mode for each of the distribution occupy the central position, and can be said to be normally
distributed. But the degree of peakedness (or Kurtosis) of these normal distributions is not the
same- the one with a flat top is called platykurtic (represented by blue coloured line), the one
with a medium top (represented by green coloured line)is called mesokurtic (represented by red
coloured line) and the one with a narrow top is called leptokurtic. In platykurtic, the data are
spread most widely or dispersed on both the sides of the central values, suggesting more
variation in data set. This is followed by the mesokurtic and leptokurtic where the spread of data
is comparatively less. This indicates that greater the variation in data more will be the degree of
dispersion.
Dispersion is the spread of the values of a variable on either side of the central value. There are
different measures of dispersion – (i) range, (ii) quartile deviation, (iii) standard deviation (iv)
variance and (v) standard error.
Range
It is the difference between the highest and lowest value of a data set, when arranged in array.
Larger the range, greater is the dispersion of the values. For example, in table 1, the lowest and
highest values are 45 and 85 respectively. So, the range is (85 –45) = 40. Had the highest and
lowest values of the distribution been 45 and 62 respectively, the range would have been smaller
(62-45 = 17). But, the range does not always give a satisfactory result of dispersion because it is
affected by the extreme values. For example, if there is one value 10 in the distribution, then the
range would have been 85-10 = 75. Since there is no other value between 45 and 10, the range of
the distribution gets affected.
Table 1
45
52
53
55
58
59
62
65
68
73
75
79
82
45
52
53
55
58
59
63
65
69
73
76
79
82
47
52
53
55
58
59
64
66
70
75
76
80
83
51
52
54
57
58
61
64
67
71
75
77
80
84
51
52
54
57
59
62
64
67
72
75
78
80
85
Semi-inter-quartile range or Quartile Deviation
Semi-inter-quartile range or Quartile Deviation can be used as a measure of dispersion, better
than the range because the extreme values on both the side can be avoided. In quartile deviation,
the raw data are arranged in ascending order of magnitude and then divided into four equal parts.
Each part is called a quartile. So, altogether 4 quartiles (Q1, Q2, Q3, Q4) will be formed. For
example, if you have 100 observations on height, arranged in ascending order of magnitude, then
the height of the 25th individual will be the first quartile value, of the 50th individual will be the
2nd quartile (middle position) value and of the 75th individual will be the third quartile (three
fourth position) value, and the fourth quartile is certainly the last observation. The 2nd quartile
(Q2) is also the median. Now, the total number of observation will not necessarily be hundred or
any multiple of hundred all the time. In such cases we calculate the quartile values using certain
formulae. The formulae for finding out the first and second quartiles (Q1 and Q3) are almost like
the formula for calculating the median. In calculating median, we first locate the midpoint of the
distribution where the value is located. This is done by dividing the total number of observation
(N) by 2, i.e. N/2. But, in quartile as the data are to be divided into 4 equal parts, the position of
the Q1, Q2 and Q3 quartile values are calculated in the following way-
N/4,
N/2,
¾N
respectively.
So, Q1 = Li+ (N/4 –C) h , Q3 = Li + (¾N – C) h
fi
fi
Where, Li = Lower limit (boundary) of the class interval belonging to the respective quartile, fi =
frequency of the class belonging to the respective quartile, h = width of the class of the
respective quartile, C = cumulative frequency of the class preceding the class of the respective
quartile and N = total number of observations.
Table 2
Weight in
Kg.
44.5-50.5
50.5-56.5
56.5-62.5
62.5-68.5
68.5-74.5
74.5-80.5
80.5-86.5
Class mark
(x)
47.5
53.5
59.5
65.5
71.5
77.5
83.5
Frequency
(f)
3
15
13
10
6
13
5
Cumulative
frequency
3
18
31
41
47
60
65
In table 2, N/4 th number corresponds to the value of the Q1. Here, N/4 th number is 65÷ 2 = 16.25.
The value of 16.25th observation is the Q1 value. It appears from the cumulative frequency
column that the value of 16.25th observation lie in the class interval (50.5-56.5), since the
corresponding cumulative frequency of the class interval is 18. So, here L = 50.5, f = 15, h = 6
and C = 3.
So, Q1 = 50.5 + (16.25 – 3) 6
15
= 55.80
The value of the Q3 (third quartile) is observed in the similar way, but here, the position Q3 is
determined by dividing N by ¾, i.e (¾N). The value of ¾N is 48.75. This means that the value
of 48.75th observation is the Q3 value. Again, it appears from the cumulative frequency column
that the value of 48.75th observation lie in the class interval (74.5-80.5), since the corresponding
cumulative frequency of the class interval is 60. So, here L = 74.5, f = 13, h = 6 and C =47.
Thus, Q3 = 74.5 + (48.75 – 47) 6
13
= 75.30
The semi-inter-quartile range is calculated by taking half of the difference between the first
and third quartile and the formula for this isQ = ½ (Q3 – Q1)
Here, (Q3-Q1) is the inter-quartile range and when it is multiplied by ½, it becomes semi-interquartile range.
Therefore, the semi-interquartile range Q = ½ (Q3 – Q1)
= ½ (75.30 - 55.80)
= 19.5
The main disadvantage in this measure of dispersion is the use of only two values (Q1 and Q3)
from the range of data.
Mean Deviation: Here, the deviation (measured in terms of absolute values) of each value from
the mean is calculated and the arithmetic mean of these deviations is measured. The formula for
calculating the mean deviation from a grouped data is as follows.
Mean deviation = ∑i fi │(ai – A)│
N
Where, fi is the frequency of the ith class, ai is the class mark of the ith class interval, A is the
arithmetic mean and N is the total number of observations. The formula can be expanded like
thisMean deviation = f1 │(a1 – A) │+ f2 │(a2 – A) │+ f3 │(a3 – A) │….+ fn │(an – A) │
N
Table 3
Weight in
Kg.
44.5-50.5
50.5-56.5
56.5-62.5
62.5-68.5
68.5-74.5
74.5-80.5
80.5-86.5
Class mark
(x)
47.5
53.5
59.5
65.5
71.5
77.5
83.5
Frequency
(f)
3
15
13
10
6
13
5
Where, f1, f2, …fn are the frequencies of 1st, 2nd ….. nth class intervals ; a1, a2, ….an are the class
marks of 1st, 2nd ….. nth class intervals; ‘A’ is the arithmetic mean of the observations, N is the
total number of observations.
In this example, if the mean weight is 65.0 kg. Now the mean deviation is
3│47.5–65.0│+15│53.5–65.0│+13│59.5-65.0│+10│65.5-65.0│+6│71.5-65.0│+13│77.5-65.0│+5│83.5-65.0│
65
= 52.5+ 172.5 + 71.5 + 5 + 39 + 162.5 +92.5
65
= 9.16
Standard Deviation or SD: Standard deviation is a very common measure of dispersion. This
measure of dispersion from the mean has an advantage over the preceding measures of
dispersions because it considers all the values of the variable in estimating the dispersion, and the
unit of standard deviation is the same as that of the mean. It is defined as the square root of the
mean squared deviation. The formula is worked out in this way(i) first find out the difference of all the values independently from the mean,
(ii) then, square each of the difference,
(iii) add, the squared difference,
(iv) divide the sum of the squared difference by the total number of observations to get the mean
deviation and
(v) finally square root the expression to get the standard deviation.
In fact, square root of the expression reverses the unit of the measurement to its actual state. The
formula for standard deviation is written as √[1/n Σ(x - ‾x) 2] . This formula for standard
deviation is applied for ungrouped data. If the sample size is small, the formula is slightly
modified √[1/n-1 Σ(x - ‾x) 2]. This (n-1) is called Degrees of Freedom (df). The degrees of freedom are the
number of values in a set of data, which are unrestricted, independent and free to vary. Let me give an example.
Suppose the sum of x, y and z is 12, and if, x=4, y= 5, then z must be 3, so that x+y+z= 12. Thus, when there are
three numbers, the degree of freedom is 2. Likewise, the df for five numbers is 4. We use the concept of df, when
the sample size is small.
The formula for finding out SD for a grouped data is slightly different.
s=
√[1/n Σf(x - ‾x) 2]
An example of the application of this formula is presented below in
Table 4.
Height
(cm.)
160-162
163-165
166-168
169-171
172-174
Class
mark
(x)
161
164
167
170
173
f
‾x
5
18
42
27
8
Σf or n
= 100
167.45
x - ‾x
-6.45
-3.45
-0.45
+2.55
+5.55
(x - ‾x)2
41.60
11.90
0.20
6.50
30.80
f(x - ‾x)2
208.00
214.20
8.40
175.50
246.40
Σ f(x - ‾x)2 = 852.50
SD or s = √[1/100 x 852.50]
= √8.525
= 2.91cm
SD can be derived from another formula s = √[1/n Σfx2 – (fx)2/n ].
Table 5
Height
(cm.)
160-162
163-165
166-168
169-171
172-174
Class
mark
(x)
161
164
167
170
173
f
fx
fx2
5
18
42
27
8
Σf or n
= 100
161x5
164x18
167x42
170x27
173x8
Σfx
1612 x5
1642 x18
1672x42
1702x27
1732x8
Σfx2
As the value is derived after square root, ± symbol is used before writing the value of standard
deviation. The symbols standard deviations used for sample and population are ‘s’ and ‘σ’
respectively. Generally, standard deviation is presented along with mean in a way (mean ± 1SD)
or simply (mean ± SD). From mean and standard deviation, one can get an idea about the spread
of the values of a variable on either side of the mean. Larger the value of the standard deviation,
greater is the spread of the values of the variable around the mean, indicating greater
heterogeneity in the data. For example, mean and standard deviation of height of a group of
individuals is 145.25 cm ± 2.5 cm. This means that if the data follow a normal distribution, then
68.26% of the values of the variable (here it is the height of individuals) will fall within the range
(145.25 – 2.5) cm. to (145.25 + 2.5) cm., i.e. between 142.75 cm. and 147.75 cm. Again, (mean
± 2SD) means that the spread of the values of the variable around the mean is between (145.25 +
2 x 2.5) and (145.25 - 2 x 2.5), i.e. between 150.25 cm. and 140.25 cm. and 95.44% of the values
of the variable will come into this range. Similarly (mean ± 3SD) will include 99.73% of the
values of the variable.
However, if the distribution of data is skewed, then the standard deviation will be affected by
outliers. In a skewed distribution, the values of mean, median and mode are not same and do not
occupy the central position. The diagrams below will help you have some idea about the skewed
distributions.
In this diagram, the distribution of data is more to the right side and hence is said to be skewed
negatively.
In this diagram, the distribution of data is more to the left side and hence is said to be skewed
positively.
Variance
The variance of a population is defined to be the average of the squared deviations from the
mean. The symbols used for variance is σ2 for population and s2 for sample. Variance can also
be calculated by squaring the value of standard deviation. It is also a measure of dispersion.
б 2 = (Sum of all deviations from mean) 2 ÷ N
Or
Variance = 1/n Σ(x - ‾x)2
Both standard deviation and variance contain similar information about the variation in the
population. So, if variance is known, SD can be calculated and vis-à-vis. The standard deviation
is generally used for describing the variation in the population because the unit of SD and that of
the variable is the same. However, the units of variance will be a squared unit of the variable.
Table 6
Cephalic
Mean (‾x)
index (n= 7)
70.5
78.78
83.2
82.2
83.0
76.5
78.6
(x - ‾x) 2
(x - ‾x)
(70.5-78.78)
= -8.28
+4.42
+3.42
+4.22
-2.28
-0.18
68.55
19.53
11.69
17.80
5.19
0.032
Σ(x - ‾x)
122.79
So, according to the formula mentioned above,
s = √ (1/7 x 122.79)
2
=
= 4.18
The mean and standard deviation of the cephalic index will be 78.78 ± 4.18. As cephalic index
has no unit, here, the mean and SD have been expressed without a unit.
Variance or s2 = (4.18)2
= 17.47
Standard Error or SE: This is another measure of dispersion of mean. It is the standard
deviation of the sampling distribution of the means. The formula of standard error is given
below.
SE = s ÷ √n, where, s is the standard deviation and n is the sample size. Standard error is useful
when one compares the dispersion of two different data set of unequal sample size drawn from
the same or different population.
Suppose the mean sitting height vertex of 64 individuals is 68.2 cm. and the standard deviation
of the mean is 4.0cm. Then, SE = 4.0 ÷ √64 = 0.5. The mean and SE is represented 68.2 ± 0.5.
Coefficient of Variation or CV: Coefficient of variation is used to compare the degree of
variability among the population. This is calculated by converting standard deviation as
percentage of mean. In other words, the coefficient of variation compares the size of the standard
deviation with the size of the mean. Higher the CV of a sample greater is the variability and
lower the value of CV, lesser is the variability. Since, the units of mean and SD is the same, CV
is unit less. The formula of CV is given below.
(Standard Deviation ÷ Mean) x 100
For example, the heights of elephants and ants cannot be compared for standard deviations. But
the variability in height of these two animals can be measured.
Table 7
Animals
Elephant
Mean
height
304 cm.
Ants
200 mm.
SD
30.48
cm.
2mm.
CV = SD ÷ Mean x 100
Remarks
10.02
Variability more
1.0
Variability less
Standard normal deviate or Z score
Sometimes you may want to compare your observation with respect to another observation. In
order to do that you need to standardize the data by calculating the Z score or standard normal
deviate. So, a Z score is the number of standard deviations an observation is away from the
mean; in other words, by how much standard deviation, a value dispersed from the mean.
A Z score of +1 indicates that the variable is one SD above the mean and is dispersed to the
right side. A value of –1 indicates that the variable is one SD below the mean and is dispersed to
the left side. A ‘Z’ score of 0 indicates that the observation and the mean are the same. In an
examination Sunil scored 65 marks out of 100 in mathematics and Ravi scored 70 marks out of
100 in Physics. Now you want to compare who is a better student? It has been found that the
mean marks scored by the students in mathematics are 60 with SD 2 and the mean mark scored
in physics is 65 with SD 5. Here, calculation of Z score will predict which of the students, Sunil
or Ravi is better. The formula of Z score is Z = (Xi – X) ÷ SD, where Xi is the individual score and X is the mean score and SD is the
standard deviation.
So, Z score for Sunil is: Z = (65 – 60) ÷ 2 = 2.5
Again, Z score for Ravi is: Z = (70 – 65 ) ÷ 5 = 1.0
Thus, Sunil did better than Ravi as the Z score is higher for Sunil.
The Z score values of Sunil and Ravi are called standardized variables. A standardized variable
has certain properties. The mean of the standardized values is 0 and the SD of these standardized
values is 1.
Suppose the weights (kg.) of 5 students are 50, 55, 52, 59, 56. The mean weight is 54.4 kg. and
SD is 3.50.
Most Z score will lie within the range – 2 and +2. Values more than 2 SD from the mean on
either side are considered as outliers.
Table 8
Weight (kg.)
Z = (Xi – X) ÷ SD (Zi – Z) 2
50
50 –54.4 ÷ 3.5
(-1.26 – 0) 2 = 1.59
= -1.26
55
0.17
0.03
52
-0.69
0.48
SD = √ [∑(Zi – Z) 2 ÷ (n – 1)
=
59
1.31
1.72
56
0.46
0.21
Mean Z score = 0
∑(Zi – Z) 2 = 4.03
√ (4.03
÷ 4) = 1.0
The standard normal distribution
The distribution of a standardized variable is known as ‘standard normal distribution’. A Z score
of +1 indicates that the variable is one SD above the mean and a value –1 indicates that the
variable is one SD below the mean. As the range between + 1 and – 1 includes 68.26% of the
observations, so a Z score of +1 means the proportion area between mean and SD is 0.341.
Similarly, a Z score of -1 means the proportion area between mean and SD is 0.341. But, how to
identify the area under the normal curve if the Z score value is 1.25? From the value of Z score
1.25 (which is a positive value), one can say that the proportion area will be to the right side of
the mean. Similarly, in case of a negative Z score value, the proportion area will lie on the left
side of the mean. Here one has to take help from the normal distribution table to find the
proportion of the area of the curve between the mean and the value of the Z score. If you look at
the normal distribution table you will see that the first column gives the Z score up to one
decimal place. The top row of the table gives the second decimal place of the Z score one wishes
to find. Now in case of the Z score value 1.25 (1.2 +0.05), look at the first column where you
will come across 1.2, following 1.2 now look at the first row for 0.05 (the second decimal place)
of the Z score. The corresponding figure that you will read in the table is 0.3944. This means that
the proportion of the area lying between the mean and the Z value 1.25 is 0.3944. Now, what can
we say from this proportion? We can say that (0.5 – 0.3944) = 0.1056, i.e. 10.56% of the
observations are outliers.
Let us understand this thing with the help of an example. Suppose the mean height of a large
number of population is 172.5 cm. with a SD 6.25 cm. Now, we are interested to know the
proportion of the population whose height (a) exceeds 180 cm. and the proportion of the
population whose height (b) is below 185 cm.
In the first problem (a)
Z = (Xi – X) ÷ SD
So, Z = (180 –172.5) ÷ 6.25
So, Z = 1.20
Now, the value of Z 1.20 means that the proportion area will lie on the right side of the mean.
Looking at the standard normal distribution the proportion comes to be 0.3849. So, the
proportion of the population exceeding the height of 180 cm. (0.5 – 0.3849) = 0.1151 or 11.51%.
In the second problem (b)
Again, Z = (Xi – X) ÷ SD
So, Z = (185 –172.5) ÷ 6.25
So, Z = 2.0
Now, the value of Z 2.0 means that the proportion area will lie on the right side of the mean.
Looking at the standard normal distribution the proportion comes to be 0.4772. So, the
proportion of the population below the height of 185 cm. (0.5 + 0.4772) = 0.9772 or 97.72 %.
CONCLUSION
We can conclude that this module deals with the dispersion of data in a population or in a
sample. Dispersion of data also helps researchers the degree of heterogeneity in the data. Their
various measures of dispersion that we discussed range, standard deviation, standard error and
vary. Each has its own merits and demerits. But standard deviation so far has an advantage over
any other measures of dispersion since its units is the same that of the mean.