Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Measures of Dispersion (Range, standard deviation, standard error) Introduction We have already learnt that ‘frequency distribution table gives a rough idea of the distribution of the variables in a sample or population’, while mean, median and mode explain the central tendency of the distribution. But, none of these measures describe how the data are spread with respect to the central value. You see, we have stated in our earlier discussion that in a normal distribution, all the measures of central tendency (mean, median and mode) will be the same, i.e. the values of mean, median and mode are the same. The following diagram shows how an ideal normal distribution, will look like? The mean, median and modal values are occupying the same position, and there is a gradual decline in slope on either side of the mean. But, the spread of data is not always gradual or smooth like this diagram. The above diagram will give you an idea about the various natures of spread of data, in spite of the fact that the mean, median and modal values are the same. The three coloured lines (blue, green and red) represent distribution of three different data sets. Here, the mean, median and mode for each of the distribution occupy the central position, and can be said to be normally distributed. But the degree of peakedness (or Kurtosis) of these normal distributions is not the same- the one with a flat top is called platykurtic (represented by blue coloured line), the one with a medium top (represented by green coloured line)is called mesokurtic (represented by red coloured line) and the one with a narrow top is called leptokurtic. In platykurtic, the data are spread most widely or dispersed on both the sides of the central values, suggesting more variation in data set. This is followed by the mesokurtic and leptokurtic where the spread of data is comparatively less. This indicates that greater the variation in data more will be the degree of dispersion. Dispersion is the spread of the values of a variable on either side of the central value. There are different measures of dispersion – (i) range, (ii) quartile deviation, (iii) standard deviation (iv) variance and (v) standard error. Range It is the difference between the highest and lowest value of a data set, when arranged in array. Larger the range, greater is the dispersion of the values. For example, in table 1, the lowest and highest values are 45 and 85 respectively. So, the range is (85 –45) = 40. Had the highest and lowest values of the distribution been 45 and 62 respectively, the range would have been smaller (62-45 = 17). But, the range does not always give a satisfactory result of dispersion because it is affected by the extreme values. For example, if there is one value 10 in the distribution, then the range would have been 85-10 = 75. Since there is no other value between 45 and 10, the range of the distribution gets affected. Table 1 45 52 53 55 58 59 62 65 68 73 75 79 82 45 52 53 55 58 59 63 65 69 73 76 79 82 47 52 53 55 58 59 64 66 70 75 76 80 83 51 52 54 57 58 61 64 67 71 75 77 80 84 51 52 54 57 59 62 64 67 72 75 78 80 85 Semi-inter-quartile range or Quartile Deviation Semi-inter-quartile range or Quartile Deviation can be used as a measure of dispersion, better than the range because the extreme values on both the side can be avoided. In quartile deviation, the raw data are arranged in ascending order of magnitude and then divided into four equal parts. Each part is called a quartile. So, altogether 4 quartiles (Q1, Q2, Q3, Q4) will be formed. For example, if you have 100 observations on height, arranged in ascending order of magnitude, then the height of the 25th individual will be the first quartile value, of the 50th individual will be the 2nd quartile (middle position) value and of the 75th individual will be the third quartile (three fourth position) value, and the fourth quartile is certainly the last observation. The 2nd quartile (Q2) is also the median. Now, the total number of observation will not necessarily be hundred or any multiple of hundred all the time. In such cases we calculate the quartile values using certain formulae. The formulae for finding out the first and second quartiles (Q1 and Q3) are almost like the formula for calculating the median. In calculating median, we first locate the midpoint of the distribution where the value is located. This is done by dividing the total number of observation (N) by 2, i.e. N/2. But, in quartile as the data are to be divided into 4 equal parts, the position of the Q1, Q2 and Q3 quartile values are calculated in the following way- N/4, N/2, ¾N respectively. So, Q1 = Li+ (N/4 –C) h , Q3 = Li + (¾N – C) h fi fi Where, Li = Lower limit (boundary) of the class interval belonging to the respective quartile, fi = frequency of the class belonging to the respective quartile, h = width of the class of the respective quartile, C = cumulative frequency of the class preceding the class of the respective quartile and N = total number of observations. Table 2 Weight in Kg. 44.5-50.5 50.5-56.5 56.5-62.5 62.5-68.5 68.5-74.5 74.5-80.5 80.5-86.5 Class mark (x) 47.5 53.5 59.5 65.5 71.5 77.5 83.5 Frequency (f) 3 15 13 10 6 13 5 Cumulative frequency 3 18 31 41 47 60 65 In table 2, N/4 th number corresponds to the value of the Q1. Here, N/4 th number is 65÷ 2 = 16.25. The value of 16.25th observation is the Q1 value. It appears from the cumulative frequency column that the value of 16.25th observation lie in the class interval (50.5-56.5), since the corresponding cumulative frequency of the class interval is 18. So, here L = 50.5, f = 15, h = 6 and C = 3. So, Q1 = 50.5 + (16.25 – 3) 6 15 = 55.80 The value of the Q3 (third quartile) is observed in the similar way, but here, the position Q3 is determined by dividing N by ¾, i.e (¾N). The value of ¾N is 48.75. This means that the value of 48.75th observation is the Q3 value. Again, it appears from the cumulative frequency column that the value of 48.75th observation lie in the class interval (74.5-80.5), since the corresponding cumulative frequency of the class interval is 60. So, here L = 74.5, f = 13, h = 6 and C =47. Thus, Q3 = 74.5 + (48.75 – 47) 6 13 = 75.30 The semi-inter-quartile range is calculated by taking half of the difference between the first and third quartile and the formula for this isQ = ½ (Q3 – Q1) Here, (Q3-Q1) is the inter-quartile range and when it is multiplied by ½, it becomes semi-interquartile range. Therefore, the semi-interquartile range Q = ½ (Q3 – Q1) = ½ (75.30 - 55.80) = 19.5 The main disadvantage in this measure of dispersion is the use of only two values (Q1 and Q3) from the range of data. Mean Deviation: Here, the deviation (measured in terms of absolute values) of each value from the mean is calculated and the arithmetic mean of these deviations is measured. The formula for calculating the mean deviation from a grouped data is as follows. Mean deviation = ∑i fi │(ai – A)│ N Where, fi is the frequency of the ith class, ai is the class mark of the ith class interval, A is the arithmetic mean and N is the total number of observations. The formula can be expanded like thisMean deviation = f1 │(a1 – A) │+ f2 │(a2 – A) │+ f3 │(a3 – A) │….+ fn │(an – A) │ N Table 3 Weight in Kg. 44.5-50.5 50.5-56.5 56.5-62.5 62.5-68.5 68.5-74.5 74.5-80.5 80.5-86.5 Class mark (x) 47.5 53.5 59.5 65.5 71.5 77.5 83.5 Frequency (f) 3 15 13 10 6 13 5 Where, f1, f2, …fn are the frequencies of 1st, 2nd ….. nth class intervals ; a1, a2, ….an are the class marks of 1st, 2nd ….. nth class intervals; ‘A’ is the arithmetic mean of the observations, N is the total number of observations. In this example, if the mean weight is 65.0 kg. Now the mean deviation is 3│47.5–65.0│+15│53.5–65.0│+13│59.5-65.0│+10│65.5-65.0│+6│71.5-65.0│+13│77.5-65.0│+5│83.5-65.0│ 65 = 52.5+ 172.5 + 71.5 + 5 + 39 + 162.5 +92.5 65 = 9.16 Standard Deviation or SD: Standard deviation is a very common measure of dispersion. This measure of dispersion from the mean has an advantage over the preceding measures of dispersions because it considers all the values of the variable in estimating the dispersion, and the unit of standard deviation is the same as that of the mean. It is defined as the square root of the mean squared deviation. The formula is worked out in this way(i) first find out the difference of all the values independently from the mean, (ii) then, square each of the difference, (iii) add, the squared difference, (iv) divide the sum of the squared difference by the total number of observations to get the mean deviation and (v) finally square root the expression to get the standard deviation. In fact, square root of the expression reverses the unit of the measurement to its actual state. The formula for standard deviation is written as √[1/n Σ(x - ‾x) 2] . This formula for standard deviation is applied for ungrouped data. If the sample size is small, the formula is slightly modified √[1/n-1 Σ(x - ‾x) 2]. This (n-1) is called Degrees of Freedom (df). The degrees of freedom are the number of values in a set of data, which are unrestricted, independent and free to vary. Let me give an example. Suppose the sum of x, y and z is 12, and if, x=4, y= 5, then z must be 3, so that x+y+z= 12. Thus, when there are three numbers, the degree of freedom is 2. Likewise, the df for five numbers is 4. We use the concept of df, when the sample size is small. The formula for finding out SD for a grouped data is slightly different. s= √[1/n Σf(x - ‾x) 2] An example of the application of this formula is presented below in Table 4. Height (cm.) 160-162 163-165 166-168 169-171 172-174 Class mark (x) 161 164 167 170 173 f ‾x 5 18 42 27 8 Σf or n = 100 167.45 x - ‾x -6.45 -3.45 -0.45 +2.55 +5.55 (x - ‾x)2 41.60 11.90 0.20 6.50 30.80 f(x - ‾x)2 208.00 214.20 8.40 175.50 246.40 Σ f(x - ‾x)2 = 852.50 SD or s = √[1/100 x 852.50] = √8.525 = 2.91cm SD can be derived from another formula s = √[1/n Σfx2 – (fx)2/n ]. Table 5 Height (cm.) 160-162 163-165 166-168 169-171 172-174 Class mark (x) 161 164 167 170 173 f fx fx2 5 18 42 27 8 Σf or n = 100 161x5 164x18 167x42 170x27 173x8 Σfx 1612 x5 1642 x18 1672x42 1702x27 1732x8 Σfx2 As the value is derived after square root, ± symbol is used before writing the value of standard deviation. The symbols standard deviations used for sample and population are ‘s’ and ‘σ’ respectively. Generally, standard deviation is presented along with mean in a way (mean ± 1SD) or simply (mean ± SD). From mean and standard deviation, one can get an idea about the spread of the values of a variable on either side of the mean. Larger the value of the standard deviation, greater is the spread of the values of the variable around the mean, indicating greater heterogeneity in the data. For example, mean and standard deviation of height of a group of individuals is 145.25 cm ± 2.5 cm. This means that if the data follow a normal distribution, then 68.26% of the values of the variable (here it is the height of individuals) will fall within the range (145.25 – 2.5) cm. to (145.25 + 2.5) cm., i.e. between 142.75 cm. and 147.75 cm. Again, (mean ± 2SD) means that the spread of the values of the variable around the mean is between (145.25 + 2 x 2.5) and (145.25 - 2 x 2.5), i.e. between 150.25 cm. and 140.25 cm. and 95.44% of the values of the variable will come into this range. Similarly (mean ± 3SD) will include 99.73% of the values of the variable. However, if the distribution of data is skewed, then the standard deviation will be affected by outliers. In a skewed distribution, the values of mean, median and mode are not same and do not occupy the central position. The diagrams below will help you have some idea about the skewed distributions. In this diagram, the distribution of data is more to the right side and hence is said to be skewed negatively. In this diagram, the distribution of data is more to the left side and hence is said to be skewed positively. Variance The variance of a population is defined to be the average of the squared deviations from the mean. The symbols used for variance is σ2 for population and s2 for sample. Variance can also be calculated by squaring the value of standard deviation. It is also a measure of dispersion. б 2 = (Sum of all deviations from mean) 2 ÷ N Or Variance = 1/n Σ(x - ‾x)2 Both standard deviation and variance contain similar information about the variation in the population. So, if variance is known, SD can be calculated and vis-à-vis. The standard deviation is generally used for describing the variation in the population because the unit of SD and that of the variable is the same. However, the units of variance will be a squared unit of the variable. Table 6 Cephalic Mean (‾x) index (n= 7) 70.5 78.78 83.2 82.2 83.0 76.5 78.6 (x - ‾x) 2 (x - ‾x) (70.5-78.78) = -8.28 +4.42 +3.42 +4.22 -2.28 -0.18 68.55 19.53 11.69 17.80 5.19 0.032 Σ(x - ‾x) 122.79 So, according to the formula mentioned above, s = √ (1/7 x 122.79) 2 = = 4.18 The mean and standard deviation of the cephalic index will be 78.78 ± 4.18. As cephalic index has no unit, here, the mean and SD have been expressed without a unit. Variance or s2 = (4.18)2 = 17.47 Standard Error or SE: This is another measure of dispersion of mean. It is the standard deviation of the sampling distribution of the means. The formula of standard error is given below. SE = s ÷ √n, where, s is the standard deviation and n is the sample size. Standard error is useful when one compares the dispersion of two different data set of unequal sample size drawn from the same or different population. Suppose the mean sitting height vertex of 64 individuals is 68.2 cm. and the standard deviation of the mean is 4.0cm. Then, SE = 4.0 ÷ √64 = 0.5. The mean and SE is represented 68.2 ± 0.5. Coefficient of Variation or CV: Coefficient of variation is used to compare the degree of variability among the population. This is calculated by converting standard deviation as percentage of mean. In other words, the coefficient of variation compares the size of the standard deviation with the size of the mean. Higher the CV of a sample greater is the variability and lower the value of CV, lesser is the variability. Since, the units of mean and SD is the same, CV is unit less. The formula of CV is given below. (Standard Deviation ÷ Mean) x 100 For example, the heights of elephants and ants cannot be compared for standard deviations. But the variability in height of these two animals can be measured. Table 7 Animals Elephant Mean height 304 cm. Ants 200 mm. SD 30.48 cm. 2mm. CV = SD ÷ Mean x 100 Remarks 10.02 Variability more 1.0 Variability less Standard normal deviate or Z score Sometimes you may want to compare your observation with respect to another observation. In order to do that you need to standardize the data by calculating the Z score or standard normal deviate. So, a Z score is the number of standard deviations an observation is away from the mean; in other words, by how much standard deviation, a value dispersed from the mean. A Z score of +1 indicates that the variable is one SD above the mean and is dispersed to the right side. A value of –1 indicates that the variable is one SD below the mean and is dispersed to the left side. A ‘Z’ score of 0 indicates that the observation and the mean are the same. In an examination Sunil scored 65 marks out of 100 in mathematics and Ravi scored 70 marks out of 100 in Physics. Now you want to compare who is a better student? It has been found that the mean marks scored by the students in mathematics are 60 with SD 2 and the mean mark scored in physics is 65 with SD 5. Here, calculation of Z score will predict which of the students, Sunil or Ravi is better. The formula of Z score is Z = (Xi – X) ÷ SD, where Xi is the individual score and X is the mean score and SD is the standard deviation. So, Z score for Sunil is: Z = (65 – 60) ÷ 2 = 2.5 Again, Z score for Ravi is: Z = (70 – 65 ) ÷ 5 = 1.0 Thus, Sunil did better than Ravi as the Z score is higher for Sunil. The Z score values of Sunil and Ravi are called standardized variables. A standardized variable has certain properties. The mean of the standardized values is 0 and the SD of these standardized values is 1. Suppose the weights (kg.) of 5 students are 50, 55, 52, 59, 56. The mean weight is 54.4 kg. and SD is 3.50. Most Z score will lie within the range – 2 and +2. Values more than 2 SD from the mean on either side are considered as outliers. Table 8 Weight (kg.) Z = (Xi – X) ÷ SD (Zi – Z) 2 50 50 –54.4 ÷ 3.5 (-1.26 – 0) 2 = 1.59 = -1.26 55 0.17 0.03 52 -0.69 0.48 SD = √ [∑(Zi – Z) 2 ÷ (n – 1) = 59 1.31 1.72 56 0.46 0.21 Mean Z score = 0 ∑(Zi – Z) 2 = 4.03 √ (4.03 ÷ 4) = 1.0 The standard normal distribution The distribution of a standardized variable is known as ‘standard normal distribution’. A Z score of +1 indicates that the variable is one SD above the mean and a value –1 indicates that the variable is one SD below the mean. As the range between + 1 and – 1 includes 68.26% of the observations, so a Z score of +1 means the proportion area between mean and SD is 0.341. Similarly, a Z score of -1 means the proportion area between mean and SD is 0.341. But, how to identify the area under the normal curve if the Z score value is 1.25? From the value of Z score 1.25 (which is a positive value), one can say that the proportion area will be to the right side of the mean. Similarly, in case of a negative Z score value, the proportion area will lie on the left side of the mean. Here one has to take help from the normal distribution table to find the proportion of the area of the curve between the mean and the value of the Z score. If you look at the normal distribution table you will see that the first column gives the Z score up to one decimal place. The top row of the table gives the second decimal place of the Z score one wishes to find. Now in case of the Z score value 1.25 (1.2 +0.05), look at the first column where you will come across 1.2, following 1.2 now look at the first row for 0.05 (the second decimal place) of the Z score. The corresponding figure that you will read in the table is 0.3944. This means that the proportion of the area lying between the mean and the Z value 1.25 is 0.3944. Now, what can we say from this proportion? We can say that (0.5 – 0.3944) = 0.1056, i.e. 10.56% of the observations are outliers. Let us understand this thing with the help of an example. Suppose the mean height of a large number of population is 172.5 cm. with a SD 6.25 cm. Now, we are interested to know the proportion of the population whose height (a) exceeds 180 cm. and the proportion of the population whose height (b) is below 185 cm. In the first problem (a) Z = (Xi – X) ÷ SD So, Z = (180 –172.5) ÷ 6.25 So, Z = 1.20 Now, the value of Z 1.20 means that the proportion area will lie on the right side of the mean. Looking at the standard normal distribution the proportion comes to be 0.3849. So, the proportion of the population exceeding the height of 180 cm. (0.5 – 0.3849) = 0.1151 or 11.51%. In the second problem (b) Again, Z = (Xi – X) ÷ SD So, Z = (185 –172.5) ÷ 6.25 So, Z = 2.0 Now, the value of Z 2.0 means that the proportion area will lie on the right side of the mean. Looking at the standard normal distribution the proportion comes to be 0.4772. So, the proportion of the population below the height of 185 cm. (0.5 + 0.4772) = 0.9772 or 97.72 %. CONCLUSION We can conclude that this module deals with the dispersion of data in a population or in a sample. Dispersion of data also helps researchers the degree of heterogeneity in the data. Their various measures of dispersion that we discussed range, standard deviation, standard error and vary. Each has its own merits and demerits. But standard deviation so far has an advantage over any other measures of dispersion since its units is the same that of the mean.