Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Measures of Dispersion SOCY601—Alan Neustadtl Measures of Dispersion ¾ Measures of central tendency estimate the numerical center of a distribution—these are measures of location. ¾ Measure of dispersion estimate the spread or variability of a distribution around the center. Variation Ratio Range Interquartile Range Quartile Deviation Mean Deviation Variance/Standard Deviation Variation Ratio ¾ The variation ratio can be used with grouped data and is most useful for nominal level data. f modal V .R. = 1 − n ¾ Basically, this is the proportion of cases that lie outside of the modal category. Variation Ratio Frequency Distribution of Sex in the 2000 General Social Survey Race f White 2,244 Black 404 Other 170 Total 2,817 f modal V .R. = 1 − n 2, 244 =1− 2,817 ≈ 0.2 Variation Ratio ¾ Advantages: Can be used with data that do not contain a lot of information (i.e. nominal level data). This measure is easily interpretable. f modal V .R. = 1 − n ¾ Disadvantages: The variation ratio is dependent on the categorization scheme used by the researcher (i.e. it is somewhat arbitrary) and does not reflect the distribution of data in the non-modal caegories.. Range ¾ The range is simply the difference between the values of the largest and smallest observations. range = max − min ¾ In the year 2000 General Social Survey, the minimum and maximum respondent ages, respectively are 18 and 89. So, the range is: 71 = 89 − 18 Range ¾ Advantages: The range is an extremely simple measure to calculate and interpret. It is useful looking for “out of range” values (i.e. values erroneously in a dataset). range = max − min ¾ Disadvantages: The range is totally dependent on just two values—the most extreme, and therefore most variable (sample dependent), observations in a data set. Interquartile Range/Quartile Deviation ¾ The interquartile range or IQR is the numerical difference or distance between the third and first quartiles in a distribution. IQR = Q3 − Q1 ¾ A related measure is the quartile deviation that numerically represents half the distance between the first and third quartiles in a distribution—the middle half of a distribution. Q3 − Q1 Q= 2 Interquartile Range/Quartile Deviation ¾ Advantages: The IQR and quartile deviation are more stable estimators of spread since they use two values closer to the middle of the distribution that vary less from sample to sample than more extreme values. ¾ Disadvantages: These measure are totally dependent on just two values and ignore all other observations in a data set. IQR = Q3 − Q1 Q3 − Q1 Q= 2 Interquartile Range/Quartile Deviation ¾ In the 2000 General Social Survey the third and first quartiles are, respectively, 57 and 25. Therefore: IQR = Q3 − Q1 = 57 − 32 = 25 Q3 − Q1 Q= 2 57 − 32 = 2 = 12.5 Mean Deviation ¾ With data that contain more information we may calculate a measure that uses all of this information. ¾ We can do this my calculating the deviation of each score from the mean and computing an average deviation. X ∑ mean deviation = i −X n ¾ Note: absolute values are taken as a theoretical convenience that, unfortunately has poor mathematical properties. Mean Deviation ¾ The mean deviation of age in the 2000 General Social Survey is equal to: X ∑ mean deviation = i n = 14.5 −X Mean Deviation ¾ Advantages: The mean deviation uses all the valid observations of a variable to produce this summary statistic—it is a “democratic” measure. It may be interpreted intuitively. ¾ Disadvantages: Absolute values are not easily algebraically manipulated. There is no ready-made metric to aid the interpretation of this statistic as there is for the standard deviation. X ∑ mean deviation = i n −X Standard Deviation ¾ The most widely used measure of variability, usually paired with the mean, is the standard deviation. s= ∑( X i n −X) 2 Standard Deviation ¾ The standard deviation of age in the 2000 General Social Survey data is equal to: s= ∑( X = 17.4 i −X) n 2 Standard Deviation ¾ Advantages: Like the mean deviation, the standard deviation uses all the valid observations of a variable to produce this summary statistic—it also is a “democratic” measure. It may be interpreted by using the Gaussian normal distribution. This statistic varies from low to high with the spread of the distribution. ¾ Disadvantages: Squaring the differences gives greater weight to more extreme values. s= ∑( X i n −X) 2 Standard Deviation Samples s= s 2 ∑( X X ( ∑ = Populations i −X) 2 σ= n i n −X) 2 σ 2 ∑( X i − µ) 2 N (X ∑ = i N − µ) 2 Summary Age = 46 Agemedian = 43 Agemode = 32 Agerange = 71.0 AgeIQR = 25.0 Agequartile dev. = 12.5 Agemean dev. = 14.5 Agevariance = 301.6 Agestd. dev. = 17.4 Heuristic vs. Computational Formulas 2 2 2 = + X 2 XX X X −X) ( ) ( ∑ ∑ 2 = s = = n n 2 2 ∑ X − ∑ 2 XX + ∑ X n 2 2 X 2 XnX nX − + ∑ n = = s= 2 X ∑ n 2 X ∑ n 2 X ∑ n −X2 − 2X 2 + X 2 −X 2 Interpreting Standard Deviations ¾ The Empirical Rule (applies to “normal” shaped distributions) Approximately 68% of all cases fall within 1 standard deviation of the mean ( X − s, X + s ) Approximately 95% of all cases fall within 2 standard deviations of the mean ( X − 2s, X + 2s ) Essentially all cases fall within 3 standard deviations of the mean ( X − 3s, X + 3s ) Interpreting Standard Deviations ¾ Chebyshev’s Rule (applies to any sample regardless of shape) It is possible that very few cases will fall within 1 standard deviation of the mean ( X − s, X + s ) At least 3/4 of all cases will fall within 2 standard deviations of the mean ( X − 2s, X + 2s ) At least 8/9 of the cases will fall within 3 standard deviations of the mean ( X − 3s, X + 3s ) 1 1 − Generally, at least k 2 will fall within k standard deviations of the mean ( X − ks, X + ks ) for any number where k is greater than 1 Understanding the Mean ¾“Best Guess” Interpretation We have proven that the tendency for cases in a distribution to differ from the mean in one direction is exactly balanced by difference in the other direction. But, there is another way to understand the mean. Suppose that a single observation was selected at random from a sample and you were asked to determine the value of that observation, i.e. make a guess about its value. Understanding the Mean ¾ A single observation is selected at random—make a guess about its value ¾ If you guess the mean of the distribution your guess might be too high, too low, or exactly right. The extent of your error is equal to: ¾ Over all possible cases that could be drawn from the distribution, the average signed error, or the mean signed deviation is equal to: d =(X − X ) d ∑ d= i n We can make the following statement: If the mean is guessed as the score for any case drawn at random from a distribution, on average, the amount of signed error will be zero. Understanding the Mean ¾ A single observation is selected at random—make a guess about its value ¾ Suppose the rules are changed somewhat—for a single guess about the value of this case it is required that you be absolutely correct in your guess with the greatest probability. In this situation you would guess the mode since it is the most frequently occurring score, and therefore the most probably value in the distribution—it has the greatest likelihood of being selected. Understanding the Mean ¾ A single observation is selected at random—make a guess about its value ¾ The final situation requires that your guess have the smallest absolute error possible. Here, the sign of the error is unimportant, but the sheer size is critical. In this situation you would guess the median since it is closest, on average, to every other score in the distribution. This is shown symbolically by: ∑ X − md = minimum Final Thoughts ¾ An easy way to understand variance is by way of physical analogy. A deviation from the mean can be identified as having a certain amount of force away from the mean. We do not know necessarily what factors are at work to make an observation deviate from the mean, but we can measure the actual deviation for each case. Therefore, the value of any observation is composed of the mean plus a deviation from the mean: X i = X + di Final Thoughts ¾ Consider two cases, d1 and d2, drawn from a distribution. If we use normal random sampling techniques, these two cases are unrelated or independent of each other. Independent values may be represented geometrically as being at right angles to each other. The issue here is to determine the net force away from the mean for these two cases. This can be calculated using this formula: ∑d 2 = (X − X ) + ( X2 − X ) 2 1 2 ¾ This is, of course, the Pythagorean Theorem, and if we divide this mess by n, this looks suspiciously like the standard deviation. Final Thoughts d2 ( d1 − X ) + ( d2 − X ) 2 X d1 2 Final Thoughts ¾ What if we had three independent cases? This resultant force away from the mean can be calculated as: d +d +d 2 1 2 2 2 3 ¾ This may be generalized to all n cases in a distribution even though it is difficult to “visualize”. ¾ The larger the standard deviation, the larger the total force away from the mean. ¾ Error is viewed as the resultant force away from homogeneity ¾ The standard deviation reflects the net effect of such forces per observation. Final Thoughts ¾ People with physics backgrounds will see that: the mean is equal to the center of gravity the variance is equal to the moment of inertia of a distribution of mass the standard deviation is equal to the radius of gyration of a distribution of mass