* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download X - AUEB e
Survey
Document related concepts
Transcript
ECO 72 INTRODUCTION TO ECONOMIC STATISTICS Topic 3 Measures of Dispersion These slides are copyright © 2003 by Tavis Barr. This material may be distributed only subject to the terms and conditions set forth in the Open Publication License, v1.0 or later (the latest version is presently available at http://www.opencontent.org/openpub/). Dispersion ● ● ● ● Measures of central tendency look at measuring a “typical” observation This section measures how dispersed or spread out the data are Helps us answer the question “How typical is typical?” For example, if most observations are close to the mean, then the mean is typical; otherwise not. Knowing the mean is not enough ● Sample of five people's wages in South Dakota: $5, 7, 11, 12, and 15. ● Sample of five people's wages in North Dakota: $9, 9.50, 10, 9.50, 12. Knowing the mean is not enough ● Sample of five people's wages in South Dakota: $5, 7, 11, 12, and 15. ● Sample of five people's wages in North Dakota: $9, 9.50, 10, 9.50, 12. Both samples have a mean wage of $10 ● But the wages in South Dakota are much more dispersed. ● How do we measure this? ● Measures of Dispersion Available ● Range ● Mean Absolute Deviation ● Variance / Standard Deviation ● InterQuartile Range Each of these measures, like the measures of central tendency, has strengths and weaknesses. Range – definition and advantage ● The range simply tells us the highest value in the sample minus the lowest value. – ● Example above: $129=3; $155=10. Advantage: Easy to calculate, intuitive. Range (cont'd) ● Problem: Huge data set includes outliers – observations that are unusual and extreme in value – Top income earner in a data set may be Bill Gates – Bottom income earner may be a street vendor making $1 per hour – The range will be the difference between their incomes. Does this tell us much? – Moreover, outliers are often created by errors. Range – Usage ● One use of range is quality control, where absolute minima may need to be set for safety standards. – Example: Engineers test five airbags to find out how long they take to inflate. Results are 0.7, 0.8, 0.85, 0.95, 0.8 seconds. – Range is 0.25. Mean Absolute Deviation ● ● Is calculated as the average distance from the mean Given a sample of size n with mean , i=N MAD = ∑ ∣X i − X ∣ i =1 N X Mean Absolute Deviation – Example ● Wages in South Dakota (Mean $10) 5 ∣X i − X ∣ 2 Wage 5 |510| = 5 7 |710| = 3 11 |1110| = 1 12 |1210| = 2 15 |1510| = 5 Avg: 10 3.2 1 5 7 11 12 15 3 5 Mean Absolute Deviation – Example ● Wages in North Dakota (Mean $10) ∣X i − X ∣ Wage 9 |9 10| 9.50 |9.510|= 0.5 10 |1010| 9.50 |9.510|= 0.5 12 |1210| Avg: 10 = 1.0 = 0.0 = 2.0 0.8 So the North Dakota sample has a smaller MAD, as we might hope. Mean Absolute Deviation – Advantages ● Uses information from all observations ● Not as affected by outliers as the range – ● Every observation affects the MAD equally Relatively intuitive (compared to what's coming....) Mean Absoluate Deviation – Disadvantage Absolute value turns out to have a difficult property: |X| 2 1 0 1 X 2 It's not smooth at zero. Since “zero” in this case is the sample mean, it can move around strangely if our estimated mean changes. Variance ● ● Is calculated as the average squared distance from the mean Given a population of size N X with mean , i=N ∑ ∣X i − X ∣ i =1 MAD = N i= N 2 = 2 ∑ Xi− X i= 1 N Sample vs. Population Variance ● Notice that sample and population variance are calculated differently! i= N ● ● ● Population Variance: Sample Variance: 2 ∑ Xi −X / N i =1 i=n 2 ∑ X i − X / n−1 i =1 In a sample, we don't know the mean exactly, so we use up one degree of freedom calculating it. This is like losing an observation in our sample. Variance – Example ● Wages in South Dakota (Mean $10)2 X i − X Wage 5 (510)2 = 25 7 (710)2 = 9 11 (1110)2 = 1 12 (1210)2 = 4 15 (1510)2 = 25 Avg: 10 Sum: 64 25 25 9 4 1 5 7 1112 2: 64/(5 1) = 16 15 Variance – Example ● Wages in North Dakota (Mean $10) X i − X 2 Wage 9 (910)2 9.50 (9.510)2 = .25 10 (1010)2 = 0 9.50 (9.510)2 = .25 12 (1210)2 = 4 Avg:10 Sum = 1 5.5 So the North Dakota sample has a smaller variance, too. 2: 5.5/(5 1) = 1.375 Variance – Advantage Variance has and advantage over MAD: X2 2 1 0 1 2 X 2 2 |X| X 1 0 1 2 The function X is smooth at zero. This means that a slightly misestimated mean will not have serious consequences. (This is a difficult point to explain fully.) Consider two samples: Sample 1 = {-2, -1, 0, 1, 2} Sample 2 = {-2, -1, 0.1, 1, 2} The mean of Sample 1 = 0 and the mean of Sample 2 = 0.02. The MAD of Sample 1 = 1.2 and the MAD of Sample 2 = 1.216, so one could say that Sample 2 is more dispersed than Sample 1, but does this conclusion seams correct? The Standard Deviation of Sample 1 and Sample 2 = 1.581, so now the conclusion is that the two samples are equally dispersed, which is more obvious given the data. Disadvantages of Variance ● ● It's a bit harder to calculate and not as intuitive as MAD, let alone Range It's slightly more affected by outliers than MAD Standard Deviation ● ● The variance is on the scale of the variable squared, not on the scale of the variable itself To get a statistic that is on the scale of the original variable, we take the (positive) square root of the variance. This is called the standard deviation. = 2 MAD, Variance, and Standard Deviation Compared SOUTH DAKOTA: NORTH DAKOTA: Range 10 Range 3 MAD 3.2 MAD 0.8 Variance 16 Variance 1.375 Std. Dev. 4 Std. Dev. 1.173 The standard deviation and the MAD are usually of the same order of magnitude. The Rule of Thumb ● ● We've seen that a larger standard deviation means more dispersed data But what does a standard deviation mean? – “The standard deviation is 1.167” – “The standard devation is 4” – Do the actual numbers matter? The Rule of Thumb ● ● The typical case: – 68% of observations lie between and + – 95% of observations lie between 2 and +2 Example: Norh Dakota. = 10, = 1.17 – 68% of sample lies between 101.17 and 10+1.17 – 95% of sample lies between 102.34 and 10+2.34 68% of observations 8.83 10 + 11.17 Chebyshev's Theorem ● A use of the standard deviation: Says at least a portion 1 (1/k)2 of the data lies within k standard deviations of the mean At most fraction 1/k2, combined At least fraction 11/k2 of data k k k +k Chebyshev's Theorem (cont'd) ● Chebyshev's theorem is usually considered conservative: Typically a lot more than fraction 11/k2 of the data lies in this range. At most fraction 1/k2, combined At least fraction 11/k2 of data k k k +k Chebyshev's Theorem ● ● ● Example: South Dakota wages. = 10, = 4. Consider the case of k=2 standard deviations. This would be the range 102∙4 to 10+ 2∙4, i.e., 2 to 18. Chebyshev guarantees that at least fraction 11/k2 = 11/22 = ¾ of the data lies in this range. In fact all of it does. At least fraction 11/k2 = ¾ of data k= 2∙4=8 k= 2∙4=8 k=2 =10 +k=18 InterQuartile Range 1. Write the data out in order 2. Break it into four parts, each with an equal number of observations 3. Pick the top number in the first part, and the top number in the third part 4. Subtract the former from the latter InterQuartile Range – Example ● 23 21 23 19 23 20 17 21 24 37 21 20 19 22 18 24 Ages of 16 students: InterQuartile Range – Example ● 23 21 23 19 23 20 17 21 24 37 21 20 19 22 18 24 Ages of 16 students: ● 17 18 19 19 20 20 21 21 21 22 23 23 23 24 24 37 In order, this is: InterQuartile Range – Example 4 obs. 17 18 19 19 20 20 21 21 21 22 23 23 23 24 24 37 In order, this is: First (bottom) Quartile 4 obs. ● Second Quartile 4 obs. 23 21 23 19 23 20 17 21 24 37 21 20 19 22 18 24 Ages of 16 students: 4 obs. ● Third Quartile Fourth (top) Quartile InterQuartile Range – Example 23 – 19 = 4 4 obs. 17 18 19 19 20 20 21 21 21 22 23 23 23 24 24 37 In order, this is: First (bottom) Quartile 4 obs. INTER QUARTILE RANGE: ● Second Quartile 4 obs. 23 21 23 19 23 20 17 21 24 37 21 20 19 22 18 24 Ages of 16 students: 4 obs. ● Third Quartile Fourth (top) Quartile IQR: Advantages and Disadvantages ● ● The interquartile range is the least affected by outliers of all of the measures above It uses less data than the variance or the MAD and may therefore not reflect changes in the distribution in the bottom or top quartile Skewness ● The skewness looks at the cube of the difference from the mean: i= N i= N 2 = ● ∑ X i − X i=1 N 2 Skewness= N ∑ X i − X 3 i =1 N −1 N −2 It is used to measure how symmetric the data are around the mean. Zero skewness means the data are symmetric; skewness can be positive or negative. Positive Vs. Negative Skewness ● Negative Skew ● Positive Skew ● Left Tail ● Right Tail ● Median > Mean ● Median < Mean Mean Median Median Mean