Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 7: Summarizing and Displaying Measurement (quantitative) Data Top Ten Athlete Earnings 2009 Athlete Tiger Woods Phil Mickelson LeBron James Alex Rodriguez Shaquille O'Neal Kevin Garnett Kobe Bryant Allen Iverson Derek Jeter Peyton Manning 2009 Earnings 99737626 52950356 42410581 39000000 35000000 34750000 31262500 28937500 28500000 27000000 In Millions 99* 53 42 39 35 35 31 29 29 27 * Woods should round to 100 but for convenience of keeping all at two digits I rounded down to 99. Numbers to describe data: - mode which is the most frequently observed data value (here there are two modes: 29 and 35) - median and quartiles - mean The position of the median is found by (n+1)/2 which for our data set is (10+1)/2 = 5.5 which represents that the median is located halfway between the 5th and 6th observation within the ordered data. To calculate the median for our data we would take the midpoint of the 5th and 6th observation or for this example the midpoint between 35 and 35 which is again 35. The interpretation is that 50% of the top ten earnings were 35 million or less. Similarly: The position of the quartiles can be found by the calculating the median of the median! That is, the position of the first quartile called Q1 is found by (n+1)/4 = (10+1)/4 = 2.75. For us this makes the first quartile located halfway the second and third ranked salary which is 29. The interpretation is that 25% of the top ten earnings were 29 million or less. The position of the third quartile called Q3 is found by 3*(n+1)/4 = 3*(10+1)/4 = 8.25. For us this makes the third quartile, Q3, located halfway between the 8th and 9th observation which comes to halfway between 42 and 53. This makes the Q3 equal to 47.5. The interpretation is that 75% of the top ten earnings were 47.5 million or less. - mean (sum of all values divided by the total number summed): mean = 41.9 - variability measures: i. range (max – min = 99 – 27 = 72), ii. Interquartile Range or IQR = Q3 – Q1 = 47.5 – 29 = 18.5 The IQR represents how spread out the middle 50% of the observations are since this measure is the difference between the third and first quartile (i.e. 75% - 25% = 50%) iii.variance and standard deviation (or SD or Std. Dev.). These two are measures of “how spread out the values are from the mean” and are related: standard deviation is the square root of 1 the variance. Not important to know how to calculate these but to know their meaning. A basic interpretation of the standard deviation is that it is roughly the average distance of the observed values from their mean. In our example the SD is 21.5 making the variance 21.52 = 462.25 - Five number summary: min (27), Q1 (29), Median (35), Q3 (47.5) and max (99) NOTE: Sometimes these values are not possible outcomes (e.g. the mean number of children in a US household is 2.2) we do NOT round the number to a whole number (e.g. we would not round this to 2). The value is important as it tells us that on average his mean number of children is less than 3 but more than 2. Graphing measurement data and shape: - stem and leaf - histogram - boxplot - symmetric (or bell shaped), skewed, and outliers Stem-and-Leaf Display: In Millions 2 3 4 5 6 7 8 9 799 1559 2 3 9 Histogram of In Millions 3.0 Frequency 2.5 2.0 1.5 1.0 0.5 0.0 30 40 50 60 70 In Millions 80 90 100 This histogram would be interpreted as right-skewed or positively skewed since the extreme observations are “pulling” or “stretching” the data to the right or in a positive direction. For such distributions we would expect the mean to be more than, or to the right, of the median. This is the case for this example as the mean of 41.9 is greater than the median of 35. 2 Building a “fence” around the data to determine extreme observations or outliers. We can use the quartiles and IQR to build a fence around our data in order to determine if any observations in our data set can be considered extreme or outliers. This fence is built by: Lower Fence: Q1 - 1.5*IQR Upper Fence: Q3 + 1.5*IQR For this data set with Q1 of 29, Q3 of 47.5, and IQR of 18.5 the fence is: Lower: 29 - 1.5*18.5 = 29 – 27.75 = 1.25 Upper: 44.75 + 1.5*18.5 = 47.5 + 27.75 = 75.25 Looking at the data we see only one value that does not lie within this “fence”: Tiger Woods with 99 million dollars. Therefore, for this data set 99 would be classified as an outlier. Creating a Box Plot A box plot is simply a representation of the median and the quartiles representing a “box” and “whiskers” representing the fence. This plot is sometimes referred to as a box-and-whisker plot. Below is a box plot our data. The box itself consists of Q1 and Q3 with the line within the box being the median (note how the big box goes from 29 to 44.75, the Q1 and Q3, with the line within this box being at 35 or the median). The whiskers extend out both sides of the box but only extend to the observed value closest to the lower and upper fence without exceeding these values. For example, the lower fence is at 1.25 but the closest observation to this value without going below it is 27; thus the whisker stops at 27. The other whisker goes to 53 since this is the closest observed value that does not exceed the upper fence of 75.25 Boxplot of In Millions 20 30 40 50 60 In Millions 70 80 90 100 The effect of these extreme observations, also called outliers, is greatest on the mean and standard deviation/variance since the latter uses the mean in its calculation. This effect is due to the mean taking into account the values of all observations meaning that Tiger Woods earnings has a greater impact on the mean than it does the median (and thus less of an impact on the quartiles and IQR). The range, too, is greatly affected by outliers 3 Empirical Rule When the data is symmetric or bell-shaped the use of SD is quite helpful. You would find in such instances that for data shaped this way that roughly 68% of the observations fall with +/- one SD from the mean; 95% of the observations fall within +/- two standard deviations from the mean; and almost all – 99.7% of all observations fall within +/- three standard deviations from the mean. For example the math portion of the SAT test is typically bell-shaped with a mean of 500 and standard deviation of 100. Thus we would expect that 68% of all scores would be between 400 and 600; 95% would fall between 300 and 700; and almost all (99.7%) would fall between 200 and 800. 4