Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Demonstrations II Scott Stevens J. Arithmetic Mean (Average) [For the mean from grouped data, see M.] Problem: During the 1980 presidential campaign, Ronald Reagan repeatedly asked voters if they were better off in 1980 than they were 4 years before. Here are some data (in yellow) on the unemployment and inflation rates during the Carter Administration (’77 to ’80) and the Reagan Administration (’81 to ’88). Is there any difference between the average unemployment rate in the two administrations? Average inflation rate? This is quite easy to compute by hand. Assuming that by "average" the problem means "mean", we just take the numbers to be averaged, add them, then divide by the number of numbers added. We can also do this with the Excel =AVERAGE(range) function. A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Year 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 B Unemployment 7.1 6.1 5.8 7.1 7.6 9.7 9.6 7.5 7.2 7 6.2 5.5 Carter Average Reagan Average 6.525 7.5375 C Inflation 6.7 9 13.3 12.5 8.9 3.8 9.6 7.5 7.2 7 6.2 5.5 10.375 6.9625 So the Carter unemployment average, for example, was computed as = AVERAGE(B2:B5), since the unemployment figures for the Carter years were in spreadsheet cells B2 through B5. The result show that unemployment was lower during the Carter years, but that inflation was considerably lower during the Reagan years. The question that remains is whether the differences in these figures are too large to be credited to random fluctuation alone. That question is one we take up later in the semester. It’s important to note that the average, a single number, cannot tell the whole story of a data set. Be careful what conclusions you draw from a “one number summary”! K. Median (and comparison with the mean) Problem: Below is a table of the rainfall recorded in the Los Angeles area in the last 10 years. (Unlike most of these demonstrations, I’ve made up this data for this one.) Compute the median rainfall based on these data. Without computing the mean, state whether the mean rainfall for this ten year period would be above, below, or equal to this median value. 1 rainfall in a year 3 median rainfall 4 5 5 5 6 6 14 17 5.5 The first step is to sort the data, smallest to largest –I’ve already done this with the data. Then, in a set of 10 observations, the median should be observation number (10+1)/2 = 5.5…that is, the number halfway between observation #5 and observation #6. Since #5 is 5” and #6 is 6” in this case, the median is 5.5”. You can also get this from Excel, by using the =MEDIAN(range) command. Now, how does this compare to the mean? If you imagine these bars sitting on a see-saw along the number line of the x-axis, the mean would be the balance point of the see-saw. (Think of where you’d have to put the fulcrum (or balance point) of the see-saw to make the “kids” balance. You should be able to see that it’s somewhere near where I put it below, at 8.5. The three little kids on the right side at positions 14, 17, and 20 will just balance the two small and two big kids on the left side (at positions 3, 4, 5, and 6).) 4 Big kid at the "5" mark weighs 3 units 3 2 Little kid way out at the "20" mark weighs 1 unit 1 0 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Balance point On the other hand, the median point for a distribution is the point at which half of the "stuff" is to the left and half of the "stuff" is to the right. If we stick with our see-saw imagery, "stuff" means "weight", 2 20 and I hope that you can see that 5 units of weight lie to the left of 5.5 (the median value), and 5 units of weight lie to the right of 5.5 This way of thinking about mean and median is quite useful, so take some time to lock it down. What you can learn from this example holds in general. Note, for example: If the last kid slides from position 20 to position 25 on the see-saw, the balance point would shift to the right (a higher mean), but the median ("half-weight") point would remain unchanged. In general, the mean is more affected by extreme values than then median. If the distribution has a long "tail" in one direction (like this one does to the right—it is "skewed right"), then the mean tends to be more than the median. In general, if a distribution is unimodal (one highest point) and skewed, then as you run up the long slope of the "hill", you'll encounter the mean, median, and mode, in that order. The mode, of course, will be at the top of the hill. If the distribution is symmetric ("mirror imaged, left to right"), then the mean and median are equal. If the symmetric distribution is also unimodal (one highest point), then the mode is also equal to the mean and the median. Don't memorize these results—understand them. And think about what the measures given in a given problem really tell you. For example, suppose you’re told that the average number of people in a US household is 2.4 persons. This is the mean (since it clearly isn't the median or mode!). It is the "balance point" of the population distribution. Saying it another way—if all of the people in all of the households were gathered together, and then distributed evenly among all of those households…well, we'd have a bloody mess, since each household would get 2.4 people. But that doesn't mean that 2.4 is "typical", or even close to "typical". For example, if 80% of all households have 1 person, and the remaining 20% all had 8 persons, the average would be 2.4. The mean is a measure of "middle", but we almost always need a measure of "spread", as well. The most common is standard deviation. L. Box Plots (Modified and Unmodified), Quartiles, and IQR Problem: In the yellow box below, find the annual inches of precipitation at the Los Angeles Civic Center for the years 1961 to 1990. Summarize this data with a boxplot and modified box plot. It's not hard to do this work by hand, but the goal of this course is to give you tools that you can use effectively and responsibly. To help you with this, I've written a number of spreadsheet templates that will perform common statistical tasks, to supplement those functions already a part of Excel. Sometimes, my templates duplicate functions already available in Excel. When I do this, it is for one of two reasons. Either 1) Excel's built-in function is restrictive about how the input data must be supplied, or 2) Excel's built-in function is not helpful in understanding how the answer is obtained. Since you must understand a tool clearly in order to use it effectively, my template often provide a "step-by-step" approach. I'll also be providing templates to perform some of the tasks that Excel can't do at all, or can only do incompletely. That's what I've done for this example. I've created a couple of templates (available on the website) to do box plots and modified box plots. . Before you start using my templates, though, there are some general things you should know about them. Please check the website post entitled Using Stevens’ Statistical Templates: Useful Information. It uses this problem as an example. 3 Data Outlier ? 4.56 5.83 6.49 6.54 7.58 7.98 8.9 8.92 9.11 9.26 9.98 10.7 10.92 11.01 12.31 12.91 14.41 14.97 15.37 16.54 16.69 17 17.45 18 23.66 26.32 26.33 26.81 30.57 outlier 34.04 outlier Statistics Values mean, x-bar standard deviation, s 2 deviations below mean 1 deviation below mean 1 deviation above mean 2 deviations above mean Formulas 14.70533 7.810534 -0.91574 6.894799 22.51587 30.3264 =AVERAGE(range) = STDEV(range) = x-bar - 2 * s = x-bar - 1 * s = x-bar + 1 * s = x-bar + 2 * s 12.61 8.9675 17.3375 8.37 4.56 29.8925 =MEDIAN(range) =QUARTILE(range, 1) =QUARTILE(range, 3) = 3rd quartile - 1st quartile = 1st quartile - 1.5 * IQR or smallest val = 3rd quartile + 1.5 * IQR or largest val median 1st quartile 3rd quartile IQR lower whisker upper whisker Frequency Table for Histogram Modified Boxplot category size 5 Values from to 4 9 9 14 14 19 19 24 24 29 29 34 34 39 0 10 20 30 frequency 8 8 8 1 3 1 1 40 Annual Rainfall (inches) We compute the numbers needed for a modified box plot and unmodified box plot. Let's start with the modified box plot. Note the commands that Excel uses to find median, 1 st quartile, and 3rd quartile. The interquartile range (IQR) is just the difference between the first and third quartile. Formulas for first and third quartile. Different sources compute the quartiles slightly differently. Excel computes the first quartile as observations number (n +3)/4 in the sorted list of n observations, while your text uses observation number (n+1)/4. The median is observation (n+1)/2, as in your book. The third quartile is computed in Excel as observation number (3n+1)/4, while your book uses observation number (3n+3)/4. We’ll be happy with either of these calculation rules. (Here, for example, the first quartile turns out to be observation (30 + 3)/4 = 33/4 = 8.25. What is observation number 8.25? It's ¼ of the way from observation #8 to observation #9 on the sorted list. #8 is 8.92 and #9 is 9.11. You can find the number that is ¼ of the way from A to B by computing (0.75 A) + (0.25 B). So, for our data, this is 0.75(8.92) + 0.25(9.11) = 8.9675, as reported.) The Excel command for the first quartile is, as you can see, = QUARTILE(range, 1). The third quartile replaces the "1" with a "3": =QUARTILE(range, 3). 4 In the modified box plot, the lower whisker extends down to 1.5 * IQR below the 1st quartile, and the upper whisker extends up to 1.5 * IQR above the 3rd quartile. Any data points beyond the whisker's ends are marked with dots, and identified as outliers. With the unmodified box plot, the whiskers extend all the way to the most extreme data values—the maximum and minimum observations. My spreadsheet here computes the "whisker's end" values for both cases, and I’ve provided two different spreadsheet templates on the website to create the two kinds of box plots. For this course, you’ll be responsible only for creating the unmodified box plots. Modified Boxplot 0 10 20 30 Unmodified Boxplot (same data) 40 0 Annual Rainfall (inches) 10 20 30 40 Annual Rainfall (inches) Let's take a look at the modified box plot, since we can use it to talk about both types of box plots. We can see that there are no outliers on the lower end; no observed rainfall is more than 1.5 IQRs from the 1st quartile, so the lower tail stops at the lowest observation. The central box shows the 1 st quartile, median, and 3rd quartile. Remember what this means. 25% of all observations represent rainfall below the "left wall" of this box (about 8.97"). 25% of all observations lie between the left wall and the median line. 25% more lie between this median line and the "right wall" of the box (about 17.34"). Finally, the highest 25% of the rainfalls fall to the right of the "right wall" of the box. To get a rough idea of what the histogram for this data would look like, you can imagine dumping the same amount of water in each of these four “compartments”. The “water” between the 1 st quartile and the median would be higher than any other “compartment”, indicating that the numbers are crunched together there more than anywhere else. Conversely, the numbers from the third quartile to the maximum value are “spread out”—they’re not packed in to their interval as densely. What else? We see that there are two observations that fall above the end of the upper whisker. We identify these as outliers—both correspond to more than 30 inches of rain. And the unmodified box plot? How does it differ? Only in that the upper whisker extends to include both of the pink outliers. While I expect you to be able to create an unmodified box plot without needing my spreadsheet, it's unlikely you'll be able to do it (without my help) in Excel. Excel doesn't support box plots, so I did a fair amount of work to make it draw them, anyway. When you use my spreadsheet on other data, be sure to change the axis name ("Annual Rainfall (inches)") so that it fits your problem. 5 M. The mean, from grouped data [For the mean of ungrouped data, see J.] Problem: In a random sample of 50 college students, 5 said that they sit “in the very front” of the class and 21 said that they sit “toward the front”. The GPA of the students in the very front was 10.94 (on a 12 point scale) while for the students who sat “toward the front”, the average GPA was 9.38. What is the average GPA of all 26 of these students? The answer has to come out exactly as if 5 students had GPAs of 10.94 (the average for the “front” group) and 21 students have GPAs of 9.38 (the “near front” group). Computing this is easy: Average 26 numbers, 5 of which are 10.94 and 21 of which are 9.38. If you think for a moment, you’ll realize that the math is going to look like [(5 10.94) + (21 9.38)]/(5 + 21). We can generalize this work to find the mean from any set of grouped data. Finding the Mean from Grouped Data in Excel Excel (as well as a number of software packages) expects that, when you want to do statistics, you'll type in every single data point. Sometimes, though, like in this problem, you don't really want to do that. You want to enter the different values observed, and how many times each value was observed. Happily, Excel can still easily compute the average of data presented in this way. Here's how you do it. 1. Enter your data in two columns, "value" and "frequency". 2. Compute the average of the data with the command =SUMPRODUCT(valuerange, frequencyrange)/SUM(frequencyrange) Here, valuerange refers to the cells containing the observed values (the numbers in the "value" column). frequencyrange refers to the cells containing the number of times each value is observed (the numbers in the "frequency" column). You can also find the mean of grouped data from a relative frequency distribution. The formula is even simpler: =SUMPRODUCT(valuerange, relfreqrange) where relfreqrange is the range of cells containing the relative frequencies of the observed values. We'll use this here. Location front toward front mean # of students Average GPA 5 10.94 21 9.38 9.68 =SUMPRODUCT(C2:C3,B2:B3)/SUM(B2:B3) You could, of course, have typed the 26 numbers in separately, then used the =AVERAGE command. 6