Download Demonstrations II: Measures of Central Tendency and Dispersion

Demonstrations II Dr. Scott Stevens N. O. P. Q. R. S. Arithmetic mean (Average) Median (and comparison with the mean) Box plots (modified and unmodified), quartiles, and IQR The mean, from grouped data Comparison of measure of central tendency: mean, median, mode, midrange, midhinge Standard deviation and variance 1 N. Arithmetic Mean (Average) [For the mean from grouped data, see Q.] Problem: During the 1980 presidential campaign, Ronald Reagan repeatedly asked voters if they were better off in 1980 than they were 4 years before. Here are some data (in yellow) on the unemployment and inflation rates during the Carter Administration (’77 to ’80) and the Reagan Administration (’81 to ’88). Is there any difference between the average unemployment rate in the two administrations? Average inflation rate? This is quite easy to compute by hand. Assuming that by "average" the problem means "mean", we just take the numbers to be averaged, add them, then divide by the number of numbers added. We can also do this with the Excel =AVERAGE(range) function. A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Year 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 B Unemployment 7.1 6.1 5.8 7.1 7.6 9.7 9.6 7.5 7.2 7 6.2 5.5 Carter Average Reagan Average 6.525 7.5375 C Inflation 6.7 9 13.3 12.5 8.9 3.8 9.6 7.5 7.2 7 6.2 5.5 10.375 6.9625 So the Carter unemployment average, for example, was computed as = AVERAGE(B2:B5), since the unemployment figures for the Carter years were in spreadsheet cells B2 through B5. The result show that unemployment was lower during the Carter years, but that inflation was considerably lower during the Reagan years. The question that remains is whether the differences in these figures are too large to be credited to random fluctuation alone. That question is one we take up later in the semester. It’s important to note that the average, a single number, cannot tell the whole story of a data set. Be careful what conclusions you draw from a “one number summary”! 2 O. Median (and comparison with the mean) Problem: Below is a table of the rainfall recorded in the Los Angeles area in the last 10 years. (Unlike most of these demonstrations, I’ve made up this data for this one.) Compute the median rainfall based on these data. Without computing the mean, state whether the mean rainfall for this ten year period would be above, below, or equal to this median value. rainfall in a year 3 median rainfall 4 5 5 5 6 6 14 17 5.5 The first step is to sort the data, smallest to largest –I’ve already done this with the data. Then, in a set of 10 observations, the median should be observation number (10+1)/2 = 5.5…that is, the number halfway between observation #5 and observation #6. Since #5 is 5” and #6 is 6” in this case, the median is 5.5”. You can also get this from Excel, by using the =MEDIAN(range) command. Now, how does this compare to the mean? If you imagine these bars sitting on a see-saw along the number line of the x-axis, the mean would be the balance point of the see-saw. (Think of where you’d have to put the fulcrum (or balance point) of the see-saw to make the “kids” balance. You should be able to see that it’s somewhere near where I put it below, at 8.5. The three little kids on the right side at positions 14, 17, and 20 will just balance the two small and two big kids on the left side (at positions 3, 4, 5, and 6).) 4 Big kid at the "5" mark weighs 3 units 3 2 Little kid way out at the "20" mark weighs 1 unit 1 0 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Balance point 3 20 On the other hand, the median point for a distribution is the point at which half of the "stuff" is to the left and half of the "stuff" is to the right. If we stick with our see-saw imagery, "stuff" means "weight", and I hope that you can see that 5 units of weight lie to the left of 5.5 (the median value), and 5 units of weight lie to the right of 5.5 This way of thinking about mean and median is quite useful, so take some time to lock it down. What you can learn from this example holds in general. Note, for example:  If the last kid slides from position 20 to position 25 on the see-saw, the balance point would shift to the right (a higher mean), but the median ("half-weight") point would remain unchanged. In general, the mean is more affected by extreme values than then median.  If the distribution has a long "tail" in one direction (like this one does to the right—it is "skewed right"), then the mean tends to be more than the median. In general, if a distribution is unimodal (one highest point) and skewed, then as you run up the long slope of the "hill", you'll encounter the mean, median, and mode, in that order. The mode, of course, will be at the top of the hill.  If the distribution is symmetric ("mirror imaged, left to right"), then the mean and median are equal. If the symmetric distribution is also unimodal (one highest point), then the mode is also equal to the mean and the median. Don't memorize these results—understand them. And think about what the measures given in a given problem really tell you. For example, suppose you’re told that the average number of people in a US household is 2.4 persons. This is the mean (since it clearly isn't the median or mode!). It is the "balance point" of the population distribution. Saying it another way—if all of the people in all of the households were gathered together, and then distributed evenly among all of those households…well, we'd have a bloody mess, since each household would get 2.4 people. But that doesn't mean that 2.4 is "typical", or even close to "typical". For example, if 80% of all households have 1 person, and the remaining 20% all had 8 persons, the average would be 2.4. The mean is a measure of "middle", but we almost always need a measure of "spread", as well. The most common is standard deviation. P. Box Plots (Modified and Unmodified), Quartiles, and IQR Problem: In the yellow box on the next page, find the annual inches of precipitation at the Los Angeles Civic Center for the years 1961 to 1990. Summarize this data with a boxplot and modified box plot. It's not hard to do this work by hand, but the goal of this course is to give you tools that you can use effectively and responsibly. To help you with this, I've written a number of spreadsheet templates that will perform common statistical tasks, to supplement those functions already a part of Excel. Sometimes, my templates duplicate functions already available in Excel. When I do this, it is for one of two reasons. Either 1) Excel's built-in function is restrictive about how the input data must be supplied, or 2) Excel's built-in function is not helpful in understanding how the answer is obtained. Since you must understand a tool clearly in order to use it effectively, my template often provide a "step-by-step" approach. I'll also be providing templates to perform some of the tasks that Excel can't do at all, or can only do incompletely. That's what I've done for this example. The template (available at my website) is Frequency Distribution, Histogram, and Box and Whisker Plot. Before you start using my templates, though, there are some general things you should know about them. Please check the website post entitled Using Stevens’ Statistical Templates: Useful Information. It uses this problem as an example. 4 Data 4.56 5.83 6.49 6.54 7.58 7.98 8.9 8.92 9.11 9.26 9.98 10.7 10.92 11.01 12.31 12.91 14.41 14.97 15.37 16.54 16.69 17 17.45 18 23.66 26.32 26.33 26.81 30.57 34.04 Outlier? Statistics mean, x-bar smallest entry largest entry number of observations median 1st quartile 3rd quartile IQR lower whisker upper whisker suggested # of cats category size At least… 4 9 14 19 24 29 34 Values 14.70533333 4.56 34.04 Formulas =AVERAGE(range) =MIN(range) =MAX(range) 30 =COUNT(range) 12.61 8.9675 17.3375 8.37 4.56 29.8925 =MEDIAN(range) =QUARTILE(range, 1) =QUARTILE(range, 3) = 3rd quartile - 1st quartile = 1st quartile - 1.5 * IQR or smallest val = 3rd quartile + 1.5 * IQR or largest val Frequency Table for Histogram 5 5 …but less than… frequency 9 8 14 8 19 8 24 1 29 3 34 1 39 1 outlier outlier We compute the numbers needed for a modified box plot and unmodified box plot. Let's start with the modified box plot. Note the commands that Excel uses to find median, 1st quartile, and 3rd quartile. The interquartile range (IQR) is just the difference between the first and third quartile. Formulas for first and third quartile. Different sources compute the quartiles slightly differently. Excel computes the first quartile as observations number (n +3)/4 in the sorted list of n observations, while your text uses observation number (n+1)/4. The median is observation (n+1)/2, as in your book. The third quartile is computed in Excel as observation number (3n+1)/4, while your book uses observation number (3n+3)/4. We’ll be happy with either of these calculation rules. (Here, for example, the first quartile turns out to be observation (30 + 3)/4 = 33/4 = 8.25. What is observation number 8.25? It's ¼ of the way from observation #8 to observation #9 on the sorted list. #8 is 8.92 and #9 is 9.11. You can find the number that is ¼ of the way from A to B by computing (0.75  A) + (0.25  B). So, for our data, this is 0.75(8.92) + 0.25(9.11) = 8.9675, as reported.) The Excel command for the first quartile is, as you can see, = QUARTILE(range, 1). The third quartile replaces the "1" with a "3": =QUARTILE(range, 3). In the modified box plot, the lower whisker extends down to 1.5 * IQR below the 1 st quartile, and the upper whisker extends up to 1.5 * IQR above the 3rd quartile. Any data points beyond the whisker's ends are marked with dots, and identified as outliers. With the unmodified box plot, the whiskers extend all the 5 way to the most extreme data values—the maximum and minimum observations. My spreadsheet here computes the "whisker's end" values for both cases, and I’ve provided two different spreadsheet templates on the website to create the two kinds of box plots. For this course, you’ll be responsible only for creating the unmodified box plots. Modified Boxplot 0 10 20 30 Unmodified Boxplot (same data) 40 0 Annual Rainfall (inches) 10 20 30 40 Annual Rainfall (inches) Let's take a look at the modified box plot, since we can use it to talk about both types of box plots. We can see that there are no outliers on the lower end; no observed rainfall is more than 1.5 IQRs from the 1st quartile, so the lower tail stops at the lowest observation. The central box shows the 1 st quartile, median, and 3rd quartile. Remember what this means. 25% of all observations represent rainfall below the "left wall" of this box (about 8.97"). 25% of all observations lie between the left wall and the median line. 25% more lie between this median line and the "right wall" of the box (about 17.34"). Finally, the highest 25% of the rainfalls fall to the right of the "right wall" of the box. To get a rough idea of what the histogram for this data would look like, you can imagine dumping the same amount of water in each of these four “compartments”. The “water” between the 1 st quartile and the median would be higher than any other “compartment”, indicating that the numbers are crunched together there more than anywhere else. Conversely, the numbers from the third quartile to the maximum value are “spread out”—they’re not packed in to their interval as densely. What else? We see that there are two observations that fall above the end of the upper whisker. We identify these as outliers—both correspond to more than 30 inches of rain. And the unmodified box plot? How does it differ? Only in that the upper whisker extends to include both of the pink outliers. This is the only kind of boxplot that your book uses. While I expect you to be able to create an unmodified box plot without needing my spreadsheet, it's unlikely you'll be able to do it (without my help) in Excel. Excel doesn't support box plots, so I did a fair amount of work to make it draw them, anyway. When you use my spreadsheet on other data, be sure to change the axis name so that it fits your problem. 6 Q. The mean, from grouped data [For the mean of ungrouped data, see N.] Problem: In a random sample of 50 college students, 5 said that they sit “in the very front” of the class and 21 said that they sit “toward the front”. The GPA of the students in the very front was 10.94 (on a 12 point scale) while for the students who sat “toward the front”, the average GPA was 9.38. What is the average GPA of all 26 of these students? The answer has to come out exactly as if 5 students had GPAs of 10.94 (the average for the “front” group) and 21 students have GPAs of 9.38 (the “near front” group). Computing this is easy: Average 26 numbers, 5 of which are 10.94 and 21 of which are 9.38. If you think for a moment, you’ll realize that the math is going to look like [(5  10.94) + (21  9.38)]/(5 + 21). We can generalize this work to find the mean from any set of grouped data. Finding the Mean from Grouped Data in Excel Excel (as well as a number of software packages) expects that, when you want to do statistics, you'll type in every single data point. Sometimes, though, like in this problem, you don't really want to do that. You want to enter the different values observed, and how many times each value was observed. Happily, Excel can still easily compute the average of data presented in this way. Here's how you do it. 1. Enter your data in two columns, "value" and "frequency". 2. Compute the average of the data with the command =SUMPRODUCT(valuerange, frequencyrange)/SUM(frequencyrange) Here, valuerange refers to the cells containing the observed values (the numbers in the "value" column). frequencyrange refers to the cells containing the number of times each value is observed (the numbers in the "frequency" column). You can also find the mean of grouped data from a relative frequency distribution. The formula is even simpler: =SUMPRODUCT(valuerange, relfreqrange) where relfreqrange is the range of cells containing the relative frequencies of the observed values. We'll use this here. Location front toward front mean # of students Average GPA 5 10.94 21 9.38 9.68 =SUMPRODUCT(C2:C3,B2:B3)/SUM(B2:B3) You could, of course, have typed the 26 numbers in separately, then used the =AVERAGE command. The standard deviation and variance is computed from grouped data using the same idea used here: treat every observation in a class as though it fell at the midpoint of that class. 7 R. Comparison of measure of central tendency: mean, median, mode, midrange, midhinge See N for mean. See O for comparison of median and mean. Mode The mode is simply the most frequently occurring value in a data set—what single observed value occurs most often? Some data sets have no mode, since each observation occurs only once. Other data sets are “bimodal”—they have two different modes. The term “bimodal” is often used to describe frequency distributions of interval or ratio data that have two prominent, nonadjacent “peaks” of comparable size in their histograms. The mode is of limited usefulness with ungrouped data. If more people got 67 points on the exam than any other single number, how much does that really tell you? On the other hand, suppose we view soda consumption, and consider classes of 0 to 5 ounces, 5 to 10 ounces, and so on. It might be quite useful for the soda company to know that when test subjects were given soda to drink in one sitting, their modal consumption was between 10 and 15 ounces. They may decide to market 12 ounce cans rather than 8 ounce ones. In Excel, you can find the mode of a data set with the command = MODE(range), where range is the range of cells containing the data. Midrange (Not used in your book.) The midrange is the value halfway between the smallest and largest observation in a data set. It can be computed in Excel as =(MAX(range) + MIN(range))/2. It’s of limited usefulness, especially for skewed data. How could it be useful? Suppose a policewoman is responsible for answering calls for help along a section of route 81 that includes exits at mile markers 30.4, 31.4, 39.4, 39.8, 40.2, and 50.4. She wants to minimize the impact of a “worst case scenario”—that she receives a call at one exit when she is far away from it. If this is her concern, where should she station herself along the highway? Well, since her segment of road runs from marker 30.4 to marker 50.4, she should station herself at the midrange, 40.4. In this way, she is at most 10.0 miles from the call, regardless of where it originates. Midhinge (Not used in your book.) The distinction between the midrange and midhinge is the same as the distinction between the range and IQR. (See L for information on IQR.) That is, the midhinge is the point halfway between the first and third quartile values. In Excel, it can be computed as = (QUARTILE(range,1) + QUARTILE(range,3))/2. It, too, is of limited usefulness. The feel of it is that it gives you the geometric middle of the “middle half” of the data. To concoct an example, suppose that our policewoman of the midrange example, above, decides that it is foolish to treat the exit at 50.4 miles on an equal footing with the others. Her midrange position of 40.1 is far from the “center” of things, due to the “outlier” at 50.4. She may instead decide to worry about the “middle 50% of the exits”. With 6 exits, Excel’s formula for 1 st quartile gives observation (6+3)/4 = “observation 2.25”, which is 33.4. The Excel formula for the 3 rd quartile gives observation (18+1)/4 = “observation 4.75”, or 40.1. The midhinge, the point halfway between the values, is (33.4+40.1)/2 = 36.75. By stationing herself at mile marker 36.75, she’ll be as close as she can be to the “middle 50%” of the exits. (Note: your book uses slightly different formulae.) You can relate median, midhinge, and midrange rather nicely with the unmodified box plot. Take a look. The point marked M is, of course, the median. The midrange is at the location marked “R”, in the middle of the range. The midhinge is at the location marked “H”, in the middle of the “box” made by the first and third quartile values. (The arrows emanating from R show the range. 8 The arrows emanating from H show the IQR.) The location of the mean and the mode (if any) aren’t readily apparent from a boxplot. If the data is unimodal and skewed, though, then as you come in from the long tail, you’ll always encounter first the mean, then the median, and then the mode. (In this example, our data set is to small to sensible talk about these things.) S. Standard deviation and variance We’ll have much to say later about the meaning and usefulness of standard deviation and variance. For now, though, you should know that both are measures of dispersion, or spread---just like range and IQR. In many problems involving samples, we’ll learn that standard deviation is THE most useful measure of dispersion. Why? For us, one of the big reasons is an additive property of variance. I’ll return to this idea in a minute, after we do a bit of variance work. For now, though, we’re going to have to be happy with the definitions of these rather elusive quantities, and our ability to compute them in Excel. I’ll refer you to your textbook for the formulas themselves. Here, we’ll focus on meaning and Excel computation. But you have to be careful with these guys. Let me explain. If you wanted to figure out the average number of credits taken per semester by a JMU student, you could take a random sample of, say, 100 students. You’d then add their credit loads, divide by the number of students, and get a number—say 16.8 credits. Now it’s unlikely that if you had done the same calculation with all JMU students to get the exact answer, the result would have been 16.8—it probably would have been a little more or a little less. On the other hand, it should be close, and you’d figure your answer is as likely to be a little too small as a little too large. All of this means that you can estimate the mean of the population by taking the mean of a sample, and that (if the sample is large enough), the mean of the sample should be a good estimate of the thing that you really care about, which is the population mean. The calculations that you do to find the average of a sample and the average of the population feel like the same calculation (add everything together and divide by how many you have). In a sense, though, they’re different, because we don’t really care about the 100 randomly selected students. We only want to use the average for that group to say something about the average for everybody. This is the only thing a sample is good for. But suppose you wanted to get an idea of the “spread” of the number of credits that JMU students take, as measured by variance and standard deviation. The population of all JMU students would have a specific number for its variance and for its standard deviation, and we could compute it using Excel (with the VARP function), or the formula for 2 from your book. But if we only had a sample, and we tried to use the same formula for its variance, we’d find that, more often than not, the value we’d get for the sample variance would be too small to be a good estimate of 2. Just why this is so is something we’ll talk about later, but the amazing fact is that the estimate you get from the sample is a good estimate if you change the formula for sample variance (which we write s2) by just a little bit—instead of dividing by the number of observations in the sample (which would seem to make sense), you divide by one less—n-1. So Excel has two different functions that can be applied to a range of numbers to give a variance. =VARP(range) =VAR(range) gives 2, the population variance gives s2, the sample variance In real life, the data that you get is almost always a sample from a larger whole, so the VAR formula is generally more useful than the VARP formula. 9 The standard deviation, of course, is just the square root of the variance. We can compute these in Excel, too. =STDEVP(range) gives , the population standard deviation =STDEV(range) gives s, the sample standard deviation We use standard deviation more often than variance because the standard deviation of a set of data is measured in the same units as the data itself. So if you have data in gallons of water, the standard deviation is measured in gallons, while the variance is measured in “gallons2”, whatever that means. So why use variance at all? And why use such a freaky idea as variance OR standard deviation, when something like interquartile range is a lot easier to understand and probably easier to compute? The answer is that, in a lot of stat, we’re taking a sample and then finding the mean of that sample. The first thing that you do when you find a mean is to add together the observations. And it turns out that variance acts very nicely when you add observations together. Let me show you. Suppose that you have five wooden blocks of various heights, and I have five wooden blocks of various heights. What we’re going to do is this: You pick one of your blocks at random, and I’ll pick one of my blocks at random, and we’ll make a tower by stacking these blocks up. The question is—what can you say about the height of the resulting tower? To lock this down, let’s be specific. Let’s say that your blocks have heights of 1”, 2”, 4”, 4” and 4”. My blocks have heights of 1”, 1”, 2”, 3” and 6”. Let’s have Excel compute some values from this: First, we’ll look at the 25 different towers we could build—5 choices for you, 5 choices for me. My choice 1 1 2 3 6 1 2 2 3 4 7 Your choice 2 3 3 4 5 8 4 5 5 6 7 10 4 5 5 6 7 10 4 5 5 6 7 10 Now I’ll ask Excel to find a bunch of stats—for your blocks, for my blocks, and for our tower. Here they are. Your block Our Tower My Block Mean 3 2.6 5.6 Standard Error 0.632455532 0.92736185 0.458258 Median 4 2 5 Mode 4 1 5 Standard Deviation 1.414213562 2.073644135 2.291288 Sample Variance 2 4.3 5.25 Population Variance 1.6 3.44 5.04 Population Std Dev 1.264911064 1.854723699 2.244994 Range 3 5 8 Minimum 1 1 2 Maximum 4 6 10 Sum 15 13 140 Count 5 5 25 10 Note that your mean plus my mean = the tower’s mean. That makes sense. If your blocks are on average 3” high, and my blocks are on average 2.6 inches high, then our tower is on average 5.6” high. Note that the median does not have this additive property—4 + 2 is not 5. The mode worked in this example, but won’t usually. How about the measures of dispersion? Nothing going on with (sample) standard deviation and sample variance…but look at population variance! 1.6 + 3.44 = 5.04. Our variances add! This property also holds for range (3 + 5 = 8), but range isn’t really a very useful measure of spread, since it only depends on the most extreme data points—here, our shortest and tallest blocks. But variance takes all of the blocks into account, and it still adds. It turns out that this property is incredibly useful, and we’ll need it very soon, when we begin to draw conclusions from samples. Oh, by the way, this variance addition ONLY works because MY choice of block was independent of YOUR choice of block. If you picked a tall one whenever I picked a tall one, for example, the relation would have failed. 11

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Demonstrations II: Measures of Central Tendency and Dispersion