Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CHAPTER 4 ANSWERS Section 4.1 Statistical Literacy and Critical Thinking 1 2 3 4 5 6 7 8 9 10 11 12 An outlier in a data set is a value that is much higher or much lower than almost all other values. This definition is not exact enough to clearly and objectively determine whether a value is an outlier. Although there are statistical tools that can aid in such a determination, there is also some judgment required. The median will do a better job of describing the income of a typical person in the class since the professor’s salary will just be the largest salary in the list of salaries, but will not affect the median. The mean, however, will be greatly affected by the professor’s outlier salary since the mean is obtained by summing all of the salaries and dividing by the number of salaries, 25. It is not likely that the result will be a good estimate of the mean commuting time for all workers. This procedure treats all of the states equally, regardless of the number of commuters and the number of large cities. Those states with more commuters should be weighted more heavily in determining a mean for all commuters. No. The numbers on the jerseys are just labels for the names of the players. They do not measure or count anything, so the mean would be a meaningless statistic. This statement does not make sense. There is only one mean for a data set. This statement is sensible. A set of data may have more than one mode. For example, the set of data 65.2, 65.2, 72.3, 75.0, 72.3, 81.4 has two numbers that occur more often than any other values: 65.2 and 72.3. This statement is sensible. It is possible for a set of data to have the same values for the mean, median, and mode. For example, the data set consisting of 4, 6, 6, 6, 8 has mean, median, and mode all equal to 6. The statement does not make sense. A mean calculated from the original raw data does not have to be equal to a mean calculated from a frequency table for the same data. The reason is that when using a frequency table to compute a mean, all of the values in a bin are assumed to be equal when, in facet, they are usually not equal. The median best describes average income of adults in a large city since it is not affected by the very large incomes of a relatively small group of people. Half of the adults will have incomes below the median and half will have incomes above the median. Since oranges packed in a large box are usually pre-sorted so that they are similar in size, either the mean or the median will provide a good average. If the box contains a random collection of oranges just picked, the mean would be a better (and quicker) average to use since only the total weight and total number of oranges are needed to find it. The median would best describe the average number of times that people change jobs since it would not be influenced by the large numbers of changes made by a few people. The mean would best describe the average number of pieces of lost luggage per flight. There will likely be very few large numbers to influence the value of the mean and the mean also reflects the total number of pieces lost. Concepts and Applications 13 For the mean, total the 8 numbers and divide by 8. more decimal place than shown in the data. Round the answer to one 52 Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley SECTION 4.1, WHAT IS AVERAGE? MEAN 14 98.6 98.6 98.0 98.0 99.0 98.4 98.4 98.4 98.4 98.6 10 984.0 98.40 10 0.27 0.17 0.17 0.16 0.13 0.24 0.29 0.24 0.14 0.16 0.12 0.16 12 2.25 0.188 12 For the median, first put the twelve numbers in increasing order. Since there is an even number of data values, the median is the average of the middle two (sixth and seventh) numbers. Thus median = (0.16 + 0.17)/2 = 0.165. The mode is the number that occurs most often. Since 0.16 occurs three times and no other number occurs more than twice, 0.16 is the mode. For the mean, total the 12 numbers and divide by 12. Round the answer to one more decimal place than shown in the data. MEAN 17 58.3 For the median, first put the ten numbers in increasing order. Since there is an even number of data values, the median is the average of the middle two (fifth and sixth) numbers. Since the ordered list is 98.0, 98.4, 98.4, 98.4, 98.4, 98.4, 98.6, 98.6, 98.6, 99.0, the median = (98.4 + 98.4)/2 = 98.4. The mode is the number that occurs most often. Since 98.4 occurs four times and no other number occurs more than three times, 98.4 is the mode. For the mean, total the 12 numbers and divide by 12. Round the answer to one more decimal place than shown in the data. MEAN 16 466 8 For the median, first put the eight numbers in increasing order. Since there is an even number of data values, the median is the average of the middle two (fourth and fifth) numbers. Thus median = (53 + 58)/2 = 55.5. The mode is the number that occurs most often. Since 49 occurs twice and no other number occurs more than once, 49 is the mode. For the mean, total the 10 numbers and divide by 10. Round the answer to one more decimal place than shown in the data. MEAN 15 53 52 75 62 68 58 49 49 8 53 98 92 95 87 96 90 65 92 95 93 98 94 12 1095 91.3min. 12 For the median, first put the twelve numbers in increasing order. Since there is an even number of data values, the median is the average of the middle two (sixth and seventh) numbers. Thus median = (93 + 94)/2 = 93.5 minutes. The mode is the number that occurs most often. Since 92, 95, and 98 each occurs two times and no other number occurs more than once, 92 and 95 and 98 minutes are all modes. For the mean, total the 11 numbers and divide by 11. Round the answer to one more decimal place than shown in the data. MEAN 0.72 0.90 0.84 0.68 0.84 0.90 0.92 0.84 0.64 0.84 0.76 11 888 0.807mm. 11 Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley 54 18 CHAPTER 4, DESCRIBING DATA For the median, first put the eleven numbers in increasing order. Since there are an odd number of data values, the median is the middle (sixth) number. Thus median = 0.84 mm. The mode is the number that occurs most often. Since 0.84 occurs four times and no other number occurs more than twice, 0.84 mm is the mode. For the mean, total the 15 ages and divide by 15. Round the answer to one more decimal place than shown in the data. MEAN 19 For the median, put the fifteen numbers in increasing order. Since there are an odd number of data values, the median is the middle (eighth) number. Thus the median = 57. The mode is the number that occurs most often. Since 57 occurs four times and no other number occurs more than twice, the mode is 57. For the mean, total the 11 weights and divide by 11. Round the answer to one more decimal place than shown in the data. MEAN 20 21 22 57 61 57 57 58 57 61 54 68 51 49 64 50 48 65 15 857 57.1. 15 0.957 0.912 0.842 0.925 0.939 0.886 0.914 0.913 0.958 0.947 0.920 11 10.113 0.9194 11 For the median, first put the eleven numbers in increasing order. Since there are an odd number of data values, the median is the middle (sixth) number. Thus median = 0.0.920 g. The mode is the number that occurs most often. Since no value occurs more than once, there is no mode. For the mean, total the 22 weights and divide by 22. Round the answer to one more decimal place than shown in the data. The sum of the 22 weights is 123.61 g, so the mean is 123.61/22 = 5.619 g. For the median, first put the 22 numbers in increasing order. Since there is an even number of data values, the median is the average of the middle two (eleventh and twelvth) numbers. Thus median = (5.59 + 5.60)/2 = 5.595 g. The mode is the number that occurs most often. Since 5.58 occurs three times and no other number occurs more than twice, 5.58 g is the mode. a) For the mean, total the seven areas and divide by 7. The sum of the seven areas is 1,103,100 square miles, so the mean is 1103100/7 = 157,586 square miles. For the median, first put the seven numbers in increasing order. Since there are an odd number of data values, the median is the middle (fourth) number. Thus median = 104,100 square miles. b) Alaska is an outlier on the high end. Without Alaska, the mean is 487,900/6 = 81,317 square miles. The median is the average of the third and fourth values = (53200 + 104100)/2 = 78650 square miles. c) Connecticut is an outlier on the low end. Without Connecticut, the mean is 1,097,600/6 = 182,933 square miles. The median is the average of the third and fourth values = (104100 + 114000)/2 = 109050 square miles. a) The mean equals the total weight/7 cans = 5.6866/7 = 0.8124 pounds. The median is the fourth smallest number in the ordered list of seven weights, or 0.8161 pounds. b) 0.7901 is an outlier since it is considerably lower than all of the other six values. Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley SECTION 4.1, WHAT IS AVERAGE? 23 c) If the outlier is excluded, the mean becomes 4.8965/6 = 0.8161 and the median becomes the average of the third and fourth numbers in the ordered list of six numbers, or (0.8161 + 0.8165) = 0.8163. a) MEAN b) Since the mean will be the total divided by five, the total will need to be 5 x 75 =375 in order for the mean to be 75 after the next quiz. Since the total is already 295 after the first four quizzes, the next quiz will need to be 375 - 295 = 80. If you achieve a score of 100, your mean score will be (295+100)/5 = 395/5 = 79. Thus it’s not possible to have a mean score higher than 79 after the next quiz. c) 24 a) MEAN 70 75 80 70 4 295 4 60 70 65 85 85 5 76.75 365 5 73.0 b) 25 26 27 28 29 30 31 55 Since the mean will be the total divided by six, the total will need to be 6 x 75 =450 in order for the mean to be 75 after the next quiz. Since the total is already 365 after the first five quizzes, the next quiz will need to be 450 - 365 = 85. c) If you achieve a score of 100, your mean score will be (365+100)/6 = 465/6 = 77.5. Thus 77.5 is the maximum mean score that you could have after the next quiz. Since the mean equals the total divided by 6, you must have a total of 480 in order to have a mean of 80. If you get 90 on the next quiz, you will have a total of 570 for seven quizzes for a mean of 570/7 = 81.4. The maximum mean score that you could have after the next quiz would result if you scored a 100. This would make your total 580 and your mean would be 580/7 = 82.9. The minimum mean score that you could have after the next quiz would result if you scored a zero. In that case, your new mean would be 480/7 = 68.6. The number of hits that she has so far is 30 x .300 = 9. If she gets a hit in her next at-bat, she will have 10 hits in 31 at-bats. Her new batting average will be 10/31 = .323. The mean score of your students is (55 + 60 + 68 + 70 + 87 + 88 + 95)/7 = 523/7 = 74.7. The median score is 70. Thus if the “average” score reported by the district is a mean, your fourth graders are above average; if it is a median, the fourth graders are below average. The mean height (in inches) of your players is (77 + 78 + 78 + 84 + 86)/5 = 403/5 = 80.6" or 6' 8.6". The median height is 78" or 6' 6". The answer to the question depends on the meaning of “average.” If the “average” height reported by the league is a mean, your team is above average height; if it is a median, the team is below average height. The mean weight of all of the peaches is the total weight divided by the total number of peaches or (18 + 22 + 24) pounds/(50 + 55 + 60) = 64/165 = 0.39 pounds. No. The classes are not of equal size. If we think of the two percentages as points out of 100, then the first class had a total number of points equal to 25 x 86 = 2150 while the second had 30 x 84 = 2520. The mean for the two classes combined is the total number of points divided by the total number of students or 4670/55 = 84.91 Each student is taking three classes with enrollments of 20 each and one class with an enrollment of 100, so the mean size of each student’s classes is 160/4 = 40. There are three classes with 100 students each and 45 classes with 20 students each, making a total enrollment of 1200 students in 48 classes. Thus the mean enrollment per class is 1200/48 = 25. Both means are correct, but they describe different means. The principal’s mean provides Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley 56 32 CHAPTER 4, DESCRIBING DATA the mean class size per class since it takes into account all classes taken by all students, while the parents’ mean provides the mean class size per student. This requires a weighted mean of the grades where the weights are the percentages. Therefore, Mean= (15)(75)+(20)(90)+(40)(85)+(25)(72) 15 20 40 25 33 Batting Average= Total Number of Hits Total Number of At Bats 8125 100 81.25 203 4 3 5 5 12 . 0.417 35 This number gives the mean number of hits per time at bat. No. Suppose that the player had 400 hits in 1000 at-bats (.400 average) followed by 2 hits in 4 at-bats. The player now has 402 hits in 1004 at-bats for an average of .4003 (which would still be reported as a .400 average). No. The average would be 10% only if the two farms produced exactly the same number of eggs. To demonstrate that the average might not be 10%, suppose one farm had 8% of 1000 eggs (80 eggs) with salmonella, while the other had 12% of 3000 eggs (360 eggs) with salmonella. Altogether, there are 440 eggs out of 4000 with salmonella, giving a percentage of 440/4000 or 11%, not 10%. 36 a) Batting Average= b) Slugging Average= 34 Total Number of Hits Total Number of At Bats Total Number of Bases Total Number of At Bats 3 2 2 5 4 5 3 4 6 545 7 14 13 14 0.500 0.929 c) 37 38 Yes. For example, if a player has 2 home runs in 4 at-bats, the slugging percentage is 8/4 = 2.000. Each share of stock gets one vote. If a Yes vote counts 1 point and a No vote counts 0 points, then the outcome of the vote is (400x1 + 600x0) / (400+600) = 400/1000 = 0.400. This represents the average number of Yes votes per vote. Since the average is less than 0.5, the vote fails. Alternatively, we can just say that there are 400 Yes votes and 600 No votes, so the item on which the vote is being taken does not pass. This is a weighted mean with the course credits being the weights. Thus, GPA= 39 40 (5)(4)+(3)(3)+(3)(2)+(3)(1) 5 3 3 3 38 14 2.71 The data are at the nominal level of measurement, so the only measure of center that makes sense is the mode. The mode is 1, indicating that the smooth-yellow peas occur more than any other phenotype. The population center has been moving westward with time, reflecting increased population in western states relative to eastern states. Section 4.2 Statistical Literacy and Critical Thinking 1 2 3 A graph is symmetric if its left half is a mirror image of its right half. This distribution is uniform (or rectangular). The distribution is symmetric, not skewed, and there are no modes. The students in the statistics class have satisfied some prerequisites for college, and perhaps also for the class. It is therefore probable that people with lower IQ scores are not present in the class as they would be for the randomly selected adults. Thus there will be less variability in the IQ scores of the students than for the scores or the randomly selected adults. A graph of the distribution of the student IQ scores will concentrated in a Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley SECTION 4.2, SHAPES OF DISTRIBUTIONS 4 5 6 7 8 57 narrower range (less spread) than the graph for IQ scores of randomly selected adults. Skewness refers to a lack of symmetry with the graph more spread out on one side than on the other. This statement is not sensible. A distribution can have any number of modes and still be symmetric. This statement is not sensible. With a symmetric distribution, the mean and median are always equal. This statement does not make sense. If the distribution is uniform, the graph of the distribution is a horizontal straight line, so there cannot be a mode consisting of a single value. This statement makes sense. A distribution can be left-skewed with a single mode. Concepts and Applications 9 Times Between Eruptions of Old Faithful 80 60 40 20 0 45 50 55 60 65 70 75 80 85 90 95 100 105 110 Minutes This distribution has two modes (at 50 and 80 minutes), is left-skewed, and has wide variation. 10 Failure Time of Computer Chips 50 40 30 20 10 0 -10 1 2 3 4 5 6 7 8 9 10 11 12 Times(months) This distribution has one mode (at 1 month), is right skewed, and has moderate variation. Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley 58 CHAPTER 4, DESCRIBING DATA 11 Weight of Rugby Players 80 60 40 20 0 65 70 75 80 85 90 95 100 105 110 115 110 Weight (kg) This distribution is single peaked, is nearly symmetric, and has moderate variation. 12 Weightsof a Sample of Pennies 40 30 20 10 0 -10 2.48 2.56 2.64 2.72 2.80 2.88 2.96 3.04 3.12 Weights (grams) 13 The distribution is bimodal and is roughly symmetric. The gap between the left portion of the distribution and the right portion reflects the fact that this graph actually includes two different populations: Pennies made before 1983 and pennies made in 1983 or later. a) The distribution of incomes will have a shape similar to the one shown below, but its exact shape cannot be determined from the information given. Since the mean is greater than the median, it will be rightskewed. Frequency Med Mean 35000 41000 Max 250000 Income Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley SECTION 4.2, SHAPES OF DISTRIBUTIONS b) c) 14 a) b) c) 15 a) b) 16 a) b) 17 18 a) b) a) 19 a) b) 20 21 a) b) a) b) 22 a) b) 23 a) b) 24 a) b) 25 a) b) a) b) a) b) 26 27 28 a) 59 About 50% or 150 (half of 300) of the families earned less than $35,000 since that is the value of the median. No. It depends on the precise distribution. All we can determine is that less than half of the families earned more than $41,000. More than half of the days (183 or more) had no rainfall, so then the minimum and the median are both zero. The distribution is right-skewed. There are many days with no rainfall (the mode is 0), probably quite a few with a little rainfall, and maybe a small number with greater rainfall. The mean is only 0.083 inches, so there was a total rainfall for the year of only about 30 inches. No. Since there was zero rainfall on more than half of the days, it rained on fewer than half of the days (182 or fewer). We would expect one mode of $0 because an income of $0 is probably the most common value. Right-skewed. Most of the incomes will be at or near zero, with only a few that are much greater, including the instructor’s. The distribution is likely to have one mode. The distribution is likely to be nearly symmetric, perhaps a little left-skewed since there is a possible range of 75 points below the mean and only 25 points above it. The distribution is likely to have one mode. The distribution is likely to be nearly symmetric. The distribution is likely to have one or two modes; there might be one mode for linemen, fullbacks, and linebackers, and a second mode for running backs, wide receivers, and defensive backs. The distribution will likely be nearly symmetric since there are nearly equal numbers of the two groups of players mentioned above. The distribution is likely to have two modes since figure skaters tend to be smaller than hockey players. We can’t know the skewness for certain without knowing how many skaters are in each group. Assuming equal numbers in each group, the distribution will probably be right skewed since no professional figure skaters are heavy, while some of the hockey players may be light. There will be two modes since SUVs will be heavier than compacts. It is symmetric due to the equal number of cars in each group. The distribution is likely to have one mode. The distribution is likely to be right skewed since most flights will leave with little or no delays, but a few flights will have long delays. Delays shorter than zero are not possible. The distribution will have one mode somewhere near the speed limit. It will be right-skewed since there will be a few people who exceed the speed limit, but even fewer who are much below the limit. The distribution is likely to have one mode (there’s no reason to suspect that museum goers must be either old or young). It will be right-skewed since younger adults may also have children with them and retirees are more likely to be alone. The distribution is likely to have one mode. The distribution is likely to be right-skewed. There will be a greater percentage of young people and families with children. Since there will always be exactly 4 players, there is one mode. The distribution is likely to be symmetric. The distribution is likely to have one mode. The distribution is likely to be symmetric. The distribution is likely to have one mode. The distribution is likely to be right-skewed since this is the right tail of a distribution that is already skewed to the right. The distribution is likely to have one mode since this is similar to the income distribution of all adults. Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley 60 CHAPTER 4, DESCRIBING DATA 29 b) a) b) 30 a) b) The distribution The distribution The distribution stars than there The distribution The distribution tend to not make is likely to be right-skewed. is likely to have one mode. is likely to be right-skewed since there are fewer are “journeyman” ball players. is likely to have one mode. is likely to be right-skewed since low average players the team. Section 4.3 Statistical Literacy and Critical Thinking 1 2 3 4 5 6 7 8 The standard deviation is based on how much values deviate from the mean. The movie patrons are likely to have more variation in their IQs than the students in a physics class. The students are likely to be a more homogeneous group since they have been filtered by being in college and additionally by being able to satisfy the mathematical prerequisites for the class. This statement is incorrect because it defines the standard deviation in terms of the minimum and the maximum values, but the standard deviation uses every value in its computation. It means that about 25% of the values are at or below 93.2 and about 75% are above 93.2. This statement does not make sense. The median is the 50th percentile, so it cannot be the 60th percentile. This statement makes sense. Since annual incomes have a distribution that is right-skewed, the mean will be larger than the median. This statement makes sense. The annual incomes of the instructors are likely to be in a smaller range than the incomes of physicians since the group of physicians may include those in general practice, pediatricians, brain surgeons, internists, etc. This statement does not make sense. The standard deviation is a type of average, so it does not necessarily become larger as the sample size increases. Concepts and Applications 9 Range = highest value – lowest value = 75 – 49 = 26 seconds The Mean is 58.25 seconds. Time 53 52 75 62 68 58 49 49 Deviation = Time - Mean -5.25 -6.25 16.75 3.75 9.75 -0.25 -9.25 -9.25 Sum = Standard Deviation= Deviation2 27.5625 39.0625 280.5625 14.0625 95.0625 0.0625 85.5625 85.5625 627.5000 Sum 9-1 627.5 8 9.5 seconds Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley SECTION 4.3, MEASURES OF VARIATION 10 Range = highest value – lowest value = 99.0 – 98.0 = 1.00 degrees The Mean is 98.44 degrees. Temperature Deviation = Deviation2 Temperature - Mean 98.6 0.16 0.0256 98.6 0.16 0.0256 98.0 -0.44 0.1936 98.0 -0.44 0.1936 99.0 0.56 0.3136 98.4 -0.04 0.0016 98.4 -0.04 0.0016 98.4 -0.04 0.0016 98.4 -0.04 0.0016 98.6 0.16 0.0256 Sum = 0.7840 Standard Deviation= 11 0.7840 9 0.30 degrees Range = highest value – lowest value The Mean is 0.1875. Concentration Deviation = Concentration - Mean 0.27 0.0825 0.17 -0.0175 0.17 -0.0175 0.16 -0.0275 0.13 -0.0575 0.24 0.0525 0.29 0.1025 0.24 0.0525 0.14 -0.0475 0.16 -0.0275 0.12 -0.0675 0.16 -0.0275 Sum = Standard Deviation= 12 Sum 10-1 Sum 12-1 0.035825 11 = 0.29 – 0.12 = 0.170 Deviation2 0.006806 0.000306 0.000306 0.000756 0.003306 0.002756 0.010506 0.002756 0.002256 0.000756 0.004556 0.000756 0.035825 0.057 Range = highest value – lowest value = 98 – 65 = 33 The Mean is 91.25. Time Deviation = Deviation2 Time - Mean 98 6.75 45.5625 92 0.75 0.5625 95 3.75 14.0625 87 -4.25 18.0625 96 4.75 22.5625 90 -1.25 1.5625 65 -26.25 689.0625 92 0.75 0.5625 95 3.75 14.0625 93 1.75 3.0625 98 6.75 45.5625 94 2.75 7.5625 Sum = 862.2500 Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley 61 62 CHAPTER 4, DESCRIBING DATA Standard Deviation= 13 862.25 11 8.9 minutes. Range = highest value – lowest value = 0.92 – 0.64 = 0.280 The Mean is 0.807273. Length 0.72 0.90 0.84 0.68 0.84 0.90 0.92 0.84 0.64 0.84 0.76 Deviation = Length - Mean -0.087270 0.092727 0.032727 -0.127270 0.032727 0.092727 0.112727 0.032727 -0.167270 0.032727 -0.047270 Sum = Standard Deviation= 14 Sum 12-1 Deviation2 0.007617 0.008598 0.001071 0.016198 0.001071 0.008598 0.012707 0.001071 0.027980 0.001071 0.002235 0.088218 Sum 11-1 0.088218 10 0.094 mm. Range = highest value – lowest value = 68 – 48 = 20.0 The Mean is 57.13333. Age 57 61 57 57 58 57 61 54 68 51 49 64 50 48 65 Deviation = Age - Mean -0.13333 3.86667 -0.13333 -0.13333 0.86667 -0.13333 3.86667 -3.13333 10.8667 -6.13333 -8.13333 6.86667 -7.13333 -9.13333 7.86667 Sum = Standard Deviation= Deviation2 0.01778 14.95111 0.01778 0.01778 0.75111 0.01778 14.95111 9.81778 118.08440 37.61778 66.15111 47.15111 50.88444 83.41778 61.88444 505.73330 Sum 15-1 505.7333 14 6.0 years. Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley SECTION 4.3, MEASURES OF VARIATION 15 Range = highest value – lowest value = 0.958 – 0.842 = 0.1160 The Mean is 0.919364. Weight Deviation = Deviation2 Weight - Mean 0.957 0.037636 0.001416 0.912 -0.007360 0.000054 0.842 -0.077360 0.005985 0.925 0.005636 0.000032 0.939 0.019636 0.000386 0.886 -0.033360 0.001113 0.914 -0.005360 0.000029 0.913 -0.006360 0.000041 0.958 0.038636 0.001493 0.947 0.027636 0.000764 0.920 0.000636 0.000041 Sum = 0.011313 Standard Deviation= 16 Sum 11-1 0.11313 10 0.0336 g. Range = highest value – lowest value = 5.84 – 5.52 = 0.320 g. The Mean is 5.618636. Weight Deviation = Deviation2 Weight - Mean 5.60 -0.01864 0.000347 5.63 0.01136 0.000129 5.58 -0.03864 0.001493 5.56 -0.05864 0.003438 5.66 0.04136 0.001711 5.58 -0.03864 0.001493 5.57 -0.04864 0.002365 5.59 -0.02864 0.000820 5.67 0.05136 0.002638 5.61 -0.00864 0.000075 5.84 0.22136 0.049002 5.73 0.11136 0.012402 5.53 -0.08864 0.007856 5.58 -0.03864 0.001493 5.52 -0.09864 0.009729 5.65 0.03136 0.000984 5.57 -0.04864 0.002365 5.71 0.09136 0.008347 5.59 -0.02864 0.000820 5.53 -0.08864 0.007856 5.63 0.01136 0.000129 5.68 0.06136 0.003765 Sum = 0.119259 Standard Deviation= Sum 22-1 0.119259 21 0.075 g. Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley 63 64 CHAPTER 4, DESCRIBING DATA 17 Length 2 6 2 2 1 4 4 2 4 2 3 8 4 2 2 7 7 2 3 11 ROOF Dev. = Length - Mean -1.9 2.1 -1.9 -1.9 -2.9 0.1 0.1 -1.9 0.1 -1.9 -0.9 4.1 0.1 -1.9 -1.9 3.1 3.1 -1.9 -0.9 7.1 Sum = Deviation2 Length 3.61 4.41 3.61 3.61 8.41 0.01 0.01 3.61 0.01 3.61 0.81 16.81 0.01 3.61 3.61 9.61 9.61 3.61 0.81 50.41 129.80 3 3 3 3 5 2 3 3 3 2 4 2 2 3 2 3 5 3 4 4 Standard Deviation= Sum 20-1 129.8 19 Standard Deviation= Sum 20-1 15.8 19 HAT Dev.= Length - Mean -0.1 -0.1 -0.1 -0.1 1.9 -1.1 -0.1 -0.1 -0.1 -1.1 0.9 -1.1 -1.1 -0.1 -1.1 -0.1 1.9 -0.1 0.9 0.9 Sum = Deviation2 0.01 0.01 0.01 0.01 3.61 1.21 0.01 0.01 0.01 1.21 0.81 1.21 1.21 0.01 1.21 0.01 3.61 0.01 0.81 0.81 15.80 2.6 for Cat on a Hot Tin Roof 0.9 for The Cat in the Hat Cat on a Hot Tin Roof: Range = 11 – 1 = 10; Standard deviation = 2.6. The Cat in the Hat: Range = 5 – 2 = 3; Standard Deviation = 0.9. There is much less variation among the word lengths in The Cat in the Hat. 18 Age 24 24 34 15 19 22 18 20 20 17 Eastbound Dev. = Age-Mean 2.7 2.7 12.7 -6.3 -2.3 0.7 -3.3 -1.3 -1.3 -4.3 Sum = Deviation 7.29 7.29 161.29 39.69 5.29 0.49 10.89 1.69 1.69 18.49 254.1 2 Age 41 24 32 26 39 45 24 21 22 21 Westbound Dev. = Age-Mean 11.5 -5.5 2.5 -3.5 9.5 15.5 -5.5 -8.5 -7.5 -8.5 Sum = Deviation2 132.25 30.25 6.25 12.25 90.25 240.25 30.25 72.25 56.25 72.25 742.50 The means are 21.3 and 29.5, respectively, for eastbound and westbound. Standard Deviation= Sum 10-1 254.1 9 5.3 for Eastbound stowaways Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley SECTION 4.3, MEASURES OF VARIATION Standard Deviation= Sum 10-1 742.5 9 Eastbound: Range = 34 – 15 = Westbound: Range = 45 – 21 = The variation in ages for the larger than the variation for 65 9.1 for Westbound stowaways 19.0; St. 24.0; St. westbound eastbound Dev. = 5.3 Dev. = 9.1 stowaways appears to be substantially stowaways. 19 Error 2 2 0 0 -3 -2 1 -2 8 1 0 -1 0 1 One Day Dev. = Error - Mean 1.5 1.5 -0.5 -0.5 -3.5 -2.5 0.5 -2.5 7.5 0.5 -0.5 -1.5 -0.5 0.5 Sum = Deviation2 Error 2.25 2.25 0.25 0.25 12.25 6.25 0.25 6.25 56.25 0.25 0.25 2.25 0.25 0.25 89.50 0 -3 2 5 -6 -9 4 -1 6 -2 -2 -1 6 -4 Standard Deviation= Sum 14-1 89.50 13 Standard Deviation= Sum 14-1 267.2143 13 Five Days Dev. = Error - Mean 0.35714 -2.64286 2.35714 5.35714 -5.64286 -8.64286 4.35714 -0.64286 6.35714 -1.64286 -1.64286 -0.64286 6.35714 -3.64286 Sum = Deviation2 0.12755 6.98469 5.55612 28.69898 31.84184 74.69898 18.98469 0.41327 40.41327 2.69898 2.69898 0.41327 40.41327 13.27041 267.21430 2.6 for one-day forecasts. 4.5 for five-day forecasts. One day: Range = 8 – (-3) = 11.0; St. Dev. = 2.6 Five Days: Range = 6 – (-9) = 15.0; St. Dev. = 4.5 The variation in errors for the five-day forecasts of the high temperature appears to be substantially larger than the variation for one-day forecasts of the high temperature. 20 Weight 0.15 0.02 0.16 0.37 0.22 No treatment Dev. = Weight - Mean -0.034 -0.164 -0.024 0.186 0.036 Sum = Deviation2 Weight 0.001156 0.026896 0.000576 0.034596 0.001296 0.063364 2.03 0.27 0.92 1.07 2.38 Standard Deviation= Sum 5-1 0.063364 4 Standard Deviation= Sum 5-1 2.95172 4 Treatment Dev. = Weight – Mean 0.696 -1.064 -0.414 -0.264 1.046 Sum = 0.126 0.859 Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley Deviation2 0.484416 1.132096 0.171396 0.069696 1.094116 0.859029 Fr eque nc y 23 8 7 6 5 4 3 2 1 0 6 7 8 9 10 11 12 6 7 8 9 10 11 12 6 7 8 9 10 11 12 6 7 8 9 10 11 12 8 7 Frequency 22 No treatment: Range = 0.37 – 0.02 = 0.0350; St. Dev. = 0.126 Treatment: Range = 2.38 – 0.27 = 2.110; St. Dev. = 0.859 The variation in weights of trees with no treatment appears to be the variation in weights of the treated trees. a) 25/465 = 0.054, so the M&M is in the 5th percentile. b) 322/465 = 0.692, so the M&M is in the 69th percentile. c) 224/465 = 0.482, so the M&M is in the 48th percentile. a) 38/76 = 0.500, so age 38 is in the 50th percentile. b) 20/76 = 0.263, so age 29 is in the 26th percentile. c) 71/76 = 0.934, so age 71 is in the 91rd percentile. a) 6 5 4 3 2 1 0 Fr eque nc y 21 CHAPTER 4, DESCRIBING DATA 8 7 6 5 4 3 2 1 0 8 Frequenc y 66 7 6 5 4 3 2 1 0 Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley less than SECTION 4.3, MEASURES OF VARIATION b) Set 1 9 9 9 9 9 Low value Lower quartile Median Upper quartile High value Set 2 8 8 9 10 10 Set 3 8 8 9 10 10 Boxplot of 1, 2, 3, 4 12 11 Data 10 9 8 7 6 1 c) 2 3 4 Set 1 Value Deviation = Value - Mean Deviation2 9 0 0 9 0 0 9 0 0 9 0 0 9 0 0 9 0 0 9 0 0 Sum = 0 Standard Deviation= Sum 7-1 0 6 0 Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley Set 4 6 6 9 12 12 67 68 CHAPTER 4, DESCRIBING DATA Set 2 Value Deviation = Value - Mean Deviation 8 -1 1 8 -1 1 9 0 0 9 0 0 9 0 0 10 1 1 10 1 1 2 Sum = 4 Standard Deviation= Sum 7-1 4 6 0.816 Set 3 Value Deviation = Value - Mean Deviation 8 -1 1 8 -1 1 8 -1 1 9 0 0 10 1 1 10 1 1 10 1 1 2 Sum = 6 Standard Deviation= Sum 7-1 6 6 1 Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley SECTION 4.3, MEASURES OF VARIATION 69 Set 4 Value Deviation = Value - Mean Deviation 6 -3 9 6 -3 9 6 -3 9 9 0 0 12 3 9 12 3 9 12 3 9 2 Sum = 54 Standard Deviation= 54 6 3 d) The standard deviation takes all of the data into account and increases as the data become more spread out around the mean. a) Set 1 Set 2 Fr eque nc y Fre quency 8 7 6 5 4 3 2 1 0 3 4 5 6 7 8 8 7 6 5 4 3 2 1 0 9 3 Set 3 5 6 7 8 9 8 7 Frequency 5 4 3 2 1 6 5 4 3 2 1 0 0 3 b) 4 Set 4 8 7 6 Fre quency 24 Sum 7-1 4 5 6 7 8 9 3 4 5 6 7 8 9 In each set, the median is the 4th value in the ordered list, the lower quartile is the middle value of the lowest three values (2 nd in the overall list), and the upper quartile is the middle value of the highest three values (6th in the overall list). Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley CHAPTER 4, DESCRIBING DATA Set 1 6 6 6 6 6 Low value Lower quartile Median Upper quartile High value Set 2 5 5 6 7 7 Set 3 5 5 6 7 7 Boxplotof Set 1, Set 2, Set 3, Set 4 9 8 7 Dat a 70 6 5 4 3 Set 1 c) Set 2 Set 3 S et 4 Set 1 Value Deviation = Value - Mean Deviation2 6 0 0 6 0 0 6 0 0 6 0 0 6 0 0 6 0 0 6 0 0 Sum = 0 Standard Deviation= Sum 7-1 0 6 0 Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley Set 4 3 3 6 9 9 SECTION 4.3, MEASURES OF VARIATION Set 2 Value Deviation = Value - Mean Deviation 5 -1 1 5 -1 1 6 0 0 6 0 0 6 0 0 7 1 1 7 1 1 2 Sum = 4 Standard Deviation= Sum 7-1 4 6 0.816 Set 3 Value Deviation = Value - Mean Deviation 5 -1 1 5 -1 1 5 -1 1 6 0 0 7 1 1 7 1 1 7 1 1 2 Sum = 6 Standard Deviation= Sum 7-1 6 6 1 Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley 71 72 CHAPTER 4, DESCRIBING DATA Set 4 Value Deviation = Value - Mean Deviation 3 -3 9 3 -3 9 3 -3 9 6 0 0 9 3 9 9 3 9 9 3 9 2 Sum = 54 Standard Deviation= 25 Sum 7-1 54 6 3 d) The standard deviation takes all of the data into account and increases as the data become more spread out around the mean. a) For the faculty, Mean= 2+3+1+0+1+2+4+3+3+2+1 11 22 11 2.0 years. Median equals the sixth number in the ordered list and is 2 years. Range = 4 - 0 = 4 years. For the students, Mean= 5+6+8+2+7+10+1+4+6+10+9 11 68 11 6.2 years. Median equals the sixth number in the ordered list and is 6 years. Range = 10 - 1 = 9 years. b) The lower quartile is the middle value of the lowest 5 values in each data set and the upper quartile is the middle value of the highest 5 values in each data set. Low Value Lower quartile Median Upper quartile High Value Faculty 0 1 2 3 4 Students 1 4 6 9 10 Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley SECTION 4.3, MEASURES OF VARIATION 73 Boxplot of Fac ulty , Students 10 8 Data 6 4 2 0 Faculty Students c) Computation of standard deviation Faculty Value Deviation = Value - Mean Students Deviation 2 Value Deviation = Value - Mean Deviation 2 2 0 0 5 -1.18 1.3924 3 1 1 6 -0.18 0.0324 1 -1 1 8 1.82 3.3124 0 -2 4 2 -4.18 17.4724 1 -1 1 7 0.82 0.6724 2 0 0 10 3.82 14.5924 4 2 4 1 -5.18 26.8324 3 1 1 4 -2.18 4.7524 3 1 1 6 -0.18 0.0324 2 0 0 10 3.82 14.5924 1 -1 1 9 2.82 7.9524 2 14 6.18 91.6364 The means for faculty and students are given in bold at the bottoms of the first and fourth columns respectively. The deviations for faculty are obtained by subtracting the mean from each number in the first column. Similarly for students. The squared deviations are then placed in the third and sixth columns, and their totals are shown in bold at the bottom of the columns. The standard deviations are then found by dividing the sum of the squared deviations by n-1 = 11-1 = 10 and taking the square root. Thus, the standard deviations are Faculty: Standard Deviation= Sum 11-1 14 10 1.2 Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley CHAPTER 4, DESCRIBING DATA Students: d) e) 26 a) Sum 11-1 Standard Deviation= 91.6364 10 3.303 By the range rule, the standard deviation is approximately range/4, which for the faculty is 4/4 = 1 and for the students is 9/4 = 2.25. Both estimates are low, but are reasonably close. The students have a higher mean age for their cars and a much greater variation in ages. For the school zone, Mean= 20+18+23+21+19+18+17+24+25 9 185 9 20.6 mph. Median equals the fifth number in the ordered list and is 20 mph. Range = 25 - 27 = 8 mph For the downtown intersection, Mean= 29+31+35+24+31+26+36+31+28 9 271 9 30.1 mph. Median equals the fifth number in the ordered list and is 31 mph. Range = 36 - 24 = 12 mph b) The lower quartile is the average of the two middle values of the lowest 4 values in each data set. The upper quartile is the average of the two middle values of the highest 4 values in each data set. School Downtown Low Value 17 24 Lower quartile 18 27 Median 20 31 Upper quartile 23.5 33 High Value 25 36 Boxplot of School, Downtown 35 30 Dat a 74 25 20 15 School Downtown Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley SECTION 4.3, MEASURES OF VARIATION 75 c) Computation of standard deviation School Zone Downtown 2 Value Deviation = Value - Mean 20 -.6 0.36 29 -1.1 1.21 18 -2.6 6.76 31 0.9 0.81 23 2.4 5.76 35 4.9 24.01 21 0.4 0.16 24 -6.1 37.21 19 -1.6 2.56 31 0.9 0.81 18 -2.6 6.76 26 -4.1 16.81 17 -3.6 12.96 36 5.9 34.81 24 3.4 11.56 31 0.9 0.81 25 4.4 19.36 28 -2.1 4.41 20.6 Deviation 66.24 Value Deviation = Value - Mean 30.1 Deviation 2 120.89 The means for the school zone and the downtown intersection are given in bold at the bottoms of the first and fourth columns respectively. The deviations for the school zone are obtained by subtracting the mean from each number in the first column. For downtown, the deviations are obtained by subtracting the mean from each number in the fourth column. The squared deviations are then placed in the third and sixth columns, and their totals are shown in bold at the bottom of the columns. The standard deviations are then found by dividing the sum of the squared deviations by n-1 = 9-1 = 8 and taking the square root. Thus, the standard deviations are School Zone: Downtown: d) e) 27 a) Standard Deviation= Standard Deviation= Sum 9-1 Sum 9-1 66.24 8 120.89 8 2.88 3.89 By the range rule, the standard deviation is approximately range/4, which for the school zone is 8/4 = 2 and for downtown is 12/4 = 3. Both estimates are low, but are reasonably close. The average speed is higher downtown and the variation is slightly greater downtown. For the first seven Presidents, Mean= 57+61+57+57+58+57+61 7 1218 7 58.3 years. Median equals the fourth number in the ordered list and is 57 years. Range = 61 - 57 = 4 years For the last seven Presidents, Mean= 56+61+52+69+64+46+54 7 402 7 57.4 years. Median equals the fourth number in the ordered list and is 56 years. Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley CHAPTER 4, DESCRIBING DATA Range = 69 - 46 = 23 years b) The lower quartile is the middle value of the lowest 3 values in each data set. The upper quartile is the middle value of the highest 3 values in each data set. First 7 Last 7 Low Value 57 46 Lower quartile 57 52 Median 57 56 Upper quartile 61 64 High Value 61 69 Boxplot of First 7, L ast 7 70 65 60 Data 76 55 50 45 First 7 Las t 7 c) Computation of standard deviation First 7 Last 7 2 Value Deviation = Value - Mean Deviation 57 -1.3 1.69 56 -1.4 1.96 61 2.7 7.29 61 3.6 12.96 57 -1.3 1.69 52 -5.4 29.16 57 -1.3 1.69 69 11.6 134.56 58 0.7 0.49 64 6.6 43.56 57 -1.3 1.69 46 –11.4 129.96 61 2.7 7.29 54 –3.4 11.56 58.3 21.83 Value 57.4 Deviation = Value - Mean Deviation 2 363.72 The means for the first 7 and the last 7 presidents are given in bold at the bottoms of the first and fourth columns respectively. The deviations for the first 7 are obtained by subtracting the mean from each number in the first column. Similarly, deviations for the last 7 are obtained by subtracting the mean from each number in the fourth column. The squared deviations are then placed in the third and sixth columns, and their totals are shown in bold at the bottom of the columns. The standard deviations are then found by dividing the sum of the squared deviations by n-1 = 7-1 = 6 and taking the square root. Thus, the standard deviations are Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley SECTION 4.3, MEASURES OF VARIATION Standard Deviation= First 7 Presidents: Standard Deviation= Last 7 Presidents: d) 21.83 1.9 6 60.62 7.8 6 By the range rule, the standard deviation is approximately range/4, which for the first 7 presidents is 4/4 = 1.00 and for the last 7 presidents is 23/4 = 5.75. Both estimates are low, but are reasonably close. The average ages of the first 7 and last 7 presidents are about the same, but the variation is over three times greater among the last 7 presidents. e) a) For Beethoven’s symphonies, Mean= 28+36+50+33+30+40+38+26+68 9 349 9 38.8 minutes. Median equals the fifth number in the ordered list and is 36 minutes. Range = 68 - 26 = 42 minutes For Mahler’s symphonies, Mean= 52+85+94+50+72+72+80+90+80 9 675 9 75.0 minutes. Median equals the fifth number in the ordered list and is 80 minutes. Range = 94 - 50 = 44 minutes b) The lower quartile is the average of the two middle values in the lowest 4 values of the data set. The upper quartile is the average of the two middle values in the highest 4 values of the data set. Low Value Lower quartile Median Upper quartile High Value Beethoven 26 29 36 45 68 Mahler 50 62 80 87.5 94 Boxplot of Beethoven, Mahler 100 90 80 70 Data 28 Sum 7-1 Sum 7-1 77 60 50 40 30 20 Beethoven Mahler Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley 78 CHAPTER 4, DESCRIBING DATA c) Computation of standard deviation Beethoven Mahler Value Deviation Value - Mean Deviation 28 -10.8 36 2 Value Deviation Value - Mean 116.64 52 -23 529 -2.8 7.84 85 10 100 50 11.2 125.44 94 19 361 33 -5.8 33.64 50 -25 625 30 -8.8 77.44 72 -3 9 40 1.2 1.44 72 -3 9 38 -0.8 0.64 80 5 25 26 -12.8 163.84 90 15 225 68 29.2 852.64 80 5 25 1378.96 75.0 38.8 Deviation 2 1908 The mean lengths for Beethoven’s and Mahler’s symphonies are given in bold at the bottoms of the first and fourth columns respectively. The deviations for Beethoven’s are obtained by subtracting the mean from each number in the first column. Similarly for Mahler’s. The squared deviations are then placed in the third and sixth columns, and their totals are shown in bold at the bottom of the columns. The standard deviations are then found by dividing the sum of the squared deviations by n-1 = 9-1 = 8 and taking the square root. Thus, the standard deviations are Beethoven: Mahler: Standard Deviation= Standard Deviation= Sum 9-1 Sum 9-1 1378.96 8 1908 8 13.13 15.44 d) 29 30 31 32 By the range rule, the standard deviation is approximately range/4, which for Beethoven’s symphonies is 42/4 = 10.5 and for Mahler’s is 44/4 = 11.0. Both estimates are low, but are reasonably close. e) The average length of Mahler’s symphonies is much greater than that of Beethoven’s, but the variation is about the same for both composers. The second shop has a slightly lower average delivery time, but its standard deviation is so large that you risk the pizza being delivered 20 to 40 minutes late. It could, of course, arrive much earlier than you expected as well. If you need to know the arrival time quite closely, you should order from the first shop, particularly since the average delivery time is only three minutes longer. Kevin, who has the larger standard deviation, is more likely to serve up more very small servings and is more likely to generate more complaints. A lower standard deviation means more certainty in the value of the portfolio and less risk. The batting averages are more closely bunched today than in the past. Since Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley SECTION 4.4, STATISTICAL PARADOXES 79 the overall average has remained at 0.260, averages above 0.350, should be less common today. 1 2 3 4 5 6 7 8 Section 4.4 Statistical Literacy and Critical Thinking A false positive occurs when the test indicates drug use for someone who does not actually use drugs. A false negative occurs when the test indicates that drugs are not used, but the subject actually does use drugs. Cancer tests are not perfect. Some test results may be positive even though the patient does not have cancer. A polygraph test can be positive (indicating that the subject is lying) even though the subject is telling the truth. If most people tested are not lying, a small percentage of false positive results can still be a fairly large number, while a small number of liars may produce a high percentage of true positives, the actual number of true positives may be quite small. Thus, among the positives, there may be more false positives than true positives, resulting in a high proportion of false accusations. Yes. For example, consider the table below. Quarterback Half 1 Half 2 Game A 25/42=0.60 2/3=0.67 27/45=0.60 B 5/10=0.50 59/90=.66 64/100=0.64 Quarterback A has the higher completion percentage in each half, but Quarterback B has the higher completion percentage for the entire game. [Probably, no quarterback has thrown 100 passes in a game, but the example illustrates how this result can happen.] This statement makes sense. When both people have the same number of scores in each category, if one person has a higher average in each category, that person will also have a higher average overall. This statement is not true. It is similar to the quarterback example in Exercise 4 above. [Substitute Ann for A and Bret for B in Exercise 4 and you can see that it is possible for Ann to have the higher average in each half of the season, but Bret is higher overall for the whole season.] This statement does not make sense. These are two entirely different probabilities. This statement does not make sense. If the test is 90% accurate, it means that 90% of drug users will test positive and 90% of non-users will test negative. It does not mean that 90% of those who test positive are drug users. This situation is similar to the mammogram example in the text. Even though that test was 85% accurate, only about 5% of patients with positive test results actually have cancer. Concepts and Applications 9 10 11 Josh had the higher batting average in the first and second halves of the season, but Jude had 80 hits in 200 at bats (.400 average) for the entire season while Josh had 85 hits in 220 at bats (.386 average), so Jude the higher overall batting average. This is an illustration of Simpson’s Paradox and it can happen because of the unequal numbers of at bats for the players in both halves of the season. Allan had the higher completion percentage in both halves of the game, but Abner had 14 for 31 (45% completions) while Allan had 11 for 26 (42% completions) for the entire game, so Abner had the higher completion percentage for the entire game. This is an illustration of Simpson’s Paradox and it can happen because of the unequal numbers of passes thrown for the players in both halves of the game. a) New Jersey had the higher scores in both racial categories, but Nebraska had the higher overall average across both racial categories. Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley 80 CHAPTER 4, DESCRIBING DATA b) c) This can happen because of the unequal percentages of whites and nonwhites in the two states. The overall average for New Jersey is a weighted average with the weights being the percentages of whites and non-whites in New Jersey. Thus Mean= 12 13 14 15 16 66(283)+34(252) 66 34 27246 100 272.46or 272 a) The average SAT scores in all five grade categories went down from 1988 to 1998 (average scores were lower in 1998). b) The overall average SAT scores went up from 1988 to 1998 (average scores were higher in 1998). c) This is an illustration of Simpson’s Paradox and it can happen because of the unequal percentages of students in each of the grade categories during the two years. a) The death rates in New York City were Whites: 8400/4,675,000 = 0.001797 Non-whites: 500/92,000 = 0.005435 Overall: 8900/4,767,000 = 0.001867 b) The death rates in Richmond were Whites: 130/81,000 = 0.001605 Non-whites: 160/47,000 = 0.003404 Overall: 290/128,000 = 0.002266 c) New York City had higher TB death rates than Richmond for both whites and non-whites in 1910, but Richmond had the higher overall TB death rate. This is an illustration of Simpson’s Paradox and it can happen because the unequal proportions of whites and non-whites in the two cities, New York City being 98% white (and 2% non-white) while Richmond was 63% white (and 37% non-white). The Gazelles had the higher mean improvement in both categories, but the cheetahs had the higher overall mean improvement. This could happen if the percentages of the teams participating in weight training were different for the two teams. For the Gazelles, let x represent the proportion of the team that participated in weight training. Then 1 – x is the proportion that did not. The overall team average improvement is a weighted average of the two group averages with the weights being x and 1 – x. Thus x(10) + (1 – x)(2) = 6.0. Simplifying, this becomes 8x + 2 = 6.0 or 8x = 4. From this, we see that x must be 0.5, so 50% of the Gazelles participated in weight training. Similarly for the Cheetahs, we have x(9) + (1 – x)(1) = 6.2, Simplifying, 8x + 1 = 6.2 or 8x = 5.2. From this we see that x = 5.2/8 = 0.65, so 65% of the Cheetahs participated in weight training. a) Spelman College has a home record of 10/29 = 0.345, while Morehouse College has a home record 9/28 = 0.321. For away games, Spelman has a record of 12/16 = 0.750, while Morehouse has a record of 56/76 = 0.737. Thus Spelman has the better record both home and away. b) Spelman’s overall record is 22/45 = 0.489, while Morehouse’s overall record is 65/104 = 0.625, so Morehouse has the better overall record. c) At the end of the season, it makes no difference where games were won and lost, so the only record that should be used for comparisons is the overall record. Thus Morehouse College has the better team. a) Among women, Drug B cured 101/900 = 0.112 while Drug A cured 5/100 = 0.050. Among men, Drug B cured 196/200 = 0.980, while Drug A cured 400/800 = 0.500. Thus, Drug B did better among women and among men. b) Overall, Drug A cured 405/900 = 0.450, while Drug B cured 297/1100 = 0.270. Thus, Drug A did better overall. c) In this case, you might want to look at the results from the patients’ viewpoints. A woman would probably prefer Drug B because its cure rate Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley SECTION 4.4, STATISTICAL PARADOXES 17 18 19 20 81 was almost twice that of Drug A for women. Similarly, a man would probably prefer Drug B because its cure rate was almost twice that of Drug A for men. The cure rates for both drugs are very different for men and women, so using the overall rate for comparison doesn’t make much sense. a) Of the 2,000 employees, 1% or 20 use drugs. The polygraph test should detect 90% of those 20, or 18. The other 2 go undetected. These figures are shown in the first column. Among the 1980 non-users, the test should be negative for 90% of them or 1782. The test should find the remaining 198 to be lying. These figures are shown in the second column. b) The number accused of lying is 18 + 198 = 216. Only 18 of these were actually lying while 198 were telling the truth. Thus, 198 out of 216, or 91.7%, were falsely accused. c) The number found to be truthful is 2 + 1782 = 1784. Of these, 1782, or 99.9%, really were truthful. a) Out of 4,000 people, 1.5% or 60 people should have the disease. This is the total in the first column. Since the test is 80% accurate, it should detect 80% of those 60 people, or 48. This is the number of positive tests in the first column. Of the remaining 3.940 who do not have the disease, the test should be negative for 80%, or 3,152. b) This is the number of negative tests in the first column. The rest of the table follows automatically by addition and subtraction. c) Of the 836 who test positive, 48, or 5.7%, actually have the disease. This is the proportion of those having the disease given that they have tested positive. This is not the same as the proportion of people who test positive (80%) given that they have the disease. d) You should describe the patient’s chance of have the disease as about 6%, or 1 chance in 16. This is higher than the 1.5% incidence rate of the disease. If the test is going to be useful at all in diagnosing the disease, this is what one would hope for. If the rate of true positives is not higher than the incidence rate, then the test is not revealing anything. a) A higher percentage of women applicants were hired for both the whitecollar and blue-collar positions. This suggests that the company hires women preferentially. Overall, there were 300 female and 600 male applicants. Forty females were hired for white-collar positions (20% of 200) and 85 were hired for blue-collar positions (85% of 100). Therefore, 125 of the 300 female applicants (41.7%) were hired overall. Thirty males were hired for white-collar positions (15% of 200) and 300 were hired for blue-collar positions (75% of 400). Therefore, 330 of the 600 male applicants (55%) were hired overall. The female applicants had a higher success rate in both categories of jobs, but the male applicants had a higher success rate overall. This apparent paradox is a result of the fact that males and females did not apply for the two kinds of jobs in equal numbers. That is not something that the company can control. Two-thirds of the women applied for whitecollar positions and two-thirds of the men applied for blue-collar positions. Treatment A had the better success rate in both trials. Overall, Treatment A was successful in 40 + 85 = 125 cases out of 300 cases (41.7%), while Treatment B was successful in 30 + 300 = 330 cases out of 600 cases (55%), so Treatment B was the more successful treatment overall. This apparent paradox is a result that can happen when the number of patients using the two treatments is quite different in the two trials. In this case, because the success rates are very different for both drugs in the two trials, one must assume that there was something very different about the subjects in the two trials. Perhaps only females were tested in the first trial and only males Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley 82 21 CHAPTER 4, DESCRIBING DATA in the second, or maybe the subjects were all young in one trial and all old in the other, or possibly the subjects were all in late stages of the disease in the first trial and all in early stages of the disease in the second trial. If the circumstances of the trials were not similar in some way such as these, the results should not be combined and the results of the individual trials should be taken into account when prescribing treatment. a) Of the 5000 people in the at-risk sample, 475 + 25 = 500 are infected. This is 10% of the sample. Of the 20,000 people in the general population sample, 57 + 3 = 60 are infected. This is 0.003 or 0.3% of the sample. Thus the table reflects the estimated incidence rates. Of those infected in the at-risk population, 475 out of 500 tested positive (95%), and of those infected in the general population, 57 out of 60 tested positive (95%). b) In the at-risk population, 95% (475 out of 500) of those with HIV test positive. Of those who test positive, 475 out of 700 (67.9%) have HIV. These are different percentages because the first is the proportion of those with HIV who test positive, while the second is the proportion of those who test positive who have HIV. c) The chance of the patient having HIV is about 67%, which is considerably higher than the incidence rate of 10% for the at-risk population. If the test is any good, one would expect that proportion of those who test positive who actually have HIV would be higher than the incidence rate. If it isn’t, the test is not revealing anything useful. d) In the general population, 95% (57 out of 60) of those with HIV test positive. Of those who test positive, 57 out of 1054 (5.6%) have HIV. These are different percentages because the first is the proportion of those with HIV who test positive, while the second is the proportion of those who test positive who have HIV. e) The chance of the patient having HIV is about 5.6%, which is considerably higher than the incidence rate of 0.3% for the general population. If the test is any good, one would expect that proportion of those who test positive who actually have HIV would be higher than the incidence rate. Chapter 4 Review Exercises 1 a) b) Red: Mean = (0.751 + 0.841 + ... + 0.905)/13 = 0.8635 Green: Mean = (0.925 + 0.914 +... + 0.881)/19 = 0.8635 Red: The median is the seventh number in the ordered list of 13 data values, which is 0.8590. Green: The median is the tenth number in the ordered list of 19 data values, which is 0.8650. Red: Range = maximum – minimum = 0.966 – 0.751 = 0.2150 Green: Range = maximum – minimum =1.015 – 0.778 = 0.2370 Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley CHAPTER 4 REVIEW EXERCISES Red Deviation s Deviations Green Deviation s Deviations 0.751 -0.11254 0.012665 0.925 0.061474 0.003779 0.841 -0.02254 0.000508 0.914 0.050474 0.002548 0.856 -0.00754 0.000057 0.881 0.017474 0.000305 0.799 -0.06454 0.004165 0.865 0.00147 0.000002 0.010498 0.865 0.00147 0.000002 0.966 0.102462 2 2 0.859 -0.00454 0.000021 1.015 0.15147 0.022944 0.857 -0.00654 0.000043 0.876 0.01247 0.000156 0.942 0.078462 0.006156 0.809 -0.05453 0.002973 0.873 0.009462 0.000090 0.865 0.00147 0.000002 0.809 -0.05454 0.002974 0.848 -0.01553 0.000241 0.890 0.026462 0.000700 0.940 0.07647 0.005848 0.878 0.014462 0.000209 0.833 -0.03053 0.000932 0.905 0.041462 0.001719 0.845 -0.01853 0.000343 0.852 -0.01153 0.000133 0.778 -0.08553 0.007315 0.814 -0.04953 0.002453 0.791 -0.07253 0.005260 0.810 -0.05353 0.002865 0.881 0.01747 0.000305 Sum = 0.058407 Sum = 0.039805 83 The deviations for Red M&Ms are obtained by subtracting the mean from each number in the first column. For Green M&Ms, the deviations are obtained by subtracting the mean from each number in the fourth column. The squared deviations are then placed in the third and sixth columns, and their totals are shown in bold at the bottom of the columns. The standard deviations are then found by dividing the sum of the squared deviations by n-1 = 13-1 = 12 for the Red M&Ms and taking the square root. For the Green M&Ms, divide the total of the squared deviations by n-1 = 19-1 = 18 and taking the square root. Thus, the standard deviations are Red M&Ms: Standard Deviation= Sum n-1 0.039805 13 1 Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley 0.0576 84 CHAPTER 4, DESCRIBING DATA Green M&Ms: Standard Deviation= 0.058407 19 1 0.0570 c) d) e) 2 a) b) 3 a) b) c) d) e) f) Boxplot of Red, Green 1.05 1.00 0.95 Data Five number summaries Red Green Minimum 0.7510 0.7780 First Quartile 0.8250 0.8140 Median 0.8590 0.8650 Third Quartile 0.8975 0.8810 Maximum 0.9660 1.0150 Sum n-1 0.90 0.85 For the Red, the first quartile is the average of 0.80 the two middle values of the lowest six values of the data 0.75 Red G reen set, 0.809 and 0.841, while the third quartile is the average of the two middle values of the highest six values, 0.890 and 0.905. For the Green, the first quartile is the middle value of the lowest nine values of the data set, 0.814, while the third quartile is the middle value of the highest nine values, 0.881. By the range rule, the standard deviation is approximately range/4, which for Red M&Ms is 0.215/4 = 0.054 and for green M&Ms is 0.237/04 = 0.059. Both estimates are very close to the real values, in part because there are no extreme outliers and the sample sizes are reasonably large. The means and medians are close for the Red and Green M&Ms. The ranges and standard deviations are also close. Therefore, there is not much difference in either the center or the variation in the distributions of the Red M&Ms and the Green ones. There are 32 values in the combined sample. After ordering the values from smallest to largest, 0.845 is the eleventh value. There are 10 values smaller than 0.845. Since 10/32 = 0.3125, the value 0.845 is in the 31st percentile. The mode is 0.865 g. It occurs three times in the combined list and no other value occurs more than twice. Zero. The mean will be the same as each of the 50 values. That means that each deviation from the mean will be zero and their squares will all be zero. The sum of the squares of the deviations will be zero, and therefore the standard deviation will also be zero. This is a toss-up. While both batteries are equally likely to achieve a life length of 48 months, the batteries with a standard deviation of 2 months are likely to come closer to lasting exactly 48 months. Some of the batteries with a 6 month standard deviation will likely fail well before the 48 months is up, but an equal number of them will last somewhat beyond the 48 month period. The outlier pulls the mean either up or down, depending on whether it is above or below the mean, respectively. The outlier has no effect on the median since the median is found as the average of the two middle values in the ordered list (25 th and 26th in a sample of size 50). The outlier would be either first or last in the list. The outlier increases the range since one of the two numbers used to compute the range will be the outlier. The outlier increases the standard deviation since one of the squared deviations will be larger than if it would be if there were no outliers. Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley CHAPTER 4 QUIZ 85 Chapter 4 Quiz 1 2 3 4 5 6 7 8 9 10 This value is the mean. The standard deviation is the only statistic in the list that is a measure of variation. All of the others are measures of center. No. It is an estimate based on only the largest and smallest values, whereas the actual standard deviation is based on all of the values in the sample. Any one of the statements could be correct. If all of the values are different, there is no mode. If two values occur equally often, but more often than any of the other values, there are two modes. If three values occur equally often, but more often than any of the other values, there are three modes. The 20th percentile must be less than the 30th percentile. The median is greater than the first quartile. The third quartile is greater than the first quartile. The mean could be equal to the median, but it doesn’t have to be. Since all of the values are different, the maximum and minimum values cannot be the same. Therefore, the range cannot be zero. The range rule of thumb says that the standard deviation is approximately 1/4 of the range. If the standard deviation is 10, the range is 40. Assuming that the distribution of the values is symmetric, the high value will be greater than the mean by 20 and the low value will be less than the mean by 20. In that case the likely low value will be about 30 and the high value about 70. The range is 10 – 2 = 8, so the standard deviation is estimated to be about 8/4 = 2.0 Since all of the values are the same, the mean will also be 5.8, making all of the deviations from the mean equal to zero. When you square and sum the deviations, the result is zero, so the standard deviation is zero. The range = maximum – minimum = 9.0 – 2.0 = 7.0. The five number summary consists of the minimum value, the first quartile, the median, the third quartile, and the maximum value. Copyright © 2012 Pearson Education, Inc. Publishing as Addison-Wesley