Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
106 CHAPTER 6: PUTTING STATISTICS TO WORK UNIT 6A TIME OUT TO THINK Pg. 373. In this case, saying that the average contract offer is zero also seems a bit misleading (just as did saying it is $700,000 if we used the mean). Pg. 375. Assuming the administration uses 12 faculty members to teach the 12 classes, they could simply redistribute assignments so that each of the 3 courses uses 4 faculty teaching sections of 25 students each. The advantages and disadvantages of such a change are a matter of opinion, but a few factors to consider: redistributing means students will have fairly small classes all around, which is generally a good thing, but will not have the benefit of the extra small class that they now have in one case. Other factors may also be involved, such as having only one qualified faculty member for one of the courses. Pg. 377(1st). In either a right-skewed or left-skewed distribution, the mean is further away from the mode (center) than the median, which shows that the mean is affected by outliers and the median is not. Pg. 377(2nd). In statistics, it refers to a distortion away from symmetry. QUICK QUIZ 1. b. The mean of a data set is the sum of all the data values divided by the number of data points. 2. b. The median of a data set is the middle value when the data are placed in numerical order. 3. c. Outliers are data values that are much smaller or larger than the other values. 4. a. Outliers that are larger than the rest of the data set tend to pull the mean to the right of the median, and this is the case for all possible data sets that satisfy the conditions given. 5. c. Because the mean is significantly higher than the median, the distribution is skewed to the right, which implies some students drink considerably more than 12 sodas per week. Specifically, it can be shown that at least one student must drink more than 16 sodas per week. 6. c. The high salaries of a few superstars pull the tail of this distribution to the right. 7. a. High outliers tend to pull the mean to the right of the median, and thus you would hope to get a salary near the mean. There are data sets with high outliers that can be constructed where the median is higher than the mean, but these are rare. 8. a. In general, data sets with a narrow central peak typically have less variation than those with broad, spread out values (the presence of outliers can confound the issue). 9. c. Driving in rush hour can go quickly on some days, but is agonizingly slow on others, so the time it takes to get downtown has high variation. 10. b. The mayor would like to see a high median and low variation because that means there’s a tight, central peak in the distribution that corresponds to support for her. DOES IT MAKE SENSE? 7. Makes sense. The two highest grades may be large enough to balance the remaining seven lower grades so that the third highest is the mean. 8. Does not make sense. The median for this exam is, by definition, the mean of the two middle scores (the fifth and sixth scores), and thus it could not be the third highest score. 9. Makes sense. When very high outliers are present, they tend to pull the mean to the right of the median (right-skewed). 10. Does not make sense. The word “average” has several meanings in mathematics, and the management and employees may be using different measures of the average (the mean and the median, for example). 11. Does not make sense. When the mean, median, and mode are all the same, it’s a sign that you have a symmetric distribution of data. 12. Makes sense. Suppose the mean age of the general population is 32 years. There’s no reason why a college extension class (which typically attracts older students) could not share the same mean. BASIC SKILLS & CONCEPTS 13. The mean and median are 5.68 and 5.685, respectively. Because there is an even number of data points, the median is calculated by taking the mean of the middle two values (add 5.67 and 5.70, and divide by 2). Since no value occurs more than once, there is no mode. 14. The mean, median, and mode are 98.44, 98.4, and 98.4, respectively. Because there is an even number of data points, the median is calculated by taking the mean of the middle two values (add 98.4 and 98.4, and divide by 2). 15. The mean and median are 201.14 and 200, respectively. Since no value occurs more than once, there is no mode. 16. The mean and median are 78.4 and 79, respectively. This is a multi-modal data set, with modes of 70, 73, 76, and 81. 17. The mean and median are 1.18 and 1.05, respectively. Since no value occurs more than once, there is no mode. UNIT 6A: CHARACTERIZING DATA 107 18. The mean and median are 0.9194 and 0.920, respectively. There is no mode. 19. The mean and median are 0.81237 and 0.8161, respectively. The outlier is 0.7901, because it varies from the others by a couple hundredths of a pound, while the rest vary from each other only by a few thousandths of a pound. Without the outlier, the mean is 0.81608, and the median is 0.8163. 20. The mean is 11.125 years, and the median is 7 years. 27 is an outlier. Eliminating 27, the mean and median are 5.83 and 3, respectively. 21. The median would be a better representation of the average. A small number of households with high earnings skew the distribution to the right, affecting the mean. 22. This distribution is probably skewed to the right by a small percentage of men who marry (for the first time) very late in life. Therefore the median would do a better job representing the average age (the outliers affect the mean). 23. This distribution is probably skewed to the right by a small percentage of people who change jobs many times. Thus the median would do a better job of representing the average, as the outliers affect the mean. 24. More likely than not, this distribution is skewed to the right by a few flights where many pieces of luggage were lost. Because of this, the mean is probably higher than the median, and thus the airline would want to use the median as the best representation of the average. Customers of the airline might argue that the mean is a better descriptor for the average. 25. This is a classic example of a symmetric distribution, and thus the mean and the median are almost identical (in a perfectly symmetric distribution, they are identical) – either the mean or the median could be used. 26. This distribution is probably skewed to the right by those times when the wait is long, and therefore the median is a better representation of the average (certainly from the view of an advertisement for the bank – customers might claim the mean is a more honest way of describing the average wait time). 27. a. There is one peak on the far left due to all the people who scored low. b. The distribution is right-skewed because the scores tail off to the right. c. The variation is large. 28. a. There would be two peaks, one representing the figure skaters who use the rink in the morning, who would have lower weights, and the other representing the professional hockey players who would have higher weights. b. This distribution would be symmetric, as each peak would be close to the same shape, centered around a point in the middle of the data. c. The variation would be moderate, since the variation within the weights of hockey players and within the weights of the figure skaters would not be great. 29. a. There are likely two peaks in such a distribution, one for the women, and one for the men. b. The distribution is symmetric because one would find a tail on the left representing the lightest people on the team, and a tail on the right representing the heaviest people on the team. c. Relative to a single-gendered set of athletes, the variation would be large. 30. a. The distribution would not have a peak since all digits, 0–9, have nearly the same frequency in the case where the last digit is selected randomly. b. This is a symmetric distribution because it is uniform. c. The data values are spread out evenly across the distribution, so there is large variation. 31. a. Consider a typical data set for the monthly sales: we expect the sales to be high in the winter months, with very low sales (or even no sales) in the summer. The number of shovels sold each month, beginning in January, might look like this: {25, 20, 15, 3, 0, 0, 0, 0, 0, 3, 15, 20}. To picture this data set as a distribution, we create a frequency histogram. This histogram has two peaks, the second not as dramatic as the first. Realize that although the histogram would look different if we varied the bin sizes or the hypothetical sales data, its basic shape is captured in the diagram above. This is due to the fact that a majority of the monthly sales figures are expected to be low (corresponding to the left peak), and the winter monthly sales figures would be high (corresponding to the right peak). 108 CHAPTER 6: PUTTING STATISTICS TO WORK b. The histogram shown in part a is right-skewed. c. The variation is large because of the difference between low sales in the summer, and high sales in the winter. 32. a. There are likely two peaks in such a distribution, one for the compact cars, and one for the sports utility vehicles. b. The frequency histogram in part a would be symmetric. c. The variation in car weights would be moderate because the weights of compact cars and sports utility vehicles would differ greatly, but there would be little variation among the weights of compact cars and among the weights of sports utility vehicles. 33. a. There would be one peak in the distribution, corresponding to the average price of detergent. b. More likely than not, there would be no extreme outliers in this data set, and it would be symmetric. c. There would be moderate variation in the data set because we don’t expect outliers. (The detergent may not sell well next to 19 other brands if its price is radically different. Of course this begs the question, “But what if it’s much cheaper than the other brands?” That’s not too likely to happen for a commodity like detergent, because once one company begins selling its detergent at a low price, the others would be forced to lower their prices to compete). 34. a. There would be one peak to the left, corresponding to the ages of children, who would visit the amusement parks in larger numbers than adults. b. The distribution is right-skewed because the ages of adults visiting the amusement park would tail off to the right. c. The variation would be large, since there could be infants and senior citizens visiting the amusement park. FURTHER APPLICATIONS 35. The distribution has two peaks (bimodal), with no symmetry, and large variation. For tourists who want to watch the geyser erupt, it means they may have to wait around for a while (most likely between 25 and 40 minutes, though perhaps as long as 50 to 90 minutes, depending on when they arrive at the site). 36. The distribution has one peak, and is right-skewed. Most of the data is clustered around its peak, though it has considerable spread, so it has moderate variation. The histogram tells us that if a computer chip is going to fail, it will most likely fail soon, rather than late. 37. This distribution has one peak, is symmetric, and has moderate variation. The graph says that weights of rugby players are usually near the mean weight, but that there are some light and some heavy players. 38. a. As shown, the graph is not symmetric (though the idea it represents – uniform distribution of randomly chosen numbers – is symmetric). b. No, the peaks appear randomly (though, again, the smooth distribution one envisions when numbers are chosen randomly could be described as a distribution with no peaks). c. No, the distribution would look different because the numbers are chosen at random, and the ups and downs of the graph would appear at random (and different) locations. d. The histogram would get smoother, approaching a uniform distribution. 39. a. Sketches will vary. We cannot be sure of the exact shape of the distribution with the given parameters, as we don’t have all the raw data. However, because the mean is larger than the median, and because the data set has large outliers, it would likely be right-skewed, with a single peak at the mode. b. About 150 families (50%) earned less than $45,000 because that value is the median income. c. We don’t have enough information to be certain how many families earned more than $55,000, but it is likely a little less than half, simply because the mean is a measure of center for all distributions. (Note that it is a little less than half because the outliers tend to pull the mean right of the median). 40. a. Because the mode is 0 minutes, and the distribution is right skewed (its mean is right of the median, and there are high outliers), the graph would have a single peak at 0 and would decrease to the right. (Note that planes do not typically depart earlier than scheduled – it would be unwise to leave passengers behind – and thus there are no negative delays). b. 50% of the flights were delayed less than the median time of 6 minutes due to the definition of the median. c. 50% of the flights were delayed more than the median time of 6 minutes due to the definition of the median. TECHNOLOGY EXERCISES 48. a. The mean, median and mode for team A are 134, 125, and 115 respectively. The mean and median for team B are 131 and 135, respectively. There is no mode for Team B. b. The coach of team A can claim that his team has the greater mean weight. UNIT 6B: MEASURES OF VARIATION 109 c. The coach of team B can claim that his team has the greater median weight. d. The mean weight of the teams combined is 132.5, which is equal to the mean of the two mean weights of the teams. This is true since both teams have the same number of players. e. The median weight of the teams combined is 130, which is equal to the median of the two median weights of the teams. This result is not true in general. UNIT 6B 8. a. Using the range rule of thumb, we see that the range is about four times as large as the standard deviation. (The worst-case scenario is when the range and standard deviation are both zero). 9. b. Newborn infants and first grade boys have similar heights, whereas there is considerable variation in the heights of all elementary school children. 10. c. Because the standard deviation for Garcia is the largest, the data are more spread out, so Professor Garcia must have had very high grades from more students than the other two. DOES IT MAKE SENSE? TIME OUT TO THINK Pg. 382. Examples and opinions will vary. Big Bank’s three lines will have greater variation since customers entering the bank can choose the shortest line, change lines if the customer at the front of the line has a time consuming transaction to complete, and there is a chance that the person at the front of all three lines has a lengthy transaction. Pg. 387. Students should notice the sense in which the standard deviation is an average of the individual deviations. Also, notice that the larger deviations are more heavily weighted in the calculation as a result of taking the squares. QUICK QUIZ 1. a. The range is defined as the high value – low value. 2. c. The five-number summary includes the low value, the first quartile, median, and third quartile, and the high value. Unless the mean happens to be the same as the median, it would not be part of the five-number summary. 3. a. Roughly half of any data set is contained between the lower and upper quartiles (“roughly” because with an odd number of data values, one cannot break a data set into equal parts). 4. b. Consider the set {0, 0, 0, 0, 0, 0, 0, 0, 0, 100}. Its mean is 10, which is larger than the upper quartile of 0. 5. b. You need the high and low values to compute the range, and you need all of the values to compute the standard deviation, so the only thing you can compute is a single deviation. 6. b. The standard deviation is defined in such a way that it can be interpreted as the average distance of a random value from the mean. 7. c. The standard deviation is defined as the square root of the variance, and thus is always nonnegative. 7. Does not make sense. The range depends upon only the low and high values, not on the middle values. 8. Makes sense. The upper quartile includes the highest score. 9. Makes sense. Consider the case where the 15 highest scores were 80, the next score was 68, and the lowest 14 scores were 40. There are numerous such cases that satisfy the conditions given. 10. Makes sense. Using the range rule of thumb, the range is approximately four times as large as the standard deviation. The only instance where the range would not be larger than the standard deviation is the case where both were 0. 11. Makes sense. The standard deviation describes the spread of the data, and one would certainly expect the heights of 5-year old children to have less variation than the heights of children aged 3 to 15. 12. Does not make sense. The standard deviation carries the same units as the mean. BASIC SKILLS & CONCEPTS 13. The mean waiting time at Big Bank is (4.1 + 5.2 + 5.6 + 6.2 + 6.7 + 7.2 + 7.7 + 7.7 + 8.5 + 9.3 + 11.0) ÷ 11 = 79.2 ÷ 11 = 7.2. Because there are 11 values, the median is the sixth value = 7.2. 14. The mean waiting time at Best Bank is (6.6 + 6.7 + 6.7 + 6.9 + 7.1 + 7.2 + 7.3 + 7.4 + 7.7 + 7.8 + 7.8) ÷ 11 = 79.2 ÷ 11 = 7.2. Because there are 11 values, the median is the sixth value = 7.2. 15. a. For the East Coast, the mean, median, and range are 134.97, 123.45, and 117.8, respectively. For the West Coast, the mean, median, and range are 145.82, 150.3, and 69.2, respectively. b. For the East Coast, the five-number summary is (98.2, 108.7, 123.45, 140.0, 216.0). For the West Coast, it is (113.2, 122.7, 150.3, 156.0, 182.4). The boxplots are not shown. c. The standard deviation for the East Coast is 42.86. For the West Coast, it is 25.06. 110 CHAPTER 6: PUTTING STATISTICS TO WORK d. For the East Coast, the standard deviation is approximately 117.8 ÷ 4 = 29.45, which is a far cry from the actual value of 42.86. This is due, in part, to the outlier of New York City (216.0). For the West Coast, the standard deviation is approximately 69.2 ÷ 4 = 17.3, which is also well off the mark. e. The cost of living index is smaller, on average, for the East Coast cities shown, though the variation is higher (due in part to the outlier of New York City). 16. a. For the coastal cities, the mean, median, and range are 7943.6, 7832, and 5614, respectively. The mean, median, and range for the non-coastal cities are 6468.2, 6699, and 5358, respectively. b. For the coastal cities, the five-number summary is (4876, 7488, 7832, 8923, 10,490). For the noncoastal cities, it is (2899, 5683, 6699, 7434, 8257). The boxplots are not shown. c. The standard deviation for the coastal cities is 1593.4. It is 1533.9 for the non-coastal cities. d. The standard deviation for the coastal cities is approximately 5614 ÷ 4 = 1403.5, which underestimates the actual value of 1593.4. For the non-coastal cities, the standard deviation is approximately 5358 ÷ 4 = 1339.5, which also underestimates the actual value of 1533.9. e. The mean and median taxes for non-coastal cities are considerably smaller than the taxes paid in coastal cities, though the variation is slightly larger. 17. a. For the no treatment group, the mean, median, and range are 0.184, 0.16, and 0.35, respectively. For the treatment group, the mean, median, and range are 1.334, 1.07, and 2.11, respectively. b. The five-number summary for the no treatment group is (0.02, 0.085, 0.16, 0.295, 0.37). It is (0.27, 0.595, 1.07, 2.205, 2.38) for the treatment group. The boxplots are not shown. c. The standard deviation for the no treatment group is 0.127; for the treatment group, it is 0.859. d. The standard deviation for the no treatment group is approximately 0.35 ÷ 4 = 0.0875, which underestimates the actual value of 0.127. For the treatment group, the standard deviation is approximately 2.11 ÷ 4 = 0.5275, which is underestimates the actual value of .859. e. The average weight and standard deviation of trees in the treatment group are higher that those of the no treatment group. 18. a. The mean, median, and range for Beethoven’s symphonies are 38.8, 36, and 42 minutes, respectively. For Mahler, these numbers are 75, 80, and 44 minutes. b. Beethoven’s five-number summary is (26, 29, 36, 45, 68). Mahler’s is (50, 62, 80, 87.5, 94). The boxplots are not shown. c. The standard deviation for Beethoven’s symphonies is 13.13 minutes. It is 15.44 for Mahler’s symphonies. d. The standard deviation for Beethoven’s symphonies is approximately 42 ÷ 4 = 10.5, which underestimates the actual value of 13.13. For Mahler, the standard deviation is approximately 44 ÷ 4 = 11, which underestimates the actual value of 15.44. e. Mahler wrote symphonies that were significantly longer, on average, than Beethoven’s, and the variation is slightly larger for Mahler’s symphonies. FURTHER APPLICATIONS 19. a. UNIT 6B: MEASURES OF VARIATION 111 b. The five-number summaries for each of the sets shown are (in order): (9, 9, 9, 9, 9); (8, 8, 9, 10, 10); (8, 8, 9, 10, 10); and (6, 6, 9, 12, 12). The boxplots are not shown. c. The standard deviations for the sets are (in order): 0.000, 0.816, 1.000, and 3.000. d. Looking at just the answers to part c, we can tell that the variation gets larger from set to set. 20. a. 21. 22. 23. 24. 25. b. The five-number summaries for each of the sets shown are (in order): (6, 6, 6, 6, 6); (5, 5, 6, 7, 7); (5, 5, 6, 7, 7); and (3, 3, 6, 9, 9). The boxplots are not shown. c. The standard deviations for the sets are (in order): 0.000, 0.816, 1.000, and 3.000. d. Looking at just the answers to part c, we can tell that the variation gets larger from set to set. The first pizza shop has a slightly larger mean delivery time, but much less variation in delivery time when compared to the second shop. If you want a reliable delivery time, choose the first shop. If you don’t care when the pizza arrives, choose the shop that offers cheaper pizza (you like them equally well, so you may as well save some money). While Skyview Airlines runs a little ahead of schedule (on average), it can also have long delays (late arrivals). SkyHigh Airlines on average runs slightly behind schedule, but it tends to have smaller delays. If you want to avoid the possibility of long delays, SkyHigh might be the better choice. A lower standard deviation means more certainty in the return, and a lower risk. Factory A has the lower average defect rate per day, but on some days, it can be expected to produce more defective chips than Factory B. The process in Factory B is more uniform and consistent, and thus one might argue that it is more reliable. Batting averages are less varied today than they were in the past. Because the mean is unchanged, batting averages above .350 are less common today. TECHNOLOGY EXERCISES 30. a. The mean, median, and standard deviation of the data are 91.25, 93.5, and 8.85, respectively. b. The mean, median, and standard deviation of the data are 99.55, 94, and 31.3, respectively. The mean and standard deviation are both greater. c. The mean, median, and standard deviation of the new data are 91.25, 93, and 8.48, respectively. 112 CHAPTER 6: PUTTING STATISTICS TO WORK The mean remains the same. standard deviation are smaller. The median and UNIT 6C 6. TIME OUT TO THINK Pg. 396. No; the 100th percentile is defined as a score for which 100% of the data values are less than or equal to it. Thus it must be the highest score, so no one can score ABOVE the 100th percentile. Pg. 397. The most likely explanation is that a minimum height is thought necessary for the army. If there is any maximum height, it must be taller than most men (since men make up much of the army), which means it is much taller than most women. Opinions will vary on the issue of fairness. This would be a good topic for a discussion either during or outside of class. 7. 8. QUICK QUIZ 1. b. A normal distribution is symmetric with one peak, and this produces a bell-shaped curve. 2. a. The mean, median and mode are equal in a normal distribution. 3. a. Data values farther from the mean correspond to lower frequencies than those close to the mean, due to the bell-shaped distribution. 4. c. Since most of the workers earn minimum wage, the mode of the wage distribution is on the left, and the distribution is right-skewed. 5. a. Roughly 68% of data values fall within one standard deviation of the mean, and 68% is about 2/3. 6. c. In a normally distributed data set, 99.7% of all data fall within 3 standard deviations of the mean. 7. a. Note that 43 mpg is 1 standard deviation above the mean. Since 68% of the data lies within one standard deviation of the mean, the remaining 32% is farther than 1 standard deviation from the mean. The normal distribution is symmetric, so half of 32%, or 16%, lies above 1 standard deviation. 8. c. The z-score for 84 is (84 – 75)/6 = 1.5. 9. b. Table 6.3 shows that the percentile corresponding with a standard score of –0.60 is 27.43%. 10. c. If your friend said his IQ was in the 75th percentile, it would mean his IQ is larger than 75% of the population. It is impossible to have an IQ that is larger than 102% of the population. DOES IT MAKE SENSE? 5. Makes sense. Physical characteristics are often normally distributed, and college basketball 9. 10. players are tall, though not all of them, so it makes sense that the mean is 6 '3" , with a standard deviation of 3” (this would put approximately 95% of the players within 5'9" to 6 '9" ). Makes sense. Physical characteristics are often normally distributed, and these statistics say that approximately 95% of babies born at Belmont are between 4.4 and 9.2 pounds (which is reasonable). Does not make sense. With a mean of 6.8 pounds, and a standard deviation of 7 pounds, you would expect that a small percentage of babies would be born 2 standard deviations above the mean. But this would imply that some babies weigh more than 20.8 pounds, which is much too heavy for a newborn. Does not make sense. The standard score is computed for an individual data value, not for an entire set of exams scores. Furthermore, a standard score of 75 for a particular data value would mean that value is 75 standard deviations away from the mean, which is virtually impossible (almost all data in a normally distributed data set is within 3 standard deviations of the mean). Makes sense. A standard score of 2 or more corresponds to percentiles of 97.72% and higher, so while this could certainly happen, the teacher is giving out As (on the final exam) to only 2.28% of the class Makes sense. If Jack is in the 50th percentile, he is taller than half of the population (to which he is being compared), and by definition, he is at the median height. BASIC SKILLS & CONCEPTS 11. Diagrams a and c are normal, and diagram c has a larger standard deviation because its values are more spread out. 12. Diagrams b and c are normal, and diagram b has a larger standard deviation because its values are more spread out. 13. Most trains would probably leave near to their scheduled departure times, but a few could be considerably delayed, which would result in a right-skewed (and non-normal) distribution. 14. This distribution would be normal because a machine fills the bags with approximately 40 pounds of lawn fertilizer, and the variation in these weights would be random, which typically results in normal distributions. 15. Random factors would be responsible for the variation seen in the distances from the bull’s-eye, so this distribution would be normal. 16. Distributions that measure athletic ability tend to be normal. UNIT 6C: THE NORMAL DISTRIBUTION 113 17. Easy exams would have scores that are leftskewed, so the scores would not be normally distributed. 18. Performances in athletic events are often normally distributed. 19. a. Half of any normally distributed data set lies below the mean, so 50% of the scores are less than 100. b. Note that 120 is one standard deviation above the mean. Since 68% of the data lies within one standard deviation of the mean (between 80 and 120), half of 68%, or 34%, lies between 100 and 120. Add 34% to 50% (the answer to part a) to find that 84% of the scores are less than 120. c. Note that 140 is two standard deviations above the mean. Since 95% of the data lies within two standard deviations above the mean (between 60 and 140), half of 95%, or 47.5%, lies between 100 and 140. Add 47.5% to 50% (see part a) to find that 97.5% of the scores are below 140. d. Note that 60 is two standard deviations below the mean. Since 95% of the data lies within two standard deviations of the mean (between 60 and 140), 5% of the data lies outside this range. Because the normal distribution is symmetric, half of 5%, or 2.5%, lies in the lower tail below 60. e. A sketch of the normal distribution would show that the amount of data lying above 60 is equivalent to the amount of data lying below 140, so the answer is 97.5% (see part c). f. Note that 160 is three standard deviations above the mean. Since 99.7% of the data lies within three standard deviations of the mean (between 40 and 160), 0.3% of the data lies outside this range. Because the normal distribution is symmetric, half of 0.3%, or 0.15%, lies in the upper tail above 160. g. A sketch of the normal distribution shows that the percent of data lying above 80 is equivalent to the percent lying below 120, so the answer is 84% (see part b). h. The percent of data between 60 and 140 is 95%, because these are each two standard deviations from the mean. 20. a. A heart rate of 55 is one standard deviation below the mean. Since 68% of the data lies between 55 and 85, 32% lie outside that range, which implies 16% lies below 55 (due to symmetry). b. A heart rate of 40 is two standard deviations below the mean. Since 95% of the data lies between 40 and 100, 5% lie outside that range, which implies 2.5% lies below 40 (due to symmetry). c. Half of 68% (or 34%) lies between 70 and 85, 50% lies below the mean of 70, so 84% of the data is less than 85. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. d. Half of 95% (or 47.5%) lies between 70 and 100, and 50% lies below the mean of 70, so 97.5% of the data lies below 100. e. This portion of the distribution is equal to the portion less than 55 because of symmetry, so the answer is 16% (see part a). f. The portion of the distribution greater than 55 is equal to the portion less than 85 because of symmetry, so the answer is 84% (see part c). g. The portion of the distribution more than 40 is equal to the portion less than 100 because of symmetry, so the answer is 97.5% (see part d). h. Both 55 and 85 are one standard deviation from the mean, and thus 68% of the data lies between them. A score of 59 is one standard deviation below the mean. Since 68% of normally distributed data lies within one standard deviation of the mean, half of 68%, or 34%, lies between 59 and 67 due to symmetry. The percent of data lying above 67 is 50%, so we add these two results to find that 84% of the scores lie above 59. A score of 83 is two standard deviations above the mean. Since 95% of normally distributed data lies within two standard deviations of the mean (between 51 and 83), 5% lies outside this range. Half of 5%, or 2.5%, will be in the upper tail, above 83, so 100% – 2.5% = 97.5% of the scores lie below 83. Subtract two standard deviations from the mean to get the cutoff score of 67 – 16 = 51. Since 95% of the data lies within two standard deviations of the mean, 5% lies outside this range. Half of 5%, or 2.5%, will be below 51, and will fail the exam. A score of 75 is one standard deviation above the mean. Since 68% of the data lies within one standard deviation of the mean (between 59 and 75), 32% of the data is outside this range. Half of 32%, or 16%, will be in the lower tail of the distribution. Since 68% + 16% = 84%, 84% × 500 = 420 of the students will score below 75. The standard score for 67 is z = (67 – 67)/8 = 0. The standard score for 59 is z = (59 – 67)/8 = –1. The standard score for 55 is z = (55 – 67)/8 = –1.5 The standard score for 88 is z = (88 – 67)/8 = 2.625. a. z = 1. The corresponding percentile is 84.13%. b. z = 0.5, and the corresponding percentile is 69.15%. c. z = –1.5, and the corresponding percentile is 6.68%. a. z = –0.5, and the corresponding percentile is 30.85%. b. z = –2, and the corresponding percentile is 2.28%. 114 CHAPTER 6: PUTTING STATISTICS TO WORK c. z = 1.2, and the corresponding percentile is 88.48%. 31. a. The 20th percentile corresponds to a standard score of about z = –0.83, which means the data value is about 0.83 standard deviations below the mean. b. The 80th percentile corresponds to a standard score of about z = 0.84, which means the data value is 0.84 standard deviations above the mean. c. The 63rd percentile corresponds to a standard score of about z = 0.33, which means the data value is 0.33 standard deviations above the mean. 32. a. The 10th percentile corresponds to a standard score of about z = –1.28, which means the data value is 1.28 standard deviations below the mean. b. The 35th percentile corresponds to a standard score of about z = –0.38, which means the data value is 0.83 standard deviations below the mean. c. The 88th percentile corresponds to a standard score of about z = 1.18, which means the data value is 1.18 standard deviations above the mean. 39. 40. 41. 42. 43. FURTHER APPLICATIONS 33. About 68% of births occur within 15 days of the mean, because this range is one standard deviation from the mean. 34. Within one month of the due date corresponds to two standard deviations on either side of the mean, which implies 95% of the births occur within one month of the mean. 35. Since 68% of births occur within 15 days of the due date, 32% occur outside this range, and half of 32%, or 16%, occur more than 15 days after the due date. 36. Since 95% of births occur within 30 days of the due date, 5% occur outside this range, and half of 5%, or 2.5%, occur more than 30 days after the due date. 37. a. The standard score is z = (66 – 63.6)/2.5 = 0.96, which corresponds to an approximate percentile of 83%. b. z = (62 – 63.6)/2.5 = –0.64, which corresponds to an approximate percentile of 26%. c. z = (61.5 – 63.6)/2.5 = –0.84, which corresponds to an approximate percentile of 20%. d. z = (64.7 – 63.6)/2.5 = 0.44, which corresponds to an approximate percentile of 67%. 38. a. The standard score is z = (22 – 26.2)/4.7 = – 0.89, which corresponds to an approximate percentile of 19%. b. z = (28 – 26.2)/4.7 = 0.38, which corresponds to an approximate percentile of 65%. c. For a BMI of 25, z = (25 – 26.2)/4.7 = –0.26, which corresponds to an approximate percentile of 40%, so 60% of American men are overweight. For a BMI of 30, z = (30 – 26.2)/4.7 = 0.81, which 44. 45. 46. 47. corresponds to an approximate percentile of 79%, so 21% of American men are obese. It is not likely because heights of eighth-graders would be normally distributed, and a standard deviation of 40 inches would imply some of the eighth graders (about 16%) are taller than 95 inches, which is preposterous. Assuming the scores are out of 100, it is not likely. A standard deviation of 50 would imply some of the scores are higher than 100. The 90th percentile corresponds to a z-score of about 1.3, which means a data value is 1.3 standard deviations above the mean. For the verbal portion of the GRE, this translates into a score of 462 + 1.3(119) = 617. For the quantitative portion, it translates into a score of 584 + 1.3(151) = 780. The standard score for 600 is z = (600 – 462)/119 = 1.16, which corresponds to about the 87th percentile. This is the percent of students who score below 600, so the percent that score above 600 is 13%. The standard score for 600 is z = (600 – 584)/151 = 0.106, which corresponds to about the 54th percentile. This is the percent of students who score below 600, so the percent that score above 600 is 46%. The standard score for 700 is z = (700 – 584)/151 = 0.77, which is 0.77 standard deviations above the mean. Your verbal score would also need to be 0.77 standard deviations above the mean to be in the same percentile, which translates to a score of 462 + 0.77(198) = 554. The z-score for an 800 on the quantitative exam is z = (800 – 584)/151 = 1.43, or about 1.4. This corresponds to the 91.92 percentile, so about 92% score below 800, which means 8% score 800. The z-score for an 800 on the verbal exam is z = (800 – 462)/119 = 2.84, which corresponds to an approximate percentile of 99.7%. Since 99.7% score below 800, about 0.3% score 800. For the verbal exam, z = (400 – 462)/119 = –0.52. , which corresponds to an approximate percentile of 30%. Thus about 30% score below 400. For the quantitative exam, z = (400 – 584)/151 = –1.22, which corresponds to an approximate percentile of 11%. Thus about 11% score below 400. TECHNOLOGY EXERCISES 53. a. z = (72 – 69.7)/2.7 = 0.85, which, from table 6.3, corresponds to an approximate percentile of 80%. b. z = (65 – 69.7)/2.7 = –1.74; NORMDIST(64,69.7,2.7,TRUE) gives the percentile as 4.0865%, which corresponds to an approximate percentile of 4%. UNIT 6D: STATISTICAL INFERENCE 115 c. 6 '4" = 76"; NORMDIST(76,69.7,2.7,TRUE), gives the percentile as 99.0185%, so approximately 1% of American adult males are taller than 6 '4". d. 5' 4" = 64"; NORMDIST(64,69.7,2.7,TRUE), gives the percentile as 1.7381%, so approximately 1.7% of American adult males are shorter than 5' 4". e. 6 '7" = 79"; NORMDIST(79,69.7,2.7,TRUE), gives the percentile as 99.97139%, so approximately (100% – 99.97139%) × 115,000,000 = 33,000 of American adult males are as tall as Kobe Bryant. UNIT 6D TIME OUT TO THINK Pg. 403. The probability is 0.01, or 1 in 100, that the results occurred by chance. This is certainly strong evidence in support of the herbal remedy, but by no means does it constitute proof. Pg. 405. Opinions will vary, but 95% confidence seems reasonable for most surveys, which have no need to be perfect. However, for something of great importance, such as deciding an election, it is far too low. That is why elections must be decided by letting everyone vote, not through statistical sampling. Pg. 407. The null hypothesis is “the suspect is innocent.” A guilty verdict means rejecting the null hypothesis, concluding that the suspect is NOT innocent. A “not guilty” verdict means not rejecting the null hypothesis, in which case we continue to assume that the suspect is innocent. QUICK QUIZ 1. c. A divorce rate of 60% is statistically significant because such large variation from the national rate of 30% is not likely to occur by chance alone. 2. b. In order to determine whether a result is statistically significant, we must be able to compare it to a known quantity (such as the effect on a control group in this case). 3. a. Statistical significance at the 0.01 level means an observed difference between the effectiveness of the remedy and the placebo would occur by chance in fewer than 1 out of 100 trials of the experiment. 4. c. If the statistical significance is at the 0.05 level, we would expect that 1 in 20 experiments would show the pill working better than the placebo by chance alone. 5. b. The confidence interval is found by subtracting and adding 3% to the results of the poll (65%). 6. c. The central limit theorem tells us that the true proportion is within the margin of error 95% of the time. 1 . Set 7. c. The margin of error is approximately n this equal to 4% (0.04), and solve for n to find that the sample size was 625. If we quadruple the sample size to 2500, the margin of error will be 1 = 0.02 , or 2%. approximately 2500 8. a. The null hypothesis is often taken to be the assumption that there is no difference in the things being compared. 9. c. If you cannot reject the null hypothesis, the only thing you can be sure of is that you don’t have evidence to support the alternative hypothesis. 10. b. If the difference in gas prices can be explained by chance in only 1 out of 100 experiments, you’ve found a result that is statistically significant at the 0.01 level. DOES IT MAKE SENSE? 9. Makes sense. If the number of people cured by the new drug is not significantly larger than the number cured by the old drug, the difference could be explained by chance alone, which means the results are not statistically significant. 10. Does not make sense. While results significant at the 0.05 level offer some evidence that the Magic Diet Pill works, we cannot remove all doubt that the results came about by chance alone. 11. Does not make sense. The margin of error is based upon the number of people surveyed (the sample size), and thus both Agency A and B should have the same margin of error. 12. Makes sense. The margin of error is based upon the number of people surveyed (the sample size), 1 , it decreases and because it is approximately n as n increases. 13. Makes sense. The alternative hypothesis is the claim that is accepted if the null hypothesis is rejected. 14. Does not make sense. One cannot prove the null hypothesis – it can be rejected, in which case there is evidence for the alternative hypothesis, or we can not reject it, which only means we do not have evidence to support the alternative hypothesis. BASIC SKILLS & CONCEPTS 15. Most of the time you would expect around 10 sixes in 60 rolls of a standard die. It would be very rare to get only 2 sixes, and thus this is statistically significant. 116 CHAPTER 6: PUTTING STATISTICS TO WORK 16. Most of the time you would expect around 50 heads in 100 tosses of a coin, and since 48 is well within the bounds of what would be expected by chance alone, this is not statistically significant. 17. Note that 11/200 is about 5% – this is exactly what you would expect by chance alone from an airline that has 95% of its flights on time. 18. It is very unlikely that a basketball player that makes 92% of free throws would miss 18 in a row. This is statistically significant. (In chapter 7, you will learn that probability of this occurring is practically zero.) 19. There aren’t many winners in the Power Ball lottery, and yet there are a good number of 7-11 stores. It would be very unlikely that ten winners in a row purchased their tickets at the same store, so this is statistically significant. 20. The likelihood of 42 people in a group of 45 all sharing the same birthday is extremely small, so this is statistically significant. 21. The result is significant at far less than the 0.01 level (which means it is significant at both the 0.01 and 0.05 level), because the probability of finding the stated results is less than 1 in 1 million by chance alone. This gives us good evidence to support the alternative hypothesis that the accepted value for temperature is wrong. 22. This result is significant at the 1/10,000 = 0.0001 level, so it is significant at both the 0.01 and 0.05 levels. Because chance alone does not do a good job of explaining the difference found in length of hospital stays, it would be reasonable to conclude that seat belts reduce the severity of injuries (assuming that the severity of injury is directly correlated with length of hospital stay). 23. The results of the study are significant at neither the 0.05 level nor the 0.01 level, and thus the improvement is not statistically significant. 24. The difference found in the weights would be expected by chance alone in only 1 out of 100 such surveys. 25. The margin of error is 1/ 1012 = 0.03 , or 3%, and the 95% confidence interval is 29% to 35%. This means we can be 95% confident that the actual percentage is between 29% and 35%. 26. The margin of error is 1/ 65, 000 = 0.004 , or 0.4%, and the 95% confidence interval is 8.5% to 9.3%. This means we can be 95% confident that the actual percentage is between 8.5% and 9.3%. 27. The margin of error is 1/ 540 = 0.043 , or 4.3%, and the 95% confidence interval is 59.7% to 68.3% for the satisfied group, and 59.7% to 68.3% for the dissatisfied group. This means we can be 95% confident that the actual percentage is between and 59.7% and 68.3%. 28. The margin of error is 1/ 1110 = 0.03 , or 3%, and the 95% confidence interval is 56% to 62%. This means we can be 95% confident that the actual percentage is between and 56% and 62%. 29. The margin of error is 1/ 1003 = 0.03 , or 3%, and the 95% confidence interval is 45% to 51%. This means we can be 95% confident that the actual percentage is between and 45% and 51%. 30. The margin of error is 1/ 2818 = 0.02 , or 2%, and the 95% confidence intervals are 71% to 75% and 49% to 53%. This means we can be 95% confident that the actual percentage is between and 71% and 75%. 31. The margin of error is 1/ 1100 = 0.03 , or 3%, and the 95% confidence intervals are 38% to 44% and 12% to 16%. This means we can be 95% confident that the actual percentage is between and 38% and 44%. 32. The margin of error is 1/ 1229 = 0.03 , or 3%, and the 95% confidence intervals are 25% to 31% and 12% and 18%. 33. a. Null hypothesis: percentage of high school graduates = 85%. Alternative hypothesis: percentage of high school students > 85%. b. Rejecting the null hypothesis means there is evidence that the percentage of high school graduates exceeds 85%. Failing to reject the null hypothesis means there is insufficient evidence to conclude that the percentage of high school graduates exceeds 85%. 34. a. Null hypothesis: vitamin C in tablets = 500 mg. Alternate hypothesis: vitamin C in tablets < 500 mg. b. Rejecting the null hypothesis means that there is evidence that the tablets contain less than 500 mg. Failing to reject the null hypothesis means there is insufficient evidence to conclude that the tablets contain less than 500 mg. 35. a. Null hypothesis: mean teacher salary = $47,750. Alternative hypothesis: mean teacher salary > $47,750. b. Rejecting the null hypothesis means there is evidence that the mean teacher salary in the state exceeds $47,750. Failing to reject the null hypothesis means there is insufficient evidence to conclude that the mean teacher salary exceeds $47,750. 36. a. Null hypothesis: water usage = 1675 gallons per month. Alternative hypothesis: water usage > 1675 gallons per month. b. Rejecting the null hypothesis means there is evidence that the water usage exceeds 1675 gallons per month. Failing to reject the null hypothesis means there is insufficient evidence to UNIT 6D: STATISTICAL INFERENCE 117 37. 38. 39. 40. 41. 42. 43. 44. conclude that water usage exceeds 1675 gallons per month. a. Null hypothesis: percentage of underrepresented students = 20%. Alternative hypothesis: percentage of underrepresented students < 20%. b. Rejecting the null hypothesis means there is evidence that the percentage of underrepresented students is less than 20%. Failing to reject the null hypothesis means there is insufficient evidence to conclude that the percentage of underrepresented students is less than 20%. a. Null hypothesis: pollution levels = EPA minimum. Alternative hypothesis: pollution levels < EPA minimum. b. Rejecting the null hypothesis means there is evidence that the pollution levels are less than the EPA minimum. Failing to reject the null hypothesis means there is insufficient evidence to conclude that the pollution levels are less than the EPA minimum. Null hypothesis: mean annual mileage of cars in the fleet = 11,725 miles. Alternative hypothesis: mean annual mileage of cars in the fleet > 11,725. The result is significant at the 0.01 level, and provides good evidence for rejecting the null hypothesis. Null hypothesis: proportion of supporting voters = 0.5. Alternative hypothesis: proportion of supporting voters > 0.5. The result is not significant at the 0.05 level, and there are no grounds for rejecting the null hypothesis. Null hypothesis: mean stay = 2.1 days. Alternative hypothesis: mean stay > 2.1 days. The result is not significant at the 0.05 level, and there are no grounds for rejecting the null hypothesis. Null hypothesis: mean bounce height = 90.10 inches. Alternative hypothesis: mean bounce height < 90.10 inches. The result is significant at the 0.05 level, and provides evidence for rejecting the null hypothesis. Null hypothesis: mean income = $50,000. Alternative hypothesis: mean income > $50,000. The result is significant at the 0.01 level, and provides good evidence for rejecting the null hypothesis. Null hypothesis: mean ownership = 7.5 years. Alternative hypothesis: mean ownership < 7.5 years. The result is not significant at the 0.05 level, and there are no grounds for rejecting the null hypothesis. FURTHER APPLICATIONS 45. The margin of error is 1/ 13, 000 = 0.009 , or 0.9%, and the 95% confidence interval is 64.1% to 65.9%. This means we can be 95% confident that the actual percentage of viewers who watched American Idol was between 64.1% and 65.9%. 46. Null hypothesis: mean birth weight = 3.38 kg. Alternative hypothesis: mean birth weight > 3,38 kg. The result is significant at the 0.01 level, and provides good evidence for rejecting the null hypothesis. 47. The margin of error is approximately 1/ n . If we want to decrease the margin of error by a factor of 2, we must increase the sample size by a factor of 1 1 1 1 4, because = = ⋅ . 4n 2 n 2 n 48. The margin of error is approximately 1/ n . If we want to decrease the margin of error by a factor of 10, we must increase the sample size by a factor of 1 1 1 1 100, because = = ⋅ . 10 n 100n 10 n 49. The margin of error, 3%, is consistent with the sample size because 1/ 1019 = 0.03 = 3%.