Download Describing Location in a Distribution

Describing Location in a Distribution I ~ j 2.1 Measures of Relative Standing and Density Curves 2.2 Normal Distributions Chapter Review C A S E: ST' UJ DY ThenewSAT For many years, high school students across the United States have taken the SAT as part of the college admissions process . The SAT has undergone regular changes over time, but until recently, test takers have always received two scores : SAT Verbal and SAT Math . Motivated by the findings of a Blue Ribbon Panel in 1990 and threats by the mammoth University of California system to discontinue use of the SAT in its admissions process, the College Board decided to add a Writing section to the test. This section would include multiple-choice questions on grammar and usage from the old SAT II Writing Subject Test, as well as a t imed, argumentative essay. At the same time, the existing Verbal and Math sections of the test underwent major content revisions. Analogies were removed from the Verbal section, which was renamed Critical Reading . In the Math section, quantitative comparison questions were eliminated, and new questions testing advanced algebra content were added . In March 2005, the College Board administered the new SAT for the first time. Students, parents, teachers, high school counselors , and college admissions officers waited anxiously to hear about the results from this new exam. Would the scores on the new SAT be comparable to those from previqus years? How would students perform on the new Writing section (and particularly on the timed essay)? In the past, boys had earned higher average scores than girls on both the Verbal and Math sections of the SAT. Would similar gender differences emerge on the new SAT? By the end of this chapter, you will have developed the statistical tools you need to answer important questions about the new SAT. 113 -A~ ' I· • 't 114 CHAPTER 2 Describing Location in a Distribution Activity 2A A fine-grained distribution Materials: Sheet ofgrid paper; salt; can of spray paint; paint easel; newspapers 1. Place the grid paper on the easel with a horizontal fold as shown, at about a 45o angle to the horizontal. Provide a "lip" at the bottom to catch the salt. Place newspaper behind the grid paper and extending out on all sides so you will not get paint on the easel. 2. Pour a strean1 of salt slowly from a point near the n1iddle of the top edge of the grid paper. The grains of salt will hop and skip their way down the grid as they collide with one another and bounce left and right. They will accun1ulate at the bottom, piled against the grid, with the s1nooth profile of a bell-shaped curve, known as a Nonnal distribution. We will learn about the Normal distribution in this chapter. 3. Now carefully spray the grid-salt and all-with paint. Then discard the salt. You should be able to easily measure the height of the curve at different places by sin1ply counting lines on the grid, or you could approxitnate areas by counting small squares or portions of squares on the grid. How could you get a tall, narrow curve? How could you get a short, broad curve? What factors might affect the height and breadth of the curve? Frmn the members of the class, collect a set of Normal curves that differ from one another. Activity 28 Roll a Normal distribution Materials: Several marbles, all the same size; two metersticks for a "ramp"; a ruled sheet of paper; a flat table about 4 feet long; carbon paper; Scotch Tape or masking tape 1. At one end of the table prop up the two metersticks in a V shape to provide a ran1p for the marbles to roll down. Each marble will roll down t Activity 28 the chute, continue across the table, and fall off the table to the floor below. Make sure that the ramp is secure and that the tabletop does not have any grooves or obstructions. 2. Roll a n1arble down the ramp several tirnes to get a good idea of the area of the floor where it will fall. 3. Center the ruled sheet of paper (see Figure 2.1) over this area, face up, with the bottmn edge toward the table and parallel to the edge of the table. The ruled lines should go in the same direction as the marble's path. Tape the sheet securely to the floor. Place the sheet of carbon paper, carbon side down, over the ruled sheet. 4. Roll the rnarble for a class total of 200 tirnes. The spots where it hits the floor will be recorded on the ruled paper as black dots. When the marble hits the floor, it will probably bounce, so try to catch it in midair after the impact so that you don't get any extra marks. After the first 100 rolls, replace the sheet of paper. This will make it easier for you to count the spots. Make sure that the second sheet is in exactly the same position as the first one. 5. When the rnarble has been rolled 200 tin1es, make a histograrn of the distribution of the points as follows. First, count the number of dots in each colurnn. Then graph these numbers by drawing horizontal lines in the colun1ns at the appropriate levels. Use the scale on the left-hand side of the sheet. Figure 2. 1 Example of ruled sheet for Activity 28. -I- - 30 -I- - -I- 25 -I-I- 20 -I- 15 - 10 - 5 -- - - - - 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 CHAPTER 2 Describing Location in a Distribution Introduction Suppose that Jenny earns an 86 (out of 100) on her next statistics test. Should she be satisfied or disappointed with her perfonnance? That depends on how her score compares to those of the other students who took the test. If 86 is the highest score, Jenny might be very pleased. Perhaps her teacher will "curve" the grades so that Jenny's 86 becmnes an "A." But if Jenny's 86 falls below the a middle" of this set of test scores, she may not be so happy. Section 2.1 focuses on describing the location of an individual within a distribution. We begin by using raw data to calculate two useful measures oflocation. One is based on the mean and standard deviation. The other is connected to the n1edian and quartiles. Smnetitnes it is helpful to use graphical models called density curves to describe the location of individuals within a distribution, rather than relying on actual data values. This is especially true when data fall in a bell-shaped pattern called a Nonnal distribution. Section 2.2 exan1ines the properties of Norn1al distributions and shows you how to perforn1 useful calculations with them. 2.1 Measures of Relative Standing and Density Curves Here are the scores of all 2 5 students in Mr. Pryor's statistics class on their first test: 79 77 6 7 7 2334 7 8 8 9 Bm899 00123334 81 83 80 86 77 73 90 79 83 85 74 83 93 89 78 84 80 82 75 67 77 72 73 The bold score is Jenny's 86. How did she perforn1 on this test relative to her class1nates? The stem plot in the margin displays this distribution of test scores. Notice that the distribution is roughly symn1etric with no apparent outliers. Figure 2.2 provides Minitab output that includes summary statistics for the test score data. 569 03 Figure 2.2 Minitab output showing descriptive statistics for the test scores of Mr. Pryor's statistics class. Mini tab Descriptive Statistics: Test 1 scores Variable Test 1 scores N Variable Test 1 scores Minimum 67.00 25 Mean 80.00 Maximum 93.00 Median 80.00 Q1 76.00 TrMean 80.00 Q3 83.50 StDev 6.07 SE Mean 1. 21 2.1 Measures of Relative Standing and Density Curves Where does Jenny's 86 fall relative to the center of this distribution? Since the mean and n1edian are both 80, we can say that Jenny's result is "above average." But how much above average is it? Measuring Relative Standing: z-Scores standardizing One way to describe Jenny's position within the distribution of test scores is to tell how 1nany standard deviations above or below the mean her score is. Since the n1ean is 80 and the standard deviation is about 6, Jenny's score of 86 is about one standard deviation above the n1ean. Converting scores like this from original values to standard deviation units is known as standardizing. To standardize a value, subtract the mean of the distribution and then divide by the standard deviation. Standardized Values and z-Scores If xis an observation from a distribution that has known mean and standard deviation, the standardized value of x is x- mean z = standard deviation A standardized value is often called a z-score. A z-score tells us how n1any standard deviations away from the mean the original observation falls, and in which direction. Observations larger than the mean are positive when standardized, and observations smaller than the mean are negative. Example2.1 ~~ Pryorsfirsttest Standardizing scores Jenny's score on the test was x = 86. Her standardized test score is z= 80 = 86 - 80 = 0.99 6.07 6.07 X - Katie earned the highest score in the class, 93. Her corresponding z-score is z= 93- 80 = 2.14 6.07 In other words, Katie's result is 2.14 standard deviations above the mean score for this test. Norman got a 72 on this test. His standardized score is z= 72 - 80 = - 1.32 6.07 Norman's score is 1.32 standard deviations below the class mean of 80. 118 CHAPTER 2 Describing Location in a Distribution We can also use z-scores to compare the relative standing of individuals in different distributions, as the following exa1nple illustrates. Example2.2 Statistics or chemistry: which is easier? Using z-scores for comparison The day after receiving her statistics test result of 86 from Mr. Pryor, Jenny earned an 82 on Mr. Goldstone's chemistry test. At first, she was disappointed. Then Mr. Goldstone told the class that the distribution of scores was fairly symmetric with a mean of 76 and a standard deviation of 4. Jenny quickly calculated her z-score: z= 82 - 76 = 1.5 4 Her 82 in chemistry was 1. 5 standard deviations above the mean score for the class. Since she only scored 1 standard deviation above the mean on the statistics test, Jenny actually did better in chemistry relative to the class. We often standardize observations from symmetric distributions to express them on a common scale. We might, for example, compare the heights of two children of different ages by calculating their z-scores. The standardized heights tell us where each child stands in the distribution for his or her age group. Exercises 2.1 SAT versus ACT Eleanor scores 680 on the mathematics part of the SAT. The distribution of SAT scores in a reference population is symmetric and single-peaked with mean 500 and standard deviation 100. Gerald takes the American College Testing (ACT) mathematics test and scores 27. ACT scores also follow a symmetric, singlepeaked distribution but with mean 18 and standard deviation 6. Find the standardized scores for both students. Assuming that both tests measure the same kind of ability, who has the higher score? 2.2 Comparing batting averages Three landmarks of baseball achievement are Ty Cobb's batting average of .420 in 1911, Ted Williams's .406 in 1941, and George Brett's .390 in 1980. These batting averages cannot be compared directly because the distribution of major league batting averages has changed over the years. The distributions are quite symmetric, except for outliers such as Cobb, Williams, and Brett. While the mean batting average has been held roughly constant by rule changes and the balance between hitting and pitching, the standard deviation has dropped over time. Here are the facts: Decade Mean Std. dev. 1910s 1940s 1970s .266 .267 .261 .0371 .0326 .0317 Compute the standardized batting averages for Cobb, Williams, and Brett to compare how far each stood above his peers. 1 2.1 Measures of Relative Standing and Density Curves 2.3 Measuring bone density Individuals with low bone density (osteoporosis) have a high risk of broken bones (fractures). Physicians who are concerned about low bone density in patients can refer them for specialized testing. Currently, the most common method for testing bone density is dual-energy X-ray absorptiometry (DXA). A patient who undergoes a DXA test usually gets bone density results in grams per square centimeter (g/cm 2) and in standardized units. Judy, who is 25 years old, has her bone density measured using DXA. Her results indicate a bone density in the hip of948 g/cm 2 and a standardized score of z = -1.45. In the reference population 2 of 25-year-old women like Judy, the mean bone density in the hip is 956 g/cm 2 . (a) Judy has not taken a statistics class in a few years. Explain to her in simple language what the standardized score tells her about her bone density. (b) Use the information provided to calculate the standard deviation of bone density in the reference population. 2.4 Comparing bone density Refer to the previous exercise. One of Judy's friends, Mary, has the bone density in her hip measured using DXA. Mary is 35 years old. Her bone density is also reported as 948 g/cm 2, but her standardized score is z = 0.50. The mean bone density in the hip for the reference population of 35-year-old women is 944 g/cm 2 . (a) Whose bones are healthier: Judy's or Mary's? Justify your answer. (b) Calculate the standard deviation of the bone density in Mary's reference population. How does this compare with your answer to Exercise 2.3(b)? Are you surprised? Measuring Relative Standing: Percentiles We can also describe Jenny's performance on her first statistics test using percentiles. In Chapter 1, we defined the pth percentile of a distribution as the value with p percent of the observations less than or equal to it. Using the stemplot on page 116, we can see that Jenny's 86 is the 22nd highest score on this test, counting up from the lowest score. Since 22 of the 25 observations (88%) are at or below her score, Jenny scored at the 88th percentile. Example2.3 Mr. Pryor's first test, continued Finding percentiles For Norman, who got a 72, only 2 of the 25 test scores in the class are at or below his. Norman's percentile is computed as follows: 2/25=0.08, or 8%. So Norman scored at the 8th percentile on this test. Katie's 93 puts her by definition at the lOOth percentile, since 25 out of 25 test scores fall at or below her result. Note that some people define the pth percentile of a distribution as the value with p percent of observations below it. Using this alternative definition of percentile, it is never possible for an individual to fall at the 1OOth percentile. That is why you never see a standardized test score reported above the 99th percentile. 120 CHAPTER 2 Describing Location in a Distribution Two students scored an 80 on Mr. Pryor's first test. By our definition of percentile, 14 of the 25 scores in the class were at or below 80, so these two students are at the 56th percentile. If we used the alternative definition of percentile, these two students would fall at the 48th percentile ( 12 of 25 scores were below 80). Of course, since 80 is the median score, we prefer to think of it as being the 50th percentile. Calculating percentiles is not an exact science! ~· '1 can't speak for everyone in the top one percent, but I'm fine." You may have wondered by now whether there is a sitnple way to convert a z-score to a percentile. The percent of observations falling at or below a particular z-score depends on the shape of the distribution. An observation that is equal to the mean has a z-score of 0. In a heavily left-skewed distribution, where the tnean will be less than the median, this observation will be somewhere below the 50th percentile (the median). There is an interesting result that describes the percent of observations in any distribution that fall within a specified number of standard deviations of the mean. It is known as Chebyshev's inequality. Chebyshev's Inequality In any distribution, the percent of observations falling within k standard deviations of the mean is at least (I 00) (I If k (I 00) = kl' )- 2, for example, Chebyshev's inequality tells us that at least (I - kl') = 7 5% of the o bserva ti ons in any d istri bu tion must fall within 2 standard deviations of the tnean. The following table displays the percent of observations within k standard deviations of the mean for several values of k. k 2 1 12 at least (I 00) ( - k ) of observations within k standard deviations of mean (100)( I - /z) = 0% (100)(1 - 2;) = 75% 2.1 Measures of Relative Standing and Density Curves 3~) 89% (100)( I - ~) = 93.75% 4 (IOO)(I - ~) = 96% 5 3 (100)( I - 4 5 Going with 85% of the flow According to the Los Angeles Times, speed limits on California highways are set at the 85th percentile of vehicle speeds on those stretches of road. The idea, apparently, is that 85% of people behave in a safe and sane manner. The other 15% are reckless individuals who go too fast. To which group do you belong? = So it is fairly unusual to find an observation that is more than 5 standard deviations away frmn the mean in any distribution. Chebyshev's inequality gives us some insight into how observations are distributed within distributions. It does not help us detennine the percentile corresponding to a given z-score, however. For that, we need 111ore advanced statistical models known as density curves. Exercises 2.5 Unemployment in the states, I Each month the Bureau of Labor Statistics announces the unemployment rate for the previous month. Unemployment rates are economically important and politically sensitive. Unemployment may vary greatly among the states because types of work are unevenly distributed across the country. Table 2.1 presents the unemployment rates for each of the 50 states in May 2005. Table 2.1 State Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Unemployment rates by state, May 2005 Percent 4.4 6.4 4.8 5.0 5.3 5.3 5.3 4.1 4.0 5.2 2.7 3.9 5.8 4.8 4.8 5.3 5.7 State Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska Nevada New Hampshire New Jersey New Mexico New York North Carolina North Dakota Source: Bureau of Labor Statistics Web site, www.bls.gov. Percent 5.4 5.0 4.2 4.8 7.1 4.3 7.1 5.6 4.5 4.0 4.0 3.6 3.9 6.0 5.0 5.1 3.5 State Ohio Oklahoma Oregon Pennsylvania Rhode Island South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming Percent 6.1 4.5 6.5 4.8 4.5 6.3 4.0 6.2 5.5 4.9 3.1 3.6 5.7 4.5 4.7 4.0 122 CHAPTER 2 Describing Location in a Distribution (a) Make a histogram of these data. Be sure to label and scale your axes. (b) Calculate numerical summaries for this data set. Describe the shape, center, and spread of the distribution of unemployment rates. (c) Determine the percentile for Illinois. Explain in simple terms what this says about the unemployment rate in Illinois relative to the other states. (d) Which state is at the 30th percentile? Calculate the z-score for this state. (e) Compare the percent of state unemployment rates that fall within 1, 2, 3, 4, and 5 standard deviations of the mean with the percents guaranteed by Chebyshev's inequality (refer to the table on page 121). 2.6 Unemployment in the states, II Refer to the previous exercise. The December 2000 unemployment rates for the 50 states had a symmetric, single-peaked distribution with a mean of 3.47% and a standard deviation of about 1%. The unemployment rate for Illinois that month was 4.5%. There were 42 states with lower unemployment rates than Illinois. (a) Write a sentence comparing the actual rates of unemployment in Illinois in December 2000 and May 2005. (b) Compare the z-scores for the Illinois unemployment rate in these same two months in a sentence or two. (c) Compare the percentiles for the Illinois unemployment rate in these same two months in a sentence or two. 2.7 PSAT scores In October 2004, about 1.4 million college-bound high school juniors took the Preliminary SAT (PSAT). The mean score on the Critical Reading test was 46.9, and the standard deviation was 10.9. Nationally, 5.2 % of test takers earned a score of 65 or higher on the Critical Reading test's 20 to 80 scale. 3 Scott was one of 50 junior boys to take the PSAT at his school. He scored 64 on the Critical Reading test. This placed Scott at the 68th percentile within the group of boys. Looking at all 50 boys' Critical Reading scores, the mean was 58.2, and the standard deviation was 9.4. (a) Write a sentence or two comparing Scott's percentile among the national group of test takers and among the 50 boys at his school. (b) Calculate and compare Scott's z-score among these same two groups of test takers. (c) How well did the boys at Scott's school perform on the PSAT? Give appropriate evidence to support your answer. (d) What does Chebyshev's inequality tell you about the performance of students nationally? At Scott's school? 2.8 Blood pressure Larry came home very excited after a VISit to his doctor. He announced proudly to his wife, "My doctor says my blood pressure is at the 90th percentile among men like me. That means I'm better off than about 90% of similar men." How should his wife, who is a statistician, respond to Larry's statement? r 2.1 Measures of Relative Standing and Density Curves Density Curves In Chapter l, we developed a kit of graphical and numerical tools that we call a "Data Analysis Toolbox" for describing distributions. In this chapter, we have explored some methods for measuring the relative standing of individuals within distributions. We now have a clear strategy for exploring data fron1 a single quantitative variable. l. Always plot your data: make a graph, usually a histogratn or a stemplot. 2. Look for the overall pattern (shape, center, spread) and for striking deviations such as outliers. 3. Calculate a nun1erical sumtnary to briefly describe center and spread. Here is one n1ore step to add to this strategy: 4. Sometimes the overall pattern of a large nmnber of observations is so regular that we can describe it by a s1nooth curve. Doing so can help us describe the location of individual observations within a distribution. Figure 2. 3 is a histogram of the scores of all 94 7 seventh-grade students in Gary, Indiana, on the vocabulary part of the Iowa Test of Basic Skills. 4 Scores on this national test have a very regular distribution. The histogram is symmetric, and both tails fall off smoothly frmn a single center peak. There are no large gaps or obvious outliers. The s1nooth curve drawn through the tops of the histogram bars in Figure 2. 3 is a good description of the overall pattern of the data. Figure2.3 Histogram of the vocabulary scores of all seventh-grade students in Gary, Indiana. 2 4 6 8 Vocabulary scores 10 12 CHAPTER 2 124 mathematical model Example2.4 Describing Location in a Distribution The curve is a mathematical model for the distribution. A n1athematical model is an idealized description. It gives a compact picture of the overall pattern of the data but ignores minor irregularities as well as any outliers. We will see that it is easier to work with the smooth curve in Figure 2.3 than with the histogram. The reason is that the histogran1 depends on our choice of classes, while with a little care we can use a curve that does not depend on any choices we make. Here's how we do it. Seventh-grade vocabulary scores From histogran1 to density curve Our eyes respond to the areas of the bars in a histogram. The bar areas represent proportions of the observations. Figure 2.4(a) is a copy of Figure 2.3 with the leftmost bars shaded blue. The area of the blue bars in Figure 2.4(a) represents the students with vocabulary scores of 6.0 or lower. There are 287 such students, who make up the proportion 287/947 = 0.303 of all Gary seventh-graders. In other words, a score of 6.0 corresponds to about the 30th percentile. Figure 2.4 (a) The proportion of scores less than or equal to 6.0 from the histogram is 0.303. (b) The proportion of scores less than or equal to 6.0 from the density curve is 0.293. The blue bars represent scores The blue area represents scores ~6.0. ~6.0. 4 6 Vocabulary scores 10 12 2 4 6 8 10 12 Vocabulary scores Now concentrate on the curve drawn through the bars. In Figur~ 2.4(b ), the area under the curve to the left of 6.0 is shaded. Adjust the scale of the graph so that the total area under the curve is exactly 1. This area represents the proportion l, that is, all the observations. Areas under the curve then represent proportions of the observations. The curve is now a density curve. The blue area under the density curve in Figure 2.4(b) represents the proportion of students with score 6.0 or lower. T'his area is 0.293, only 0.010 away from the histogram result. So our estimate based on the density curve is that a score of 6.0 falls at about the 29th percentile. You can see that areas under the density curve give quite good approximations of areas given by the histogram. t 2.1 Measures of Relative Standing and Density Curves Density Curve A density curve is a curve that • is always on or above the horizontal axis, and • has area exactly 1 underneath it. A density curve describes the overall pattern of a distribution. The area under the curve and above any interval of values on the horizontal axis is the proportion of all observations that fall in that interval. Normal curve Example2.5 The density curve in Figures 2. 3 and 2.4 is a Normal curve. Density curves, like distributions, come in many shapes. In later chapters, we will encounter important density curves that are skewed to the left or right and curves that may look like Normal curves but are not. A left-skewed distribution Density curves and areas Figure 2.5 shows the density curve for a distribution that is skewed to the left. The smooth curve makes the overall shape of the distribution clearly visible. The shaded area under the curve covers the range of values from 7 to 8. This area is 0.12. This means that the proportion 0.12 ( 12 %) of all observations from this distribution have values between 7 and 8. Figure 2.5 The shaded area under this density curve is the proportion of observations taking values between 7 and B. Total area under 7 8 CHAPTER 2 Describing Location in a Distribution Figure 2.6 shows two density curves: a symmetric, Normal density curve and a right-skewed curve. The mean and median of each distribution, which will be discussed in more detail shortly, have been 1narked on the horizontal axis. Figure 2.6 (a) The median and mean of a symmetric density curve. (b) The median and mean of a right-skewed density curve. The long right tail pulls the mean to the right. t Median and mean llean Median A density curve of the appropriate shape is often an adequate description of the overall pattern of a distribution. Outliers, which are deviations from the overall pattern, are not described by the curve. Of course, no set of real data is exactly described by a density curve. The curve is an approximation that is easy to use and accurate enough for practical use. The Median and Mean of a Density Curve Our n1easures of center and spread apply to density curves as well as to actual sets of observations. The 1nedian and quartiles are easy. Areas under a density curve represent proportions of the total number of observations. The 1nedian is the point with half the observations on either side. So the median of a density curve is the "equal-areas point," the point with half the area under the curve to its left and the remaining half of the area to its right. The quartiles divide the area under the curve into quarters. One-fourth of the area under the curve is to the left of the first quartile, and three-fourths of the area is to the left of the third quartile. You can roughly locate the median and quartiles of any density curve by eye by dividing the area under the curve into four equal parts. Because density curves are idealized patterns, a symmetric density curve is exactly syn1n1etric. The median of a symmetric density curve is therefore at its center. Figure 2.6(a) shows the median of a symn1etric curve. It isn't so easy to spot the equal-areas point on a skewed curve. There are mathematical ways of finding the median for any density curve. We did that to 111ark the median on the skewed curve in Figure 2.6(b ). What about the mean? The mean of a set of observations is their arithmetic average. If we think of the observations as weights strung out along a thin rod, 2.1 Measures of Relative Standing and Density Curves 127 the mean is the point at which the rod would balance. This fact is also true of density curves. The mean of a density curve is the "balance point," the point at which the curve would balance if n1ade of solid material. Figure 2. 7 illustrates this fact about the 1nean. A symn1etric curve balances at its center because the two sides are identical. The mean and median of a symmetric density curve are equal, as in Figure 2.6(a). We know that the mean of a skewed distribution is pulled toward the long tail. Figure 2.6(b) shows how the mean of a skewed density curve is pulled toward the long tail 111ore than is the median. It's hard to locate the balance point by eye on a skewed curve. There are mathematical ways of calculating the n1ean for any density curve, so we are able to mark the mean as well as the n1edian in Figure 2.6(b ). Figure 2.7 The mean is the balance point of a density curve. ~ Median and Mean of a Density Curve The median of a density curve is the "equal-areas point," the point that divides the area under the curve in half. The mean of a density curve is the "balance point," at which the curve would balance if made of solid material. The 1nedian and .m ean are the same for a symmetric density curve. They both lie at the center of the curve. The mean of a skewed curve is pulled away from the 1nedian in the direction of the long tail. mean 1-L standilrd deviation u We can roughly locate the mean, median, and quartiles of any density curve by eye. This is not true of the standard deviation. When necessary, we can once again call on more advanced mathematics to learn the value of the standard deviation. The study of mathematical methods for doing calculations with density curves is part of theoretical statistics. Though we are concentrating on statistical practice, we often make use of the results of 1nathematical study. Because a density curve is an idealized description of the distribution of data, we need to distinguish between the mean and standard deviation of the density curve and the mean x and standard deviations computed from the actual observations. The usual notation for the mean of a density curve is JL (the Greek letter mu). We write the standard deviation of a density curve as u (the Greek letter sigma). CHAPTER 2 128 Describing Location in a Distribution Exercises 2.9 Density curves Sketch density curves that might describe distributions with the following shapes: (a) Symmetric, but with two peaks (b) Single peak and skewed to the left uniform distribution 2.10 A uniform distribution, I Figure 2.8 displays the density curve of a uniform distribution. The curve takes the constant value l over the interval from 0 to l and is 0 outside the range of values. This means that data described by this distribution take values that are uniformly spread between 0 and l. Figure 2.8 The density curve of a uniform distribution, for Exercise 2.10. "'' } height = 1 J 0 Use areas under this density curve to answer the following questions. (a) Why is the total area under this curve equal to l? (b) What percent of the observations lie above 0.8? (c) What percent of the observations lie below 0.6? (d) What percent of the observations lie between 0.25 and 0.75? (e) What is the mean J.-t for this distribution? 2.11 A uniform distribution, II Refer to the previous exercise. Can you construct a modified boxplot for the uniform distribution? If so, do it. If not, explain why not. 2.12 Finding means and medians Figure 2.9 displays three density curves, each with three points indicated. At which of these points on each curve do the mean and the median fall? Figure 2.9 Three density curves, for Exercise 2.12. ABC ABC (a) (b) ~ AB C (c) 2.1 Measures of Relative Standing and Density Curves 129 2.13 A weird density curve A line segment can be considered a density "curve," as shown in Exercise 2.1 0. A "broken-line" graph can also be considered a density curve. Figure 2.10 shows such a density curve. Figure 2 . 10 An unusual "broken-line" density curve, for Exercise 2.13. y 2~----r----+-----r----r X 0 0.2 0.4 0.6 0.8 (a) Verify that the graph in Figure 2.10 is a valid density curve. For each of the following, use areas under this density curve to find the proportion of observations within the given interval: (b) 0. 6 ~ X ~ 0. 8 (c)O~x~o.4 (d) 0 ~X~ 0.2 (e) The median of this density curve is a point between X= 0.2 and X= 0.4. Explain why. outcomes simulation 2.14 Roll a distribution In this exercise you will pretend to roll a regular, six-sided die 120 times. Each time you roll the die, you will record the result. The numbers 1, 2, 3, 4, 5, and 6 are called the outcomes of this chance phenomenon. In 120 rolls, how many of each number would you expect to roll? The TI-83/84 and TI-89 are useful devices for imitating chance behavior, especially in situations like this that involve performing many repetitions. Because you are only pretending to roll the die repeatedly, we call this process a simulation. There will be a more formal treatment of simulations in Chapter 6. • Begin by clearing L 1/list1 on your calculator. • Use your calculator's random integer generator to generate 120 random whole numbers between 1 and 6 (inclusive), and then store these numbers in L 1/listl. • On the TI-83/84, press IFJII,choose PRB, then 5: Randint. On the TI-89, press llt!iffllel~[i] and choose randint .... CHAPTER 2 • Describing Location in a Distribution On the TI-83/84, complete the command Randint ( 1 6 12 0) ~l. On the TI-89, complete the command tis tat. randint ( 1 6 12 0 ~ list1. I I I • • • • I Set the viewing window as follows: X[ l, 7] 1 by Y[- 5, 2 5]5. Specify a histogram using the data in 1 1/listl . Then graph. Are you surprised? Repeat the simulation several times. You can recall and reuse the previous command by pressing flfiJIM~IIMU. It's a good habit to clear L1/listl before you roll the die again. (a) In theory, each number should come up 20 times. But in practice, there is chance variation, so the bars in the histogram will probably have different heights. Theoretically, what should the distribution look like? (b) How is this distribution similar to the uniform distribution in Exercise 2.1 0? How is it different? Section 2.1 Summa!} Two ways of describing an individual's location within a distribution are z-scores and percentiles. To standardize any observation x, subtract the mean of the distribution and then divide by the standard deviation. The resulting z-score x- mean standard deviation z=------says how many standard deviations x lies fron1 the distribution n1ean. An observation's percentile is the percent of the distribution that is at or below the value of that observation. We can also use z-scores and percentiles to compare the relative standing of individuals in different distributions. Chebyshev's inequality gives us a useful rule of thumb for the percent of observations in any distribution that are within a specified number of standard deviations of the mean. We can sometin1es describe the overall pattern of a distribution by a density curve. A density curve always remains on or above the horizontal axis and has total area l underneath it. An area under a density curve gives the proportion of observations that fall in a range of values. A density curve is an idealized description of the overall pattern of a distribution that s1nooths out the irregularities in the actual data. Write the mean of a density curve as IL and the standard deviation of a density curve as O" to distinguish then1 from the mean x and the standard deviation s of the actual data. The mean, the n1edian, and the quartiles of a density curve can be located by eye. The mean IL is the balance point of the curve. The median divides the area under the curve in half. The quartiles, with the 1nedian, divide the area under the curve into quarters. The standard deviation O" cannot be located by eye on most density curves. The mean and median are equal for symtnetric density curves. The mean of a skewed curve is located farther toward the long tail than is the median. 2.1 Measures of Relative Standing and Density Curves Sectio11 2.1 Exercises 2.15 Men's and women's heights The heights of women aged 20 to 29 are symmetrically distributed with a mean of 64 inches and standard deviation of 2.7 inches. Men the same age have mean height 69.3 inches with a standard deviation of 2.8 inches. What are the zscores for a woman 6 feet tall and a man 6 feet tall? Describe in simple language what information the z-scores give that the actual heights do not. 2.16 Baseball salaries, I Here are the salaries for each member of the Boston Red Sox baseball team on the opening day of the 2005 season: 5 Player Ramirez, Manny Schilling, Curt Damon, Johnny Renteria, Edgar Varitek, Jason Foulke, Keith Nixon, Trot Clement, Matt Ortiz, David Wakefield, Tim Wells, David Millar, Kevin Payton, Jay Embree, Alan Bellhorn, Mark Timlin, Mike Mueller, Bill Arroyo, Bronson Miller, Wade Mirabelli, Doug Halama, John Mantei, Matt Vazquez, Ramon Myers, Mike McCarty, David Youkilis, Kevin Neal, Blaine Stern, Adam Salary $ 22,000,000 14,500,000 8,250,000 8,000,000 8,000,000 7,500,000 7,500,000 6,500,000 5,250,000 4,670,000 4,075,000 3,500,000 3,500,000 3,000,000 2,750,000 2,750,000 2,500,000 1,850,000 1,500,000 1,500,000 850,000 750,000 . 700,000 600,000 550,000 323,125 321,000 316,000 (a) Construct a histogram of the distribution of salaries. (b) Calculate numerical summaries for this distribution of salaries. Describe the shape, center, and spread of the distribution. Are there any outliers? (c) Describe David McCarty's position within the distribution using both his z-score and his percentile. CHAPTER 2 Describing Location in a Distribution (d) According to our definition of percentile, who are the players at the 25th and 75th percentiles of this distribution, and what are their salaries? Do these values match the first and third quartiles you calculated in part (b)? Why or why not? 2.17 Baseball salaries, II Refer to the previous exercise. On the opening day of the 2004 season, David McCarty had a $500,000 salary. Johnny Damon's salary was $8,000,000. For the 30 players on the roster that clay, the mean salary was $4,243,283.33 and the standard deviation was $5,324,827.26. Damon's salary was the fifth highest on the team, while McCarty's salary ranked 24th. Compare David McCarty's and Johnny Damon's openingday salaries for the 2004 and 2005 seasons using actual data values, z-scores, and percentiles. Write a few sentences describing your conclusions. 2.18 Statistics test scores revisited Refer to the scores of Mr. Pryor's statistics students on their first test (page 116). (a) Compare the percent of test scores that fall within l, 2, 3, 4, and 5 standard deviations of the mean with the percents guaranteed by Chebyshev's inequality (see the table on page 116). (b) Mr. Pryor decides that the test scores are a little low. One option he considers is adding 4 points to each student's score. How will this affect each student's z-score? Their percentile? (c) Rather than adding 4 points, Mr. Pryor also thinks about multiplying each student's score by 1.06. How will this affect each student's z-score? Their percentile? (d) The last idea Mr. Pryor contemplates is determining each student's final score as follows: 84 + (student's z-score)(4 points). Which of the three plans- (b), (c), or (d) -would you recommend? Explain your reasoning. 2.19 Comparing performance: percentiles Erik is a star runner on the track team, and Erica is one of the best sprinters on the swim team. Both athletes qualify for the state championship meet based on their performance during the regular season. (a) In the track playoffs, Erik records a time that would fall at the lOth percentile of all his race times that season. But his performance places him at the 40th percentile in the state championship• meet. Explain how this is possible. (b) Erica swims a bit lethargically for her in the state swim meet, recording a time that would fall at the 40th percentile of all her meet times that season. But her performance places Erica at the lOth percentile in this event at the state meet. Explain how this could happen. 2.20 Another weird density curve A certain density curve looks like an inverted letter V. The first segment goes from the point (0, 0.6) to the point (0.5, 1.4). The second segment goes from (0.5, 1.4) to (1, 0.6). (a) Sketch the curve. Verify that the area under the curve is l, so that it is a valid density curve. (b) Determine the median. Mark the median and the approximate locations of the quartiles Q1 and Q3 on your sketch. (c) What percent of the observations lie below 0.3? (d) What percent of the observations lie between 0.3 and 0.7? 2.2 Normal Distributions 2.21 Calculator-generated density curve Like Minitab and similar computer software, the TI-83/84 and TI-89 has a "random number generator" that produces decimal numbers between 0 and l. mJD,then choose PRB and 1 :Rand. • On the TI-83/84, press • On the Tl-89, press P.m!Jm (MATH) then choose 7: Probability and 4: rand (. Be sure to close the parentheses. Press !j~lljO several times to see the results. The command 2rand ( 2rand () on the TI-89) produces a random number between 0 and 2. The density curve of the outcomes has constant height between 0 and 2, and height 0 elsewhere. (a) Draw a graph of the density curve. (b) Use your graph from (a) and the fact that areas under the curve are relative frequencies of outcomes to find the proportion of outcomes that are less than l . (c) What is the median of the distribution? The mean? What are the quartiles? (d) Find the proportion of outcomes that lie between 0.5 and 1.3. 2.22 FLIP50 The program FLIP50 simulates flipping a fair coin 50 times and counts the number of times the coin comes up heads. It prints the number of heads on the screen. Then it repeats the simulation for a total of l 00 times, each time displaying the number of heads in 50 flips. When it finishes, it draws a histogram of the 100 results. (You have to set up the plot first on the TI-89.) (a) What outcomes are likely? What outcomes are the most likely? If you made a histogram of the results of the l 00 repetitions, what shape distribution would you expect? (b) Link the program from a classmate or your teacher. Run the program and observe the variations in the results of the l 00 repetitions. I t (c) When the histogram appears, TRACE to see the classes and frequencies . Record the results in a frequency table. (On the TI-89, set up Plot l to be a histogram of list2 with a bucket width of2 . Then press C[i) (GRAPH). (d) Describe the distribution: symmetric versus nonsymmetric; center; spread; number of peaks; gaps; suspected outliers. What shape density curve would best fit your distribution? 2.2 Normal Distributions Nomwl distributions One particularly in1portant class of density curves has already appeared in Figures 2.3, 2.4, and 2.6(a) and the ufine-grained distribution" of Activity 2A. These density curves are syn1n1etric, single-peaked, and bell-shaped. They are called Normal curves, and they describe Norma l distributions. Norn1al distributions play a large role in statistics, but they are rather special and not at all unormal" in the sense of being average or natural. We capitalize Normal to remind you that these curves are special. CHAPTER 2 Describing Location in a Distribution All Normal distributions have the same overall shape. The exact density curve for a particular Normal distribution is described by giving its mean J.L and its standard deviation a-. The n1ean is located at the center of the symmetric curve, and is the same as the median. Changing J.L without changing a- moves the Normal curve along the horizontal axis without changing its spread. The standard deviation a- controls the spread of a Normal curve. Figure 2.11 shows two Nonnal curves with different values of a-. The curve with the larger standard deviation is more spread out. Figure 2.11 Two Normal curves, showing the mean J..L and standard deviation cr. ~ J..L J..L The standard deviation a- is the natural measure of spread for Normal distributions. Not only do J.L and a- completely detennine the shape of a Nonnal curve, but we can locate a- by eye on the curve. Here's how. As we move out in either direction frmn the center J.L, the curve changes fron1 falling ever more steeply to falling ever less steeply. _/ 1\ ~ The points at which this change of curvature takes place are located at distance aon either side of the mean J.L. Advanced math students know these points as inflection points. Figure 2.11 shows a- for two different Normal curves. You can feel the change as you run a pencil along a Normal curve, and so find the standard deviation. Ren1ember that J.L and a- alone do not specify the shape of most distributions, and that the shape of density curves in general does not reveal a-. These are special properties of Nonnal distributions. Why are the Normal distributions i1nportant in statistics? Here are three reasons. First, Normal distributions are good descriptions for some distributions of real data. Distributions that are often close to Normal include • scores on tests taken by many people (such as SAT exams and many psychological tests), • repeated careful1neasurements of the same quantity, and • characteristics of biological populations (such as yields of corn and lengths of anin1al pregnancies). 2.2 Normal Distributions Second, Normal distributions are good approximations to the results of many kinds of chance outcomes, such as tossing a coin many times. (See the FLIP 50 simulation in Exercise 2.22, page 133.) Third, and 1nost in1portant, we will see that many statistical inference procedures based on Nonnal distributions work well for other roughly symmetric distributions. However, even though many sets of data follow a Normal distribution, many do not. Most inc01ne distributions, for exarnple, are skewed to the right and so are not Normal. Some distributions are symmetric but not Normal or even close to Normal. The uniform distribution of Exercise 2.10 (page 128) is one such example. Non-Normal data, like non-normal people, not only are common but are s01netimes 1nore interesting than their Normal counterparts. I ~ The 68-95-99.7 Rule Although there are many Nonnal curves, they all have common properties. In particular, all Normal distributions obey the following rule. The 68-95-99.7 Rule In the Nonnal distribution with mean JL and standard deviation a: • Approximately 68% of the observations fall within a of the 1nean J.L. • Approximately 95% of the observations fall within 2a of JL. • Approximately 99.7% of the observations fall within 3a of JL. Figure 2.12 illustrates the 68-95-99.7 rule. Some authors refer to it as the a empirical rule." By remembering these three numbers, you can think about Normal Figure 2.12 The 68-95-99.7 rule for Normal distributions. 68% of data -+ I I -3 -2 -1 95% of data I 99.7% of data I 0 \ •I I' 2 c •I 3 136 CHAPTER 2 Describing Location in a Distribution distributions without constantly making detailed calculations when rough approximations will suffice. Notice that the 68-95-99.7 rule gives us 1nuch more precise inforn1ation about how the observations fall in a Normal distribution than Chebyshev's inequality (page 120) did. Example2.6 Young women's heights Using the 68-95-99.7 rule The distribution of heights of young women agedl8 to 24 is approximately Normal with mean J.L = 64.5 inches and standard deviation cr = 2. 5 inches. Figure 2.13 shows the application of the 68-95-99.7 rule in this example. Figure 2.13 57 The 68-95-99.7 rule applied to the distribution of the heights of young women. Here, J.L = 64.5 and u = 2.5. 59.5 62 64.5 . 67 69.5 72 Height (in inches) Two standard deviations is 5 inches for this distribution. The 9 5 part of the 68-95-99.7 rule says that the middle 95 % of young women are between 64.5 - 5 and 64.5 + 5 inches tall, that is, between 59.5 and69.5 inches. This fact is exactly true for an exactly Normal distribution. It is approximately true for the heights of young women because the distribution of heights is approximately Normal. The other 5% of young women have heights outside the range from 59.5 to 69.5 inches. Because the Normal distributions are symmetric, half of these women are on the tall side. So the tallest 2.5% of young women are taller than 69 .5 inches. The 99.7 part of the 68-95-99.7 rule says that almost all young women (99.7% of them) have heights between J.L- 3crand J.L + 3cr. This range of heights is 57 to 72 inches. N(J.L, u) Because we will n1ention Normal distributions often, a short notation is helpful. We abbreviate the Normal distribution with n1ean JL and standard deviation a as N(JL,a). For example, the distribution of young women's heights is N(64.5, 2.5). 2.2 Normal Distributions Activity 2C The Normal Curve Applet The Normal Curve Applet allows you to investigate the properties of Normal distributions very easily. It is somewhat limited by the number of pixels available for use, so that it can't hit every value exactly. For the questions below, use the closest available values. l. How accurate is 68-95-99. 7? The 68-95-99. 7 rule for Norn1al distributions is a useful approximation. But how accurate is this rule? On the applet, drag one flag past the other so that the applet shows the area under the curve between the two flags . (a) Place the flags one standard deviation on either side of the mean. What is the area between these two values? What does the 68-95-99. 7 rule say this area is? (b) Repeat for locations two and three standard deviations on either side of the mean. Again compare the 68-9 5-99. 7 rule with the area given by the applet. 2. How 1nany standard deviations above or below the mean does the 40th percentile of the Nonnal distribution with n1ean 0 and standard deviation l fall? Describe how you used the applet to answer this question. Exercises 2.23 Estimating standard deviations Figure 2.14 on the next page shows two Normal curves, both with mean 0. Approximately what is the standard deviation of each of these curves? 2.24 Men·s heights T he distribution of h eights of adult Am erican men is approximately Normal with mean 69 inches and standard deviation 2.5 inch es. Draw a Normal curve on which this mean and standard deviation are correctly located. (Hint: Draw the curve first, locate the points wh ere the curvature changes, then mark the h orizontal axis.) 2.25 More on men's heights T h e distribution of heights of adult American men is approximately Normal with mean 69 inch es and standard deviation 2.5 inch es. Use th e 68-95-99.7 rule to answer the following questions. (a) W hat percent of men are taller than 74 inches? (b) Between what heights do the m iddle 95% of men fall? (c) W hat percent of men are shorter than 66.5 in ches? (d) A h eight of 71. 5 inch es corresponds to what perce ntile of adult male Am erican h eights? CHAPTER 2 Describing Location in a Distribution Figure 2.14 -1.6 Two Normal curves with the same mean but different standard deviations, for Exercise 2.23. -1.2 - 0.8 -0.4 0 0.4 0.8 1.2 1.6 2.26 Potato chips The distribution of weights of 9-ounce bags of a particular brand of potato chips is approximately Normal with mean J.L = 9.12 ounces and standard deviation u = 0.15 ounce. (a) Draw an accurate sketch of the distribution of potato chip bag weights. Be sure to label the mean, as well as the points one, two, and three standard deviations away from the mean on the horizontal axis. (b) A bag that weighs 8.97 ounces is at what percentile in this distribution? (c) What percent of 9-ounce bags of this brand of potato chips weigh between 8.67 ounces and 9.27 ounces? 2.27 FLIP50 Refer to Exercise 2.22 (page 133). Run the program FLIP50 again. (a) Use the Data Analysis Toolbox from Chapter 1 to describe your results. (b) What percent of your observations fall within one standard deviation of the mean? Within two standard deviations? Within three standard deviations? (c) How well do your data follow the 68-95-99.7 rule for Normal distributions? Explain. 2.28 A fine-grained distribution You can do this exercise if you spray-painted a Normal distribution in Activity 2A. On your "fine-grained distribution," first count the number of whole squares and parts of squares under the curve. Approximate as best you can. This represents the total area under the curve. (a) Mark vertical lines at J.L - 1u and J.L + 1u. Count the number of squares or parts of squares between these two vertical lines. Now divide the number of squares within one standard deviation of J.L by the total number of squares under the curve and express your answer as a percent. How does this compare with 68%? Why would you expect your answer to differ somewhat from 68%? 2.2 Normal Distributions (b) Count squares to determine the percent of area within 2u of J.L. How does your answer compare with 95 %? (c) Count squares to determ ine the percent of area within 3u of J.L. How does your answer compare with 99. 7%? The Standard Normal Distribution As the 68-95-99.7 rule suggests, all Normal distributions share many common properties. In fact, all Norn1al distributions are the same if we m easure in units of size u about the m ean J.L as center. Changing to these un its requires us to standardize, just as we did in Section 2.1: z= He said, she said The height and weight distributions in this chapter come from actual measurements in a government survey. Why not just ask people? When asked their weight, almost all women say they weigh less than they really do. Heavier men also underreport their weight, but lighter men claim to weigh more than the scale shows. We leave you to ponder the psychology of the two sexes. Just remember that "say so" is no replacement for measuring. x- J.L - (J If the variable we standardize has a Normal distribution, then so does the new variable z. (This is because standardizing is a linear transfon11ation, which, as we learned in C hapter 1, does not change the shape of a distribution. ) This new distribution is called the standard Normal distribution. Standard Normal Distribution T h e standard Normal distribution is the N ormal distribution N(O, l ) with m ean 0 and standard deviation 1 (Figure 2.15 ). Figure 2.15 Standard Normal distribution. - 3.0 - 2.0 - 1.0 1.0 0 2.0 3.0 If a variable x h as any Normal distribution N( J.L, u ) with m ean J.L and standard deviation u, then the standardized variable X- J.L z=--u has the standard Normal distribution. 140 CHAPTER 2 Describing Location in a Distribution Standard Normal Calculations An area under a density curve is a proportion of the observations in a distribution. Any question about what proportion of observations lie in some range of values can be answered by finding an area under the curve. Because all Normal distributions are the same when we standardize, we can find areas under any Normal curve fro1n a single table, a table that gives areas under the curve for the standard Normal distribution. Table A, the standard Normal table, gives areas under the standard Normal curve. You can find Table A inside the front cover. Ths Standard Norms/ Tsbls Table A is a table of areas under the standard Normal curve. The table entry for each value z is the area under the curve to the left of z. z The next two examples show how to use the table. Example2.7 Area to the left Using the standard Nonnal table Problem: Find the proportion of observations from the standard Normal distribution that are less than 2.22. Solution: To find the area to the left of 2.22, locate 2.2 in the left-hand column of Table A, then locate the remaining digit 2 as .02 in the top row. The entry opposite 2.2 and under .02 is 0.9868. This is the area we seek. z .00 .01 .02 .03 2.1 .9821 .9826 .9830 .9834 2.2 .9861 .9864 .9868 .9871 2.3 .9893 .9896 .9898 .9901 Figure 2.16 illustrates the relationship between the value z = 2.22 and the area 0.9868. 2.2 Normal Distributions Figure 2.16 The area under a standard Normal curve to the left of the point z = 2.22 is 0.9868. Table entry for z is always the area under the curve to the left of z . Z= Example2.8 2.22 Area to the right More on using the standard Normal table z .04 .05 .08 Problem: Find the proportion of observations fro m the standard Normal distribution that - 2.2 - 2.1 - 2.0 .0125 .0 162 .0207 .0 122 .0158 .0202 .0 11 9 .0 154 .0197 are greater than -2. 15. Solution: Enter Table A under z = -2. 15. T hat is, fi nd -2. 1 in the left-hand column and .05 in the top row. T he table entry is 0.01 58. T his is the area to the left of- 2.1 5. Because the total area under the curve is 1, the area lying to the right of - 2. 15 is 1 - 0. 0158 = 0.9842. Figure 2.17 illustrates these areas . Figure 2.17 Areas under the standard Normal curve to the right and left of z = - 2.15. Table A gives only areas to the left. Z= - 2.15 Caution! A common student mistake is to look up a z-value in Table A and report the entry corresponding to that z-value, regardless of whether the problem asks for the area to the left or to the right of that z-value. Always sketch the standard Normal curve, mark the z-value, and shade the area of interest. And before you finish, make sure your answer is reasonable in the context of the problem. 142 CHAPTER 2 Describing Location in a Distribution Exercises 2.29 Table A practice Use Table A to find the proportion of observations from a standard Normal distribution that satisfies each of the following statements. In each case, sketch a standard Normal curve and shade the area under the curve that is the answer to the question. Use the Normal Curve Applet to check your answers. (a) z < 2.85 (b) z > 2.85 (c) z > -1.66 (d) -1.66 < z < 2.85 2.30 More Table A practice Use Table A to find the proportion of observations from a standard Normal distribution that satisfies each of the following statements. In each case, sketch a standard Normal curve and shade the area under the curve that is the answer to the question. Use the Normal Curve Applet to check your answers. (a) z < -2.46 (b) z > 2.46 (c) 0.89 < z < 2.46 (d) -2.95 < z < -1.27 B.C. by johnny hart NORMALIZ.E WHAI YoiJ AAvt:: IF 'lOu Dot-lriJEt:D GI..-A~S~s. f'fl -<- - ::::.,.;;oo;· -- ~4-,.;--A- ~ • . 1m- Normal Distribution Calculations We can answer any question about proportions of observations in a Normal distribution by standardizing and then using the standard Normal table. Here is an outline of the method for finding the proportion of the distribution in any region. Solving Problems Involving Norms/ Distributions Step 1: State the problem in terms of the observed variable x. Draw a picture of the distribution and shade the area of interest under the curve. 2.2 Normal Distributions Step 2: Standardize and draw a picture. Standardize x to restate the problem in terms of a standard Normal variable z. Draw a picture to show the area of interest under the standard Norn1al curve. Step 3: Use the table. Find the required area under the standard Normal curve, using Table A and the fact that the total area under the curve is 1. Step 4: Conclusion. Write your conclusion in the context of the problem. Example2.9 Is cholesterol a problem for young boys? Normal calculations The level of cholesterol in the blood is important because high cholesterol levels may increase the risk of heart disease . The distribution of blood cholesterol levels in a large population of people of the same age and sex is roughly Normal. For 14-year-old boys, 6 the mean is JL = 170 milligrams of cholesterol per deciliter of blood (mg/dl) and the standard deviation is a= 30 mg/dl. Levels above 240 mg/dl may require medical attention. What percent of 14-year-old boys have more than 240 mg/dl of cholesterol? Step 1: State the problem. Call the level of cholesterol in the blood x. The variable x has theN( 170, 30) distribution. We want the proportion of boys with cholesterol level x > 240. Sketch the distribution, mark the important points on the horizontal axis, and shade the area of interest. See Figure 2.18(a). Figure 2. 18a (a) Cholesterol levels for 14-year-o/d boys who may require medical attention. 170 200 240 Step 2: Standardize and draw a picture. On both sides of the inequality, subtract the mean, then divide by the standard deviation, to turn x into a standard Normal z: X> 240 X - 170 > 240 - 170 30 30 z > 2.33 144 CHAPTER 2 Describing Location in a Distribution Sketch a standard Normal curve, and shade the area of interest. See Figure 2.18(b). Figure 2. 18b (b) Areas under the standard Normal curve. Area =0.9901 Area Z= =0.0099 2.33 Step 3: Use the table. From Table A, we see that the proportion of observations less than 2.33 is 0.9901. About 99% of boys have cholesterol levels less than 240. The area to the right of 2.33 is therefore 1 - 0.9901 = 0.0099. This is about 0.01 , or 1%. Step 4: Conclusion. (Write your conclusion in the context of the problem.) Only about 1% of 14-year-old boys have dangerously high cholesterol. In a Norn1al distribution, the proportion of observations with x > 240 is the same as the proportion with x ~ 240. There is no area under the curve exactly above the point 240 on the horizontal axis, so the areas under the curve with x > 240 and x ~ 240 are the sa1ne. This isn't true of the actual data. There n1ay be a boy with exactly 240 mg/dl of blood cholesterol. The Normal distribution is just an easy-to-use approximation, not a description of every detail in the actual data. The key to doing a Normal calculation is to sketch the area you want, then match that area with the area that the table gives you. Here is another example. Example 2. 10 Cholesterol in young boys (again) Working with an interval What percent of 14-year-old boys have blood cholesterol between 170 and 240 mg/dl? Step 1: State the problem. We want the proportion of boys with 170:::; x:::; 240. Step 2: Standardize and draw a picture. 170 ::S 170 - 170 30 < X - X ::S 240 170 < 240 - 170 30 30 0:::; z:::; 2.33 2.2 Normal Distributions Sketch a standard Normal curve, and shade the area of interest. See Figure 2.19. Figure 2.19 Areas under the standard Normal curve. Area =0.5 I Area =0.4901 0 Area 2.33 =0.9901 Step 3: Use the table. The area between 0 and 2.33 is the area to the left of 2.33 minus the area to the left of 0. Look at Figure 2.19 to check this. From Table A, area between 0 and 2. 3 3 = area to th e left of 2. 33 - area to the left of 0.00 = 0.9901 - 0.5000 = 0.4901 Step 4: Conclusion. (State your conclusion in context.) About 49% of 14-year-old boys have cholesterol levels between 170 and 240 mg/cll. What if we meet a z that falls outside the range covered by Table A? For exa1nple, the area to the left of z = -4 does not appear in the table. But since -4 is less than - 3.4, this area is sn1aller than the entry for z = - 3.40, which is 0.0003. There is very little area under the standard Normal curve outside the range covered by Table A. You can take this area to be zero with little loss of accuracy. Finding a Value, Given a Proportion Examples 2.9 and2.10 illustrate the use of Table A to find what proportion of the observations satisfies some condition, such as "blood cholesterol between 170 and 240 n1g/dl." We n1ay instead want to find the observed value with a given proportion of the observations above or below it. To do this, use Table A backwards. Find the given proportion in the body of the table , read the corresponding z from the left column and top row, then "unstandardize" to get the observed value. Here is an example. 146 CHAPTER 2 Describing Location in a Distribution Example 2. 11 SAT Verbal scores Inverse Normal calculations Scores on the SAT Verbal test in recent years follow approximately the N(505, 110) distribution. How high must a student score in order to place in the top 10% of all students taking the SAT? Step 1: State the problem and draw a picture. We want to find the SAT score x with area 0.1 to its right under the Normal curve with mean J-t = 505 and standard deviation u = 110. That's the same as finding the SAT score x with area 0.9 to its left. Figure 2.20 poses the question in graphical form. Figure 2.20 Locating the point on a Normal curve with area 0.10 to its right. X= 505 The bell curve? Does the distribution of human intelligence follow the "bell curve" of a Normal distribution? Scores on IQ tests do roughly follow a Normal distribution. That is because a test score is calculated from a person's answers in a way that is designed to produce a Normal distribution. To conclude that intelligence follows a bell curve, we must agree thatthetestscores directly measure intelligence. Many psychologists don't think there is one human characteristic that we can call "intelligence" and can measure by a single test score. Z= X=? 0 Z= 1.28 Because Table A gives the areas to the left of z-values, always state the problem in terms of the area to the left of x. Step 2: Use the table. Look in the body of Table A for the entry closest to 0.9. It is 0.8997. This is the entry corresponding to z 0.9 to its left. = 1.28. So z = 1.28 is the standardized value with area Step 3: Unstandardize to transform the solution from the z scale back to the original x scale. We know that the standardized value of the unknown xis z = 1.28. Sox itself satisfies 505 = 1.28 110 X- Solving this equation for x gives X= 505 + (1.28)(110) = 645.8 This equation should make sense: it finds the x that lies 1.28 standard deviations above the mean on this particular Normal curve. That is the "unstandardized" meaning of z = 1.28. Step 4: Conclusion. We see that a student must score at least 646 to place in the highest 10%. 2.2 Normal Distributions Exercises 2.31 How hard do locomotives pull? An important measure of the performance of a locomotive is its "adhesion," which is the locomotive's pulling force as a multiple of its weight. The adhesion of one 4400-horsepower diesel locomotive varies in actual use according to a Normal distribution with mean JL = 0.37 and standard deviation u = 0.04. For each part that follows, sketch and shade an appropriate Normal distribution. Then show your work. (a) What proportion of adhesions measured in use are higher than 0.40? (b) What proportion of adhesions are between 0.40 and 0.50? (c) Improvements in the locomotive's computer controls change the distribution of adhesion to a Normal distribution with mean JL = 0.41 and standard deviation u = 0.02. Find the proportions in (a) and (b) after this improvement. 2.32 Table A in reverse Use Table A to find the value z from a standard Normal distribution that satisfies each of the following conditions. (Use the value of z from Table A that comes closest to satisfying the condition.) In each case, sketch a standard N annal curve with your value of z marked on the axis. Use the Nomwl Curve Applet to check your answers. (a) The point z with 25% of the observations falling to the left of it. (b) The point z with 40% of the observations falling to the right of it. 2.33 Length of pregnancies The length of human pregnancies from conception to birth varies according to a distribution that is approximately Normal with mean 266 days and standard deviation 16 days. For each part that follows, sketch and shade an appropriate Normal distribution. Then show your work. (a) What percent of pregnancies last less than 240 days (that's about 8 months)? (b) What percent of pregnancies last between 240 and 270 days (roughly between 8 months and 9 months)? (c) How long do the longest 20% of pregnancies last? 2.34 IQ test scores Scores on the Wechsler Adult Intelligence Scale (a standard IQ test) for the 20 to 34 age group are approximately Normally distributed with JL = 110 and u = 25. For each part that follows, sketch and shade an appropriate Normal distribution. Then show your work. (a) What percent of people aged 20 to 34 have IQ scores above 100? (b) What percent have scores above 150? (c) MENSA is an elite organization that admits as members people who score in the top 2% on IQ tests. What score on the Wechsler Adult Intelligence Scale would an individual have to earn to qualify for MENSA membership? 2.35 Locating quartiles The quartiles of any density curve are the points with area 0.25 and 0.75 to their left under the curve. Use Table A or the Nomwl Curve Applet to answer the following questions. CHAPTER 2 Describing Location in a Distribution (a) What are the quartiles of a standard Normal distribution? (b) How many standard deviations away from the mean do the quartiles lie in any Normal distribution? What are the quartiles for the lengths of human pregnancies in Exercise 2.33? 2.36 Brush your teeth The amount of time Ricardo spends brushing his teeth follows a Normal distribution with unknown mean and standard deviation. Ricardo spends less than one minute brushing his teeth about 40% of the time. He spends more than two minutes brushing his teeth 2% of the time. Use this information to determine the mean and standard deviation of this distribution. Assessing Normality The Normal distributions provide good 1nodels for smne distributions of real data. Examples include the highway gas mileage of 2006 Corvette convertibles, statewide une1nployment rates, and weights of 9-ounce bags of potato chips. The distributions of smne other con1mon variables are usually skewed and therefore distinctly nonNorn1al. Exan1ples include economic variables such as personal income and gross sales of business firn1s, the survival times of cancer patients after treatment, and the service lifetime of mechanical or electronic components. While experience can suggest whether or not a Normal distribution is plausible in a particular case, it is risky to assume that a distribution is Nonnal without actually inspecting the data. In the latter part of this course we will want to use various statistical inference procedures to try to answer questions that are in1portant to us. These tests involve sampling people or objects and recording data to gain insights into the populations from which they come. Many of these procedures are based on the assumption that the host population is approximately Normally distributed. Consequently, we need to develop n1ethods for assessing Nonnality. Method I Construct a histogram or a sten1plot. See if the graph is approximately bell-shaped and syn1n1etric about the n1ean. A histogram or stemplot can reveal distinctly non-Normal features of a distribution, such as outliers, pronounced skewness, or gaps and clusters. You can improve the effectiveness of these plots for assessing whether a distribution is Normal by marking the points :X, :X ± s, and :X ± 2s on the horizontal axis. This gives the scale natural to Normal distributions. Then compare the count of observations in each interval with the 68-95-99.7 rule. Example 2. 12 Seventh-grade vocabulary scores (again) Assessing N onnality The histogram in Figure 2.3 (page 123) suggests thatthe distribution of the 947 Gary vocabulary scores is close to Normal. It is hard to assess by eye how close to Normal a histogram is. Let's use the 68-95-99.7 rule to check more closely. We enter the scores into a statistical software package and ask for the mean and standard deviation. The computer replies, MEAN= 6. 8585 STDEV = 1. 5952 2.2 Normal Distributions Now that we know that x = 6.8585 and s = 1.5952, we check the 68-95-99.7 rule by finding the actual counts of Gary vocabulary scores in intervals of length s about the mean x. The computer will also do this for us. Here are the counts: 21 2.07 3s x- 129 3.67 x- 2s 331 5.26 x-s 125 318 6.86 8.45 X x+s 21 10.05 x + 2s 11.64 x + 3s The distribution is very close to symmetric. It also follows the 68-95-99.7 rule closely: 68.5% of the scores (649 out of 94 7) are within one standard deviation of the mean, 95.4% (903 out of947) are within two standard deviations, and 99.8% (94 5 out of947) are within three standard deviations. These counts confirm that the Normal distribution with J.L = 6.86 and a = 1.595 fits these data well. Sn1aller data sets rarely fit the 68-95-99.7 rule as well as the Gary vocabulary scores. This is even true of observations taken from a larger population that really has a Normal distribution. There is 1nore chance variation in small data sets. Method 2 Construct a Normal probability plot. A Nonnal probability plot provides a good assess1nent of the adequacy of the Nonnal1nodel for a set of data. Most software packages, including Minitab, Fathom, and Crunchlt!, can construct Norn1al probability plots from entered data. The TI-83/84 and TI-89 will also make Normal probability plots. You will need to be able to produce a Normal probability plot (either with a calculator or with computer software) and interpret it. Here is the basic idea of a Norn1al probability plot. The graphs produced by technology use more sophisticated versions of this idea. l. Arrange the observed data values frmn smallest to largest. Record what percentile of the data each value occupies. For example, the smallest observation in a set of 20 is at the 5% point, the second smallest is at the 10% point, and so on. 2. Use the standard Normal distribution (Table A) to find the z-scores at these san1e percentiles. For example, z = -1.645 is the 5% point of the standard Norn1al distribution, and z = -1.28 is the 10% point. 3. Plot each data point x against the corresponding z. If the data distribution is close to Nonnal, the plotted points will lie close to some straight line. Note: some software plots the x-values on the horizontal axis and the z-scores on the vertical axis, while other software does just the reverse. You should be able to interpret the plot in either case. 150 CHAPTER 2 Describing Location in a Distribution Use of Normal Probability Plots If the points on a Normal probability plot lie close to a straight line, the plot indicates that the data are Normal. Systematic deviations from a straight line indicate a non-Normal distribution. Outliers appear as points that are far away from the overall pattern of the plot. The next three examples show you how to interpret Normal probability plots. Example 2. 13 Connecting wires and wafers Interpreting Normal probability plots Let's return to Example 1. 8 (page 54), which described the breaking strengths of connections between very fine wires and a semiconductor wafer. Figure 1. 7 showed that the distribution of breaking strengths is symmetric and single-peaked. Now we ask: do the breaking strengths follow a Normal distribution? Figure 2.21 is a Normal probability plot of the breaking strengths. Lay a transparent straightedge over the center of the plot to see that most points fall close to a straight line. A Normal distribution describes these points quite well. The only substantial deviations are short horizontal runs of points. Each run represents repeated observations having the same value-there are five measurements at 1150, for example. This phenomenon (called granularity) is a result of the limited precision of the measurements and does not represent an important deviation from Normality. Figure 2.21 Normal probability plot of the breaking strengths of wires bonded to a semiconductor wafer. This distribution has a Normal shape except for outliers in both tails. 3000 l • 2500 ! c:n "C 1000 I • Cl ~ 1500 CJ) Cl c:: ~ ~ 1000 05 • 500 • • ••• • ••• • •• 0 -3 -2 -1 0 z-score 1 2 3 151 2.2 Normal Distributions The high outlier at 3150 pounds lies above the line formed by the center of the datait is farther out in the high direction than we expect Normal data to be. The two low outliers at 0 lie below the line-they are suspiciously far out in the low direction. These outliers were explained in Example 1.8. Example 2. 14 Guinea pig survival Some non-Nonnal data Table 2.2 gives the survival times in days of 72 guinea pigs after they were injected with tubercle bacilli in a medical experiment. Table2.2 43 80 91 103 137 191 45 80 92 104 138 198 Survival times {days) of guinea pigs in a medical experiment 53 81 92 107 139 211 56 81 97 108 144 214 56 81 99 109 145 243 57 82 99 113 147 249 58 83 100 114 156 329 66 83 100 118 162 380 67 84 101 121 174 403 73 88 102 123 178 511 74 89 102 126 179 522 79 91 102 128 184 598 Source: T. Bjerkedal. "Acquisition of resistance in guinea pigs infected with different doses of virulent tubercle bacilli," American Journal of Hygiene. 72 (1960). pp. 130-148. Figure 2.22 is a Normal probability plot of the guinea pig survival times. Draw a line through the leftmost points, which correspond to the smaller observations. The larger observations fall systematically above this line. That is, the right-of-center observations have larger values than the Normal distribution. This Normal probability plot indicates that the guinea pig survival data are strongly right-skewed. In a right-skewed distribution, the largest observations fall distinctly above a line drawn through the main body of points. Similarly, left-skewness is evident when the smallest observations fall below the line. There appear to be no outliers in this set of data. Figure 2.22 Normal probability plot of the survival times of guinea pigs in a medical experiment. This distribution is skewed to the right. • 600 •• 500 (ij" ~ •• • 400 Q) E ·~ -ro > 300 -~ #' (I) 200 .... 100 -3 -2 / -1 z-score 152 CHAPTER 2 Describing Location in a Distribution Example 2. 15 Acidity of rainwater Some Normal data Researchers collected data on the acidity levels (measured by pH) in 105 samples of rainwater. Distilled water has pH 7.00. As water becomes more acidic, the pH goes down. The pH of rainwater is important to environmentalists because of the problem of acid rain. 7 Figure 2.23 is a Normal probability plot of the 105 acidity (pH) measurements. The Normal probability plot makes it clear that a Normal distribution is a good descriptionthere are only minor wiggles in a generally straight-line pattern. Figure2.23 Normal probability plot of the acidity (pH) values of 105 samples of rainwater. This distribution is roughly Normal. •• • • • • 6.5 • ~ 6.0 Q) :::1 -ro > :J: c.. ~ co 5.5 :5: .!::: co 0: 5.0 J' .. ,. • • 4.5 -3 -2 -1 0 2 3 z-score As Figure 2.23 indicates, real data ahnost always show some departure from the theoretical Nonnal1nodel. When you examine a Normal probability plot, look for shapes that show clear departures from Normality. Don't overreact to minor wiggles in the plot. When we discuss statistical methods that are based on the Norn1al model, we will pay attention to the sensitivity of each method to departures from Normality. Many co1nmon methods work well as long as the data are approximately Normal and outliers are not present. 2.2 Normal Distributions Technology Toolbox Normal probability plots on the Tl-83/84/89 If you ran the progran1 FLIP 50 in Exercise 2.22 (page 13 3), and you still have the data (I 00 numbers mostly in the 20s) in Ll/listl, then use these data. If you have not entered the program and run it, take a few 1ninutes to do that now. Duplicate this example with your data. Here is the histogram that was generated at the end of one run of this sin1ulation on each calculator. TI-83/84 TI-89 P1:L1 min=24 max <26 n=28 n= 29 . MAIN RAD AUTO FUNC Ask for one-variable statistics: • Press B , choose CALC, then 1: 1-Var Stats and iliJU (Ll) • In the Stats/List Editor, press; II] (Calc) and choose 1: 1-Var Stats for list!. This will give us the following: 1-Var Stats x=25.04 Ix=2504 I,x 2 =63732 Sx=3.228409246 OX=3.21 2226642 ,J,n=100 1 -Var Stats ... ~ x -E LX Ix2 3. Sx ax n I ~ ·I ~ ~~~X rfil ( Enter=OK ) MAIN 1-Var Stats tn=100 minX=18 Q1 =23 Med=25 Q3=27 maxX=33 J Fl Too l RAD AUTO ax n 1. =24.45 =2445. =60525. =3.40635624 14 2 =3.38532146602 =100. =14. =22 . MinX Q1X MedX Q3X MaxX I (x-x) 2 FUNC =3.3853216602 =100. =14. =22. =24. =27. =32. =1148. 75 l i CEn~~J::=OJ< :> MAIN RAD AUTO FUNC f----- f--- 4/4 f----- 4 /4 • Cmnparing the 1neans and medians (x = 25.04 versus M = 25 on the TI-83/84 and x = 24.45 versus M = 24 on the TI-89) suggests that the distributions are fairly symmetric. Boxplots confirm the roughly symmetric shape. ~3~ Pl ~ Med=25 Med: 2 4. USE HjJ, OR TYPE+ [ESC] =CANCEL CHAPTER 2 154 Describing Location in a Distribution Technology Toolbox Normal probability plots on the Tl-83/84/89 (continued) • To construct a Normal probability plot of the data with the x-values on the horizontal axis, define Plot 1 like this: ~ Plot2 gQff Type : L I~ Plot3 ~ 3 dlJ:b o{]t··>{l}-;· - Data List---:-11 Data Axis:f3 Y Mark: m + · _j No rm Prob Plot ... r r- Plot Number: List: Data Axis: Mark: Store Zscor es to: Plotl--7 llistl I X--7 Box-? lstatvarsSz I <: ESC-CANCEL::> =8 5 1 6 5 l <: Enter-OK > zscores [20] =-. 85961736186 ... ~ USE f- AND --7 TO OPEN CHOICES • Use ZoomStat (ZoomData on the TI-89) to see the finished graph. P1:L1 D i. ••• rl-' •••• 80 )::( X=18 Y=-2.575829 [] MAIN RAD AUTO FUNC Interpretation: The Nonnal probability plot is quite linear, so it is reasonable to believe that the data follow a Normal distribution. Comment: The cursor on the bottom TI-83/84 screen shot is on the point with an x-value of 18 and a z-score of- 2. 58. This is the point with the smallest x-value of the 100 data values generated by the program FLIP 50. By our definition, that puts the value of x = 18 at the 1% point. The alternative definition of percentile (see page 119) would place x = 18 at the 0% point, since no data values are below this one. The Normal probability plot on the TI-83/84 and TI-89 uses a compromise: it puts this point at the 0. 5% mark. If we look up an area to the left of 0.0050 (0. 5%) in Table A, we find that z = - 2.58. Exercises 2.37 Taking stock Figure 2.24 (on the next page) is a Normal probability plot of the monthly percent returns for U.S. common stocks from June 1950 to June 2000. Because there are 601 observations, the individual points in the middle of the plot run together. In what way do monthly returns on stocks deviate from a Normal distribution? 2.2 Normal Distributions Figure2.24 Normal probability plot of the percent returns on U.S. common stocks in 601 consecutive months, for Exercise 2.37. • 10 E ~ 0 e ..... c Q) (.) Q; c.. -10 • -20 J • -3 -2 -1 2 0 3 z-score 2.38 Carbon dioxide emissions Figure 2.25 is a Normal probability plot of the emissions of carbon dioxide per person in 48 countries. 8 In what ways is this distribution non-Normal? Figure2.25 Normal probability plot of C02 emissions in 48 countries, for Exercise 2.38. • 20 18 c 0 en • 16 Q; 0.. Q; • 14 0.. en c 12 ·s 10 .8 Q) , •••• .§. en c 8 I 0 ·u; -~ E 6 , Q) N 0 u 4 ....../ 2 0 -3 -2 -1 , • tJ/1" 0 z-score 2 3 156 CHAPTER 2 Describing Location in a Distribution 2.39 Great white sharks Here are the lengths in feet of 44 great white sharks: 18.7 16.4 13.2 19.1 12.3 16.7 15.8 16.2 18.6 17.8 14.3 22.8 16.4 16.2 16.6 16.8 15.7 12.6 9.4 13.6 18.3 17.8 18.2 13.2 14.6 13.8 13.2 15.7 15.8 12.2 13.6 19.7 14.9 15.2 15.3 18.7 17.6 14.7 16.1 13.2 12.1 12.4 13.5 16.8 (a) Use the Data Analysis Toolbox (page 93) to describe the distribution of these lengths. (b) Compare the mean with the median. Does this comparison support your assessment in (a) of the shape of the distribution? Explain. (c) Is the distribution approximately Normal? If you haven't done this already, enter the data into your calculator, and reorder them from smallest to largest. Then calculate the percent of the data that lies within one standard deviation of the mean. Within two standard deviations of the mean. Within three standard deviations of the mean. (d) Use your calculator to construct a Normal probability plot. Interpret this plot. (e) Having inspected the data from several different perspectives, do you think these data are approximately Normal? Write a brief summary of your assessment that combines your findings from (a) through (d). 2.40 Cavendish and the density of the earth Repeated careful measurements of the same physical quantity often have a distribution that is close to Normal. Look back at Cavendish's 29 measurements of the density of the earth from Exercise 1.60 (page 106): 5.50 5.57 5.42 5.61 5.53 5.47 4.88 5.62 5.63 5.07 5.29 5.34 5.26 5.44 5.46 5.55 5.34 5.30 5.36 5.79 5.75 5.29 5.10 5.68 5.58 5.27 5.85 5.65 5.39 (a) Construct a stemplot to show that the data are reasonably symmetric. (b) Now check how closely they follow the 68-95-99.7 rule. Find x and s, then count the number of observations that fall between x - sand x + s, between x - 2s and x + 2s, and between :X- 3s and x + 3s. Compare the percents of the 29 observations in each of these intervals with the 68-95-99.7 rule. We expect that, when we have only a few observations from a Normal distribution, the percents will show some deviation from 68, 95, and 99.7. Cavendish's measurements are in fact close to Normal. (c) Use your calculator to construct a Normal probability plot for Cavendish's density of the earth data, and write a brief statement about the Normality of the data. Does the Normal probability plot reinforce your findings in (a) and (b)? 2.41 Normal random numbers Use technology to generate 100 observations from the standard Normal distribution. (The TI-83/84 command randNorrn ( 0, 1, 10 0) --7L 1 or the TI-89 command tis tat. randNorrn ( 0, 1, 100) --7list1 will do this.) (a) Make a histogram of these observations. How does the shape of the histogram compare with a Normal density curve? (b) Make a Normal probability plot of the data. Does the plot suggest any important deviations from Normality? 2.2 Normal Distributions (c) Repeat the process of generating 100 observations and completing parts (a) and (b) several times. This should give you a better idea of how close to Normal the data will appear when we take a sample from a Normal population. 2.42 Uniform random numbers Use technology to generate 100 observations from the uniform distribution described in Exercise 2.10 (page 128). (The TI-83/84 command rand ( 100) ~L 1 or the TI-89 command rand ( 100) ~list1 will do this.) (a) Make a histogram of these observations. How does the histogram compare with the density curve in Figure 2.8 (page 128)? (b) Make a Normal probability plot of your data. According to this plot, how does the uniform distribution deviate from Normality? Section 2.2 Summarl The Normal distributions are described by a special family of bell-shaped, symmetric density curves, called Normal curves. The mean J.L and standard deviation a completely specify a Normal distribution N(J.L, a). The 1nean is the center of the curve, and a is the distance from J.L to the inflection points on either side. In particular, all Norn1al distributions satisfy the 68-95-99.7 rule, which describes what percent of observations lie within one, two, and three standard deviations of the mean. All Normal distributions are the same when 1neasuren1ents are standardized. If x has the N(J.L,a) distribution, then the standardized variable z = (x - J.L)Ia has the standard Normal distribution N(O, l) with mean 0 and standard deviation l. Table A gives the proportions of standard Normal observations that are less than z for many values of z. By standardizing, we can use Table A for any Normal distribution. The Normal Curve Applet allows you to do Normal calculations quickly. In order to perform certain inference procedures in later chapters, we will need to know that the data cmne from populations that are approximately Norn1ally distributed. To assess Normality, one can observe the shape of histograms, stemplots, and boxplots and see how well the data fit the 68-95-99.7 rule for Normal distributions. Another good 1nethod for assessing Normality is to construct a Normal probability plot. Section 2.2 Exercises For Exercises 2.43 to 2.47, use Table A or the Nomwl Curve Applet. 2.43 Finding areas from z-scores Find the proportion of observations from a standard Normal distribution that fall in each of the following regions. In each case, sketch a standard Normal curve and shade the area representing the region. (a) z < 1.28 (b) z > -0.42 CHAPTER 2 (c) -0.42 < Describing Location in a Distribution z < 1.28 (d) -1.28 < z < -0.42 2.44 Working backward: finding z-values (a) Find the number z such that the proportion of observations that are less than z in a standard Normal distribution is 0.98. (b) Find the number z such that 22% of all observations from a standard Normal distribution are greater than z. 2.45 A surprising calculation Changing the mean of a Normal distribution by a moderate amount (while keeping the standard deviation about the same) can greatly change the percent of observations in the tails. Suppose that a college is looking for applicants with SAT Math scores of750 and above. (a) In 2004, the scores of males on the SAT Math test followed the N(537, 116) distribution. What percent of males scored 750 or better? Show your work. (b) Females' SAT Math scores that year had the N(501, 110) distribution. What percent of females scored 750 or better? Show your work. You see that the percent of males above 750 is almost three times the percent of females with such scores. 2.46 The stock market The annual rate of return on stock indexes (which combine many individual stocks) is approximately Normal. Since 1945, the Standard & Poor's 500 index has had a mean yearly return of 12%, with a standard deviation of 16.5 %. Take this Normal distribution to be the distribution of yearly returns over a long period. (a) In what range do the middle 95 % of all yearly returns lie? Justify your answer. (b) The market is down for the year if the return on the index is less than zero. In what proportion of years is the market down? Show your work. (c) In what proportion of years does the index gain 2 5% or more? Show your work. 2.47 Deciles The deciles of any distribution are the points that mark off the lowest 10% and the highest 10%. The deciles of a density curve are therefore the points with area 0.1 and 0.9 to their left under the curve. (a) What are the deciles of the standard Normal distribution? (b) The heights of young women are approximately Normal with mean 64.5 inches and standard deviation 2.5 inches. What are the deciles of this distribution? Show your work. 2.48 Outliers in a Normal distribution The percent of the observations that are classified as outliers by the 1.5 X IQR rule is the same in any Normal distribution. What is this percent? Show your method clearly. 2.49 Runners' heart rates Figure 2.26 is a Normal probability plot of the heart rates of 200 male runners after six minutes of exercise on a treadmill. 9 The distribution is close to Normal. How can you see this? Describe the nature of the small deviations from Normality that are visible in the plot. 2.2 Normal Distributions Figure2.26 159 Normal probability plot of the heart rates of 200 male runners, for Exercise 2.49. ,.,.. 140~ 130 -i ~ ::I c .E Q; a. 120 I • • 110 ~ ro Q) :::::. ~ ~ t ro Q) 100 90 :I: ··-' BOj • 70 -3 2 3 z-score 2.50 Weights aren't Normal The heights of people of the same sex and similar ages follow Normal distributions reasonably closely. Weights, on the other hand, are not Normally distributed. The weights of women aged 20 to 29 have mean 141.7 pounds and median 13 3.2 pounds. The first and third quartiles are 118.3 pounds and 157.3 pounds. What can you say about the shape of the weight distribution? Why? .nna¥~ ... ... EJ.·,:;.,r;x ,_ ...... -==Qlllgll.t; . . . , j Ij j j i j 42 :Ji Many plu~.rs £1t the' nonnal OJ:t'V~ when 1t col!l61 to dlstrlbuUcm or we.liht. 160 CHAPTER 2 e Describing Location in a Distribution CLOS~Dl As~ ThenewSAT Using what you have learned in this chapter, you should now be ready to examine some results from the Spring 2005 administrations of the new SAT discussed in the chapter-opening Case Study (page 113). 1. Overall, the distribution of scores on the new Writing section followed a Normal distribution with mean 516 and standard deviation 115. (a) What percent of students earned scores between 600 and 700 on the Writing section? Show your work. (b) What score would place a student at the 65th percentile? 2. In past years, males have earned higher mean scores on the Verbal and Math sections of the SAT. On the new Writing section, male scores followed a N(491, 11 0) distribution. Female scores followed a N(502, 108) distribution . (a) What percent of male test takers earned scores below the mean score for female test takers? (b) What percent of female test takers earned scores above the mean score for male test takers? (c) What percent of male test takers earned scores above the 85th percentile of the female scores? 3. Here are the Writing scores on the new SAT for all test takers at a medium-sized private high school : Males: Females: 660 540 500 620 530 520 440 700 580 430 460 720 610 580 520 630 540 490 760 720 550 71 0 560 530 470 670 530 690 600 570 570 690 550 700 690 530 500 430 660 520 690 600 580 550 420 700 500 460 480 670 560 580 630 690 580 590 640 640 560 620 580 570 580 530 630 550 550 520 570 580 640 530 610 540 640 630 570 520 490 630 620 560 780 450 560 630 610 (a) Did males or females at this school perform better? Give appropriate graphical and numerical evidence to support your answer. (b) How do the scores of the males and females at this school compare with the national results? Write a few sentences discussing this issue. (c) Is the distribution of scores for the males at this school approximately Normal? How about the distribution of scores for the females at this school? Justify your answer with appropriate statistical evidence. Chapter Review Chapter Review Summary This chapter focused on two big issues: describing an individual value's location within a distribution of data and modeling distributions with density curves. Both z-scores and percentiles provide easily calculated Ineasures of relative standing for individuals. Density curves come in assorted shapes, but all share the property that the area beneath the curve is 1. We can use areas under density curves to estimate the proportion of individuals in a distribution whose values fall in a specified range. In the special case of Normally distributed data, we can use the standard Normal curve and Table A to calculate such areas. There are many real-world examples of Normal distributions. When you n1eet a new set of data, you can use the graphical and numerical tools discussed in this chapter to assess the Normality of the data. What You Should Have Learned Here is a review list of the most important skills you should have acquired from your study of this chapter. A. Measures of Relative Standing 1. Find the standardized value (z-score) of an observation. Interpret z-scores in context. 2. Use percentiles to locate individual values within distributions of data. 3. Apply Chebyshev's inequality to a given distribution of data. B. Density Curves 1. Know that areas under a density curve represent proportions of all observations and that the total area under a density curve is 1. 2. Approximately locate the median (equal-areas point) and the mean (balance point) on a density curve. 3. Know that the mean and median both lie at the center of a symmetric density curve and that the Inean 1noves farther toward the long tail of a skewed curve. C. Normal Distributions 1. Recognize the shape of Normal curves and be able to estimate both the n1ean and standard deviation from such a curve. 2. Use the 68-95-99.7 rule and sym1netry to state what percent of the observations from a Normal distribution fall between two points when the points lie at the mean or one, two, or three standard deviations on either side of the n1ean. 162 CHAPTER 2 Describing Location in a Distribution 3. Use the standard Normal distribution to calculate the proportion of values in a specified range and to determine a z-score from a percentile. 4. Given that a variable has the Normal distribution with mean p., and standard deviation 0', use Table A and your calculator to detern1ine the proportion of values in a specified range. 5. Given that a variable has the Normal distribution with mean p., and standard deviation 0', calculate the point having a stated proportion of all values to the left or to the right of it. D. Assessing Normality 1. Plot a histogram, stem plot, and/or boxplot to detennine if a distribution is bell-sha peel. 2. Determine the proportion of observations within one, two, and three standard deviations of the mean, and cmnpare with the 68-95-99.7 rule for Normal distributions. 3. Construct and interpret Normal probability plots. Web Links You can get more information about standardized tests (like the PSAT, SAT, and ACT) and intelligence tests (like the Wechsler Adult Intelligence Scale) online. College Board Web site www.collegeboard.com ACT Web site www.act.org MENSA Web site www.us.mensa.org Chapter Review Exercises 2.51 IQ scores for children The scores of a reference population on the Wechsler Intelligence Scale for Children (WISC) are Normally distributed with JL = 100 and (J' = 15. A school district classified children as "gifted" if their WISC score exceeded 13 5. There are 1300 sixth-graders in the school district. About how many of them are gifted? Show your work. 2.52 Understanding density curves Remember that it is areas under a density curve, not the height of the curve, that give proportions in a distribution. To illustrate this, sketch a density curve that has its peak at 0 on the horizontal axis but has greater area within 0.25 on either side of 1 than within 0.25 on either side of 0. 2.53 Are the data Normal? Scores on the ACT test for the 2004 high school graduating class had mean 20.9 and standard deviation 4.8. In all, 1,171,460 students in this class took the test, and 1,052,490 of them had scores of 27 or lower. 10 If the distribution of scores were Normal, what percent of scores would be 27 or lower? What percent of the actual scores were 27 or lower? Does the Normal distribution describe the actual data well? Chapter Review 2.54 Standardized test scores as percentiles Joey received a report that he scored in the 97th percentile on a national standardized reading test but in the 72nd percentile on the math portion of the test. (a) Explain to Joey's grandmother, who knows no statistics, what these numbers mean. (b) Can we determine Joey's z-scores for his reading and math performance? Why or why not? 2.55 Helmet sizes The army reports that the distribution of head circumference among soldiers is approximately Normal with mean 22.8 inches and standard deviation 1.1 inches. Helmets are mass-produced for all except the smallest 5% and the largest 5% of head sizes. Soldiers in the smallest or largest 5% get custom-made helmets. What head sizes get custom-made helmets? Show your work. 2.56 More weird density curves A certain density curve consists of a straight-line segment that begins at the origin, (0, 0), and has slope l. (a) Sketch the density curve. What are the coordinates of the right endpoint of the segment? (b) Determine the median, the first quartile (Q 1), and the third quartile (Q 3 ). (c) Relative to the median, where would you expect the mean of the distribution to be located? (d) What percent of the observations lie below 0.5? Above 1.5? 2.57 Are the data Normal? A government report looked at the amount borrowed for college by students who graduated in 2000 and had taken out student loans. 11 The mean amount was x = $ 17,776 and the standard deviation was s = $12,034. The quartiles were Ql = $9900, M = $ 15,532, and Q3 = $22,500. (a) Compare the mean :X and the median M. Also compare the distances of Q 1 and Q 3 from the median. Explain why both comparisons suggest that the distribution is rightskewed. Interpret this in the context of student loans. (b) The right-skew increases the standard deviation. So a Normal distribution with the same mean and standard deviation would have a third quartile larger than the actual Q3. Find the third quartile of the Normal distribution with /.1. = $17,776 and cr = $12,034 and compare it with Q3 = $22,500. 2.58 Osteoporosis Osteoporosis is a condition in which the bones become brittle due to loss of minerals. To diagnose osteoporosis, an elaborate apparatus measures bone mineral density (BMD). BMD is usually reported in standardized form. The standardization is based on a population of healthy young adults. The World Health Organization (WHO) criterion for osteoporosis is a BMD score that is 2.5 standard deviations below the mean for young adults. BMD measurements in a population of people similar in age and sex roughly follow a Normal distribution. (a) What percent of healthy young adults have osteoporosis by the WHO criterion? (b) Women aged 70 to 79 are of course not young adults. The mean BMD in this age group is about- 2 on the standard scale for young adults. Suppose that the standard deviation is the same as for young adults. What percent of this older population has osteoporosis? 164 CHAPTER 2 Describing Location in a Distribution 2.59 Normal probability plots Figure 2.27 displays Normal probability plots for four different sets of data. Describe what each plot tells you about the Normality of the given data set. Figure 2.27 Four Normal probability plots, for Exercise 2.59. .. • • • •-2 ••• -1 -3 , / • • •• ...... •• 2 0 z-score 3 -3 • -•2 ••·" -1 0 1 2 3 z-score (a) ,.... (b) • • • • I / I' • • .. -3 • • / • • If -2 • -1 0 z-score (c) 1 2 3 -3 • -2...1' -1 0 1 2 3 z-score (d) 2.60 Grading managers Many companies "grade on a bell curve" to compare the performance of their managers and professional workers. This forces the use of some low performance ratings, so that not all workers are listed as "above average." Ford Motor Company's "performance management process" for a time assigned 10% A grades, 80% B grades, and 10% C grades to the company's 18,000 managers. Suppose that Ford's performance scores really are Normally distributed. This year, managers with scores less than 25 received C's and those with scores above 475 received A's. What are the mean and standard deviation of the scores? Show your work. Chapter Review Technology Toolbox Finding areas with ShadeNorm The TI-83/84/89 can be used to find the area to the left or right of a point or above an interval in a Normal distribution without referring to a standard Nonnal table. Consider the WISC scores for children ofExercise 2.51. This distribution is N( 100, 15). Suppose we want to find the percent of children whose WISC scores are above 125. Begin by specifying a viewing window as follows: X[55, 145]1 2 and Y[ -0.008, 0.028J.oi· You will generally need to experiment with they settings to get a good graph. TI-83/84 TI-89 • Press Pll!l l!im (DISTR), then choose DRAW and 1 : ShadeNorm ( . • Press lih-1ri11el! (i (Flash Apps) and choose shadNorm(. • Cmnplete the command ShadeNorm ( 1251 1E99 1100115) and preSS • Complete the comtnand tis tat. shadNorm ( 125 1 1E99 100 1 15) and preSS am. am. 1 FlT Tools Area=.04779 low=125 /l Area =. 04779 low=125. UP=lE99 MAIN RAD APPROX up=l.E99 FUNC You must always specify an interval. An area in the right tail of the distribution would theoretically be the interval (125, oo). The calculator limitation dictates that we use a nutnber that is at least 5 or l 0 standard deviations to the right of the mean. To find the area to the left of 8 5, you would specify ShadeNorm ( -1E9 9 8 5 10 0 15) . Or we could specify Shade Norm ( 0 8 5 10 0 15) since WISC scores can't be negative. Both yield at least four-decimalplace accuracy. If you're using standard Normal values, then you need to specify only the endpoints of the interval; the 1nean 0 and standard deviation l will be understood. For example, use ShadeNorm ( 11 2) to find the area above the interval z = l to z = 2 to four decimal places. Compare this result to the 68-95-99.7 rule. I I I I I I 2.61 Made in the shade Use the calculator's ShadeNorm feature to find the following areas for the WISC scores of Exercise 2. 51 (page 162), to four-decimal-place accuracy. Then write your findings in a sentence. (a) The relative frequency of scores greater than 135 . (b) The relative frequency of scores lower than 7 5. (c) Show two ways to find the percent of scores within two standard deviations of the mean. CHAPTER 2 Describing Location in a Distribution Technology Toolbox Finding areas with normalcdf The normalcdf com1nand on the TI-83/84/89 can be used to find areas under a Normal curve. This method has the advantage over ShadeNorn1 of being quicker, and the disadvantage of not providing a picture of the area it is finding. Here are the keystrokes for the WISC scores of Exercise2.5l: Tl-83/84 TI-89 • Press mDJ ~ (DISTR) and choose 2: normalcdf (. • Press liti)fi111Iij [i] (Flash Apps) and choose normCdf (. • Complete • Complete the con1mand tis tat. normCdf ( 12 5 1 E 9 9 1 0 0 15 ) and press [llliil. command normalcdf ( 12 5 1 E 9 9 1 0 0 15 ) and press ll~lll;l I the I I normalcdf(125,1E99, 100,15) .0477903304 I I I I ~=..........=~....L...:..::.=-==:....::.<... • tistat.normcdf(l25,l. E99 .,_. .04779033036 ·•·~•eJt MAIN RAD APPRO X FUNC 1/ 30 We can say that about 5% of the WISC scores are above 125. If the Normal values have already been standardized, then you need to specify only the left and right endpoints of the interval. For example, normalcdf ( -1~ 1) returns 0.6827, meaning that the area from z = -l to z = l is approximately 0.6827, correct to four decimal places. Chapter Review 2.62 Areas by calculator Use the calculator's normalcdf function to verify your answer to Exercise 2. 53 (page 162). Technology Toolbox Finding values with invNorm The TI-83/84/89 invNorm function calculates the raw or standardized Normal value corresponding to a known area under a Normal curve or a relative frequency. The following example uses the WISC scores, which have a N(100, 15) distribution. Here are the keystrokes: TI-83/84 TI-89 • Press PIDJ ~ (DISTR), then choose 3 : invNorm ( . • Press Press llf.tif!11eln li) (Flash Apps) and choose invNorm ( . • Complete the command invNorm ( . 9 10 0 15) and press I@UII;J. Compare this with the command invNorm ( . 9) . • Complete the command tis tat. invNorm (. 9 100 15) and press Compare this with the command tis tat. inv Norm(. 9). I I invNorm(.9,100,15) 119.2232735 invNorm ( . 9) 1.281551567 I mm. I • tistat.invnorm( . 9,100,15) 119 223 • tistat.invnorm(.9) 1.28155 0 MAIN RAD APPROX FUNC 2/3 0 The first cmnmand finds that the WISC score that has 90% of the scores below it from the N(lOO, 15 ) distribution is x = 119. The second command finds that the standardized WISC score that has 90% of the scores below it is z = 1.28. 2.63 Inverse normal on the calculator Use the calculator's invNorm function to verify your answer to Exercise 2.55 (page 163).

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Describing Location in a Distribution