Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics or Never attribute to malice that which is adequately explained by stupidity. Facts are stubborn things, but statistics are more pliable. – Mark Twain Numbers don’t lie. And often perception is not reality. Case in point, people are always concerned, worried even, about the increasingly violent society in which they live. Statements like “what is this world coming to,” are commonplace. People may tend to feel increasingly unsafe many may become more and more reluctant to go out at night. Some been known to hide behind their TV or computer, instead of venturing out. There are many who feel that when they were young, crime was not as bad as it is now. People attribute arbitrary reasons for this new wave of perceived violence. “Exhaustive music videos glorify violence, causing a violent cycle to never end...” “No wonder there is so much crime these days, look at all the violence on TV and in the movies…” “Kids have no respect for their parents, teachers or elders these days, this contributes to more violence…” Or “the remote control teaches us to become impatient, and we are more likely to quickly pull the trigger…” Images from the OJ murders, Columbine shootings or 9/11 tend to fill our televisions, replaying the same isolated scenes over and over again. People are shot every night on reruns of Law and Order. So, it’s natural for people to criticize the amount of violence in our society, but rarely do these same people utter any voice toward thinking their utterances through to its logical conclusion. Instead, many appear to become angry about the rise of violent crime in this country and tend to make matters worse by linking this acquired malice to other elements in society (music videos, teenagers, TV, OJ), fostering a wider net of hate. More importantly, they never once pause to check out the numbers. And in a matter of moments, anyone can do just that, check out the numbers. Any of us can access on the WWW the FBI’s Index of Crime Statistics. So, we did. Below are the nation wide statistics from 1982 to 2001, showing, by year, the number of violent crimes nationwide. During this twenty year span, while the nation’s population grew from 231,664,458 in 1982 to 284,796,887 in 2001, the number of violent crimes as defined by murder, rape, robbery and assault did not steadily increase, as expected. In 1 fact, clearly there was a stunning decline in violent crimes over the last decade. Violent crime reached it’s peaked 1992, with 1,932,274 reported instances and since then, violent crime has dropped over 25 percent. (The homicides on September 11, 2001 were not included.) murder, rape, robbery, assault Violent Crime 2,500,000 2,000,000 1,500,000 1,000,000 500,000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 year: 1982 to 2001 But, look at the age we live in. We have all seen the headlines: Arizona Kids Are Home Alone, A new survey says 30 percent in kindergarten through 12th grade take care of themselves” Of the 85 prisoners executed in 2000, 43 white, 35 African American, 6 Hispanic, 1 American Indian Vietnam - 58,168 deaths, total abortions since 1973, 44,670,812 as of April 22, 2004 Should juveniles be tried as adults? Kids are killing these days in record numbers Statistics are tossed at us in such a deluge the numbers alone seem almost controversial, 30 % of school age children left alone, 35 out of 85 executed are African American, 44 million abortions in last 30 years. … Certainly, each of these topics elicits emotion from within each of us, too many parents leave their children unsupervised, there is not enough funding for day care, death penalty, pro or con, racially biased, too many, too few juveniles? And if you want to clear a room with angry combatants, start with the age-old question, woman’s choice or murder of the unborn? No matter your stand on these topics, as you comb through the headlines, statistics besiege you. Why is quantitative literacy important? When confronted with numbers associated with hotly contested issues or highly controversial ethical or moral arguments, raw numbers themselves, such as the above stated 44,670,812 abortions in the last 30 years, need to be examined so they may be fully understood. As always, we begin by examining the number for credibility? Is it even viable? This particular number or numbers similar to it appear on various websites. We easily found these numbers quoted and similar such numbers at http://www.americandaily.com/article/1806, http://www.americandaily.com/article/1806, and http://womensissues.about.com/cs/abortionstats/a/aaabortionstats.htm. Are they accurate, well, we 2 simply have no way of knowing, but these are often published statistics. Are they viable? Now, that is a different question altogether. Following our pattern of analysis, if the number seems to be viable, then we continue. If it is viable, what implications are fair to divvy out? These 44 million aborted fetuses would be 30 years of age or younger, so for argument sake, let us assume it is fair to say a large percentage would be alive today. If this assumption is reasonable, 40 million plus the 290 million US citizens comes to 330 million. We are talking about a population of 330 million people, and 44/330 is slightly larger than 13 % or slightly greater than 1/8. What does this mean? Has society aborted 1 in 8? Don’t questions abound in your mind? Is this correct? Were these all abortions performed out of necessity? How many were medical? Or moral? Or personal choice? Does the reason for the abortion matter to you? Does the reason for the abortion matter to you if you take into account this new “1 in 8” statistic as a measure of how often abortions do occur? Certainly, one may argue that 1 in 8 could be construed as an alarming rate. But, the point of view and the emotions you feel are personal for you. The point is that 44 million is the statistic we are confronted with. Our ability to perform math tells us 1 in 8 is a logical consequence of this statistic. What you do in the subsequent interpretation is your decision. But, quantitative literacy will allow you to understand the statistic in context and make the interpretation. Statistics themselves are numbers that stand alone. Honest. Raw. Naked numbers. The name of the game in statistics is to draw inferences about a population or topic. If we are using polls, we are basing inferences on a smaller random sample of the general population. When trying to then form a conclusion, we must be careful. Correlation is not causation, just because numbers correlate does not mean one causes the other. Inferring characteristics about a population based on the raw data is the immediate reaction as we scan the headlines, but should it be? Can graphs be misleading? How good are we at recognizing misleading information? Causation and Correlation There exits a relationship between attendance and grades. Research shows that students who attend class regularly have better grades than those who don’t. Does this mean that attending class will cause a student to have a better grade, that is, will simply coming to class increase one’s grade? What about the student who regularly comes to class because they can get 50 minutes of solid rest by laying their head on the desk? The nature of this question illustrates the need for a distinction between two words, causation and correlation. Cities with more pornography have a higher crime rate. What is the relationship between these two variables; are the social implications as obvious as is implied? Relationships between variables are not always cut and dry. Studies can show children who come from economically advantaged homes perform better in high school. If anyone took this study and concluded that as a society, the smarter citizens tended to rise to the top of the economic food chain, the public outcry would be palpable. This is because other factors need to be taken into consideration. Such as the premise “advantages are just that, advantages”. Other factors such as better access to tutors, 3 better access to support systems, or not having to study while your hungry or cold or working full time, certainly contribute to one’s academic performance. Correlation should never be used interchangeably with causation. Sometimes correlation indicates causation, sometimes not. Clearly, there exists a high correlation between the amount of blood alcohol level in a person’s body and the likelihood they will get into an auto accident. We do not think any rational person would dispute the added inference that drinking alcohol can cause an auto accident. The data that supports the two factor’s relationship, the higher the number of drunks compared to non drunks who get into the accident, imply correlation. That drinking lead to or caused, the accident implies causation. It will be our task to determine whether a factor’s data that correlates to some other factor’s data can be interpreted to mean that one factor influences the other. Correlation A correlation exists between two factors if a change in one of the factor’s data is associated with a rise or decline in the other factor’s data. Causation A causation exists between two factors if a one factor causes, determines or results in the other factor’s data to rise or decline. Correlation as a result of causation As with drinking and auto accidents, we can often infer that a correlation is tied to causation. Another equally clear case can be made by considering tobacco use and lung cancer. The numbers correlate, one can equate the amount one smokes with the likelihood of succumbing to lung cancer. Those who smoke more have a higher percentage of their population inflicted with lung cancer. And, for years, the Surgeon General has been telling us that smoking causes lung cancer. The more you smoke, the higher the risk of developing lung cancer. Correlation with no causation. Hidden factors Just because two factors correlate does not mean one factor causes the other. One of the easiest examples to spotlight the difference and to have it plainly explained is to look at a common correlation between divorce and death. In most states, there is a significant negative correlation between the two, the more divorces, the less deaths. Since the two correlate negatively, the natural question arises, does getting a divorce reduce the risk of dying; does staying married increase the chance of dying? All joking aside about the obvious hidden implication, it is a third unseen factor that causes the correlation. Death and divorce do not have a causal relationship. Age does. The older the married couple, the less the likelihood they will get a divorce. The older the married couple, the higher likelihood they will pass away. There is a negative correlation between divorce and age and a positive correlation between age and death. The younger you are, the more likely you are to get divorced. The older you are, the more likely you are to pass away. Since the higher number of divorces occur with younger people, and since younger people tend to live longer, we have a transitive relationship implying the higher number of divorces relating to the longer life spans. Correlation. Causation. Very different. Yes, there is a correlation between divorce and death. No, neither causes the other. In plain English, getting a divorce will not increase or decrease the likelihood you will die. 4 Accidental Correlations Sometimes there exists accidental correlations where there is no hidden other factor or unseen logical explanation. The winner of the Super Bowl and the party of the winner of the presidential race in the country correlate highly every four years, but do not think football predicts the presidential races, or visa versa. This is an accidental correlation. Misleading Information Breast cancer will afflict one in eleven women. But this figure is misleading because it applies to all women to age eighty-five. Only a small minority of women live to that age. The incidence of breast cancer rises as the woman gets older. At age forty, one in a thousand women develop breast cancer. At age sixty, one in five hundred. Is the statistic one in eleven technically correct? Yes. Should a 40 year old woman be concerned with getting breast cancer? Certainly. Should they worry that one in eleven of their peer group will be afflicted? No. And while one in a thousand in their peer group will get afflicted, this by no means minimizes the seriousness of the issue, but sheds a more realistic light on it. A scatter plot is a graph of ordered pairs that allow us to examine the relation between two sets of data. To draw scatterplot: Arrange the data in a table. Decide which column represents the x–values (the label representing data along the horizontal axis). Those values need to be the perceived cause, the independent variable. Decide which columns represents the y–values (data represented along the vertical axis). These values need to appear to be affected by the perceived causes, the dependent variable. Plot the data as points of the form or an ordered pair, (x, y). Analysis: We can make predictions if the points show a correlation. * if the points appear to increase while reading the scatter plot from left to right, this is a positive correlation. * if the points appear to decrease while reading the scatter plot from left to right, this is a negative correlation. Positive Correlation: We expect that if the values along the horizontal axis increase, so do the values associated with the vertical axis. That is, as we increase x, we increase y. The more we study, the higher we expect to score on Exam One. 5 Grade on Exam One 120 100 80 60 40 20 0 0 1 2 3 4 5 6 7 Hours Spent Studying Negative Correlation: We expect that if we increase x then we decrease y. The higher the temperature, the less minutes we will jog. Minutes Spent Jogging 90 80 70 60 50 40 30 20 10 0 0 20 40 60 80 100 120 140 Temperature (Fahrenheit) Problem One For each below, decide if there is a correlation between the two factors. If there is, is it a positive correlation or negative correlation? Then decide if the two factors have a causal relationship. If they do not have a causal relationship, but they do correlate, determine if there are hidden factors that explain the correlation, if the correlation is accidental or if there is misleading information. a. b. c. d. A child’s shoe size, a child’s ability to do math Blood alcohol level and reaction time A girl’s body weight, the time the girl spends playing with dolls each day Price on an airline ticket, the distance traveled Solution 6 a. Positive correlation. A child’s shoe size does correspond to a child’s ability to do math. The larger a child’s shoe size, the better in math they are. But, the relationship is not causal; large feet do not cause a child to perform math better. There is a hidden factor. Age. The older the child, the larger the child’s shoe size. As children become, they take more math classes. The more math classes the child has participated in, the better the child performs in math. b. Negative correlation. The higher the blood alcohol level, the slower the reaction time. The relationship is causal. c. Negative correlation. As a girl’s body weight increases, they play with dolls less each day. No causal relationship here, again a hidden factor. And again it is age. The more a girl weighs, the older she is, the less time she spends playing with dolls. d. Positive correlation. The longer the distance, generally, the more expensive the ticket. Causation. Problem Two Let’s examine the basic question, “Do students who do better on a placement exam perform better in a college algebra course?” Below is the data. Draw a scatter plot and answer the question. Placement Score 70 68 56 40 78 59 67 45 61 Final Average College Algebra 91 89 71 62 95 65 85 66 70 Solution We need to examine the data visually to see if there exists a positive or a negative correlation between higher placement test scores and performance in college algebra. Below is a scatter plot of the placement test data and average scores on a College Algebra Final. 7 Final Average in College Algebra 100 90 80 70 60 50 40 30 20 10 0 0 20 40 60 80 100 Placement Test Score Though not all points show the same trend, the general trend is an increase in placement score does translate to an increase in the final average grade. 8 Exercise Set 14. From a survey of 2000 people, the table below represents averages for the number of years in school and the associated average monthly salary. Make a scatter plot labeling the x and y axes. Label the x and y axes. Number of Years Average in School Monthly Salary Less than 12 (approx. $ 1,500 10) 12 $ 1,750 14 $ 2,100 16 $ 2,550 18 $ 3,000 20 $4,200 15. Draw a line through the data which closely fits the scatter plot for accumulative donations for a charity by year below: 16. From the scatter plot below, interpret the linear pattern and predict the percent of students who failed the math course in the year 2,000. percent of students failing math course For problems 1–13, decide if there is a correlation between the two factors. If there is, is it a positive correlation or negative correlation. Then decide if the two factors have a causal relationship. If they do not have a causal relationship, but they do correlate, determine if there are hidden factors that explain the correlation, if the correlation is accidental or if there is misleading information. 1. Altitude, air pressure 2. Number of homes sold, realtor’s income 3. Number of abortions in US, number of people who are Pro-choice 4. Encouragement of cattle ranching, amount of rain forest 5. The length a time a couple is together, the similarity of their out look in life 6. A senior citizen’s age, clarity of vision for the senior citizen 7. Weight of an envelope, postage on the envelope 8. A boy’s height, a boy’s time spent watching cartoons each day 9. Minutes hot coffee sits on a desk, the temperature of the coffee 10. Rate of violence in a city, unemployment rate in the same city 11. Petroleum consumption, quality of air 12. Number of cars on the highway, quality of air 13. Efficiency of household appliance, size of the monthly electric bill 30 25 20 15 10 5 0 1988 1990 1992 1994 1996 1998 2000 year 9 c) $ 17. Which data below has the greatest negative correlation? 40 35 30 25 $ 20 $ 15 35 10 30 5 25 0 1965 20 $ 15 1970 1975 1980 1985 1990 1995 2000 10 5 0 1965 d) 1970 1975 1980 1985 1990 1995 2000 $ a) 40 35 30 $ 25 20 40 $ 15 35 10 30 5 25 20 $ 15 0 1965 1970 1975 1980 1985 1990 1995 2000 10 5 0 1965 1970 1975 1980 1985 1990 1995 2000 b) Construct and Draw Inferences Constructing and drawing inferences are essential to critical thinking and problem solving. When faced with statements, problems and puzzles, we do more than use common sense. We use problem solving skills, try to find patterns and infer statements that follow logically from the statements given. We determine what is reasonable and what is not. We determine what should logically follow and what should not in order to make good decisions. Circle Graphs Taken directly from newspaper headlines: Should a juvenile be tried as an adult? To address this issue, we should ask ourselves many questions and look at this crucial problem from many perspectives. For many of us, the first question we ask may be “Do juveniles who murder pose a chronic problem in this country?” Well, what’s chronic? If a large percentage of all murders were done by juveniles, this could be called chronic. We return to the Crime Index as defined by the FBI from 2001. Let us ask the question, “does there exist a correlation between age and those who commit murder in this country?” As long as we have the information grouped by category, which in this case is by age, we can recognize large numbers displayed in data as a percent of the whole in a pie chart or circle graph. 10 First, let’s see how a circle graph or pie chart is made. We tend to subdivide a circle into sectors represented by their central angle in either degrees (out of 360 degrees) or the percent of the circle that is to be shaded (out of 100 %). Common subdivisions of a circle So, for our question: “is their a correlation between age and those who commit murder in this country?”, we examine the data taken from the Crime Index. Of the 10,113 number of known murderers in the country in 2001, there age distribution was given as follows: Age, in years 1 to 4 5 to 8 9 to 12 13 to 16 17 to 19 20 to 24 25 to 29 30 to 34 35 to 39 40 to 44 45 to 49 50 to 54 55 to 59 60 to 64 65 to 69 70 to 74 75 and over Total Number 0 0 14 454 1,695 2,767 1,571 992 855 645 455 272 158 85 59 37 54 10,113 Since the data is already organized, let’s find the density of each age group. This means we will reconstruct the table and find the percent of murderers for each category, 1 to 4, 5 to 8, 9 to 12, 13 to 16 and so on. Note, not all categories are partitioned into equal time intervals. 11 Age, in years 1 to 4 5 to 8 Number 0 0 9 to 12 14 13 to 16 454 17 to 19 20 to 24 25 to 29 30 to 34 35 to 39 40 to 44 45 to 49 50 to 54 55 to 59 60 to 64 65 to 69 70 to 74 75 and over Total 1,695 2,767 1,571 992 855 645 455 272 158 85 59 37 54 10,113 Relative frequency 0 0 Central Angle 14 0.0013 10,113 454 0.0449 10,113 1, 695 0.1676 10,113 2767 0.2736 10,113 0.0013 x 360 0.468 0.0449 x 360 16.2 0.1676 x 360 60.34, or 1 of the circle 6 0.115 0.098 0.085 0.064 0.045 0.027 0.016 0.008 0.005 0.004 0.005 1 0.2736 x 360 98.5 41.4 degrees 35.3 degrees 30.4 degrees 23 degrees 16.2 degrees 9.7 degrees 5.62 degrees 3 degrees 2.1 degrees 1.3 degrees 2 degrees 360 degrees, a whole circle Murder Offenders by age 1 to 4 5 to 8 9 to 12 13 to 16 17 to 19 20 to 24 25 to 29 30 to 34 35 to 39 40 to 44 45 to 49 50 to 54 55 to 59 60 to 64 65 to 69 70 to 74 75 and over The pie chart below is illuminating. Very quickly, by glancing at the chart, we can tell that 20 to 24 year olds commit the most murders, but a close second seems to be 17 to 19 year olds, as well as 25 to 29 year olds. If a juvenile is defined to be under 18 years of age, then this appears to be a chronic problem because the second most dense population of murderers occurs in the age group 17 to 19 year olds. Now when we factor in the 13 to 16 year olds (454), the problem of juvenile murder seems to be more acute. For murders committed by teenagers alone, we have within the 13 to 19 year old age group, accounted for 454 + 1695 or 2,149 murders committed by teenagers. This comes to 2149/10,113 or just a little over 20 percent, and this doesn’t include the children who are 12 or under. 12 Now, let’s continue to address this problem again. Numbers never lie. But rearranged, could they deceive? Could the very same numbers be used by the opposing side of the argument to make the opposing view more viable? As said, first, we re-arrange the numbers. 1 to 19 14+454+1695=2163 20 to 39 2767+1571+992+855=6185 40 to 59 645+455+272+158=1530 60 and over 85+59+37+54=235 Murderers by age, 2001 1 to 19 20 to 39 We then construct a pie chart from these new subdivisions. Again, keep in mind we only considering the murders where we know the age of the murderer. There were 10,113 of these murders. 40 to 59 60 and over But, we are trying to represent the opposing point of view and we are trying to show murder by juveniles is not a ‘chronic problem’. So, in 2001, there were an additional 5375 murders where the age of the perpetrator was unknown. Regrouping, our table looks like: 1 to 19 20 to 39 40 to 59 60 and over Unknown 2,163 6,185 1,530 235 5,375 Murderers, by age, 2001 1 to 19 20 to 39 40 to 59 60 and over unknown Let’s examine the new pie chart. Notice how much smaller the piece of the pie for the 1 to 19 year old segment now is compared to the whole. This is significant difference from the previous pie charts where we did not factor in the murders committed by people of unknown ages. 13 Murderers by age, 2001 1 to 19 20 to 39 40 to 59 60 and over unknown To further enhance our argument, we may construct the slices of the pie with a 3–dimensional representation. We then shift the angle of the segment of the pie we are trying to ostensibly hide so that it is less prominent. Our point that juvenile crime is not a chronic problem seems more justified to the viewer’s eye. To add a final touch in enhancing our argument, let’s re-categorize and change two groupings: 1 to 19 and 20 to 29 to 1 to 16 and 17 to 39. If we keep the category of unknown murderers in the groupings, let’s compare the original pie chart with the final one. To the naked eye, a quick glance reveals the juvenile’s slice to be a mere sliver on the left chart compared to nearly a quarter of the pie on the right. Murderers, by age, 2001 Murderers by age, 2001 1 to 16 17 to 39 40 to 59 60 and over 1 to 19 20 to 39 40 to 59 60 and over unknown Statistics don’t lie, they can be re-arranged though to show what ever is on one’s agenda. Problem Two The graph below is shown and a TV anchor man states, “There was a sharp dramatic increase in drunk driving convictions between the year 1999 and the year 2000.” Consider the statement and reply to its accuracy. 14 Solution According to the figure, the actual increase in drunk driving convictions between 1999 and 2000 was 12, up to 732 from 720 the year before. Though this is an increase, it can not be considered a “sharp dramatic increase”. Evaluating the data in another way, the 12 100 1.7% is not significantly sharp or particularly dramatic. The percent increase, 720 anchor man was over dramatizing the report, the words may be deemed inflammatory, bordering on misleading. Problem Three Drawing Inferences A bucket has small green balls, medium blue balls, large pink balls, and very large red balls. A child picks ten balls, selecting each randomly so each ball is equally likely to be selected. Four such trials were conducted. Which trial most closely resemble the theoretical probability that would occur if the balls were selected randomly ten times? a) Balls Number of Balls Selected Small Green 2 Medium Blue 2 Large Pink 2 Very Large 4 Red b) Balls Number of Balls Selected Small Green 3 Medium Blue 2 Large Pink 3 Very Large 2 Red c) Balls Number of Balls Selected Small Green 2 Medium Blue 3 Large Pink 2 Very Large 3 Red d) Balls Number of Balls Selected Small Green 3 Medium Blue 2 Large Pink 4 Very Large 1 Red 15 Solution First, we need to calculate the theoretical probability for each type of ball. Recall, the probability is the number of successful outcomes divided by the total number of outcomes. The total number of balls is 20. There are 6 small green ones, 4 medium blue ones, 7 large pink ones, and 3 very large red ones. Below are the theoretical probabilities: Balls Prob. Small Green 6/20 Medium Blue 4/20 Large Pink 7/20 Very Large 3/20 Red If ten balls were selected, we could anticipate 3 out of 10 balls to be small green ones, 2 out of 10 to be medium blue ones, 3.5 out of 10 to be large pink ones and 1.5 out of 10 to be very large red ones. This trial outcome is impossible and so choice b) is the closest trial to these expected results. Exercise Set For problems 1 and 2, use the following data. In the year 2000, a state lottery distributes its $ 2.1 million proceeds in the following manner: Proceeds Beneficiary $ 900,000 Education $ 500,000 Cities $ 200,000 Highways $ 200,000 Senior Citizens $ 160,000 Libraries $ 140,000 Other 1. Draw two circle graphs. One should support the argument that too much money from the state lottery went toward education. The other should support the counter argument, too little money from the state lottery went toward education. 2. Choose a side to the above argument. Pro or Con. Then write a paragraph defending your argument, citing social, political, ethical and/or religious factors. For problems 3-4, use the following data. Source: In 2000, the population of California was 33,871,648 and 134,227 Californians purchased 193,489 handguns. 103,743 people purchased one hand gun, 28,453 people purchased two to five handguns totaling 71,363 handguns. 1,855 people purchased 6 to 12 handguns, totaling 14,053 handguns and 176 people purchased more than 12 handguns for a total of 4330 handguns. http://www.ucdmc.ucdavis.edu/vprp/Section6,2000.pdf 3. Construct two circle graphs. One circle graph should support the argument that there is a need for more restrictions on handguns in the state of California. The other circle graph should refute the argument, that is, support the counter argument that there is no need for more restrictions on handguns in the state of California. 16 4. Choose a side to the argument that there is a need for more restrictions on handguns in the state of California. Pro or Con. Then write a paragraph defending your argument, citing social, political, ethical and/or religious factors. For problems 5 and 6, use the following data for the US Census Bureau. In 1999, there were roughly 280,000,000 US citizens, and 35,000,000 lived in poverty. Of these 35 million, 12,100,000 were children, where 4,500,000 of these children lived in families who were under one-half of the poverty level. The poverty level was defined as $ 13,290 per family of three. For each problem, construct a circle graph as designated below. 5. Draw a circle graph whose population is the citizens of the United States. Section the circle graph into two sectors, one sector representing the US citizens who live above the poverty level, one sector representing the US citizens who live below the poverty level. For problems 7-8. Observe the tables below. For each, what is the greatest issue presented by these numbers. Then argue one side of the argument, using pie charts to visually sway your reader. Be certain to outline the issue, show the supporting table(s) and pie chart(s). Discuss the potential harm of such practices. 7. Murder Victims. By Race and Sex. 2001. Note: The murder and nonnegligent homicides as a result of the events of September 11th, 2001, were not included in the below table. 2001, taken from Tables 2.3-2.15. Special Report Section V. http://www.fbi.gov/ucr/01cius.htm. Race of Victim Total Male White Black 6,750 6,446 4,785 5,350 1,962 1,095 3 1 Other race Unknown race 368 188 245 123 123 34 0 31 3,214 35 Total 13,752 10,503 8. Hate Crime Statistics. By Bias. 2003. Source: FBI Crime List in 2003. http://www.fbi.gov/pressrel/pressrel04/pressrel112204.htm Total 6. Draw a circle graph whose population is those citizens who live below the poverty level. Section the circle graph into three sectors, one sector representing the adults who live below the poverty level, one sector representing the children who lived in families who lived under one-half the poverty level and the third sector is all of the other children who lived below the poverty level. Female Unknown Victims 9100 Bias to race Anti-White Anti-Black Other 4,754 1,006 3,150 598 Bias to Religion Anti-Jewish Anti-Catholic Anti-Islamic Other 1,489 1,025 80 171 213 Bias to Other 2,857 17 9. The graph below shows the companies profits in its first four years of existence. What’s wrong with this statement, “There was a substantial increase in the company’s profit in its first 4 years of existence.” 10. Poll your classmates as to the most important ‘hot button’ campaign issue. Create a table as you see below. Topic Frequency Relative Frequency Density Terror Racial Relations Abortion Death Penalty Drugs Education Construct a histogram and a pie chart for the data. 11. Project: Circle graphs, drawing inferences Sometimes we choose to see what we want to see. We all stretch the truth, exaggerate what we need, ignore what hurts us and to what end, personal wealth at the expense of personal worth? From the US Census Bureau, 2000: Child poverty in America dropped from 13.5 million children in 1998 to 12.1 million in 1999. With a whisper of optimism, we rationalize that this improvement was great. Was it? Do you ever have trouble focusing on exams or concentrating on homework assignments? How hard do you think it would be to concentrate on your exams, homework, or even your instructor' lectures if your family didn't have enough money to feed you? What if you were in poor health and your family couldn't afford to take you to the doctor or provide the medicine you need? The bitter truth is that in 2000, 12,100,000 children in America were living in poverty and confronted these challenges every day. If a family of three were living below the poverty line in 2000, they had an income below $13,290 a year. Living in poverty can translate to residing in crowded housing, having your utilities turned off, not owning a phone, or refrigerator or car, not having enough food to feed your family, not enough medicine to heal your loved ones. And the heart wrenching statistic is that 4.5 million children live in families that exist below one-half of the official poverty level. Do we have your attention, are you gasping in proper reverence? We should. Particularly because in 2000, America was experiencing one of its greatest flashes of economic prosperity. Business was skyrocketing, and people were spending. But, was just a minute percentage of Americans benefiting from this new wealth? Ironically, in 2000, the unemployment rate in the U.S. was lower than it had been in years, but the percentage of poor children in working families was soaring. There were many possibilities to explain this phenomenon, but "Some economists (said) that if wages had kept pace with the cost of living since the 1960s, the minimum wage would (have been) between $12 18 and $14 dollars" (CNN.com).” Instead, the minimum wage was $5.15. Assignment Go to the US Census Bureau. Find out how many children there were in the US in 2000. Construct a circle graph with the following categories: Children who lived above the poverty level, children who lived below the poverty level. Draw separate sections of the circle graph for those children who lived above $ 6,645 a year (half of the poverty level of $13,290 a year) and those who lived below $ 6,645 per year. Also, include a section of the graph for those children who lived in the upper 1 % of the income bracket and determine what that income level was. Then tackle the following questions? a. Do you think there is a positive, negative or no correlation between concentrating in high school and graduating from high school? Is it a causation relationship? Why? b. Do you think there is a positive, negative or no correlation between concentrating in high school getting into college? Is it a causation relationship? Why? c. Do you think there is a positive, negative or no correlation between concentrating in high school and acquiring well-paying jobs? Is it a causation relationship? Why? d. Do you think there is a positive, negative or no correlation between staying healthy and having access to doctors and medicine? Is it a causation relationship? Why? e. Do you think there is a positive, negative or no correlation between poverty and crime? Is it a causation relationship? Why? f. Do you think there is a positive, negative or no correlation between issues that politicians and lawmakers have as a top priority and issues that affect those under 18, who can not vote? Is it a causation relationship? Why? For problems 12-17, use this information provided: 5,000 years ago, forests covered nearly 50% of the earth's land surface. Since the advent of humans, forests now cover less than 20%. Forests serve as the lungs to our planet by providing the very oxygen with which we breath. The rate of deforestation is increasing and although extinction is nature’s way of selectively re-aligning our living world, this extinction, the most acute since the dinosaurs, is not nature’s way. Humans have caused it, by themselves. Source: According to RAN (Rainforest Action Network) and Myers (Op sit). In Central and South America, Bolivia, whose land mass is 1,098,581 square kilometers once had a forest cover of 90,000 sq km, now has a forest cover of 45,000 sq km. Brazil, whose land area is 8,511,960 sq km, once had a forest cover of 2,860,000 sq km, now has a forest cover of 1,800,000 sq km. Central America has a land area of 522,915 sq km, once had a forest cover of 500,000 km and now has a forest cover of 55,000 km. Columbia has a land area of 1,138,891 sq km once had a forest cover of 700,000 km and now has a land area of 180,000 km. Ecuador’s land area is 270,670 km, once had a forest cover of 132,000 sq km and now has a forest cover of 44,000 km. Mexico’s land area is 1,967,180 sq km, one time its forest cover was 400,000 sq km and now its forest cover is 110,000 sq km. 12. For each country, construct a circle graph where the circle represents the land area of the country. Divide each 19 circle into two sectors, one for the country’s land area that was once covered by forests and one for the land area not that was not covered by forests at that time. twelve sectors, two for each country, where one sector represents the land area currently covered by forests and the other the land area currently not covered by forests. 13. For each country, construct a circle graph where the circle represents the land area of the country. Divide each circle into two sectors, one for the country’s land area that is currently covered by forests and one for the land area that is currently not covered by forests at that time. 16. Construct a circle graph that represents the total land area for Bolivia, Brazil, Central America Columbia, Ecuador and Mexico. Divide the circle graph into twelve sectors, two for each country, where one sector represents the land area that was once covered by forests and the other represent the land area at that time that was not covered by forests. 14. For each country, construct a circle graph where the circle represents the original extend of forest cover. Divide the circle into two sectors, one for the existing land area covered by forests and one for the land area lost to deforestation. 15. Construct a circle graph that represents the total land area for Bolivia, Brazil, Central America Columbia, Ecuador and Mexico. Divide the circle graph into 17. After assimilating the information and viewing the circle graphs from problems 12-17, provide an argument, either pro or con, with regard to the following statement: “Deforestation of the rain forests in Central and South America is threatening the local environment as well as the global environment. It should be a not button issue in today’s society.” Measure of Central Tendency Mean, Median, and Mode Finding a number that best represents a set of data is important to you right now. Because your choice of the “representative” number that best indicates your grade can determine your course grade. Mathematicians say that to find the number that is going to serve as the spokesperson for the data should reflect the measure of the center or the middle of the data. Usually we begin by averaging the numbers to find that representative number; we find the sum and then divide by the number of data points. But, if the data consists of exam scores and you earned a 95, 95, 95, 95, 95, and a 45, then your average is found with two calculation, 95 + 95 + 95 + 95 + 95 + 45 = 520 and 520/6 = 86.7 . This means the center of your data, or the letter for the grade that best represents your data is a B according to your average, and yet you never once earned a B. In fact, you earned only A’s, except for one failing grade. You pause, because clearly you earned 5 grades of a 95 and just one grade of a 45. The five A’s must count for something, right? The data that appears the 20 most, 95, is described as the mode and it is simply another representation of the tendency of the data. Now that we see there is more than one way to refer to the center of the data, let’s begin with perhaps a more realistic example. Suppose we knew you had the following exam scores, 60, 80, 60, 70, 80, 80, 90, and 95. Your thinking perhaps you deserve an A because your last two grades were A’s. Or at the very least, you deserve a B. You begin by finding your average or mean, which is the sum of the scores divided by the number of scores; so you average your grades. First you add the scores: 60 + 80 + 60 + 70 + 80 + 80 + 90 + 95 = 615. You had 8 exams and the average is found by dividing 615 by 8; 615/8 = 76.9 or a C. Uh oh. You change your strategy. You argue that you scored an 80 three times, you deserve a B. The mode is the data that occurs most frequently, and your mode is an 80. Does this help your argument? Well, one more indication of the middle of your data is the middle value when you align the numbers in order, either from top to bottom or from bottom to top. So, we arrange our data as 60, 60, 70, 80, 80, 80, 90, and 95. The data that occurs in the middle is called the median, like the median of the highway. If there is an odd number of data points, the median will be a number found in the data set. If there is an even number of data, there will be two numbers in the physical middle of the data, and when this occurs, you need to average the two middle numbers. For us, there are two 80’s in the middle of the data, another indication you deserve a B. Now, perhaps the last possible argument you may use to justify you are a B student is that your last four exam grades, 80, 80, 90 and 95 showed you were more of a B student than a C student at the end of the course. So, despite having an average or mean of a 76.9, your mode and median scores were an 80 and you’re your grades in the second half of the semester were certainly not indicative of a C student. What grade should you get? What grade did you earn? Real Estate You meet with a real estate agent and carefully explain to the agent the price range of the homes you are interested in seeing. The agent taps away on their computer and tells you they tell you they completely understand what you want, that you are looking for homes in the $130,000 to $160,000 range. You nod your head in agreement. The agent informs you they have found three neighborhoods where the mean (average) value of houses in the three neighborhoods are $ 128,571, $136,786 and $161,429. Each subdivision is small, just like you prefer, with 14 homes. They explain the need for you to sign a exclusive right to buy agreement before they take you out. Impressed with both the immediacy and the detail provided, you quickly sign the agreement. The real estate agent takes you out for the day. After cozying into the front seat of their car, you sit back and enthusiastically await what should prove to be a worthwhile day of house hunting. By the end of the day, you are nodding your head sideways, not up and down, and you are straining to think of ways to break the stupid exclusive right to buy agreement you just signed earlier. What happened? Let’s see. 21 The three subdivisions you saw: House Sleepy Brook Vista View Meadowlands 1 205,000 130,000 300,000 2 400,000 130,000 400,000 3 500,000 135,000 400,000 4 80,000 140,000 500,000 5 70,000 150,000 65,000 6 60,000 125,000 70,000 7 80,000 125,000 70,000 8 80,000 125,000 65,000 9 100,000 125,000 65,000 10 100,000 125,000 65,000 11 60,000 125,000 65,000 12 60,000 125,000 65,000 13 60,000 120,000 65,000 14 60,000 120,000 65,000 Average Value 136,786 128,571 161,429 Which subdivision was closest to your liking? Well, clearly Vista View is the only subdivision that even had homes in your price range, with 5 of the 14 homes within your price range. But, this was the least likely subdivision because it’s average value home was a little below your range. But, visiting the other two subdivisions was useless, they had no homes in your range. The agent never checked the values of the homes in the three subdivisions, they only checked the average value of the homes. To cut the agent some slack, checking 3 subdivisions with 14 homes each would have been a lengthy endeavor, because each home would have needed to be accessed individually on the computer screen. Remember, the agent wanted to impress you with their quick research. Still, the oversight was caused because you did not have enough information about the data. Measures of Central Tendency informs us as to the behavior of the middle of the data, without the need to see every tedious piece of data. Since pulling up each home would have been too time consuming (42 homes) what other pieces of information could have been pertinent so that you would have known that only Vista View was worth visiting? Range. The mean or average value for these sets of data are: For Sleepy Brook: 205,000 400,000 500,000 3(80,000) 70,000 5(60,000) 2(100,000) $136,786 14 22 For Vista View: 2(130,000) 135,000 140,000 150,000 7(125,000) 2(120,0000) $128,571 14 For Meadowlands: 300,000 2(400,000) 500,000 8(65,000) 2(70,000) $161, 429 14 But, clearly, this was not enough information about the middle of the data. What else could have helped us. Well, in the Meadowlands subdivision, there were 8 of the 14 homes were worth $65,000, one-half of our lower limit for our price range. This would have been helpful to know. The mode is the piece of data that shows up the most frequently. So, in the Meadowlands, the mode is 65,000, occurring 8 times. For Vista View, the most is 125,000 occurring 7 times. This mode is close to our price range. And Sleepy Brook? It’s mode was much lower, 60,000, occurring 5 times. What other tendency for the data would have been helpful. How many homes are not in our price range would be too easy of an answer, huh. If we order our data, then the median value of these homes is also readily avaiable: House 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Median Sleepy Brook Vista View Meadowlands 60,000 120,000 65,000 60,000 120,000 65,000 60,000 125,000 65,000 60,000 125,000 65,000 60,000 125,000 65,000 70,000 125,000 65,000 80,000 125,000 65,000 80,000 125,000 65,000 80,000 125,000 70,000 100,000 130,000 70,000 100,000 130,000 300,000 205,000 135,000 400,000 400,000 140,000 400,000 500,000 150,000 500,000 80,000 125,000 65,000 How would knowing the median have been helpful? Well, if we knew the medium for Meadowlands, then we would have known that one-half of the homes in the subdivision, that is 7 of the homes were $65,000 or below. To keep an average of $161,429, many of the other homes would have needed to be too expensive for us, leaving at best, the possibility of at most a few homes in our range. It turned out, there were no homes in our range. Which leads us to the dispersion of the data. Dispersion means spreading, scattering or distribution. We can address these different words with different measures of central 23 tendency. The range is the difference between the largest and the smallest data point. For Sleepy Brook, $500,000 - $60,000 = $440,000 or most realistically, there is a large difference between the cheapest and the most expensive home in the subdivision. For Vista View, $ 150,000 - $ 120,000 = $ 30,000 and this tells us all the homes are at least close to our price range. Meadowlands has the problem Sleepy Brook had, the range is $500,000 - $65,000 = $435,000. To measure the scattering and the distribution of even larger samples of data, we will examine standard deviations a little later. But first, let’s look at mean, median, mode and range a little longer. Problem One Below are the Traffic Fatalities per 100 million (108) vehicle miles in 2001 Source: U.S. National Highway Safety Traffic Administration. Rank the states and the District of Columbia in ascending order. Then find the mode, median, mean and range. Discuss the relevance of the numbers. This means if any two correspond closely, look at the data and tell why. If any state is far from the middle of the data, call it an outlier. Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware District of Columbia Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri 1.75 1.80 2.06 2.08 1.27 1.71 1.01 1.58 1.81 1.93 1.50 1.61 1.84 1.37 1.27 1.49 1.75 1.83 2.32 1.33 1.27 0.90 1.34 1.06 2.18 1.62 Montana Nebraska Nevada New Hampshire New Jersey New Mexico New York North Carolina North Dakota Ohio Oklahoma Oregon Pennsylvania Rhode Island South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming 2.30 1.36 1.71 1.15 1.09 1.99 1.18 1.67 1.45 1.29 1.55 1.42 1.49 1.01 2.27 2.00 1.85 1.72 1.25 0.96 1.27 1.21 1.91 1.33 2.16 24 Solution Massachusetts Vermont Connecticut Rhode Island Minnesota New Jersey New Hampshire New York Washington Utah California Indiana Maryland Virginia Ohio Maine Wisconsin Michigan Nebraska Illinois Oregon North Dakota Iowa Pennsylvania Georgia Oklahoma Delaware Hawaii 0.90 0.96 1.01 1.01 1.06 1.09 1.15 1.18 1.21 1.25 1.27 1.27 1.27 1.27 1.29 1.33 1.33 1.34 1.36 1.37 1.42 1.45 1.49 1.49 1.50 1.55 1.58 1.61 50 49 47 47 46 45 44 43 42 41 37 37 37 37 36 34 34 33 32 31 30 29 27 27 26 25 24 23 Missouri North Carolina Colorado Nevada Texas Alabama Kansas Alaska District of Columbia Kentucky Idaho Tennessee West Virginia Florida New Mexico South Dakota Arizona Arkansas Wyoming Mississippi South Carolina Montana Louisiana Mode Median Mean 1.62 1.67 1.71 1.71 1.72 1.75 1.75 1.80 22 21 19 19 18 16 16 15 1.81 1.83 1.84 1.85 1.91 1.93 1.99 2.00 2.06 2.08 2.16 2.18 2.27 2.30 2.32 1.27 1.55 1.57 (X) 14 13 12 11 10 9 8 7 6 5 4 3 2 1 The mean and median are close, this means the number in the middle of the data and the average are close together. The number of states that rank above and below the average and the number of states that rank above and below the middle state, GA, are close to the same. So, the data is not top or bottom heavy. Yet, this doesn’t mean the data is dispersed evenly. Why? 25 Exercise Set For problems 1-6 below, find the mean, median and mode for the data. 1. 1, 3, 4, 4, 4, 5, 5, 6 2. 3, 3, 4, 4, 4, 5, 5, 6 3. 3, 3, 3, 4, 4, 5, 5, 6 4. 3, 3, 3, 4, 5, 5, 5, 6 5. 1, 1, 1, 1, 2, 2, 6, 6 6. 1, 1, 2, 2, 6, 6, 6, 6 7. What is the median time it took for the students to write the exam? Student ID Time to Number Take Exam 4025 1:25 1026 1:09 8790 0:59 1029 0:45 2943 1:01 2020 1:12 2084 1:25 5091 1:31 7812 0:49 5103 2:00 6092 1:42 8. Below is the year and the percent of children under the age of 4 in a city that attended Day Care. Percent Alabama 15.5 Alaska 6.8 Arizona 8.4 Arkansas 13.8 California 8.1 Colorado 7.4 Connecticut 7.0 Delaware 7.9 District of Columbia 16.4 Florida 9.1 Georgia 13.5 Hawaii 7.4 Idaho 8.3 Illinois 8.3 Indiana 7.7 Iowa 7.7 Kansas 8.1 1970 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 15 17 15 16 18 17 21 31 12 15 16 17 15 12 What is the mode for this set of data? 9. From the US Census Bureau, 1999, below is the state rankings of the percent of elderly persons, 65 years and over that live below the poverty level. Rank the states and the District of Columbia in ascending order. Then find the mode, median, mean and range. Discuss the relevance of the numbers. This means if any two correspond closely, look at the data and tell why. If any state is far from the middle of the data, call it an outlier. Year Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska Nevada New Hampshire New Jersey New Mexico New York North Carolina 14.2 16.7 10.2 8.5 8.9 8.2 8.2 18.8 9.9 9.1 8.0 7.1 7.2 7.8 12.8 11.3 13.2 26 North Dakota Ohio Oklahoma Oregon Pennsylvania Rhode Island South Carolina South Dakota Tennessee 11.1 8.1 11.1 7.6 9.1 10.6 13.9 11.1 13.5 Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming 12.8 5.8 8.5 9.5 7.5 11.9 7.4 8.9 Standard Deviation and the Normal Distribution According to a study done by the National Center for Health and Statistics, Mean Body Weight, Height and Body mass Index, United States 1960-2002, American men are (ages 20-74) are 25 pounds heavier in 2002 than they were some 42 years earlier in 1960. In 2002 the average American male weighed 191 pounds, up from his 1960 counterpart who weighed 166 pounds. American women from the same age group followed the same trend, the average American woman weighed 164 pounds in 2002, up 24 pounds from the average American woman from 1960 who weighed 140 pounds. This study caused quite a stir, as nutritionists and diet doctors clamored together to seek solutions. And as you can imagine, the dangers of obesity were revisited when this study was broadcast. Not only had the average American weight increased, but they grew as well. The average male heights increased over the 42 year span, from 5‘ 8“ in 1960 to 5‘ 9 ½“ in 2002. And as expected, the average height for the American woman also increased, from 5‘ 3“ to 5‘ 4“. The study was done on a smaller representation of the true population, it was performed on thousands of people and in reality, the population of American Men and Women total in the hundreds of millions. Since these numbers are so large, we assume the data to be a normally distributed around the mean, or average. A normal distribution for a set of data means that there is more data close to the "average," and the less data farther from the average until finally relatively few data points tend to one extreme or the other. The data is symmetrically distributed away from the average. This is common sense, or mathematical intuition. Humans are, after all, close to being one and the same. Let’s say you are writing a story about the height distribution of the American male in 2002 because you are trying to correlate it to ethnicity, diet or genes. First you take the population, in this case, those people who participated in the study, and tally up the number of people for each given height. Like most data, if the sample or population is large enough, the heights for the population turn out to be normally distributed. This means most people will be of average height or close to average height. In other words, the average height also will be the height to occur most frequently in our population and the height found in the middle of the data when it is ordered. Thus, the mean will be the mode and the median too. Next, if a population is normally distributed, and you plot each height in increasing order, the number of men for a given height are symmetrically distributed around the average height. In other words, there will be more people close to average height than far from the average height. In 2002, the average height of the American male was 5’9 ½ .’’ For our normally distributed society which 27 we aptly call the American male, the next most common heights occur from 5’9” through 5’ 10”, both heights ½ inch away from the mean height. Next, the most common heights would be expected to occur between 5’ 8” to 5’ 11” And so on. We expect less and less men to have a designated height as we move further from the average height. Intuitively, this fits our preconceived notion of our society, we expect to see less men that are 6 ‘ 5” than you would find that are 5’ 11” for instance. And similarly, this means there will be fewer men that are five foot than 5 ‘ 7” and on the other side of the mirror, fewer that are 6 ½ feet than 5 ‘ 11”. Because height is a normally distributed trait, the heights are distributed symmetrically around the average height. 180000 160000 140000 120000 100000 80000 60000 40000 20000 0 180000 160000 140000 So, we estimated the number of adult American Males for each given height and then grouped the heights into small intervals. We then drew a bar graph, as shown to the left. The x-axis represents a given height, the y-axis represents the number of adult males of that given height. Notice, the graph is centered at the average height of the adult American male. We then redrew a line graph with from the normally distributed data, as shown on the left. 120000 100000 80000 60000 40000 20000 0 Often, we draw the Normally Distributed Bell Shaped curve free hand. Our approximation of a Bell Shaped curve may look like the graph to the left. Note: The x-axis (the horizontal one) is the value in question, the population’s height for example. The y-axis (the vertical one) is the number of data points for each value on the x-axis, the number of people that are that certain height. The standard deviation is a measure of how widely values are dispersed from the mean (average value). For populations where the data points are tightly bunched together, the bell-shaped curve is steep and the standard 28 deviation is small. For populations where the data points are spread further apart from the average, the bell curve is flatter and the standard deviation is larger. 68-95-99.7 To refine our understanding of a standard deviation, we turn our attention to a bell shaped graph. In a moment we will show you the calculation for the standard deviation. Right now, we want to present a conceptual understanding for the term ‘standard deviation.’ Recall, in 2002, the American male had a mean height of 5 ‘ 9 ½ “. The standard deviation is 2 3/8 “. For a normal distribution, one standard deviation (in red above) away from the mean in both directions on the horizontal axis will account for approximately 68 % of the population. There are two heights that are 2 3/8 inches from 5’ 9 1/2”, the smaller 5’ 7 1/8” (5’ 9 ½” – 2’ 3/8”) and one larger, 5’ 11 7/8” (5’ 9 ½” + 2’ 3/8”). Thus, 68 % of the American men in 2002 stood between 5’ 7 1/8” and 5’ 11 7/8”. All data found within two standard deviations (in red and green above) from the mean will account for roughly 95 % of the normally distributed population, or the adult American male population. The two heights two standard deviations away from the mean are found with two predictable calculations. We first subtract two standard deviations from the mean, giving us 5’ 9 ½” - 2’ 3/8” - 2’ 3/8” = 5’ 4 ¾” We then add two standard deviations to the mean, giving 5’ 9 ½” + 2’ 3/8” + 2’ 3/8” = 6 ‘ 2 ¼”. So, 95 % of American men in 2002 were somewhere between 5’ 4 ¾” and 6’ 2 ¼”. Recall, the heights for this 95 % of the population are evenly distributed from the mean. Data found three standard deviations from the mean (the red, green and blue areas) account for about 99.7 % of normally distributed populations. So, in 2002, 99.7 % of the American men were between 5’ 2 3/8” (5’ 9 ½” - 2’ 3/8” - 2’ 3/8” - 2’ 3/8”) and 6’ 4 5/8”(5’ 9 ½” + 2’ 3/8” + 2’ 3/8” + 2’ 3/8”). From a different perspective, one could infer that in 2002, those American men who were more than three standard deviations away from the mean either were shorter then 5’ 2 3/8” or taller than 6’ 4 5/8” represented 0.3 % of the adult American male population, they were considered short or tall by the our population’s standards. 29 If a curve was flatter, the standard deviation would have to be larger in order to account for those 68 percent and if the curve was steeper, the standard deviation would have to be smaller to account for 68 percent of the population. Standard deviation tells you “how spread out the data points in the population are from the mean.” Why is this useful? Well, if you are comparing test scores for different schools, the standard deviation will tell you how diverse the test scores are for each school. Let's say Washington High School has a higher mean test score than Adams High School for the mathematics portion of the statewide AIMS test administered in the state of Arizona to measure the students understanding of high school mathematics. Our first reaction might be to deduce that the students at Washington are either smarter or better educated by the teachers. You analyze the data further. The standard deviation, you find out, at Washington is larger than at Adams. This means that at Washington there are relatively more kids scoring toward one extreme or the other. By asking a few follow-up questions, you might find that Washington’s average was higher because the school district sent all of the gifted education kids to Washington. Or perhaps Adams scores were dragged down and thus appeared bunched together because all of the students who recently have been "mainstreamed" from special education classes. Perhaps the gifted classes were sent out of district. In this way, looking at the standard deviation can help point you in the right direction when asking why data is the way it is. Problem One You are trying to decide which teacher’s class to enroll in for Mathematics. You go to a website that claims to have tracked the three teacher’s success rates over the past five years. The final grade for Mr. Allen’s students had a mean score of 76 with a standard deviation of 5, while Mrs. Bennett’s students had a mean score of 74, with a standard deviation of 3 and Mrs. Clyde has a mean score of 79 for the student’s final grade, with a standard deviation of 7. Whose class would you enroll in? How would you interpret the data on the web site? Rank the teachers from first to third, so that if one’s section is full, you would know whose class to register for next. Solution We must quantify the exam scores to interpret the data. For Mr. Allen classes, 68 % of the students earned a final grade that was within 5 points of 76, so 68 % of the students scored earned between 71 to 81. About 95 % scored within two standard deviations of the mean, so 95 % of the students earned a final grade between 66 to 86. Finally, 99.7 % of the students earned a final grade between 61 to 91. Continuing with this thought process, Mrs. Bennett’s students has a lower final grade average, 74. But, 68 % of the students earned a final grade scored between 71 to 77, 95 % earned a final grade between 68 to 80 and 99.7 % earned a final grade between 65 to 83. For Mrs. Clyde’s students, her students earned the highest average, but she had the 68 % , 95 % and 99.7 % spread farther apart, 72-86, 65-93 and 58 – 100 respectively. 30 A table allows us to compare the success rates of the three teachers: Mr. A Mrs. B Mrs. C 68 % 71-81 71-77 72-86 95 % 66-86 68-80 65-93 99.7 % 61-91 65-83 58-100 So, to answer the question of which teacher you should take. If you are a good student, you have a better chance of securing an A with Mrs. Clyde first, Mr. Allen second and Mrs. Bennett third. If you struggle at math, you probably would choose Mrs. Bennett first because 99.7 % of her students earn above a 65. Mr. Allen would probably be your second choice, Mrs. Clyde your third choice. Problem Two In Typical City, USA, the number of hours a teen watches TV has become concern for the town’s elders. They research this and find the teens watch an average of 4 ½ hours of TV a day, with a standard deviation of ½ hour. What percent of the teen’s watch a) more than 5 hours of TV per day? b) more than 5 ½ hours of TV per day? c) less than 5 ½ hours of TV per day? d) less than 4 hours of TV per day? e) less than 3 ½ hours of TV per day? Solution a) Since 5 hours is 1 standard deviation above the mean (4 ½ plus ½ ), then 68 % of the teens are distributed within 1 standard deviation or between 4 and 5 hours. So, half of the teens are will watch from 0 to 4 ½ hours, and another 34 % (half of the 68 %) will watch between 4 ½ to 5 hours. So, 84 % will watch less than 5 hours, thus 100 % - 84 % or 16 % will watch more than 5 hours per day. b) Since 5 ½ hours is 2 standard deviations above the mean (4 ½ plus ½ plus ½ ), then 95 % of the teens are distributed within 2 standard deviation or between 3 ½ and 5 ½ hours. So, half of the teens are will watch from 0 to 4 ½ hours, and another 47 ½ % (half of the 95 %) will watch between 4 ½ to 5 ½ hours. So, 97 ½ % will watch less than 5 ½ hours, thus 100 % - 97 ½ % or 2 ½ % will watch more than 5 ½ hours per day. c) From the above paragraph, we have 100 % - 2 ½ % = 97 ½ % of the teens will watch loess than 5 ½ hours of TV per day.. d) Since 4 hours is 1 standard deviation below the mean (4 ½ minus ½ ), then 68 % of the teens are distributed within 1 standard deviation or between 4 and 5 hours. So, half of the teens are will watch from 0 to 4 ½ hours, and another 34 % (half of the 68 %) will watch between 4 and 4 ½ hours. So, 50 % - 34 % will watch less than 4 hours per day. e) Since 3 ½ hours is 2 standard deviations below the mean (4 ½ minus ½ minus ½ ), then 95 % of the teens are distributed within 2 standard deviations or between 3 ½ and 5 ½ hours. So, half of the teens are will watch from 0 to 4 ½ hours, but 31 another 47 ½ % (half of the 95 %) will watch between 3 ½ to 4 ½ hours per day. So, 50 – 47 ½ or 2 ½ % of the teens will watch less than 3 ½ hours per day. Standard score or z-score. If one is analyzing data within 1, 2 or 3 standard deviations from the mean, then you can expect 68 %, 95 % or 99.7 % respectively, of the population to lie within these bounds. What happens if we know that 90 % of the data lies within two scores. What would the standard deviation look like? Since data rarely if ever is presented to us where the mean is zero and the standard deviation is 1, we use the standard normal curve to help analyze any normally distributed data. A data value with a z-score of 0 indicates the data is the mean. A data value with a z-score of –1.3 indicates the data value is 1.3 standard deviations below the mean and so forth. If you know the standard deviation and mean of your data, z-scores enable you to determine the percent of data between any two values in the range of your data. The formula used to find each z-score is data value - mean . standard deviation Below is a table for the z-scores for the standard normal distribution. z 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 0 0 0.0398 0.0793 0.1179 0.1554 0.1915 0.2257 0.258 0.2881 0.3159 0.3413 0.3643 0.3849 0.4032 0.4192 0.4332 0.4452 0.4554 0.4641 0.4713 0.4772 0.4821 0.4861 0.4893 0.4918 0.01 0.004 0.0438 0.0832 0.1217 0.1591 0.195 0.2291 0.2611 0.291 0.3186 0.3438 0.3665 0.3869 0.4049 0.4207 0.4345 0.4463 0.4564 0.4649 0.4719 0.4778 0.4826 0.4864 0.4896 0.492 0.02 0.008 0.0478 0.0871 0.1255 0.1628 0.1985 0.2324 0.2642 0.2939 0.3212 0.3461 0.3686 0.3888 0.4066 0.4222 0.4357 0.4474 0.4573 0.4656 0.4726 0.4783 0.483 0.4868 0.4898 0.4922 0.03 0.012 0.0517 0.091 0.1293 0.1664 0.2019 0.2357 0.2673 0.2967 0.3238 0.3485 0.3708 0.3907 0.4082 0.4236 0.437 0.4484 0.4582 0.4664 0.4732 0.4788 0.4834 0.4871 0.4901 0.4925 0.04 0.016 0.0557 0.0948 0.1331 0.17 0.2054 0.2389 0.2704 0.2995 0.3264 0.3508 0.3729 0.3925 0.4099 0.4251 0.4382 0.4495 0.4591 0.4671 0.4738 0.4793 0.4838 0.4875 0.4904 0.4927 0.05 0.0199 0.0596 0.0987 0.1368 0.1736 0.2088 0.2422 0.2734 0.3023 0.3289 0.3531 0.3749 0.3944 0.4115 0.4265 0.4394 0.4505 0.4599 0.4678 0.4744 0.4798 0.4842 0.4878 0.4906 0.4929 0.06 0.0239 0.0636 0.1028 0.1406 0.1772 0.2123 0.2454 0.2764 0.3051 0.3315 0.3554 0.377 0.3962 0.4131 0.4279 0.4406 0.4515 0.4608 0.4686 0.475 0.4803 0.4846 0.4881 0.4909 0.4931 0.07 0.0279 0.0675 0.1064 0.1443 0.1808 0.2157 0.2486 0.2794 0.3078 0.334 0.3577 0.379 0.398 0.4147 0.4292 0.4418 0.4525 0.4616 0.4693 0.4756 0.4808 0.485 0.4884 0.4911 0.4932 0.08 0.0319 0.0714 0.1103 0.148 0.1844 0.219 0.2517 0.2823 0.3106 0.3365 0.3599 0.381 0.3997 0.4162 0.4306 0.4429 0.4535 0.4625 0.4699 0.4761 0.4812 0.4854 0.4887 0.4913 0.4934 32 0.09 0.0359 0.0753 0.1141 0.1517 0.1879 0.2224 0.2549 0.2852 0.3133 0.3389 0.3621 0.383 0.4015 0.4177 0.4319 0.4441 0.4545 0.4633 0.4706 0.4767 0.4817 0.4857 0.489 0.4916 0.4936 2.5 2.6 2.7 2.8 2.9 3 0.4938 0.4953 0.4965 0.4974 0.4981 0.4987 0.494 0.4955 0.4966 0.4975 0.4982 0.4941 0.4956 0.4967 0.4976 0.4982 0.4943 0.4957 0.4968 0.4977 0.4983 0.4945 0.4959 0.4969 0.4977 0.4984 0.4946 0.496 0.497 0.4978 0.4984 0.4948 0.4961 0.4971 0.4979 0.4985 0.4949 0.4962 0.4972 0.4979 0.4985 0.4951 0.4963 0.4973 0.498 0.4986 Problem Three We know that in 2002, the average height of the American Male was 5’ 9 1/2 “ and the standard deviation was 2 3/8”. What percent of the American males in 2002 were … a) taller than 6’? b) shorter than 5’ 7 ½ ”? c) between 5’ 10 “ and 6 ‘ 1”? Solution a) First, we find the z-score associated with 6’. We have 1 data value - mean 6 ' 5'9 2 " 2.5 z score 1.05 . Notice the positive 1.05 standard deviation 2.375 23 " 8 corresponds to the fact that 6’ is above the mean. Now, we glance at the table and see that a z-score of 1.05 has a value of 0.3531 . This means that in 2002, 35.31 % of the adult America Males were between 6 ‘and the average height of 5 ‘ 9 ½ “. So, the percent of adult American Males taller than 6’ were 100 % - 50 % - 35.31 % or 14.69 %. b) First, we find the z-score associated with 5’ 7 ½”. We have 5'7.5" 5'9.5" 2.0 z score 0.84 . The negative value corresponds to the fact 2.375 23 " 8 that 5’ 7 1/2'” is below the mean. Now, we glance at the table and see that a z-score of 0.84 has a value of 0.2995. This means that in 2002, 29.95 % of the adult America Males were between 5’ 7 ½ “ and the average American Male height of 5 ‘ 9 ½ “. The percent of American Males shorter 5’ 7 ½ “ were 100 % - 50 % - 29.25 % or 20.75 %. c) Calculating each z-score, we have: 5'10" 5'9.5" 0.5 6 ' 1" 5'9.5" 3.5 z score 0.21 and z score 1.47 2.375 2.375 23 " 23 " 8 8 Now, we use the table and see that the z-score of 0.21 and 1.47 have the values of 0.0832 and 0.4292 respectively. This means that in 2002, 8.32 % of the America Males were between 5’ 10 “ and the average American Male height of 5 ‘ 9 ½ “ and 42.92 % of the America Males were between 6’ 1 “ and the average American Male height of 5 ‘ 9 ½ “. So, the percent of adult American Males between 5’ 10” and 6’ 1” would be 42.92 % 8.32 % or 34.6 %. 33 0.4952 0.4964 0.4974 0.4981 0.4986 34 Exercise Set 1. Two AP calculus classes were taught by Mr. Venette and Ms. Harper. The final grade for course during the past five years indicated that Mr. Venette’s classes had a mean of 80 and a standard deviation of 4, while Ms. Harper’s classes had a mean of 82, but a standard deviation of 2.5. Interpret the results in terms of 68-95-99.7 percentiles. Then give possible reasons for the differences you observe. For questions 2 to 10, use the following data: The mean income in a city is $ 51,000, and the standard deviation is $ 4000. Find the percentage of people whose income is 2. $ 59,000 or above 3. $ 47,000 or below 4. between $ 43,000 and $ 55,000 5. $ 55,000 or below 6. $ between 39,000 and 55,400 7. $ 39,000 or below 8. $ 40,000 or above 9. $ 50,000 or below 10. between $ 50,000 and $ 60,000 For problems 11 to 16, use the following information. In Japan in 2002, studies pertaining to the heights for adults separated by gender vary slightly, but a rough estimation of data compiled from various studies is as follows: For the adult male population, 97.5% of the males were found to be between 5'0" 5/8 and 5'9" 7/8, with the average at 5'5" 1/8. For the adult female population, 97.5% of the females were found to be between 4'8" 1/2 and 5'4", with the average at 5'0" ¼. 11. Find the standard deviation for both the males and the females and interpret both in a complete sentence. 12. Find the percent of Japanese males shorter then 5 ‘ 9 7/8” 13. Find the height of the Japanese female who is taller than 99.85 % of the population. 14. Find the height of the Japanese males who are shorter than 84 % of the population. 15. Find the percent of Japanese males shorter than 5’ 2”. 16. Find the percent of Japanese women taller than 5’ 2”. For problems 17 to 24, use the following information. In the United States in 2002, the weights pertaining to adults separated by gender vary slightly, but a rough estimation of data compiled from various studies is as follows: For the adult male population, 95% of the males were found to be between 168 lbs and 214 lbs, with the average at 191 lbs. For the adult female population, 95% of the females were found to be between 140 lbs and 188 lbs, with the average at 164 lbs. 17. Find the standard deviation for both the males and the females and interpret both in a complete sentence. 18. Find the percent of American females who weigh more than 145 pounds. 19. Find the weight of the American female who weighs less than 34 % of the population. 20. Find the height of the American male who weighs more than 97.5 % of the population. 21. Find the percent of American females who weigh more than 150 pounds. 22. Find the percent of American males who weigh more than 200 pounds. 35 23. Find the percent of American females who weigh less than the weight of the Average Male in 2002. 24. Find the percent of American males who weigh less than the weight of the Average Female in 2002. 25. On your route home, you have a choice of taking two bridges, each of the same length and same number of lanes. At the time you cross each bridge, Bridge One has an average of 420 cars on it with a standard deviation of 100, and Bridge Two has a average of 460 cars on it with a standard deviation of 20 cars. Which bridge would you decide to cross? Would it matter if you were in a hurry? For problems 26 to 32, use this fact: According to By Robert Dvorchak, Pittsburgh Post-Gazette, the average length of a National League baseball game was 2:47:20. Compared to its own historic past, when in 1967 the average game was 2:30, in the 1940’s the average game, according to the Sporting News was exactly 2:00, or even a century ago, when the average game was a mere 1:30. If we estimate a standard deviation of 20 minutes, what percent of the games … 26. in 2004 lasted longer than two hours 27. in 1967 lasted longer than two hours 28. in 1940 lasted longer than two hours 29. a century ago lasted longer than two hours 30. in 2004 lasted longer than three hours 31. in 1967 lasted longer than three hours 32. in 2004 were shorter than 3 ½ hours Standard Deviation A standard deviation then is really nothing more than the average distance from the mean. For each data point or value, we subtract the mean from each data and the result is either zero, positive or negative. When we add these values, the sum of the positive differences will cancel with the sum of the negative difference. Since we are looking to find the average distance from the mean, this calculation would prove worthless. Try it and see or yourself. Instead, we use a convenience where we square each difference because these squared values are all positive. Thus, they won’t have the effect of canceling each other out. Now, we add them all up. We then divide by the number of terms. Almost. Actually, we divide by n-1 because statisticians have determined that with large populations, since there is always an outlier (the really tall kid, the really bright child that blows out the curve with IQ scores and so on … ), dividing by n-1 most closely resembles the true behavior of the data. We then take the square root of the sum of the squared differences to cancel out the effect of squaring, giving us this measurable average distance from the mean. We designate positive values to indicate above the mean, negative values to indicate below the mean. A practical way to compute standard deviation would be to incorporate the use of a spread sheet. In Microsoft Excel, type the following code into the cell where you want the Standard Deviation result, using the "unbiased," or "n-1" method: =STDEV(A1:Z99) (substitute the cell name of the first value in your dataset for A1, and the cell name of the last value for Z99.) 36 Calculating the standard deviation, let x be one value in your set of data and let x be the mean of all values x in your set of data. Let n be the number of data points from your set of data. For each value x, subtract the overall x from x, that is x – x , then square the result (x - x )2 . Sum up all those squared values and then divide the sum by (n-1). Finally, there's one more step, take the square root of this ratio. That's the standard deviation of your set of data, written as . n (x i 1 i x) 2 n 1 Introduction or who deserves the B? Let’s develop an intrinsic feel for the measure of central tendency of data. Below are five students, and their seven grades for a course. The bottom row reaffirms that all six students have a 79 average. Allan Bill Cindy Deanna Eve 74 73 59 69 68 76 75 62 73 78 77 77 78 78 79 80 79 79 79 79 81 83 80 82 83 82 83 96 83 83 83 83 99 89 83 79 79 79 79 79 All six students want a B. Allan argues that his middle grade, his median is a B. Bill argues that his mode, the grade that occurs most frequently, is a B. Cindy argues that she has shown great potential, two of her grades are solid A’s. Deanna argues the same argument, but her grades are not quite as erratic as Cindy’s they are not as dispersed away from the average as Cindy’s grades. Eve, like Bill, also argues that her mode is a B. And although Eve and Bill have the same mean, median and mode, Eve is the one with the 68. Uh oh…. Standard deviations measure just this, how a data value is deviates from a mean, in other words, a standard deviation is a numerical value that tells the reader how spread out the data is, Allan’s grades are clumped together, he should have a small standard deviation. Cindy’s grade history is more erratic, her grades are farther spread out, she should have a larger standard deviation. Let’s compare the standard deviations for three of the students, Bill, Cindy and Eve. We will find how much each data value deviates from the mean. But notice, if we try and sum up these deviations from the mean without squaring the differences, the sum is zero. Why? 37 So, first we subtract each data point from the mean (deviation). After we square the differences (deviation squared), we sum the square of the differences, divide this sum by a number that is one less than the number of data points. Lastly, we take the square root of this ratio and we have the standard deviation. Bill 73 75 77 79 83 83 83 79 deviation -6 -4 -2 0 4 4 4 0 deviation squared 36 16 4 16 16 16 104 For Bill, his standard deviation is deviation is Cindy 59 62 78 79 80 96 99 79 deviation -20 -17 -1 0 1 17 20 0 deviation squared 400 289 1 1 289 400 1380 Eve 68 78 79 79 83 83 83 79 deviation -11 -1 0 0 4 4 4 0 deviation squared 121 1 0 0 16 16 16 170 104 17.33333 4.16 . For Cindy, her standard 6 1380 230 15.17 . For Eve, her standard deviation is 6 170 28.33333 5.32 . As standard deviations is a measure of dispersion, the larger 6 the standard deviation, the more dispersed the data. We now have more information about each of the student’s grades at our disposal; we know the mean, median, mode and the standard deviation. You decide, who deserves the B; who does not. Problem One Baseball, said to be America’s favorite pastime, is also fertile ground for honing basic statistical skills. From games won or lost to home and away records, from records against divisional foes to streaks, from batting averages to home runs, numbers abound. For this example, we will find the standard deviation and then incorporate the z-score formula to determine how far each team’s record is from the mean. We will see data in context, as it would appear in your morning newspaper. Who is the best and the worst in the American League on Labor Day, 2004? A standard deviation way to explore this age old baseball question. Updated: 9/5/2004 3:37:06 PM cnn.com 38 AMERICAN LEAGUE EAST ~~~~~~~~~~~~~~~~~~~~ TEAM NY YANKEES BOSTON BALTIMORE TAMPA BAY TORONTO WON 83 80 63 59 56 LOST PCT GB HOME ROAD EAST CENT 52 54 71 75 80 .615 .597 .470 .440 .412 2 19 23 27 45-21 48-22 29-35 36-34 34-37 38-31 32-32 34-36 23-41 22-43 36-19 36-20 28-29 21-38 21-36 15-11 19-13 15-12 13-12 13-19 LOST PCT GB HOME ROAD EAST CENT 58 67 70 72 87 .570 .500 .489 .463 .351 43-28 9 1/2 38-32 11 40-30 14 1/2 32-32 29 1/2 30-37 34-30 29-35 27-40 30-40 17-50 16-11 16-16 17-15 10-14 08-19 34-24 29-27 26-31 28-28 25-32 LOST PCT HOME ROAD EAST CENT 45-19 4 38-28 6 1/2 42-22 30 32-34 36-35 39-30 32-38 19-50 23-16 24-16 22-17 11-28 25-15 25-14 22-17 19-21 1/2 1/2 1/2 1/2 WEST STREAK 22-14 16-12 15-17 10-22 14-15 LOST LOST WON LOST LOST 2 1 6 7 2 AMERICAN LEAGUE CENTRAL ~~~~~~~~~~~~~~~~~~~~~~~ TEAM MINNESOTA CHI WHITE SOX CLEVELAND DETROIT KANSAS CITY WON 77 67 67 62 47 WEST STREAK 16-16 14-14 14-16 15-21 08-24 WON WON LOST WON LOST 5 2 4 1 2 AMERICAN LEAGUE WEST ~~~~~~~~~~~~~~~~~~~~ TEAM OAKLAND ANAHEIM TEXAS SEATTLE WON 81 77 74 51 54 58 60 84 .600 .570 .552 .378 GB WEST STREAK 23-15 21-17 20-18 12-26 WON WON WON LOST The three divisional winners and the second place team with the best record makes the playoffs. But, who is the best team? The worst team? How good is good and how bad is bad? Let’s calculate the standard deviation with respect to the number of wins for each team. First, we find the mean number of wins by adding up all the wins and dividing by 14. x 83 80 63 59 56 77 67 67 62 47 81 77 74 51 67.4 14 The average or mean number of wins is 67.4 for the American League teams on Labor, 2004. The table below has each team ranked by the number of wins, from most to least. We used a spread sheet to construct the columns representing the differences from the mean, the square of these differences, the standard deviation and the z-scores. 39 3 2 1 4 Statistics table for the teams in the American League on Labor day, 2004. Wins - Mean (Wins - Mean)2 NY YANKEES 83 83-67.4 = 15.6 15.62 = 243.36 OAKLAND BOSTON MINNESOTA ANAHEIM TEXAS CHI WHITE SOX CLEVELAND BALTIMORE DETRIOT TAMPA BAY TORONTO SEATTLE KANSAS CITY SUM 81 80 77 77 74 67 67 63 62 59 56 51 47 944 81-67.4 = 13.6 13.62 = 184.96 12.6 158.8 9.6 92.2 9.6 92.2 6.6 43.6 -0.4 0.2 -0.4 0.2 -4.4 19.4 -5.4 29.2 -8.4 70.6 -11.4 129.96 -16.4 268.96 -20.4 416.2 0 1749.84 n To calculate the standard deviation, Recall, to find each z-score, NY YANKEES OAKLAND BOSTON MINNESOTA ANAHEIM TEXAS CHI WHITE SOX CLEVELAND BALTIMORE DETRIOT TAMPA BAY TORONTO SEATTLE KANSAS CITY 83 81 80 77 77 74 67 67 63 62 59 56 51 47 (x i 1 i n 1 x) 2 1749.84 11.6 . 14 1 data value - mean . standard deviation Wins - Mean z-score 83-67.4 = 15.6 15.6/11.6 = 1.3 81-67.4 = 13.6 13.6/11.6 = 1.2 12.6 1.1 9.6 0.8 9.6 0.8 6.6 0.6 -0.4 -0.03 -0.4 -0.03 -4.4 -0.4 -5.4 -0.5 -8.4 -0.8 -11.4 -0.98 -16.4 -1.4 -20.4 -1.8 Look at the final column, and recall, as you glance at each teams’ z-score, that for a normal population, a 68 % of the population falls within 1` standard deviation or z-score of the mean, 95 % falls within 2 and 99.7 falls within 3. How good are the NY Yankees and how bad are the Kansas City Royals? You now have a more detailed frame of reference to answer that question. 40 Example 2. Does money buy success in baseball? Updated: 9/5/2004 3:37:06 PM cnn.com The payroll for the American League teams are listed below. New York Yankees $ 184,193,950 Boston Red Sox $ 127,298,500 Anaheim Angels $ 100,534,667 Seattle Mariners $ 81,515,834 Chicago White Sox $ 65,212,500 Oakland Athletics $ 59,425,667 Texas Rangers $ 55,050,417 Minnesota Twins $ 53,585,000 Baltimore Orioles $ 51,623,333 Toronto Blue Jays $ 50,017,000 Kansas City Royals $ 47,609,000 Detroit Tigers $ 46,832,000 Cleveland Indians $ 34,319,300 Tampa Bay Devil Rays $ 29,556,667 What will the standard deviation tell us with respect to this payroll data? Will there be a correlation between salaries and success? Does money buy success? Once we have calculated the standard deviation and z-scores, we will compare these results with the true standings taken on Sept 5th, 2004. The sum of the 14 American League salaries is $ 986,773,835, thus the average salary is $ 70,483,845.36, which we will round to $70,483,845. To calculate the standard deviation, we construct the following table. Team Payroll Salary Payroll – Mean (Payroll – Mean)2 New York Yankees $ 184,193,950 113,710,105 12,929,987,979,111,025 Boston Red Sox $ 127,298,500 56,814,655 3,227,905,022,769,025 Anaheim Angels $ 100,534,667 30,050,822 903,051,902,875,684 Seattle Mariners $ 81,515,834 11031989 121,704,781,296,121 Chicago White Sox $ 65,212,500 - 5,271,345 27,787,078,109,025 Oakland Athletics $ 59,425,667 - 11,058,178 122,283,300,679,684 Texas Rangers $ 55,050,417 Minnesota Twins $ 53,585,000 Baltimore Orioles $ 51,623,333 Toronto Blue Jays $ 50,017,000 Kansas City Royals $ 47,609,000 Detroit Tigers $ 46,832,000 41 Cleveland Indians $ 34,319,300 Tampa Bay Devil Rays $ 29,556,667 We leave it as an exercise for you to complete the table above. Once done, it is it is quickly verified that the standard deviation is $ 41,783,940. So, let’s reprint the table, with the standard deviation from the mean listed for each team and it’s ranking in the American League. Team Payroll Salary Standard Deviations from the Mean (z-score) True Major League Ranking New York Yankees $ 184,193,950 2.72 1 Boston Red Sox $ 127,298,500 1.36 3 Anaheim Angels $ 100,534,667 0.72 Tied for 4 Seattle Mariners $ 81,515,834 0.26 13 Chicago White Sox $ 65,212,500 -0.13 7 Oakland Athletics $ 59,425,667 -0.27 2 Texas Rangers $ 55,050,417 -0.37 6 Minnesota Twins $ 53,585,000 -0.40 Tied for 4 Baltimore Orioles $ 51,623,333 -0.45 9 Toronto Blue Jays $ 50,017,000 -0.49 12 Kansas City Royals $ 47,609,000 -0.55 14 Detroit Tigers $ 46,832,000 -0.57 10 Cleveland Indians $ 34,319,300 -.87 8 Tampa Bay Devil Rays $ 29,556,667 -0.98 11 Problem Three Homeownership in the USA Below are the state rankings for the percent of homeownership in the United States (to include mobile homes) in 2002. Source: US Bureau of the Census. Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware District of Columbia Florida Georgia 73.5 67.3 65.9 70.2 58.0 69.1 71.6 75.6 44.1 68.7 71.7 Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana Maine Maryland Massachusetts 57.4 73.0 70.2 75.0 73.9 70.2 73.5 67.1 73.9 72.0 62.7 42 Michigan Minnesota Mississippi Missouri Montana Nebraska Nevada New Hampshire New Jersey New Mexico New York North Carolina North Dakota Ohio Oklahoma 76.0 77.3 74.8 74.6 69.3 68.4 65.5 69.5 67.2 70.3 55.0 70.0 69.5 72.0 69.4 Oregon Pennsylvania Rhode Island South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming 66.2 74.0 59.6 77.3 71.5 70.1 63.8 72.7 70.2 74.3 67.0 77.0 72.0 72.8 Like before, we have a rather large sample. Let’s begin with the statistical basics. We will find the mean, median, mode and range after first ranking states in ascending order. DC New York Hawaii California Rhode Island Massachusetts Texas Nevada Arizona Oregon Washington Louisiana New Jersey Alaska Nebraska Florida Colorado Montana Oklahoma New Hampshire North Dakota North Carolina Tennessee Arkansas Illinois Kansas 44.1 55.0 57.4 58.0 59.6 62.7 63.8 65.5 65.9 66.2 67.0 67.1 67.2 67.3 68.4 68.7 69.1 69.3 69.4 69.5 69.5 70.0 70.1 70.2 70.2 70.2 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 31 31 30 29 25 25 25 Vermont New Mexico South Dakota Connecticut Georgia Maryland Ohio Wisconsin Utah Wyoming Idaho Alabama Kentucky Iowa Maine Pennsylvania Virginia Missouri Mississippi Indiana Delaware Michigan West Virginia Minnesota South Carolina 70.2 70.3 71.5 71.6 71.7 72.0 72.0 72.0 72.7 72.8 73.0 73.5 73.5 73.9 73.9 74.0 74.3 74.6 74.8 75.0 75.6 76.0 77.0 77.3 77.3 25 24 23 22 21 18 18 18 17 16 15 13 13 11 11 10 9 8 7 6 5 4 3 1 1 43 Mean Mode Median 69.4 70.2 70.2 Let’s begin to interpret the data. First, we notice the mode and median are the same, and the mean (average) is below the two. On a positive note, we can say more than half of the states have a percentage of homeownership above the national average. The next question, “is the mean significantly below the other two?” This may be partially answered by observing the range. The range is 33.2 (77.3 – 44.1), which appears to be rather large. So, by comparison 69.4 versus 70.2 appears to be no big deal. Let’s add to our repertoire of examining the dispersion of the data. For large sets of data, we do not want to obsess over each individual data point. We want to see if the data follows noticeable trends and then interpret any outliers that may appear. To do this, we observe a histogram, made from the data in ascending order. Notice we have so much data our page is not wide enough to label each state. We labeled only a few for reference. South Indiana Maine Wyoming Georgia Kansas North Florida Washingto Massachu 100.0 80.0 60.0 40.0 20.0 0.0 District of percent Percent of Home Owners by state The District of Columbia is an outlier, it’s percent of 44.1 far from the mean of 70.2 %. But, in a manner of speaking, NY, HA, CA, RI, with respective percents of home ownership of 55, 57.4, 58 and 59.6 all seem far below the mean of 70.2 as well. This is what we mean by dispersion. We need to quantify how well grouped the data is because we need to know where to draw the line between those states that are significantly below the mean or significantly above the mean. This can be done with what we have identified as a standard deviation; the measure of how data deviates from it’s behavior around the middle of the data. Central tendency. In a perfect world, the mean is in the center of the data, thus it is the median too. And the mode, if we get greedy. Recall, the standard deviation, loosely speaking, measures how the data deviates from the mean, and remember, the mean is in the center of the data. The table shows the state’s raw z-score followed by the state’s percentage of homeownership. A table so constructed allows the reader to find a state and quickly identify how the state compares to the national average. Problems 20 – 22 in the Exercise Set requires you to verify the table and then draw certain responses from the table. DC New York -4.21 -2.45 44.1 55.0 Hawaii California -2.06 -1.97 57.4 58.0 44 Rhode Island Massachusetts Texas Nevada Arizona Oregon Washington Louisiana New Jersey Alaska Nebraska Florida Colorado Montana Oklahoma New Hampshire North Dakota North Carolina Tennessee Arkansas Illinois Kansas Vermont New Mexico -1.71 -1.21 -1.03 -0.76 -0.69 -0.65 -0.52 -0.5 -0.48 -0.47 -0.29 -0.24 -0.18 -0.15 -0.13 -0.11 -0.11 -0.03 -0.02 0 0 0 0 0.02 59.6 62.7 63.8 65.5 65.9 66.2 67.0 67.1 67.2 67.3 68.4 68.7 69.1 69.3 69.4 69.5 69.5 70.0 70.1 70.2 70.2 70.2 70.2 70.3 South Dakota Connecticut Georgia Maryland Ohio Wisconsin Utah Wyoming Idaho Alabama Kentucky Iowa Maine Pennsylvania Virginia Missouri Mississippi Indiana Delaware Michigan West Virginia Minnesota South Carolina 0.21 0.23 0.24 0.29 0.29 0.29 0.4 0.42 0.45 0.53 0.53 0.6 0.6 0.61 0.66 0.71 0.74 0.77 0.87 0.94 1.1 1.15 1.15 71.5 71.6 71.7 72.0 72.0 72.0 72.7 72.8 73.0 73.5 73.5 73.9 73.9 74.0 74.3 74.6 74.8 75.0 75.6 76.0 77.0 77.3 77.3 Why use standard deviation The standard deviation can also help you evaluate the worth of all so-called "studies" that seem to be released to the press everyday. Standard deviation is commonly used in business as a measure to describe the risk of a security or portfolio of securities. If you read the history of investment performance, chances are that standard deviation will be used to gauge risk. The same is true for academic studies to determine the validity of exam results, or the effectiveness of educational tools. Standard deviation is also one of the most commonly used statistical tools in the sciences and social sciences. It provides a precise measure of the amount of variation in any group of numbers, be it the returns on a mutual fund, the yearly rainfall in Mexico City, or the hits per game for a major league baseball player. Lastly, look at the data below, taken directly from the morning newspaper. Does it take on a whole new look? Could we analyze, say, whether the offense or the defense is a better predictor of success in professional football. The 2003 Final Standings of the NFL teams. W = wins, L = loses, % = percentage of games won, PF = Points For, that is, points the team scored, PA = points the team allowed. AFC East W New England Patriots 14 L 2 T 0 % .875 PF 348 PA 238 45 Miami Dolphins Buffalo Bills New York Jets 10 6 6 6 10 10 0 0 0 .625 .375 .375 311 243 283 261 279 299 NFC East W Philadelphia Eagles 12 Dallas Cowboys 10 Washington Redskins 5 New York Giants 4 L 4 6 11 12 T 0 0 0 0 % .750 .625 .313 .250 PF 374 289 287 243 PA 287 260 372 387 AFC North Baltimore Ravens Cincinnati Bengals Pittsburgh Steelers Cleveland Browns W 10 8 6 5 L 6 8 10 11 T 0 0 0 0 % .625 .500 .375 .313 PF 391 346 300 254 PA 281 384 327 322 NFC North Green Bay Packers Minnesota Vikings Chicago Bears Detroit Lions W 10 9 7 5 L 6 7 9 11 T 0 0 0 0 % .625 .563 .438 .313 PF 442 416 283 270 PA 307 353 346 379 AFC South W Indianapolis Colts 12 Tennessee Titans 12 Jacksonville Jaguars 5 Houston Texans 5 L 4 4 11 11 T 0 0 0 0 % .750 .750 .313 .313 PF 447 435 276 255 PA 336 324 331 380 NFC South W Carolina Panthers 11 New Orleans Saints 8 Tampa Bay Buccaneers 7 Atlanta Falcons 5 L 5 8 9 11 T 0 0 0 0 % .688 .500 .438 .313 PF 325 340 301 299 PA 304 326 264 422 AFC West W Kansas City Chiefs 13 Denver Broncos 10 Oakland Raiders 4 San Diego Chargers 4 L 3 6 12 12 T 0 0 0 0 % .813 .625 .250 .250 PF 438 381 270 313 PA 332 301 379 441 NFC West W St. Louis Rams 12 Seattle Seahawks 10 San Francisco 49ers 7 Arizona Cardinals 4 L 4 6 9 12 T 0 0 0 0 % .750 .625 .438 .250 PF 447 404 384 225 PA 328 327 337 452 46 Exercise Set For problems 1 to 8, use the 2003 Final Standings of the NFL teams, as previously indicated. 1. For the NFC teams, find the standard deviation for the number of wins and then find the z-score for each team. 2. For the AFC teams, find the standard deviation for the number of wins and then find the z-score for each team. 1. 3. For the all teams, find the standard deviation for the number of wins and then find the z-score for each team. 4. For the NFC teams, find the standard deviation for PF and then find the zscore for each team. 5. For the AFC teams, find the standard deviation for PF and then find the zscore for each team. 1. 6. For the NFC teams, find the standard deviation for PA and then find the zscore for each team. 7. For the AFC teams, find the standard deviation for PA and then find the zscore for each team. 8. Look carefully at questions 1 to 7. Which is a better predictor of a team’s success, the offense as indicated by the points the team scored (PF) or the defense, as indicated by the points that team allowed (PA). Why? 9. According to the 2005 World Almanac for Kids, below are the 25 largest countries in the world in mid2004 in no particular order, in square miles. Find the mean, median and the stand deviation. 1,294,629,555 China 82,424,609 Germany 1,065,070,607 India 76,117,421 Egypt 293,027,571 United States 69,018,294 Iran 238,452,952 Indonesia 68,893,918 Turkey 184,101,109 Brazil 67,851,281 Ethiopia 153,705,278 Pakistan 64,865,523 Thailand 144,112,353 Russia 60,424,213 France 141,340,476 Bangladesh 60,270,708 Great Britain 137,253,133 Nigeria 58,317,930 Dem. Rep. of the Congo 127,333,002 Japan 58,057,477 Italy 104,959,594 Mexico 48,598,175 South Korea 86,241,697 Philippines 47,732,079 Ukraine 82,689,518 Vietnam 10. According to the 2005 World Almanac for Kids, below are the ten largest cities followed by the population in the world in 2004 in no particular order. Tokyo, Japan 34,450,000; Kolkata (Calcutta), India 13,058,000; Mexico City, Mexico 18,066,000; Shanghai, China 12,887,000; New York City, U.S. 17,846,000; Buenos Aires, Argentina 12,583,000; São Paulo, Brazil 17,099,000; Delhi, India 12,441,000; Mumbai (Bombay), India 16,086,000; Los Angeles, U.S. 11,814,000. Find the mean and the standard deviation. 11. According to the 2005 World Almanac for kids, below are the American League Pennant Winners, with the year they won proceeding the name and their won-lost record following their name since 1970. Remove the shortened strike season of 1981 and the year where there was no 47 world series and find the mean and the standard deviation for wins. . 1970 Baltimore 108 54 , 1971 Baltimore 101 57, 1972 Oakland 93 62, 1973 Oakland 94 68, 1974 Oakland 90 72, 1975 Boston 95 65, 1976 New York 97 62, 1977 New York 100 62, 1978 New York 100 63, 1979 Baltimore 102 57, 1980 Kansas City 97 65, 1981 New York 59 48, 1982 Milwaukee 95 67 1983 Baltimore 98 64, 1984 Detroit 104 58, 1985 Kansas City 91 71, 1986 Boston 95 66, 1987 Minnesota 85 77, 1988 Oakland 104 58, 1989 Oakland 99 63, 1990 Oakland 103 59, 1991 Minnesota 95 67, 1992 Toronto 96 66, 1993 Toronto 95 67 1994 none, 1995 Cleveland 100 44, 1996 New York 92 70, 1997 Cleveland 86 75, 1998 New York 114 48, 1999 New York 98 64, 2000 New York 87 74, 2001 New York 95 65, 2002 Anaheim 99 63, 2003 New York 101 61 12. Look up the ages of the presidents of the United States when they took office. Find the mean and standard deviation of the presidents ages. Then repeat the process, lumping together those presidents who were inaugurated before the Civil War and those who were inaugurated after the Civil War. What do you notice when you compare the pre and post Civil War presidents’ ages? For questions 13 to 19: M&M’s project - Some years come and go, but other years live in the hearts and mind of men and women for all eternity. Such was the year of 1941, when Pearl Harbor was attacked, Joe DiMaggio hit safely in 56 straight games and M&M’s were first introduced to the public. Daughters everywhere love M&M’s, in particular, some love the blue pieces the most. The original M&M’s had violet candies and no blue ones in 1941. Then, in 1949, tan replaced violet and in 1995, tan was replaced by blue. M&M’s were made round by taking milk chocolate centers and tumbling them to get their smooth rounded shape. We all know M&M stands for Mars and Murrie and that the different color M&M’s taste the same. According to the M&M’s website http://www.mmmars.com/cai/mms/faq.html , that M&M’s Milk Chocolate candies are 30 % brown, 20 % each of yellow and red, and 10 % each of orange, green and blue M&M’s Peanut Chocolate candies are 20 % each of brown, yellow, red and blue, and 10 % each of green and orange M&M’s Peanut Butter and Almond Chocolate candies are 20 % each of brown, red yellow, green and blue M&M’s Crispy Chocolate candies are 16.6 % each of brown, red, yellow, green, orange and blue. Let’s perform our own test and see if our observation of the percent of each color matches the website’s prediction. 13. Buy one pound bags of M&M’s Milk Chocolate for each student in your class. As a class, for each bag, tally up the number of each color M&M. Find the percent of each color for each bag. 14. The tally up the colors for all the bags, and find the percent of colors for the class room sample, which consists of all the bags. 15. Using each bag as individual trials, find the mean, median and mode for each color. Then find the percent of colors based on these findings. 16. How do the results in parts 2. and 3. compare? How do the results compare to M&M’s website statistics? 48 17. Repeat the experiment for Peanut Chocolate, Peanut Butter and Almond Chocolate, and Crispy Chocolate. 18. Answer this question - how can you run standard deviations in this experiment to help you analyze your findings so that you may decide on the reliability of the data on the M&M’s website? 19. Run those standard deviations to determine the reliability of the data on the M&M’s website. For problems 20 to 22, refer to Problem 3 – Home Ownership - from the text. 20. Compute the standard deviation for the data from Problem 3 and then verify the table presented. 21. Determine which states are the friendliest to home ownership and which states are the least? 22. Is there a cause and effect relationship that you can argue to explain why these states are at either end of this analysis? 23. Barry Bonds or Babe Ruth. Who was the greatest baseball player of all time? To argue your point, quote statistics. Research their batting average and compare it to the batting averages of their peers at the time. How many standard deviations from the mean were their batting averages? Do the same for home runs, RBI’s and on base percentage. Factor in that Barry Bonds played in night games and that Babe Ruth won 20 games as a pitcher. Best of luck… 49