Download Ch3 - Arizona State University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Regression toward the mean wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Statistics or
Never attribute to malice that which is adequately explained by stupidity.
Numbers don’t lie. And often perception is not reality. Case in point, people are always
concerned, worried even, about the increasingly violent society in which they live.
Statements like “what is this world coming to,” are commonplace. People may tend to
feel increasingly unsafe many may become more and more reluctant to go out at night.
Some been known to hide behind their TV or computer, instead of venturing out. There
are many who feel that when they were young, crime was not as bad as it is now. People
attribute arbitrary reasons for this new wave of perceived violence. “Exhaustive music
videos glorify violence, causing a violent cycle to never end...” “No wonder there is so
much crime these days, look at all the violence on TV and in the movies…” “Kids have
no respect for their parents, teachers or elders these days, this contributes to more
violence…” Or “the remote control teaches us to become impatient, and we are more
likely to quickly pull the trigger…” Images from the OJ murders, Columbine shootings
or 9/11 tend to fill our televisions, replaying the same isolated scenes over and over
again. People are shot every night on reruns of Law and Order. So, it’s natural for
people to criticize the amount of violence in our society, but rarely do these same people
utter any voice toward thinking their utterances through to its logical conclusion. Instead,
many appear to become angry about the rise of violent crime in this country and tend to
make matters worse by linking this acquired malice to other elements in society (music
videos, teenagers, TV, OJ), fostering a wider net of hate. More importantly, they never
once pause to check out the numbers. And in a matter of moments, anyone can do just
that, check out the numbers. Any of us can access on the WWW the FBI’s Index of
Crime Statistics. So, we did.
Below are the nation wide statistics from 1982 to 2001, showing, by year, the number of
violent crimes nationwide. During this twenty year span, while the nation’s population
grew from 231,664,458 in 1982 to 284,796,887 in 2001, the number of violent crimes as
defined by murder, rape, robbery and assault did not steadily increase, as expected. In
fact, clearly there was a stunning decline in violent crimes over the last decade. Violent
crime reached it’s peaked 1992, with 1,932,274 reported instances and since then, violent
1
crime has dropped over 25 percent. (The homicides on September 11, 2001 were not
included.)
murder, rape, robbery, assault
Violent Crime
2,500,000
2,000,000
1,500,000
1,000,000
500,000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
year: 1982 to 2001
But, look at the age we live in. Ripped from the headlines:
 Arizona Kids Are Home Alone, A new survey says 30 percent in kindergarten
through 12th grade take care of themselves”
 Of the 85 prisoners executed in 2000, 43 white, 35 African American, 6 Hispanic,
1 American Indian
 Vietnam - 58,168 deaths, total abortions since 1973, 44,670,812 as of April 22,
2004
 Should juveniles be tried as adults? Kids are killing these days in record numbers
Statistics are tossed at us in such a deluge the numbers alone seem almost controversial,
30 % of school age children left alone, 35 out of 85 executed are African American, 44
million abortions in last 30 years. … Certainly, each of these topics elicits emotion from
within each of us, too many parents leave their children unsupervised, there is not enough
funding for day care, death penalty, pro or con, racially biased, too many, too few
juveniles? And if you want to clear a room with angry combatants, start with the age-old
question, woman’s choice or murder of the unborn? No matter your stand on these
topics, as you comb through the headlines, statistics besiege you.
Why is quantitative literacy important? When confronted with numbers associated with
hotly contested issues or highly controversial ethical or moral arguments, raw numbers
themselves, such as the above stated 44,670,812 abortions in the last 30 years, need to be
examined so they may be fully understood. As always, we begin by examining the
number for credibility? Is it even viable? This particular number or numbers similar to it
appear on various websites. We found the number we quoted and similar such numbers
at http://www.americandaily.com/article/1806, http://www.americandaily.com/article/1806, and
http://womensissues.about.com/cs/abortionstats/a/aaabortionstats.htm. Are they accurate, well,
we
simply have no way of knowing, but it is an oft published statistic. Is it viable? Now,
that is a different question altogether. It is certainly quoted often enough.
2
So, following our pattern of analysis, if the number seems to be viable, then we continue.
If it is viable, what implications are fair to divvy out? These 44 million aborted fetuses
would be 30 or younger, so for argument sake, let us assume it is fair to say a large
percentage would be alive today. If this assumption is reasonable, 40 million plus the
290 million US citizens comes to 330 million. We are talking a population of 330
million people, and 44/330 is slightly larger than 13 % or slightly greater than 1/8. What
does this mean? Has society aborted 1 in 8? Don’t questions abound in your mind? Is
this correct? Were these all abortions performed out of necessity? How many were
medical? Or moral? Or personal choice? Does the reason for the abortion matter to
you? Does the reason for the abortion matter to you if you take into account this new “1
in 8” statistic as a measure of how often abortions do occur? Certainly, one may argue
that 1 in 8 could be construed as an alarming rate. But, the point of view and the
emotions you feel are personal for you. The point is that 44 million is the statistic we are
confronted with. Our ability to perform math tells us 1 in 8 is a logical consequence of
this statistic. What you do in the subsequent interpretation is your decision. But,
quantitative literacy will allow you to understand the statistic in context and make that
interpretation.
Statistics themselves are numbers that stand alone. Honest. Raw. Naked numbers. The
name of the game in statistics is to draw inferences about a population or topic. If we are
using polls, we are basing the inference on a smaller random sample. When trying to
then form a conclusion, we must be careful. Correlation is not causation, just because
numbers correlate does not mean one causes the other. Inferring characteristics about a
population based on the raw data is the immediate reaction as we scan the headlines, but
should it be? Can graphs be misleading? How good are we at recognizing misleading
information?
Causation and Correlation
Clearly, there exists a high correlation between the amount of blood alcohol level in a
person’s body and the likelihood they will get into an auto accident. I do not think any
rational person would dispute the added inference that drinking alcohol can cause an auto
accident. The data that supports the two factor’s relationship, the higher the number of
drunks compared to non drunks who get into the accident, imply correlation. That
drinking lead to or caused, the accident implies causation. It will be our task to determine
whether a factor’s data that correlates to some other factor’s data can be interpreted to
mean that one factor influences the other.
Correlation A correlation exists between two factors if a change in one of the factor’s
data is associated with a rise or decline in the other factor’s data.
Causation A causation exists between two factors if a one factor causes, determines or
results in the other factor’s data to rise or decline.
Correlation as a result of causation As with drinking and auto accidents, we can often
infer that a correlation is tied to causation. Another equally clear case can be made by
3
considering tobacco use and lung cancer. The numbers correlate, one can equate the
amount one smokes with the likelihood of succumbing to lung cancer. Those who smoke
more have a higher percentage of their population inflicted with lung cancer. And, for
years, the Surgeon General has been telling us that smoking causes lung cancer. The
more you smoke, the higher the risk of developing lung cancer.
Correlation with no causation. Hidden factors Just because two factors correlate does
not mean one factor causes the other. One of the easiest examples to spotlight the
difference and to have it plainly explained is to look at a common correlation between
divorce and death. In most states, there is a significant negative correlation between the
two, the more divorces, the less deaths. Since the two correlate negatively, the natural
question arises, does getting a divorce reduce the risk of dying; does staying married
increase the chance of dying? All joking aside about the obvious hidden implication, it is
a third unseen factor that causes the correlation. Death and divorce do not have a causal
relationship. Age does. The older the married couple, the less the likelihood they will
get a divorce. The older the married couple, the higher likelihood they will pass away.
There is a negative correlation between divorce and age and a positive correlation
between age and death. The younger you are, the more likely you are to get divorced.
The older you are, the more likely you are to pass away. Since the higher number of
divorces occur with younger people, and since younger people tend to live longer, we
have a transitive relationship implying the higher number of divorces relating to the
longer life spans. Correlation. Causation. Very different. Yes, there is a correlation
between divorce and death. No, neither causes the other. In plain English, getting a
divorce will not increase or decrease the likelihood you will die.
Accidental Correlations Sometimes there exists accidental correlations where there is
no hidden other factor or unseen logical explanation. The winner of the Super Bowl and
the party of the winner of the presidential race in the country correlate highly every four
years, but I do not think football predicts the presidential races, or visa versa. This is an
accidental correlation.
Misleading Information
Breast cancer will afflict one in eleven women. But this figure is misleading because it
applies to all women to age eighty-five. Only a small minority of women live to that age.
The incidence of breast cancer rises as the woman gets older. At age forty, one in a
thousand women develop breast cancer. At age sixty, one in five hundred. Is the statistic
one in eleven technically correct? Yes. Should a 40 year old woman be concerned with
getting breast cancer? Certainly. Should they worry that one in eleven of their peer
group will be afflicted? No. And while one in a thousand in their peer group will get
afflicted, this by no means minimizes the seriousness of the issue, but sheds a more
realistic light on it.
A scatter plot is a graph of ordered pairs that allow us to
examine the relation between two sets of data.
4
To draw scatterplot:
 Arrange the data in a table.
 Decide which column represents the x–values (the label representing data along the
horizontal axis). Those values need to be the perceived cause, the independent
variable. Decide which columns represents the y–values (data represented along the
vertical axis). These values need to appear to be affected by the perceived causes,
the dependent variable.
 Plot the data as points of the form or an ordered pair, (x, y).
 Analysis: We can make predictions if the points show a correlation.
* if the points appear to increase to the right, this is a positive correlation.
* if the points appear to decrease to the right, this is a negative correlation
Positive Correlation: We expect that if the values along the horizontal axis increase, so
do the values associated with the vertical axis. That is, as we increase x, we increase y.
The more we study, the higher we expect to score on Exam One.
Grade on Exam One
120
100
80
60
40
20
0
0
1
2
3
4
5
6
7
Hours Spent Studying
Negative Correlation: We expect that if we increase x then we decrease y. The higher
the temperature, the less minutes we will jog.
Minutes Spent Jogging
90
80
70
60
50
40
30
20
10
0
0
20
40
60
80
100
120
140
Temperature (Fahrenheit)
5
Problem 1
For each below, decide if there is a correlation between the two factors. If there is, is it a
positive correlation or negative correlation. Then decide if the two factors have a causal
relationship. If they don’t have a causal relationship, but they do correlate, determine if
there are hidden factors that explain the correlation, if the correlation is accidental or if
there is misleading information.
a.
b.
c.
d.
A child’s shoe size, a child’s ability to do math
Blood alcohol level and reaction time
A girl’s body weight, the time the girl spends playing with dolls each day
Price on an airline ticket, the distance traveled
Solution
a. Positive correlation. A child’s shoe size does correspond to a child’s ability to do
math. The larger a child’s shoe size, the better in math they are. But, the relationship is
not causal; large feet do not cause a child to perform math better. There is a hidden
factor. Age. The older the child, the larger the child’s shoe size. As a child get older,
they take more math classes. The more math classes the child has taken, the better the
child performs in math.
b. Negative correlation. The higher the blood alcohol level, the shorter (slower) the
reaction time. The relationship is causal.
c. Negative correlation. As a girl’s body weight increases, they play with dolls less each
day. No causal relationship here, again a hidden factor. And again it is age. The more a
girl weighs, the older she is, the less time she spends playing with dolls.
d. Positive correlation. The longer the distance, generally, the more expensive the ticket.
Causation.
Problem 2
Let’s examine the basic question, “Do students who do better on a placement exam
perform better in a college algebra course?” Below is the data. Draw a scatter plot and
answer the question.
Placement Score
70
68
56
40
78
59
67
45
61
Final Average
College Algebra
91
89
71
62
95
65
85
66
70
Solution:
6
Final Average in College Algebra
We need to examine the data visually to see if there exists a positive or a negative
correlation between higher placement test scores and performance in college algebra.
Below construct a scatter plot.
100
90
80
70
60
50
40
30
20
10
0
0
20
40
60
80
100
Placement Test Score
Though not all points show the same trend, then general trend is an increase in
placement score does translate to an increase in the final average grade.
Problem Set
For problems 1 – 4, decide if there is a correlation between the two factors. If there is, is
it a positive correlation or negative correlation. Then decide if the two factors have a
causal relationship. If they don’t have a causal relationship, but they do correlate,
determine if there are hidden factors that explain the correlation, if the correlation is
accidental or if there is misleading information.
1. A boy’s height, a boy’s time spent watching cartoons each day
2. Altitude, air pressure
3. Number of homes sold, realtor’s income
4. Number of abortions in US, number of people who are Pro-choice
5. From a survey of 2000 people, the table below represents averages for the number of
years in school and the associated average monthly salary. Make a scatter plot labeling
the x and y axes. Label the x and y axes.
Number of Years in School
Average Monthly Salary
Less than 12 (approx. 10)
$ 1,500
12
$ 1,750
14
$ 2,100
16
$ 2,550
18
$ 3,000
20
$4,200
6. Draw a line which closely fits the scatter plot for accumulative donations for a charity
by year below:
7
percent of students failing math
course
7. From the scatter plot below, interpret the linear pattern and predict the percent of
students who failed the math course in the year 2,000.
30
25
20
15
10
5
0
1988
1990
1992
1994
1996
1998
2000
year
8. Which data below has the greatest negative correlation?
a)
$
35
30
25
20
$
15
10
5
0
1965
1970
1975
1980
1985
1990
1995
2000
8
b)
$
40
35
30
25
20
$
15
10
5
0
1965
1970
1975
1980
1985
1990
1995
2000
c)
$
40
35
30
25
20
$
15
10
5
0
1965
1970
1975
1980
1985
1990
1995
2000
d)
$
40
35
30
25
20
$
15
10
5
0
1965
1970
1975
1980
1985
1990
1995
2000
Construct and Draw Inferences
Constructing and drawing inferences are essential to critical thinking and problem
solving. When faced with statements, problems and puzzles, we do more than use
common sense. We use problem solving skills, try to fit patterns and infer statements
that follow logically from the statements given. We determine what is reasonable and
9
what is not. We determine what should logically follow and what should not in order to
make good decisions.
Circle Graphs
Ripped from the nation’s headlines: Should a juvenile be tried as an adult? To address
this issue, we should ask ourselves many questions and look at this crucial problem from
many perspectives. For many of us, the first question we ask may be “Do juveniles who
murder pose a chronic problem in this country?” Well, what’s chronic? If a large
percentage of all murders were done by juveniles, this could be called chronic.
So, we return to the Crime Index as defined by the FBI from 2001. Let us ask the
question, “does there exist a correlation between age and those who commit murder in
this country?” As long as we have the information grouped by category, which in this
case is by age, we can recognize large numbers displayed in data as a percent of the
whole in a pie chart or circle graph.
First, let’s see how a circle graph or pie chart is made. We tend to subdivide a circle into
sectors represented by their central angle in either degrees (out of 360 degrees) or the
percent of the circle that is to be shaded (out of 100 %).
Below are common subdivisions of a circle:
So, for our question: “is their a correlation between age and those who commit
murder in this country?”, we examine the data taken from the Crime Index. Of the
10,113 number of known murderers in the country in 2001, there age distribution was
given as follows:
Age, in years
1 to 4
5 to 8
9 to 12
13 to 16
17 to 19
20 to 24
25 to 29
30 to 34
35 to 39
40 to 44
45 to 49
50 to 54
Number
0
0
14
454
1,695
2,767
1,571
992
855
645
455
272
10
55 to 59
60 to 64
65 to 69
70 to 74
75 and over
Total
158
85
59
37
54
10,113
So, since the data is already organized, let’s find the density of each age group. This
means we will reconstruct the table and find the percent of murderers for each category, 1
to 4, 5 to 8, 9 to 12, 13 to 16 and so on. Note, not all categories are partitioned into equal
time intervals.
Age, in years
1 to 4
5 to 8
Number
0
0
9 to 12
14
13 to 16
454
17 to 19
20 to 24
25 to 29
30 to 34
35 to 39
40 to 44
45 to 49
50 to 54
55 to 59
60 to 64
65 to 69
70 to 74
75 and over
Total
1,695
2,767
1,571
992
855
645
455
272
158
85
59
37
54
10,113
Relative frequency
0
0
14
 0.0013
10,113
454
 0.0449
10,113
1, 695
 0.1676
10,113
2767
 0.2736
10,113
0.115
0.098
0.085
0.064
0.045
0.027
0.016
0.008
0.005
0.004
0.005
1
Central Angle
0.0013 x 360   0.468
0.0449 x 360  16.2
0.1676 x 360   60.34, or 1 of the circle
6
0.2736 x 360   98.5
41.4 degrees
35.3 degrees
30.4 degrees
23 degrees
16.2 degrees
9.7 degrees
5.62 degrees
3 degrees
2.1 degrees
1.3 degrees
2 degrees
360 degrees, a whole circle
The pie chart below is illuminating. Very quickly, by glancing at the chart, we can tell
that 20 to 24 year olds commit the most murders, but a close second seems to be 17 to 19
year olds, as well as 25 to 29 year olds. If a juvenile is defined to be under 18 years of
age, then this appears to be a chronic problem because the second most dense population
of murderers occurs in the age group 17 to 19 year olds. Now when we factor in the 13
to 16 year olds (454), the problem of juvenile murder seems to be more acute. For
murders committed by teenagers alone, we have within the 13 to 19 year old age group,
accounted for 454 + 1695 or 2,149 murders committed by teenagers. This comes to
2149/10,113 or just a little over 20 percent, and this doesn’t include the children who are
12 or under.
11
Murder Offenders by age
1 to 4
5 to 8
9 to 12
13 to 16
17 to 19
20 to 24
25 to 29
30 to 34
35 to 39
40 to 44
45 to 49
50 to 54
55 to 59
Now, let’s continue to address this problem again. Numbers never lie. But rearranged,
could they deceive? Could the very same numbers be used by the opposing side of the
argument to make the opposing view more viable? As said, first, we re-arrange the
numbers.
1 to 19
14+454+1695=2163
20 to 39
2767+1571+992+855=6185
40 to 59
645+455+272+158=1530
60 and over 85+59+37+54=235
Murderers by age, 2001
1 to 19
20 to 39
40 to 59
60 and over
Giving us the following pie chart:
If we consider only the murders where we know the age of the murderer. There were
10,113 of these murders.
But, we are trying to represent the opposing point of view and we are trying to show
murder by juveniles is not ‘chronic.” So, in 2001, there were an additional 5375 murders
where the age of the perpetrator was unknown. Regrouping, our table looks like:
1 to 19
20 to 39
40 to 59
60 and over
Unknown
2,163
6,185
1,530
235
5,375
12
Murderers, by age, 2001
1 to 19
20 to 39
40 to 59
60 and over
unknow n
And the pie chart now looks like:
Notice how much smaller the piece of the pie for the 1 to 19 year old segment now is
compared to the whole. This is significant difference from the above pie chart where we
did not factor in the murders committed by people of unknown ages. To further enhance
our argument, we may construct the slices of the pie with a 3–dimensional representation.
We then angle the segment of the pie we are trying to ostensibly hide so that it stands out
less. Now, our point that juvenile crime is not a chronic problem seems more justified to
the viewer’s eye.
Murderers by age, 2001
1 to 19
20 to 39
40 to 59
60 and over
unknow n
To add the final touch to enhance our argument, let’s re-categorize, so we change two
groupings: 1 to 19 and 20 to 29 to 1 to 16 and 17 to 39. If we keep the category of
unknown murderers in the groupings, let’s compare the original pie chart with the final
one. To the naked eye, a quick glance reveals a sliver on the left chart compared to
nearly a quarter of the pie on the right.
Murderers by age, 2001
Murderers, by age, 2001
1 to 16
1 to 19
17 to 39
20 to 39
40 to 59
60 and over
40 to 59
unknown
60 and over
Statistics don’t lie, they can be re-arranged though to show what ever is on one’s agenda.
13
Problem 2
The graph below is shown and a TV anchor man states, “There was a sharp dramatic
increase in drunk driving convictions between the year 1999 and the year 2000.”
Consider the statement and reply to its accuracy.
Solution
According to the figure, the actual increase in drunk driving convictions between 1999
and 2000 was 12, up to 732 from 720 the year before. Though this is an increase, it can
not be considered a “sharp dramatic increase”. Evaluating the data in another way, the
12
 100  1.7% is not significantly sharp or particularly dramatic.
percent increase,
720
The anchor man was over dramatizing the report, the words were inflammatory,
bordering on misleading.
Problem 3 Drawing Inferences
A bucket has small green balls, medium blue balls, large pink balls, and very large red
balls. A child picks ten balls, selecting each randomly so each ball is equally likely to be
selected. Four such trials were conducted. Which trial most closely resemble the
theoretical probability that would occur if the balls were selected randomly ten times?
a)
Balls
Number of
14
Balls
Selected
Small Green 2
Medium Blue 2
Large Pink
2
Very Large
4
Red
b)
Balls
Number of
Balls
Selected
Small Green 3
Medium Blue 2
Large Pink
3
Very Large
2
Red
c)
Balls
Number of
Balls
Selected
Small Green 2
Medium Blue 3
Large Pink
2
Very Large
3
Red
d)
Balls
Number of
Balls
Selected
Small Green 3
Medium Blue 2
Large Pink
4
Very Large
1
Red
Solution
First, we need to calculate the theoretical probability for each type of ball. Recall, the
probability is the number of successful outcomes divided by the total number of
outcomes. The total number of balls is 20. There are 6 small green ones, 4 medium blue
ones, 7 large pink ones, and 3 very large red ones. Below are the theoretical
probabilities:
Balls
Prob.
Small Green 6/20
Medium Blue 4/20
Large Pink
7/20
15
Very Large
3/20
Red
If ten balls were selected, we could anticipate 3 out of 10 balls to be small green ones, 2
out of 10 to be medium blue ones, 3.5 out of 10 to be large pink ones and 1.5 out of 10 to
be very large red ones. This trial outcome is impossible and so choice b) is the closest
trial to these expected results.
Problem Set
1. In the year 2000, a state lottery distributes its $ 2 million proceeds in the following
manner:
Proceeds
$ 900,000
$ 500,000
$ 200,000
$ 200,000
$ 160,000
$ 140,000
Draw a circle graph to represent this data.
Beneficiary
Education
Cities
Highways
Senior Citizens
Libraries
Other
2. The graph below shows the companies profits in its first four years of existence.
What’s wrong with this statement, “There was a substantial increase in the company’s
profit in its first 4 years of existence.”
3. Poll your classmates as to the most important ‘hot button’ campaign issue. Create a
table as you see below.
Topic
Terror
Racial Discrimination
Frequency
Relative Frequency
Density
16
Abortion
Death Penalty
Drugs
Education
Construct a histogram and a pie chart for the data.
For problems 4 and 5, use the following data for the US Census Bureau. In 1999, there
were roughly 280,000,000 US citizens, and 35,000,000 lived in poverty. Of these 35
million, 12,100,000 were children, where 4,500,000 of these children lived in families
who were under one-half of the poverty level. The poverty level was defined as $ 13,290
per family of three. For each problem, construct a circle graph as designated below.
4. Draw a circle graph whose population is the citizens of the United States. Section the
circle graph into two sectors, one sector representing the US citizens who live above the
poverty level, one sector representing the US citizens who live below the poverty level.
5. Draw a circle graph whose population is those citizens who live below the poverty
level. Section the circle graph into three sectors, one sector representing the adults who
live below the poverty level, one sector representing the children who lived in families
who lived under one-half the poverty level and the third sector is all of the other children
who lived below the poverty level.
6. Project: Circle graphs, drawing inferences
Sometimes we choose to see what we want to see. We all stretch the truth, exaggerate
what we need, ignore what hurts us and to what end, personal wealth at the expense of
personal worth? From the US Census Bureau, 2000: Child poverty in America dropped from
13.5 million children in 1998 to 12.1 million in 1999. With a whisper of optimism, we
rationalize that this improvement was great. Was it?
Do you ever have trouble focusing on exams or concentrating on homework
assignments? How hard do you think it would be to concentrate on your exams,
homework, or even your instructor' lectures if your family didn't have enough money to
feed you? What if you were in poor health and your family couldn't afford to take you to
the doctor or provide the medicine you need? The bitter truth is that in 2000, 12,100,000
children in America were living in poverty and confronted these challenges every day.
If a family of three were living below the poverty line in 2000, they had an income
below $13,290 a year. Living in poverty can translate to residing in crowded housing,
having your utilities turned off, not owning a phone, or refrigerator or car, not having
enough food to feed your family, not enough medicine to heal your loved ones. And the
heart wrenching statistic is that 4.5 million children live in families that exist below onehalf of the official poverty level.
Do we have your attention, are you gasping in proper reverence? We should.
Particularly because in 2000, America was experiencing one of its greatest flashes of
economic prosperity. Business was skyrocketing, and people were spending. But, was
just a minute percentage of Americans benefiting from this new wealth? Ironically, in
2000, the unemployment rate in the U.S. was lower than it had been in years, but the
percentage of poor children in working families was soaring. There were many
possibilities to explain this phenomenon, but "Some economists (said) that if wages had
17
kept pace with the cost of living since the 1960s, the minimum wage would (have been)
between $12 and $14 dollars" (CNN.com).” Instead, the minimum wage was $5.15.
Assignment Go to the US Census Bureau. Find out how many children there were in
the US in 2000. Construct a circle graph with the following categories: Children who
lived above the poverty level, children who lived below the poverty level. Draw separate
sections of the circle graph for those children who lived above $ 6,645 a year (half of the
poverty level of $13,290 a year) and those who lived below $ 6,645 per year. Also,
include a section of the graph for those children who lived in the upper 1 % of the income
bracket and determine what that income level was. Then tackle the following
questions?
a. Do you think there is a positive, negative or no correlation between concentrating in
high school and graduating from high school? Is it a causation relationship? Why?
b. Do you think there is a positive, negative or no correlation between concentrating in
high school getting into college? Is it a causation relationship? Why?
c. Do you think there is a positive, negative or no correlation between concentrating in
high school and acquiring well-paying jobs? Is it a causation relationship? Why?
d. Do you think there is a positive, negative or no correlation between staying healthy
and having access to doctors and medicine? Is it a causation relationship? Why?
e. Do you think there is a positive, negative or no correlation between poverty and
crime? Is it a causation relationship? Why?
f. Do you think there is a positive, negative or no correlation between issues that
politicians and lawmakers have as a top priority and issues that affect those under 18,
who can not vote? Is it a causation relationship? Why?
For problems 7-9, use the following data. Source: http://www.ucdmc.ucdavis.edu/vprp/Section6,2000.pdf In
2000, the population of California was 33,871,648 and 134,227 Californians purchased
193,489 handguns. 103,743 people purchased one hand gun, 28,453 people purchased
two to five handguns totaling 71,363 handguns. 1,855 people purchased 6 to 12
handguns, totaling 14,053 handguns and 176 people purchased more than 12 handguns
for a total of 4330 handguns.
7. Construct a circle graph representing the number of people who purchased handguns
in California in the year 2000. Separate the sectors into those who bought one gun, twofive guns, 6-12 guns and more than 12 guns. .
8. Construct a circle graph representing the number of handguns purchased by
Californians in the year 2000. Separate the sectors into those who bought one gun, twofive guns, 6-12 guns and more than 12 guns. .
9. Based on your results in problems 8 and 9, construct an argument pertaining to gun
control and argue one side of the argument based on these results.
For problems 10-15, use this information provided: 5,000 years ago, forests covered
nearly 50% of the earth's land surface. Since the advent of humans, forests now cover
less than 20%. Forests serve as the lungs to our planet by providing the very oxygen with
which we breath. The rate of deforestation is increasing and although extinction is
nature’s way of selectively re-aligning our living world, this extinction, the most acute
since the dinosaurs, is not nature’s way. Humans have caused it, by themselves.
18
Source: According to RAN (Rainforest Action Network) and Myers (Op sit). In Central
and South America, Bolivia, whose land mass is 1,098,581 square kilometers once had a
forest cover of 90,000 sq km, now has a forest cover of 45,000 sq km. Brazil, whose land
area is 8,511,960 sq km, once had a forest cover of 2,860,000 sq km, now has a forest
cover of 1,800,000 sq km. Central America has a land area of 522,915 sq km, once had a
forest cover of 500,000 km and now has a forest cover of 55,000 km. Columbia has a
land area of 1,138,891 sq km once had a forest cover of 700,000 km and now has a land
area of 180,000 km. Ecuador’s land area is 270,670 km, once had a forest cover of
132,000 sq km and now has a forest cover of 44,000 km. Mexico’s land area is
1,967,180 sq km, one time its forest cover was 400,000 sq km and now its forest cover is
110,000 sq km.
10. For each country, construct a circle graph where the circle represents the land area of
the country. Divide each circle into two sectors, one for the country’s land area that was
once covered by forests and one for the land area not that was not covered by forests at
that time.
11. For each country, construct a circle graph where the circle represents the land area of
the country. Divide each circle into two sectors, one for the country’s land area that is
currently covered by forests and one for the land area that is currently not covered by
forests at that time.
12. For each country, construct a circle graph where the circle represents the original
extend of forest cover. Divide the circle into two sectors, one for the existing land area
covered by forests and one for the land area lost to deforestation.
13. Construct a circle graph that represents the total land area for Bolivia, Brazil, Central
America Columbia, Ecuador and Mexico. Divide the circle graph into twelve sectors,
two for each country, where one sector represents the land area currently covered by
forests and the other the land area currently not covered by forests.
14. Construct a circle graph that represents the total land area for Bolivia, Brazil, Central
America Columbia, Ecuador and Mexico. Divide the circle graph into twelve sectors,
two for each country, where one sector represents the land area that was once covered by
forests and the other represent the land area at that time that was not covered by forests.
15. After assimilating the information and viewing the circle graphs from problems 1015, provide an argument, either pro or con, with regard to the following statement:
“Deforestation of the rain forests in Central and South America is threatening the local
environment as well as the global environment. It should be a not button issue in today’s
society.”
16. Project: Below is another table taken from the FBI Crime List in 2001. Using circle
charts (pie charts), take an issue you gleam from the table provided and show both sides
of the argument, using pie charts to visually sway your reader. Outline the issue, show
the supporting table(s) and pie chart(s). Discuss the benefits and harm of such practices.
Murder Victims1
By Race and Sex, 2001
Race of victim
Total
Sex of Victims
19
Male
Female
Unknown
Total white victims
6,750
4,785
1,962
3
Total black victims
6,446
5,350
1,095
1
Total other race victims
368
245
123
0
Total unknown race
188
123
34
31
13,752
10,503
3,214
35
Total victims2
1
The murder and nonnegligent homicides that occurred as a result of the events of
September 11, 2001, were not included in any murder tables (Tables 2.3-2.15).
See special report, Section V.
2 Total
number of murder victims for whom supplemental homicide data were received.
Measure of Central Tendency
Finding a number that best represents a set of data is important to you right now.
Because your choice of the “representative” number that best indicates your grade can
determine your course grade. Mathematicians say that to find the number that is going to
serve as the spokesperson for the data should reflect the measure of the center or the
middle of the data.
Usually we begin by averaging the numbers to find that representative number; we find
the sum and then divide by the number of data points. But, if the data consists of exam
scores and you earned a 95, 95, 95, 95, 95, and a 45, then your average is found with two
calculation, 95 + 95 + 95 + 95 + 95 + 45 = 520 and 520/6 = 86.7 . This means the center
of your data, or the letter for the grade that best represents your data is a B according to
your average, and yet you never once earned a B. In fact, you earned only A’s, except for
one failing grade. You pause, because clearly you earned 5 grades of a 95 and just one
grade of a 45. This must count for something, right? The data that appears the most, the
95, is called the mode and it is simply another representation of the tendency of the data.
Now that we see there is more than one way to refer to the center of the data, let’s begin
with perhaps a more realistic example. Suppose we knew you had the following exam
scores, 60, 80, 60, 70, 80, 80, 90, and 95. Your thinking perhaps you deserve an A
because your last two grades were A’s. Or at the very least, you deserve a B. You begin
by finding your average or mean, which is the sum of the scores divided by the number
of scores; so you average your grades. First you add the scores: 60 + 80 + 60 + 70 + 80
+ 80 + 90 + 95 = 615. You had 8 exams and the average is found by dividing 615 by 8;
615/8 = 76.9 or a C. Uh oh. You change your strategy. You argue that you scored an
80 three times, you deserve a B. The mode is the data that occurs most frequently, and
your mode is an 80. Does this help your argument? Well, one more indication of the
middle of your data is the middle value when you align the numbers in order, either from
top to bottom or from bottom to top. So, we arrange our data as 60, 60, 70, 80, 80, 80,
90, and 95. The data that occurs in the middle is called the median, like the median of
20
the highway. If there is an odd number of data, the median will be a number. If there is
an even number of data, there will be two numbers in the physical middle of the data, and
when this occurs, you need to average the two middle numbers. For us, there are two
80’s smack in the middle of the data, another indication you deserve a B. Now, perhaps
the last possible argument you may use to justify you are a B student is that your last four
exam grades, 80, 80, 90 and 95 showed you were more of a B student than a C student at
the end of the course. So, despite having an average or mean of a 76.9, your mode and
median scores were an 80 and you’re your grades in the second half of the semester were
certainly not indicative of a C student. What grade should you get? What grade did you
earn?
Real Estate You meet with a real estate agent and carefully explain to the agent the
price range of the homes you are interested in seeing. The agent taps away on their
computer and tells you they tell you they completely understand what you want, that you
are looking for homes in the $130,000 to $160,000 range. You nod your head in
agreement. The agent informs you they have found three neighborhoods where the mean
(average) value of houses in the three neighborhoods are $ 128,571, $136,786 and $
161,429. Each subdivision is small, just like you prefer, with 14 homes. They explain
the need for you to sign a exclusive right to buy agreement before they take you out.
Impressed with both the immediacy and the detail provided, you quickly sign the
agreement. The real estate agent takes you out for the day. After cozying into the front
seat of their car, you sit back and enthusiastically await what should prove to be a
worthwhile day of house hunting. By the end of the day, you are nodding your head
sideways, not up and down, and you are straining to think of ways to break the stupid
exclusive right to buy agreement you just signed earlier. What happened? Let’s see.
The three subdivisions you saw:
House 1
2
3
4
5
6
7
8
9
10
11
12
13
14
Average Value
Sleepy Brook Vista View Meadowlands
205,000
130,000
300,000
400,000
130,000
400,000
500,000
135,000
400,000
80,000
140,000
500,000
70,000
150,000
65,000
60,000
125,000
70,000
80,000
125,000
70,000
80,000
125,000
65,000
100,000
125,000
65,000
100,000
125,000
65,000
60,000
125,000
65,000
60,000
125,000
65,000
60,000
120,000
65,000
60,000
120,000
65,000
136,786
128,571
161,429
21
Which subdivision was closest to your liking? Well, clearly Vista View is the only
subdivision that even had homes in your price range, with 5 of the 14 homes within your
price range. But, this was the least likely subdivision because it’s average value home
was a little below your range. But, visiting the other two subdivisions was useless, they
had no homes in your range. The agent never checked the values of the homes in the
three subdivisions, they only checked the average value of the homes. To cut the agent
some slack, checking 3 subdivisions with 14 homes each would have been a lengthy
endeavor, because each home would have needed to be accessed individually on the
computer screen. Remember, the agent wanted to impress you with their quick research.
Still, the oversight was caused because you did not have enough information about the
data. Measures of Central Tendency informs us as to the behavior of the middle of
the data, without the need to see every tedious piece of data. Since pulling up each
home would have been too time consuming (42 homes) what other pieces of information
could have been pertinent so that you would have known that only Vista View was worth
visiting?
Mean, Median, Mode and Range. The mean or average value for a set of data is the
average most of us are familiar with, where we take the sum of the data and divide the
sum by the number of pieces of data.
For Sleepy Brook:
205,000  400,000  500,000  3(80,000)  70,000  5(60,000)  2(100,000)
 $136,786
14
For Vista View:
2(130,000)  135,000  140,000  150,000  7(125,000)  2(120,0000)
 $128,571
14
300,000  2(400,000)  500,000  8(65,000)  2(70,000)
 $161, 429
For Meadowlands:
14
But, clearly, this was not enough information about the middle of the data. What else
could have helped us. Well, in the Meadowlands subdivision, there were 8 of the 14
homes were worth $65,000, one-half of our lower limit for our price range. This would
have been helpful to know. The mode is the piece of data that shows up the most
frequently. So, in the Meadowlands, the mode is 65,000, occurring 8 times. For Vista
View, the most is 125,000 occurring 7 times. This mode is close to our price range. And
Sleepy Brook? It’s mode was much lower, 60,000, occurring 5 times.
What other tendency for the data would have been helpful. How many homes are not in
our price range would be too easy of an answer, huh. Well, if we order the data from
smallest to largest (or largest to smallest), the middle of the data is called the median.
We use the image of a median on a highway to remember the name, because the median
on a highway physically divides the highway in half. Our median does the same thing. If
we order our data, then:
House 1
Sleepy Brook Vista View Meadowlands
60,000
120,000
65,000
22
2
3
4
5
6
7
8
9
10
11
12
13
14
Median
60,000
60,000
60,000
60,000
70,000
80,000
80,000
80,000
100,000
100,000
205,000
400,000
500,000
80,000
120,000
125,000
125,000
125,000
125,000
125,000
125,000
125,000
130,000
130,000
135,000
140,000
150,000
125,000
65,000
65,000
65,000
65,000
65,000
65,000
65,000
70,000
70,000
300,000
400,000
400,000
500,000
65,000
Note, that if there are an odd number of data, then the median is a single piece of data. If
there is an even number of data, the median we are seeing is the average of the middle
two items. How would knowing the median have been helpful? Well, if we knew the
medium for Meadowlands, then we would have known that one-half of the homes in the
subdivision, that is 7 of the homes were $65,000 or below. To keep an average of
$161,429, many of the other homes would have needed to be too expensive for us,
leaving at best, the possibility of at most a few homes in our range. It turned out, there
were no homes in our range.
Which leads us to the dispersion of the data. Dispersion means spreading, scattering or
distribution. We can address these different words with different measures of central
tendency. The range is the difference between the largest and the smallest data point.
For Sleepy Brook, $500,000 - $60,000 = $440,000 or most realistically, there is a large
difference between the cheapest and the most expensive home in the subdivision. For
Vista View, $ 150,000 - $ 120,000 = $ 30,000 and this tells us all the homes are at least
close to our price range. Meadowlands has the problem Sleepy Brook had, the range is
$500,000 - $65,000 = $435,000.
To measure the scattering and the distribution of even larger samples of data, we will
examine standard deviations a little later. But first, let’s look at mean, median, mode
and range a little longer.
Problem One
Below are the Traffic Fatalities per 100 million (108) vehicle miles in 2001 Source: U.S.
National Highway Safety Traffic Administration. Rank the states and the District of
Columbia in ascending order. Then find the mode, median, mean and range. Discuss the
relevance of the numbers. This means if any two correspond closely, look at the data and
tell why. If any state is far from the middle of the data, call it an outlier.
Alabama
1.75
Alaska
1.80
23
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
District of
Columbia
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
2.06
2.08
1.27
1.71
1.01
1.58
1.81
1.93
1.50
1.61
1.84
1.37
1.27
1.49
1.75
1.83
2.32
1.33
1.27
0.90
1.34
1.06
2.18
1.62
2.30
Answer:
Massachusetts
Vermont
Connecticut
Rhode Island
Minnesota
New Jersey
New Hampshire
New York
Washington
Utah
California
Indiana
Maryland
0.90
0.96
1.01
1.01
1.06
1.09
1.15
1.18
1.21
1.25
1.27
1.27
1.27
50
49
47
47
46
45
44
43
42
41
37
37
37
Virginia
Ohio
Maine
Wisconsin
1.27
1.29
1.33
1.33
37
36
34
34
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
Michigan
Nebraska
Illinois
Oregon
North Dakota
Iowa
Pennsylvania
Georgia
Oklahoma
Delaware
Hawaii
Missouri
North Carolina
Colorado
Nevada
Texas
Alabama
Kansas
Alaska
1.36
1.71
1.15
1.09
1.99
1.18
1.67
1.45
1.29
1.55
1.42
1.49
1.01
2.27
2.00
1.85
1.72
1.25
0.96
1.27
1.21
1.91
1.33
2.16
1.34
1.36
1.37
1.42
1.45
1.49
1.49
1.50
1.55
1.58
1.61
1.62
1.67
1.71
1.71
1.72
1.75
1.75
1.80
33
32
31
30
29
27
27
26
25
24
23
22
21
19
19
18
16
16
15
24
District of
Columbia
Kentucky
Idaho
Tennessee
West Virginia
Florida
New Mexico
South Dakota
Arizona
1.81
1.83
1.84
1.85
1.91
1.93
1.99
2.00
2.06
(X)
14
13
12
11
10
9
8
7
Arkansas
Wyoming
Mississippi
South Carolina
Montana
Louisiana
Mode
Median
Mean
2.08
2.16
2.18
2.27
2.30
2.32
1.27
1.55
1.57
6
5
4
3
2
1
The mean and median are close, this means the number in the middle of the data and the
average are close together. The number of states that rank above and below the average
and the number of states that rank above and below the middle state, GA, are close to the
same. So, the data is not top or bottom heavy. Yet, this doesn’t mean the data is
dispersed evenly. Why?
Let’s see why.
Problem Set
For problems 1-6 below, find the mean, median and mode for the data.
1. 1, 3, 4, 4, 4, 5, 5, 6
2. 3, 3, 4, 4, 4, 5, 5, 6
3. 3, 3, 3, 4, 4, 5, 5, 6
4. 3, 3, 3, 4, 5, 5, 5, 6
5. 1, 1, 1, 1, 2, 2, 6, 6
6. 1, 1, 2, 2, 6, 6, 6, 6
7. What is the median time it took for the students to write the exam?
Student ID Number
4025
1026
8790
1029
2943
2020
2084
5091
7812
5103
6092
Time to Take Exam
1:25
1:09
0:59
0:45
1:01
1:12
1:25
1:31
0:49
2:00
1:42
8. Below is the year and the percent of children under the age of 4 in a city that attended
Day Care.
Year
1970
1972
1974
1976
1978
1980
Percent
15
17
15
16
18
17
25
1982
1984
1986
1988
1990
1992
1994
1996
21
31
12
15
16
17
15
12
What is the mode for this set of data?
9. From the US Census Bureau, 1999, below is the state rankings of the percent of
elderly persons, 65 years and over that live below the poverty level. Rank the states and
the District of Columbia in ascending order. Then find the mode, median, mean and
range. Discuss the relevance of the numbers. This means if any two correspond closely,
look at the data and tell why. If any state is far from the middle of the data, call it an
outlier.
Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
District of Columbia
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
15.5
6.8
8.4
13.8
8.1
7.4
7.0
7.9
16.4
9.1
13.5
7.4
8.3
8.3
7.7
7.7
8.1
14.2
16.7
10.2
8.5
8.9
8.2
8.2
18.8
9.9
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
9.1
8.0
7.1
7.2
7.8
12.8
11.3
13.2
11.1
8.1
11.1
7.6
9.1
10.6
13.9
11.1
13.5
12.8
5.8
8.5
9.5
7.5
11.9
7.4
8.9
26
Standard Deviation and the Normal Distribution
According to a study done by the National Center for Health and Statistics, Mean Body
Weight, Height and Body mass Index, United States 1960-2002, American men are (ages
20-74) are 25 pounds heavier in 2002 than they were some 42 years earlier in 1960. In
2002 the average American male weighed 191 pounds, up from his 1960 counterpart who
weighed 166 pounds. American women from the same age group followed the same
trend, the average American woman weighed 164 pounds in 2002, up 24 pounds from the
average American woman from 1960 who weighed 140 pounds. This study caused quite
a stir, as nutritionists and diet doctors clamored together to seek solutions. And as you
can imagine, the dangers of obesity were revisited when this study was broadcasted. In
turn, even the average male heights increased over the 42 year span, from 5‘ 8“ in 1960
to 5‘ 9 ½“ in 2002. And as expected, the average height for the American woman also
increased, from 5‘ 3“ to 5‘ 4“. The study was done on a smaller representation of the true
population, it was performed on thousands of people and in reality, the population of
American Men and Women total in the hundreds of millions. Since these numbers are so
large, we assume the data to be a normally distributed around the mean, or average. A
normal distribution for a set of data means that there is more data close to the
"average," and the less data farther from the average until finally relatively few data
points tend to one extreme or the other. The data is symmetrically distributed away from
the average.
This is common sense, or mathematical intuition. They are, after all, close to being one
and the same. Let’s say you are writing a story about the height distribution of the
American male in 2002 because you are trying to correlate it to ethnicity, diet or genes.
First you take the population, in this case, those people who participated in the study, and
tally up the number of people for each given height. Like most data, if the sample or
population is large enough, the heights for the population turn out to be normally
distributed. This means most people will be of average height or close to average height.
In other words, the average height also will be the height to occur most frequently in our
population and the height smack in the middle of the data when it is ordered. Thus, the
mean will be the mode and the median too. Next, if a population is normally distributed,
if you plot each height in increasing order, the number of men for a given height are
symmetrically distributed around the average height. In other words, there will be more
people close to average height than far from the average height. In 2002, the average
height of the American male was 5’9 ½ ’’, and this height will occur most frequently.
Then, for our normally distributed society which we aptly call the American male, the
next most common heights occur from 5’9” through 5’ 10”, both heights ½ inch away
from the mean height. Next, the most common heights would be expected to occur
between 5’ 8” to 5’ 11” And so on. We expect less and less men to have a designated
height as we move further from the average height. Intuitively, this fits our preconceived
notion of our society, we expect less men than are 6 ‘ 5” than you would find men that
are 5’ 11” for instance. And continuing, this means there will be fewer men that are five
foot than 5 ‘ 7” and fewer that are 6 ½ feet than 5 ‘ 11”. If the heights do fit a normal
distribution, then the heights are distributed symmetrically around the average height.
27
In short, more people will be closer to the average height than farther from it, and usually
the distribution is normally distributed around the average; hence the words normal
distribution.
If you looked at normally distributed data on a graph, it would look something like this:
The x-axis (the horizontal one) is the value in question, the population’s height for
example. The y-axis (the vertical one) is the number of data points for each value on the
x-axis., the number of people that are that certain height.
The standard deviation is a measure of how widely values are dispersed from the mean
(average value). For populations where the data points are tightly bunched together, the
bell-shaped curve is steep and the standard deviation is small. For populations where the
data points spread further apart from the average, the bell curve is flatter and the standard
deviation is larger.
68-95-99.7 To refine our understanding of a standard deviation, we turn our attention to
a graph. In a moment we will show you the calculation for the standard deviation. Right
now, we want to present a conceptual understanding for the term ‘standard deviation.’
Recall, in 2002, the American male had a mean height of 5 ‘ 9 ½ “. The standard
deviation was 2 3/8 “.
28
For a normal distribution, one standard deviation (in red above) away from the mean in
both directions on the horizontal axis will account for approximately 68 % of the
population. There are two heights that are 2 3/8 inches from 5’ 9 1/2”, the smaller 5’ 7
1/8” (5’ 9 ½” – 2’ 3/8”) and one larger, 5’ 11 7/8” (5’ 9 ½” + 2’ 3/8”). Thus, 68 % of the
American men in 2002 were between 5’ 7 1/8” and 5’ 11 7/8”.
All data found within two standard deviations (in red and green above) from the mean
will account for roughly 95 % of the normally distributed population, or the American
men. The two heights two standard deviations away from the mean are found with two
predictable calculations. We first subtract two standard deviations from the mean, giving
us 5’ 9 ½” - 2’ 3/8” - 2’ 3/8” = 5’ 4 ¾” We then add two standard deviations to the
mean, giving 5’ 9 ½” + 2’ 3/8” + 2’ 3/8” = 6 ‘ 2 ¼”. So, 95 % of the American men is
2002 were between 5’ 4 ¾” and 6’ 2 ¼”. Recall, the heights of these 95 % of the
population are even distributed from the mean.
Data found three standard deviations from the mean (the red, green and blue areas)
account for about 99.7 % of normally distributed populations. So, in 2002, 99.7 % of the
American men were between 5’ 2 3/8” (5’ 9 ½” - 2’ 3/8” - 2’ 3/8” - 2’ 3/8”) and 6’ 4
5/8”(5’ 9 ½” + 2’ 3/8” + 2’ 3/8” + 2’ 3/8”). From a different perspective, one could
infer that in 2002, those American men who were more than three standard deviations
away from the mean either were shorter then 5’ 2 3/8” or taller than 6’ 4 5/8”. Since they
represented 0.3 % of the American males, they were considered short or tall by the our
population’s standards.
If a curve was flatter, the standard deviation would have to be larger in order to account
for those 68 percent and if the curve was steeper, the standard deviation would have to be
smaller to account for 68 percent of the population. Standard deviation tells you “how
spread out the data points in the population are from the mean.”
Why is this useful? Well, if you are comparing test scores for different schools, the
standard deviation will tell you how diverse the test scores are for each school. Let's say
Washington High School has a higher mean test score than Adams High School for the
mathematics portion of the statewide AIMS test administered in the state of Arizona to
29
measure the students understanding of high school mathematics. Our first reaction might
be to deduce that the students at Washington are either smarter or better educated by the
teachers.
You analyze the data further. The standard deviation, you find out, at Washington is
larger than at Adams. This means that at Washington there are relatively more kids at
scoring toward one extreme or the other. By asking a few follow-up questions, you might
find that Washington’s average was higher because the school district sent all of the
gifted education kids to Washington. Or perhaps Adams scores were dragged down and
thus appeared bunched together because all of the students who recently have been
"mainstreamed" from special education classes. Perhaps the gifted classes were sent out
of district. In this way, looking at the standard deviation can help point you in the right
direction when asking why data is the way it is.
Example 1 You are trying to decide which teacher’s class to enroll in for Mathematics.
You go to a website that claims to have tracked the three teacher’s success rates over the
past five years. The final grade for Mr. Allen’s students had a mean score of 76 with a
standard deviation of 5, while Mrs. Bennett’s students had a mean score of 74, with a
standard deviation of 3 and Mrs. Clyde has a mean score of 79 for the student’s final
grade, with a standard deviation of 7. Whose class would you enroll in, that is how
would you interpret the data on the web site? Rank the teachers from first to third, so that
if one’s section is full, you would know whose class to enroll in next..
Solution: We must quantify the exam scores to interpret the data. For Mr. Allen classes,
68 % of the students earned a final grade that was within 5 points of 76, so 68 % of the
students scored earned between 71 to 81. About 95 % scored within two standard
deviations of the mean, so 95 % of the students earned a final grade between 66 to 86.
Finally, 99.7 % of the students earned a final grade between 61 to 91. Continuing with
this thought process, Mrs. Bennett’s students has a lower final grade average, 74. But, 68
% of the students earned a final grade scored between 71 to 77, 95 % earned a final grade
between 68 to 80 and 99.7 % earned a final grade between 65 to 83. For Mrs. Clyde’s
students, her students earned the highest average, but she had the 68 % , 95 % and 99.7
% spread farther apart, 72-86, 65-93 and 58 – 100 respectively. A quick table allows us
to compare the success rates of the three teachers:
Mr. A Mrs. B Mrs. C
68 %
71-81 71-77 72-86
95 %
66-86 68-80 65-93
99.7 % 61-91 65-83 58-100
So, to answer the question of which teacher you should take. If you are a good student,
you have a better chance of securing an A with Mrs. Clyde first, Mr. Allen second and
Mrs. Bennett third. If you struggle at math, you probably would choose Mrs. Bennett
first because 99.7 % of her students earn above a 65. Mr. Allen would probably be your
second choice, Mrs. Clyde your third choice.
Example 2 In Typical City, USA, the number of hours a teen watches TV has become
concern for the town’s elders. They research this and find the teens watch an average of
30
4 ½ hours of TV a day, with a standard deviation of ½ hour. What percent of the teen’s
watch
a) more than 5 hours of TV per day?
b) more than 5 ½ hours of TV per day?
c) less than 5 ½ hours of TV per day?
d) less than 4 hours of TV per day?
e) less than 3 ½ hours of TV per day?
Solution:
a) Since 5 hours is 1 standard deviation above the mean (4 ½ plus ½ ), then 68 % of
the teens are distributed within 1 standard deviation or between 4 and 5 hours.
So, half of the teens are will watch from 0 to 4 ½ hours, and another 34 % (half of
the 68 %) will watch between 4 ½ to 5 hours. So, 84 % will watch less than 5
hours, thus 100 % - 84 % or 16 % will watch more than 5 hours per day.
b) Since 5 ½ hours is 2 standard deviations above the mean (4 ½ plus ½ plus ½ ),
then 95 % of the teens are distributed within 2 standard deviation or between 3 ½
and 5 ½ hours. So, half of the teens are will watch from 0 to 4 ½ hours, and
another 47 ½ % (half of the 95 %) will watch between 4 ½ to 5 ½ hours. So, 97
½ % will watch less than 5 ½ hours, thus 100 % - 97 ½ % or 2 ½ % will watch
more than 5 ½ hours per day.
c) From the above paragraph, we have 100 % - 2 ½ % = 97 ½ % of the teens will
watch loess than 5 ½ hours of TV per day..
d) Since 4 hours is 1 standard deviation below the mean (4 ½ minus ½ ), then 68 %
of the teens are distributed within 1 standard deviation or between 4 and 5 hours.
So, half of the teens are will watch from 0 to 4 ½ hours, and another 34 % (half of
the 68 %) will watch between 4 and 4 ½ hours. So, 50 % - 34 % will watch less
than 4 hours per day.
e) Since 3 ½ hours is 2 standard deviations below the mean (4 ½ minus ½ minus ½
), then 95 % of the teens are distributed within 2 standard deviations or between 3
½ and 5 ½ hours. So, half of the teens are will watch from 0 to 4 ½ hours, but
another 47 ½ % (half of the 95 %) will watch between 3 ½ to 4 ½ hours per day.
So, 50 – 47 ½ or 2 ½ % of the teens will watch less than 3 ½ hours per day.
Standard score or z-score. If one is analyzing data within 1, 2 or 3 standard deviations
from the mean, then you can expect 68 %, 95 % or 99.7 % respectively, of the population
to lie within these bounds. What happens if we know that 90 % of the data lies within
two scores. What would the standard deviation look like?
Since data rarely if ever is presented to us where the mean is zero and the standard
deviation is 1, we use the standard normal curve to help analyze any normally distributed
data. Traditionally, this standard curve is referred to as standard scores or z-scores. In
other words, the number of standard deviations that data lies above or below a mean
is called the standard score or z-score. So, a data value with a z-score of 0 indicates the
data is the mean. A data value with a z-score of –1.3 indicates the data value is 1.3
standard deviations below the mean and so forth. If you know the standard deviation and
31
mean of your data, z-scores enable you to determine the percent of data between any two
values in the range of your data.
To find each z-score,
data value  mean
.
s tan dard deviation
Below is a table for the z-scores for the standard normal distribution.
z
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3
0
0
0.0398
0.0793
0.1179
0.1554
0.1915
0.2257
0.258
0.2881
0.3159
0.3413
0.3643
0.3849
0.4032
0.4192
0.4332
0.4452
0.4554
0.4641
0.4713
0.4772
0.4821
0.4861
0.4893
0.4918
0.4938
0.4953
0.4965
0.4974
0.4981
0.4987
0.01
0.004
0.0438
0.0832
0.1217
0.1591
0.195
0.2291
0.2611
0.291
0.3186
0.3438
0.3665
0.3869
0.4049
0.4207
0.4345
0.4463
0.4564
0.4649
0.4719
0.4778
0.4826
0.4864
0.4896
0.492
0.494
0.4955
0.4966
0.4975
0.4982
0.02
0.008
0.0478
0.0871
0.1255
0.1628
0.1985
0.2324
0.2642
0.2939
0.3212
0.3461
0.3686
0.3888
0.4066
0.4222
0.4357
0.4474
0.4573
0.4656
0.4726
0.4783
0.483
0.4868
0.4898
0.4922
0.4941
0.4956
0.4967
0.4976
0.4982
0.03
0.012
0.0517
0.091
0.1293
0.1664
0.2019
0.2357
0.2673
0.2967
0.3238
0.3485
0.3708
0.3907
0.4082
0.4236
0.437
0.4484
0.4582
0.4664
0.4732
0.4788
0.4834
0.4871
0.4901
0.4925
0.4943
0.4957
0.4968
0.4977
0.4983
0.04
0.016
0.0557
0.0948
0.1331
0.17
0.2054
0.2389
0.2704
0.2995
0.3264
0.3508
0.3729
0.3925
0.4099
0.4251
0.4382
0.4495
0.4591
0.4671
0.4738
0.4793
0.4838
0.4875
0.4904
0.4927
0.4945
0.4959
0.4969
0.4977
0.4984
0.05
0.0199
0.0596
0.0987
0.1368
0.1736
0.2088
0.2422
0.2734
0.3023
0.3289
0.3531
0.3749
0.3944
0.4115
0.4265
0.4394
0.4505
0.4599
0.4678
0.4744
0.4798
0.4842
0.4878
0.4906
0.4929
0.4946
0.496
0.497
0.4978
0.4984
0.06
0.0239
0.0636
0.1028
0.1406
0.1772
0.2123
0.2454
0.2764
0.3051
0.3315
0.3554
0.377
0.3962
0.4131
0.4279
0.4406
0.4515
0.4608
0.4686
0.475
0.4803
0.4846
0.4881
0.4909
0.4931
0.4948
0.4961
0.4971
0.4979
0.4985
0.07
0.0279
0.0675
0.1064
0.1443
0.1808
0.2157
0.2486
0.2794
0.3078
0.334
0.3577
0.379
0.398
0.4147
0.4292
0.4418
0.4525
0.4616
0.4693
0.4756
0.4808
0.485
0.4884
0.4911
0.4932
0.4949
0.4962
0.4972
0.4979
0.4985
0.08
0.0319
0.0714
0.1103
0.148
0.1844
0.219
0.2517
0.2823
0.3106
0.3365
0.3599
0.381
0.3997
0.4162
0.4306
0.4429
0.4535
0.4625
0.4699
0.4761
0.4812
0.4854
0.4887
0.4913
0.4934
0.4951
0.4963
0.4973
0.498
0.4986
Example Three. We know that in 2002, the average height of the American Male was 5’
9 1/2 “ and the standard deviation was 2 3/8”. What percent of the American males in
2002 was …
32
0.09
0.0359
0.0753
0.1141
0.1517
0.1879
0.2224
0.2549
0.2852
0.3133
0.3389
0.3621
0.383
0.4015
0.4177
0.4319
0.4441
0.4545
0.4633
0.4706
0.4767
0.4817
0.4857
0.489
0.4916
0.4936
0.4952
0.4964
0.4974
0.4981
0.4986
a) taller than 6’?
b) shorter than 5’ 7 ½ ”?
c) between 5’ 10 “ and 6 ‘ 1”?
Solution
a) First, we find the z-score associated with 6’. We have
1
data value  mean 6 '  5'9 2 "
2.5
z score 


 1.05 . Notice the positive 1.05
s tan dard deviation
2.375
23 "
8
corresponds to the fact that 6’ is above the mean. Now, we glance at the table and see
that a z-score of 1.05 has a value of 0.3531 . This means that in 2002, 35.31 % of the
America Males were between 6 ‘and the average American Male height of 5 ‘ 9 ½ “. So,
the percent of American Males taller than 6’ were 100 % - 50 % - 35.31 % or 14.69 %.
b) First, we find the z-score associated with 5’ 7 ½”. We have
data value  mean 5'7.5" 5'9.5"  2.0
z score 


  0.84 The negative value
s tan dard deviation
2.375
23 "
8
corresponds to the fact that 5’ 7 1/2'” is below the mean. Now, we glance at the table and
see that a z-score of 0.84 has a value of 0.2995. This means that in 2002, 29.95 % of the
America Males were between 5’ 7 ½ “ and the average American Male height of 5 ‘ 9 ½
“. So, the percent of American Males shorter 5’ 7 ½ “ were 100 % - 50 % - 29.25 % or
20.75 %.
c) Calculating each z-score, we have:
data value  mean 5'10" 5'9.5"  0.5
z score 


 0.21
s tan dard deviation
2.375
23 "
8
data value  mean 6 ' 1" 5'9.5"
3.5
z score 


1.47
s tan dard deviation
2.375
23 "
8
Now, we glance at the table and see that the z-score of 0.21 and 1.47 have the values of
0.0832 and 0.4292 respectively. This means that in 2002, 8.32 % of the America Males
were between 5’ 10 “ and the average American Male height of 5 ‘ 9 ½ “ and 42.92 % of
the America Males were between 6’ 1 “ and the average American Male height of 5 ‘ 9 ½
“. So, the percent of American Males between 5’ 10” and 6’ 1” would be 42.92 % - 8.32
% or 34.6 %.
Problem Set
1. Two AP calculus classes were taught by Mr. Venette and Ms. Harper. The final grade
for course during the past five years indicated that Mr. Venette’s classes had a mean of
80 and a standard deviation of 4, while Ms. Harper’s classes had a mean of 82, but a
standard deviation of 2.5. Interpret the results in terms of 68-95-99.7 percentiles. Then
give possible reasons for the differences you observe.
33
For questions 2 to 10, use the following data: The mean income in a city is $ 51,000, and
the standard deviation is $ 4000. Find the percentage of people whose income is
2. $ 59,000 or above
3. $ 47,000 or below
4. between $ 43,000 and $ 55,000
5. $ 55,000 or below
6. $ between 39,000 and 55,400
7. $ 39,000 or below
8. $ 40,000 or above
9. $ 50,000 or below
10. between $ 50,000 and $ 60,000
For problems 11 to 16, use the following table. In Japan, studies pertaining to the heights
for adults separated by gender vary slightly. Below is a rough estimation of data
compiled from various studies.
Males
Shortest
Person
Average
person
Tallest
person
97.5 %
5'0" 5/8
5'5" 1/8
5'9" 7/8
JAPAN
Females
Shortest
person
Average
person
Tallest
person
97.5 %
4'8" 1/2
5’0” 1/4
5’4”
11. Find the standard deviation for both the males and the females and interpret both in a
complete sentence.
12. Find the percent of Japanese males shorter then 5 ‘ 9 7/8”
13. Find the height of the Japanese female who is taller than 99.85 % of the population.
14. Find the height of the Japanese males who are shorter than 84 % of the population.
15. Find the percent of Japanese males shorter than 5’ 2”.
16. Find the percent of Japanese women taller than 5’ 2”.
For problems 17 to 24, use the following table. In the United States, the weights
pertaining to adults separated by gender vary slightly. Below is a rough estimation of
data compiled from various studies.
Males
95 %
Lightest
person
Average
person
Heaviest
person
168
pounds
191
pounds
214
pounds
Females
USA
95 %
Lightest
person
Average
person
Heaviest
person
140
pounds
164
pounds
188
pounds
17. Find the standard deviation for both the males and the females and interpret both in a
complete sentence.
18. Find the percent of American females who weigh more than 145 pounds.
19. Find the weight of the American female who weighs less than 34 % of the
population.
34
20. Find the height of the American male who weighs more than 97.5 % of the
population.
21. Find the percent of American females who weigh more than 150 pounds.
22. Find the percent of American males who weigh more than 200 pounds.
23. Find the percent of American females who weigh less than the weight of the Average
Male in 2002.
24. Find the percent of American males who weigh less than the weight of the Average
Female in 2002.
25. On your route home, you have a choice of taking two bridges, each of the same
length and same number of lanes. At the time you cross each bridge, Bridge One has
an average of 420 cars on it with a standard deviation of 100, and Bridge Two has a
average of 460 cars on it with a standard deviation of 20 cars. Which bridge would you
decide to cross? Would it matter if you were in a hurry?
For problems 26 to 32, use this fact: According to By Robert Dvorchak, Pittsburgh PostGazette, the average length of a National League baseball game was 2:47:20. Compared
to its own historic past, when in 1967 the average game was 2:30, in the 1940’s the
average game, according to the Sporting News was exactly 2:00, or even a century ago,
when the average game was a mere 1:30. If we estimate a standard deviation of 20
minutes, what percent of the games …
26. in 2004 lasted longer than two hours
27. in 1967 lasted longer than two hours
28. in 1940 lasted longer than two hours
29. a century ago lasted longer than two hours
30. in 2004 lasted longer than three hours
31. in 1967 lasted longer than three hours
32. in 2004 were shorter than 3 ½ hours
Calculating the Standard Deviation
A standard deviation then is really nothing more than the average distance from the
mean. For each data point or value, we subtract the mean from each data and the result is
either zero, positive or negative. When we add these values, the sum of the positive
differences will cancel with the sum of the negative difference. Since we are looking to
find the average distance from the mean, this calculation would prove worthless. Try it.
Instead, we use a convenience where we square each difference because these squared
value are all positive. Thus, they won’t have the effect of canceling each other out.
Now, we add them all up. We then divide by the number of terms. Almost. Actually,
we divide by n-1 because statisticians have determined that with large populations, since
there is always an outlier (the really tall kid, the really bright child that blows out the
curve with IQ scores and so on … ), dividing by n-1 most closely resembles the true
behavior of the data. We then take the square root to cancel out the effect of squaring,
giving us this measurable average distance form the mean. We designate positive values
to indicate above the mean, negative values to indicate below the mean..
35
A practical way to compute standard deviation would be to incorporate the use of a
spread sheet. In Microsoft Excel, type the following code into the cell where you want
the Standard Deviation result, using the "unbiased," or "n-1" method:
=STDEV(A1:Z99) (substitute the cell name of the first value in your dataset for
A1, and the cell name of the last value for Z99.)
To find the standard deviation, let x be one value in your set of data and let x be the
mean of all values x in your set of data. Let n be the number of values x in your set of
data. For each value x, subtract the overall x from x, that is x – x , then square the result
(x - x )2 . Sum up all those squared values and then divide that sum by (n-1). Finally,
there's one more step, find the square root of that last number. That's the standard
deviation, written as  , of your set of data.
n

 (x
i 1
i
 x) 2
n 1
Introduction or who deserves the B?
Let’s give you an intrinsic feel as to the measure of central tendency of data. Below are
five students, and their seven grades for a course. The bottom row reaffirms that all six
students have a 79 average.
Allan
Bill
Cindy Deanna
Eve
74
73
59
69
68
76
75
62
73
78
77
77
78
78
79
80
79
79
79
79
81
83
80
82
83
82
83
96
83
83
83
83
99
89
83
79
79
79
79
79
All six students want a B. Allan argues that his middle grade, his median is a B. Bill
argues that his mode, the grade that occurs most frequently, is a B. Cindy argues that she
has shown great potential, two of her grades are solid A’s. Deanna argues the same
argument, but her grades are not quite as erratic as Cindy’s they are not as dispersed
away from the average as Cindy’s grades. Eve, like Bill, also argues that her mode is a
B. And although Eve and Bill have the same mean, median and mode, Eve is the one
with the 68 average. Uh oh….
Standard deviations measure just this, how a data value is deviates from a mean, in other
words, a standard deviation is a numerical value that tells the reader how spread out the
data is, Allan’s grades are clumped together, he should have a small standard deviation.
Cindy’s grade history is more erratic, her grades are farther spread out, she should have a
larger standard deviation.
36
Let’s compare the standard deviations for three of the students, Bill, Cindy and Eve. We
will find how much each data value deviates from the mean. But notice, when we do, if
we try and sum up these values to get some sort of average, the sum is zero because the
deviations above the mean will cancel with those below the mean. So, we will square the
results, to keep the deviations positive and then divide by one less than the number of
data points. Finally, we will undue the squaring by taking the square root of both sides so
that we can get this measure of how each value deviates from the mean.
deviatio deviation
deviatio deviation
deviatio deviation
Bill
n
squared
Cindy
n
squared
Eve
n
squared
73
-6
36
59
-20
400
68
-11
121
75
-4
16
62
-17
289
78
-1
1
77
-2
4
78
-1
1
79
0
0
79
0
79
0
79
0
0
83
4
16
80
1
1
83
4
16
83
4
16
96
17
289
83
4
16
83
4
16
99
20
400
83
4
16
79
0
104
79
0
1380
79
0
170
For Bill, his standard deviation is
deviation is
104
 17.33333  4.16 . For Cindy, her standard
6
1380
 230  15.17 . For Bill, his standard deviation is
6
170
 28.33333  5.32 . Standard deviations is a measure of dispersion, and clearly
6
the larger the standard deviation, the more dispersed the data. Now, we have more
information about each of the student’s grades at our disposal; we know the mean,
median, mode and the standard deviation. You decide, who deserves the B; who does
not.
Example 1: Who is the best and the worst in the American League on Labor Day,
2004? A standard deviation way to explore this age old baseball question. Updated: 9/5/2004
3:37:06 PM cnn.com
AMERICAN LEAGUE EAST
~~~~~~~~~~~~~~~~~~~~
TEAM
NY YANKEES
BOSTON
BALTIMORE
TAMPA BAY
TORONTO
WON
83
80
63
59
56
LOST
52
54
71
75
80
PCT
GB
.615
.597
.470
.440
.412
2
19
23
27
PCT
GB
1/2
1/2
1/2
1/2
HOME
ROAD
EAST
CENT
45-21
48-22
29-35
36-34
34-37
38-31
32-32
34-36
23-41
22-43
36-19
36-20
28-29
21-38
21-36
15-11
19-13
15-12
13-12
13-19
HOME
ROAD
EAST
CENT
WEST STREAK
22-14
16-12
15-17
10-22
14-15
LOST
LOST
WON
LOST
LOST
2
1
6
7
2
AMERICAN LEAGUE CENTRAL
~~~~~~~~~~~~~~~~~~~~~~~
TEAM
WON
LOST
WEST STREAK
37
MINNESOTA
CHI WHITE SOX
CLEVELAND
DETROIT
KANSAS CITY
77
67
67
62
47
58
67
70
72
87
.570
.500
.489
.463
.351
43-28
9 1/2 38-32
11
40-30
14 1/2 32-32
29 1/2 30-37
34-30
29-35
27-40
30-40
17-50
16-11
16-16
17-15
10-14
08-19
34-24
29-27
26-31
28-28
25-32
HOME
ROAD
EAST
CENT
45-19
4
38-28
6 1/2 42-22
30
32-34
36-35
39-30
32-38
19-50
23-16
24-16
22-17
11-28
25-15
25-14
22-17
19-21
16-16
14-14
14-16
15-21
08-24
WON
WON
LOST
WON
LOST
5
2
4
1
2
AMERICAN LEAGUE WEST
~~~~~~~~~~~~~~~~~~~~
TEAM
OAKLAND
ANAHEIM
TEXAS
SEATTLE
WON
81
77
74
51
LOST
54
58
60
84
PCT
.600
.570
.552
.378
GB
WEST STREAK
23-15
21-17
20-18
12-26
WON
WON
WON
LOST
The three divisional winners and the second place team with the best record makes the
playoffs. But, who is the best team? The worst team? How good is good and how bad is
bad? Let’s calculate the standard deviation with respect to the number of wins for each
team.
First, we find the mean number of wins by adding up all the wins and dividing by 14.
x
83  80  63  59  56  77  67  67  62  47  81  77  74  51
 67.4
14
So, the average or mean number of wins is 67.4 for the American League teams on
Labor, 2004. We then create a table. The table will have the teams listed in order by the
number of wins, from most to least. We then create a third column with the difference of
the number of wins for a team and the mean number of wins. The final column has this
number squared. We will find the sum of that final column.
Wins - Mean (Wins - Mean)2
NY YANKEES
OAKLAND
BOSTON
MINNESOTA
ANAHEIM
TEXAS
CHI WHITE SOX
CLEVELAND
BALTIMORE
DETRIOT
TAMPA BAY
TORONTO
SEATTLE
KANSAS CITY
SUM
83 83-67.4 = 15.6
15.62 = 243.36
81 81-67.4 = 13.6 13.62 = 184.96
80
12.6
158.8
77
9.6
92.2
77
9.6
92.2
74
6.6
43.6
67
-0.4
0.2
67
-0.4
0.2
63
-4.4
19.4
62
-5.4
29.2
59
-8.4
70.6
56
-11.4
129.96
51
-16.4
268.96
47
-20.4
416.2
944
0
1749.84
38
3
2
1
4
To find the standard deviation, we complete the last two steps.
n
 
 (x 
i 1
i
n 1
x) 2

1749.84
11.6
14  1
data value  mean
.
s tan dard deviation
Below is the table of the American League teams on Labor Day, 2004. Look at the final
column, and recall, as you glance at each teams’ z-score, that for a normal population, a
68 % of the population falls within 1` standard deviation or z-score of the mean, 95 %
falls within 2 and 99.7 falls within 3. How good are the NY Yankees and how bad are
the Kansas City Royals? You now have a frame of reference to answer that question.
Recall, to find each z-score,
NY YANKEES
OAKLAND
BOSTON
MINNESOTA
ANAHEIM
TEXAS
CHI WHITE SOX
CLEVELAND
BALTIMORE
DETRIOT
TAMPA BAY
TORONTO
SEATTLE
KANSAS CITY
Wins - Mean
z-score
83 83-67.4 = 15.6 15.6/11.6 = 1.3
81 81-67.4 = 13.6 13.6/11.6 = 1.2
80
12.6
1.1
77
9.6
0.8
77
9.6
0.8
74
6.6
0.6
67
-0.4
-0.03
67
-0.4
-0.03
63
-4.4
-0.4
62
-5.4
-0.5
59
-8.4
-0.8
56
-11.4
-0.98
51
-16.4
-1.4
47
-20.4
-1.8
Example 2. Does money buy success in baseball? Updated: 9/5/2004 3:37:06 PM cnn.com
The payroll for the American League teams are listed below.
New York Yankees
$ 184,193,950
Boston Red Sox
$ 127,298,500
Anaheim Angels
$ 100,534,667
Seattle Mariners
$ 81,515,834
Chicago White Sox
$ 65,212,500
Oakland Athletics
$ 59,425,667
Texas Rangers
$ 55,050,417
Minnesota Twins
$ 53,585,000
Baltimore Orioles
$ 51,623,333
Toronto Blue Jays
$ 50,017,000
Kansas City Royals
$ 47,609,000
39
Detroit Tigers
$ 46,832,000
Cleveland Indians
$ 34,319,300
Tampa Bay Devil Rays
$ 29,556,667
Let’s find the standard deviation from the mean for each team and then compare the
rankings to the true standings on September 5th . Does there seem to be a correlation
between salaries and success? Does money buy success?
The sum of the 14 American League salaries is $ 986,773,835, thus the average salary is
$ 70,483,845.36, which we will round to $70,483,845.
To calculate the standard deviation, we construct the following table.
Team
New York Yankees
Boston Red Sox
Anaheim Angels
Seattle Mariners
Chicago White Sox
Oakland Athletics
Texas Rangers
Minnesota Twins
Baltimore Orioles
Toronto Blue Jays
Kansas City
Royals
Detroit Tigers
Cleveland Indians
Tampa Bay Devil
Rays
Payroll Salary
$ 184,193,950
Payroll – Mean
(Payroll – Mean)2
113,710,105 12,929,987,979,111,025
$ 127,298,500
56,814,655
3,227,905,022,769,025
$ 100,534,667
30,050,822
903,051,902,875,684
$ 81,515,834
11031989
121,704,781,296,121
$ 65,212,500
- 5,271,345
27,787,078,109,025
$ 59,425,667
- 11,058,178
122,283,300,679,684
$ 55,050,417
$ 53,585,000
$ 51,623,333
$ 50,017,000
$ 47,609,000
$ 46,832,000
$ 34,319,300
$ 29,556,667
We leave it as an exercise for you to complete the table above.
40


The sum of the numbers in the last column divided by 14 is called the variance.
Here the variance is 1.62119 x 10^15.
One standard deviation from the mean is found by taking the square root of this
number. The standard deviation is $ 41,783,940.
So, let’s reprint the table, with the standard deviation from the mean listed for each team
and it’s ranking in the American League. .
Team
Standard Deviations
from the Mean
Payroll Salary
(z-score)
New York Yankees
Boston Red Sox
Anaheim Angels
Seattle Mariners
Chicago White Sox
Oakland Athletics
Texas Rangers
Minnesota Twins
Baltimore Orioles
Toronto Blue Jays
Kansas City Royals
Detroit Tigers
Cleveland Indians
Tampa Bay Devil
Rays
True Major
League Ranking
$ 184,193,950
2.72
1
$ 127,298,500
1.36
3
$ 100,534,667
0.72
Tied for 4
$ 81,515,834
0.26
13
$ 65,212,500
-0.13
7
$ 59,425,667
-0.27
2
$ 55,050,417
-0.37
6
$ 53,585,000
-0.40
Tied for 4
$ 51,623,333
-0.45
9
$ 50,017,000
-0.49
12
$ 47,609,000
-0.55
14
$ 46,832,000
-0.57
10
$ 34,319,300
-.87
8
$ 29,556,667
-0.98
11
Below are the true standing of the teams in the American League.
Updated: 9/5/2004 3:37:06 PM cnn.com
AMERICAN LEAGUE EAST
~~~~~~~~~~~~~~~~~~~~
TEAM
NY YANKEES
WON
83
LOST
52
PCT
GB
HOME
ROAD
.615
-
45-21
38-31
EAST
CENT
WEST STREAK
36-19 15-11 22-14 LOST 2
41
BOSTON
BALTIMORE
TAMPA BAY
TORONTO
80
63
59
56
54
71
75
80
.597
.470
.440
.412
2
19
23
27
PCT
GB
1/2
1/2
1/2
1/2
48-22
29-35
36-34
34-37
32-32
34-36
23-41
22-43
36-20
28-29
21-38
21-36
19-13
15-12
13-12
13-19
HOME
ROAD
EAST
CENT
43-28
9 1/2 38-32
11
40-30
14 1/2 32-32
29 1/2 30-37
34-30
29-35
27-40
30-40
17-50
16-11
16-16
17-15
10-14
08-19
34-24
29-27
26-31
28-28
25-32
HOME
ROAD
EAST
CENT
45-19
4
38-28
6 1/2 42-22
30
32-34
36-35
39-30
32-38
19-50
23-16
24-16
22-17
11-28
25-15
25-14
22-17
19-21
16-12
15-17
10-22
14-15
LOST
WON
LOST
LOST
1
6
7
2
AMERICAN LEAGUE CENTRAL
~~~~~~~~~~~~~~~~~~~~~~~
TEAM
MINNESOTA
CHI WHITE SOX
CLEVELAND
DETROIT
KANSAS CITY
WON
LOST
77
67
67
62
47
58
67
70
72
87
.570
.500
.489
.463
.351
WEST STREAK
16-16
14-14
14-16
15-21
08-24
WON
WON
LOST
WON
LOST
5
2
4
1
2
AMERICAN LEAGUE WEST
~~~~~~~~~~~~~~~~~~~~
TEAM
OAKLAND
ANAHEIM
TEXAS
SEATTLE
WON
LOST
PCT
54
58
60
84
.600
.570
.552
.378
81
77
74
51
GB
WEST STREAK
23-15
21-17
20-18
12-26
WON
WON
WON
LOST
Example 3: Homeownership in the USA We will calculate the standard deviation for
the percent of the population in this country that owns a home. Below are the state
rankings for the percent of homeownership in the United States (to include mobile
homes) in 2002. Source: US Bureau of the Census.
Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
District of Columbia
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
73.5
67.3
65.9
70.2
58.0
69.1
71.6
75.6
44.1
68.7
71.7
57.4
73.0
70.2
75.0
73.9
70.2
73.5
67.1
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon
73.9
72.0
62.7
76.0
77.3
74.8
74.6
69.3
68.4
65.5
69.5
67.2
70.3
55.0
70.0
69.5
72.0
69.4
66.2
42
3
2
1
4
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
74.0
59.6
77.3
71.5
70.1
63.8
72.7
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
70.2
74.3
67.0
77.0
72.0
72.8
Like before, we have a rather large sample. Let’s begin with what we know, let’s find the
mean, median, mode and range first ranking to states in ascending order.
DC
New York
Hawaii
California
Rhode Island
Massachusetts
Texas
Nevada
Arizona
Oregon
Washington
Louisiana
New Jersey
Alaska
Nebraska
Florida
Colorado
Montana
Oklahoma
New Hampshire
North Dakota
North Carolina
Tennessee
Arkansas
Illinois
Kansas
Mean
69.4
Mode
70.2
Median
70.2
44.1
55.0
57.4
58.0
59.6
62.7
63.8
65.5
65.9
66.2
67.0
67.1
67.2
67.3
68.4
68.7
69.1
69.3
69.4
69.5
69.5
70.0
70.1
70.2
70.2
70.2
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
31
31
30
29
25
25
25
Vermont
New Mexico
South Dakota
Connecticut
Georgia
Maryland
Ohio
Wisconsin
Utah
Wyoming
Idaho
Alabama
Kentucky
Iowa
Maine
Pennsylvania
Virginia
Missouri
Mississippi
Indiana
Delaware
Michigan
West Virginia
Minnesota
South Carolina
70.2
70.3
71.5
71.6
71.7
72.0
72.0
72.0
72.7
72.8
73.0
73.5
73.5
73.9
73.9
74.0
74.3
74.6
74.8
75.0
75.6
76.0
77.0
77.3
77.3
25
24
23
22
21
18
18
18
17
16
15
13
13
11
11
10
9
8
7
6
5
4
3
1
1
Let’s begin to interpret the data. First, we notice the mode and median are the same, and
the mean (average) is below the two. The next question, “is the mean significantly below
the other two” may be partially answered by observing the range, for comparison sake.
The range is 33.2 (77.3 – 44.1) appears rather large. So, by comparison 69.4 versus 70.2
appears to be no big deal. Let’s add to our repertoire our ability to examine the
43
dispersion of the data. For large sets of data, we do not want to obsess over each
individual data point. We want to see if the data follows noticeable trends and then
interpret any outliers that may appear.
Let’s observe a histogram, made from the data in ascending order. Notice we have so
much data we cannot see each state (only a few are labeled for reference) or see each bar
representing each state.
South
Indiana
Maine
Wyoming
Georgia
Kansas
North
Florida
Washingto
Massachu
100.0
80.0
60.0
40.0
20.0
0.0
District of
percent
Percent of Home Owners
by state
Notice how the District of Columbia is an outlier, it’s percent of 44.1 far from the mean
of 70.2 %. But, in a manner of speaking, NY, HA, CA, RI, with respective percents of
home ownership of 55, 57.4, 58 and 59.6 all seem far below the mean of 70.2 This is
what we mean by dispersion. We need to quantify how well grouped the data is because
we need to know where to draw the line between those states that are significantly below
the mean or significantly above the mean. This can be done with what we have identified
as a standard deviation; the measure of how data deviates from it’s behavior around the
middle of the data. Central tendency. In a perfect world, the mean is in the center of
the data, thus it is the median too. And the mode, if we get greedy. The standard
deviation, loosely speaking, measures how the data deviates from the mean, and
remember, the mean is in the center of the data.
Standard deviation is 6.2
Y = (x-70.2)/6.2 table start at 70.2 count by 0.1, mode at float 2
DC
New York
Hawaii
California
Rhode Island
Massachusetts
Texas
Nevada
Arizona
Oregon
Washington
-4.21
-2.45
-2.06
-1.97
-1.71
-1.21
-1.03
-0.76
-0.69
-0.65
-0.52
44.1
55.0
57.4
58.0
59.6
62.7
63.8
65.5
65.9
66.2
67.0
Louisiana
New Jersey
Alaska
Nebraska
Florida
Colorado
Montana
Oklahoma
New Hampshire
North Dakota
North Carolina
-0.5
-0.48
-0.47
-0.29
-0.24
-0.18
-0.15
-0.13
-0.11
-0.11
-0.03
67.1
67.2
67.3
68.4
68.7
69.1
69.3
69.4
69.5
69.5
70.0
44
Tennessee
Arkansas
Illinois
Kansas
Vermont
New Mexico
South Dakota
Connecticut
Georgia
Maryland
Ohio
Wisconsin
Utah
Wyoming
Idaho
Alabama
-0.02
0
0
0
0
0.02
0.21
0.23
0.24
0.29
0.29
0.29
0.4
0.42
0.45
0.53
70.1
70.2
70.2
70.2
70.2
70.3
71.5
71.6
71.7
72.0
72.0
72.0
72.7
72.8
73.0
73.5
Kentucky
Iowa
Maine
Pennsylvania
Virginia
Missouri
Mississippi
Indiana
Delaware
Michigan
West Virginia
Minnesota
South Carolina
Mean
Standard
deviation
0.53
0.6
0.6
0.61
0.66
0.71
0.74
0.77
0.87
0.94
1.1
1.15
1.15
73.5
73.9
73.9
74.0
74.3
74.6
74.8
75.0
75.6
76.0
77.0
77.3
77.3
70.2
6.2
Why use standard deviation The standard deviation can also help you evaluate the
worth of all so-called "studies" that seem to be released to the press everyday. Standard
deviation is commonly used in business as a measure to describe the risk of a security or
portfolio of securities. If you read the history of investment performance, chances are
that standard deviation will be used to gauge risk. The same is true for academic studies
to determine the validity of exam results, or the effectiveness of educational tools.
Standard deviation are also one of the most commonly used statistical tools in the
sciences and social sciences. It provides a precise measure of the amount of variation in
any group of numbers, be it the returns on a mutual fund, the yearly rainfall in Mexico
City, or the hits per game for a major league baseball player.
Problem Set:
For problems 1 to 8, use the 2003 Final Standings of the NFL teams, as indicated below:
2003 NFL Standings. W = wins, L = loses, % = percentage of games won, PF = Points For, that is, points the team
scored, PA = points the team allowed.
AFC East
W
New England Patriots 14
Miami Dolphins
10
Buffalo Bills
6
New York Jets
6
L
2
6
10
10
T
0
0
0
0
%
.875
.625
.375
.375
PF
348
311
243
283
PA
238
261
279
299
NFC East
W
Philadelphia Eagles 12
Dallas Cowboys
10
Washington Redskins 5
New York Giants
4
L
4
6
11
12
T
0
0
0
0
%
.750
.625
.313
.250
PF
374
289
287
243
PA
287
260
372
387
45
AFC North
Baltimore Ravens
Cincinnati Bengals
Pittsburgh Steelers
Cleveland Browns
W
10
8
6
5
L
6
8
10
11
T
0
0
0
0
%
.625
.500
.375
.313
PF
391
346
300
254
PA
281
384
327
322
NFC North
Green Bay Packers
Minnesota Vikings
Chicago Bears
Detroit Lions
W
10
9
7
5
L
6
7
9
11
T
0
0
0
0
%
.625
.563
.438
.313
PF
442
416
283
270
PA
307
353
346
379
AFC South
W
Indianapolis Colts
12
Tennessee Titans
12
Jacksonville Jaguars
5
Houston Texans
5
L
4
4
11
11
T
0
0
0
0
%
.750
.750
.313
.313
PF
447
435
276
255
PA
336
324
331
380
NFC South
W
Carolina Panthers
11
New Orleans Saints
8
Tampa Bay Buccaneers 7
Atlanta Falcons
5
L
5
8
9
11
T
0
0
0
0
%
.688
.500
.438
.313
PF
325
340
301
299
PA
304
326
264
422
AFC West
W
Kansas City Chiefs 13
Denver Broncos
10
Oakland Raiders
4
San Diego Chargers 4
L
3
6
12
12
T
0
0
0
0
%
.813
.625
.250
.250
PF
438
381
270
313
PA
332
301
379
441
NFC West
W
St. Louis Rams
12
Seattle Seahawks 10
San Francisco 49ers 7
Arizona Cardinals
4
L
4
6
9
12
T
0
0
0
0
%
.750
.625
.438
.250
PF
447
404
384
225
PA
328
327
337
452
1. For the NFC teams, find the standard deviation for the number of wins and then find
the z-score for each team.
2. For the AFC teams, find the standard deviation for the number of wins and then find
the z-score for each team. 1.
3. For the all teams, find the standard deviation for the number of wins and then find the
z-score for each team.
4. For the NFC teams, find the standard deviation for PF and then find the z-score for
each team.
5. For the AFC teams, find the standard deviation for PF and then find the z-score for
each team. 1.
46
6. For the NFC teams, find the standard deviation for PA and then find the z-score for
each team.
7. For the AFC teams, find the standard deviation for PA and then find the z-score for
each team.
8. Look carefully at questions 1 to 7. Which is a better predictor of a team’s success, the
offense as indicated by the points the team scored (PF) or the defense, as indicated by the
points that team allowed (PA). Why?
9. According to the 2005 World Almanac for Kids, below are the 25 largest countries in
the world in mid-2004 in no particular order. Find the mean, median and the stand
deviation.
1,294,629,555 China
82,424,609 Germany
1,065,070,607 India
76,117,421 Egypt
293,027,571 United States
69,018,294 Iran
238,452,952 Indonesia
68,893,918 Turkey
184,101,109 Brazil
67,851,281 Ethiopia
153,705,278 Pakistan
64,865,523 Thailand
144,112,353 Russia
60,424,213 France
141,340,476 Bangladesh
60,270,708 Great Britain
137,253,133 Nigeria
58,317,930 Dem. Rep. of the Congo
127,333,002 Japan
58,057,477 Italy
104,959,594 Mexico
48,598,175 South Korea
86,241,697 Philippines
47,732,079 Ukraine
82,689,518 Vietnam
10. According to the 2005 World Almanac for Kids, below are the ten largest cities
followed by the population in the world in 2004 in no particular order.
Tokyo, Japan 34,450,000; Kolkata (Calcutta), India 13,058,000; Mexico City, Mexico
18,066,000; Shanghai, China 12,887,000; New York City, U.S. 17,846,000; Buenos
Aires, Argentina 12,583,000; São Paulo, Brazil 17,099,000; Delhi, India 12,441,000;
Mumbai (Bombay), India 16,086,000; Los Angeles, U.S. 11,814,000. Find the mean and
the standard deviation.
47
11. According to the 2005 World Almanac for kids, below are the American League
Pennant Winners, with the year they won proceeding the name and their won-lost record
following their name since 1970. Remove the shortened strike season of 1981 and the
year where there was no world series and find the mean and the standard deviation for
wins. .
1970 Baltimore 108 54 , 1971 Baltimore 101 57, 1972 Oakland 93 62, 1973 Oakland 94
68, 1974 Oakland 90 72, 1975 Boston 95 65, 1976 New York 97 62, 1977 New York 100
62, 1978 New York 100 63, 1979 Baltimore 102 57, 1980 Kansas City 97 65, 1981 New
York 59 48, 1982 Milwaukee 95 67 1983 Baltimore 98 64, 1984 Detroit 104 58, 1985
Kansas City 91 71, 1986 Boston 95 66, 1987 Minnesota 85 77, 1988 Oakland 104 58,
1989 Oakland 99 63, 1990 Oakland 103 59, 1991 Minnesota 95 67, 1992 Toronto 96 66,
1993 Toronto 95 67 1994 none, 1995 Cleveland 100 44, 1996 New York 92 70, 1997
Cleveland 86 75, 1998 New York 114 48, 1999 New York 98 64, 2000 New York 87 74,
2001 New York 95 65, 2002 Anaheim 99 63, 2003 New York 101 61
12. Look up the ages of the presidents of the United States when they took office. Find
the mean and standard deviation of the presidents ages. Then repeat the process, lumping
together those presidents who were inaugurated before the Civil War and those who were
inaugurated after the Civil War. What do you notice when you compare the pre and post
Civil War presidents’ ages?
For questions 13 to 19: M&M’s project - Some years come and go, but other years live
in the hearts and mind of men and women for all eternity. Such was the year of 1941,
when Pearl Harbor was attacked, Joe DiMaggio hit safely in 56 straight games and
M&M’s were first introduced to the public. Daughters everywhere love M&M’s, in
particular, some love the blue pieces the most. The original M&M’s had violet candies
and no blue ones in 1941. Then, in 1949, tan replaced violet and in 1995, tan was
replaced by blue. M&M’s were made round by taking milk chocolate centers and
tumbling them to get their smooth rounded shape. We all know M&M stands for Mars
and Murrie and that the different color M&M’s taste the same. According to the M&M’s
website http://www.mmmars.com/cai/mms/faq.html , that
 M&M’s Milk Chocolate candies are 30 % brown, 20 % each of yellow and red,
and 10 % each of orange, green and blue
 M&M’s Peanut Chocolate candies are 20 % each of brown, yellow, red and blue,
and 10 % each of green and orange
 M&M’s Peanut Butter and Almond Chocolate candies are 20 % each of brown,
red yellow, green and blue
 M&M’s Crispy Chocolate candies are 16.6 % each of brown, red, yellow, green,
orange and blue.
Let’s perform our own test and see if our observation of the percent of each color
matches the website’s prediction.
13. Buy one pound bags of M&M’s Milk Chocolate for each student in your class. As a
class, for each bag, tally up the number of each color M&M. Find the percent of each
color for each bag.
48
14. The tally up the colors for all the bags, and find the percent of colors for the class
room sample, which consists of all the bags.
15. Using each bag as individual trials, find the mean, median and mode for each color.
Then find the percent of colors based on these findings.
16. How do the results in parts 2. and 3. compare? How do the results compare to
M&M’s website statistics?
17. Repeat the experiment for Peanut Chocolate, Peanut Butter and Almond Chocolate,
and Crispy Chocolate.
18. Answer this question - how can you run standard deviations in this experiment to
help you analyze your findings so that you may decide on the reliability of the data on the
M&M’s website?
19. Run those standard deviations to determine the reliability of the data on the M&M’s
website.
For problems 20 to 22, refer to Example 3 from the text.
20. Compute the standard deviation for the data from Example 3.
21. Determine which states are the friendliest to home ownership and which states are
the least?
22. Is there a cause and effect relationship that you can argue to explain why these states
are at either end of this analysis?
23. Barry Bonds or Babe Ruth. Who was the greatest baseball player of all time? To
argue your point, quote statistics. Research their batting average and compare it to the
batting averages of their peers at the time. How many standard deviations from the mean
were their batting averages? Do the same for home runs, RBI’s and on base percentage.
Factor in that Barry Bonds played in night games and that Babe Ruth won 20 games as a
pitcher. Best of luck…
49