Download Ch3 f - Arizona State University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Statistics or
Never attribute to malice that which is adequately explained by stupidity.
Facts are stubborn things, but statistics are more pliable. – Mark Twain
Numbers don’t lie. And often perception is not reality. Case in point, people are always
concerned, worried even, about the increasingly violent society in which they live.
Statements like “what is this world coming to,” are commonplace. People may tend to
feel increasingly unsafe many may become more and more reluctant to go out at night.
Some been known to hide behind their TV or computer, instead of venturing out. There
are many who feel that when they were young, crime was not as bad as it is now. People
attribute arbitrary reasons for this new wave of perceived violence. “Exhaustive music
videos glorify violence, causing a violent cycle to never end...” “No wonder there is so
much crime these days, look at all the violence on TV and in the movies…” “Kids have
no respect for their parents, teachers or elders these days, this contributes to more
violence…” Or “the remote control teaches us to become impatient, and we are more
likely to quickly pull the trigger…” Images from the OJ murders, Columbine shootings
or 9/11 tend to fill our televisions, replaying the same isolated scenes over and over
again. People are shot every night on reruns of Law and Order. So, it’s natural for
people to criticize the amount of violence in our society, but rarely do these same people
utter any voice toward thinking their utterances through to its logical conclusion. Instead,
many appear to become angry about the rise of violent crime in this country and tend to
make matters worse by linking this acquired malice to other elements in society (music
videos, teenagers, TV, OJ), fostering a wider net of hate. More importantly, they never
once pause to check out the numbers. And in a matter of moments, anyone can do just
that, check out the numbers. Any of us can access on the WWW the FBI’s Index of
Crime Statistics. So, we did.
Below are the nation wide statistics from 1982 to 2001, showing, by year, the number of
violent crimes nationwide. During this twenty year span, while the nation’s population
grew from 231,664,458 in 1982 to 284,796,887 in 2001, the number of violent crimes as
defined by murder, rape, robbery and assault did not steadily increase, as expected. In
1
fact, clearly there was a stunning decline in violent crimes over the last decade. Violent
crime reached it’s peaked 1992, with 1,932,274 reported instances and since then, violent
crime has dropped over 25 percent. (The homicides on September 11, 2001 were not
included.)
murder, rape, robbery, assault
Violent Crime
2,500,000
2,000,000
1,500,000
1,000,000
500,000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
year: 1982 to 2001
But, look at the age we live in. We have all seen the headlines:
 Arizona Kids Are Home Alone, A new survey says 30 percent in kindergarten
through 12th grade take care of themselves”
 Of the 85 prisoners executed in 2000, 43 white, 35 African American, 6 Hispanic,
1 American Indian
 Vietnam - 58,168 deaths, total abortions since 1973, 44,670,812 as of April 22,
2004
 Should juveniles be tried as adults? Kids are killing these days in record numbers
Statistics are tossed at us in such a deluge the numbers alone seem almost controversial,
30 % of school age children left alone, 35 out of 85 executed are African American, 44
million abortions in last 30 years. … Certainly, each of these topics elicits emotion from
within each of us, too many parents leave their children unsupervised, there is not enough
funding for day care, death penalty, pro or con, racially biased, too many, too few
juveniles? And if you want to clear a room with angry combatants, start with the age-old
question, woman’s choice or murder of the unborn? No matter your stand on these
topics, as you comb through the headlines, statistics besiege you.
Why is quantitative literacy important? When confronted with numbers associated with
hotly contested issues or highly controversial ethical or moral arguments, raw numbers
themselves, such as the above stated 44,670,812 abortions in the last 30 years, need to be
examined so they may be fully understood. As always, we begin by examining the
number for credibility? Is it even viable? This particular number or numbers similar to it
appear on various websites. We easily found these numbers quoted and similar such
numbers at http://www.americandaily.com/article/1806, http://www.americandaily.com/article/1806, and
http://womensissues.about.com/cs/abortionstats/a/aaabortionstats.htm. Are they accurate, well, we
2
simply have no way of knowing, but these are often published statistics. Are they viable?
Now, that is a different question altogether.
Following our pattern of analysis, if the number seems to be viable, then we continue. If
it is viable, what implications are fair to divvy out? These 44 million aborted fetuses
would be 30 years of age or younger, so for argument sake, let us assume it is fair to say a
large percentage would be alive today. If this assumption is reasonable, 40 million plus
the 290 million US citizens comes to 330 million. We are talking about a population of
330 million people, and 44/330 is slightly larger than 13 % or slightly greater than 1/8.
What does this mean? Has society aborted 1 in 8? Don’t questions abound in your
mind? Is this correct? Were these all abortions performed out of necessity? How many
were medical? Or moral? Or personal choice? Does the reason for the abortion matter
to you? Does the reason for the abortion matter to you if you take into account this new
“1 in 8” statistic as a measure of how often abortions do occur? Certainly, one may argue
that 1 in 8 could be construed as an alarming rate. But, the point of view and the
emotions you feel are personal for you. The point is that 44 million is the statistic we are
confronted with. Our ability to perform math tells us 1 in 8 is a logical consequence of
this statistic. What you do in the subsequent interpretation is your decision. But,
quantitative literacy will allow you to understand the statistic in context and make the
interpretation.
Statistics themselves are numbers that stand alone. Honest. Raw. Naked numbers. The
name of the game in statistics is to draw inferences about a population or topic. If we are
using polls, we are basing inferences on a smaller random sample of the general
population. When trying to then form a conclusion, we must be careful. Correlation is
not causation, just because numbers correlate does not mean one causes the other.
Inferring characteristics about a population based on the raw data is the immediate
reaction as we scan the headlines, but should it be? Can graphs be misleading? How
good are we at recognizing misleading information?
Causation and Correlation
There exits a relationship between attendance and grades. Research shows that students
who attend class regularly have better grades than those who don’t. Does this mean that
attending class will cause a student to have a better grade, that is, will simply coming to
class increase one’s grade? What about the student who regularly comes to class because
they can get 50 minutes of solid rest by laying their head on the desk? The nature of this
question illustrates the need for a distinction between two words, causation and
correlation. Cities with more pornography have a higher crime rate. What is the
relationship between these two variables; are the social implications as obvious as is
implied? Relationships between variables are not always cut and dry. Studies can show
children who come from economically advantaged homes perform better in high school.
If anyone took this study and concluded that as a society, the smarter citizens tended to
rise to the top of the economic food chain, the public outcry would be palpable. This is
because other factors need to be taken into consideration. Such as the premise
“advantages are just that, advantages”. Other factors such as better access to tutors,
3
better access to support systems, or not having to study while your hungry or cold or
working full time, certainly contribute to one’s academic performance. Correlation
should never be used interchangeably with causation. Sometimes correlation indicates
causation, sometimes not.
Clearly, there exists a high correlation between the amount of blood alcohol level in a
person’s body and the likelihood they will get into an auto accident. We do not think any
rational person would dispute the added inference that drinking alcohol can cause an auto
accident. The data that supports the two factor’s relationship, the higher the number of
drunks compared to non drunks who get into the accident, imply correlation. That
drinking lead to or caused, the accident implies causation. It will be our task to determine
whether a factor’s data that correlates to some other factor’s data can be interpreted to
mean that one factor influences the other.
Correlation A correlation exists between two factors if a change in one of the factor’s
data is associated with a rise or decline in the other factor’s data.
Causation A causation exists between two factors if a one factor causes, determines or
results in the other factor’s data to rise or decline.
Correlation as a result of causation As with drinking and auto accidents, we can often
infer that a correlation is tied to causation. Another equally clear case can be made by
considering tobacco use and lung cancer. The numbers correlate, one can equate the
amount one smokes with the likelihood of succumbing to lung cancer. Those who smoke
more have a higher percentage of their population inflicted with lung cancer. And, for
years, the Surgeon General has been telling us that smoking causes lung cancer. The
more you smoke, the higher the risk of developing lung cancer.
Correlation with no causation. Hidden factors Just because two factors correlate does
not mean one factor causes the other. One of the easiest examples to spotlight the
difference and to have it plainly explained is to look at a common correlation between
divorce and death. In most states, there is a significant negative correlation between the
two, the more divorces, the less deaths. Since the two correlate negatively, the natural
question arises, does getting a divorce reduce the risk of dying; does staying married
increase the chance of dying? All joking aside about the obvious hidden implication, it is
a third unseen factor that causes the correlation. Death and divorce do not have a causal
relationship. Age does. The older the married couple, the less the likelihood they will
get a divorce. The older the married couple, the higher likelihood they will pass away.
There is a negative correlation between divorce and age and a positive correlation
between age and death. The younger you are, the more likely you are to get divorced.
The older you are, the more likely you are to pass away. Since the higher number of
divorces occur with younger people, and since younger people tend to live longer, we
have a transitive relationship implying the higher number of divorces relating to the
longer life spans. Correlation. Causation. Very different. Yes, there is a correlation
between divorce and death. No, neither causes the other. In plain English, getting a
divorce will not increase or decrease the likelihood you will die.
4
Accidental Correlations Sometimes there exists accidental correlations where there is
no hidden other factor or unseen logical explanation. The winner of the Super Bowl and
the party of the winner of the presidential race in the country correlate highly every four
years, but do not think football predicts the presidential races, or visa versa. This is an
accidental correlation.
Misleading Information
Breast cancer will afflict one in eleven women. But this figure is misleading because it
applies to all women to age eighty-five. Only a small minority of women live to that age.
The incidence of breast cancer rises as the woman gets older. At age forty, one in a
thousand women develop breast cancer. At age sixty, one in five hundred. Is the statistic
one in eleven technically correct? Yes. Should a 40 year old woman be concerned with
getting breast cancer? Certainly. Should they worry that one in eleven of their peer
group will be afflicted? No. And while one in a thousand in their peer group will get
afflicted, this by no means minimizes the seriousness of the issue, but sheds a more
realistic light on it.
A scatter plot is a graph of ordered pairs that allow us to
examine the relation between two sets of data.
To draw scatterplot:
 Arrange the data in a table.
 Decide which column represents the x–values (the label representing data along the
horizontal axis). Those values need to be the perceived cause, the independent
variable. Decide which columns represents the y–values (data represented along the
vertical axis). These values need to appear to be affected by the perceived causes,
the dependent variable.
 Plot the data as points of the form or an ordered pair, (x, y).
 Analysis: We can make predictions if the points show a correlation.
* if the points appear to increase while reading the scatter plot from left to right, this
is a positive correlation.
* if the points appear to decrease while reading the scatter plot from left to right, this
is a negative correlation.
Positive Correlation: We expect that if the values along the horizontal axis increase, so
do the values associated with the vertical axis. That is, as we increase x, we increase y.
The more we study, the higher we expect to score on Exam One.
5
Grade on Exam One
120
100
80
60
40
20
0
0
1
2
3
4
5
6
7
Hours Spent Studying
Negative Correlation: We expect that if we increase x then we decrease y. The higher
the temperature, the less minutes we will jog.
Minutes Spent Jogging
90
80
70
60
50
40
30
20
10
0
0
20
40
60
80
100
120
140
Temperature (Fahrenheit)
Problem One
For each below, decide if there is a correlation between the two factors. If there is, is it a
positive correlation or negative correlation? Then decide if the two factors have a causal
relationship. If they do not have a causal relationship, but they do correlate, determine if
there are hidden factors that explain the correlation, if the correlation is accidental or if
there is misleading information.
a.
b.
c.
d.
A child’s shoe size, a child’s ability to do math
Blood alcohol level and reaction time
A girl’s body weight, the time the girl spends playing with dolls each day
Price on an airline ticket, the distance traveled
Solution
6
a. Positive correlation. A child’s shoe size does correspond to a child’s ability to do
math. The larger a child’s shoe size, the better in math they are. But, the relationship is
not causal; large feet do not cause a child to perform math better. There is a hidden
factor. Age. The older the child, the larger the child’s shoe size. As children become,
they take more math classes. The more math classes the child has participated in, the
better the child performs in math.
b. Negative correlation. The higher the blood alcohol level, the slower the reaction time.
The relationship is causal.
c. Negative correlation. As a girl’s body weight increases, they play with dolls less each
day. No causal relationship here, again a hidden factor. And again it is age. The more a
girl weighs, the older she is, the less time she spends playing with dolls.
d. Positive correlation. The longer the distance, generally, the more expensive the ticket.
Causation.
Problem Two
Let’s examine the basic question, “Do students who do better on a placement exam
perform better in a college algebra course?” Below is the data. Draw a scatter plot and
answer the question.
Placement Score
70
68
56
40
78
59
67
45
61
Final Average
College Algebra
91
89
71
62
95
65
85
66
70
Solution
We need to examine the data visually to see if there exists a positive or a negative
correlation between higher placement test scores and performance in college algebra.
Below is a scatter plot of the placement test data and average scores on a College Algebra
Final.
7
Final Average in College Algebra
100
90
80
70
60
50
40
30
20
10
0
0
20
40
60
80
100
Placement Test Score
Though not all points show the same trend, the general trend is an increase in placement
score does translate to an increase in the final average grade.
8
Exercise Set
14. From a survey of 2000 people, the
table below represents averages for the
number of years in school and the
associated average monthly salary.
Make a scatter plot labeling the x and y
axes. Label the x and y axes.
Number of Years
Average
in School
Monthly
Salary
Less than 12 (approx.
$ 1,500
10)
12
$ 1,750
14
$ 2,100
16
$ 2,550
18
$ 3,000
20
$4,200
15. Draw a line through the data which
closely fits the scatter plot for
accumulative donations for a charity by
year below:
16. From the scatter plot below,
interpret the linear pattern and predict
the percent of students who failed the
math course in the year 2,000.
percent of students failing math
course
For problems 1–13, decide if there is a
correlation between the two factors. If
there is, is it a positive correlation or
negative correlation. Then decide if the
two factors have a causal relationship. If
they do not have a causal relationship,
but they do correlate, determine if there
are hidden factors that explain the
correlation, if the correlation is
accidental or if there is misleading
information.
1. Altitude, air pressure
2. Number of homes sold, realtor’s
income
3. Number of abortions in US, number
of people who are Pro-choice
4. Encouragement of cattle ranching,
amount of rain forest
5. The length a time a couple is
together, the similarity of their out look
in life
6. A senior citizen’s age, clarity of
vision for the senior citizen
7. Weight of an envelope, postage on
the envelope
8. A boy’s height, a boy’s time spent
watching cartoons each day
9. Minutes hot coffee sits on a desk, the
temperature of the coffee
10. Rate of violence in a city,
unemployment rate in the same city
11. Petroleum consumption, quality of
air
12. Number of cars on the highway,
quality of air
13. Efficiency of household appliance,
size of the monthly electric bill
30
25
20
15
10
5
0
1988
1990
1992
1994
1996
1998
2000
year
9
c)
$
17. Which data below has the greatest
negative correlation?
40
35
30
25
$
20
$
15
35
10
30
5
25
0
1965
20
$
15
1970
1975
1980
1985
1990
1995
2000
10
5
0
1965
d)
1970
1975
1980
1985
1990
1995
2000
$
a)
40
35
30
$
25
20
40
$
15
35
10
30
5
25
20
$
15
0
1965
1970
1975
1980
1985
1990
1995
2000
10
5
0
1965
1970
1975
1980
1985
1990
1995
2000
b)
Construct and Draw Inferences
Constructing and drawing inferences are essential to critical thinking and problem
solving. When faced with statements, problems and puzzles, we do more than use
common sense. We use problem solving skills, try to find patterns and infer statements
that follow logically from the statements given. We determine what is reasonable and
what is not. We determine what should logically follow and what should not in order to
make good decisions.
Circle Graphs
Taken directly from newspaper headlines: Should a juvenile be tried as an adult? To
address this issue, we should ask ourselves many questions and look at this crucial
problem from many perspectives. For many of us, the first question we ask may be “Do
juveniles who murder pose a chronic problem in this country?” Well, what’s chronic? If
a large percentage of all murders were done by juveniles, this could be called chronic.
We return to the Crime Index as defined by the FBI from 2001. Let us ask the question,
“does there exist a correlation between age and those who commit murder in this
country?” As long as we have the information grouped by category, which in this case is
by age, we can recognize large numbers displayed in data as a percent of the whole in a
pie chart or circle graph.
10
First, let’s see how a circle graph or pie chart is
made. We tend to subdivide a circle into
sectors represented by their central angle in
either degrees (out of 360 degrees) or the
percent of the circle that is to be shaded (out of
100 %).
Common subdivisions of a circle
So, for our question: “is their a correlation between age and those who commit
murder in this country?”, we examine the data taken from the Crime Index. Of the
10,113 number of known murderers in the country in 2001, there age distribution was
given as follows:
Age, in years
1 to 4
5 to 8
9 to 12
13 to 16
17 to 19
20 to 24
25 to 29
30 to 34
35 to 39
40 to 44
45 to 49
50 to 54
55 to 59
60 to 64
65 to 69
70 to 74
75 and over
Total
Number
0
0
14
454
1,695
2,767
1,571
992
855
645
455
272
158
85
59
37
54
10,113
Since the data is already organized, let’s find the density of each age group. This means
we will reconstruct the table and find the percent of murderers for each category, 1 to 4, 5
to 8, 9 to 12, 13 to 16 and so on. Note, not all categories are partitioned into equal time
intervals.
11
Age, in years
1 to 4
5 to 8
Number
0
0
9 to 12
14
13 to 16
454
17 to 19
20 to 24
25 to 29
30 to 34
35 to 39
40 to 44
45 to 49
50 to 54
55 to 59
60 to 64
65 to 69
70 to 74
75 and over
Total
1,695
2,767
1,571
992
855
645
455
272
158
85
59
37
54
10,113
Relative frequency
0
0
Central Angle
14
 0.0013
10,113
454
 0.0449
10,113
1, 695
 0.1676
10,113
2767
 0.2736
10,113
0.0013 x 360   0.468
0.0449 x 360  16.2
0.1676 x 360   60.34, or 1 of the circle
6
0.115
0.098
0.085
0.064
0.045
0.027
0.016
0.008
0.005
0.004
0.005
1
0.2736 x 360   98.5
41.4 degrees
35.3 degrees
30.4 degrees
23 degrees
16.2 degrees
9.7 degrees
5.62 degrees
3 degrees
2.1 degrees
1.3 degrees
2 degrees
360 degrees, a whole circle
Murder Offenders by age
1 to 4
5 to 8
9 to 12
13 to 16
17 to 19
20 to 24
25 to 29
30 to 34
35 to 39
40 to 44
45 to 49
50 to 54
55 to 59
60 to 64
65 to 69
70 to 74
75 and over
The pie chart below is illuminating. Very
quickly, by glancing at the chart, we can
tell that 20 to 24 year olds commit the
most murders, but a close second seems
to be 17 to 19 year olds, as well as 25 to
29 year olds. If a juvenile is defined to
be under 18 years of age, then this
appears to be a chronic problem because
the second most dense population of
murderers occurs in the age group 17 to
19 year olds. Now when we factor in the
13 to 16 year olds (454), the problem of
juvenile murder seems to be more acute.
For murders committed by teenagers
alone, we have within the 13 to 19 year
old age group, accounted for 454 + 1695
or 2,149 murders committed by
teenagers. This comes to 2149/10,113 or
just a little over 20 percent, and this
doesn’t include the children who are 12
or under.
12
Now, let’s continue to address this problem again. Numbers never lie. But rearranged,
could they deceive? Could the very same numbers be used by the opposing side of the
argument to make the opposing view more viable? As said, first, we re-arrange the
numbers.
1 to 19
14+454+1695=2163
20 to 39
2767+1571+992+855=6185
40 to 59
645+455+272+158=1530
60 and over 85+59+37+54=235
Murderers by age, 2001
1 to 19
20 to 39
We then construct a pie chart from
these new subdivisions. Again, keep
in mind we only considering the
murders where we know the age of
the murderer. There were 10,113 of
these murders.
40 to 59
60 and over
But, we are trying to represent the opposing point of view and we are trying to show
murder by juveniles is not a ‘chronic problem’. So, in 2001, there were an additional
5375 murders where the age of the perpetrator was unknown. Regrouping, our table
looks like:
1 to 19
20 to 39
40 to 59
60 and over
Unknown
2,163
6,185
1,530
235
5,375
Murderers, by age, 2001
1 to 19
20 to 39
40 to 59
60 and over
unknown
Let’s examine the new pie chart. Notice
how much smaller the piece of the pie
for the 1 to 19 year old segment now is
compared to the whole. This is
significant difference from the previous
pie charts where we did not factor in the
murders committed by people of
unknown ages.
13
Murderers by age, 2001
1 to 19
20 to 39
40 to 59
60 and over
unknown
To further enhance our argument, we
may construct the slices of the pie with a
3–dimensional representation. We then
shift the angle of the segment of the pie
we are trying to ostensibly hide so that it
is less prominent. Our point that
juvenile crime is not a chronic problem
seems more justified to the viewer’s eye.
To add a final touch in enhancing our argument, let’s re-categorize and change two
groupings: 1 to 19 and 20 to 29 to 1 to 16 and 17 to 39. If we keep the category of
unknown murderers in the groupings, let’s compare the original pie chart with the final
one. To the naked eye, a quick glance reveals the juvenile’s slice to be a mere sliver on
the left chart compared to nearly a quarter of the pie on the right.
Murderers, by age, 2001
Murderers by age, 2001
1 to 16
17 to 39
40 to 59
60 and over
1 to 19
20 to 39
40 to 59
60 and over
unknown
Statistics don’t lie, they can be re-arranged though to show what ever is on one’s agenda.
Problem Two
The graph below is shown and a TV anchor man states, “There was a sharp dramatic
increase in drunk driving convictions between the year 1999 and the year 2000.”
Consider the statement and reply to its accuracy.
14
Solution
According to the figure, the actual increase in drunk driving convictions between 1999
and 2000 was 12, up to 732 from 720 the year before. Though this is an increase, it can
not be considered a “sharp dramatic increase”. Evaluating the data in another way, the
12
 100  1.7% is not significantly sharp or particularly dramatic. The
percent increase,
720
anchor man was over dramatizing the report, the words may be deemed inflammatory,
bordering on misleading.
Problem Three
Drawing Inferences A bucket has small green balls, medium blue balls, large pink
balls, and very large red balls. A child picks ten balls, selecting each randomly so each
ball is equally likely to be selected. Four such trials were conducted. Which trial most
closely resemble the theoretical probability that would occur if the balls were selected
randomly ten times?
a)
Balls
Number of
Balls
Selected
Small Green 2
Medium Blue 2
Large Pink
2
Very Large
4
Red
b)
Balls
Number of
Balls
Selected
Small Green 3
Medium Blue 2
Large Pink
3
Very Large
2
Red
c)
Balls
Number of
Balls
Selected
Small Green 2
Medium Blue 3
Large Pink
2
Very Large
3
Red
d)
Balls
Number of
Balls
Selected
Small Green 3
Medium Blue 2
Large Pink
4
Very Large
1
Red
15
Solution
First, we need to calculate the theoretical probability for each type of ball. Recall, the
probability is the number of successful outcomes divided by the total number of
outcomes. The total number of balls is 20. There are 6 small green ones, 4 medium blue
ones, 7 large pink ones, and 3 very large red ones. Below are the theoretical
probabilities:
Balls
Prob.
Small Green 6/20
Medium Blue 4/20
Large Pink
7/20
Very Large
3/20
Red
If ten balls were selected, we could anticipate 3 out of 10 balls to be small green ones, 2
out of 10 to be medium blue ones, 3.5 out of 10 to be large pink ones and 1.5 out of 10 to
be very large red ones. This trial outcome is impossible and so choice b) is the closest
trial to these expected results.
Exercise Set
For problems 1 and 2, use the following
data. In the year 2000, a state lottery
distributes its $ 2.1 million proceeds in
the following manner:
Proceeds
Beneficiary
$ 900,000
Education
$ 500,000
Cities
$ 200,000
Highways
$ 200,000 Senior Citizens
$ 160,000
Libraries
$ 140,000
Other
1. Draw two circle graphs. One should
support the argument that too much
money from the state lottery went
toward education. The other should
support the counter argument, too little
money from the state lottery went
toward education.
2. Choose a side to the above argument.
Pro or Con. Then write a paragraph
defending your argument, citing social,
political, ethical and/or religious factors.
For problems 3-4, use the following
data. Source:
In
2000, the population of California was
33,871,648 and 134,227 Californians
purchased 193,489 handguns. 103,743
people purchased one hand gun, 28,453
people purchased two to five handguns
totaling 71,363 handguns. 1,855 people
purchased 6 to 12 handguns, totaling
14,053 handguns and 176 people
purchased more than 12 handguns for a
total of 4330 handguns.
http://www.ucdmc.ucdavis.edu/vprp/Section6,2000.pdf
3. Construct two circle graphs. One
circle graph should support the argument
that there is a need for more restrictions
on handguns in the state of California.
The other circle graph should refute the
argument, that is, support the counter
argument that there is no need for more
restrictions on handguns in the state of
California.
16
4. Choose a side to the argument that
there is a need for more restrictions on
handguns in the state of California. Pro
or Con. Then write a paragraph
defending your argument, citing social,
political, ethical and/or religious factors.
For problems 5 and 6, use the following
data for the US Census Bureau. In 1999,
there were roughly 280,000,000 US
citizens, and 35,000,000 lived in
poverty. Of these 35 million,
12,100,000 were children, where
4,500,000 of these children lived in
families who were under one-half of the
poverty level. The poverty level was
defined as $ 13,290 per family of three.
For each problem, construct a circle
graph as designated below.
5. Draw a circle graph whose population
is the citizens of the United States.
Section the circle graph into two sectors,
one sector representing the US citizens
who live above the poverty level, one
sector representing the US citizens who
live below the poverty level.
For problems 7-8. Observe the tables
below. For each, what is the greatest
issue presented by these numbers. Then
argue one side of the argument, using pie
charts to visually sway your reader. Be
certain to outline the issue, show the
supporting table(s) and pie chart(s).
Discuss the potential harm of such
practices.
7. Murder Victims. By Race and Sex.
2001. Note: The murder and
nonnegligent homicides as a result of the
events of September 11th, 2001, were
not included in the below table. 2001, taken
from Tables 2.3-2.15. Special Report Section V.
http://www.fbi.gov/ucr/01cius.htm.
Race of Victim
Total
Male
White
Black
6,750
6,446
4,785
5,350
1,962
1,095
3
1
Other race
Unknown race
368
188
245
123
123
34
0
31
3,214
35
Total
13,752 10,503
8. Hate Crime Statistics. By Bias.
2003. Source: FBI Crime List in 2003.
http://www.fbi.gov/pressrel/pressrel04/pressrel112204.htm
Total
6. Draw a circle graph whose
population is those citizens who live
below the poverty level. Section the
circle graph into three sectors, one sector
representing the adults who live below
the poverty level, one sector representing
the children who lived in families who
lived under one-half the poverty level
and the third sector is all of the other
children who lived below the poverty
level.
Female Unknown
Victims
9100
Bias to race
Anti-White
Anti-Black
Other
4,754
1,006
3,150
598
Bias to Religion
Anti-Jewish
Anti-Catholic
Anti-Islamic
Other
1,489
1,025
80
171
213
Bias to Other
2,857
17
9. The graph below shows the
companies profits in its first four years
of existence.
What’s wrong with this statement,
“There was a substantial increase in the
company’s profit in its first 4 years of
existence.”
10. Poll your classmates as to the most
important ‘hot button’ campaign issue.
Create a table as you see below.
Topic
Frequency
Relative
Frequency
Density
Terror
Racial
Relations
Abortion
Death
Penalty
Drugs
Education
Construct a histogram and a pie chart for
the data.
11. Project: Circle graphs, drawing
inferences
Sometimes we choose to see what we
want to see. We all stretch the truth,
exaggerate what we need, ignore what
hurts us and to what end, personal
wealth at the expense of personal worth?
From the US Census Bureau, 2000: Child
poverty in America dropped from 13.5
million children in 1998 to 12.1 million
in 1999. With a whisper of optimism, we
rationalize that this improvement was
great. Was it?
Do you ever have trouble focusing on
exams or concentrating on homework
assignments? How hard do you think it
would be to concentrate on your exams,
homework, or even your instructor'
lectures if your family didn't have
enough money to feed you? What if you
were in poor health and your family
couldn't afford to take you to the doctor
or provide the medicine you need? The
bitter truth is that in 2000, 12,100,000
children in America were living in
poverty and confronted these challenges
every day.
If a family of three were living below
the poverty line in 2000, they had an
income below $13,290 a year. Living in
poverty can translate to residing in
crowded housing, having your utilities
turned off, not owning a phone, or
refrigerator or car, not having enough
food to feed your family, not enough
medicine to heal your loved ones. And
the heart wrenching statistic is that 4.5
million children live in families that
exist below one-half of the official
poverty level.
Do we have your attention, are you
gasping in proper reverence? We
should. Particularly because in 2000,
America was experiencing one of its
greatest flashes of economic prosperity.
Business was skyrocketing, and people
were spending. But, was just a minute
percentage of Americans benefiting from
this new wealth? Ironically, in 2000,
the unemployment rate in the U.S. was
lower than it had been in years, but the
percentage of poor children in working
families was soaring. There were many
possibilities to explain this phenomenon,
but "Some economists (said) that if
wages had kept pace with the cost of
living since the 1960s, the minimum
wage would (have been) between $12
18
and $14 dollars" (CNN.com).” Instead,
the minimum wage was $5.15.
Assignment Go to the US Census
Bureau. Find out how many children
there were in the US in 2000.
Construct a circle graph with the
following categories: Children who
lived above the poverty level, children
who lived below the poverty level.
Draw separate sections of the circle
graph for those children who lived above
$ 6,645 a year (half of the poverty level
of $13,290 a year) and those who lived
below $ 6,645 per year. Also, include a
section of the graph for those children
who lived in the upper 1 % of the
income bracket and determine what that
income level was. Then tackle the
following questions?
a. Do you think there is a positive,
negative or no correlation between
concentrating in high school and
graduating from high school? Is it a
causation relationship? Why?
b. Do you think there is a positive,
negative or no correlation between
concentrating in high school getting into
college? Is it a causation relationship?
Why?
c. Do you think there is a positive,
negative or no correlation between
concentrating in high school and
acquiring well-paying jobs? Is it a
causation relationship? Why?
d. Do you think there is a positive,
negative or no correlation between
staying healthy and having access to
doctors and medicine? Is it a causation
relationship? Why?
e. Do you think there is a positive,
negative or no correlation between
poverty and crime? Is it a causation
relationship? Why?
f. Do you think there is a positive,
negative or no correlation between
issues that politicians and lawmakers
have as a top priority and issues that
affect those under 18, who can not vote?
Is it a causation relationship? Why?
For problems 12-17, use this information
provided: 5,000 years ago, forests
covered nearly 50% of the earth's land
surface. Since the advent of humans,
forests now cover less than 20%.
Forests serve as the lungs to our planet
by providing the very oxygen with
which we breath. The rate of
deforestation is increasing and although
extinction is nature’s way of selectively
re-aligning our living world, this
extinction, the most acute since the
dinosaurs, is not nature’s way. Humans
have caused it, by themselves.
Source: According to RAN (Rainforest
Action Network) and Myers (Op sit). In
Central and South America, Bolivia,
whose land mass is 1,098,581 square
kilometers once had a forest cover of
90,000 sq km, now has a forest cover of
45,000 sq km. Brazil, whose land area is
8,511,960 sq km, once had a forest cover
of 2,860,000 sq km, now has a forest
cover of 1,800,000 sq km. Central
America has a land area of 522,915 sq
km, once had a forest cover of 500,000
km and now has a forest cover of 55,000
km. Columbia has a land area of
1,138,891 sq km once had a forest cover
of 700,000 km and now has a land area
of 180,000 km. Ecuador’s land area is
270,670 km, once had a forest cover of
132,000 sq km and now has a forest
cover of 44,000 km. Mexico’s land area
is 1,967,180 sq km, one time its forest
cover was 400,000 sq km and now its
forest cover is 110,000 sq km.
12. For each country, construct a circle
graph where the circle represents the
land area of the country. Divide each
19
circle into two sectors, one for the
country’s land area that was once
covered by forests and one for the land
area not that was not covered by forests
at that time.
twelve sectors, two for each country,
where one sector represents the land area
currently covered by forests and the
other the land area currently not covered
by forests.
13. For each country, construct a circle
graph where the circle represents the
land area of the country. Divide each
circle into two sectors, one for the
country’s land area that is currently
covered by forests and one for the land
area that is currently not covered by
forests at that time.
16. Construct a circle graph that
represents the total land area for Bolivia,
Brazil, Central America Columbia,
Ecuador and Mexico. Divide the circle
graph into twelve sectors, two for each
country, where one sector represents the
land area that was once covered by
forests and the other represent the land
area at that time that was not covered by
forests.
14. For each country, construct a circle
graph where the circle represents the
original extend of forest cover. Divide
the circle into two sectors, one for the
existing land area covered by forests and
one for the land area lost to
deforestation.
15. Construct a circle graph that
represents the total land area for Bolivia,
Brazil, Central
America Columbia, Ecuador and
Mexico. Divide the circle graph into
17. After assimilating the information
and viewing the circle graphs from
problems 12-17, provide an argument,
either pro or con, with regard to the
following statement: “Deforestation of
the rain forests in Central and South
America is threatening the local
environment as well as the global
environment. It should be a not button
issue in today’s society.”
Measure of Central Tendency
Mean, Median, and Mode Finding a number that best represents a set of data is
important to you right now. Because your choice of the “representative” number that
best indicates your grade can determine your course grade. Mathematicians say that to
find the number that is going to serve as the spokesperson for the data should reflect the
measure of the center or the middle of the data.
Usually we begin by averaging the numbers to find that representative number; we find
the sum and then divide by the number of data points. But, if the data consists of exam
scores and you earned a 95, 95, 95, 95, 95, and a 45, then your average is found with two
calculation, 95 + 95 + 95 + 95 + 95 + 45 = 520 and 520/6 = 86.7 . This means the center
of your data, or the letter for the grade that best represents your data is a B according to
your average, and yet you never once earned a B. In fact, you earned only A’s, except for
one failing grade. You pause, because clearly you earned 5 grades of a 95 and just one
grade of a 45. The five A’s must count for something, right? The data that appears the
20
most, 95, is described as the mode and it is simply another representation of the tendency
of the data.
Now that we see there is more than one way to refer to the center of the data, let’s begin
with perhaps a more realistic example. Suppose we knew you had the following exam
scores, 60, 80, 60, 70, 80, 80, 90, and 95. Your thinking perhaps you deserve an A
because your last two grades were A’s. Or at the very least, you deserve a B. You begin
by finding your average or mean, which is the sum of the scores divided by the number
of scores; so you average your grades. First you add the scores: 60 + 80 + 60 + 70 + 80
+ 80 + 90 + 95 = 615. You had 8 exams and the average is found by dividing 615 by 8;
615/8 = 76.9 or a C. Uh oh. You change your strategy. You argue that you scored an
80 three times, you deserve a B. The mode is the data that occurs most frequently, and
your mode is an 80. Does this help your argument? Well, one more indication of the
middle of your data is the middle value when you align the numbers in order, either from
top to bottom or from bottom to top. So, we arrange our data as 60, 60, 70, 80, 80, 80,
90, and 95. The data that occurs in the middle is called the median, like the median of
the highway. If there is an odd number of data points, the median will be a number found
in the data set. If there is an even number of data, there will be two numbers in the
physical middle of the data, and when this occurs, you need to average the two middle
numbers. For us, there are two 80’s in the middle of the data, another indication you
deserve a B. Now, perhaps the last possible argument you may use to justify you are a B
student is that your last four exam grades, 80, 80, 90 and 95 showed you were more of a
B student than a C student at the end of the course. So, despite having an average or
mean of a 76.9, your mode and median scores were an 80 and you’re your grades in the
second half of the semester were certainly not indicative of a C student. What grade
should you get? What grade did you earn?
Real Estate You meet with a real estate agent and carefully explain to the agent the
price range of the homes you are interested in seeing. The agent taps away on their
computer and tells you they tell you they completely understand what you want, that you
are looking for homes in the $130,000 to $160,000 range. You nod your head in
agreement. The agent informs you they have found three neighborhoods where the mean
(average) value of houses in the three neighborhoods are $ 128,571, $136,786 and
$161,429. Each subdivision is small, just like you prefer, with 14 homes. They explain
the need for you to sign a exclusive right to buy agreement before they take you out.
Impressed with both the immediacy and the detail provided, you quickly sign the
agreement. The real estate agent takes you out for the day. After cozying into the front
seat of their car, you sit back and enthusiastically await what should prove to be a
worthwhile day of house hunting. By the end of the day, you are nodding your head
sideways, not up and down, and you are straining to think of ways to break the stupid
exclusive right to buy agreement you just signed earlier. What happened? Let’s see.
21
The three subdivisions you saw:
House
Sleepy Brook Vista View Meadowlands
1
205,000
130,000
300,000
2
400,000
130,000
400,000
3
500,000
135,000
400,000
4
80,000
140,000
500,000
5
70,000
150,000
65,000
6
60,000
125,000
70,000
7
80,000
125,000
70,000
8
80,000
125,000
65,000
9
100,000
125,000
65,000
10
100,000
125,000
65,000
11
60,000
125,000
65,000
12
60,000
125,000
65,000
13
60,000
120,000
65,000
14
60,000
120,000
65,000
Average Value
136,786
128,571
161,429
Which subdivision was closest to your liking? Well, clearly Vista View is the only
subdivision that even had homes in your price range, with 5 of the 14 homes within your
price range. But, this was the least likely subdivision because it’s average value home
was a little below your range. But, visiting the other two subdivisions was useless, they
had no homes in your range. The agent never checked the values of the homes in the
three subdivisions, they only checked the average value of the homes. To cut the agent
some slack, checking 3 subdivisions with 14 homes each would have been a lengthy
endeavor, because each home would have needed to be accessed individually on the
computer screen. Remember, the agent wanted to impress you with their quick research.
Still, the oversight was caused because you did not have enough information about the
data. Measures of Central Tendency informs us as to the behavior of the middle of
the data, without the need to see every tedious piece of data. Since pulling up each
home would have been too time consuming (42 homes) what other pieces of information
could have been pertinent so that you would have known that only Vista View was worth
visiting?
Range. The mean or average value for these sets of data are:
For Sleepy Brook:
205,000  400,000  500,000  3(80,000)  70,000  5(60,000)  2(100,000)
 $136,786
14
22
For Vista View:
2(130,000)  135,000  140,000  150,000  7(125,000)  2(120,0000)
 $128,571
14
For Meadowlands:
300,000  2(400,000)  500,000  8(65,000)  2(70,000)
 $161, 429
14
But, clearly, this was not enough information about the middle of the data. What else
could have helped us. Well, in the Meadowlands subdivision, there were 8 of the 14
homes were worth $65,000, one-half of our lower limit for our price range. This would
have been helpful to know. The mode is the piece of data that shows up the most
frequently. So, in the Meadowlands, the mode is 65,000, occurring 8 times. For Vista
View, the most is 125,000 occurring 7 times. This mode is close to our price range. And
Sleepy Brook? It’s mode was much lower, 60,000, occurring 5 times.
What other tendency for the data would have been helpful. How many homes are not in
our price range would be too easy of an answer, huh. If we order our data, then the
median value of these homes is also readily avaiable:
House
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Median
Sleepy Brook Vista View Meadowlands
60,000
120,000
65,000
60,000
120,000
65,000
60,000
125,000
65,000
60,000
125,000
65,000
60,000
125,000
65,000
70,000
125,000
65,000
80,000
125,000
65,000
80,000
125,000
65,000
80,000
125,000
70,000
100,000
130,000
70,000
100,000
130,000
300,000
205,000
135,000
400,000
400,000
140,000
400,000
500,000
150,000
500,000
80,000
125,000
65,000
How would knowing the median have been helpful? Well, if we knew the medium for
Meadowlands, then we would have known that one-half of the homes in the subdivision,
that is 7 of the homes were $65,000 or below. To keep an average of $161,429, many of
the other homes would have needed to be too expensive for us, leaving at best, the
possibility of at most a few homes in our range. It turned out, there were no homes in our
range.
Which leads us to the dispersion of the data. Dispersion means spreading, scattering or
distribution. We can address these different words with different measures of central
23
tendency. The range is the difference between the largest and the smallest data point.
For Sleepy Brook, $500,000 - $60,000 = $440,000 or most realistically, there is a large
difference between the cheapest and the most expensive home in the subdivision. For
Vista View, $ 150,000 - $ 120,000 = $ 30,000 and this tells us all the homes are at least
close to our price range. Meadowlands has the problem Sleepy Brook had, the range is
$500,000 - $65,000 = $435,000.
To measure the scattering and the distribution of even larger samples of data, we will
examine standard deviations a little later. But first, let’s look at mean, median, mode
and range a little longer.
Problem One
Below are the Traffic Fatalities per 100 million (108) vehicle miles in 2001 Source: U.S.
National Highway Safety Traffic Administration. Rank the states and the District of
Columbia in ascending order. Then find the mode, median, mean and range. Discuss the
relevance of the numbers. This means if any two correspond closely, look at the data and
tell why. If any state is far from the middle of the data, call it an outlier.
Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
District of
Columbia
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
1.75
1.80
2.06
2.08
1.27
1.71
1.01
1.58
1.81
1.93
1.50
1.61
1.84
1.37
1.27
1.49
1.75
1.83
2.32
1.33
1.27
0.90
1.34
1.06
2.18
1.62
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
2.30
1.36
1.71
1.15
1.09
1.99
1.18
1.67
1.45
1.29
1.55
1.42
1.49
1.01
2.27
2.00
1.85
1.72
1.25
0.96
1.27
1.21
1.91
1.33
2.16
24
Solution
Massachusetts
Vermont
Connecticut
Rhode Island
Minnesota
New Jersey
New Hampshire
New York
Washington
Utah
California
Indiana
Maryland
Virginia
Ohio
Maine
Wisconsin
Michigan
Nebraska
Illinois
Oregon
North Dakota
Iowa
Pennsylvania
Georgia
Oklahoma
Delaware
Hawaii
0.90
0.96
1.01
1.01
1.06
1.09
1.15
1.18
1.21
1.25
1.27
1.27
1.27
1.27
1.29
1.33
1.33
1.34
1.36
1.37
1.42
1.45
1.49
1.49
1.50
1.55
1.58
1.61
50
49
47
47
46
45
44
43
42
41
37
37
37
37
36
34
34
33
32
31
30
29
27
27
26
25
24
23
Missouri
North Carolina
Colorado
Nevada
Texas
Alabama
Kansas
Alaska
District of
Columbia
Kentucky
Idaho
Tennessee
West Virginia
Florida
New Mexico
South Dakota
Arizona
Arkansas
Wyoming
Mississippi
South Carolina
Montana
Louisiana
Mode
Median
Mean
1.62
1.67
1.71
1.71
1.72
1.75
1.75
1.80
22
21
19
19
18
16
16
15
1.81
1.83
1.84
1.85
1.91
1.93
1.99
2.00
2.06
2.08
2.16
2.18
2.27
2.30
2.32
1.27
1.55
1.57
(X)
14
13
12
11
10
9
8
7
6
5
4
3
2
1
The mean and median are close, this
means the number in the middle of the
data and the average are close together.
The number of states that rank above
and below the average and the number
of states that rank above and below the
middle state, GA, are close to the same.
So, the data is not top or bottom heavy.
Yet, this doesn’t mean the data is
dispersed evenly. Why?
25
Exercise Set
For problems 1-6 below, find the mean,
median and mode for the data.
1. 1, 3, 4, 4, 4, 5, 5, 6
2. 3, 3, 4, 4, 4, 5, 5, 6
3. 3, 3, 3, 4, 4, 5, 5, 6
4. 3, 3, 3, 4, 5, 5, 5, 6
5. 1, 1, 1, 1, 2, 2, 6, 6
6. 1, 1, 2, 2, 6, 6, 6, 6
7. What is the median time it took for
the students to write the exam?
Student ID Time to
Number Take Exam
4025
1:25
1026
1:09
8790
0:59
1029
0:45
2943
1:01
2020
1:12
2084
1:25
5091
1:31
7812
0:49
5103
2:00
6092
1:42
8. Below is the year and the percent of
children under the age of 4 in a city that
attended Day Care.
Percent
Alabama
15.5
Alaska
6.8
Arizona
8.4
Arkansas
13.8
California
8.1
Colorado
7.4
Connecticut
7.0
Delaware
7.9
District of Columbia 16.4
Florida
9.1
Georgia
13.5
Hawaii
7.4
Idaho
8.3
Illinois
8.3
Indiana
7.7
Iowa
7.7
Kansas
8.1
1970
1972
1974
1976
1978
1980
1982
1984
1986
1988
1990
1992
1994
1996
15
17
15
16
18
17
21
31
12
15
16
17
15
12
What is the mode for this set of data?
9. From the US Census Bureau, 1999,
below is the state rankings of the percent
of elderly persons, 65 years and over that
live below the poverty level. Rank the
states and the District of Columbia in
ascending order. Then find the mode,
median, mean and range. Discuss the
relevance of the numbers. This means if
any two correspond closely, look at the
data and tell why. If any state is far
from the middle of the data, call it an
outlier.
Year
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
14.2
16.7
10.2
8.5
8.9
8.2
8.2
18.8
9.9
9.1
8.0
7.1
7.2
7.8
12.8
11.3
13.2
26
North Dakota
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
11.1
8.1
11.1
7.6
9.1
10.6
13.9
11.1
13.5
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
12.8
5.8
8.5
9.5
7.5
11.9
7.4
8.9
Standard Deviation and the Normal Distribution
According to a study done by the National Center for Health and Statistics, Mean Body
Weight, Height and Body mass Index, United States 1960-2002, American men are (ages
20-74) are 25 pounds heavier in 2002 than they were some 42 years earlier in 1960. In
2002 the average American male weighed 191 pounds, up from his 1960 counterpart who
weighed 166 pounds. American women from the same age group followed the same
trend, the average American woman weighed 164 pounds in 2002, up 24 pounds from the
average American woman from 1960 who weighed 140 pounds. This study caused quite
a stir, as nutritionists and diet doctors clamored together to seek solutions. And as you
can imagine, the dangers of obesity were revisited when this study was broadcast.
Not only had the average American weight increased, but they grew as well. The average
male heights increased over the 42 year span, from 5‘ 8“ in 1960 to 5‘ 9 ½“ in 2002. And
as expected, the average height for the American woman also increased, from 5‘ 3“ to 5‘
4“. The study was done on a smaller representation of the true population, it was
performed on thousands of people and in reality, the population of American Men and
Women total in the hundreds of millions. Since these numbers are so large, we assume
the data to be a normally distributed around the mean, or average. A normal
distribution for a set of data means that there is more data close to the "average," and the
less data farther from the average until finally relatively few data points tend to one
extreme or the other. The data is symmetrically distributed away from the average.
This is common sense, or mathematical intuition. Humans are, after all, close to being
one and the same. Let’s say you are writing a story about the height distribution of the
American male in 2002 because you are trying to correlate it to ethnicity, diet or genes.
First you take the population, in this case, those people who participated in the study, and
tally up the number of people for each given height. Like most data, if the sample or
population is large enough, the heights for the population turn out to be normally
distributed. This means most people will be of average height or close to average height.
In other words, the average height also will be the height to occur most frequently in our
population and the height found in the middle of the data when it is ordered. Thus, the
mean will be the mode and the median too. Next, if a population is normally distributed,
and you plot each height in increasing order, the number of men for a given height are
symmetrically distributed around the average height. In other words, there will be more
people close to average height than far from the average height. In 2002, the average
height of the American male was 5’9 ½ .’’ For our normally distributed society which
27
we aptly call the American male, the next most common heights occur from 5’9” through
5’ 10”, both heights ½ inch away from the mean height. Next, the most common heights
would be expected to occur between 5’ 8” to 5’ 11” And so on. We expect less and less
men to have a designated height as we move further from the average height. Intuitively,
this fits our preconceived notion of our society, we expect to see less men that are 6 ‘ 5”
than you would find that are 5’ 11” for instance. And similarly, this means there will be
fewer men that are five foot than 5 ‘ 7” and on the other side of the mirror, fewer that are
6 ½ feet than 5 ‘ 11”. Because height is a normally distributed trait, the heights are
distributed symmetrically around the average height.
180000
160000
140000
120000
100000
80000
60000
40000
20000
0
180000
160000
140000
So, we estimated the number of
adult American Males for each
given height and then grouped the
heights into small intervals. We
then drew a bar graph, as shown to
the left. The x-axis represents a
given height, the y-axis represents
the number of adult males of that
given height. Notice, the graph is
centered at the average height of
the adult American male.
We then redrew a line graph with
from the normally distributed
data, as shown on the left.
120000
100000
80000
60000
40000
20000
0
Often, we draw the Normally Distributed Bell
Shaped curve free hand. Our approximation of a
Bell Shaped curve may look like the graph to the left.
Note: The x-axis (the horizontal one) is the value in
question, the population’s height for example. The
y-axis (the vertical one) is the number of data points
for each value on the x-axis, the number of people
that are that certain height.
The standard deviation is a measure of how widely
values are dispersed from the mean (average value). For populations where the data
points are tightly bunched together, the bell-shaped curve is steep and the standard
28
deviation is small. For populations where the data points are spread further apart from
the average, the bell curve is flatter and the standard deviation is larger.
68-95-99.7 To refine our understanding of a standard deviation, we turn our attention to
a bell shaped graph. In a moment we will show you the calculation for the standard
deviation. Right now, we want to present a conceptual understanding for the term
‘standard deviation.’ Recall, in 2002, the American male had a mean height of 5 ‘ 9 ½ “.
The standard deviation is 2 3/8 “.
For a normal distribution, one standard deviation (in red above) away from the mean in
both directions on the horizontal axis will account for approximately 68 % of the
population. There are two heights that are 2 3/8 inches from 5’ 9 1/2”, the smaller 5’ 7
1/8” (5’ 9 ½” – 2’ 3/8”) and one larger, 5’ 11 7/8” (5’ 9 ½” + 2’ 3/8”). Thus, 68 % of the
American men in 2002 stood between 5’ 7 1/8” and 5’ 11 7/8”.
All data found within two standard deviations (in red and green above) from the mean
will account for roughly 95 % of the normally distributed population, or the adult
American male population. The two heights two standard deviations away from the
mean are found with two predictable calculations. We first subtract two standard
deviations from the mean, giving us 5’ 9 ½” - 2’ 3/8” - 2’ 3/8” = 5’ 4 ¾” We then add
two standard deviations to the mean, giving 5’ 9 ½” + 2’ 3/8” + 2’ 3/8” = 6 ‘ 2 ¼”. So,
95 % of American men in 2002 were somewhere between 5’ 4 ¾” and 6’ 2 ¼”. Recall,
the heights for this 95 % of the population are evenly distributed from the mean.
Data found three standard deviations from the mean (the red, green and blue areas)
account for about 99.7 % of normally distributed populations. So, in 2002, 99.7 % of the
American men were between 5’ 2 3/8” (5’ 9 ½” - 2’ 3/8” - 2’ 3/8” - 2’ 3/8”) and 6’ 4
5/8”(5’ 9 ½” + 2’ 3/8” + 2’ 3/8” + 2’ 3/8”). From a different perspective, one could
infer that in 2002, those American men who were more than three standard deviations
away from the mean either were shorter then 5’ 2 3/8” or taller than 6’ 4 5/8” represented
0.3 % of the adult American male population, they were considered short or tall by the
our population’s standards.
29
If a curve was flatter, the standard deviation would have to be larger in order to account
for those 68 percent and if the curve was steeper, the standard deviation would have to be
smaller to account for 68 percent of the population. Standard deviation tells you “how
spread out the data points in the population are from the mean.”
Why is this useful? Well, if you are comparing test scores for different schools, the
standard deviation will tell you how diverse the test scores are for each school. Let's say
Washington High School has a higher mean test score than Adams High School for the
mathematics portion of the statewide AIMS test administered in the state of Arizona to
measure the students understanding of high school mathematics. Our first reaction might
be to deduce that the students at Washington are either smarter or better educated by the
teachers.
You analyze the data further. The standard deviation, you find out, at Washington is
larger than at Adams. This means that at Washington there are relatively more kids
scoring toward one extreme or the other. By asking a few follow-up questions, you
might find that Washington’s average was higher because the school district sent all of
the gifted education kids to Washington. Or perhaps Adams scores were dragged down
and thus appeared bunched together because all of the students who recently have been
"mainstreamed" from special education classes. Perhaps the gifted classes were sent out
of district. In this way, looking at the standard deviation can help point you in the right
direction when asking why data is the way it is.
Problem One
You are trying to decide which teacher’s class to enroll in for Mathematics. You go to a
website that claims to have tracked the three teacher’s success rates over the past five
years. The final grade for Mr. Allen’s students had a mean score of 76 with a standard
deviation of 5, while Mrs. Bennett’s students had a mean score of 74, with a standard
deviation of 3 and Mrs. Clyde has a mean score of 79 for the student’s final grade, with a
standard deviation of 7. Whose class would you enroll in? How would you interpret the
data on the web site? Rank the teachers from first to third, so that if one’s section is full,
you would know whose class to register for next.
Solution
We must quantify the exam scores to interpret the data. For Mr. Allen classes, 68 % of
the students earned a final grade that was within 5 points of 76, so 68 % of the students
scored earned between 71 to 81. About 95 % scored within two standard deviations of
the mean, so 95 % of the students earned a final grade between 66 to 86. Finally, 99.7 %
of the students earned a final grade between 61 to 91. Continuing with this thought
process, Mrs. Bennett’s students has a lower final grade average, 74. But, 68 % of the
students earned a final grade scored between 71 to 77, 95 % earned a final grade between
68 to 80 and 99.7 % earned a final grade between 65 to 83. For Mrs. Clyde’s students,
her students earned the highest average, but she had the 68 % , 95 % and 99.7 % spread
farther apart, 72-86, 65-93 and 58 – 100 respectively.
30
A table allows us to compare the success rates of the three teachers:
Mr. A Mrs. B Mrs. C
68 %
71-81 71-77 72-86
95 %
66-86 68-80 65-93
99.7 % 61-91 65-83 58-100
So, to answer the question of which teacher you should take. If you are a good student,
you have a better chance of securing an A with Mrs. Clyde first, Mr. Allen second and
Mrs. Bennett third. If you struggle at math, you probably would choose Mrs. Bennett
first because 99.7 % of her students earn above a 65. Mr. Allen would probably be your
second choice, Mrs. Clyde your third choice.
Problem Two
In Typical City, USA, the number of hours a teen watches TV has become concern for
the town’s elders. They research this and find the teens watch an average of 4 ½ hours of
TV a day, with a standard deviation of ½ hour. What percent of the teen’s watch
a) more than 5 hours of TV per day?
b) more than 5 ½ hours of TV per day?
c) less than 5 ½ hours of TV per day?
d) less than 4 hours of TV per day?
e) less than 3 ½ hours of TV per day?
Solution
a) Since 5 hours is 1 standard deviation above the mean (4 ½ plus ½ ), then 68 % of
the teens are distributed within 1 standard deviation or between 4 and 5 hours.
So, half of the teens are will watch from 0 to 4 ½ hours, and another 34 % (half of
the 68 %) will watch between 4 ½ to 5 hours. So, 84 % will watch less than 5
hours, thus 100 % - 84 % or 16 % will watch more than 5 hours per day.
b) Since 5 ½ hours is 2 standard deviations above the mean (4 ½ plus ½ plus ½ ),
then 95 % of the teens are distributed within 2 standard deviation or between 3 ½
and 5 ½ hours. So, half of the teens are will watch from 0 to 4 ½ hours, and
another 47 ½ % (half of the 95 %) will watch between 4 ½ to 5 ½ hours. So, 97
½ % will watch less than 5 ½ hours, thus 100 % - 97 ½ % or 2 ½ % will watch
more than 5 ½ hours per day.
c) From the above paragraph, we have 100 % - 2 ½ % = 97 ½ % of the teens will
watch loess than 5 ½ hours of TV per day..
d) Since 4 hours is 1 standard deviation below the mean (4 ½ minus ½ ), then 68 %
of the teens are distributed within 1 standard deviation or between 4 and 5 hours.
So, half of the teens are will watch from 0 to 4 ½ hours, and another 34 % (half of
the 68 %) will watch between 4 and 4 ½ hours. So, 50 % - 34 % will watch less
than 4 hours per day.
e) Since 3 ½ hours is 2 standard deviations below the mean (4 ½ minus ½ minus ½
), then 95 % of the teens are distributed within 2 standard deviations or between 3
½ and 5 ½ hours. So, half of the teens are will watch from 0 to 4 ½ hours, but
31
another 47 ½ % (half of the 95 %) will watch between 3 ½ to 4 ½ hours per day.
So, 50 – 47 ½ or 2 ½ % of the teens will watch less than 3 ½ hours per day.
Standard score or z-score. If one is analyzing data within 1, 2 or 3 standard deviations
from the mean, then you can expect 68 %, 95 % or 99.7 % respectively, of the population
to lie within these bounds. What happens if we know that 90 % of the data lies within
two scores. What would the standard deviation look like?
Since data rarely if ever is presented to us where the mean is zero and the standard
deviation is 1, we use the standard normal curve to help analyze any normally distributed
data. A data value with a z-score of 0 indicates the data is the mean. A data value with a
z-score of –1.3 indicates the data value is 1.3 standard deviations below the mean and so
forth. If you know the standard deviation and mean of your data, z-scores enable you to
determine the percent of data between any two values in the range of your data.
The formula used to find each z-score is
data value - mean
.
standard deviation
Below is a table for the z-scores for the standard normal distribution.
z
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
2.1
2.2
2.3
2.4
0
0
0.0398
0.0793
0.1179
0.1554
0.1915
0.2257
0.258
0.2881
0.3159
0.3413
0.3643
0.3849
0.4032
0.4192
0.4332
0.4452
0.4554
0.4641
0.4713
0.4772
0.4821
0.4861
0.4893
0.4918
0.01
0.004
0.0438
0.0832
0.1217
0.1591
0.195
0.2291
0.2611
0.291
0.3186
0.3438
0.3665
0.3869
0.4049
0.4207
0.4345
0.4463
0.4564
0.4649
0.4719
0.4778
0.4826
0.4864
0.4896
0.492
0.02
0.008
0.0478
0.0871
0.1255
0.1628
0.1985
0.2324
0.2642
0.2939
0.3212
0.3461
0.3686
0.3888
0.4066
0.4222
0.4357
0.4474
0.4573
0.4656
0.4726
0.4783
0.483
0.4868
0.4898
0.4922
0.03
0.012
0.0517
0.091
0.1293
0.1664
0.2019
0.2357
0.2673
0.2967
0.3238
0.3485
0.3708
0.3907
0.4082
0.4236
0.437
0.4484
0.4582
0.4664
0.4732
0.4788
0.4834
0.4871
0.4901
0.4925
0.04
0.016
0.0557
0.0948
0.1331
0.17
0.2054
0.2389
0.2704
0.2995
0.3264
0.3508
0.3729
0.3925
0.4099
0.4251
0.4382
0.4495
0.4591
0.4671
0.4738
0.4793
0.4838
0.4875
0.4904
0.4927
0.05
0.0199
0.0596
0.0987
0.1368
0.1736
0.2088
0.2422
0.2734
0.3023
0.3289
0.3531
0.3749
0.3944
0.4115
0.4265
0.4394
0.4505
0.4599
0.4678
0.4744
0.4798
0.4842
0.4878
0.4906
0.4929
0.06
0.0239
0.0636
0.1028
0.1406
0.1772
0.2123
0.2454
0.2764
0.3051
0.3315
0.3554
0.377
0.3962
0.4131
0.4279
0.4406
0.4515
0.4608
0.4686
0.475
0.4803
0.4846
0.4881
0.4909
0.4931
0.07
0.0279
0.0675
0.1064
0.1443
0.1808
0.2157
0.2486
0.2794
0.3078
0.334
0.3577
0.379
0.398
0.4147
0.4292
0.4418
0.4525
0.4616
0.4693
0.4756
0.4808
0.485
0.4884
0.4911
0.4932
0.08
0.0319
0.0714
0.1103
0.148
0.1844
0.219
0.2517
0.2823
0.3106
0.3365
0.3599
0.381
0.3997
0.4162
0.4306
0.4429
0.4535
0.4625
0.4699
0.4761
0.4812
0.4854
0.4887
0.4913
0.4934
32
0.09
0.0359
0.0753
0.1141
0.1517
0.1879
0.2224
0.2549
0.2852
0.3133
0.3389
0.3621
0.383
0.4015
0.4177
0.4319
0.4441
0.4545
0.4633
0.4706
0.4767
0.4817
0.4857
0.489
0.4916
0.4936
2.5
2.6
2.7
2.8
2.9
3
0.4938
0.4953
0.4965
0.4974
0.4981
0.4987
0.494
0.4955
0.4966
0.4975
0.4982
0.4941
0.4956
0.4967
0.4976
0.4982
0.4943
0.4957
0.4968
0.4977
0.4983
0.4945
0.4959
0.4969
0.4977
0.4984
0.4946
0.496
0.497
0.4978
0.4984
0.4948
0.4961
0.4971
0.4979
0.4985
0.4949
0.4962
0.4972
0.4979
0.4985
0.4951
0.4963
0.4973
0.498
0.4986
Problem Three
We know that in 2002, the average height of the American Male was 5’ 9 1/2 “ and the
standard deviation was 2 3/8”. What percent of the American males in 2002 were …
a) taller than 6’?
b) shorter than 5’ 7 ½ ”?
c) between 5’ 10 “ and 6 ‘ 1”?
Solution
a) First, we find the z-score associated with 6’. We have
1
data value - mean 6 '  5'9 2 "
2.5
z score 


 1.05 . Notice the positive 1.05
standard deviation
2.375
23 "
8
corresponds to the fact that 6’ is above the mean. Now, we glance at the table and see
that a z-score of 1.05 has a value of 0.3531 . This means that in 2002, 35.31 % of the
adult America Males were between 6 ‘and the average height of 5 ‘ 9 ½ “. So, the
percent of adult American Males taller than 6’ were 100 % - 50 % - 35.31 % or 14.69 %.
b) First, we find the z-score associated with 5’ 7 ½”. We have
5'7.5" 5'9.5"  2.0
z score 

  0.84 . The negative value corresponds to the fact
2.375
23 "
8
that 5’ 7 1/2'” is below the mean. Now, we glance at the table and see that a z-score of
0.84 has a value of 0.2995. This means that in 2002, 29.95 % of the adult America Males
were between 5’ 7 ½ “ and the average American Male height of 5 ‘ 9 ½ “. The percent
of American Males shorter 5’ 7 ½ “ were 100 % - 50 % - 29.25 % or 20.75 %.
c) Calculating each z-score, we have:
5'10" 5'9.5"  0.5
6 ' 1" 5'9.5"
3.5
z score 

 0.21 and z score 

1.47
2.375
2.375
23 "
23 "
8
8
Now, we use the table and see that the z-score of 0.21 and 1.47 have the values of 0.0832
and 0.4292 respectively. This means that in 2002, 8.32 % of the America Males were
between 5’ 10 “ and the average American Male height of 5 ‘ 9 ½ “ and 42.92 % of the
America Males were between 6’ 1 “ and the average American Male height of 5 ‘ 9 ½ “.
So, the percent of adult American Males between 5’ 10” and 6’ 1” would be 42.92 % 8.32 % or 34.6 %.
33
0.4952
0.4964
0.4974
0.4981
0.4986
34
Exercise Set
1. Two AP calculus classes were taught
by Mr. Venette and Ms. Harper. The
final grade for course during the past
five years indicated that Mr. Venette’s
classes had a mean of 80 and a standard
deviation of 4, while Ms. Harper’s
classes had a mean of 82, but a standard
deviation of 2.5. Interpret the results in
terms of 68-95-99.7 percentiles. Then
give possible reasons for the differences
you observe.
For questions 2 to 10, use the following
data: The mean income in a city is $
51,000, and the standard deviation is $
4000. Find the percentage of people
whose income is
2. $ 59,000 or above
3. $ 47,000 or below
4. between $ 43,000 and $ 55,000
5. $ 55,000 or below
6. $ between 39,000 and 55,400
7. $ 39,000 or below
8. $ 40,000 or above
9. $ 50,000 or below
10. between $ 50,000 and $ 60,000
For problems 11 to 16, use the following
information. In Japan in 2002, studies
pertaining to the heights for adults
separated by gender vary slightly, but a
rough estimation of data compiled from
various studies is as follows: For the
adult male population, 97.5% of the
males were found to be between 5'0" 5/8
and 5'9" 7/8, with the average at 5'5"
1/8. For the adult female population,
97.5% of the females were found to be
between 4'8" 1/2 and 5'4", with the
average at 5'0" ¼.
11. Find the standard deviation for both
the males and the females and interpret
both in a complete sentence.
12. Find the percent of Japanese males
shorter then 5 ‘ 9 7/8”
13. Find the height of the Japanese
female who is taller than 99.85 % of the
population.
14. Find the height of the Japanese
males who are shorter than 84 % of the
population.
15. Find the percent of Japanese males
shorter than 5’ 2”.
16. Find the percent of Japanese women
taller than 5’ 2”.
For problems 17 to 24, use the following
information. In the United States in
2002, the weights pertaining to adults
separated by gender vary slightly, but a
rough estimation of data compiled from
various studies is as follows: For the
adult male population, 95% of the males
were found to be between 168 lbs and
214 lbs, with the average at 191 lbs. For
the adult female population, 95% of the
females were found to be between 140
lbs and 188 lbs, with the average at 164
lbs.
17. Find the standard deviation for both
the males and the females and interpret
both in a complete sentence.
18. Find the percent of American
females who weigh more than 145
pounds.
19. Find the weight of the American
female who weighs less than 34 % of the
population.
20. Find the height of the American
male who weighs more than 97.5 % of
the population.
21. Find the percent of American
females who weigh more than 150
pounds.
22. Find the percent of American males
who weigh more than 200 pounds.
35
23. Find the percent of American
females who weigh less than the weight
of the Average Male in 2002.
24. Find the percent of American males
who weigh less than the weight of the
Average Female in 2002.
25. On your route home, you have a
choice of taking two bridges, each of the
same length and same number of lanes.
At the time you cross each bridge,
Bridge One has
an average of 420 cars on it with a
standard deviation of 100, and Bridge
Two has a average of 460 cars on it with
a standard deviation of 20 cars. Which
bridge would you decide to cross?
Would it matter if you were in a hurry?
For problems 26 to 32, use this fact:
According to By Robert Dvorchak,
Pittsburgh Post-Gazette, the average
length of a National League baseball
game was 2:47:20. Compared to its own
historic past, when in 1967 the average
game was 2:30, in the 1940’s the
average game, according to the Sporting
News was exactly 2:00, or even a
century ago, when the average game was
a mere 1:30. If we estimate a standard
deviation of 20 minutes, what percent of
the games …
26. in 2004 lasted longer than two hours
27. in 1967 lasted longer than two hours
28. in 1940 lasted longer than two hours
29. a century ago lasted longer than two
hours
30. in 2004 lasted longer than three
hours
31. in 1967 lasted longer than three
hours
32. in 2004 were shorter than 3 ½ hours
Standard Deviation
A standard deviation then is really nothing more than the average distance from the
mean. For each data point or value, we subtract the mean from each data and the result is
either zero, positive or negative. When we add these values, the sum of the positive
differences will cancel with the sum of the negative difference. Since we are looking to
find the average distance from the mean, this calculation would prove worthless. Try it
and see or yourself. Instead, we use a convenience where we square each difference
because these squared values are all positive. Thus, they won’t have the effect of
canceling each other out. Now, we add them all up. We then divide by the number of
terms. Almost. Actually, we divide by n-1 because statisticians have determined that
with large populations, since there is always an outlier (the really tall kid, the really
bright child that blows out the curve with IQ scores and so on … ), dividing by n-1 most
closely resembles the true behavior of the data. We then take the square root of the sum
of the squared differences to cancel out the effect of squaring, giving us this measurable
average distance from the mean. We designate positive values to indicate above the
mean, negative values to indicate below the mean.
A practical way to compute standard deviation would be to incorporate the use of a
spread sheet. In Microsoft Excel, type the following code into the cell where you want
the Standard Deviation result, using the "unbiased," or "n-1" method:
=STDEV(A1:Z99) (substitute the cell name of the first value in your dataset for
A1, and the cell name of the last value for Z99.)
36
Calculating the standard deviation, let x be one value in your set of data and let x be
the mean of all values x in your set of data. Let n be the number of data points from your
set of data. For each value x, subtract the overall x from x, that is x – x , then square the
result (x - x )2 . Sum up all those squared values and then divide the sum by (n-1).
Finally, there's one more step, take the square root of this ratio. That's the standard
deviation of your set of data, written as  .
n

 (x
i 1
i
 x) 2
n 1
Introduction or who deserves the B?
Let’s develop an intrinsic feel for the measure of central tendency of data. Below are five
students, and their seven grades for a course. The bottom row reaffirms that all six
students have a 79 average.
Allan
Bill
Cindy Deanna
Eve
74
73
59
69
68
76
75
62
73
78
77
77
78
78
79
80
79
79
79
79
81
83
80
82
83
82
83
96
83
83
83
83
99
89
83
79
79
79
79
79
All six students want a B. Allan argues that his middle grade, his median is a B. Bill
argues that his mode, the grade that occurs most frequently, is a B. Cindy argues that she
has shown great potential, two of her grades are solid A’s. Deanna argues the same
argument, but her grades are not quite as erratic as Cindy’s they are not as dispersed
away from the average as Cindy’s grades. Eve, like Bill, also argues that her mode is a
B. And although Eve and Bill have the same mean, median and mode, Eve is the one
with the 68. Uh oh….
Standard deviations measure just this, how a data value is deviates from a mean, in other
words, a standard deviation is a numerical value that tells the reader how spread out the
data is, Allan’s grades are clumped together, he should have a small standard deviation.
Cindy’s grade history is more erratic, her grades are farther spread out, she should have a
larger standard deviation.
Let’s compare the standard deviations for three of the students, Bill, Cindy and Eve. We
will find how much each data value deviates from the mean. But notice, if we try and
sum up these deviations from the mean without squaring the differences, the sum is zero.
Why?
37
So, first we subtract each data point from the mean (deviation). After we square the
differences (deviation squared), we sum the square of the differences, divide this sum by
a number that is one less than the number of data points. Lastly, we take the square root
of this ratio and we have the standard deviation.
Bill
73
75
77
79
83
83
83
79
deviation
-6
-4
-2
0
4
4
4
0
deviation
squared
36
16
4
16
16
16
104
For Bill, his standard deviation is
deviation is
Cindy
59
62
78
79
80
96
99
79
deviation
-20
-17
-1
0
1
17
20
0
deviation
squared
400
289
1
1
289
400
1380
Eve
68
78
79
79
83
83
83
79
deviation
-11
-1
0
0
4
4
4
0
deviation
squared
121
1
0
0
16
16
16
170
104
 17.33333  4.16 . For Cindy, her standard
6
1380
 230  15.17 . For Eve, her standard deviation is
6
170
 28.33333  5.32 . As standard deviations is a measure of dispersion, the larger
6
the standard deviation, the more dispersed the data. We now have more information
about each of the student’s grades at our disposal; we know the mean, median, mode and
the standard deviation. You decide, who deserves the B; who does not.
Problem One
Baseball, said to be America’s favorite pastime, is also fertile ground for honing basic
statistical skills. From games won or lost to home and away records, from records
against divisional foes to streaks, from batting averages to home runs, numbers abound.
For this example, we will find the standard deviation and then incorporate the z-score
formula to determine how far each team’s record is from the mean. We will see data in
context, as it would appear in your morning newspaper.
Who is the best and the worst in the American League on Labor Day, 2004? A
standard deviation way to explore this age old baseball question. Updated: 9/5/2004 3:37:06 PM
cnn.com
38
AMERICAN LEAGUE EAST
~~~~~~~~~~~~~~~~~~~~
TEAM
NY YANKEES
BOSTON
BALTIMORE
TAMPA BAY
TORONTO
WON
83
80
63
59
56
LOST
PCT
GB
HOME
ROAD
EAST
CENT
52
54
71
75
80
.615
.597
.470
.440
.412
2
19
23
27
45-21
48-22
29-35
36-34
34-37
38-31
32-32
34-36
23-41
22-43
36-19
36-20
28-29
21-38
21-36
15-11
19-13
15-12
13-12
13-19
LOST
PCT
GB
HOME
ROAD
EAST
CENT
58
67
70
72
87
.570
.500
.489
.463
.351
43-28
9 1/2 38-32
11
40-30
14 1/2 32-32
29 1/2 30-37
34-30
29-35
27-40
30-40
17-50
16-11
16-16
17-15
10-14
08-19
34-24
29-27
26-31
28-28
25-32
LOST
PCT
HOME
ROAD
EAST
CENT
45-19
4
38-28
6 1/2 42-22
30
32-34
36-35
39-30
32-38
19-50
23-16
24-16
22-17
11-28
25-15
25-14
22-17
19-21
1/2
1/2
1/2
1/2
WEST STREAK
22-14
16-12
15-17
10-22
14-15
LOST
LOST
WON
LOST
LOST
2
1
6
7
2
AMERICAN LEAGUE CENTRAL
~~~~~~~~~~~~~~~~~~~~~~~
TEAM
MINNESOTA
CHI WHITE SOX
CLEVELAND
DETROIT
KANSAS CITY
WON
77
67
67
62
47
WEST STREAK
16-16
14-14
14-16
15-21
08-24
WON
WON
LOST
WON
LOST
5
2
4
1
2
AMERICAN LEAGUE WEST
~~~~~~~~~~~~~~~~~~~~
TEAM
OAKLAND
ANAHEIM
TEXAS
SEATTLE
WON
81
77
74
51
54
58
60
84
.600
.570
.552
.378
GB
WEST STREAK
23-15
21-17
20-18
12-26
WON
WON
WON
LOST
The three divisional winners and the second place team with the best record makes the
playoffs. But, who is the best team? The worst team? How good is good and how bad is
bad? Let’s calculate the standard deviation with respect to the number of wins for each
team.
First, we find the mean number of wins by adding up all the wins and dividing by 14.
x
83  80  63  59  56  77  67  67  62  47  81  77  74  51
 67.4
14
The average or mean number of wins is 67.4 for the American League teams on Labor,
2004. The table below has each team ranked by the number of wins, from most to least.
We used a spread sheet to construct the columns representing the differences from the
mean, the square of these differences, the standard deviation and the z-scores.
39
3
2
1
4
Statistics table for the teams in the American League on Labor day, 2004.
Wins - Mean (Wins - Mean)2
NY YANKEES
83
83-67.4 = 15.6 15.62 = 243.36
OAKLAND
BOSTON
MINNESOTA
ANAHEIM
TEXAS
CHI WHITE SOX
CLEVELAND
BALTIMORE
DETRIOT
TAMPA BAY
TORONTO
SEATTLE
KANSAS CITY
SUM
81
80
77
77
74
67
67
63
62
59
56
51
47
944
81-67.4 = 13.6 13.62 = 184.96
12.6
158.8
9.6
92.2
9.6
92.2
6.6
43.6
-0.4
0.2
-0.4
0.2
-4.4
19.4
-5.4
29.2
-8.4
70.6
-11.4
129.96
-16.4
268.96
-20.4
416.2
0
1749.84
n
To calculate the standard deviation,  
Recall, to find each z-score,
NY YANKEES
OAKLAND
BOSTON
MINNESOTA
ANAHEIM
TEXAS
CHI WHITE SOX
CLEVELAND
BALTIMORE
DETRIOT
TAMPA BAY
TORONTO
SEATTLE
KANSAS CITY
83
81
80
77
77
74
67
67
63
62
59
56
51
47
 (x 
i 1
i
n 1
x) 2

1749.84
11.6 .
14  1
data value - mean
.
standard deviation
Wins - Mean
z-score
83-67.4 = 15.6 15.6/11.6 = 1.3
81-67.4 = 13.6 13.6/11.6 = 1.2
12.6
1.1
9.6
0.8
9.6
0.8
6.6
0.6
-0.4
-0.03
-0.4
-0.03
-4.4
-0.4
-5.4
-0.5
-8.4
-0.8
-11.4
-0.98
-16.4
-1.4
-20.4
-1.8
Look at the final column, and recall, as you glance at each teams’ z-score, that for a
normal population, a 68 % of the population falls within 1` standard deviation or z-score
of the mean, 95 % falls within 2 and 99.7 falls within 3. How good are the NY Yankees
and how bad are the Kansas City Royals? You now have a more detailed frame of
reference to answer that question.
40
Example 2. Does money buy success in baseball? Updated: 9/5/2004 3:37:06 PM cnn.com
The payroll for the American League teams are listed below.
New York Yankees
$ 184,193,950
Boston Red Sox
$ 127,298,500
Anaheim Angels
$ 100,534,667
Seattle Mariners
$ 81,515,834
Chicago White Sox
$ 65,212,500
Oakland Athletics
$ 59,425,667
Texas Rangers
$ 55,050,417
Minnesota Twins
$ 53,585,000
Baltimore Orioles
$ 51,623,333
Toronto Blue Jays
$ 50,017,000
Kansas City Royals
$ 47,609,000
Detroit Tigers
$ 46,832,000
Cleveland Indians
$ 34,319,300
Tampa Bay Devil Rays $ 29,556,667
What will the standard deviation tell us with respect to this payroll data? Will there be a
correlation between salaries and success? Does money buy success? Once we have
calculated the standard deviation and z-scores, we will compare these results with the true
standings taken on Sept 5th, 2004.
The sum of the 14 American League salaries is $ 986,773,835, thus the average salary is
$ 70,483,845.36, which we will round to $70,483,845.
To calculate the standard deviation, we construct the following table.
Team
Payroll Salary Payroll – Mean
(Payroll – Mean)2
New York Yankees
$ 184,193,950
113,710,105
12,929,987,979,111,025
Boston Red Sox
$ 127,298,500
56,814,655
3,227,905,022,769,025
Anaheim Angels
$ 100,534,667
30,050,822
903,051,902,875,684
Seattle Mariners
$ 81,515,834
11031989
121,704,781,296,121
Chicago White Sox
$ 65,212,500
- 5,271,345
27,787,078,109,025
Oakland Athletics
$ 59,425,667
- 11,058,178
122,283,300,679,684
Texas Rangers
$ 55,050,417
Minnesota Twins
$ 53,585,000
Baltimore Orioles
$ 51,623,333
Toronto Blue Jays
$ 50,017,000
Kansas City Royals
$ 47,609,000
Detroit Tigers
$ 46,832,000
41
Cleveland Indians
$ 34,319,300
Tampa Bay Devil Rays
$ 29,556,667
We leave it as an exercise for you to complete the table above. Once done, it is it is
quickly verified that the standard deviation is $ 41,783,940.
So, let’s reprint the table, with the standard deviation from the mean listed for each team
and it’s ranking in the American League.
Team
Payroll Salary
Standard Deviations
from the Mean
(z-score)
True Major
League Ranking
New York Yankees
$ 184,193,950
2.72
1
Boston Red Sox
$ 127,298,500
1.36
3
Anaheim Angels
$ 100,534,667
0.72
Tied for 4
Seattle Mariners
$ 81,515,834
0.26
13
Chicago White Sox
$ 65,212,500
-0.13
7
Oakland Athletics
$ 59,425,667
-0.27
2
Texas Rangers
$ 55,050,417
-0.37
6
Minnesota Twins
$ 53,585,000
-0.40
Tied for 4
Baltimore Orioles
$ 51,623,333
-0.45
9
Toronto Blue Jays
$ 50,017,000
-0.49
12
Kansas City Royals
$ 47,609,000
-0.55
14
Detroit Tigers
$ 46,832,000
-0.57
10
Cleveland Indians
$ 34,319,300
-.87
8
Tampa Bay Devil Rays
$ 29,556,667
-0.98
11
Problem Three
Homeownership in the USA Below are the state rankings for the percent of
homeownership in the United States (to include mobile homes) in 2002. Source: US
Bureau of the Census.
Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
District of Columbia
Florida
Georgia
73.5
67.3
65.9
70.2
58.0
69.1
71.6
75.6
44.1
68.7
71.7
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
57.4
73.0
70.2
75.0
73.9
70.2
73.5
67.1
73.9
72.0
62.7
42
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
76.0
77.3
74.8
74.6
69.3
68.4
65.5
69.5
67.2
70.3
55.0
70.0
69.5
72.0
69.4
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
66.2
74.0
59.6
77.3
71.5
70.1
63.8
72.7
70.2
74.3
67.0
77.0
72.0
72.8
Like before, we have a rather large sample. Let’s begin with the statistical basics. We
will find the mean, median, mode and range after first ranking states in ascending order.
DC
New York
Hawaii
California
Rhode Island
Massachusetts
Texas
Nevada
Arizona
Oregon
Washington
Louisiana
New Jersey
Alaska
Nebraska
Florida
Colorado
Montana
Oklahoma
New Hampshire
North Dakota
North Carolina
Tennessee
Arkansas
Illinois
Kansas
44.1
55.0
57.4
58.0
59.6
62.7
63.8
65.5
65.9
66.2
67.0
67.1
67.2
67.3
68.4
68.7
69.1
69.3
69.4
69.5
69.5
70.0
70.1
70.2
70.2
70.2
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
31
31
30
29
25
25
25
Vermont
New Mexico
South Dakota
Connecticut
Georgia
Maryland
Ohio
Wisconsin
Utah
Wyoming
Idaho
Alabama
Kentucky
Iowa
Maine
Pennsylvania
Virginia
Missouri
Mississippi
Indiana
Delaware
Michigan
West Virginia
Minnesota
South Carolina
70.2
70.3
71.5
71.6
71.7
72.0
72.0
72.0
72.7
72.8
73.0
73.5
73.5
73.9
73.9
74.0
74.3
74.6
74.8
75.0
75.6
76.0
77.0
77.3
77.3
25
24
23
22
21
18
18
18
17
16
15
13
13
11
11
10
9
8
7
6
5
4
3
1
1
43
Mean
Mode
Median
69.4
70.2
70.2
Let’s begin to interpret the data. First, we notice the mode and median are the same, and the
mean (average) is below the two. On a positive note, we can say more than half of the states
have a percentage of homeownership above the national average. The next question, “is the
mean significantly below the other two?” This may be partially answered by observing the range.
The range is 33.2 (77.3 – 44.1), which appears to be rather large. So, by comparison 69.4 versus
70.2 appears to be no big deal. Let’s add to our repertoire of examining the dispersion of the
data. For large sets of data, we do not want to obsess over each individual data point. We want
to see if the data follows noticeable trends and then interpret any outliers that may appear.
To do this, we observe a histogram, made from the data in ascending order. Notice we have so
much data our page is not wide enough to label each state. We labeled only a few for reference.
South
Indiana
Maine
Wyoming
Georgia
Kansas
North
Florida
Washingto
Massachu
100.0
80.0
60.0
40.0
20.0
0.0
District of
percent
Percent of Home Owners
by state
The District of Columbia is an outlier, it’s percent of 44.1 far from the mean of 70.2 %. But, in a
manner of speaking, NY, HA, CA, RI, with respective percents of home ownership of 55, 57.4,
58 and 59.6 all seem far below the mean of 70.2 as well. This is what we mean by dispersion.
We need to quantify how well grouped the data is because we need to know where to draw the
line between those states that are significantly below the mean or significantly above the mean.
This can be done with what we have identified as a standard deviation; the measure of how data
deviates from it’s behavior around the middle of the data. Central tendency. In a perfect
world, the mean is in the center of the data, thus it is the median too. And the mode, if we get
greedy. Recall, the standard deviation, loosely speaking, measures how the data deviates from
the mean, and remember, the mean is in the center of the data.
The table shows the state’s raw z-score followed by the state’s percentage of homeownership. A
table so constructed allows the reader to find a state and quickly identify how the state compares
to the national average. Problems 20 – 22 in the Exercise Set requires you to verify the table
and then draw certain responses from the table.
DC
New York
-4.21
-2.45
44.1
55.0
Hawaii
California
-2.06
-1.97
57.4
58.0
44
Rhode Island
Massachusetts
Texas
Nevada
Arizona
Oregon
Washington
Louisiana
New Jersey
Alaska
Nebraska
Florida
Colorado
Montana
Oklahoma
New Hampshire
North Dakota
North Carolina
Tennessee
Arkansas
Illinois
Kansas
Vermont
New Mexico
-1.71
-1.21
-1.03
-0.76
-0.69
-0.65
-0.52
-0.5
-0.48
-0.47
-0.29
-0.24
-0.18
-0.15
-0.13
-0.11
-0.11
-0.03
-0.02
0
0
0
0
0.02
59.6
62.7
63.8
65.5
65.9
66.2
67.0
67.1
67.2
67.3
68.4
68.7
69.1
69.3
69.4
69.5
69.5
70.0
70.1
70.2
70.2
70.2
70.2
70.3
South Dakota
Connecticut
Georgia
Maryland
Ohio
Wisconsin
Utah
Wyoming
Idaho
Alabama
Kentucky
Iowa
Maine
Pennsylvania
Virginia
Missouri
Mississippi
Indiana
Delaware
Michigan
West Virginia
Minnesota
South Carolina
0.21
0.23
0.24
0.29
0.29
0.29
0.4
0.42
0.45
0.53
0.53
0.6
0.6
0.61
0.66
0.71
0.74
0.77
0.87
0.94
1.1
1.15
1.15
71.5
71.6
71.7
72.0
72.0
72.0
72.7
72.8
73.0
73.5
73.5
73.9
73.9
74.0
74.3
74.6
74.8
75.0
75.6
76.0
77.0
77.3
77.3
Why use standard deviation The standard deviation can also help you evaluate the
worth of all so-called "studies" that seem to be released to the press everyday. Standard
deviation is commonly used in business as a measure to describe the risk of a security or
portfolio of securities. If you read the history of investment performance, chances are
that standard deviation will be used to gauge risk. The same is true for academic studies
to determine the validity of exam results, or the effectiveness of educational tools.
Standard deviation is also one of the most commonly used statistical tools in the sciences
and social sciences. It provides a precise measure of the amount of variation in any
group of numbers, be it the returns on a mutual fund, the yearly rainfall in Mexico City,
or the hits per game for a major league baseball player.
Lastly, look at the data below, taken directly from the morning newspaper. Does it take
on a whole new look? Could we analyze, say, whether the offense or the defense is a
better predictor of success in professional football.
The 2003 Final Standings of the NFL teams. W = wins, L = loses, % = percentage of
games won, PF = Points For, that is, points the team scored, PA = points the team
allowed.
AFC East
W
New England Patriots 14
L
2
T
0
%
.875
PF
348
PA
238
45
Miami Dolphins
Buffalo Bills
New York Jets
10
6
6
6
10
10
0
0
0
.625
.375
.375
311
243
283
261
279
299
NFC East
W
Philadelphia Eagles 12
Dallas Cowboys
10
Washington Redskins 5
New York Giants
4
L
4
6
11
12
T
0
0
0
0
%
.750
.625
.313
.250
PF
374
289
287
243
PA
287
260
372
387
AFC North
Baltimore Ravens
Cincinnati Bengals
Pittsburgh Steelers
Cleveland Browns
W
10
8
6
5
L
6
8
10
11
T
0
0
0
0
%
.625
.500
.375
.313
PF
391
346
300
254
PA
281
384
327
322
NFC North
Green Bay Packers
Minnesota Vikings
Chicago Bears
Detroit Lions
W
10
9
7
5
L
6
7
9
11
T
0
0
0
0
%
.625
.563
.438
.313
PF
442
416
283
270
PA
307
353
346
379
AFC South
W
Indianapolis Colts
12
Tennessee Titans
12
Jacksonville Jaguars
5
Houston Texans
5
L
4
4
11
11
T
0
0
0
0
%
.750
.750
.313
.313
PF
447
435
276
255
PA
336
324
331
380
NFC South
W
Carolina Panthers
11
New Orleans Saints
8
Tampa Bay Buccaneers 7
Atlanta Falcons
5
L
5
8
9
11
T
0
0
0
0
%
.688
.500
.438
.313
PF
325
340
301
299
PA
304
326
264
422
AFC West
W
Kansas City Chiefs 13
Denver Broncos
10
Oakland Raiders
4
San Diego Chargers 4
L
3
6
12
12
T
0
0
0
0
%
.813
.625
.250
.250
PF
438
381
270
313
PA
332
301
379
441
NFC West
W
St. Louis Rams
12
Seattle Seahawks 10
San Francisco 49ers 7
Arizona Cardinals
4
L
4
6
9
12
T
0
0
0
0
%
.750
.625
.438
.250
PF
447
404
384
225
PA
328
327
337
452
46
Exercise Set
For problems 1 to 8, use the 2003 Final
Standings of the NFL teams, as
previously indicated.
1. For the NFC teams, find the standard
deviation for the number of wins and
then find the z-score for each team.
2. For the AFC teams, find the standard
deviation for the number of wins and
then find the z-score for each team. 1.
3. For the all teams, find the standard
deviation for the number of wins and
then find the z-score for each team.
4. For the NFC teams, find the standard
deviation for PF and then find the zscore for each team.
5. For the AFC teams, find the standard
deviation for PF and then find the zscore for each team. 1.
6. For the NFC teams, find the standard
deviation for PA and then find the zscore for each team.
7. For the AFC teams, find the standard
deviation for PA and then find the zscore for each team.
8. Look carefully at questions 1 to 7.
Which is a better predictor of a team’s
success, the offense as indicated by the
points the team scored (PF) or the
defense, as indicated by the points that
team allowed (PA). Why?
9. According to the 2005 World
Almanac for Kids, below are the 25
largest countries in the world in mid2004 in no particular order, in square
miles. Find the mean, median and the
stand deviation.
1,294,629,555 China
82,424,609 Germany
1,065,070,607 India
76,117,421 Egypt
293,027,571 United States
69,018,294 Iran
238,452,952 Indonesia
68,893,918 Turkey
184,101,109 Brazil
67,851,281 Ethiopia
153,705,278 Pakistan
64,865,523 Thailand
144,112,353 Russia
60,424,213 France
141,340,476 Bangladesh
60,270,708 Great Britain
137,253,133 Nigeria
58,317,930 Dem. Rep. of the Congo
127,333,002 Japan
58,057,477 Italy
104,959,594 Mexico
48,598,175 South Korea
86,241,697 Philippines
47,732,079 Ukraine
82,689,518 Vietnam
10. According to the 2005 World
Almanac for Kids, below are the ten
largest cities followed by the population
in the world in 2004 in no particular
order.
Tokyo, Japan 34,450,000; Kolkata
(Calcutta), India 13,058,000; Mexico
City, Mexico 18,066,000; Shanghai,
China 12,887,000; New York City, U.S.
17,846,000; Buenos Aires, Argentina
12,583,000; São Paulo, Brazil
17,099,000; Delhi, India 12,441,000;
Mumbai (Bombay), India 16,086,000;
Los Angeles, U.S. 11,814,000. Find the
mean and the standard deviation.
11. According to the 2005 World
Almanac for kids, below are the
American League Pennant Winners,
with the year they won proceeding the
name and their won-lost record
following their name since 1970.
Remove the shortened strike season of
1981 and the year where there was no
47
world series and find the mean and the
standard deviation for wins. .
1970 Baltimore 108 54 , 1971 Baltimore
101 57, 1972 Oakland 93 62, 1973
Oakland 94 68, 1974 Oakland 90 72,
1975 Boston 95 65, 1976 New York 97
62, 1977 New York 100 62, 1978 New
York 100 63, 1979 Baltimore 102 57,
1980 Kansas City 97 65, 1981 New
York 59 48, 1982 Milwaukee 95 67
1983 Baltimore 98 64, 1984 Detroit 104
58, 1985 Kansas City 91 71, 1986
Boston 95 66, 1987 Minnesota 85 77,
1988 Oakland 104 58, 1989 Oakland 99
63, 1990 Oakland 103 59, 1991
Minnesota 95 67, 1992 Toronto 96 66,
1993 Toronto 95 67 1994 none, 1995
Cleveland 100 44, 1996 New York 92
70, 1997 Cleveland 86 75, 1998 New
York 114 48, 1999 New York 98 64,
2000 New York 87 74, 2001 New York
95 65, 2002 Anaheim 99 63, 2003 New
York 101 61
12. Look up the ages of the presidents
of the United States when they took
office. Find the mean and standard
deviation of the presidents ages. Then
repeat the process, lumping together
those presidents who were inaugurated
before the Civil War and those who were
inaugurated after the Civil War. What
do you notice when you compare the pre
and post Civil War presidents’ ages?
For questions 13 to 19: M&M’s project
- Some years come and go, but other
years live in the hearts and mind of men
and women for all eternity. Such was
the year of 1941, when Pearl Harbor was
attacked, Joe DiMaggio hit safely in 56
straight games and M&M’s were first
introduced to the public. Daughters
everywhere love M&M’s, in particular,
some love the blue pieces the most. The
original M&M’s had violet candies and
no blue ones in 1941. Then, in 1949, tan
replaced violet and in 1995, tan was
replaced by blue. M&M’s were made
round by taking milk chocolate centers
and tumbling them to get their smooth
rounded shape. We all know M&M
stands for Mars and Murrie and that the
different color M&M’s taste the same.
According to the M&M’s website
http://www.mmmars.com/cai/mms/faq.html ,
that
M&M’s Milk Chocolate candies
are 30 % brown, 20 % each of
yellow and red, and 10 % each of
orange, green and blue
 M&M’s Peanut Chocolate
candies are 20 % each of brown,
yellow, red and blue, and 10 %
each of green and orange
 M&M’s Peanut Butter and
Almond Chocolate candies are
20 % each of brown, red yellow,
green and blue
 M&M’s Crispy Chocolate
candies are 16.6 % each of
brown, red, yellow, green, orange
and blue.
Let’s perform our own test and see if our
observation of the percent of each color
matches the website’s prediction.
13. Buy one pound bags of M&M’s
Milk Chocolate for each student in your
class. As a class, for each bag, tally up
the number of each color M&M. Find
the percent of each color for each bag.
14. The tally up the colors for all the
bags, and find the percent of colors for
the class room sample, which consists of
all the bags.
15. Using each bag as individual trials,
find the mean, median and mode for
each color. Then find the percent of
colors based on these findings.
16. How do the results in parts 2. and 3.
compare? How do the results compare
to M&M’s website statistics?

48
17. Repeat the experiment for Peanut
Chocolate, Peanut Butter and Almond
Chocolate, and Crispy Chocolate.
18. Answer this question - how can you
run standard deviations in this
experiment to help you analyze your
findings so that you may decide on the
reliability of the data on the M&M’s
website?
19. Run those standard deviations to
determine the reliability of the data on
the M&M’s website.
For problems 20 to 22, refer to Problem
3 – Home Ownership - from the text.
20. Compute the standard deviation for
the data from Problem 3 and then verify
the table presented.
21. Determine which states are the
friendliest to home ownership and which
states are the least?
22. Is there a cause and effect
relationship that you can argue to
explain why these states are at either end
of this analysis?
23. Barry Bonds or Babe Ruth. Who
was the greatest baseball player of all
time? To argue your point, quote
statistics. Research their batting average
and compare it to the batting averages of
their peers at the time. How many
standard deviations from the mean were
their batting averages? Do the same for
home runs, RBI’s and on base
percentage. Factor in that Barry Bonds
played in night games and that Babe
Ruth won 20 games as a pitcher. Best of
luck…
49