Download Exercise Set - Arizona State University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
THREE
Statistics
Facts are stubborn things, but statistics are more pliable. – Mark Twain
Never attribute to malice that which is adequately explained by stupidity. Numbers don’t lie.
And often perception is not reality. Case in point, people are always concerned, worried even,
about the increasingly violent society in which they live. Statements like “what is this world
coming to,” are commonplace. People may tend to feel increasingly unsafe many may become
more and more reluctant to go out at night. Some been known to hide behind their TV or
computer, instead of venturing out. There are many who feel that when they were young, crime
was not as bad as it is now. People attribute arbitrary reasons for this new wave of perceived
violence. “Exhaustive music videos glorify violence, causing a violent cycle to never end...”
“No wonder there is so much crime these days, look at all the violence on TV and in the
movies…” “Kids have no respect for their parents, teachers or elders these days, this contributes
to more violence…” Or “the remote control teaches us to become impatient, and we are more
likely to quickly pull the trigger…” Images from the OJ murders, Columbine shootings or 9/11
tend to fill our televisions, replaying the same isolated scenes over and over again. People are
shot every night on reruns of Law and Order. So, it’s natural for people to criticize the amount
of violence in our society, but rarely do these same people utter any voice toward thinking their
utterances through to its logical conclusion. Instead, many appear to become angry about the
rise of violent crime in this country and tend to make matters worse by linking this acquired
malice to other elements in society (music videos, teenagers, TV, OJ), fostering a wider net of
hate. More importantly, they never once pause to check out the numbers. And in a matter of
moments, anyone can do just that, check out the numbers. Any of us can access on the WWW
the FBI’s Index of Crime Statistics. So, we did.
Below are the nation wide statistics from 1982 to 2001, showing, by year, the number of violent
crimes nationwide. During this twenty year span, while the nation’s population grew more than
20 % from 231,664,458 in 1982 to 284,796,887 in 2001, the number of violent crimes as defined
by murder, rape, robbery and assault did not steadily increase, as expected. In fact, there was a
stunning decline in violent crimes over the last decade. Violent crime reached it’s peak in 1992,
with 1,932,274 reported instances and since then, violent crime has dropped over 25 percent.
(The homicides on September 11, 2001 were not included.)
murder, rape, robbery, assault
Violent Crime
2,500,000
2,000,000
1,500,000
1,000,000
500,000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
year: 1982 to 2001
Those of us who are prone to some occasional worry about violence in our society should
consider other statistics too. In the United States, each year a little over 2.3 million people die,
over 6000 per day. Annually in the US, roughly 44,000 people, over 100 per day, die in traffic
accidents. Each year in the United States, it is estimated that over 65 % of adults are over
weight, leading to a loss of 300,000 people each year, over 800 per day. And annually in the US,
approximately 440,000, over 1000 per day, die from tobacco use. This means in this country,
nearly 1 in every 5 deaths is related to the use of tobacco. Cigarettes alone kill more Americans
than alcohol, traffic fatalities, AIDS, suicide, homicide, and illegal drugs combined.
But, look at the age we live in. We have all seen the headlines:
 Of the 34,000 gun deaths each year in the United States, fewer than 300 are listed as
justifiable homicide.
 Arizona Kids Are Home Alone, A new survey says 30 percent in kindergarten through
12th grade take care of themselves”
 Of the 85 prisoners executed in 2000, 43 white, 35 African American, 6 Hispanic, 1
American Indian
 Vietnam - 58,168 deaths, total abortions since 1973, 44,670,812 as of April 22, 2004
 Should juveniles be tried as adults? Kids are killing these days in record numbers
Statistics are tossed at us in such a deluge the numbers alone seem almost controversial, 30 % of
school age children left alone, 35 out of 85 executed are African American, 44 million abortions
in last 30 years. … Certainly, each of these topics elicits emotion from within each of us, too
many parents leave their children unsupervised, there is not enough funding for day care, death
penalty, pro or con, racially biased, too many, too few juveniles? And if you want to clear a
93
room with angry combatants, start with the age-old question, woman’s choice or murder of the
unborn? No matter your stand on these topics, as you comb through the headlines, statistics
besiege you.
Why is quantitative literacy important? When confronted with numbers associated with hotly
contested issues or highly controversial ethical or moral arguments, raw numbers themselves,
such as the above stated 44,670,812 abortions in the last 30 years, need to be examined so they
may be fully understood. As always, we begin by examining the number for credibility? Is it
even viable? This particular number or numbers similar to it appear on various websites. We
easily
found
these
numbers
quoted
and
similar
such
numbers
at
http://www.americandaily.com/article/1806,
http://www.americandaily.com/article/1806,
http://womensissues.about.com/cs/abortionstats/a/aaabortionstats.htm. Are they accurate, well,
and
we simply
have no way of knowing, but these are often published statistics. Are they viable? Now, that is a
different question altogether.
Following our pattern of analysis, if the number seems to be viable, then we continue. If it is
viable, what implications are fair to divvy out? These 44 million aborted fetuses would be 30
years of age or younger, so for argument sake, let us assume it is fair to say a large percentage
would be alive today. If this assumption is reasonable, 40 million plus the 290 million US
citizens comes to 330 million. We are talking about a population of 330 million people, and
44/330 is slightly larger than 13 % or slightly greater than 1/8. What does this mean? Has
society aborted 1 in 8? Don’t questions abound in your mind? Is this correct? Were these all
abortions performed out of necessity? How many were medical? Or moral? Or personal
choice? Does the reason for the abortion matter to you? Does the reason for the abortion matter
to you if you take into account this new “1 in 8” statistic as a measure of how often abortions do
occur? Certainly, one may argue that 1 in 8 could be construed as an alarming rate. But, the
point of view and the emotions you feel are personal for you. The point is that 44 million is the
statistic we are confronted with. Our ability to perform math tells us 1 in 8 is a logical
consequence of this statistic. What you do in the subsequent interpretation is your decision. But,
quantitative literacy will allow you to understand the statistic in context and make the
interpretation.
Statistics themselves are numbers that stand alone. Honest. Raw. Naked numbers. The name
of the game in statistics is to draw inferences about a population or topic. If we are using polls,
we are basing inferences on a smaller random sample of the general population. When trying to
then form a conclusion, we must be careful. Correlation is not causation, just because numbers
correlate does not mean one causes the other. Inferring characteristics about a population based
on the raw data is the immediate reaction as we scan the headlines, but should it be? Can graphs
be misleading? How good are we at recognizing misleading information?
Causation and Correlation
There exits a relationship between attendance and grades. Research shows that students who
attend class regularly have better grades than those who don’t. Does this mean that attending
class will cause a student to have a better grade, that is, will simply coming to class increase
one’s grade? What about the student who regularly comes to class because they can get 50
minutes of solid rest by laying their head on the desk? The nature of this question illustrates the
94
need for a distinction between two words, causation and correlation. Cities with more
pornography have a higher crime rate. What is the relationship between these two variables; are
the social implications as obvious as is implied? Relationships between variables are not always
cut and dry. Studies can show children who come from economically advantaged homes
perform better in high school. If anyone took this study and concluded that as a society, the
smarter citizens tended to rise to the top of the economic food chain, the public outcry would be
palpable. This is because other factors need to be taken into consideration. Such as the premise
“advantages are just that, advantages”. Other factors such as better access to tutors, better access
to support systems, or not having to study while your hungry or cold or working full time,
certainly contribute to one’s academic performance. Correlation should never be used
interchangeably with causation. Sometimes correlation indicates causation, sometimes not.
Clearly, there exists a high correlation between the amount of blood alcohol level in a person’s
body and the likelihood they will get into an auto accident. We do not think any rational person
would dispute the added inference that drinking alcohol can cause an auto accident. The data
that supports the two factor’s relationship, the higher the number of drunks compared to non
drunks who get into these accidents, imply correlation. That drinking lead to or caused, the
accident implies causation. It will be our task to determine whether a factor’s data that correlates
to some other factor’s data can be interpreted to mean that one factor influences the other.
Correlation A correlation exists between two factors if a change in one of the factor’s data is
associated with a rise or decline in the other factor’s data.
Causation A causation exists between two factors if a one factor causes, determines or results in
the other factor’s data to rise or decline.
Correlation as a result of causation As with drinking and auto accidents, we can often infer
that a correlation is tied to causation. Another equally clear case can be made by considering
tobacco use and lung cancer. The numbers correlate, one can equate the amount one smokes
with the likelihood of succumbing to lung cancer. Those who smoke more have a higher
percentage of their population inflicted with lung cancer. And, for years, the Surgeon General
has been telling us that smoking causes lung cancer. The more you smoke, the higher the risk of
developing lung cancer.
Correlation with no causation. Hidden factors Just because two factors correlate does not
mean one factor causes the other. One of the easiest examples to spotlight the difference and to
have it plainly explained is to look at a common correlation between divorce and death. In most
states, there is a significant negative correlation between the two, the more divorces, the less
deaths. Since the two correlate negatively, the natural question arises, does getting a divorce
reduce the risk of dying; does staying married increase the chance of dying? All joking aside
about the obvious hidden implication, it is a third unseen factor that causes the correlation.
Death and divorce do not have a causal relationship. Age does. The older the married couple,
the less the likelihood they will get a divorce. The older the married couple, the higher
likelihood they will pass away. There is a negative correlation between divorce and age and a
positive correlation between age and death. The younger you are, the more likely you are to get
divorced. The older you are, the more likely you are to pass away. Since the higher number of
95
divorces occur with younger people, and since younger people tend to live longer, we have a
transitive relationship implying the higher number of divorces relating to the longer life spans.
Correlation. Causation. Very different. Yes, there is a correlation between divorce and death.
No, neither causes the other. In plain English, getting a divorce will not increase or decrease the
likelihood you will die.
Accidental Correlations Sometimes there exists accidental correlations where there is no
hidden other factor or unseen logical explanation. The winner of the Super Bowl and the party
of the winner of the presidential race in the country correlate highly every four years, but do not
think football predicts the presidential races, or visa versa. This is an accidental correlation.
Misleading Information
Breast cancer will afflict one in eleven women. But this figure is misleading because it applies
to all women to age eighty-five. Only a small minority of women live to that age. The incidence
of breast cancer rises as the woman gets older. At age forty, one in a thousand women develop
breast cancer. At age sixty, one in five hundred. Is the statistic one in eleven technically
correct? Yes. Should a 40 year old woman be concerned with getting breast cancer? Certainly.
Should they worry that one in eleven of their peer group will be afflicted? No. And while one in
a thousand in their peer group will get afflicted, this by no means minimizes the seriousness of
the issue, but sheds a more realistic light on it.
A scatter plot is a graph of ordered pairs that allow us to
examine the relation between two sets of data.
To draw scatterplot:
 Arrange the data in a table.
 Decide which column represents the x–values (the label representing data along the
horizontal axis). Those values need to be the perceived cause, the independent variable.
Decide which columns represents the y–values (data represented along the vertical axis).
These values need to appear to be affected by the perceived causes, the dependent
variable.
 Plot the data as points of the form or an ordered pair, (x, y).
 Analysis: We can make predictions if the points show a correlation.
* if the points appear to increase while reading the scatter plot from left to right, this is a
positive correlation.
* if the points appear to decrease while reading the scatter plot from left to right, this is a
negative correlation.
Positive Correlation: We expect
that if the values along the
horizontal axis increase, so do the
values associated with the vertical
axis. That is, as we increase x, we
Grade on Exam One
120
100
80
60
40
20
96
0
0
1
2
3
4
Hours Spent Studying
5
6
7
Minutes Spent Jogging
increase y. The more we study, the higher we expect to score on Exam One.
90
80
70
60
50
40
30
20
10
0
Negative Correlation: We expect
that if we increase x then we
decrease y.
The higher the
temperature, the less minutes we
will jog.
0
20
40
60
80
100
120
140
Temperature (Fahrenheit)
Example One
For each below, decide if there is a correlation between the two factors. If there is, is it a
positive correlation or negative correlation? Then decide if the two factors have a causal
relationship. If they do not have a causal relationship, but they do correlate, determine if there
are hidden factors that explain the correlation, if the correlation is accidental or if there is
misleading information.
a.
b.
c.
d.
A child’s shoe size, a child’s ability to do math
Blood alcohol level and reaction time
A girl’s body weight, the time the girl spends playing with dolls each day
Price on an airline ticket, the distance traveled
Solution
a. Positive correlation. A child’s shoe size does correspond to a child’s ability to do math. The
larger a child’s shoe size, the better in math they are. But, the relationship is not causal; large
feet do not cause a child to perform math better. There is a hidden factor. Age. The older the
child, the larger the child’s shoe size. As children become older they take more math classes.
The more math classes the child has participated in, the better the child performs in math.
b. Positive correlation. The higher the blood alcohol level, the slower the reaction time (the
more time required to react). The relationship is causal.
c. Negative correlation. As a girl’s body weight increases, they play with dolls less each day.
No causal relationship here, again a hidden factor. And again it is age. The more a girl weighs,
the older she is, the less time she spends playing with dolls.
97
d. Positive correlation. The longer the distance, generally, the more expensive the ticket.
Causation.
Example Two
Placement Score
70
68
56
40
78
59
67
45
61
Final Average
College Algebra
91
89
71
62
95
65
85
66
70
Let’s examine the basic question, “Do students who
do better on a placement exam perform better in a
college algebra course?” Below is the data. Draw a
scatter plot and answer the question.
Final Average in College Algebra
Solution
100
90
80
70
60
50
40
30
20
10
0
0
20
40
60
80
100
Placement Test Score
98
We need to examine the data visually to see
if there exists a positive or a negative
correlation between higher placement test
scores and performance in college algebra.
Below is a scatter plot of the placement test
data and average scores on a College
Algebra Final. Though not all points show
the same trend, the general trend is an
increase in placement score does translate
to an increase in the final average grade.
Exercise Set
For problems 1–13, decide if there is a
correlation between the two factors. If there
is, is it a positive correlation or negative
correlation. Then decide if the two factors
have a causal relationship. If they do not
have a causal relationship, but they do
correlate, determine if there are hidden
factors that explain the correlation, if the
correlation is accidental or if there is
misleading information.
13. Efficiency of household appliance, size
of the monthly electric bill
14. From a survey of 2000 people, the table
below represents averages for the number of
years in school and the associated average
monthly salary. Make a scatter plot labeling
the x and y axes. Label the x and y axes.
Number of Years
Average
in School
Monthly Salary
Less than 12 (approx. 10) $ 1,500
12
$ 1,750
14
$ 2,100
16
$ 2,550
18
$ 3,000
20
$4,200
1. Altitude, air pressure
2. Number of homes sold, realtor’s income
3. Number of abortions in US, number of
people who are Pro-choice
15. Draw a line through the data which
closely fits the scatter plot for cumulative
donations for a charity by year below:
4. Encouragement of cattle ranching,
amount of rain forest
5. The length a time a couple is together,
the similarity of their out look in life
6. A senior citizen’s age, clarity of vision
for the senior citizen
7. Weight of an envelope, postage on the
envelope
8. A boy’s height, a boy’s time spent
watching cartoons each day
16. From the scatter plot below, interpret
the linear pattern and predict the percent of
students who failed the math course in the
year 2,000.
percent of students failing math
course
9. Minutes hot coffee sits on a desk, the
temperature of the coffee
10. Rate of violence in a city,
unemployment rate in the same city
11. Petroleum consumption, quality of air
30
25
20
15
10
5
0
1988
1990
1992
1994
year
12. Number of cars on the highway, quality
of air
99
1996
1998
2000
17. Which data below has the greatest
negative correlation?
a)
c)
$
$
35
40
30
35
25
30
25
20
20
15
15
10
10
5
5
0
1965
1970
1975
1980
1985
1990
1995
0
1965
2000
b)
1970
1975
1980
1985
1990
1985
1990
1995
2000
d)
$
$
40
40
35
35
30
30
25
25
20
20
15
15
10
10
5
5
0
1965
1970
1975
1980
1985
1990
1995
0
1965
2000
1970
1975
1980
1995
2000
Construct and Draw Inferences
Constructing and drawing inferences are essential to critical thinking and problem solving.
When faced with statements, problems and puzzles, we do more than use common sense. We
use problem solving skills, try to find patterns and infer statements that follow logically from the
statements given. We determine what is reasonable and what is not. We determine what should
logically follow and what should not in order to make good decisions.
Circle Graphs
Taken directly from newspaper headlines: Should a juvenile be tried as an adult? To address
this issue, we should ask ourselves many questions and look at this crucial problem from many
perspectives. For many of us, the first question we ask may be “Do juveniles who murder pose a
chronic problem in this country?” Well, what’s chronic? If a large percentage of all murders
were done by juveniles, this could be called chronic.
100
We return to the Crime Index as defined by the FBI from 2001. Let us ask the question, “does
there exist a correlation between age and those who commit murder in this country?” As long as
we have the information grouped by category, which in this case is by age, we can recognize
large numbers displayed in data as a percent of the whole in a pie chart or circle graph.
First, let’s see how a circle graph or pie chart is made.
We tend to subdivide a circle into sectors represented
by their central angle in either degrees (out of 360
degrees) or the percent of the circle that is to be shaded
(out of 100 %).
Common subdivisions of a circle
Age, in years
1 to 4
5 to 8
9 to 12
13 to 16
17 to 19
20 to 24
25 to 29
30 to 34
35 to 39
40 to 44
45 to 49
50 to 54
55 to 59
60 to 64
65 to 69
70 to 74
75 and over
Total
Number
0
0
14
454
1,695
2,767
1,571
992
855
645
455
272
158
85
59
37
54
10,113
So, for our question: “Is their a correlation between age and those who
commit murder in this country?”, we examine the data taken from the
Crime Index. Of the 10,113 number of known murderers in the country
in 2001, there age distribution was given as follows:
Since the data is already organized, let’s find the density of each age group. This means we will
reconstruct the table and find the percent of murderers for each category, 1 to 4, 5 to 8, 9 to 12,
13 to 16 and so on. Note, not all categories are partitioned into equal time intervals.
101
Age, in years
1 to 4
5 to 8
Number
0
0
9 to 12
14
13 to 16
454
17 to 19
1,695
20 to 24
25 to 29
30 to 34
35 to 39
40 to 44
45 to 49
50 to 54
55 to 59
60 to 64
65 to 69
70 to 74
75 and over
Total
2,767
1,571
992
855
645
455
272
158
85
59
37
54
10,113
Relative frequency
0
0
14
 0.0013
10,113
454
 0.0449
10,113
1, 695
 0.1676
10,113
2767
 0.2736
10,113
0.115
0.098
0.085
0.064
0.045
0.027
0.016
0.008
0.005
0.004
0.005
1
Murder Offenders by age
1 to 4
5 to 8
9 to 12
13 to 16
17 to 19
20 to 24
25 to 29
30 to 34
35 to 39
40 to 44
45 to 49
50 to 54
55 to 59
60 to 64
65 to 69
70 to 74
Central Angle
0.0013 x 360   0.468
0.0449 x 360  16.2
0.1676 x 360   60.34, or 1 of the circle
6
0.2736 x 360   98.5
41.4 degrees
35.3 degrees
30.4 degrees
23 degrees
16.2 degrees
9.7 degrees
5.62 degrees
3 degrees
2.1 degrees
1.3 degrees
2 degrees
360 degrees, a whole circle
The pie chart below is illuminating. Very
quickly, by glancing at the chart, we can tell that
20 to 24 year olds commit the most murders, but
a close second seems to be 17 to 19 year olds, as
well as 25 to 29 year olds. If a juvenile is
defined to be under 18 years of age, then this
appears to be a chronic problem because the
second most dense population of murderers
occurs in the age group 17 to 19 year olds. Now
when we factor in the 13 to 16 year olds (454),
the problem of juvenile murder seems to be more
acute. For murders committed by teenagers
alone, we have within the 13 to 19 year old age
group, accounted for 454 + 1695 or 2,149
murders committed by teenagers. This comes to
2149/10,113 or just a little over 20 percent, and
this doesn’t include the children who are 12 or
under.
75 and over
102
Now, let’s continue to address this problem again. Numbers never lie. But rearranged, could
they deceive? Could the very same numbers be used by the opposing side of the argument to
make the opposing view more viable? As said, first, we re-arrange the numbers.
1 to 19
14+454+1695=2163
20 to 39
2767+1571+992+855=6185
40 to 59
645+455+272+158=1530
60 and over 85+59+37+54=235
We then construct a pie chart from these
new subdivisions. Again, keep in mind we
only considering the murders where we
know the age of the murderer. There were
10,113 of these murders.
Murderers by age, 2001
1 to 19
20 to 39
40 to 59
60 and over
But, we are trying to represent the opposing point of view and we are trying to show murder by
juveniles is not a ‘chronic problem’. So, in 2001, there were an additional 5375 murders where
the age of the perpetrator was unknown. Regrouping, our table looks like:
1 to 19
2,163
20 to 39
6,185
40 to 59
1,530
60 and over 235
Unknown 5,375
Murderers, by age, 2001
1 to 19
20 to 39
40 to 59
Let’s examine the new pie chart. Notice how
much smaller the piece of the pie for the 1 to 19
year old segment now is compared to the whole.
This is significant difference from the previous
pie charts where we did not factor in the
murders committed by people of unknown ages.
60 and over
unknown
103
Murderers by age, 2001
1 to 19
20 to 39
40 to 59
60 and over
unknown
To further enhance our argument, we may
construct the slices of the pie with a 3–
dimensional representation. We then shift the
angle of the segment of the pie we are trying to
ostensibly hide so that it is less prominent. Our
point that juvenile crime is not a chronic
problem seems more justified to the viewer’s
eye.
To add a final touch in enhancing our argument, let’s re-categorize and change two groupings: 1
to 19 and 20 to 29 to 1 to 16 and 17 to 39. If we keep the category of unknown murderers in the
groupings, let’s compare the original pie chart with the final one. To the naked eye, a quick
glance reveals the juvenile’s slice to be a mere sliver on the left chart compared to nearly a
quarter of the pie on the right.
Murderers, by age, 2001
Murderers by age, 2001
1 to 16
1 to 19
17 to 39
20 to 39
40 to 59
40 to 59
60 and over
60 and over
unknown
Statistics don’t lie, they can be re-arranged though to show what ever is on one’s agenda.
Example Two
The graph below is shown and a TV anchor man states, “There was a sharp dramatic increase in
drunk driving convictions between the year 1999 and the year 2000.” Consider the statement
and reply to its accuracy.
104
Solution
According to the figure, the actual increase in drunk driving convictions between 1999 and 2000
was 12, up to 732 from 720 the year before. Though this is an increase, it can not be considered
a “sharp dramatic increase”. Evaluating the data in another way, the percent increase,
12
 100  1.7% is not significantly sharp or particularly dramatic. The anchor man was over
720
dramatizing the report, the words may be deemed inflammatory, bordering on misleading.
Example Three
Drawing Inferences A bucket has 6 small green balls, 4
medium blue balls, 7 large pink balls, and 3 very large red
balls. A child picks ten balls, selecting each randomly so
each ball is equally likely to be selected. Four such trials
were conducted. Which trial most closely resemble the
theoretical probability that would occur if the balls were
selected randomly ten times?
a)
Balls
c)
Balls
b)
Balls
d)
Balls
Number of
Balls
Selected
Small Green 2
Medium Blue 2
Large Pink
2
Very Large 4
Red
Number of
Balls
Selected
Small Green 2
Medium Blue 3
Large Pink
2
Very Large 3
Red
Number of
Balls
Selected
Small Green 3
Medium Blue 2
Large Pink
3
Very Large 2
Red
Number of
Balls
Selected
Small Green 3
Medium Blue 2
Large Pink
4
Very Large 1
Red
105
Solution
First, we need to calculate the theoretical probability for each type of
ball. Recall, the probability is the number of successful outcomes
divided by the total number of outcomes. The total number of balls is
20. There are 6 small green ones, 4 medium blue ones, 7 large pink
ones, and 3 very large red ones. Given the theoretical probabilities: if
ten balls were selected, we could anticipate 3 out of 10 balls to be
small green ones, 2 out of 10 to be medium blue ones, 3.5 out of 10 to be large pink ones and 1.5
out of 10 to be very large red ones. This trial outcome is impossible and so choice b) is the
closest trial to these expected results.
Balls
Small Green
Medium Blue
Large Pink
Very Large Red
Prob.
6/20
4/20
7/20
3/20
Exercise Set
For problems 1 and 2, use the following
data. In the year 2000, a state lottery
distributes its $ 2.1 million proceeds in the
following manner:
Proceeds
$ 900,000
$ 500,000
$ 200,000
$ 200,000
$ 160,000
$ 140,000
For problems 3-4, use the following data.
Source: http://www.ucdmc.ucdavis.edu/vprp/Section6,2000.pdf
In 2000, the population of California was
33,871,648 and 134,227 Californians
purchased 193,489 handguns. 103,743
people purchased one hand gun, 28,453
people purchased two to five handguns
totaling 71,363 handguns. 1,855 people
purchased 6 to 12 handguns, totaling 14,053
handguns and 176 people purchased more
than 12 handguns for a total of 4330
handguns.
Beneficiary
Education
Cities
Highways
Senior Citizens
Libraries
Other
3. Construct two circle graphs. One circle
graph should support the argument that there
is a need for more restrictions on handguns
in the state of California. The other circle
graph should refute the argument, that is,
support the counter argument that there is no
need for more restrictions on handguns in
the state of California.
1. Draw two circle graphs. One should
support the argument that too much money
from the state lottery went toward education.
The other should support the counter
argument, too little money from the state
lottery went toward education.
2. Choose a side to the above argument.
Pro or Con. Then write a paragraph
defending your argument, citing social,
political, ethical and/or religious factors.
4. Choose a side to the argument that there
is a need for more restrictions on handguns
in the state of California. Pro or Con. Then
write a paragraph defending your argument,
citing social, political, ethical and/or
religious factors.
106
For problems 5 and 6, use the following data
for the US Census Bureau. In 1999, there
were roughly 280,000,000 US citizens, and
35,000,000 lived in poverty. Of these 35
million, 12,100,000 were children, where
4,500,000 of these children lived in families
who were under one-half of the poverty
level. The poverty level was defined as $
13,290 per family of three. For each
problem, construct a circle graph as
designated below.
7. Murder Victims. By Race and Sex.
2001. Note: The murder and nonnegligent
homicides as a result of the events of
September 11th, 2001, were not included in
the below table. 2001, taken from Tables 2.3-2.15.
Special Report Section V. http://www.fbi.gov/ucr/01cius.htm.
5. Draw a circle graph whose population is
the citizens of the United States. Section the
circle graph into two sectors, one sector
representing the US citizens who live above
the poverty level, one sector representing the
US citizens who live below the poverty
level.
Race of Victim
White
Total
6,750
Male
4,785
Female
1,962
Unknown
3
Black
Other race
6,446
368
5,350
245
1,095
123
1
0
Unknown race
Total
188
123
34
13,752 10,503 3,214
8.
31
35
Hate Crime Statistics. By Bias. 2003.
Source:FBI Crime List in 2003.
http://www.fbi.gov/pressrel/pressrel04/pressrel112204.htm
6. Draw a circle graph whose population is
those citizens who live below the poverty
level. Section the circle graph into three
sectors, one sector representing the adults
who live below the poverty level, one sector
representing the children who lived in
families who lived under one-half the
poverty level and the third sector is all of the
other children who lived below the poverty
level.
For problems 7-8. Observe the tables
below. For each, what is the greatest issue
presented by these numbers. Then argue
one side of the argument, using pie charts to
visually sway your reader. Be certain to
outline the issue, show the supporting
table(s) and pie chart(s). Discuss the
potential harm of such practices.
107
Total
Victims
9100
Bias to race
Anti-White
Anti-Black
Other
4,754
1,006
3,150
598
Bias to Religion
Anti-Jewish
Anti-Catholic
Anti-Islamic
Other
1,489
1,025
80
171
213
Bias to Other
2,857
9. The graph below shows the companies
profits in its first four years of existence.
would be to concentrate on your exams,
homework, or even your instructor' lectures
if your family didn't have enough money to
feed you? What if you were in poor health
and your family couldn't afford to take you
to the doctor or provide the medicine you
need? The bitter truth is that in 2000,
12,100,000 children in America were living
in poverty and confronted these challenges
every day.
If a family of three were living below the
poverty line in 2000, they had an income
below $13,290 a year. Living in poverty can
translate to residing in crowded housing,
having your utilities turned off, not owning a
phone, or refrigerator or car, not having
enough food to feed your family, not enough
medicine to heal your loved ones. And the
heart wrenching statistic is that 4.5 million
children live in families that exist below
one-half of the official poverty level.
Do we have your attention, are you
gasping in proper reverence? We should.
Particularly because in 2000, America was
experiencing one of its greatest flashes of
economic prosperity. Business was
skyrocketing, and people were spending.
But, was just a minute percentage of
Americans benefiting from this new wealth?
Ironically, in 2000, the unemployment rate
in the U.S. was lower than it had been in
years, but the percentage of poor children in
working families was soaring. There were
many possibilities to explain this
phenomenon, but "Some economists (said)
that if wages had kept pace with the cost of
living since the 1960s, the minimum wage
would (have been) between $12 and $14
dollars" (CNN.com).” Instead, the
minimum wage was $5.15.
Assignment Go to the US Census
Bureau. Find out how many children there
were in the US in 2000. Construct a circle
graph with the following categories:
Children who lived above the poverty level,
children who lived below the poverty level.
What’s wrong with this statement, “There
was a substantial increase in the company’s
profit in its first 4 years of existence.”
10. Poll your classmates as to the most
important ‘hot button’ campaign issue.
Create a table as you see below.
Topic
Frequency
Relative
Frequency
Density
Terror
Racial
Relations
Abortion
Death
Penalty
Drugs
Education
Construct a histogram and a pie chart for the
data.
11. Project: Circle graphs, drawing
inferences
Sometimes we choose to see what we
want to see. We all stretch the truth,
exaggerate what we need, ignore what hurts
us and to what end, personal wealth at the
expense of personal worth? From the US
Census Bureau, 2000: Child poverty in
America dropped from 13.5 million children
in 1998 to 12.1 million in 1999. With a
whisper of optimism, we rationalize that this
improvement was great. Was it?
Do you ever have trouble focusing on
exams or concentrating on homework
assignments? How hard do you think it
108
Draw separate sections of the circle graph
for those children who lived above $ 6,645 a
year (half of the poverty level of $13,290 a
year) and those who lived below $ 6,645 per
year. Also, include a section of the graph
for those children who lived in the upper 1
% of the income bracket and determine what
that income level was. Then tackle the
following questions?
For problems 12-17, use this information
provided: 5,000 years ago, forests covered
nearly 50% of the earth's land surface.
Since the advent of humans, forests now
cover less than 20%. Forests serve as the
lungs to our planet by providing the very
oxygen with which we breath. The rate of
deforestation is increasing and although
extinction is nature’s way of selectively realigning our living world, this extinction, the
most acute since the dinosaurs, is not
nature’s way. Humans have caused it, by
themselves.
a. Do you think there is a positive, negative
or no correlation between concentrating in
high school and graduating from high
school? Is it a causation relationship?
Why?
Source: According to RAN (Rainforest
Action Network) and Myers (Op sit). In
Central and South America, Bolivia, whose
land mass is 1,098,581 square kilometers
once had a forest cover of 90,000 sq km,
now has a forest cover of 45,000 sq km.
Brazil, whose land area is 8,511,960 sq km,
once had a forest cover of 2,860,000 sq km,
now has a forest cover of 1,800,000 sq km.
Central America has a land area of 522,915
sq km, once had a forest cover of 500,000
km and now has a forest cover of 55,000
km. Columbia has a land area of 1,138,891
sq km once had a forest cover of 700,000
km and now has a forest cover of 180,000
km. Ecuador’s land area is 270,670 km,
once had a forest cover of 132,000 sq km
and now has a forest cover of 44,000 km.
Mexico’s land area is 1,967,180 sq km, one
time its forest cover was 400,000 sq km and
now its forest cover is 110,000 sq km.
b. Do you think there is a positive, negative
or no correlation between concentrating in
high school getting into college? Is it a
causation relationship? Why?
c. Do you think there is a positive, negative
or no correlation between concentrating in
high school and acquiring well-paying jobs?
Is it a causation relationship? Why?
d. Do you think there is a positive, negative
or no correlation between staying healthy
and having access to doctors and medicine?
Is it a causation relationship? Why?
e. Do you think there is a positive, negative
or no correlation between poverty and
crime? Is it a causation relationship? Why?
f. Do you think there is a positive, negative
or no correlation between issues that
politicians and lawmakers have as a top
priority and issues that affect those under 18,
who can not vote? Is it a causation
relationship? Why?
12. For each country, construct a circle
graph where the circle represents the land
area of the country. Divide each circle into
two sectors, one for the country’s land area
that was once covered by forests and one for
the land area not that was not covered by
forests at that time.
109
13. For each country, construct a circle
graph where the circle represents the land
area of the country. Divide each circle into
two sectors, one for the country’s land area
that is currently covered by forests and one
for the land area that is currently not covered
by forests at this time.
16. Construct a circle graph that represents
the total land area for Bolivia, Brazil,
Central America Columbia, Ecuador and
Mexico. Divide the circle graph into twelve
sectors, two for each country, where one
sector represents the land area that was once
covered by forests and the other represent
the land area at that time that was not
covered by forests.
14. For each country, construct a circle
graph where the circle represents the
original extent of forest cover. Divide the
circle into two sectors, one for the existing
land area covered by forests and one for the
land area lost to deforestation.
17. After assimilating the information and
viewing the circle graphs from problems 1217, provide an argument, either pro or con,
with regard to the following statement:
“Deforestation of the rain forests in Central
and South America is threatening the local
environment as well as the global
environment. It should be a button issue in
today’s society.”
15. Construct a circle graph that represents
the total land area for Bolivia, Brazil,
Central
America Columbia, Ecuador and Mexico.
Divide the circle graph into twelve sectors,
two for each country, where one sector
represents the land area currently covered by
forests and the other the land area currently
not covered by forests.
110
Measure of Central Tendency
Mean, Median, and Mode Finding a number that best represents a set of data is important to
you right now. Because your choice of the “representative” number that best indicates your
grade can determine your course grade. Mathematicians say that to find the number that is going
to serve as the spokesperson for the data should reflect the measure of the center or the middle of
the data.
Usually we begin by averaging the numbers to find that representative number; we find the sum
and then divide by the number of data points. But, if the data consists of exam scores and you
earned a 95, 95, 95, 95, 95, and a 45, then your average is found with two calculation, 95 + 95 +
95 + 95 + 95 + 45 = 520 and 520/6 = 86.7 . This means the center of your data, or the letter for
the grade that best represents your data is a B according to your average, and yet you never once
earned a B. In fact, you earned only A’s, except for one failing grade. You pause, because
clearly you earned 5 grades of a 95 and just one grade of a 45. The five A’s must count for
something, right? The data that appears the most, 95, is described as the mode and it is simply
another representation of the tendency of the data.
Now that we see there is more than one way to refer to the center of the data, let’s begin with
perhaps a more realistic example. Suppose we knew you had the following exam scores, 60, 80,
60, 70, 80, 80, 90, and 95. Your thinking perhaps you deserve an A because your last two grades
were A’s. Or at the very least, you deserve a B. You begin by finding your average or mean,
which is the sum of the scores divided by the number of scores; so you average your grades.
First you add the scores: 60 + 80 + 60 + 70 + 80 + 80 + 90 + 95 = 615. You had 8 exams and
the average is found by dividing 615 by 8; 615/8 = 76.9 or a C. Uh oh. You change your
strategy. You argue that you scored an 80 three times, you deserve a B. The mode is the data
that occurs most frequently, and your mode is an 80. Does this help your argument? Well, one
more indication of the middle of your data is the middle value when you align the numbers in
order, either from top to bottom or from bottom to top. So, we arrange our data as 60, 60, 70, 80,
80, 80, 90, and 95. The data that occurs in the middle is called the median, like the median of
the highway. If there is an odd number of data points, the median will be a number found in the
data set. If there is an even number of data, there will be two numbers in the physical middle of
the data, and when this occurs, you need to average the two middle numbers. For us, there are
two 80’s in the middle of the data, another indication you deserve a B. Now, perhaps the last
possible argument you may use to justify you are a B student is that your last four exam grades,
80, 80, 90 and 95 showed you were more of a B student than a C student at the end of the course.
So, despite having an average or mean of a 76.9, your mode and median scores were an 80 and
you’re your grades in the second half of the semester were certainly not indicative of a C student.
What grade should you get? What grade did you earn?
Real Estate You meet with a real estate agent and carefully explain to the agent the price range
of the homes you are interested in seeing. The agent taps away on their computer and tells you
they tell you they completely understand what you want, that you are looking for homes in the
$130,000 to $160,000 range. You nod your head in agreement. The agent informs you they
have found three neighborhoods where the mean (average) value of houses in the three
neighborhoods are $ 128,571, $136,786 and $161,429. Each subdivision is small, just like you
111
prefer, with 14 homes. They explain the need for you to sign a exclusive right to buy agreement
before they take you out. Impressed with both the immediacy and the detail provided, you
quickly sign the agreement. The real estate agent takes you out for the day. After cozying into
the front seat of their car, you sit back and enthusiastically await what should prove to be a
worthwhile day of house hunting. By the end of the day, you are nodding your head sideways,
not up and down, and you are straining to think of ways to break the stupid exclusive right to buy
agreement you just signed earlier. What happened? Let’s see.
The three subdivisions you saw:
House
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Average Value
Sleepy Brook
205,000
400,000
500,000
80,000
70,000
60,000
80,000
80,000
100,000
100,000
60,000
60,000
60,000
60,000
136,786
Vista View
130,000
130,000
135,000
140,000
150,000
125,000
125,000
125,000
125,000
125,000
125,000
125,000
120,000
120,000
128,571
Meadowlands
300,000
400,000
400,000
500,000
65,000
70,000
70,000
65,000
65,000
65,000
65,000
65,000
65,000
65,000
161,429
Which subdivision was closest to your liking? Well, clearly Vista View is the only subdivision
that even had homes in your price range, with 5 of the 14 homes within your price range. But,
this was the least likely subdivision because it’s average value home was a little below your
range. But, visiting the other two subdivisions was useless, they had no homes in your range.
The agent never checked the values of the homes in the three subdivisions, they only checked the
average value of the homes. To cut the agent some slack, checking 3 subdivisions with 14
homes each would have been a lengthy endeavor, because each home would have needed to be
accessed individually on the computer screen. Remember, the agent wanted to impress you with
their quick research.
Still, the oversight was caused because you did not have enough
information about the data. Measures of Central Tendency informs us as to the behavior of
the middle of the data, without the need to see every tedious piece of data. Since pulling up
each home would have been too time consuming (42 homes) what other pieces of information
could have been pertinent so that you would have known that only Vista View was worth
visiting?
112
Range. The mean or average value for these sets of data are:
For
Sleepy
Brook:
205,000  400,000  500,000  3(80,000)  70,000  5(60,000)  2(100,000)
 $136,786
14
For
Vista
2(130,000)  135,000  140,000  150,000  7(125,000)  2(120,0000)
 $128,571
14
For Meadowlands:
300,000  2(400,000)  500,000  8(65,000)  2(70,000)
 $161, 429
14
View:
But, clearly, this was not enough information about the middle of the data. What else could have
helped us. Well, in the Meadowlands subdivision, there were 8 of the 14 homes were worth
$65,000, one-half of our lower limit for our price range. This would have been helpful to know.
The mode is the piece of data that shows up the most frequently. So, in the Meadowlands,
the mode is 65,000, occurring 8 times. For Vista View, the most is 125,000 occurring 7 times.
This mode is close to our price range. And Sleepy Brook? It’s mode was much lower, 60,000,
occurring 5 times.
What other tendency for the data would have been helpful. How many homes are not in our
price range would be too easy of an answer, huh. If we order our data, then the median value of
these homes is also readily available:
House
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Median
Sleepy Brook
60,000
60,000
60,000
60,000
60,000
70,000
80,000
80,000
80,000
100,000
100,000
205,000
400,000
500,000
80,000
Vista View
120,000
120,000
125,000
125,000
125,000
125,000
125,000
125,000
125,000
130,000
130,000
135,000
140,000
150,000
125,000
Meadowlands
65,000
65,000
65,000
65,000
65,000
65,000
65,000
65,000
70,000
70,000
300,000
400,000
400,000
500,000
65,000
How would knowing the median have been helpful? Well, if we knew the medium for
Meadowlands, then we would have known that one-half of the homes in the subdivision, that is 7
of the homes were $65,000 or below. To keep an average of $161,429, many of the other homes
would have needed to be too expensive for us, leaving at best, the possibility of at most a few
homes in our range. It turned out, there were no homes in our range.
113
Which leads us to the dispersion of the data. Dispersion means spreading, scattering or
distribution. We can address these different words with different measures of central tendency.
The range is the difference between the largest and the smallest data point. For Sleepy Brook,
$500,000 - $60,000 = $440,000 or most realistically, there is a large difference between the
cheapest and the most expensive home in the subdivision. For Vista View, $ 150,000 - $
120,000 = $ 30,000 and this tells us all the homes are at least close to our price range.
Meadowlands has the problem Sleepy Brook had, the range is $500,000 - $65,000 = $435,000.
To measure the scattering and the distribution of even larger samples of data, we will examine
standard deviations a little later. But first, let’s look at mean, median, mode and range a little
longer.
Example One
Below are the Traffic Fatalities per 100 million (108) vehicle miles in 2001 Source: U.S.
National Highway Safety Traffic Administration. Rank the states and the District of Columbia
in ascending order. Then find the mode, median, mean and range. Discuss the relevance of the
numbers. This means if any two correspond closely, look at the data and tell why. If any state is
far from the middle of the data, call it an outlier.
Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
District
Columbia
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
1.75
1.80
2.06
2.08
1.27
1.71
1.01
1.58
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
of
1.81
1.93
1.50
1.61
1.84
1.37
1.27
1.49
1.75
1.83
2.32
1.33
1.27
0.90
1.34
1.06
2.18
114
1.62
2.30
1.36
1.71
1.15
1.09
1.99
1.18
1.67
1.45
1.29
1.55
1.42
1.49
1.01
2.27
2.00
1.85
1.72
1.25
0.96
1.27
1.21
1.91
1.33
2.16
Solution
Massachusetts
Vermont
Connecticut
Rhode Island
Minnesota
New Jersey
New Hampshire
New York
Washington
Utah
California
Indiana
Maryland
Virginia
Ohio
Maine
Wisconsin
Michigan
Nebraska
Illinois
Oregon
North Dakota
Iowa
Pennsylvania
Georgia
Oklahoma
Delaware
Hawaii
0.90
0.96
1.01
1.01
1.06
1.09
1.15
1.18
1.21
1.25
1.27
1.27
1.27
1.27
1.29
1.33
1.33
1.34
1.36
1.37
1.42
1.45
1.49
1.49
1.50
1.55
1.58
1.61
50
49
47
47
46
45
44
43
42
41
37
37
37
37
36
34
34
33
32
31
30
29
27
27
26
25
24
23
Missouri
1.62
North Carolina 1.67
Colorado
1.71
Nevada
1.71
Texas
1.72
Alabama
1.75
Kansas
1.75
Alaska
1.80
District
of
Columbia
1.81
Kentucky
1.83
Idaho
1.84
Tennessee
1.85
West Virginia
1.91
Florida
1.93
New Mexico
1.99
South Dakota
2.00
Arizona
2.06
Arkansas
2.08
Wyoming
2.16
Mississippi
2.18
South Carolina 2.27
Montana
2.30
Louisiana
2.32
Mode
1.27
Median
1.55
Mean
1.57
22
21
19
19
18
16
16
15
(X)
14
13
12
11
10
9
8
7
6
5
4
3
2
1
The mean and median are close, this means the number in the middle of the data and the average
are close together. The number of states that rank above and below the average and the number
of states that rank above and below the middle state, GA, are close to the same. So, the data is
not top or bottom heavy. Yet, this doesn’t mean the data is dispersed evenly. Why?
115
9. From the US Census Bureau, 1999,
below is the state rankings of the percent of
elderly persons, 65 years and over that live
below the poverty level. Rank the states and
the District of Columbia in ascending order.
Then find the mode, median, mean and
range. Discuss the relevance of the
numbers. This means if any two correspond
closely, look at the data and tell why. If any
state is far from the middle of the data, call
it an outlier.
Exercise Set
For problems 1-6 below, find the mean,
median and mode for the data.
1. 1, 3, 4, 4, 4, 5, 5, 6
2. 3, 3, 4, 4, 4, 5, 5, 6
3. 3, 3, 3, 4, 4, 5, 5, 6
4. 3, 3, 3, 4, 5, 5, 5, 6
Alabama
15.5
Alaska
6.8
Arizona
8.4
Arkansas
13.8
California
8.1
Colorado
7.4
Connecticut
7.0
Delaware
7.9
District of Columbia 16.4
Florida
9.1
Georgia
13.5
Hawaii
7.4
Idaho
8.3
Illinois
8.3
Indiana
7.7
Iowa
7.7
Kansas
8.1
Kentucky
14.2
Louisiana
16.7
Maine
10.2
Maryland
8.5
Massachusetts
8.9
Michigan
8.2
Minnesota
8.2
Mississippi
18.8
Missouri
9.9
Montana
9.1
Nebraska
8.0
Nevada
7.1
New Hampshire
7.2
New Jersey
7.8
New Mexico
12.8
New York
11.3
North Carolina
13.2
5. 1, 1, 1, 1, 2, 2, 6, 6
6. 1, 1, 2, 2, 6, 6, 6, 6
7. What is the median time it took for the
students to write the exam?
Student ID
Number
4025
1026
8790
1029
2943
2020
2084
5091
7812
5103
6092
Time to
Take Exam
1:25
1:09
0:59
0:45
1:01
1:12
1:25
1:31
0:49
2:00
1:42
8. Below is the year and the percent of
children under the age of four in a city that
attended daycare. What is the mode for this
set of data.
1970
1974
1978
1982
1986
1990
1994
15
15
18
21
12
16
15
1972
1976
1980
1984
1988
1992
1996
17
16
17
31
15
17
12
116
North Dakota
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
11.1
8.1
11.1
7.6
9.1
10.6
13.9
11.1
13.5
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
12.8
5.8
8.5
9.5
7.5
11.9
7.4
8.9
Standard Deviation and the Normal Distribution
According to a study done by the National Center for Health and Statistics, Mean Body Weight,
Height and Body mass Index, United States 1960-2002, American men are (ages 20-74) are 25
pounds heavier in 2002 than they were some 42 years earlier in 1960. In 2002 the average
American male weighed 191 pounds, up from his 1960 counterpart who weighed 166 pounds.
American women from the same age group followed the same trend, the average American
woman weighed 164 pounds in 2002, up 24 pounds from the average American woman from
1960 who weighed 140 pounds. This study caused quite a stir, as nutritionists and diet doctors
clamored together to seek solutions. And as you can imagine, the dangers of obesity were
revisited when this study was broadcast.
Not only had the average American weight increased, but they grew as well. The average male
heights increased over the 42 year span, from 5‘ 8“ in 1960 to 5‘ 9 ½“ in 2002. And as expected,
the average height for the American woman also increased, from 5‘ 3“ to 5‘ 4“. The study was
done on a smaller representation of the true population, it was performed on thousands of people
and in reality, the population of American Men and Women total in the hundreds of millions.
Since these numbers are so large, we assume the data to be a normally distributed around the
mean, or average. A normal distribution for a set of data means that there is more data close to
the "average," and the less data farther from the average until finally relatively few data points
tend to one extreme or the other. The data is symmetrically distributed away from the average.
This is common sense, or mathematical intuition. Humans are, after all, close to being one and
the same. Let’s say you are writing a story about the height distribution of the American male in
2002 because you are trying to correlate it to ethnicity, diet or genes. First you take the
population, in this case, those people who participated in the study, and tally up the number of
people for each given height. Like most data, if the sample or population is large enough, the
heights for the population turn out to be normally distributed. This means most people will be
of average height or close to average height. In other words, the average height also will be the
117
height to occur most frequently in our population and the height found in the middle of the data
when it is ordered. Thus, the mean will be the mode and the median too. Next, if a population is
normally distributed, and you plot each height in increasing order, the number of men for a given
height are symmetrically distributed around the average height. In other words, there will be
more people close to average height than far from the average height. In 2002, the average
height of the American male was 5’9 ½ .’’ For our normally distributed society which we aptly
call the American male, the next most common heights occur from 5’9” through 5’ 10”, both
heights ½ inch away from the mean height. Next, the most common heights would be expected
to occur between 5’ 8” to 5’ 11” And so on. We expect less and less men to have a designated
height as we move further from the average height. Intuitively, this fits our preconceived notion
of our society, we expect to see less men that are 6 ‘ 5” than you would find that are 5’ 11” for
instance. And similarly, this means there will be fewer men that are five foot than 5 ‘ 7” and on
the other side of the mirror, fewer that are 6 ½ feet than 5 ‘ 11”. Because height is a normally
distributed trait, the heights are distributed symmetrically around the average height.
So, we estimated the number of adult
American Males for each given height and
then grouped the heights into small
intervals. We then drew a bar graph, as
shown to the left. The x-axis represents a
given height, the y-axis represents the
number of adult males of that given
height. Notice, the graph is centered at the
average height of the adult American
male.
180000
160000
140000
120000
100000
80000
60000
40000
20000
0
We then redrew a line graph with from the
normally distributed data, as shown on the
left.
180000
160000
140000
120000
100000
80000
60000
40000
20000
0
Often, we draw the Normally Distributed Bell Shaped curve
free hand. Our approximation of a Bell Shaped curve may
look like the graph to the left. Note: The x-axis (the
horizontal one) is the value in question, the population’s
height for example. The y-axis (the vertical one) is the
number of data points for each value on the x-axis, the
number of people that are that certain height.
118
The standard deviation is a measure of how widely values are dispersed from the mean
(average value). For populations where the data points are tightly bunched together, the bellshaped curve is steep and the standard deviation is small. For populations where the data points
are spread further apart from the average, the bell curve is flatter and the standard deviation is
larger.
68-95-99.7 To refine our understanding of a standard deviation, we turn our attention to a bell
shaped graph. In a moment we will show you the calculation for the standard deviation. Right
now, we want to present a conceptual understanding for the term ‘standard deviation.’ Recall, in
2002, the American male had a mean height of 5 ‘ 9 ½ “. The standard deviation is 2 3/8 “.
For a normal distribution, one standard deviation (in red above) away from the mean in both
directions on the horizontal axis will account for approximately 68 % of the population. There
are two heights that are 2 3/8 inches from 5’ 9 1/2”, the smaller 5’ 7 1/8” (5’ 9 ½” – 2’ 3/8”) and
one larger, 5’ 11 7/8” (5’ 9 ½” + 2’ 3/8”). Thus, 68 % of the American men in 2002 stood
between 5’ 7 1/8” and 5’ 11 7/8”.
All data found within two standard deviations (in red and green above) from the mean will
account for roughly 95 % of the normally distributed population, or the adult American male
population. The two heights two standard deviations away from the mean are found with two
predictable calculations. We first subtract two standard deviations from the mean, giving us 5’ 9
½” - 2’ 3/8” - 2’ 3/8” = 5’ 4 ¾” We then add two standard deviations to the mean, giving 5’ 9
½” + 2’ 3/8” + 2’ 3/8” = 6 ‘ 2 ¼”. So, 95 % of American men in 2002 were somewhere
between 5’ 4 ¾” and 6’ 2 ¼”. Recall, the heights for this 95 % of the population are evenly
distributed from the mean.
Data found three standard deviations from the mean (the red, green and blue areas) account for
about 99.7 % of normally distributed populations. So, in 2002, 99.7 % of the American men
were between 5’ 2 3/8” (5’ 9 ½” - 2’ 3/8” - 2’ 3/8” - 2’ 3/8”) and 6’ 4 5/8”(5’ 9 ½” + 2’ 3/8” +
2’ 3/8” + 2’ 3/8”). From a different perspective, one could infer that in 2002, those American
men who were more than three standard deviations away from the mean either were shorter then
5’ 2 3/8” or taller than 6’ 4 5/8” represented 0.3 % of the adult American male population, they
were considered short or tall by the our population’s standards.
119
If a curve was flatter, the standard deviation would have to be larger in order to account for those
68 percent and if the curve was steeper, the standard deviation would have to be smaller to
account for 68 percent of the population. Standard deviation tells you “how spread out the data
points in the population are from the mean.”
Why is this useful? Well, if you are comparing test scores for different schools, the standard
deviation will tell you how diverse the test scores are for each school. Let's say Washington
High School has a higher mean test score than Adams High School for the mathematics portion
of the statewide AIMS test administered in the state of Arizona to measure the students
understanding of high school mathematics. Our first reaction might be to deduce that the
students at Washington are either smarter or better educated by the teachers.
You analyze the data further. The standard deviation, you find out, at Washington is larger than
at Adams. This means that at Washington there are relatively more kids scoring toward one
extreme or the other. By asking a few follow-up questions, you might find that Washington’s
average was higher because the school district sent all of the gifted education kids to
Washington. Or perhaps Adams scores were dragged down and thus appeared bunched together
because all of the students who recently have been "mainstreamed" from special education
classes. Perhaps the gifted classes were sent out of district. In this way, looking at the standard
deviation can help point you in the right direction when asking why data is the way it is.
Example One
You are trying to decide which teacher’s class to enroll in for Mathematics. You go to a website
that claims to have tracked the three teacher’s success rates over the past five years. The final
grade for Mr. Allen’s students had a mean score of 76 with a standard deviation of 5, while Mrs.
Bennett’s students had a mean score of 74, with a standard deviation of 3 and Mrs. Clyde has a
mean score of 79 for the student’s final grade, with a standard deviation of 7. Whose class
would you enroll in? How would you interpret the data on the web site? Rank the teachers from
first to third, so that if one’s section is full, you would know whose class to register for next.
Solution
We must quantify the exam scores to interpret the data. For Mr. Allen’s classes, 68 % of the
students earned a final grade that was within 5 points of 76, so 68 % of the students scored
earned between 71 to 81. About 95 % scored within two standard deviations of the mean, so 95
% of the students earned a final grade between 66 to 86. Finally, 99.7 % of the students earned a
final grade between 61 to 91. Continuing with this thought process, Mrs. Bennett’s students has
a lower final grade average, 74. But, 68 % of the students earned a final grade scored between
71 to 77, 95 % earned a final grade between 68 to 80 and 99.7 % earned a final grade between 65
to 83. For Mrs. Clyde’s students, her students earned the highest average, but she had the 68 % ,
95 % and 99.7 % spread farther apart, 72-86, 65-93 and 58 – 100 respectively.
120
A table allows us to compare the success rates of the three teachers:
So, to answer the question of which teacher you should
take. If you are a good student, you have a better chance of
securing an A with Mrs. Clyde first, Mr. Allen second and
Mrs. Bennett third. If you struggle at math, you probably
would choose Mrs. Bennett first because 99.7 % of her
students earn above a 65. Mr. Allen would probably be your second choice, Mrs. Clyde your
third choice.
Mr. A
68 %
71-81
95 %
66-86
99.7 % 61-91
Mrs. B
71-77
68-80
65-83
Mrs. C
72-86
65-93
58-100
Example Two
In Typical City, USA, the number of hours a teen watches TV has become concern for the
town’s elders. They research this and find the teens watch an average of 4 ½ hours of TV a day,
with a standard deviation of ½ hour. What percent of the teen’s watch
a) more than 5 hours of TV per day?
b) more than 5 ½ hours of TV per day?
c) less than 5 ½ hours of TV per day?
d) less than 4 hours of TV per day?
e) less than 3 ½ hours of TV per day?
Solution
a) Since 5 hours is 1 standard deviation above the mean (4 ½ plus ½ ), then 68 % of the
teens are distributed within 1 standard deviation or between 4 and 5 hours. So, half of the
teens are will watch from 0 to 4 ½ hours, and another 34 % (half of the 68 %) will watch
between 4 ½ to 5 hours. So, 84 % will watch less than 5 hours, thus 100 % - 84 % or 16
% will watch more than 5 hours per day.
b) Since 5 ½ hours is 2 standard deviations above the mean (4 ½ plus ½ plus ½ ), then 95 %
of the teens are distributed within 2 standard deviation or between 3 ½ and 5 ½ hours.
So, half of the teens are will watch from 0 to 4 ½ hours, and another 47 ½ % (half of the
95 %) will watch between 4 ½ to 5 ½ hours. So, 97 ½ % will watch less than 5 ½ hours,
thus 100 % - 97 ½ % or 2 ½ % will watch more than 5 ½ hours per day.
c) From the above paragraph, we have 100 % - 2 ½ % = 97 ½ % of the teens will watch
loess than 5 ½ hours of TV per day..
d) Since 4 hours is 1 standard deviation below the mean (4 ½ minus ½ ), then 68 % of the
teens are distributed within 1 standard deviation or between 4 and 5 hours. So, half of the
teens are will watch from 0 to 4 ½ hours, and another 34 % (half of the 68 %) will watch
between 4 and 4 ½ hours. So, 50 % - 34 % will watch less than 4 hours per day.
e) Since 3 ½ hours is 2 standard deviations below the mean (4 ½ minus ½ minus ½ ), then
95 % of the teens are distributed within 2 standard deviations or between 3 ½ and 5 ½
hours. So, half of the teens are will watch from 0 to 4 ½ hours, but another 47 ½ %
(half of the 95 %) will watch between 3 ½ to 4 ½ hours per day. So, 50 – 47 ½ or 2 ½ %
of the teens will watch less than 3 ½ hours per day.
121
Standard score or z-score. If one is analyzing data within 1, 2 or 3 standard deviations from the
mean, then you can expect 68 %, 95 % or 99.7 % respectively, of the population to lie within
these bounds. What happens if we know that 90 % of the data lies within two scores. What
would the standard deviation look like?
Since data rarely if ever is presented to us where the mean is zero and the standard deviation is 1,
we use the standard normal curve to help analyze any normally distributed data. A data value
with a z-score of 0 indicates the data is the mean. A data value with a z-score of –1.3 indicates
the data value is 1.3 standard deviations below the mean and so forth. If you know the standard
deviation and mean of your data, z-scores enable you to determine the percent of data between
any two values in the range of your data.
data value - mean
The formula used to find each z-score is
.
standard deviation
Below is a table for the z-scores for the standard normal distribution.
z
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3
0
0
0.0398
0.0793
0.1179
0.1554
0.1915
0.2257
0.258
0.2881
0.3159
0.3413
0.3643
0.3849
0.4032
0.4192
0.4332
0.4452
0.4554
0.4641
0.4713
0.4772
0.4821
0.4861
0.4893
0.4918
0.4938
0.4953
0.4965
0.4974
0.4981
0.4987
0.01
0.004
0.0438
0.0832
0.1217
0.1591
0.195
0.2291
0.2611
0.291
0.3186
0.3438
0.3665
0.3869
0.4049
0.4207
0.4345
0.4463
0.4564
0.4649
0.4719
0.4778
0.4826
0.4864
0.4896
0.492
0.494
0.4955
0.4966
0.4975
0.4982
0.02
0.008
0.0478
0.0871
0.1255
0.1628
0.1985
0.2324
0.2642
0.2939
0.3212
0.3461
0.3686
0.3888
0.4066
0.4222
0.4357
0.4474
0.4573
0.4656
0.4726
0.4783
0.483
0.4868
0.4898
0.4922
0.4941
0.4956
0.4967
0.4976
0.4982
0.03
0.012
0.0517
0.091
0.1293
0.1664
0.2019
0.2357
0.2673
0.2967
0.3238
0.3485
0.3708
0.3907
0.4082
0.4236
0.437
0.4484
0.4582
0.4664
0.4732
0.4788
0.4834
0.4871
0.4901
0.4925
0.4943
0.4957
0.4968
0.4977
0.4983
0.04
0.016
0.0557
0.0948
0.1331
0.17
0.2054
0.2389
0.2704
0.2995
0.3264
0.3508
0.3729
0.3925
0.4099
0.4251
0.4382
0.4495
0.4591
0.4671
0.4738
0.4793
0.4838
0.4875
0.4904
0.4927
0.4945
0.4959
0.4969
0.4977
0.4984
122
0.05
0.0199
0.0596
0.0987
0.1368
0.1736
0.2088
0.2422
0.2734
0.3023
0.3289
0.3531
0.3749
0.3944
0.4115
0.4265
0.4394
0.4505
0.4599
0.4678
0.4744
0.4798
0.4842
0.4878
0.4906
0.4929
0.4946
0.496
0.497
0.4978
0.4984
0.06
0.0239
0.0636
0.1028
0.1406
0.1772
0.2123
0.2454
0.2764
0.3051
0.3315
0.3554
0.377
0.3962
0.4131
0.4279
0.4406
0.4515
0.4608
0.4686
0.475
0.4803
0.4846
0.4881
0.4909
0.4931
0.4948
0.4961
0.4971
0.4979
0.4985
0.07
0.0279
0.0675
0.1064
0.1443
0.1808
0.2157
0.2486
0.2794
0.3078
0.334
0.3577
0.379
0.398
0.4147
0.4292
0.4418
0.4525
0.4616
0.4693
0.4756
0.4808
0.485
0.4884
0.4911
0.4932
0.4949
0.4962
0.4972
0.4979
0.4985
0.08
0.0319
0.0714
0.1103
0.148
0.1844
0.219
0.2517
0.2823
0.3106
0.3365
0.3599
0.381
0.3997
0.4162
0.4306
0.4429
0.4535
0.4625
0.4699
0.4761
0.4812
0.4854
0.4887
0.4913
0.4934
0.4951
0.4963
0.4973
0.498
0.4986
0.09
0.0359
0.0753
0.1141
0.1517
0.1879
0.2224
0.2549
0.2852
0.3133
0.3389
0.3621
0.383
0.4015
0.4177
0.4319
0.4441
0.4545
0.4633
0.4706
0.4767
0.4817
0.4857
0.489
0.4916
0.4936
0.4952
0.4964
0.4974
0.4981
0.4986
Example Three
We know that in 2002, the average height of the American Male was 5’ 9 1/2 “ and the standard
deviation was 2 3/8”. What percent of the American males in 2002 were …
a) taller than 6’?
b) shorter than 5’ 7 ½ ”?
c) between 5’ 10 “ and 6 ‘ 1”?
Solution
a) First, we find the z-score associated with 6’. We have
1
data value - mean 6 '  5'9 2 "
2.5
z score 


 1.05 .
standard deviation
2.375
23 "
8
Notice the positive 1.05 corresponds to the fact that 6’ is above the mean. Now, we glance at the
table and see that a z-score of 1.05 has a value of 0.3531 . This means that in 2002, 35.31 % of
the adult America Males were between 6 ‘and the average height of 5 ‘ 9 ½ “. So, the percent of
adult American Males taller than 6’ were 100 % - 50 % - 35.31 % or 14.69 %.
b) First, we find the z-score associated with 5’ 7 ½”. We have
5'7.5" 5'9.5"  2.0
z score 

  0.84 .
2.375
23 "
8
The negative value corresponds to the fact that 5’ 7 1/2'” is below the mean. Now, we glance at
the table and see that a z-score of 0.84 has a value of 0.2995. This means that in 2002, 29.95 %
of the adult America Males were between 5’ 7 ½ “ and the average American Male height of 5 ‘
9 ½ “. The percent of American Males shorter 5’ 7 ½ “ were 100 % - 50 % - 29.25 % or 20.75
%.
c) Calculating each z-score, we have:
5'10" 5'9.5"  0.5
6' 1" 5'9.5"
3.5
z score 

 0.21 and z score 

1.47
2.375
2.375
23 "
23 "
8
8
Now, we use the table and see that the z-score of 0.21 and 1.47 have the values of 0.0832 and
0.4292 respectively. This means that in 2002, 8.32 % of the America Males were between 5’ 10
“ and the average American Male height of 5 ‘ 9 ½ “ and 42.92 % of the America Males were
between 6’ 1 “ and the average American Male height of 5 ‘ 9 ½ “. So, the percent of adult
American Males between 5’ 10” and 6’ 1” would be 42.92 % - 8.32 % or 34.6 %.
123
Exercise Set
1. Two AP calculus classes were taught by
Mr. Venette and Ms. Harper. The final
grade for course during the past five years
indicated that Mr. Venette’s classes had a
mean of 80 and a standard deviation of 4,
while Ms. Harper’s classes had a mean of
82, but a standard deviation of 2.5. Interpret
the results in terms of 68-95-99.7
percentiles. Then give possible reasons for
the differences you observe.
For problems 11 to 16, use the following
information. In Japan in 2002, studies
pertaining to the heights for adults separated
by gender vary slightly, but a rough
estimation of data compiled from various
studies is as follows: For the adult male
population, 95% of the males were found to
be between 5'0" 5/8 and 5'9" 5/8, with the
average at 5'5" 1/8. For the adult female
population, 95% of the females were found
to be between 4'8" 1/2 and 5'4", with the
average at 5'0" ¼.
For questions 2 to 10, use the following
data: The mean income in a city is $ 51,000,
and the standard deviation is $ 4000. Find
the percentage of people whose income is
11. Find the standard deviation for both the
males and the females and interpret both in a
complete sentence.
2. $ 59,000 or above
12. Find the percent of Japanese males
shorter then 5 ‘ 9 5/8”
3. $ 47,000 or below
4. between $ 43,000 and $ 55,000
13. Find the height of the Japanese female
who is taller than 99.7 % of the population.
5. $ 55,000 or below
14. Find the height of the Japanese males
who are shorter than 84 % of the population.
6. $ between 39,000 and 55,400
7. $ 39,000 or below
15. Find the percent of Japanese males
shorter than 5’ 2”.
8. $ 40,000 or above
16. Find the percent of Japanese women
taller than 5’ 2”.
9. $ 50,000 or below
10. between $ 50,000 and $ 60,000
124
For problems 17 to 24, use the following
information. In the United States in 2002,
the weights pertaining to adults separated by
gender vary slightly, but a rough estimation
of data compiled from various studies is as
follows: For the adult male population, 95%
of the males were found to be between 168
lbs and 214 lbs, with the average at 191 lbs.
For the adult female population, 95% of the
females were found to be between 140 lbs
and 188 lbs, with the average at 164 lbs.
25. On your route home, you have a choice
of taking two bridges, each of the same
length and same number of lanes. At the
time you cross each bridge, Bridge One has
an average of 420 cars on it with a standard
deviation of 100, and Bridge Two has a
average of 460 cars on it with a standard
deviation of 20 cars. Which bridge would
you decide to cross? Would it matter if you
were in a hurry?
For problems 26 to 32, use this fact:
According to Robert Dvorchak, Pittsburgh
Post-Gazette, the average length of a
National League baseball game in 2004 was
2:47:20. Compared to its own historic past,
when in 1967 the average game was 2:30, in
the 1940’s the average game, according to
the Sporting News was exactly 2:00, or even
a century ago, when the average game was a
mere 1:30. If we estimate a standard
deviation of 20 minutes, what percent of the
games …
17. Find the standard deviation for both the
males and the females and interpret both in a
complete sentence.
18. Find the percent of American females
who weigh more than 145 pounds.
19. Find the weight of the American female
who weighs less than 34 % of the
population.
20. Find the height of the American male
who weighs more than 90% of the
population.
26. in 2004 lasted longer than two hours
27. in 1967 lasted longer than two hours
21. Find the percent of American females
who weigh more than 150 pounds.
28. in 1940 lasted longer than two hours
22. Find the percent of American males
who weigh more than 200 pounds.
29. a century ago lasted longer than two
hours
23. Find the percent of American females
who weigh less than the weight of the
Average Male in 2002.
30. in 2004 lasted longer than three hours
24. Find the percent of American males
who weigh less than the weight of the
Average Female in 2002.
32. in 2004 were shorter than 3 ½ hours
31. in 1967 lasted longer than three hours
125
Standard Deviation
A standard deviation then is really nothing more than the average distance from the mean.
For each data point or value, we subtract the mean from each data and the result is either zero,
positive or negative. When we add these values, the sum of the positive differences will cancel
with the sum of the negative difference. Since we are looking to find the average distance from
the mean, this calculation would prove worthless. Try it and see or yourself. Instead, we use a
convenience where we square each difference because these squared values are all positive.
Thus, they won’t have the effect of canceling each other out. Now, we add them all up. We then
divide by the number of terms. Almost. Actually, we divide by n-1 because statisticians have
determined that with large populations, since there is always an outlier (the really tall kid, the
really bright child that blows out the curve with IQ scores and so on … ), dividing by n-1 most
closely resembles the true behavior of the data. We then take the square root of the sum of the
squared differences to cancel out the effect of squaring, giving us this measurable average
distance from the mean. We designate positive values to indicate above the mean, negative
values to indicate below the mean.
A practical way to compute standard deviation would be to incorporate the use of a spread sheet.
In Microsoft Excel, type the following code into the cell where you want the Standard Deviation
result, using the "unbiased," or "n-1" method:
=STDEV(A1:Z99) (substitute the cell name of the first value in your dataset for A1, and
the cell name of the last value for Z99.)
Calculating the standard deviation, let x be one value in your set of data and let x be the mean
of all values x in your set of data. Let n be the number of data points from your set of data. For
each value x, subtract the overall x from x, that is x – x , then square the result (x - x )2 . Sum up
all those squared values and then divide the sum by (n-1). Finally, there's one more step, take
the square root of this ratio. That's the standard deviation of your set of data, written as  .
n

 (x
i 1
i
 x) 2
n 1
Introduction or who deserves the B?
Let’s develop an intrinsic feel for the measure of central tendency of data. Below are five
students, and their seven grades for a course. The bottom row reaffirms that all six students have
a 79 average.
Allan
Bill
Cindy
Deanna Eve
74
73
59
69
68
76
75
62
73
78
77
77
78
78
79
80
79
79
79
79
81
83
80
82
83
82
83
96
83
83
83
83
99
89
83
79
79
79
79
79
126
All six students want a B. Allan argues that his middle grade, his median is a B. Bill argues that
his mode, the grade that occurs most frequently, is a B. Cindy argues that she has shown great
potential, two of her grades are solid A’s. Deanna argues the same argument, but her grades are
not quite as erratic as Cindy’s they are not as dispersed away from the average as Cindy’s grades.
Eve, like Bill, also argues that her mode is a B. And although Eve and Bill have the same mean,
median and mode, Eve is the one with the 68. Uh oh….
Standard deviations measure just this, how a data value is deviates from a mean, in other words,
a standard deviation is a numerical value that tells the reader how spread out the data is, Allan’s
grades are clumped together, he should have a small standard deviation. Cindy’s grade history is
more erratic, her grades are farther spread out, she should have a larger standard deviation.
Let’s compare the standard deviations for three of the students, Bill, Cindy and Eve. We will
find how much each data value deviates from the mean. But notice, if we try and sum up these
deviations from the mean without squaring the differences, the sum is zero. Why?
So, first we subtract each data point from the mean (deviation). After we square the differences
(deviation squared), we sum the square of the differences, divide this sum by a number that is
one less than the number of data points. Lastly, we take the square root of this ratio and we have
the standard deviation.
Bill
73
75
77
79
83
83
83
79
deviation
-6
-4
-2
0
4
4
4
0
deviation
squared
36
16
4
16
16
16
104
Cindy
59
62
78
79
80
96
99
79
deviation
-20
-17
-1
0
1
17
20
0
deviation
squared
400
289
1
1
289
400
1380
Eve
68
78
79
79
83
83
83
79
deviation
-11
-1
0
0
4
4
4
0
deviation
squared
121
1
0
0
16
16
16
170
104
 17.33333  4.16 . For Cindy, her standard deviation
6
1380
170
is
 230  15.17 . For Eve, her standard deviation is
 28.33333  5.32 . As
6
6
standard deviations is a measure of dispersion, the larger the standard deviation, the more
dispersed the data. We now have more information about each of the student’s grades at our
disposal; we know the mean, median, mode and the standard deviation. You decide, who
deserves the B; who does not.
For Bill, his standard deviation is
Example One
Baseball, said to be America’s favorite pastime, is also fertile ground for honing basic statistical
skills. From games won or lost to home and away records, from records against divisional foes
127
to streaks, from batting averages to home runs, numbers abound. For this example, we will find
the standard deviation and then incorporate the z-score formula to determine how far each team’s
record is from the mean. We will see data in context, as it would appear in your morning
newspaper.
Who is the best and the worst in the American League on Labor Day, 2004? A standard
deviation way to explore this age old baseball question. Updated: 9/5/2004 3:37:06 PM cnn.com
AMERICAN LEAGUE EAST
~~~~~~~~~~~~~~~~~~~~
TEAM
WON
NY YANKEES
BOSTON
BALTIMORE
TAMPA BAY
TORONTO
83
80
63
59
56
LOST
PCT
GB
HOME
ROAD
EAST
CENT
52
54
71
75
80
.615
.597
.470
.440
.412
2
19
23
27
45-21
48-22
29-35
36-34
34-37
38-31
32-32
34-36
23-41
22-43
36-19
36-20
28-29
21-38
21-36
15-11
19-13
15-12
13-12
13-19
PCT
GB
HOME
ROAD
EAST
CENT
43-28
9 1/2 38-32
11
40-30
14 1/2 32-32
29 1/2 30-37
34-30
29-35
27-40
30-40
17-50
16-11
16-16
17-15
10-14
08-19
34-24
29-27
26-31
28-28
25-32
HOME
ROAD
EAST
CENT
45-19
4
38-28
6 1/2 42-22
30
32-34
36-35
39-30
32-38
19-50
23-16
24-16
22-17
11-28
25-15
25-14
22-17
19-21
1/2
1/2
1/2
1/2
WEST STREAK
22-14
16-12
15-17
10-22
14-15
LOST
LOST
WON
LOST
LOST
2
1
6
7
2
AMERICAN LEAGUE CENTRAL
~~~~~~~~~~~~~~~~~~~~~~~
TEAM
WON
MINNESOTA
CHI WHITE SOX
CLEVELAND
DETROIT
KANSAS CITY
77
67
67
62
47
LOST
58
67
70
72
87
.570
.500
.489
.463
.351
WEST STREAK
16-16
14-14
14-16
15-21
08-24
WON
WON
LOST
WON
LOST
5
2
4
1
2
AMERICAN LEAGUE WEST
~~~~~~~~~~~~~~~~~~~~
TEAM
WON
OAKLAND
ANAHEIM
TEXAS
SEATTLE
81
77
74
51
LOST
54
58
60
84
PCT
.600
.570
.552
.378
GB
WEST STREAK
23-15
21-17
20-18
12-26
WON
WON
WON
LOST
3
2
1
4
The three divisional winners and the second place team with the best record makes the playoffs.
But, who is the best team? The worst team? How good is good and how bad is bad? Let’s
calculate the standard deviation with respect to the number of wins for each team.
First, we find the mean number of wins by adding up all the wins and dividing by 14.
x
83  80  63  59  56  77  67  67  62  47  81  77  74  51
 67.4
14
The average or mean number of wins is 67.4 for the American League teams on Labor, 2004.
The table below has each team ranked by the number of wins, from most to least. We used a
128
spread sheet to construct the columns representing the differences from the mean, the square of
these differences, the standard deviation and the z-scores.
Statistics table for the teams in the American League on Labor day, 2004.
Wins - Mean
NY YANKEES
83
OAKLAND
81
BOSTON
80
MINNESOTA
77
ANAHEIM
77
TEXAS
74
CHI WHITE SOX 67
CLEVELAND
67
BALTIMORE
63
DETRIOT
62
TAMPA BAY
59
TORONTO
56
SEATTLE
51
KANSAS CITY 47
SUM
944
(Wins - Mean)2
83-67.4 = 15.6 15.62 = 243.36
81-67.4 = 13.6
12.6
9.6
9.6
6.6
-0.4
-0.4
-4.4
-5.4
-8.4
-11.4
-16.4
-20.4
0
13.62 = 184.96
158.8
92.2
92.2
43.6
0.2
0.2
19.4
29.2
70.6
129.96
268.96
416.2
1749.84
n
To calculate the standard deviation,  
Recall, to find each z-score,
NY YANKEES 83
OAKLAND
81
BOSTON
80
MINNESOTA
77
ANAHEIM
77
TEXAS
74
CHI WHITE SOX 67
CLEVELAND
67
BALTIMORE
63
DETRIOT
62
TAMPA BAY
59
TORONTO
56
SEATTLE
51
KANSAS CITY 47
 (x 
i 1
i
x) 2
n 1

1749.84
11.6 .
14  1
data value - mean
.
standard deviation
Wins - Mean
83-67.4 = 15.6
81-67.4 = 13.6
12.6
9.6
9.6
6.6
-0.4
-0.4
-4.4
-5.4
-8.4
-11.4
-16.4
-20.4
z-score
15.6/11.6 = 1.3
13.6/11.6 = 1.2
1.1
0.8
0.8
0.6
-0.03
-0.03
-0.4
-0.5
-0.8
-0.98
-1.4
-1.8
Look at the final column, and recall, as you glance at each teams’ z-score, that for a normal
population, a 68 % of the population falls within 1` standard deviation or z-score of the mean, 95
% falls within 2 and 99.7 falls within 3. How good are the NY Yankees and how bad are the
129
Kansas City Royals? You now have a more detailed frame of reference to answer that question.
Example Two Does money buy success in baseball? Updated: 9/5/2004 3:37:06 PM cnn.com
The payroll for the American League teams are listed below.
N$
What will the standard deviation tell us with
respect to this payroll data? Will there be a
correlation between salaries and success?
Does money buy success? Once we have
calculated the standard deviation and zscores, we will compare these results with
the true standings taken on Sept 5th, 2004.
ew184,193,950
York Yankees
B$
oston
127,298,500
Red Sox
A$
naheim
100,534,667
Angels
S$
eattle
81,515,834
Mariners
C$
The sum of the 14 American League salaries
is $ 986,773,835, thus the average salary is $
70,483,845.36, which we will round to
$70,483,845.
hicago
65,212,500
White Sox
O$
akland
59,425,667
Athletics
T$
exas
55,050,417
Rangers
M$
innesota
53,585,000
Twins
B$
altimore
51,623,333
Orioles
T$
oronto
50,017,000
Blue Jays
K$
ansas
47,609,000
City Royals
D$
etroit
46,832,000
Tigers
C$
leveland
34,319,300
Indians
T$
ampa
29,556,667
Bay Devil Rays
To calculate the standard deviation, we construct the following table.
Team
Payroll Salary Payroll – Mean (Payroll – Mean)2
New York Yankees
$ 184,193,950 113,710,105
12,929,987,979,111,025
Boston Red Sox
$ 127,298,500 56,814,655
3,227,905,022,769,025
Anaheim Angels
$ 100,534,667 30,050,822
903,051,902,875,684
Seattle Mariners
$ 81,515,834
11031989
121,704,781,296,121
Chicago White Sox
$ 65,212,500
- 5,271,345
27,787,078,109,025
Oakland Athletics
$ 59,425,667
- 11,058,178
122,283,300,679,684
Texas Rangers
$ 55,050,417
130
Minnesota Twins
$ 53,585,000
Baltimore Orioles
$ 51,623,333
Toronto Blue Jays
$ 50,017,000
Kansas City Royals
$ 47,609,000
Detroit Tigers
$ 46,832,000
Cleveland Indians
$ 34,319,300
Tampa Bay Devil Rays $ 29,556,667
We leave it as an exercise for you to complete the table above. Once done, it is it is quickly
verified that the standard deviation is $ 41,783,940.
So, let’s reprint the table, with the standard deviation from the mean listed for each team and it’s
ranking in the American League.
Team
Standard Deviations
Payroll Salary from the Mean
(z-score)
True Major
League Ranking
New York Yankees
$ 184,193,950 2.72
1
Boston Red Sox
$ 127,298,500 1.36
3
Anaheim Angels
$ 100,534,667 0.72
Tied for 4
Seattle Mariners
$ 81,515,834
0.26
13
Chicago White Sox
$ 65,212,500
-0.13
7
Oakland Athletics
$ 59,425,667
-0.27
2
Texas Rangers
$ 55,050,417
-0.37
6
Minnesota Twins
$ 53,585,000
-0.40
Tied for 4
Baltimore Orioles
$ 51,623,333
-0.45
9
Toronto Blue Jays
$ 50,017,000
-0.49
12
Kansas City Royals
$ 47,609,000
-0.55
14
Detroit Tigers
$ 46,832,000
-0.57
10
Cleveland Indians
$ 34,319,300
-.87
8
-0.98
11
Tampa Bay Devil Rays $ 29,556,667
Example Three
Homeownership in the USA Below are the state rankings for the percent of homeownership in
the United States (to include mobile homes) in 2002. Source: US Bureau of the Census.
Alabama
Alaska
Arizona
Arkansas
California
Colorado
73.5
67.3
65.9
70.2
58.0
69.1
Connecticut
Delaware
District of Columbia
Florida
Georgia
Hawaii
131
71.6
75.6
44.1
68.7
71.7
57.4
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
73.0
70.2
75.0
73.9
70.2
73.5
67.1
73.9
72.0
62.7
76.0
77.3
74.8
74.6
69.3
68.4
65.5
69.5
67.2
70.3
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
55.0
70.0
69.5
72.0
69.4
66.2
74.0
59.6
77.3
71.5
70.1
63.8
72.7
70.2
74.3
67.0
77.0
72.0
72.8
Like before, we have a rather large sample. Let’s begin with the statistical basics. We will find
the mean, median, mode and range after first ranking states in ascending order.
DC
New York
Hawaii
California
Rhode Island
Massachusetts
Texas
Nevada
Arizona
Oregon
Washington
Louisiana
New Jersey
Alaska
Nebraska
Florida
Colorado
Montana
Oklahoma
New Hampshire
North Dakota
44.1
55.0
57.4
58.0
59.6
62.7
63.8
65.5
65.9
66.2
67.0
67.1
67.2
67.3
68.4
68.7
69.1
69.3
69.4
69.5
69.5
North Carolina
Tennessee
Arkansas
Illinois
Kansas
Vermont
New Mexico
South Dakota
Connecticut
Georgia
Maryland
Ohio
Wisconsin
Utah
Wyoming
Idaho
Alabama
Kentucky
Iowa
Maine
Pennsylvania
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
31
31
132
70.0
70.1
70.2
70.2
70.2
70.2
70.3
71.5
71.6
71.7
72.0
72.0
72.0
72.7
72.8
73.0
73.5
73.5
73.9
73.9
74.0
30
29
25
25
25
25
24
23
22
21
18
18
18
17
16
15
13
13
11
11
10
Virginia
Missouri
Mississippi
Indiana
Delaware
Mean
69.4
Mode
70.2
Median
70.2
74.3
74.6
74.8
75.0
75.6
9
8
7
6
5
Michigan
West Virginia
Minnesota
South Carolina
76.0
77.0
77.3
77.3
4
3
1
1
Let’s begin to interpret the data. First, we notice the mode and median are the same, and the
mean (average) is below the two. On a positive note, we can say more than half of the states
have a percentage of homeownership above the national average. The next question, “is the
mean significantly below the other two?” This may be partially answered by observing the
range. The range is 33.2 (77.3 – 44.1), which appears to be rather large. So, by comparison
69.4 versus 70.2 appears to be no big deal. Let’s add to our repertoire of examining the
dispersion of the data. For large sets of data, we do not want to obsess over each individual
data point. We want to see if the data follows noticeable trends and then interpret any
outliers that may appear.
To do this, we observe a histogram, made from the data in ascending order. Notice we have
so much data our page is not wide enough to label each state. We labeled only a few for
reference.
South
Indiana
Maine
Wyoming
Georgia
Kansas
North
Florida
Washingto
Massachu
100.0
80.0
60.0
40.0
20.0
0.0
District of
percent
Percent of Home Owners
by state
The District of Columbia is an outlier, it’s percent of 44.1 far from the mean of 70.2 %. But,
in a manner of speaking, NY, HA, CA, RI, with respective percents of home ownership of
55, 57.4, 58 and 59.6 all seem far below the mean of 70.2 as well. This is what we mean by
dispersion. We need to quantify how well grouped the data is because we need to know
where to draw the line between those states that are significantly below the mean or
significantly above the mean. This can be done with what we have identified as a standard
deviation; the measure of how data deviates from it’s behavior around the middle of the data.
Central tendency. In a perfect world, the mean is in the center of the data, thus it is the
median too. And the mode, if we get greedy. Recall, the standard deviation, loosely
speaking, measures how the data deviates from the mean, and remember, the mean is in the
center of the data.
133
The table shows the state’s raw z-score followed by the state’s percentage of
homeownership. A table so constructed allows the reader to find a state and quickly identify
how the state compares to the national average. Problems 20 – 22 in the Exercise Set
requires you to verify the table and then draw certain responses from the table.
DC
New York
Hawaii
California
Rhode Island
Massachusetts
Texas
Nevada
Arizona
Oregon
Washington
Louisiana
New Jersey
Alaska
Nebraska
Florida
Colorado
Montana
Oklahoma
New Hampshire
North Dakota
North Carolina
Tennessee
Arkansas
Illinois
Kansas
-4.21
-2.45
-2.06
-1.97
-1.71
-1.21
-1.03
-0.76
-0.69
-0.65
-0.52
-0.5
-0.48
-0.47
-0.29
-0.24
-0.18
-0.15
-0.13
-0.11
-0.11
-0.03
-0.02
0
0
0
44.1
55.0
57.4
58.0
59.6
62.7
63.8
65.5
65.9
66.2
67.0
67.1
67.2
67.3
68.4
68.7
69.1
69.3
69.4
69.5
69.5
70.0
70.1
70.2
70.2
70.2
Vermont
New Mexico
South Dakota
Connecticut
Georgia
Maryland
Ohio
Wisconsin
Utah
Wyoming
Idaho
Alabama
Kentucky
Iowa
Maine
Pennsylvania
Virginia
Missouri
Mississippi
Indiana
Delaware
Michigan
West Virginia
Minnesota
South Carolina
0
0.02
0.21
0.23
0.24
0.29
0.29
0.29
0.4
0.42
0.45
0.53
0.53
0.6
0.6
0.61
0.66
0.71
0.74
0.77
0.87
0.94
1.1
1.15
1.15
70.2
70.3
71.5
71.6
71.7
72.0
72.0
72.0
72.7
72.8
73.0
73.5
73.5
73.9
73.9
74.0
74.3
74.6
74.8
75.0
75.6
76.0
77.0
77.3
77.3
Why use standard deviation The standard deviation can also help you evaluate the worth of
all so-called "studies" that seem to be released to the press everyday. Standard deviation is
commonly used in business as a measure to describe the risk of a security or portfolio of
securities. If you read the history of investment performance, chances are that standard deviation
will be used to gauge risk. The same is true for academic studies to determine the validity of
134
exam results, or the effectiveness of educational tools. Standard deviation is also one of the most
commonly used statistical tools in the sciences and social sciences. It provides a precise
measure of the amount of variation in any group of numbers, be it the returns on a mutual fund,
the yearly rainfall in Mexico City, or the hits per game for a major league baseball player.
Lastly, look at the data below, taken directly from the morning newspaper. Does it take on a
whole new look? Could we analyze, say, whether the offense or the defense is a better predictor
of success in professional football.
The 2003 Final Standings of the NFL teams. W = wins, L = loses, % = percentage of games
won, PF = Points For, that is, points the team scored, PA = points the team allowed.
AFC East
New England Patriots
Miami Dolphins
Buffalo Bills
New York Jets
W
14
10
6
6
L
2
6
10
10
NFC East
Philadelphia Eagles
Dallas Cowboys
Washington Redskins
New York Giants
12
10
5
4
4
6
11
12
AFC North
Baltimore Ravens
Cincinnati Bengals
Pittsburgh Steelers
Cleveland Browns
10
8
6
5
6
8
10
11
NFC North
Green Bay Packers
Minnesota Vikings
Chicago Bears
Detroit Lions
10
9
7
5
6
7
9
11
AFC South
Indianapolis Colts
Tennessee Titans
Jacksonville Jaguars
Houston Texans
12
12
5
5
NFC South
Carolina Panthers
New Orleans Saints
Tampa Bay Buccaneers
Atlanta Falcons
AFC West
Kansas City Chiefs
Denver Broncos
Oakland Raiders
San Diego Chargers
T
0
0
0
0
%
.875
.625
.375
.375
PF
348
311
243
283
0
0
0
0
.750
.625
.313
.250
374 287
289 260
287 372
243 387
0
0
0
0
.625
.500
.375
.313
391
346
300
254
281
384
327
322
0
0
0
0
.625
.563
.438
.313
442
416
283
270
307
353
346
379
4
4
11
11
0
0
0
0
.750
.750
.313
.313
447 336
435 324
276 331
255 380
11
8
7
5
5
8
9
11
0
0
0
0
.688
.500
.438
.313
325
340
301
299
304
326
264
422
13
10
4
4
3
6
12
12
0
0
0
0
.813
.625
.250
.250
438
381
270
313
332
301
379
441
135
PA
238
261
279
299
NFC West
St. Louis Rams
Seattle Seahawks
San Francisco 49ers
Arizona Cardinals
12
10
7
4
4
6
9
12
0
0
0
0
.750
.625
.438
.250
447
404
384
225
136
328
327
337
452
Exercise Set
For problems 1 to 8, use the 2003 Final
Standings of the NFL teams, as previously
indicated.
9. According to the 2005 World Almanac
for Kids, below are the 25 largest countries
in the world in mid-2004 in no particular
order, in square miles. Find the mean,
median and the stand deviation.
1,294,629,555
China
82,424,609
Germany
1,065,070,607
India
76,117,421
Egypt
293,027,571
United States
69,018,294
Iran
238,452,952
Indonesia
68,893,918
Turkey
184,101,109
Brazil
67,851,281
Ethiopia
153,705,278
Pakistan
64,865,523
Thailand
144,112,353
Russia
60,424,213
France
141,340,476
Bangladesh
60,270,708
Great Britain
137,253,133
Nigeria
58,317,930
Dem. Rep. of Congo
127,333,002
Japan
58,057,477
Italy
104,959,594
Mexico
48,598,175
South Korea
86,241,697
Philippines
47,732,079
Ukraine
82,689,518
Vietnam
1. For the NFC teams, find the standard
deviation for the number of wins and then
find the z-score for each team.
2. For the AFC teams, find the standard
deviation for the number of wins and then
find the z-score for each team.
3. For the all teams, find the standard
deviation for the number of wins and then
find the z-score for each team.
4. For the NFC teams, find the standard
deviation for PF and then find the z-score for
each team.
5. For the AFC teams, find the standard
deviation for PF and then find the z-score for
each team.
6. For the NFC teams, find the standard
deviation for PA and then find the z-score
for each team.
7. For the AFC teams, find the standard
deviation for PA and then find the z-score
for each team.
8. Look carefully at questions 1 to 7.
Which is a better predictor of a team’s
success, the offense as indicated by the
points the team scored (PF) or the defense,
as indicated by the points that team allowed
(PA). Why?
137
10. According to the 2005 World Almanac
for Kids, below are the ten largest cities
followed by the population in the world in
2004 in no particular order.
Tokyo, Japan 34,450,000; Kolkata
(Calcutta), India 13,058,000; Mexico City,
Mexico 18,066,000; Shanghai, China
12,887,000; New York City, U.S.
17,846,000; Buenos Aires, Argentina
12,583,000; São Paulo, Brazil 17,099,000;
Delhi, India 12,441,000; Mumbai (Bombay),
India 16,086,000; Los Angeles, U.S.
11,814,000. Find the mean and the standard
deviation.
12. Presidential Inaugurations. On April
30, 1789, George Washington was
inaugurated as president on the balcony of
Federal Hall in New York City. On March
4, 1857, James Buchanan’s Inauguration
was the first to have been photographed. On
March 4, 1921, Warren G. Harding was the
first president to ride to his Inauguration in
an automobile and the first to use
loudspeakers at an Inauguration. On
January 20, 1997, William J. Clinton’s
Inauguration was to be broadcasted live on
the Internet. And on January 20, 2005,
George W. Bush’s Inauguration, the first
since the September 11th terrorist attacks,
had the tightest security of any Inauguration.
More than a hundred square blocks were
closed off.
11. According to the 2005 World Almanac
for kids, below are the American League
Pennant Winners, with the year they won
proceeding the name and their won-lost
record following their name since 1970.
Remove the shortened strike season of 1981
and the year where there was no world series
and find the mean, median, mode and the
standard deviation for wins. .
1970 Baltimore 108 54 , 1971 Baltimore 101
57, 1972 Oakland 93 62, 1973 Oakland 94
68, 1974 Oakland 90 72, 1975 Boston 95 65,
1976 New York 97 62, 1977 New York 100
62, 1978 New York 100 63, 1979 Baltimore
102 57, 1980 Kansas City 97 65, 1981 New
York 59 48, 1982 Milwaukee 95 67 1983
Baltimore 98 64, 1984 Detroit 104 58, 1985
Kansas City 91 71, 1986 Boston 95 66, 1987
Minnesota 85 77, 1988 Oakland 104 58,
1989 Oakland 99 63, 1990 Oakland 103 59,
1991 Minnesota 95 67, 1992 Toronto 96 66,
1993 Toronto 95 67 1994 none, 1995
Cleveland 100 44, 1996 New York 92 70,
1997 Cleveland 86 75, 1998 New York 114
48, 1999 New York 98 64, 2000 New York
87 74, 2001 New York 95 65, 2002
Anaheim 99 63, 2003 New York 101 61
Look up the ages of the presidents of the
United States when they took office, this
would be the day of their presidential
Inauguration. Find the mean, median, mode
and standard deviation of the presidents’
ages. If a president served more than one
four year term, count that president just
once.
Then repeat the process, lumping together
those presidents who were inaugurated
before the Civil War and those who were
inaugurated after the Civil War. What do
you notice when you compare the pre and
post Civil War presidents’ ages?
For questions 13 to 19: M&M’s project Some years come and go, but other years
live in the hearts and mind of men and
women for all eternity. Such was the year of
1941, when Pearl Harbor was attacked, Joe
DiMaggio hit safely in 56 straight games
and M&M’s were first introduced to the
public. Daughters everywhere love M&M’s,
in particular, some love the blue pieces the
most. The original M&M’s had violet
candies and no blue ones in 1941. Then, in
138
1949, tan replaced violet and in 1995, tan
was replaced by blue. M&M’s were made
round by taking milk chocolate centers and
tumbling them to get their smooth rounded
shape. We all know M&M stands for Mars
and Murrie and that the different color
M&M’s taste the same. According to the
M&M’s website
http://www.mmmars.com/cai/mms/faq.html , that
 M&M’s Milk Chocolate candies are
30 % brown, 20 % each of yellow
and red, and 10 % each of orange,
green and blue
 M&M’s Peanut Chocolate candies
are 20 % each of brown, yellow, red
and blue, and 10 % each of green
and orange
 M&M’s Peanut Butter and Almond
Chocolate candies are 20 % each of
brown, red yellow, green and blue
 M&M’s Crispy Chocolate candies
are 16.6 % each of brown, red,
yellow, green, orange and blue.
Let’s perform our own test and see if our
observation of the percent of each color
matches the website’s prediction.
16. How do the results in parts 2. and 3.
compare? How do the results compare to
M&M’s website statistics?
17. Repeat the experiment for Peanut
Chocolate, Peanut Butter and Almond
Chocolate, and Crispy Chocolate.
18. Answer this question - how can you run
standard deviations in this experiment to
help you analyze your findings so that you
may decide on the reliability of the data on
the M&M’s website?
19. Run those standard deviations to
determine the reliability of the data on the
M&M’s website.
For problems 20 to 22, refer to Problem 3 –
Home Ownership - from the text.
20. Compute the standard deviation for the
data from Problem 3 and then verify the
table presented.
21. Determine which states are the
friendliest to home ownership and which
states are the least?
13. Buy one pound bags of M&M’s Milk
Chocolate for each student in your class. As
a class, for each bag, tally up the number of
each color M&M. Find the percent of each
color for each bag.
22. Is there a cause and effect relationship
that you can argue to explain why these
states are at either end of this analysis?
23. Barry Bonds or Babe Ruth. Who was
the greatest baseball player of all time? To
argue your point, quote statistics. Research
their batting average and compare it to the
batting averages of their peers at the time.
How many standard deviations from the
mean were their batting averages? Do the
same for home runs, RBI’s and on base
percentage. Factor in that Barry Bonds
played in night games and that Babe Ruth
won 20 games as a pitcher. Best of luck…
14. The tally up the colors for all the bags,
and find the percent of colors for the class
room sample, which consists of all the bags.
15. Using each bag as individual trials, find
the mean, median and mode for each color.
Then find the percent of colors based on
these findings.
139