Download Chapter 3: Displaying and Describing Categorical Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia, lookup

Transcript
Chapter 6: The Standard Deviation
as a Ruler and the Normal Model
Women’s Heptathlon
The women’s heptathlon in the Olympics
consists of seven track and field events: the 200m and 800-m runs, 100-m high hurdles, shot
put, javelin, high jump, and long jump.
Somehow, the performances in all seven events
have to be combined into one score. How can
the performances in such different events be
compared? They don’t even have the same
units; the races are recorded in seconds and the
throwing and jumping events in meters.
More Heptathlon
In the 2000 Olympics, the best 800-m time, run by
Getrud Bacher of Italy, was 8 seconds faster than
the mean. The winning long jump by the Russian
Yelena Prokhorova was 60 centimeters longer than
the mean. Which performance deserves more
points?
The trick to comparing very different-looking values
is to use standard deviations. The standard
deviation tells us how the collection of values
varies, so it’s a natural ruler for comparing an
individual value to the group.
More Heptathlon
How many standard deviations better than the mean is each
woman’s results?
Bacher’s winning 800-m time of 129 seconds was 8 seconds
faster than the mean of 137 seconds. The standard deviation
of qualifying times was 5 seconds, so her jump was (129 –
137)/5 = –8/5 = –1.6, or 1.6 standard deviations better than
the mean.
Prokhorova’s winning long jump was 60 cm longer than the
average 6-m jump. The standard deviation was (60/30) = 2
standard deviations better than the mean.
Prokhorova’s performance was better because it was a greater
improvement over the mean long jump than Bacher’s
improvement over the mean 800-m run.
Standardizing with z-scores
observation
y y
z
s
standard
mean
deviation
We call the resulting values standardized values, and denote
them with the letter z. Usually, we just call them z-scores.
Z-Scores
• No units – measures the distance of each data
value from the mean in standard deviations
• A z-score of 2 tells us that a data value is 2
standard deviations above the mean.
• A z-score of -1.6 tells us that a data value is 1.6
standard deviations below the mean.
More about Z-Scores
When standardizing with z-scores, we do two
things:
1. Shift the data by subtracting the mean
2. Rescale the values by dividing by the
standard deviation
Some Questions…
• How does shifting or rescaling the data work?
• What happens to the grade distribution if
everyone gets a five-point bonus?
• If we switch from feet to meters, what
happens to the distribution of heights of
students in our class?
Shifting Data
Since the 1960’s the Center for Disease Control’s
National Center for Health Statistics has been
collecting health and nutritional information on
people of all ages and backgrounds. A recent
survey, the National Health and Nutrition
Examination Survey (NHANES) 2001-2002,
measured a wide variety of variables, including
body measurements, cardiovascular fitness, blood
chemistry, and demographic information on more
than 11,000 individuals. Included in this group
were 80 men between 19 and 24 years old of
average height (between 5’8” and 5’10” tall).
# of Men
NHANES
Weight (kg)
# of Men
NHANES Rescaled
Kg Above Recommended Weight
What Do We Notice?
• When adding/subtracting a constant to each
value, all measures of position (center,
percentiles, min, max) will increase/decrease
by the same constant
• Measures of spread (range, IQR, and standard
deviation) do not change
# of Men
What about Pounds?
Weight (pounds)
What Do We Notice?
• When we multiply/divide all the values by any
constant, all measures of position (such as
mean, median, and percentiles) are
multiplied/divided by that same constant.
• Measures of spread (range, IQR, and standard
deviation) are also multiplied/divided by that
same constant.
Just Checking
1. Your statistics teacher has announced that the lower
of your two tests will be dropped. You got a 90 on
test 1 and an 80 on test 2. You’re all set to drop the
80 until she announces that she grades “on a curve.”
She standardized the scores in order to decide which
is the lower one. If the mean on the first test is 88
with a standard deviation of 4 and the mean on the
second is a 75 with a standard deviation of 5,
a) Which one will be dropped?
b) Does this seem “fair?”
Just Checking
2. In 1995 the Education Testing Service (ETS) adjusted the scores
of SAT tests. Before ETS recentered the SAT Verbal test, the
mean of all test scores was 450.
a) How would adding 50 points to each score affect the mean?
b) The standard deviation was 100 points. What would the
standard deviation be after adding 50 points?
c) Suppose we drew boxplots of test takers’ scores a year
before and a year after the re-centering. How would
the boxplots of the two years differ?
Just Checking
3. A company manufactures wheels for roller
blades. The diameter of the wheels has a mean
of 3 inches and a standard deviation of 0.1
inches. Because so many of its customers use
the metric system, the company decided to
report their production statistics in millimeters
(1 inch = 25.4mm). They report that the
standard deviation is now 2.54 mm. A corporate
executive is worried about this increase in
variation. Should they be concerned? Explain.
Back to Z-Scores
• All we do when standardizing z-scores is shift
the data by the mean and rescale by the
standard deviation. It does not change the
shape of the distribution of a variable.
• It changes the center by making the mean 0
• It changes the spread by making the standard
deviation 1
Think, Show, Tell
Many colleges and universities require applicants to
submit scores on standardized tests such as the SAT
Writing, Math, and Critical Reading tests. The college
your little sister wants to apply to says that while there is
no minimum score required, the middle 50% of their
students have combined SAT scores between 1530 and
1850. You’d feel confident if you knew her score was in
the top 25%, but unfortunately, she took the ACT test.
How high does her ACT need to be to make it into the top
quarter of equivalent SAT scores?
For college-bound seniors, the average combined SAT
scores is about 1500 and the standard deviation is about
250 points. For the same group, the ACT average is 20.8
with a standard deviation of 4.8
Think
I want to know what ACT score corresponds to
the upper quartile SAT score. I know the mean
and standard deviation for both the SAT and ACT
scores based on all test takers, but I have no
individual data values.
Quantitative Variable Condition – Scores for
both tests are quantitative, but have no
meaningful units other than points.
Show
The middle 50% of SAT scores at this college fall between 1530
and 1850 points. To be in the top quarter, my sister would
have to have a score of at least 1850. That’s a z-score of
y  y 1850  1500
z

 1.40
s
250
So an SAT score of 1850 is 1.40 standard deviations above the
mean of all test takers.
For the ACT, 1.40 standard deviations about the mean is
y  y  sz  20.8  4.81.40   27.52
Note: the formula z 
y y
can be rearranged to solve for a different variable.
s
Tell
To be in the top quarter of applicants in terms of
combined SAT score, she’d need to have an ACT
score of at least 27.52.
When is a Z-Score BIG?
• How far from 0 does a z-score have to be to be
“interesting?”
• We need to reference a model, but carefully
– No model will ever be “perfect” for the data, but is
usually “good enough”
• While there is no universal standard for z-scores,
there is a model that shows up frequently in
statistics.
Normal Models
• “bell-shaped curves”
• Appropriate for distributions whose shapes are
unimodal and roughly symmetric
• For the normal model, the mean is s and the standard
deviation is m
• We write Nm, s to represent a normal model
• Why the Greek? These numbers are parameters and
are part of the model. They are NOT from numerical
summaries (“actual data”)
Z-Scores
y y
Remember this formula?: z 
s
What if we standardize z-scores for the normal
model? Replace y with m and s with s
z
ym
s
Standard Normal Model
• Usually it’s easier to standardize data first (using
its mean and standard deviation). Then we need
only the model N(0, 1)
• The normal model with mean 0 and standard
deviation 1 is called the standard Normal model
(or the standard Normal distribution)
Normality Assumption
• In order to use the Normal model, we must
assume the distribution is “Normal”
• The Nearly Normal Condition satisfies this
assumption. If the shape of the data’s
distribution is unimodal and symmetric, the
condition is met.
• ALWAYS check this condition first!!
68-95-99.7 Rule
In a Normal model, about 68% of all values fall
within 1 standard deviation of the mean, 95% of
values fall within 2 standard deviations of the
mean, and 99.7% of values fall within 3 standard
deviations of the mean.
Sometimes called
the “Empirical
Rule” in books
68-95-99.7 Rule
Another view
Just Checking
1. As a group, the Dutch are among the tallest
in the world. The average Dutch man is 184
cm tall – just over 6 feet. If a Normal model
is appropriate and the standard deviation for
men is about 8 cm, what percent of all Dutch
men will be over 2 meters (6’6”) tall?
Just Checking
2. Suppose it takes you 20 minutes, on average, to drive to
school, with a standard deviation of 2 minutes. Suppose a
Normal model is appropriate for the distributions of
driving times.
a) How often will you arrive at school in less
than 22 minutes?
b) How often will it take you more than 24
minutes?
c) Do you think the distribution of your driving times
is unimodal and symmetric?
d) What does this say about the accuracy of your
predictions? Explain.
Think, Show, Tell
The SAT Reasoning Test has three parts: Writing,
Math, and Critical Reading. Each part has a
distribution that is roughly unimodal and symmetric
and is designed to have an overall mean of about
500 and a standard deviation of 100 for all test
takers. In any one year, the mean and standard
deviation may differ from these targets by a small
amount, but they are also a good overall
approximation. Suppose you earned a 600 on one
part of your SAT. Where do you stand among all
students who took that test?
Think
I want to see how my SAT score compares with
all other students. To do that, I’ll need to model
the distribution.
a Nearly Normal Condition is satisfied because
we’re told the data is roughly symmetric and
unimodal (we’d check a histogram if we had
actual data points)
Show
We will model SAT score with a N(500, 100)
model.
200
300
400
500
600
700
800
The score of 600 is 1 standard deviation above the mean. That
corresponds to the 68-95-99.7% Rule.
Tell
The score of 600 is higher than about 84% of all
scores on this portion of the SAT.
Finding Normal Percentiles by Hand
An SAT score of 600 is easy to assess because it’s
exactly one standard deviation away from the
mean.
What if we wanted to see how a person stands
against the population if his score was 680?
Normal Percentiles by Hand
• Step 1: Calculate the z-score by hand.
z
ym
s
680  500

 1.8
100
• Step 2: Look up the z-score using a standard
Normal table (we have one on pg A-118 in the
back of the book)
The student scored higher than 96.41% of all
SAT test takers (or that 3.59% scored higher)
Finding Normal Percentiles on the
Calculator 
To access the Normal model on your calculator:
2nd
DISTR to access various distributions and
models
We’re going to primarily use 2: normalcdf( and 3:
invNorm(
*We rarely (if ever) use 1: normalpfd( in this course!
normalcdf(
normalcdf( finds the area between two z-score
cut points, by specifying a lower bound and an
upper bound
normalcdf(zLeft, zRight)
normalcdf(-0.5, 1.0) = 0.5328
z = -0.5
z = 1.0
So, 53.28% of the data lies
between ½ standard
deviation below the mean
and 1 standard deviation
above the mean
normalcdf(
What about our SAT question? Where we had a
score of 680…
(remember, z = 1.8)
-3
-2
-1
0
1
normalcdf(-999999, 1.8) = 0.9640
normancdf(1.8, 9999999) = .0359
2
3
What do we
notice here?
Think, Show, Tell
What proportion of SAT scores fall between 450
and 580? Assume a Normal model is
appropriate a mean of 500 and standard
deviation of 100.
THINK: We want to know the proportion of SAT
scores between 450 and 580. The nearly
Normal condition is satisfied because the data
follows a Normal model.
Show
z
ym
s
580  500

 0.8
100
z
ym
s
450  500

 0.5
100
Area (z < 0.8) = 0.7881
normalcdf(-0.5, 0.8) = .4796
Area (z < -0.5) = 0.3085
Area (-0.5 < z < 0.8) = 0.7881-0.3085 = 0.4796
Tell
The Normal model estimates that about 47.96%
of SAT scores fall between 450 and 580.
From Percentiles to Z-Scores
• What if we know the areas/percentiles and
want to find the corresponding z-scores?
• A college says it admits only people with SAT
Verbal test scores among the top 10%. How
high a score does it take to be eligible?
In Reverse
• By hand: find the percent in the table and give
the corresponding z-score; give answer in
context
• By calculator: 2nd DISTR and choose
3: invNorm(
Plug in the percent you want to find
Back to that 10%
A college says it admits only people with SAT Verbal
test scores among the top 10%. How high a score does
it take to be eligible?
What do we want to find on the table? Does .1000
make sense?
(z = -1.28; y  zs  m  1.28100   500  372 )
*ALWAYS read the area to the left on the critical point)
The Correct way
So, if we’re finding the top 10%, then these
students score better that the lower 90%.
 .9000 is the % we’re looking for; z = 1.28
(the fact that this is symmetric with z = -1.28
is NOT coincidental!)
 The necessary SAT score is 628
Normal Probability Plots
If the distribution of a data set is roughly
Normal, a Normal probability plot will be
roughly a diagonal straight line. Deviations from
a straight line indicate the distribution is not
Normal
In the calculator – plot using the
last plot option under “Type”
Always use the Y axis
How Does it Work?
The plot is really a scatter plot of actual data
values compared to their z-scores under Normal
conditions. If they’re the same, they’ll lie on the
line y = x. Therefore, the closer the points are to
forming a diagonal line (the y = x line), the closer
the data is to being Normal.
What Can Go Wrong?
• Don’t use a Normal model when the
distribution is not unimodal and symmetric.
– ALWAYS check the data with a histogram/normal
probability plot first
• Don’t use the mean/standard deviation when
there are outliers
• Don’t round off too soon (or in the middle of a
calculation)
• Don’t worry about minor differences in results