Download Section 4-3 Measures of Variation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
4.3 Measures of Variation
LEARNING GOAL
Understand and interpret these common measures of
variation: range, the five-number summary, and standard
deviation.
Copyright © 2009 Pearson Education, Inc.
Why Variation Matters
Customers at Big Bank can enter any one of three
different lines leading to three different tellers.
Best Bank also has three tellers, but all customers wait in
a single line and are called to the next available teller.
Here is a sample of wait times are arranged in ascending
order.
Big Bank (three lines): 4.1 5.2 5.6 6.2 6.7 7.2 7.7 7.7 8.5 9.3 11.0
Best Bank (one line): 6.6 6.7 6.7 6.9 7.1 7.2 7.3 7.4 7.7 7.8 7.8
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 2
Why Variation Matters
You’ll probably find more unhappy customers at Big Bank
than at Best Bank, but this is not because the average wait
is any longer. In fact, the mean and median waiting times
are 7.2 minutes at both banks.
Big Bank (three lines): 4.1 5.2 5.6 6.2 6.7 7.2 7.7 7.7 8.5 9.3 11.0
Best Bank (one line): 6.6 6.7 6.7 6.9 7.1 7.2 7.3 7.4 7.7 7.8 7.8
The difference in customer satisfaction comes from the
variation at the two banks.
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 3
Figure 4.13 Histograms for the waiting times at Big Bank and Best Bank,
shown with data binned to the nearest minute.
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 4
TIME OUT TO THINK
Explain why Big Bank, with three separate lines,
should have a greater variation in waiting times than
Best Bank. Then consider several places where you
commonly wait in lines, such as a grocery store, a
bank, a concert ticket outlet, or a fast food restaurant.
Do these places use a single customer line that feeds
multiple clerks or multiple lines? If a place uses
multiple lines, do you think a single line would be
better? Explain.
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 5
Range
Definition
The range of a set of data values is the difference
between its highest and lowest data values:
range = highest value (max) - lowest value (min)
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 6
EXAMPLE 1 Misleading Range
Consider the following two sets of quiz scores for nine students.
Which set has the greater range? Would you also say that this set
has the greater variation?
Quiz 1: 1 10 10 10 10 10 10 10 10
Quiz 2: 2 3 4 5 6 7 8 9 10
Solution: Solution The range for Quiz 1 is 10 – 1 = 9 points and
the range for Quiz 2 is 10 – 2 = 8 points. Thus, the range is greater
for Quiz 1. However, aside from a single low score (an outlier),
Quiz 1 has no variation at all because every other student got a 10.
In contrast, no two students got the same score on Quiz 2, and the
scores are spread throughout the list of possible scores. Quiz 2
therefore has greater variation even though Quiz 1 has greater
range.
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 7
Quartiles and the Five-Number Summary
Quartiles are values that divide the data distribution
into quarters.
Lower quartile
(Q1)
Median
(Q2)
Upper quartile
(Q3)
Big Bank: 4.1 5.2 5.6 6.2 6.7 7.2 7.7 7.7 8.5 9.3 11.0
Best Bank: 6.6 6.7 6.7 6.9 7.1 7.2 7.3 7.4 7.7 7.8 7.8
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 8
Definitions
The lower quartile (or first quartile or Q1) divides the
lowest fourth of a data set from the upper three-fourths. It
is the median of the data values in the lower half of a data
set. (Exclude the middle value in the data set if the
number of data points is odd.)
The middle quartile (or second quartile or Q2) is the
overall median.
The upper quartile (or third quartile or Q3) divides the
lowest three-fourths of a data set from the upper fourth. It
is the median of the data values in the upper half of a
data set. (Exclude the middle value in the data set if the
number of data points is odd.)
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 9
TECHNICAL NOTE
Statisticians do not universally agree
on the procedure for calculating
quartiles, and different procedures
can result in slightly different values.
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 10
The Five-Number Summary
The five-number summary for a data distribution
consists of the following five numbers:
low value
lower quartile
median
upper quartile
Copyright © 2009 Pearson Education, Inc.
high value
Slide 4.3- 11
Drawing a Boxplot
Step 1. Draw a number line that spans all the values in the
data set.
Step 2. Enclose the values from the lower to the upper
quartile in a box. (The thickness of the box has no meaning.)
Step 3. Draw a line through the box at the median.
Step 4. Add “whiskers” extending to the low and high values.
Figure 4.14 Boxplots show that the variation of the waiting times is
greater at Big Bank than at Best Bank.
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 12
TECHNICAL NOTE
The boxplots shown in this book are called
skeletal boxplots. Some boxplots are drawn
with outliers marked by an asterisk (*) and
the whiskers extending only to the smallest
and largest nonoutliers; these types of
boxplots are called modified boxplots.
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 13
Percentiles
Definition
The nth percentile of a data set divides the bottom n% of
data values from the top (100 - n)%. A data value that lies
between two percentiles is often said to lie in the lower
percentile. You can approximate the percentile of any
data value with the following formula:
percentile of data value =
number of values less than this data value
x 100
total number of values in data set
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 14
There are different procedures for finding a data
value corresponding to a given percentile, but one
approximate approach is to find the Lth value, where
L is the product of the percentile (in decimal form)
and the sample size.
For example, with 50 sample values, the 12th
percentile is around the 0.12 × 50 = 6th value.
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 15
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 16
EXAMPLE 3 Smoke Exposure Percentiles
Answer the following questions concerning the data in Table 4.4
(previous slide).
a. What is the percentile for the data value of 104.54 ng/ml for
smokers?
Solution: The following results are approximate.
a. The data value of 104.54 ng/ml for smokers is the 35th data
value in the set, which means that 34 data values lie below it.
Thus, its percentile is
number of values less than 104.54 ng/ml
x 100 =
total number of values in data set
34
x 100 = 68
50
In other words, the 35th data value marks the 68th percentile.
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 17
EXAMPLE 3 Smoke Exposure Percentiles
Answer the following questions concerning the data in Table 4.4
(slide 16).
b. What is the percentile for the data value of 61.33 ng/ml for
nonsmokers?
Solution: The following results are approximate.
b. The data value of 61.33 ng/ml for smokers is the 50th and highest
data value in the set, which means that 49 data values lie below
it. Thus, its percentile is
number of values less than 61.33 ng/ml
x 100 =
total number of values in data set
49
x 100 = 98
50
In other words, the highest data value marks the 98th percentile.
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 18
EXAMPLE 3 Smoke Exposure Percentiles
Answer the following questions concerning the data in Table 4.4
(slide 16).
c. What data value marks the 36th percentile for the smokers? For
the nonsmokers?
Solution:
c. Because there are 50 data values in the set, the 36th percentile is
around the 0.36 x 50 =18th value. For smokers this value is 20.16
ng/ml, and for nonsmokers it is 0.33 ng/ml.
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 19
Standard Deviation
Statisticians often prefer to describe variation with a single
number. The single number most commonly used to describe
variation is called the standard deviation.
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 20
Calculating the Standard Deviation
To calculate the standard deviation for any data set:
Step 1. Compute the mean of the data set. Then find the
deviation from the mean for every data value by
subtracting the mean from the data value. That
is, for every data value,
deviation from mean = data value – mean
Step 2. Find the squares (second power) of all the
deviations from the mean.
Step 3. Add all the squares of the deviations from the
mean.
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 21
Calculating the Standard Deviation (cont.)
Step 4. Divide this sum by the total number of data
values minus 1.
Step 5. The standard deviation is the square root of this
quotient.
Overall, these steps produce the standard deviation
formula:
sum of (deviations from the mean)2
standard deviation =
total number of data values - 1
(This formula is shown in summation notation on slide 36.)
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 22
TECHNICAL NOTE
In finding the standard deviation when
dealing with data from a sample, one part
of the calculation involves dividing the sum
of the squared deviations by the total
number of data values minus 1. When
dealing with an entire population, we do not
subtract the 1. In this book, we will use only
the formula for a sample.
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 23
TECHNICAL NOTE (2)
The result of Step 4 is called the variance
of the distribution. In other words, the
standard deviation is the square root of the
variance. Although the variance is used in
many advanced statistical computations,
we will not use it in this book.
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 24
EXAMPLE 4 Calculating Standard Deviation
Calculate the standard deviation for the waiting times at Big Bank.
Solution:
We follow the five steps to
calculate the standard
deviations. Table 4.5 shows
how to organize the work in
the first three steps.
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 25
EXAMPLE 4 Calculating Standard Deviation
Calculate the standard deviation for the waiting times at Big Bank.
Solution:
We follow the five steps to
calculate the standard
deviations. Table 4.5 shows
how to organize the work in
the first three steps.
The first column for each
bank lists the waiting times
(in minutes).
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 26
EXAMPLE 4 Calculating Standard Deviation
Calculate the standard deviation for the waiting times at Big Bank.
Solution:
We follow the five steps to
calculate the standard
deviations. Table 4.5 shows
how to organize the work in
the first three steps.
The first column for each
bank lists the waiting times
(in minutes).
The second column lists the
deviations from the mean
(Step 1).
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 27
EXAMPLE 4 Calculating Standard Deviation
Calculate the standard deviation for the waiting times at Big Bank.
Solution (cont.):
The third column lists the
squares of the deviations
(Step 2).
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 28
EXAMPLE 4 Calculating Standard Deviation
Calculate the standard deviation for the waiting times at Big Bank.
Solution (cont.):
The third column lists the
squares of the deviations
(Step 2).
We add all the squared
deviations to find the sum at
the bottom of the third
column (Step 3).
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 29
EXAMPLE 4 Calculating Standard Deviation
Calculate the standard deviation for the waiting times at Big Bank.
Solution (cont.):
For Step 4, we divide the
sums from Step 3 by the
total number of data values
minus 1. Because there are
11 data values, we divide by
10:
38.46
= 3.846
10
Finally, Step 5 tells us that
the standard deviation is the
square root of the number
from Step 4:
= 1.96 minutes
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 30
Interpreting the Standard Deviation
A good way to develop a deeper understanding of the standard
deviation is to consider an approximation called the range
rule of thumb.
The Range Rule of Thumb
The standard deviation is approximately related to the range of
a distribution by the range rule of thumb:
range
standard deviation ≈
4
If we know the range of a distribution (range = high – low),
we can use this rule to estimate the standard deviation.
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 31
The Range Rule of Thumb (cont.)
Alternatively, if we know the standard deviation, we can use
this rule to estimate the low and high values as follows:
low value ≈ mean – (2 x standard deviation)
high value ≈ mean + (2 x standard deviation)
The range rule of thumb does not work well when the high or
low values are outliers.
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 32
TECHNICAL NOTE
Another way of interpreting the standard
deviation uses a mathematical rule called
Chebyshev’s Theorem. It states that, for
any data distribution, at least 75% of all
data values lie within two standard
deviations of the mean, and at least 89% of
all data values lie within three deviations of
the mean. Although we will not use this
theorem in this book, you may encounter it
if you take another statistics course.
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 33
EXAMPLE 5 Using the Range Rule of Thumb
Use the range rule of thumb to estimate the standard deviations for
the waiting time at Big Bank. Compare the estimate to the actual
value found in Example 4.
Solution: The waiting times for Big Bank vary from 4.1 to 11.0
minutes, which means a range of 11.0 – 4.1 = 6.9 minutes.
6.9
standard deviation ≈
= 1.7
4
The actual standard deviation calculated in Example 4 is 1.96.
For this case the estimate from the range rule of thumb slightly
underestimates the actual standard deviation. Nevertheless, the
estimate puts us in the right ballpark, showing that the rule is
useful.
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 34
EXAMPLE 6 Estimating a Range
Studies of the gas mileage of a BMW under varying driving
conditions show that it gets a mean of 22 miles per gallon with a
standard deviation of 3 miles per gallon. Estimate the minimum and
maximum typical gas mileage amounts that you can expect under
ordinary driving conditions.
Solution: From the range rule of thumb, the low and high
values for gas mileage are approximately
low value ≈ mean – (2 x standard deviation)
= 22 – (2 x 3) = 16
high value ≈ mean + (2 x standard deviation)
= 22 + (2 x 3) = 28
The range of gas mileage for the car is roughly from a minimum
of 16 miles per gallon to a maximum of 28 miles per gallon.
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 35
Standard Deviation with Summation
Notation (Optional Section)
The summation notation introduced earlier makes it easy to
write the standard deviation formula in a compact form.
The symbol s is the conventional symbol for the standard
deviation of a sample.
For the standard deviation of a population, statisticians use the
Greek letter s (sigma), and the term n - 1 in the formula is
replaced by n. Consequently, you will get slightly different
results for the standard deviation depending on whether you
assume the data represent a sample or a population.
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 36
TECHNICAL NOTE
The formula for the variance is
The standard symbol for the variance, s2,
reflects the fact that it is the square of the
standard deviation.
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 37
The End
Copyright © 2009 Pearson Education, Inc.
Slide 4.3- 38