Download Chapt4II

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Data mining wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
CHAPTER 4 PART II
Measures of central tendency
 A value that represents a typical, or central, entry
of a data set.
 Most common measures of central tendency:
 Mean
 Median
 Mode
Larson/Farber 4th ed.
Copyright © 2010, 2007, 2004 Pearson Education, Inc.
1
Slide 4 - 1
Median
 The value that lies in the middle of the data
when the data set is ordered.
 Measures the center of an ordered data set by
dividing it into two equal parts.
 If the data set has an
 odd number of entries: median is the middle
data entry.
 even number of entries: median is the mean
of the two middle data entries.
Larson/Farber 4th ed.
Copyright © 2010, 2007, 2004 Pearson Education, Inc.
2
Slide 4 - 2
Example: Finding the Median
The prices (in dollars) for a sample of roundtrip
flights from Chicago, Illinois to Cancun, Mexico are
listed. Find the median of the flight prices.
872 432 397 427 388 782 397
Larson/Farber 4th ed.
Copyright © 2010, 2007, 2004 Pearson Education, Inc.
3
Slide 4 - 3
Example: Finding the Median
The flight priced at $432 is no longer available.
What is the median price of the remaining flights?
872 397 427 388 782 397
Larson/Farber 4th ed.
Copyright © 2010, 2007, 2004 Pearson Education, Inc.
4
Slide 4 - 4
Finding the Median in a Histogram

The median is the value with exactly half the data values
below it and half above it.
 It is the middle data
value (once the data
values have been
ordered) that divides
the histogram into
two equal areas
 It has the same units
as the data
Copyright © 2010, 2007, 2004 Pearson Education, Inc.
Slide 4 - 5
Mean or Median?

Regardless of the
shape of the
distribution, the
mean is the point
at which a
histogram of the
data would
balance:
Copyright © 2010, 2007, 2004 Pearson Education, Inc.
Slide 4 - 6
Summarizing Symmetric Distributions -The Mean

The mean feels like the center because it is the
point where the histogram balances:
HOW DOES THE MEAN COMPARE TO THE MEDIAN?
Copyright © 2010, 2007, 2004 Pearson Education, Inc.
Slide 4 - 7
Mean or Median? (cont.)


In symmetric distributions, the mean and median
are approximately the same in value, so either
measure of center may be used.
For skewed data, though, it’s better to report the
median than the mean as a measure of center.
Copyright © 2010, 2007, 2004 Pearson Education, Inc.
Slide 4 - 8
Comparing the Mean, Median, and Mode



All three measures describe a typical entry of a
data set.
Advantage of using the mean:
 The mean is a reliable measure because it
takes into account every entry of a data set.
Disadvantage of using the mean:
 Greatly affected by outliers (a data entry that is
far removed from the other entries in the data
set).
Larson/Farber 4th ed.
Copyright © 2010, 2007, 2004 Pearson Education, Inc.
Larson/Farber 4th
9
ed.9
Example: Comparing the Mean, Median,
and Mode
Find the mean, median, and mode of the sample ages of
a class shown. Which measure of central tendency best
describes a typical entry of this data set?
Ages in a class
Larson/Farber 4th ed.
20
20
20
20
20
20
21
21
21
21
22
22
22
23
23
23
23
24
24
65
10
Solution: Comparing the Mean, Median,
and Mode
Ages in a class
20
20
20
20
20
20
21
21
21
21
22
22
22
23
23
23
23
24
24
65
Mean:
x 20  20  ...  24  65
x

 23.8 years
n
20
Median:
21  22
 21.5 years
2
Mode:
Larson/Farber 4th ed.
20 years (the entry occurring with the
greatest frequency)
11
MEASURES OF SPREAD
AND THE BOXPLOT!
CONSIDER THE FOLLOWING 3 SAMPLE DATA SETS:
I
20 40 50 30 60 70
II
47 43 44 46 20 70
III
44 43 40 50 46 47
COMPUTE THE RANGE, MEDIAN & MEAN FOR EACH DATA SET
WHAT DO YOU NOTICE???
NOW TAKE A LOOK AT COMPARING THE
DOT PLOTS
How Spread Out is the Distribution?
• Variation matters, and Statistics is about
variation.
• Are the values of the distribution tightly
clustered around the center or more spread
out?
• Always report a measure of spread along with
a measure of center when describing a
distribution numerically.
Slide 4 - 14
Measures of Variability
• range (max-min)
• interquartile range (Q3-Q1)
• deviations  x  x  Lower case
Greek letter
2
• variance  
sigma
• standard deviation  
Spread: Home on the Range
• The range of the data is the difference
between the maximum and minimum values:
Range = max – min
• A disadvantage of the range is that a single
extreme value can make it very large and,
thus, not representative of the data overall.
Slide 4 - 16
Spread: The Interquartile Range
• The interquartile range (IQR) lets us ignore
extreme data values and concentrate on the
middle of the data.
• To find the IQR, we first need to know what
quartiles are…
Slide 4 - 17
Spread: The Interquartile Range (cont.)
• Quartiles divide the data into four equal sections.
– One quarter (25%) of the data lies below the lower
quartile, Q1
– One quarter of the data lies above the upper quartile, Q3,
that is Q3 is the 75% mark
– The quartiles border the middle half of the data.
• A simple way to find quartiles is to start by splitting the data
into 2 halves at the median. Q1 is the median of the lower
half and Q3 the median of the upper half.
Slide 4 - 18
Example: Finding Quartiles
The number of nuclear power plants in the top 15
nuclear power-producing countries in the world are
listed. Find the first, second, and third quartiles of the
data set.
7Solution:
18 11 6 59 17 18 54 104 20 31 8 10 15 19
• THE MEDIAN divides the data set into two halves.
6 7 8 10 11 15 17 18 18 19 20 31 54 59 104
Upper half
Lower half
MEDIAN
© 2012 Pearson Education, Inc. All rights reserved.
19 of 149
Solution: Finding Quartiles
• The first and third quartiles are the medians of the
lower and upper halves of the data set.
Lower half
Upper half
6 7 8 10 11 15 17 18 18 19 20 31 54 59 104
Q1
Q2
Q3
INTERPRETING THE QUARTILES: About one fourth of the
countries have 10 or fewer nuclear power plants; about one half
have 18 or fewer; and about three fourths have 31 or fewer.
© 2012 Pearson Education, Inc. All rights reserved.
20 of 149
FINDING QUARTILES
• Listed below are the lengths of the touchdown passes
for the Green Bay Packers over the course of several
games
28,18,20,30,32,27,32,20,22,31,35,39,33,19,18
Find Q1, the median, and Q3 and explain what these
values tell about the distribution.
Slide 4 - 21
THE IQR
• The difference between the quartiles is the
interquartile range (IQR), so
IQR = upper quartile – lower quartile
OR
Q3 - Q1
Find the IQR of the Green Bay data and write a
sentence explaining the meaning of this value.
Slide 4 - 22
FINDING THE IQR
• Find each of the Quartiles and compute the IQR of
the following data set of New York travel times:
5 10 10 15 15 15 15 20 20 20 25 30 30 40 40
45 60 60 65 85
Larson/Farber 5th ed.
23
Definition:
The 1.5 x IQR Rule for Outliers
Call an observation an outlier if it falls more than 1.5 x IQR above the third
quartile or below the first quartile.
In the New York travel time data, we found Q1=15 minutes,
Q3=42.5 minutes, and IQR=27.5 minutes.
For these data, 1.5 x IQR = 1.5(27.5) = 41.25
Q1 - 1.5 x IQR = 15 – 41.25 = -26.25
Q3+ 1.5 x IQR = 42.5 + 41.25 = 83.75
Any travel time shorter than -26.25 minutes or longer than
83.75 minutes is considered an outlier.
SO DO WE HAVE ANY OUTLIERS?
0
1
2
3
4
5
6
7
8
5
005555
0005
00
005
005
5
Describing Quantitative Data
 In addition to serving as a measure of spread, the interquartile
range (IQR) is used as part of a rule of thumb for identifying
outliers.
+
• Identifying Outliers
To Determine Outliers
Find Quartile 1 & Quartile 2
Determine Interquartile Range :
IQR = Q3 - Q1
Multiply 1.5xIQR
Set up “fences” Q1-(1.5IQR) and Q3+(1.5IQR)
Observations “outside” the fences are outliers.
Why 1.5? According to John Tukey, 1 IQR
seemed like too little and 2 IQRs
seemed like too much...
IDENTIFYING OUTLIERS
• USE THE 1.5 IQR RULE TO DECIDE IF THERE
ARE ANY OUTLIERS IN THE FOLLOWING
DATA SET:
17 23 24 27 32 35 16 70 12 15 22 35 34 18 0
26
5-Number Summary
• The 5-number summary of a distribution reports its median,
quartiles,(Q1 & Q3) and extremes (maximum and minimum)
• The 5-number summary for the recent tsunami earthquake
Magnitudes looks like this:
Interpret these
values
Slide 4 - 27
USING THE CALCULATOR
• TO FIND THE 5-NUMBER SUMMARY ON THE
CALCULATOR:
1. ENTER DATA INTO A LIST
2. USING THE STAT MENU SCROLL TO STAT
AND RUN 1-VARS STATS ON LIST
Find the 5-number summary for the data list:
7 18 11 6 59 17 18 54 104 20 31 8 10 15 19
Larson/Farber 5th ed.
28
FIVE NUMBER SUMMARIES
Of course, in real life, where data sets are often large an
full of “messy” numbers, you’ll use a calculator to
find the 5-Number Summary, but for now let’s try
calculating one by Hand!
Find the 5-Number Summary for the following data set
which lists the number of calories in 9 different candy
bars:
280 250 290 240 210 220 190 220 230
Slide 4 - 29
Box-and-Whisker Plot
Box-and-whisker plot
• Exploratory data analysis tool.
• Highlights important features of a data set.
• Requires (five-number summary):
 Minimum entry
 First quartile Q1
 Median Q2
 Third quartile Q3
 Maximum entry
© 2012 Pearson Education, Inc. All rights reserved.
30 of 149
Drawing a Box-and-Whisker Plot
1. Find the five-number summary of the data set.
2. Construct a horizontal scale that spans the range of
the data.
3. Plot the five numbers above the horizontal scale.
4. Draw a box above the horizontal scale from Q1 to Q3
and draw a vertical line in the box at Median.
5. Draw whiskers from the box to the minimum and
maximum entries if there are no outliers.
Box
Whisker
Minimum
entry
Whisker
Q1
© 2012 Pearson Education, Inc. All rights reserved.
Median, Q2
Q3
Maximum
entry
31 of 149
Example: Drawing a Box-and-Whisker
Plot
Draw a box-and-whisker plot that represents the 15
test scores.
Recall Min = 5 Q1 = 10 Q2 = 15 Q3 = 18 Max =
37
Solution:
5
10
15
18
37
About half the scores are between 10 and 18. By looking
at the length of the right whisker, you can conclude 37 is
a possible outlier.
Larson/Farber 4th ed.
32
Modified boxplots
• display outliers
• fences mark off mild & extreme
outliers
ALWAYS use modified
• whiskers
extendintothis
largest
boxplots
class!!!
(smallest) data value inside the
fence
Outlier Example
IQR=45.72-19.06
IQR=26.66
fence: 19.0639.99
= -20.93
fence: 45.72+39.99
= 85.71
outliers
}
{
0
10
1.5IQR=1.5(26.66)
1.5IQR=39.99
20
30
40
50 60 70
Spending ($)
80
90
100
A report from the U.S. Department of Justice gave the
following percent increase in federal prison populations
in 20 northeastern & mid-western states in 1999.
5.9
4.5
2.3
3.5
5.0
8.2
5.9
6.4
4.5
5.5
5.6
5.3
4.1
10.9
6.3
4.4
Construct a modified boxplot. Describe the
distribution.
4.8
8.5
6.9
3.2
Why use boxplots?
• ease of construction
• convenient handling of outliers
• Used with medium or large size
data sets (n > 10)
• useful for comparative displays
More About Spread: The Standard
Deviation


A more powerful measure of spread than the IQR
is the standard deviation, which takes into
account how far each data value is from the
mean.
A deviation is the distance that a data value is
from the mean.
Copyright © 2010, 2007, 2004 Pearson Education, Inc.
Slide 4 - 37
Example: Finding the Sample Standard Deviation
The starting salaries are for the Chicago branches
of a corporation. The corporation has several other
branches, and you plan to use the starting salaries
of the Chicago branches to estimate the starting
salaries for the larger population. Find the sample
standard deviation of the starting salaries.
Starting salaries (1000s of dollars)
41 38 39 45 47 41 44 41 37 42
Larson/Farber 4th ed.
Copyright © 2010, 2007, 2004 Pearson Education, Inc.
Slide 4 - 38
38
Solution: Finding the Sample Standard Deviation




First find the
mean
Next calculate
each deviation
from the mean
Sum these
deviations
What do you
notice???
Salary, x
Larson/Farber 4th ed.
Copyright © 2010, 2007, 2004 Pearson Education, Inc.
Deviation: x –
μ
Squares: (x –
μ)2
41
41 – 41.5 = –0.5 (–0.5)2 = 0.25
38
38 – 41.5 = –3.5 (–3.5)2 = 12.25
39
39 – 41.5 = –2.5 (–2.5)2 = 6.25
45
45 – 41.5 = 3.5
(3.5)2 = 12.25
47
47 – 41.5 = 5.5
(5.5)2 = 30.25
41
41 – 41.5 = –0.5 (–0.5)2 = 0.25
44
44 – 41.5 = 2.5
41
41 – 41.5 = –0.5 (–0.5)2 = 0.25
37
37 – 41.5 = –4.5 (–4.5)2 = 20.25
42
42
Σ(x– –41.5
μ) ==00.5
(2.5)2 = 6.25
2 = 0.25
(0.5)
SSx = 88.5
Slide 4 - 39
39

Since adding all deviations together would total
zero, we square each deviation and find an
average of sorts for the deviations.
Copyright © 2010, 2007, 2004 Pearson Education, Inc.
Slide 4 - 40
Solution: Finding the Sample Standard Deviation
Sample Variance
( x  x )
88.5

 9.8
• s 
n 1
10  1
2
2
Sample Standard Deviation
88.5
 3.1
• s s 
9
2
The sample standard deviation is about 3.1, or
$3100.
Larson/Farber 4th ed.
Copyright © 2010, 2007, 2004 Pearson Education, Inc.
Slide 4 - 41
41
Example: Using Technology to Find the Standard Deviation
Sample office rental rates (in
dollars per square foot per
year) for Miami’s central
business district are shown
in the table. Use a calculator
or a computer to find the
mean rental rate and the
sample standard deviation.
(Adapted from: Cushman &
Wakefield Inc.)
Larson/Farber 4th ed.
Copyright © 2010, 2007, 2004 Pearson Education, Inc.
Office Rental Rates
35.00
33.50
37.00
23.75
26.50
31.25
36.50
40.00
32.00
39.25
37.50
34.75
37.75
37.25
36.75
27.00
35.75
26.00
37.00
29.00
40.50
24.50
33.00
38.00
Slide 4 - 42
42
Solution: Using Technology to Find the Standard Deviation
Sample Mean
Sample
Standard
Deviation
Larson/Farber 4th ed.
Copyright © 2010, 2007, 2004 Pearson Education, Inc.
Slide 4 - 43
43
Interpreting Standard Deviation


Standard deviation is a measure of the typical
amount an entry deviates from the mean.
The more the entries are spread out, the greater
the standard deviation.
Larson/Farber 4th ed.
Copyright © 2010, 2007, 2004 Pearson Education, Inc.
Slide 4 - 44
44
Thinking About Variation




Since Statistics is about variation, spread is an
important fundamental concept of Statistics.
Measures of spread help us talk about what we
don’t know.
When the data values are tightly clustered around
the center of the distribution, the IQR and
standard deviation will be small.
When the data values are scattered far from the
center, the IQR and standard deviation will be
large.
Copyright © 2010, 2007, 2004 Pearson Education, Inc.
Slide 4 - 45
Tell -- Draw a Picture

When telling about quantitative variables, start by
making a histogram, boxplot, or stem-and-leaf
display and discuss the shape of the distribution.
Copyright © 2010, 2007, 2004 Pearson Education, Inc.
Slide 4 - 46
Tell -- Shape, Center, and Spread

Next, always report the shape of its distribution,
along with a center and a spread.
 If the shape is skewed, report the median and
IQR.
 If the shape is symmetric, report the mean and
standard deviation and possibly the median and
IQR as well.
Copyright © 2010, 2007, 2004 Pearson Education, Inc.
Slide 4 - 47
Tell -- What About Unusual Features?

If there are multiple modes, try to understand
why. If you identify a reason for the separate
modes, it may be good to split the data into two
groups.

If there are any clear outliers and you are
reporting the mean and standard deviation, report
them with the outliers present and with the
outliers removed. The differences may be quite
revealing.
 Note: The median and IQR are not likely to be
affected by the outliers.
Copyright © 2010, 2007, 2004 Pearson Education, Inc.
Slide 4 - 48
What have we learned?



We’ve learned how to make a picture for quantitative data
to help us see the story the data have to Tell.
We can display the distribution of quantitative data with a
histogram, stem-and-leaf display, or dotplot.
We’ve learned how to summarize distributions of
quantitative variables numerically.
 Measures of center for a distribution include the
median and mean.
 Measures of spread include the range, IQR, and
standard deviation.
 Use the median and IQR when the distribution is
skewed. Use the mean and standard deviation if the
distribution is symmetric.
Copyright © 2010, 2007, 2004 Pearson Education, Inc.
Slide 4 - 49
What have we learned? (cont.)

We’ve learned to Think about the type of variable
we are summarizing.
 All methods of this chapter assume the data
are quantitative.
 The Quantitative Data Condition serves as a
check that the data are, in fact, quantitative.
Copyright © 2010, 2007, 2004 Pearson Education, Inc.
Slide 4 - 50