Download Chapter 4 - AshLardizabal

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

World Values Survey wikipedia , lookup

Time series wikipedia , lookup

Transcript
Chapter 4
Displaying and Summarizing
Quantitative Data
Objectives
• Histogram
• Stem-and-leaf
plot
• Dotplot
• Shape
• Center
• Spread
• Outliers
• Mean
• Median
• Range
• Interquartile
range (IQR)
• Percentile
• 5-Number
summary
• Resistant
• Variance
• Standard
Deviation
Dealing With a Lot of Numbers…
• Summarizing the data will help us
when we look at large sets of
quantitative data.
• Without summaries of the data, it’s
hard to grasp what the data tell us.
• The best thing to do is to make a
picture…
• We can’t use bar charts or pie charts
for quantitative data, since those
displays are for categorical variables.
Reasons for Constructing
Quantitative Frequency Tables
1. Large data sets can be
summarized.
2. Can gain some insight into the
nature of data.
3. Have a basis for constructing a
histogram.
Ways to chart quantitative data
• Histograms and stemplots
These are summary graphs for a single variable. They
are very useful to understand the pattern of variability in
the data.
• Line graphs: time plots
Use when there is a meaningful sequence, like time. The
line connecting the points helps emphasize any change
over time.
• Other graphs to reflect numerical summaries are
Dotplots and Cumulative Frequency Curves (Ogive).
Quantitative Data
HISTOGRAM
Histogram
• To make a histogram we first need
to organize the data using a
quantitative frequency table.
• Two types of quantitative data
1. Discrete – use ungrouped frequency
table to organize.
2. Continuous – use grouped frequency
table to organize.
Quantitative Frequency
Tables – Ungrouped
• What is an ungrouped
frequency table? An
ungrouped frequency table
simply lists the data values with
the corresponding frequency
counts with which each value
occurs.
• Commonly used with discrete
quantitative data.
Quantitative Frequency
Tables – Ungrouped
• Example: The at-rest pulse
rate for 16 athletes at a meet
were 57, 57, 56, 57, 58, 56,
54, 64, 53, 54, 54, 55, 57, 55,
60, and 58. Summarize the
information with an
ungrouped frequency
distribution.
Quantitative Frequency
Tables – Ungrouped
• Example Continued
Note: The (ungrouped)
classes are the
observed values
themselves.
Quantitative Relative Frequency
Tables - Ungrouped
Note: The relative
frequency for a
class is obtained
by computing f/n.
Quantitative Frequency
Tables – Grouped
• What is a grouped frequency table?
A grouped frequency table is
obtained by constructing classes (or
intervals) for the data, and then
listing the corresponding number of
values (frequency counts) in each
interval.
• Commonly used with continuous
quantitative data.
Quantitative Frequency
Tables – Grouped
• Later, we will encounter a
graphical display called the
histogram. We will see that
grouped frequency tables are
used to construct these
displays.
Quantitative Frequency
Tables – Grouped
• There are several procedures that
one can use to construct a grouped
frequency tables.
• However, because of the many
statistical software packages
(MINITAB, SPSS etc.) and graphing
calculators (TI-83 etc.) available
today, it is not necessary to try to
construct such distributions using
pencil and paper.
Quantitative Frequency
Tables – Grouped
• A frequency table should have a
minimum of 5 classes and a
maximum of 20 classes.
• For small data sets, one can use
between 5 and 10 classes.
• For large data sets, one can use up
to 20 classes.
Quantitative Frequency
Tables – Grouped
• Example: The weights of 30 female
students majoring in Physical
Education on a college campus are
as follows: 143, 113, 107, 151, 90,
139, 136, 126, 122, 127, 123, 137,
132, 121, 112, 132, 133, 121, 126,
104, 140, 138, 99, 134, 119, 112,
133, 104, 129, and 123. Summarize
the data with a frequency
distribution using seven classes.
Quantitative Frequency Tables – Grouped
Example Continued
• NOTE: We will introduce the
histogram here to help us
explain a grouped frequency
distribution.
Quantitative Frequency Tables – Grouped
Example Continued
• What is a histogram? A
histogram is a graphical display of a
frequency or a relative frequency
table that uses classes and vertical
(horizontal) bars (rectangles) of
various heights to represent the
frequencies.
Histogram
• The most common graph used to
display one variable quantitative data.
Quantitative Frequency Tables – Grouped
Example Continued
• The MINITAB statistical software
was used to generate the histogram
in the next slide.
• The histogram has seven classes.
• Classes for the weights are along
the x-axis and frequencies are
along the y-axis.
• The number at the top of each
rectangular box, represents the
frequency for the class.
Quantitative Frequency Tables – Grouped
Example Continued
Histogram
with 7 classes
for the
weights.
Quantitative Frequency Tables – Grouped
Example Continued
• Observations
• From the histogram, the
classes (intervals) are 85 –
95, 95 – 105,105 – 115 etc.
with corresponding
frequencies of 1, 3, 4, etc.
• We will use this information
to construct the group
frequency distribution.
Quantitative Frequency Tables – Grouped
Example Continued
• Observations (continued)
• Observe that the upper
class limit of 95 for the
class 85 – 95 is listed as
the lower class limit for the
class 95 – 105.
• Since the value of 95
cannot be included in both
classes, we will use the
convention that the upper
class limit is not included in
the class.
Quantitative Frequency Tables – Grouped
Example Continued
• Observations (continued)
• That is, the class 85 – 95
should be interpreted as
having the values 85 and
up to 95 but not including
the value of 95.
• Using these observations,
the grouped frequency
distribution is constructed
from the histogram and is
given on the next slide.
Quantitative Frequency Tables – Grouped
Example Continued
Quantitative Frequency Tables – Grouped
Example Continued
• Observations (continued)
• In the grouped frequency
distribution, the sum of
the relative frequencies
did not add up to 1. This
is due to rounding to four
decimal places.
• The same observation
should be noted for the
cumulative relative
frequency column.
Creating a Histogram
It is an iterative process—try and try again.
What bin size should you use?
• Not too many bins with either 0 or 1 counts
• Not overly summarized that you lose all the information
• Not so detailed that it is no longer summary
 Rule of thumb: Start with 5 to10 bins.
Look at the distribution and refine your bins.
(There isn’t a unique or “perfect” solution.)
Same data set
Not
summarized
enough
Too summarized
Histograms
Definitions
• Frequency Distributions
• Example
Lower Class Limits
are the smallest numbers that can actually
belong to different classes
Lower Class Limits
are the smallest numbers that can actually belong to
different classes
Lower Class
Limits
Upper Class Limits
are the largest numbers that can actually belong to
different classes
Upper Class
Limits
Class Boundaries
are the numbers used to separate classes,
but without the gaps created by class
limits
Class Boundaries
number separating classes
- 0.5
99.5
199.5
299.5
399.5
499.5
Class Boundaries
number separating classes
- 0.5
Class
Boundaries
99.5
199.5
299.5
399.5
499.5
Class Midpoints or Class Mark
midpoints of the classes
Class midpoints can be found by
adding the lower class limit to the
upper class limit and dividing the sum
by two.
Class Midpoints
midpoints of the classes
Class
Midpoints
49.5
149.5
249.5
349.5
449.5
Class Width
is the difference between two consecutive lower class
limits or two consecutive lower class boundaries
100
Class
Width
100
100
100
100
Summary of Terminology
• Class - non-overlapping intervals the data is
divided into.
• Class Limits –The smallest and largest
observed values in a given class.
• Class Boundaries – Fall halfway between the
upper class limit for the smaller class and the
lower class limit for larger class. Used to close
the gap between classes.
• Class Width – The difference between the class
boundaries for a given class.
• Class mark – The midpoint of a class.
Constructing A Frequency
Table
1. Decide on the number of classes (should be between
5 and 20) .
2. Calculate (round up).
class width

(highest value) – (lowest value)
number of classes
3. Starting point: Begin by choosing a lower limit of the
first class.
4. Using the lower limit of the first class and class width,
proceed to list the lower class limits.
5. List the lower class limits in a vertical column and
proceed to enter the upper class limits.
6. Go through the data set putting a tally in the
appropriate class for each data value.
Histogram
Then to complete the Histogram, graph the Frequency Table data.
Frequency Histogram vs Relative
Frequency Histogram
A bar graph in which the horizontal scale represents the classes of
data values and the vertical scale represents the frequencies.
Frequency Histogram vs Relative
Frequency Histogram
Has the same shape and horizontal scale as a histogram, but the
vertical scale is marked with relative frequencies.
Frequency Histogram vs Relative
Frequency Histogram
Histograms - Facts
• Histograms are useful when the
data values are quantitative.
• A histogram gives an estimate of the
shape of the distribution of the
population from which the sample
was taken.
• If the relative frequencies were
plotted along the vertical axis to
produce the histogram, the shape
will be the same as when the
frequencies are used.
Making Histograms on the
TI-83/84
Use of Stat Plots on the TI-83/84
Raw Data: 548, 405, 375, 400, 475, 450, 412
375, 364, 492, 482, 384, 490, 492
490, 435, 390, 500, 400, 491, 945
435, 848, 792, 700, 572, 739, 572
Frequency Table Data:
Class Limits
350 to < 450
450 to < 550
550 to < 650
650 to < 750
750 to < 850
850 to < 950
Frequency
11
10
2
2
2
1
Quantitative Data
STEM AND LEAF PLOT
Stem-and-Leaf Plots
• What is a stem-and-leaf plot? A stem-andleaf plot is a data plot that uses part of a
data value as the stem to form groups or
classes and part of the data value as the
leaf.
• Most often used for small or medium sized
data sets. For larger data sets, histograms
do a better job.
• Note: A stem-and-leaf plot has an
advantage over a grouped frequency table
or hostogram, since a stem-and-leaf plot
retains the actual data by showing them in
graphic form.
Stemplots
Include key – how to
read the stemplot.
How to make a stemplot:
1) Separate each observation into a stem,
consisting of all but the final (rightmost) digit,
and a leaf, which is that remaining final digit.
Stems may have as many digits as needed.
Use only one digit for each leaf—either round or
truncate the data values to one decimal place
after the stem.
2) Write the stems in a vertical column with the
smallest value at the top, and draw a vertical
line at the right of this column.
3) Write each leaf in the row to the right of its
stem, in increasing order out from the stem.
Original data: 9, 9, 22, 32, 33, 39, 39, 42, 49, 52, 58, 70
0|9 = 9
STEM
LEAVES
Stem-and-Leaf Plots
• Example: Consider the following values
– 96, 98, 107, 110, and 112. Construct
a stem-and-leaf plot by using the units
digits as the leaves.
Stem-and-Leaf Plot
Stems and leaves for the
data values.
Stem-and-leaf plot for the
data values.
Key: 09|6 = 96
Stem
09
10
11
Leaf
6 8
7
0 2
Your Turn: Stem-and-Leaf Plots
• A sample of the number of admissions to a
psychiatric ward at a local hospital during the
full phases of the moon is as follows: 22, 30,
21, 27, 31, 36, 20, 28, 25, 33, 21, 38, 32, 35,
26, 19, 43, 30, 30, 34, 27, and 41.
• Display the data in a stem-and-leaf plot with
the leaves represented by the unit digits.
Stem-and-Leaf Plot
Key: 1|9 = 19
Stem
1
2
3
4
Leaf
9
0 1 1 2 5 6 7 7 8
0 0 0 1 2 3 4 5 6 8
1 3
Variations of the StemPlot
•
Splitting Stems – (too few stems or classes) Split
stems to double the number of stems when all the
leaves would otherwise fall on just a few stems.
• Each stem appears twice.
• Leaves 0-4 go on the 1st stem and leaves 5-9 go on
the 2nd stem.
• Example: data –
120,121,121,123,124,124,125,125,125,126,126,128,129,130,132,
132,133,134,134,134,135,137,138,138,138,139
StemPlot
StemPlot (splitting stems)
12 0 1 13445556689
12 0 1 1344
13 0223444578889
12 5556689
13 0223444
13 578889
Stemplots versus Histograms
Stemplots are quick and dirty histograms that can easily be
done by hand, therefore, very convenient for back of the
envelope calculations. However, they are rarely found in
scientific or laymen publications.
Stemplots versus Histograms
• Stem-and-leaf displays show the
distribution of a quantitative variable,
like histograms do, while preserving
the individual values.
• Stem-and-leaf displays contain all
the information found in a histogram
and, when carefully drawn, satisfy
the area principle and show the
distribution.
Stem-and-Leaf Example
• Compare the histogram and stem-and-leaf
display for the pulse rates of 24 women at
a health clinic. Which graphical display do
you prefer?
Key: 5|6 = 56
56
60
648
742
746
6 800
5 6
88
Slide 4 - 58
4
88
82
86
80
6
4
28
22
26
20
7
4
68
62
66
60
7
4
4
0
0
0
04 84
8 8
Quantitative Data
DOTPLOTS
Dot Plots
• What is a dot plot? A dot plot is a
plot that displays a dot for each
value in a data set along a
number line. If there are multiple
occurrences of a specific value,
then the dots will be stacked
vertically.
Dotplots
• A dotplot is a simple
display. It just places a
dot along an axis for
each case in the data.
• The dotplot to the right
shows Kentucky Derby
winning times, plotting
each race as its own
dot.
• You might see a dotplot
displayed horizontally
or vertically.
Dot Plot Example:
• The following data shows the length of 50 movies in
minutes. Construct a dot plot for the data.
• 64, 64, 69, 70, 71, 71, 71, 72, 73, 73, 74, 74, 74, 74, 75, 75,
75, 75, 75, 75, 76, 76, 76, 77, 77, 78, 78, 79, 79, 80, 80, 81,
81, 81, 82, 82, 82, 83, 83, 83, 84, 86, 88, 89, 89, 90, 90, 92,
94, 120
Figure 2-5
Dot Plots – Your Turn
The following frequency
distribution shows the
number of defectives
observed by a quality
control officer over a 30
day period. Construct a
dot plot for the data.
Dot Plots – Solution
Ogive - Cumulative
Frequency Curve
Cumulative Frequency and the Ogive
• Histogram displays the distribution of a quantitative variable.
It tells little about the relative standing (percentile, quartile,
etc.) of an individual observation.
• For this information, we use a Cumulative Frequency graph,
called an Ogive (pronounced O-JIVE).
• The Pth percentile of a distribution is a value such that P%
of the data fall at or below it.
Cumulative Frequency
• What is a cumulative
frequency for a class? The
cumulative frequency for a
specific class in a frequency
table is the sum of the
frequencies for all values at or
below the given class.
Cumulative Frequency
Constructing an Ogive
1. Make a frequency table and add a cumulative frequency
column.
2. To fill in the cumulative frequency column, add the counts
in the frequency column that fall in or below the current
class interval.
3. Label and scale the axes and title the graph. Horizontal
axis “classes” and vertical axis “cumulative frequency or
relative cumulative frequency”.
4. Begin the ogive at zero on the vertical axis and lower
boundary of the first class on the horizontal axis. Then
graph each additional Upper class boundary vs.
cumulative frequency for that class.
Ogive
• A line graph that depicts cumulative
frequencies.
• Used to Find Quartiles and
Percentiles.
Example: Cumulative Frequency Curve
• The frequencies of the scores of 80 students in a test are
given in the following table. Complete the corresponding
cumulative frequency table.
• A suitable table is as follows:
Example continued
• The information provided by a cumulative frequency table
can be displayed in graphical form by plotting the cumulative
frequencies given in the table against the upper class
boundaries, and joining these points with a smooth.
• The cumulative frequency curve corresponding to the data
is as follows:
Your Turn:
• The results obtained by 200 students in a mathematics
test are given in the following table.
Draw a cumulative frequency curve and use it to estimate
a) The median mark
b) The number of students who scored less than 22 marks
c) The pass mark if 120 students passed the test
d) The min. mark required to obtain an A grade if 10% of the
students received an A grade.
Solution
•
a)
b)
c)
d)
The required cumulative frequency curve is as follows:
The median mark: median mark is 26
The number of students who scored less than 22 marks: approximately 69 students
scored less than 22 marks
The pass mark if 120 students passed the test: pass mark is 28
The min. mark required to obtain an A grade if 10% of the students received an A
grade: min. mark required for an A is 38
Percentiles
• Explanation of the term –
percentiles: Percentiles are
numerical values that divide an
ordered data set into 100
groups of values with at most
1% of the data values in each
group.
• The kth percentile is the
number that falls above k% of
the data.
Percentiles
• Explanation of the term – kth
percentile: the kth percentile
for an ordered array of
numerical data is a numerical
value Pk (say) such that k% of
the data values are smaller
than or equal to Pk, and at most
(100 – k)% of the data values
are larger than Pk.
• The idea of the kth percentile
is illustrated on the next slide.
Percentile Corresponding to a Given
Data Value
• The percentile corresponding to a given
data value, say x, in a set is obtained by
using the following formula.
Number of values at or below x
Percentile 
100%
Number of values in data set
Percentile Corresponding to a Given
Data Value
• Example: The shoe sizes, in
whole numbers, for a sample
of 12 male students in a
statistics class were as
follows: 13, 11, 10, 13, 11, 10,
8, 12, 9, 9, 8, and 9.
• What is the percentile rank
for a shoe size of 12?
Percentile Corresponding to a Given
Data Value
• Solution: First, we need to
arrange the values from
smallest to largest.
• The ordered array is given
below: 8, 8, 9, 9, 9, 10, 10, 11,
11, 12, 13, 13.
• Observe that the number of
values at or below the value
of 12 is 10.
Percentile Corresponding to a Given
Data Value
• Solution (continued): The total number of
values in the data set is 12.
• Thus, using the formula, the
corresponding percentile is:
The value of 12
corresponds to
approximately the
83rd percentile.
Procedure for Finding a Data Value
for a Given Percentile
• Assume that we want to determine what data
value falls at some general percentile Pk.
• The following steps will enable you to find a
general percentile Pk for a data set.
• Step 1: Order the data set from smallest to
largest.
• Step 2: Compute the position c of the
percentile. To compute the value of c, use
the following formula:
Procedure for Finding a Data Value
for a Given Percentile
•Step 1: If c is not a whole number, round
up to the next whole number.
• Locate this position in the ordered set.
• The value in this location is the required
percentile.
Procedure for Finding a Data Value
for a Given Percentile
•Step 2: If c is a whole number.
• Locate this position in the ordered set.
• The value in this location is the required
percentile.
Percentile Corresponding to a Given
Data Value
• Example: The data given below
represents the 19 countries with the
largest numbers of total Olympic medals
– excluding the United States, which
had 101 medals – for the 1996 Atlanta
games. Find the 65th percentile for the
data set.
• 63, 65, 50, 37, 35, 41, 25, 23, 27, 21, 17, 17,
20, 19, 22, 15, 15, 15, 15.
Percentile Corresponding to a Given
Data Value
• Solution: First, we need to arrange the
data set in order. The ordered set is: .
• 15, 15, 15, 15, 17, 17, 19, 20, 21, 22, 23, 25,
27, 35, 37, 41, 50, 63, 65.
• Next, compute the position of the percentile.
• Here n = 19, k = 65.
• Thus, c = (19  65)/100 = 12.35.
• We need to round up to a value 13.
Percentile Corresponding to a Given
Data Value
• Solution (continued): Thus, the 13th value in
the ordered data set will correspond to
the 65th percentile.
• That is P65 = 27.
• Question: Why does a percentile measure
relative position?
Question: Why does a percentile
measure Relative Position?
Display of the 65th Percentile
along with the data values.
Question: Why does a percentile measure
Relative Position?
• Referring to the diagram, observe that
the value of 27 is such that at most 65%
of the data values are smaller than 27
and at most 35% of the values are larger
than 27.
•This shows that the percentile value of
27 is a measure of location.
•Thus, the percentile gives us an idea of
the relative position of a value in an
ordered data set.
Special Percentiles – Deciles and
Quartiles
• Deciles and quartiles are special
percentiles.
• Deciles divide an ordered data set
into 10 equal parts.
• Quartiles divide the ordered data
set into 4 equal parts.
• We usually denote the deciles by D1,
D2, D3, … , D9.
• We usually denote the quartiles by
Q1, Q2, and Q3.
Quick Tip:
• There are 9 deciles and 3
quartiles.
• Q1 = first quartile = P25
• Q2 = second quartile = P50
• Q3 = third quartile = P75
• D1 = first decile = P10
• D2 = second decile = P20 . . .
• D9 = ninth decile = P90
Think Before You Draw, Again
• Remember the “Make a picture” rule?
• Now that we have options for data
displays, you need to Think carefully about
which type of display to make.
• Before making a stem-and-leaf display, a
histogram, or a dotplot, check the
• Quantitative Data Condition: The data
are values of a quantitative variable
whose units are known.
Shape, Center, and Spread
• When describing a distribution,
make sure to always tell about three
things: shape, center, and spread…
• Actually you should comment on
four things when describing a
distribution. The three above and
any deviations from the shape.
• These deviations from the shape are
called ‘outliers’ and will be
discussed later.
What is the Shape of the
Distribution?
1. Does the histogram have a single,
central hump or several separated
humps?
2. Is the histogram symmetric?
3. Do any unusual features stick out?
Humps
1.
Does the histogram have a single,
central hump or several separated
bumps?
• Humps in a histogram are called
modes or peaks.
• A histogram with one main peak is
dubbed unimodal; histograms with
two peaks are bimodal; histograms
with three or more peaks are called
multimodal.
Humps (cont.)
• A bimodal histogram has two apparent peaks:
Humps (cont.)
• A histogram that doesn’t appear to have any mode and
in which all the bars are approximately the same height
is called uniform:
Uniform or Rectangular
Distribution
• A distribution in which every
class has equal frequency. A
uniform distribution is
symmetrical with the added
property that the bars are the
same height.
Symmetry
2.
Is the histogram symmetric?
• If you can fold the histogram along a vertical line
through the middle and have the edges match
pretty closely, the histogram is symmetric.
Symmetrical Distribution
• In a symmetrical distribution, the
data values are evenly distributed
on both sides of the mean.
• When the distribution is unimodal,
the mean, the median, and the
mode are all equal to one another
and are located at the center of
the distribution.
Symmetrical Distribution
Symmetry (cont.)
• The (usually) thinner ends of a distribution are called
the tails. If one tail stretches out farther than the other,
the histogram is said to be skewed to the side of the
longer tail.
• In the figure below, the histogram on the left is said to
be skewed left, while the histogram on the right is said
to be skewed right.
Skewed Right Distribution
• In a skewed right
distribution, most of the data
values fall to the left of the
mean, and the “tail” of the
distribution is to the right.
• The mean is to the right of
the median and the mode is to
the left of the median.
Skewed Right Distribution
Skewed Right
Skewed Left Distribution
• In a skewed left distribution,
most of the data values fall
to the right of the mean, and
the “tail” of the distribution
is to the left.
• The mean is to the left of the
median and the mode is to the
right of the median.
Skewed Left Distribution
Skewed Left
Anything Unusual?
3.
Do any unusual features stick out?
• Sometimes it’s the unusual features
that tell us something interesting or
exciting about the data.
• You should always mention any
stragglers, or outliers, that stand off
away from the body of the distribution.
• Are there any gaps in the distribution?
If so, we might have data from more
than one group.
Anything Unusual? (cont.)
• The following histogram has outliers—
there are three cities in the leftmost bar:
Deviations from the Overall Pattern
• Outliers – An individual observation that falls outside the
overall pattern of the distribution. Extreme Values –
either high or low.
• Causes:
1. Data Mistake
2. Special nature of some observations
Outliers
An important kind of deviation is an outlier. Outliers are
observations that lie outside the overall pattern of a
distribution. Always look for outliers and try to explain them.
The overall pattern is fairly
symmetrical except for two
states clearly not belonging
to the main trend. Alaska
and Florida have unusual
representation of the
elderly in their population.
A large gap in the
distribution is typically a
sign of an outlier.
Alaska
Florida
Other Common Terms
• Peak – high bar
• Valley – between 2 peaks
• Gap – no data
Numerical Data Properties
Central Tendency
(center)
Variation
(spread)
Shape
Examples – Describing Distributions
It’s often a good idea to think about what the distribution of a data set might look like
before we collect the data. What do you think the distribution of each of the
following data sets will look like? Be sure to discuss its shape. Where do you
think the center might be? How spread out do you think the values will be?
1. Number of Miles run by Saturday morning joggers at a park.
• Roughly symmetric, slightly skewed right. Center around 3 miles. Few over 10
miles.
2. Hours spent by U.S. adults watching football on Thanksgiving Day.
• Bimodal. Center between 1 and 2 hours. Many people watch no football, others
watch most of one or more games. Probably only a few values over 5 hours.
3. Amount of winnings of all people playing a particular state’s lottery last week.
• Strongly skewed to the right, with almost everyone at $0, a few small prizes, with
the winner an outlier.
4. Ages of the faculty members at your school.
• Fairly symmetric, somewhat uniform, perhaps slightly skewed to the right. Center
in the 40’s. Few ages below 25 or above 70.
5. Last digit of phone numbers on your campus.
• Uniform, symmetric. Center near 5. Roughly equal counts for each digit 0-9.
Where is the Center of the
Distribution?
• If you had to pick a single number to
describe all the data what would you pick?
• It’s easy to find the center when a
histogram is unimodal and symmetric—it’s
right in the middle.
• On the other hand, it’s not so easy to find
the center of a skewed histogram or a
histogram with more than one mode.
Measures of Central
Tendency
• A measure of central tendency
for a collection of data values
is a number that is meant to
convey the idea of centralness
for the data set.
• The most commonly used
measures of central tendency
for sample data are the: mean,
median, and mode.
The Mean
• Explanation of the term – mean:
The mean of a set of numerical
(data) values is the (arithmetic)
average for the set of values.
• NOTE: When computing the value
of the mean, the data values can
be population values or sample
values.
• Hence we can compute either the
population mean or the sample
mean
The Mean
• Explanation of the term –
population mean: If the
numerical values are from an
entire population, then the
mean of these values is called
the population mean.
• NOTATION: The population
mean is usually denoted by the
Greek letter µ (read as “mu”).
The Mean
• Explanation of the term –
sample mean: If the
numerical values are from a
sample, then the mean of
these values is called the
sample mean.
• NOTATION: The sample
mean is usually denoted by x
(read as “x-bar”).
The Mean -- Example
• Example: What is the mean of the
following 11 sample values?
3
8
6
14
0
0
12 -7
0
-10
-4
The Mean -- Example (Continued)
• Solution:
3  8  6  14  0  (4)  0  12  (7)  0  (10)
x
11
2
The Mean
• Nonresistant – The mean is sensitive to the influence of
extreme values and/or outliers. Skewed distributions pull
the mean away from the center towards the longer tail.
• The mean is located at the balancing point of the
histogram. For a skewed distribution, is not a good
measure of center.
The Mean
• Nonresistant – Example
• Example – Data: {1,2,3,4,5,6,7}
• The mean is 4
• Add an outlier {1,2,3,4,5,6,7,50}
• New median is 9.75 – large affect
Quick Tip:
• When a data set has a large
number of values, we sometimes
summarize it as a frequency table.
The frequencies represent the
number of times each value
occurs.
• When the mean is calculated from
a frequency table it is often an
approximation, because the raw
data is sometimes not known.
Calculating Means
• TI-83/84 1-Var Stats
• Using raw data
• Using Frequency table data
Calculating Means on TI-83/84
Raw Data: 548, 405, 375, 400, 475, 450, 412
375, 364, 492, 482, 384, 490, 492
490, 435, 390, 500, 400, 491, 945
435, 848, 792, 700, 572, 739, 572
Calculating Means on TI-83/84
Note: The (ungrouped)
classes are the
observed values
themselves.
Calculating Means on TI-83/84
• Grouped Frequency Table Data:
Class Limits
350 to < 450
450 to < 550
550 to < 650
650 to < 750
750 to < 850
850 to < 950
Frequency
11
10
2
2
2
1
The Median
• Explanation of the term – median:
The median of a set of numerical
(data) values is that numerical
value in the middle when the data
set is arranged in order.
• NOTE: When computing the
value of the median, the data
values can be population values or
sample values.
• Hence we can compute either the
population median or the sample
median.
Center of a Distribution -- Median
• The median is the value with exactly half the data values
below it and half above it.
• It is the middle data
value (once the data
values have been
ordered) that divides
the histogram into
two equal areas
• It has the same units
as the data
Quick Tip:
• When the number of values in
the data set is odd, the median
will be the middle value in the
ordered array.
• When the number of values in
the data set is even, the median
will be the average of the two
middle values in the ordered
array.
The Median -- Example
• Example: What is the
median for the following
sample values?
3
8
6
2
12 -7
14
0
-1 -10
-4
The Median -- Example (Continued)
• Solution: First of all, we need to
arrange the data set in order. The
ordered set is:
-10 -7 -4 -1 0 2 3 6 8 12 14
6th value
The Median -- Example (Continued)
• Solution (Continued): Since
the number of values is odd,
the median will be found in
the 6th position in the
ordered set (To find; data
number divided by 2 and
round up, 11/2 = 5.5⇒6).
• Thus, the value of the
median is 2.
The Median -- Example
• Example: Find the median
age for the following eight
college students.
23 19 32 25 26 22 24 20
The Median – Example (continued)
• Example: First we have to
order the values as shown below.
19 20 22 23 24 25 26 32
The Median – Example (continued)
• Example: Since there is an even
number of ages, the median will
be the average of the two
middle values (To find; data
number divided by 2, that
number and the next are the
two middle numbers, 8/2 =
4⇒4th & 5th are the middle
numbers).
• Thus, median = (23 + 24)/2 = 23.5.
The Median
The median is the midpoint of a distribution—the number such
that half of the observations are smaller and half are larger.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
25 12
6.1
1. Sort observations from smallest to largest.
n = number of observations
______________________________
2. If n is odd, the median is observation
n/2 (round up) down the list
 n = 25
n/2 = 25/2 = 12.5=13
Median = 3.4
3. If n is even, the median is the
mean of the two center observations
n = 24 
n/2 = 12 &13
Median = (3.3+3.4) /2 = 3.35
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
1
2
3
4
5
6
7
8
9
10
11
1
2
3
4
5
6
7
8
9
10
11
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
The Median
• Resistant – The median is said to
be resistant, because extreme
values and/or outliers have little
effect on the median.
• Example – Data: {1,2,3,4,5,6,7}
• The median is 4
• Add an outlier {1,2,3,4,5,6,7,50}
• New median is 4.5 – very little affect
The Mode
• Explanation of the term –
mode: The mode of a set of
numerical (data) values is the
most frequently occurring
value in the data set.
Quick Tip:
• If all the elements in the data set have the same
frequency of occurrence, then the data set is said to
have no mode.
Example of data set with no mode.
Quick Tip:
• If the data set has one value that occurs
more frequently than the rest of the values,
then the data set is said to be unimodal.
Example of
A Unimodal
Data set.
Quick Tip:
• If two data values in the set are tied for the
highest frequency of occurrence, then the
data set is said to be bimodal.
Example of a
bimodal set of
data.
Summary Measures of Center
How Spread Out is the
Distribution?
• Variation matters, and Statistics is about
variation.
• Are the values of the distribution tightly
clustered around the center or more
spread out?
• Always report a measure of spread along
with a measure of center when describing
a distribution numerically.
Measures of Spread
• A measure of variability for a
collection of data values is a
number that is meant to convey
the idea of spread for the data
set.
• The most commonly used
measures of variability for sample
data are the:
 range
 interquartile range
 variance or standard deviation
Spread: Home on the Range
• The range of the data is the difference
between the maximum and minimum
values:
Range = max – min
• A disadvantage of the range is that a
single extreme value can make it very
large and, thus, not representative of the
data overall.
Range
• The range is affected by
outliers (large or small values
relative to the rest of the data
set).
• The range does not utilize all
the information in the data set
only the largest and smallest
values.
• Thus it is not a very useful
measure of spread or variation.
Spread: The Interquartile Range
• A better way to describe the spread of a
set of data might be to ignore the extremes
and concentrate on the middle of the data.
• The interquartile range (IQR) lets us ignore
extreme data values and concentrate on
the middle of the data.
• To find the IQR, we first need to know
what quartiles are…
Spread: The Interquartile Range
(cont.)
• Quartiles divide the data into four equal
sections.
• One quarter of the data lies below the
lower quartile, Q1
• One quarter of the data lies above the
upper quartile, Q3.
• The quartiles border the middle half of
the data.
• The difference between the quartiles is the
interquartile range (IQR), so
IQR = upper quartile – lower quartile
Finding Quartiles
1. Order the Data
2. Find the median, this divides the data into a lower and
upper half (the median itself is in neither half).
3. Q1 is then the median of the lower half.
4. Q3 is the median of the upper half.
5. Example
Even data
Q1=27, M=39, Q3=50.5
IQR = 50.5 – 27 = 23.5
Odd data
Q1=35, M=46, Q3=54
IQR = 54 – 35 = 19
The Interquartile Range
• The following depicts the idea of the
interquartile range.
IQR = Q3 - Q1
Spread: The Interquartile Range
(cont.)
• The lower and upper quartiles are the 25th and 75th
percentiles of the data, so…
• The IQR contains the middle 50% of the values of the
distribution, as shown in figure:
Example IQR
The first quartile, Q1, is the value in
the sample that has 25% of the data
at or below it.
M = median = 3.4
The third quartile, Q3, is the value in
the sample that has 75% of the data
at or below it.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
1
2
3
4
5
6
7
1
2
3
4
5
1
2
3
4
5
6
7
1
2
3
4
5
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
6.1
Q1= first quartile = 2.2
IQR=Q3-Q1
=4.35-2.2
=2.15
Q3= third quartile = 4.35
Your Turn:
•
The following scores for a statistics
10-point quiz were reported. What is
the value of the interquartile range?
7 8 9 6 8 0 9 9 9
0 0 7 10 9 8 5 7 9
Solution: IQR = 3
Calculator - IQR
•
TI-83 Solution: The following shows
the descriptive statistics output.
Interquartile range = Q3 – Q1 = 9 – 6 = 3.
5-Number Summary
• The 5-number summary of a distribution reports its median,
quartiles, and extremes (maximum and minimum)
• The 5-number summary for the recent tsunami earthquake
Magnitudes looks like this:
• Obtain 5-number summary
from 1-Var Stats
What About Spread? The
Standard Deviation
• A more powerful measure of spread than
the IQR is the standard deviation, which
takes into account how far each data value
is from the mean.
• A deviation is the distance that a data
value is from the mean.
• Since adding all deviations together
would total zero, we square each
deviation and find an average of sorts
for the deviations.
What About Spread? The
Standard Deviation (cont.)
• The variance, notated by s2, is found by
summing the squared deviations and
(almost) averaging them:
s
2
y  y


2
n 1
• Used to calculate Standard Deviation.
• The variance will play a role later in our
study, but it is problematic as a measure of
spread - it is measured in squared units serious disadvantage!
What About Spread? The
Standard Deviation (cont.)
• The standard deviation, s, is just the
square root of the variance and is
measured in the same units as the original
data.
s
 y  y 
n 1
2
Procedure for Calculating the Standard
Deviation using Formula
1. Compute the mean x .
2. Subtract the mean from each individual value to get a
list of the deviations from the mean  x  x  .
3. Square each of the differences to produce the square
of the deviations from the mean  x  x 2.
4. Add all of the squares of the deviations from the mean
to get   x  x 2 .
5. Divide the sum   x  x  by  n  1 . [variance]
6. Find the square root of the result.
2
Example:
• Find the standard deviation of the Mulberry
Bank customer waiting times. Those times
(in minutes) are 1, 3, 14.
Calculating Standard Deviation
on the TI-83/84
• Use 1-Var Stats
• Sx is the sample standard deviation
• σx is the population standard deviation
Properties of Standard Deviation
• Measures spread about the mean and should only be
used to describe the spread of a distribution when the
mean is used to describe the center (ie. symmetrical
distributions).
• The value of s is positive. It is zero only when all of the
data values are the same number. Larger values of s
indicate greater amounts of variation.
• Nonresistant, s can increase dramatically due to extreme
values or outliers.
• The units of s are the same as the units of the original
data. One reason s is preferred to s2.
Thinking About Variation
• Since Statistics is about variation, spread
is an important fundamental concept of
Statistics.
• Measures of spread help us talk about
what we don’t know.
• When the data values are tightly clustered
around the center of the distribution, the
IQR and standard deviation will be small.
• When the data values are scattered far
from the center, the IQR and standard
deviation will be large.
Summarizing Symmetric
Distributions -- The Mean
• When we have symmetric data, there is an
alternative other than the median.
• If we want to calculate a number, we can
average the data.
• We use the Greek letter sigma to mean
“sum” and write:
Total  y
y

n
n
The formula says that to find the mean, we
add up all the values of the variable and
divide by the number of data values, n.
Summarizing Symmetric
Distributions -- The Mean (cont.)
• The mean feels like the center because it
is the point where the histogram balances:
Mean or Median?
• Because the median considers only the
order of values, it is resistant to values that
are extraordinarily large or small; it simply
notes that they are one of the “big ones” or
“small ones” and ignores their distance
from center.
• To choose between the mean and median,
start by looking at the data. If the
histogram is symmetric and there are no
outliers, use the mean.
• However, if the histogram is skewed or
with outliers, you are better off with the
median.
Comparing the mean and the median
•The mean and the median are the same only if the distribution is symmetrical.
•The median is a measure of center that is resistant to skew and outliers. The
mean is not.
Mean and median for a
symmetric distribution
Mean
Median
Mean and median for
skewed distributions
Left skew
Mean
Median
Mean
Median
Right skew
Mean and Median of a Distribution with Outliers
Percent of people dying
x  3.4 x  4.2
Without the outliers
With the outliers
The mean is pulled to the
The median, on the other hand,
right a lot by the outliers
is only slightly pulled to the right
(from 3.4 to 4.2).
by the outliers (from 3.4 to 3.6).
Example
• Observed mean =2.28,
median=3, mode=3.1
• What is the shape of the
distribution and why?
Example
Solution: Skewed Left
Left-Skewed
Mean Median Mode
Symmetric
Mean = Median = Mode
Right-Skewed
Mode Median Mean
Conclusion – Mean or
Median?
• Mean – use with symmetrical
distributions (no outliers),
because it is nonresistant.
• Median – use with skewed
distribution or distribution with
outliers, because it is resistant.
Tell -- Draw a Picture
• When telling about quantitative variables,
start by making a histogram or stem-andleaf display and discuss the shape of the
distribution.
Tell -- Shape, Center, and Spread
• Next, always report the shape of its
distribution, along with a center and a
spread.
• If the shape is skewed, report the
median and IQR.
• If the shape is symmetric, report the
mean and standard deviation and
possibly the median and IQR as well.
Tell -- What About Unusual
Features?
• If there are multiple modes, try to
understand why. If you identify a reason
for the separate modes, it may be good to
split the data into two groups.
• If there are any clear outliers and you are
reporting the mean and standard
deviation, report them with the outliers
present and with the outliers removed. The
differences may be quite revealing.
• Note: The median and IQR are not
likely to be affected by the outliers.
What Can Go Wrong?
• Don’t make a histogram of a categorical variable—bar
charts or pie charts should be used for categorical data.
• Don’t look for shape,
center, and spread
of a bar chart.
What Can Go Wrong? (cont.)
• Don’t use bars in every display—save them for histograms
and bar charts.
• Below is a badly drawn plot and the proper histogram for
the number of juvenile bald eagles sighted in a collection
of weeks:
What Can Go Wrong? (cont.)
• Choose a bin width appropriate to the data.
• Changing the bin width changes the appearance of the
histogram:
What Can Go Wrong? (cont.)
• Don’t forget to do a reality check – don’t let the calculator
do the thinking for you.
• Don’t forget to sort the values before finding the median
or percentiles.
• Don’t worry about small differences when using different
methods.
• Don’t compute numerical summaries of a categorical
variable.
• Don’t report too many decimal places.
• Don’t round in the middle of a calculation.
• Watch out for multiple modes
• Beware of outliers
• Make a picture … make a picture . . . make a picture !!!
What have we learned?
• We’ve learned how to make a picture for quantitative data to
help us see the story the data have to Tell.
• We can display the distribution of quantitative data with a
histogram, stem-and-leaf display, or dotplot.
• We’ve learned how to summarize distributions of
quantitative variables numerically.
• Measures of center for a distribution include the median
and mean.
• Measures of spread include the range, IQR, and standard
deviation.
• Use the median and IQR when the distribution is skewed.
Use the mean and standard deviation if the distribution is
symmetric.
What have we learned? (cont.)
• We’ve learned to Think about the type of
variable we are summarizing.
• All methods of this chapter assume the
data are quantitative.
• The Quantitative Data Condition
serves as a check that the data are, in
fact, quantitative.
Assignment
• Exercises pg. 72 – 79: #5 - 18, 30 - 33, 43,
44, 48
• Read Ch-4, pg. 44 - 71