Download Describing the distribution of a single variable

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
C H A P T E R
22
Describing the
distribution of a
single variable
Objectives
PL
E
P1: FXS/ABE
To introduce the two main types of data—categorical and numerical
To use bar charts to display frequency distributions of categorical data
To use histograms and frequency polygons to display frequency distributions of
SA
M
numerical data
To use cumulative frequency polygons and cumulative relative frequency
polygons to display cumulative frequency distributions
To use the stem-and-leaf plot to display numerical data
To use the histogram to display numerical data
To use these plots to describe the distribution of a numerical variable in terms of
symmetry, centre, spread and outliers
To define and calculate the summary statistics mean, median, range, interquartile
range, variance and standard deviation
To understand the properties of these summary statistics and when each is
appropriate
To construct and interpret boxplots, and use them to compare data sets
22.1
Types of variables
A characteristic about which information is recorded is called a variable, because its value is
not always the same. Several types of variable can be identified. Consider the following
situations.
500
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P1: FXS/ABE
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
Chapter 22 — Describing the distribution of a single variable
501
PL
E
Students answer a question by selecting ‘yes’, ‘no’ or ‘don’t know’.
Students say how they feel about a particular statement by ticking one of ‘strongly agree’,
‘agree’, ‘no opinion’, ‘disagree’ or ‘strongly disagree’.
Students write down the size shoe that they take.
Students write down their height.
These situations give rise to two different types of data. The data arising from the first two
situations are called categorical data, because the data can only be classified by the name of
the category from which they come; there is no quantity associated with each category. The
data arising from the third and fourth examples is called numerical data. These examples
differ slightly from each other in the type of numerical data they each generate. Shoe sizes are
of the form . . . , 6, 6.5, 7, 7.5, . . . . These are called discrete data, because the data can only
take particular values. Discrete data often arise in situations where counting is involved. The
other type of numerical data is continuous data where the variable may take any value
(sometimes within a specified interval). Such data arise when students measure height. In fact,
continuous data often arise when measuring is involved.
Exercise 22A
1 Classify the data which arise from the following situations into categorical, or numerical.
SA
M
a Kindergarten pupils bring along their favourite toy, and they are grouped together under
the headings: ‘dolls’, ‘soft toys’, ‘games’, ‘cars’, and ‘other’.
b The number of students on each of twenty school buses are counted.
c A group of people each write down their favourite colour.
d Each student in a class is weighed in kilograms.
e Each student in a class is weighed and then classified as ‘light’, ‘average’ or ‘heavy’.
f People rate their enthusiasm for a certain rock group as ‘low’, ‘medium’, or ‘high’.
2 Classify the data which arise from the following situations as categorical or numerical.
a The intelligence quotient (IQ) of a group of students is measured using a test.
b A group of people are asked to indicate their attitude to capital punishment by selecting
a number from 1 to 5 where 1 = strongly disagree, 2 = disagree, 3 = undecided,
4 = agree, and 5 = strongly agree.
3 Classify the following numerical data as either discrete or continuous.
a
b
c
d
e
The number of pages in a book.
The price paid to fill the tank of a car with petrol.
The volume of petrol used to fill the tank of a car.
The time between the arrival of successive customers at an autobank teller.
The number of tosses of a die required before a six is thrown.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P1: FXS/ABE
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
502
22.2
Essential Advanced General Mathematics
Displaying categorical data—the bar chart
Suppose a group of 130 students were asked to nominate their favourite kind of music under
the categories ‘hard rock’, ‘oldies’, ‘classical’, ‘rap’, ‘country’ or ‘other’. The table shows the
data for the first few students.
Favourite music
hard rock
classical
country
hard rock
PL
E
Student’s name
Daniel
Karina
John
Jodie
The table gives data for individual students. To consider the group as a whole the data
should be collected into a table called a frequency distribution by counting how many of each
of the different values of the variable have been observed.
Counting the number of students who responded to the question on favourite kinds of music
gave the following results in each category.
Hard rock
62
Other
27
Oldies
20
Classical
15
Rap
3
Country
3
Number of students
SA
M
While a clear indication of the group’s preferences can be seen from the table, a visual
display may be constructed to illustrate this. When the data are categorical, the appropriate
display is a bar chart. The categories are indicated on the horizontal axis and the
corresponding numbers in each category shown on the vertical axis.
70
60
50
40
30
20
10
0
Hard rock
Other
Oldies Classical
Type of music
Rap
Country
The order in which the categories are listed on the horizontal axis is not important, as no
order is inherent in the category labels. In this particular bar chart, the categories are listed in
decreasing order by number.
From the bar chart the music preferences for the group of students may be easily compared.
The value which occurs most frequently is called the mode of the variable. Here it can be seen
that the mode is hard rock.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P1: FXS/ABE
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
Chapter 22 — Describing the distribution of a single variable
503
Exercise 22B
1 A group of students were asked to select their favourite type of fast food, with the
following results.
Food type
hamburgers
chicken
fish and chips
Chinese
pizza
other
Number of students
23
7
6
7
18
8
PL
E
a Draw a bar chart for these data.
b Which is the most popular food type?
2 The following responses were received to a
question regarding the return of capital punishment.
a Draw a bar chart for these data.
b How many respondents either agree or strongly
agree?
SA
M
3 A video shop proprietor took note of the
type of films borrowed during a particular day
with the following results.
a Construct a bar chart to illustrate these data.
b Which is the least popular film type?
4 A survey of secondary school students’ preferred
ways of spending their leisure time at home gave the
following results.
a Construct a bar chart to illustrate these data.
b What is the most common leisure activity?
22.3
strongly agree
agree
don’t know
disagree
strongly disagree
comedy
drama
horror
music
other
watch TV
read
listen to music
watch a video
phone friends
other
21
11
42
53
129
53
89
42
15
33
42%
13%
23%
12%
4%
6%
Displaying numerical data—the histogram
In previous studies you have been introduced to various ways of summarising and displaying
numerical data, including dotplots, stem-and-leaf plots, histograms and boxplots. Constructing
a histogram for discrete numerical data is demonstrated in Example 1.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
504
Essential Advanced General Mathematics
Example 1
The numbers of siblings reported by each student in Year 11 at a local school is as follows:
2
0
2
3
2
3
4
1
4
0
1
1
3
4
1
2
5
0
3
3
9
0
2
0
4
5
1
1
6
1
0
1
1
0
1
1
1
1
1
2
0
0
3
2
1
Construct a frequency distribution of the number of siblings.
Solution
PL
E
P1: FXS/ABE
To construct the frequency distribution count the numbers of students corresponding
to each of the numbers of siblings, as shown.
Number
Frequency
0
9
1
15
2
7
3
6
4
4
5
2
6
1
7
0
8
0
9
1
A histogram looks similar to a bar chart, but because the data are numeric there is a
natural order to the plot which may not occur with a bar chart. Usually for discrete data
the actual data values are located at the middle of the appropriate column, as shown.
10
SA
M
Frequency
15
5
0
0
1
2
3
4
5
6
Number of siblings
7
8
9
An alternative display for a frequency distribution is a frequency polygon. It is formed by
plotting the values in the frequency histogram with points, which are then joined by straight
lines. A frequency polygon for the data in Example 1 is shown by the red line in this diagram.
Frequency
15
10
5
0
0
1
2
3
4
5
6
Number of siblings
7
8
9
When the range of responses is large it is usual to gather the data together into sub-groups
or class intervals. The number of data values corresponding to each class interval is called the
class frequency.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
Chapter 22 — Describing the distribution of a single variable
505
Class intervals should be chosen according to the following principles:
Every data value should be in an interval
The intervals should not overlap
There should be no gaps between the intervals.
The choice of intervals can vary, but generally a division which results in about 5 to
15 groups is preferred. It is also usual to choose an interval width which is easy for the reader
to interpret, such as 10 units, 100 units, 1000 units etc (depending on the data). By convention,
the beginning of the interval is given the appropriate exact value, rather than the end. For
example, intervals of 0–49, 50–99, 100–149 would be preferred over the intervals 1–50,
51–100, 101–150 etc.
PL
E
P1: FXS/ABE
Example 2
A researcher asked a group of people to record how many cups of coffee they drank in a
particular week. Here are her results.
0
5
8
0
9 10 23 25
0 17 14 3
6
19 25 25 0
0
0 0 34 32 0 0 30 0 4
0 33 23 0 32 13 21 22 6
0 2 28 25 14 20 12 17 16
Construct a frequency distribution and hence a histogram of these data.
Solution
Frequency
SA
M
Because there are so many different results and they are spread over a wide range, the
data are summarised into class intervals.
As the minimum value is 0 and the
Number of Frequency
maximum is 34, intervals of width 5
cups of coffee
would be appropriate, giving the
0–4
16
frequency distribution shown in the table.
5–9
5
10–14
5
The corresponding histogram
15–19
4
may then be drawn.
20–24
5
25–29
5
20
30–34
5
15
10
5
0
5
15 20 25 30
10
Number of cups of coffee
35
Example 2 was concerned with a discrete numerical variable. When constructing a frequency
distribution of continuous data, the data are again grouped, as shown in Example 3.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
506
Essential Advanced General Mathematics
Example 3
The following are the heights of the players in a basketball club, measured to the nearest
millimetre.
178.1
183.3
192.4
196.3
185.6
180.3
203.7
189.6
173.3
182.0
191.1
183.9
193.4
183.6
189.7
177.7
183.1
184.5
191.1
184.1
193.0
185.8
180.4
183.8
188.3
189.1
180.0
174.7
189.5 184.6 202.4 170.9
178.6 194.7 185.3 188.7
180.1 170.5 179.3 193.8
178.9
PL
E
P1: FXS/ABE
Construct a frequency distribution and hence a histogram of these data.
Solution
From the data it seems that intervals of width
5 will be suitable. All values of the variable
which are 170 or more, but less than 175,
have been included in the first interval.
The second interval includes values from
175 to less than 180, and so on for the rest
of the table.
Frequency
SA
M
The histogram of these
data is shown here.
Player heights
170 –
175 –
180 –
185 –
190 –
195 –
200 –
Frequency
4
5
13
9
7
1
2
15
10
5
0
170
175
180
185 190 195
Player heights
200
205
The interval in a frequency distribution which has the highest class frequency is
called the modal class. Here the modal class is 180.0–184.9.
Using the TI-Nspire
The calculator can be used to construct a histogram for numerical data. This will be
illustrated using the basketball player height data from Example 3.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P1: FXS/ABE
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
Chapter 22 — Describing the distribution of a single variable
507
PL
E
The data is easiest entered in a Lists &
3).
Spreadsheet application (
) to
Firstly, use the up/down arrows (
name the first column height.
Then enter each of the 41 numbers as
shown.
Open a Data & Statistics application (
5 ) to graph the data. At first the data
displays as shown.
SA
M
Specify the x variable by selecting Add X
Variable from the Plot Properties (b 2
4) and selecting height. The data now
displays as shown.
(Note: It is also possible to use the NavPad
to move down below the x-axis and click to
add the x variable.)
Select Histogram from the Plot Type menu
(b 1 3). The data now displays as
shown.
Select Bin Settings from the Histogram
Properties submenu of Plot Properties
menu (b 2 2 2).
Let width = 5 and Alignment = 170.
Finally, select Zoom, Data from the
Window/Zoom menu (b 5 2) to
display the data as shown.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P1: FXS/ABE
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
508
Essential Advanced General Mathematics
Using the Casio ClassPad
The calculator can be used to construct a histogram for numerical data. This will be
illustrated using the basketball player height data from Example 3.
SA
M
PL
E
enter the data into list1, tapping EXE to enter and move down the column.
In
Tap SetGraph, Setting . . . and the tab for Graph 1, enter the settings shown and tap
SET.
Tap SetGraph, StatGraph1 and then tap the box
to tick and select the graph.
to produce the graph selecting HStart
Tap
= 4 (the left bound of the histogram) and HStep =
4 (the desired interval width) when prompted. The
histogram is produced as shown.
With the graph window selected (bold border)
tap 6 to adjust the viewing window for the
graph.
Tap Analysis, Trace and use the navigator key to
move from column to column and display the
count for that column.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
Chapter 22 — Describing the distribution of a single variable
509
Relative and percentage frequencies
When frequencies are expressed as a proportion of the total number they are called relative
frequencies. By expressing the frequencies as relative frequencies more information is
obtained about the data set. Multiplying the relative frequencies by 100 readily converts them
to percentage frequencies, which are easier to interpret.
An example of the calculation of relative and percentage frequencies is shown in
Example 4.
Example 4
PL
E
P1: FXS/ABE
Construct a relative frequency distribution and a percentage frequency distribution for the
player height data.
Player
heights (cm)
Frequency
170 –
4
175 –
5
180 –
13
185 –
9
190 –
7
195 –
1
200 –
2
SA
M
Solution
From this table it can be
seen, for example, that nine
out of forty-one, or 22% of
players, have heights from
185 cm to less than 190 cm.
Relative
frequency
4
41
5
41
13
41
9
41
7
41
1
41
2
41
Percentage
frequency
= 0.10
10%
= 0.12
12%
= 0.32
32%
= 0.22
22%
= 0.17
17%
= 0.02
2%
= 0.05
5%
Both the relative frequency histogram and the percentage frequency histogram are identical to
the frequency histogram—only the vertical scale is changed. To construct either of these
histograms from a list of data use a graphics calculator to construct the frequency histogram,
and then convert the individual frequencies to either relative frequencies or percentage
frequencies one by one as required.
Cumulative frequency distribution
To answer questions concerning the number or proportion of the data values which are less
than a given value a cumulative frequency distribution, or a cumulative relative frequency
distribution can be constructed. In both a cumulative frequency distribution and a cumulative
relative frequency distribution, the number of observations in each class are accumulated from
low to high values of the variable.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P1: FXS/ABE
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
510
Essential Advanced General Mathematics
Example 5
Construct a cumulative frequency distribution and a cumulative relative frequency distribution
for the data in Example 4.
Solution
<170
<175
<180
<185
<190
<195
<200
<205
Frequency
Cumulative
frequency
Cumulative relative
frequency
0
4
5
13
9
7
1
2
0
4
9
22
31
38
39
41
0
0.10
0.22
0.54
0.76
0.93
0.95
1.00
PL
E
Player heights
(cm)
SA
M
Each cumulative frequency was obtained by adding preceding values of the frequency.
In the same way the cumulative relative frequencies were obtained by adding
preceding relative frequencies. Thus it can be said that a proportion of 0.54, or 54%,
of players are less than 185 cm tall.
Cumulative frequency
A graphical representation of a cumulative frequency
distribution is called a cumulative frequency
40
polygon and has a distinctive appearance, as it
30
always starts at zero and is non-decreasing.
This graph shows, on the vertical axis, the
20
number of players shorter than any height
10
given on the horizontal axis. The cumulative
relative frequency distribution could also be
0
170 175 180 185 190 195 200
plotted as a cumulative relative frequency
Player heights
polygon, which would differ from the cumulative
frequency polygon only in the scale on the vertical axis, which would run from 0 to 1.
205
Exercise 22C
Example
1
1 The number of pets reported by each student in a class is given in the following table:
2
0
3
2
4
1
0
1
3
4
2
5
3
3
0
2
4
5
1
6
0
1
Construct a frequency distribution of the numbers of pets reported by each student.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
Chapter 22 — Describing the distribution of a single variable
511
Number of students
2 The number of children in the family for each student in a class is shown in this histogram.
10
5
0
a
b
c
d
PL
E
P1: FXS/ABE
1
2
3
4
5
6
7
Size of family
8
9
10
How many students are the only child in a family?
What is the most common number of children in the family?
How many students come from families with six or more children?
How many students are there in the class?
10
SA
M
Number of students
3 The following histogram gives the scores on a general knowledge quiz for a class of Year
11 students.
5
0
10
20
30
40
50 60
Marks
70
80
90 100
a How many students scored from 10–19 marks?
b How many students attempted the quiz?
c What is the modal class?
d If a mark of 50 or more is designated as a pass, how many students passed the quiz?
4 The maximum temperatures for several capital cities around the world on a particular day,
in degrees Celsius, were:
17
16
17
31
Example
2
Example
4
a
b
c
d
26
15
23
19
36
18
28
25
32
25
36
22
17
30
45
24
12
23
17
29
32
33
19
32
2
33
37
38
Use a class interval of 5 to construct a frequency distribution for these data.
Construct the corresponding relative frequency distribution.
Draw a histogram from the frequency distribution.
What percentage of cities had a maximum temperature of less than 25◦ C?
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
512
Essential Advanced General Mathematics
5 A student purchases 21 new text books from a school book supplier with the following
prices (in dollars).
21.65
7.80
8.90
Example
3
Example
5
14.95
3.50
17.15
12.80
7.99
4.55
7.95
42.98
21.95
32.50
18.50
7.60
23.99
19.95
5.99
23.99
3.20
14.50
a Draw a histogram of these data using appropriate class intervals.
b What is the modal class?
c Construct a cumulative frequency distribution for these data and draw the cumulative
frequency polygon.
PL
E
P1: FXS/ABE
6 A group of students were asked to draw a line which they estimated to be the same length
as a 30 cm ruler. The lines were then measured (in cm) with the following results.
30.3
32.2
32.1
30.9
30.1
31.2
31.2
31.6
30.7
32.3
32.1
32.1
31.3
31.4
30.8
30.7
31.8
29.7
32.8
32.9
30.1
31.0
31.9
28.9
33.3
29.4
30.7
31.6
a Construct a histogram of the frequency distribution.
b Construct a cumulative frequency distribution for these data and draw the cumulative
frequency polygon.
c Write a sentence to describe the students’ performance on this task.
SA
M
7 The following are the marks obtained by a group of Year 11 Chemistry students on the end
of year exam.
21
33
47
49
52
52
58
59
63
68
68
71
72
82
92
31
47
48
49
52
53
59
59
65
68
70
71
72
91
99
a Using a graphics calculator, or otherwise, construct a histogram of the frequency
distribution.
b Construct a cumulative frequency distribution for these data and draw the cumulative
frequency polygon.
c Write a sentence to describe the students’ performance on this exam.
8 The following 50 values are the lengths (in metres) of some par 4 golf holes from
Melbourne golf courses.
302
371
376
366
398
272
334
332
361
407
311
369
338
299
337
351
334
320
321
371
338
320
321
361
266
325
374
364
312
354
314
364
317
305
331
307
353
362
408
409
336
366
310
245
385
310
260
280
279
260
a Construct a histogram of the frequency distribution.
b Construct a cumulative frequency distribution for these data and draw the cumulative
frequency polygon.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P1: FXS/ABE
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
Chapter 22 — Describing the distribution of a single variable
513
c Use the cumulative frequency polygon to estimate:
i the proportion of par 4 holes below 300 m in length
ii the proportion of par 4 holes 360 m or more in length
iii the length which is exceeded by 90% of the par 4 holes.
22.4
Characteristics of distributions
of numerical variables
a
PL
E
Distributions of numerical variables are characterised by their shapes and special features such
as centre and spread.
Two distributions are said to differ in centre if the values of the variable in one distribution
are generally larger than the values of the variable in the other distribution. Consider, for
example, the following histograms shown on the same scale.
b
0
5
10
15
0
5
10
15
SA
M
It can be seen that plot b is identical to plot a but moved horizontally several units to the
right, indicating that these distributions differ in the location of their centres.
The next pair of histograms also differ, but not in the same way. While both histograms are
centred at about the same place, histogram d is more spread out. Two distributions are said to
differ in spread if the values of the variable in one distribution tend to be more spread out than
the values of the variable in the other distribution.
c
0
5
10
15
d
0
5
10
15
A distribution is said to be symmetric if it forms a mirror image of itself when folded in the
‘middle’ along a vertical axis; otherwise it is said to be skewed. Histogram e is perfectly
symmetrical, while f shows a distribution which is approximately symmetric.
f
e
0
5
10
15
0
5
10
15
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
514
Essential Advanced General Mathematics
If a histogram has a short tail to the left and a long tail pointing to the right it is said to be
positively skewed (because of the many values towards the positive end of the distribution) as
shown in the histogram g.
If a histogram has a short tail to the right and a long tail pointing to the left it is said to be
negatively skewed (because of the many values towards the negative end of the distribution),
as shown in histogram h.
g
h
negatively skewed
positively skewed
0
5
PL
E
P1: FXS/ABE
10
15
0
5
10
15
Knowing whether a distribution is skewed or symmetric is important as this gives
considerable information concerning the choice of appropriate summary statistics, as will be
seen in the next section.
Exercise 22D
1 Do the following pairs of distributions differ in centre, spread, both or neither?
SA
M
a
b
0
0
0
0
c
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P1: FXS/ABE
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
Chapter 22 — Describing the distribution of a single variable
515
2 Describe the shape of each of the following histograms.
a
b
0
0
PL
E
c
0
3 What is the shape of the histogram drawn in 6, Exercise 22C?
4 What is the shape of the histogram drawn in 7, Exercise 22C?
5 What is the shape of the histogram drawn in 8, Exercise 22C?
22.5
Stem-and-leaf plots
SA
M
An informative data display for a small (less than 50 values) numerical data set is the
stem-and-leaf plot. The construction of the stem-and-leaf plot is illustrated in Example 6.
Example 6
By the end of 2004 the number of test matches played, as captain, by each of the Australian
cricket captains was:
3
10
1
16
11
39
2
2
2
1
5
25
8
25
1
3
5
30
6
24
48
4
1
7
8
24
28
21
2
93
2
17
50
15
1
57
10
5
9
6
28
6
Construct a stem-and-leaf plot of these data.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
516
Essential Advanced General Mathematics
Solution
To make a stem-and-leaf plot find the smallest and
the largest data values. From the table above, the
smallest value is 1, which is given a 0 in the ten’s
column, and the largest is 93, which has a 9 in the
ten’s column. This means that the stems are chosen
to be from 0–9. These are written in a column with
a vertical line to their right, as shown.
0
1
2
3
4
5
6
7
8
9
PL
E
P1: FXS/ABE
The units for each data point are then entered to the right of the dividing line. They are
entered initially in the order in which they appear in the data. When all data points are
entered in the table, the stem-and-leaf plot looks like this.
3
6
1
9
8
0
2 1 8 3 6 4 8 2 6 2 5 5 1 2 1 5 1 2 1 7 9 6
5 0 0 1 7
5 4 4 8 5 8
0
7
SA
M
0
1
2
3
4
5
6
7
8
9
3
To complete the plot the leaves are ordered, and a key added to specify the place
value of the stem and the leaves.
0
1
2
3
4
5
6
7
8
9
1
0
1
0
8
0
1 1 1 1 2 2 2 2 2 3 3 4 5 5 5 6 6 6 7 8 8 9
0 1 5 6 7
4 4 5 5 8 8
9
7
3 | 9 indicates 39 matches
3
It can be seen from this plot that one captain has led Australia in many more test matches than
any other (Allan Border, who captained Australia in 93 test matches). When a value sits away
from the main body of the data it is called an outlier.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
Chapter 22 — Describing the distribution of a single variable
517
Stem-and-leaf plots have the advantage of retaining all the information in the data set while
achieving a display not unlike that of a histogram (turned on its side). In addition, a
stem-and-leaf plot clearly shows:
the range of values
where the values are concentrated
the shape of the data set
whether there are any gaps in which no values are observed
any unusual values (outliers).
Grouping the leaves in tens is simplest—other convenient groupings are in fives or twos, as
shown in Example 7.
Example 7
PL
E
P1: FXS/ABE
The birth weights, in kilograms, of the first 30 babies born at a hospital in a selected month are
as follows.
2.9
3.7
2.8
2.7
3.6
3.5
3.5
3.2
3.3
3.6
2.9
3.1
2.8
3.2
3.0
3.6
2.5
4.2
3.7
2.6
3.2
3.6
3.8
2.4
3.6
3.0
4.3
2.9
4.2
3.2
Construct a stem-and-leaf plot of these data.
Solution
SA
M
A stem-and-leaf plot of the birth weights, with the stem representing units and the
leaves representing one-tenth of a unit, may be constructed.
2 4 5 6 7 8 8 9 9 9
3 0 0 1 2 2 2 2 3 5 5 6 6 6 6 6 7 7 8
3 | 0 indicates 3.0 kilograms
4 2 2 3
The plot, which allows one row for each different stem, appears to be too compact.
These data may be better displayed by constructing a stem-and-leaf plot with two rows
for each stem. These rows correspond to the digits {0, 1, 2, 3, 4} in the first row and
{5, 6, 7, 8, 9} in the second row.
2
2
3
3
4
4
5
0
5
2
6
0
5
2
7
1
6
3
8
2
6
8
2
6
9
2
6
9
2
6
9
3
7
7
8
3 | 0 indicates 3.0 kilograms
The only other possibility for a stem-and-leaf plot is one which has five rows per
stem. These rows correspond to the digits {0, 1}, {2, 3}, {4, 5}, {6, 7} and {8, 9}.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
518
Essential Advanced General Mathematics
2
2
2
3
3
3
3
3
4
4
4
6
8
0
2
5
6
8
5
7
8
0
2
5
6
2
2
9
1
2
9
9
2
3
6
6
6
7
7
PL
E
P1: FXS/ABE
3 | 0 indicates 3.0 kilograms
3
SA
M
None of the stem-and-leaf displays shown are correct or incorrect. A stem-and-leaf plot is
used to explore data and more than one may need to be constructed before the most
informative one is obtained. Again, from 5 to 15 rows is generally the most helpful, but this
may vary in individual cases.
When the data have too many digits for a convenient stem-and-leaf plot they should be
rounded or truncated. Truncating a number means simply dropping off the unwanted digits.
So, for example, a value of 149.99 would become 149 if truncated to three digits, but 150 if
rounded to three digits. Since the object of a stem-and-leaf display is to give a feeling for the
shape and patterns in the data set, the decision on whether to round or truncate is not very
important; however, generally when constructing a stem-and-leaf display the data is truncated,
as this is what commonly used data analysis computer packages will do.
Some of the most interesting investigations in statistics involve comparing two or more data
sets. Stem-and-leaf plots are useful displays for the comparison of two data sets, as shown in
the following example.
Example 8
The following table gives the number disposals by members of the Port Adelaide and Brisbane
football teams, in the 2004 AFL Grand Final.
Port Adelaide
25
12
20
11
19
11
18
11
18
11
17
10
16
10
15
9
14
9
13
7
12
7
Brisbane
25 19 19 18 17 16 15 15 13 13 13
10 10 9 9 8 8 7 6 5 4 0
Construct back to back stem-and-leaf plots of these data.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
Chapter 22 — Describing the distribution of a single variable
519
Solution
To compare the two groups, the stem-and-leaf plots are drawn back to back, using two
rows per stem.
Port Adelaide
Brisbane
0
0
1
1
2
2
9 9 7 7
4 3 2 2 1 1 1 1 0 0
9 8 8 7 6 5
0
5
0 | 2 represents 20 disposals
0
5
0
5
4
6 7 8 8 9 9
0 3 3 3
5 6 7 8 9 9
PL
E
P1: FXS/ABE
5
2 | 0 represents 20 disposals
The leaves on the left of the stem are centred slightly higher than the leaves on the
right, which suggests that, overall, Port Adelaide recorded more disposals. The spread
of disposals for Port Adelaide appears narrower than that of the Brisbane players.
Exercise 22E
Example
6
1 The monthly rainfall for Melbourne, in a particular year, is given in the following table
(in millimetres).
J
F
M A M
SA
M
Month
Rainfall (mm)
J
J
A
S
O
N
D
48 57 52 57 58 49 49 50 59 67 60 59
a Construct a stem-and-leaf plot of the rainfall, using the following stems.
4
5
6
b In how many months is the rainfall 60 mm or more?
Example
7
2 An investigator recorded the amount of time 24 similar batteries lasted in a toy. Her results
in hours were:
25.5
4.2
39.7
25.6
29.9
16.9
23.6
18.9
26.9
46.0
31.3
33.8
21.4
36.8
27.4
27.5
19.5
25.1
29.8
31.3
33.4
41.2
21.8
32.9
a Make a stem-and-leaf plot of these times with two rows per stem.
b How many of the batteries lasted for more than 30 hours?
3 The amount of time (in minutes) that a class of students spent on homework on one
particular night was:
10
39
27
70
46
19
63
37
20
67
33
20
15
28
21
23
16
0
14
29
15
10
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
520
Essential Advanced General Mathematics
a Make a stem-and-leaf plot of these times.
b How many students spent more than 60 minutes on homework?
c What is the shape of the distribution?
4 The cost of various brands of track shoes at a retail outlet are as follows.
$49.99 $75.49 $68.99 $164.99
$75.99
$210.00 $84.99 $36.98
$95.49
$28.99
$46.99 $76.99 $82.99
$79.99 $149.99
a Construct a stem-and-leaf plot of these data.
b What is the shape of the distribution?
Example
8
$39.99
$25.49
$35.99
$78.99
52.99
$45.99
PL
E
P1: FXS/ABE
5 The students in a class were asked to write down the ages of their mothers and fathers.
Mother’s age
49
50
43
44
Father’s age
50
51
43
46
43
40
50
39
47
40
50
41
40
43
46
45
49
48
49
38
42
43
44
37
38
43
41
44
55
44
51
48
48
43
47
48
47
43
52
46
54
48
41
49
44
45
40
46
SA
M
a Construct a back to back stem-and-leaf plot of these data sets.
b How do the ages of the students’ mothers and fathers compare in terms of shape, centre
and spread?
6 The results of a mathematics test for two different classes of students are given in the table.
Class A
22
19
85
79
48
45
39
82
68
81
47
80
58
91
77
99
76
55
89
65
85
79
82
71
Class B
12
13
74
76
80
80
81
81
83
82
98
84
70
84
70
88
71
69
72
73
72
88
73
91
a Construct a back to back stem-and-leaf plot to compare the data sets.
b How many students in each class scored less than 50%?
c Which class do you think performed better overall on the test? Give reasons for your
answer.
22.6
Summarising data
A statistic is a number that can be computed from data. Certain special statistics are called
summary statistics, because they numerically summarise special features of the data set under
consideration. Of course, whenever any set of numbers is summarised into just one or two
figures much information is lost, but if the summary statistics are well chosen they will also
help to reveal the message which may be hidden in the data set.
Summary statistics are generally either measures of centre or measures of spread. There
are many different examples for each of these measures and there are situations when one of
the measures is more appropriate than another.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P1: FXS/ABE
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
Chapter 22 — Describing the distribution of a single variable
521
Measures of centre
Mean
The most commonly used measure of centre of a distribution of a numerical variable is the
mean. This is calculated by summing all the data values and dividing by the number of values
in the data set.
PL
E
Example 9
The following data set shows the number of premierships won by each of the current AFL
teams, up until the end of 2004. Find the mean of the number of premiership wins.
Premierships
16
16
14
12
11
10
9
6
4
3
2
2
1
1
1
0
SA
M
Team
Carlton
Essendon
Collingwood
Melbourne
Fitzroy/Lions
Richmond
Hawthorn
Geelong
Kangaroos
Sydney
West Coast
Adelaide
Port Adelaide
W Bulldogs
St Kilda
Fremantle
Solution
mean =
16 + 16 + 14 + 12 + 11 + 10 + 9 + 6 + 4 + 3 + 2 + 2 + 1 + 1 + 1 + 0
= 6.8
16
The mean of a sample is always denoted by the symbol x̄, which is called ‘x bar’.
In general, if n observations are denoted by x1 , x2 , . . . ., xn the mean is
x1 + x2 + · · · · · · + xn
n
or, in a more compact version
x̄ =
n
1
xi
n i=1
where the symbol
is the upper case Greek sigma, which in mathematics means ‘the sum
of the terms’.
x̄ =
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
522
Essential Advanced General Mathematics
Note: The subscripts on the x’s are used to identify all of the n different values of x. They do not
mean that the x’s have to be written in any special order. The values of x in the example are in
order only because they were listed in that way in the table.
Median
Another useful measure of the centre of a distribution of a numerical variable is the middle
value, or median. To find the value of the median, all the observations are listed in order and
the middle one is the median.
The median of
2
3
4
PL
E
P1: FXS/ABE
5
median
6
7
5
7
8
8
11
is 6, as there are five observations on either side of this value when the data are listed in order.
Example 10
Find the median number of premierships in the AFL ladder using the data in Example 9.
Solution
As the data are already given in order, it only remains to decide which is the middle
observation.
1
1
1
2
2
3
4
6
9
10
11
12
14
16
16
SA
M
0
Since there are 16 entries in the table there is no actual middle observation, so the
median is chosen as the value half way between the two middle observations, in this
1
case the eighth and ninth (6 and 4). Thus the median is equal to (6 + 4) = 5. The
2
interpretation here is that of the teams currently playing in the AFL, half (or 50%)
have won the premiership 5 or more times and half (or 50%) have have won the
premiership 5 or less times.
In general, to compute the median of a distribution:
Arrange all the observations in ascending order according to size.
n + 1 th
If n, the number of observations, is odd, then the median is the
2
observation from the end of the list.
If n, the number of observations, is even, then the median is found by averaging the
nth
and the
two middle observations in the list. That is, to find the median the
2
n
th
+ 1 observations are added together, and divided by 2.
2
The median value is easily determined from a stem-and-leaf plot by counting to the required
observation or observations from either end.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
Chapter 22 — Describing the distribution of a single variable
523
From Examples 10 and 11, the mean number of times premierships won (6.8) and the
median number of premierships won (5) have already been determined. These values are
different and the interesting question is: why are they different, and which is the better measure
of centre for this example? To help answer this question consider a stem-and-leaf plot of these
data.
0
0
1
1
0
6
0
6
1
9
1
6
1
1
2
4
2
2
3
4
PL
E
P1: FXS/ABE
From the stem-and-leaf plot it can be seen that the distribution is positively skewed. This
example illustrates a property of the mean. When the distribution is skewed or if there are one
or two very extreme values, then the value of the mean may be quite significantly affected. The
median is not so affected by unusual observations, however, and is thus often a preferable
measure of centre. When this is the case, the median is generally preferred as a measure of
centre as it will give a better ‘typical’ value of the variable under consideration.
Mode
SA
M
The mode is the observation which occurs most often. It is a useful summary statistic,
particularly for categorical data which do not lend themselves to some of the other numerical
summary methods. Many texts state that the mode is a third option for a measure of centre but
this is generally not true. Sometimes data sets do not have a mode, or they have several modes,
or they have a mode which is at one or other end of the range of values.
Measures of spread
Range
A measure of spread is calculated in order to judge the variability of a data set. That is, are
most of the values clustered together, or are they rather spread out? The simplest measure of
spread can be determined by considering the difference between the smallest and the largest
observations. This is called the range.
Example 11
Consider the marks, for two different tasks, awarded to a group of students.
Task A
2
35
6
38
9
38
10
39
11
42
12
46
13
47
22
47
23
52
24
52
26
56
26
56
27
59
33
91
34
94
16
59
19
63
21
65
23
68
28
71
31
72
31
73
33
75
38
78
41
78
49
78
52
86
53
88
54
91
Task B
11
56
Find the range of each of these data sets.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
524
Essential Advanced General Mathematics
Solution
For Task A, the minimum mark is 2 and the maximum mark is 94.
Range for Task A = 94 − 2 = 92
For Task B, the minimum mark is 11 and the maximum mark is 91.
Range for Task B = 91 − 11 = 80
The range for Task A is greater than the range for Task B. Is the range a useful summary
statistic for comparing the spread of the two distributions? To help make this decision,
consider the stem-and-leaf plots of the data sets:
7
9
6
8
9
3
6
8
7
6
9
2
4
5
7
6
PL
E
P1: FXS/ABE
Task A
6
2
1
0
3
2
4
3
6
2
2
2
1
0
1
2
3
4
5
6
7
8
9
SA
M
4
Task B
1
1
1
1
2
3
1
6
1
6
3
1
9
3
5
2
8
9
8
3
4
8
3
8
6
9
5
8
8
8
From the stem-and-leaf plots of the data it appears that the spread of marks for the two tasks is
not well described by the range. The marks for Task A are more concentrated than the marks for
Task B, except for the two unusual values for Task A. Another measure of spread is needed, one
which is not so influenced by these extreme values. For this the interquartile range is used.
Interquartile range
To find the interquartile range of a distribution:
Arrange all observations in order according to size.
Divide the observations into two equal-sized groups. If n, the number of
observations, is odd, then the median is omitted from both groups.
Locate Q 1 , the first quartile, which is the median of the lower half of the
observations, and Q 3 , the third quartile, which is the median of the upper half
of the observations.
The interquartile range IQR is defined as the difference between the quartiles.
That is
IQR = Q 3 − Q 1
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
Chapter 22 — Describing the distribution of a single variable
525
Definitions of the quartiles of a distribution sometimes differ slightly from the one given here.
Using different definitions may result in slight differences in the values obtained, but these will
be minimal and should not be considered a difficulty.
Example 12
Find the interquartile ranges for Task A and Task B data given in Example 11.
Solution
PL
E
P1: FXS/ABE
For Task A the marks listed in order are:
2
35
6
38
9
38
10
39
11
42
12
46
13
47
22
47
23
52
24
52
26
56
26
56
27
59
33
91
34
94
Since there is an even number of observations, then the lower ‘half’ is:
2
6
9
10
11
12
13
22
23
24
26
26
27
33
34
The median of this lower group is the eighth observation, 22, so Q 1 = 22.
The upper half is:
35
38
38
39
42
46
47
47
52
52
56
56
59
91
94
SA
M
The median of this upper group is 47, so Q 3 = 47
Thus, the interquartile range, IQR = 47 − 22
= 25
Similarly, for Task B data,
the lower quartile = 31 and
the upper quartile = 73,
giving an interquartile range for this data set of 42.
Comparing the two values of interquartile range shows the spread of Task A marks to
be much smaller than the spread of Task B marks, which seems consistent with the
display.
The interquartile range is a measure of spread of a distribution which describes the range of
the middle 50% of the observations. Since the upper 25% and the lower 25% of the
observations are discarded, the interquartile range is generally not affected by the presence of
outliers in the data set, which makes it a reliable measure of spread.
The median and quartiles of a distribution may also be determined from a cumulative
relative frequency polygon. Since the median is the observation which divides the data set in
half, this is the data value which corresponds to a cumulative relative frequency of 0.5 or 50%.
Similarly, the first quartile corresponds to a cumulative relative frequency of 0.25 or 25%, and
the third quartile corresponds to a cumulative relative frequency of 0.75 or 75%.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
526
Essential Advanced General Mathematics
Example 13
Use the cumulative relative frequency polygon to find the median and the interquartile range
for the data set shown in the graph.
% 100
75
50
25
0
PL
E
P1: FXS/ABE
2
Solution
4
6
8
10
12
14
16
18
From the plot of the data it can be seen that the median is 10, the first quartile is 8, the
third quartile is 12 and hence the interquartile range is 12 − 8 = 4.
Standard deviation
SA
M
Another extremely useful measure of spread is the standard deviation. It is derived by
considering the distance of each observation from the sample mean. If the average of these
distances is used as a measure of spread it will be found that, as some of these distances are
positive and some are negative, adding them together results in a total of zero. A more useful
measure will result if the distances are squared (which makes them all positive) and are then
added together. The variance is defined as a kind of average of these squared distances. When
the variance is calculated from a sample, rather than the whole population, the average is
calculated by dividing by n − 1, rather than n. For the remainder of this discussion it will be
assumed that the data under consideration are from a sample.
Since the variance has been calculated by squaring the data values it is sensible to find the
square root of the variance, so that the measure reverts to a scale comparable to the original
data. This results in measure of spread which is called the standard deviation. Standard
deviation calculated from a sample is denoted s.
Formally the standard deviation may be defined as follows.
If a data set consists of n observations denoted x1 , x2 , . . . , xn , the standard deviation is
1 (x1 − x̄)2 + (x2 − x̄)2 + · · · + (xn − x̄)2
s=
n−1
or, in more compact notation,
n
1 (xi − x̄)2
s=
n − 1 i=1
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P1: FXS/ABE
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
527
Chapter 22 — Describing the distribution of a single variable
Example 14
Calculate the standard deviation of the following data set.
13
12
14
6
15
12
7
6
7
8
Solution
Construct a table as shown.
(xi − x̄)2
9
4
16
16
25
4
9
16
9
4
(xi − x̄)2 = 1̄12
112 √
= 12.44 = 3.53
From the table, the standard deviation s is: s =
9
xi − x̄
3
2
4
−4
5
2
−3
−4
−3
−2
SA
M
PL
E
xi
13
12
14
6
15
12
7
6
7
8
xi = 1̄00
Interpreting the standard deviation
The standard deviation can be made more meaningful by interpreting it in relation to the data
set. The interquartile range gives the spread of the middle 50% of the data. Can similar
statements be made about the standard deviation? It can be shown that, for most data sets,
about 95% of the observations lie within two standard deviations of the mean.
Example 15
The cost of a lettuce at a number of different shops on a particular day is given in the table:
$3.85
$3.81
$2.65
$1.69
$1.90
$3.66
$2.95
$2.60
$2.40
$2.70
$2.42
$3.10
$2.63
$2.80
$3.20
$1.80
$4.20
$2.88
$2.33
$1.40
$0.85
Calculate the mean cost, the standard deviation and the interval equivalent to two standard
deviations above and below the mean.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
528
Essential Advanced General Mathematics
Solution
The mean cost is $2.66 and the standard deviation is $0.84.
The interval equivalent to two standard deviations above and below the mean is:
[2.66 − 2 × 0.84, 2.66 + 2 × 0.84] = [0.98, 4.34].
In this case, 20 of the 21 observations, or 95% of observations, have values within the
interval calculated.
Example 16
PL
E
P1: FXS/ABE
The prices of forty secondhand motorbikes listed in a newspaper are as follows:
$5442
$2220
$3457
$6469
$5294
$5439
$1356
$4689
$7148
$3847
$2523
$738
$8218
$10 884
$4219
$2358
$656
$11 091
$14 450
$4786
$2363
$715
$11 778
$15 731
$2280
$2244
$1000
$11 637
$13 153
$3019
$1963
$1214
$8770
$10 067
$7645
$2142
$1788
$8450
$9878
$8079
Determine the interval equivalent to two standard deviations above and below the mean.
Solution
SA
M
The mean price is $5729 and the standard deviation is $4233 (to the nearest whole
dollar).
The interval equivalent to two standard deviations above and below the mean is:
[5729 − 2 × 4233, 5729 + 2 × 4233] = [−2737, 14 195].
The negative value does not give a sensible solution and should be replaced by 0.
38 of the 40 observations, or 95% of observations, have values within the interval.
The exact percentage of observations which lie within two standard deviations of the mean
varies from data set to data set, but in general it will be around 95%, particularly for symmetric
data sets.
It was noted earlier that even a single outlier can have a very marked effect on the value of
the mean of a data set, while leaving the median unchanged. The same is true when the effect
of an outlier on the standard deviation is considered, in comparison to the interquartile range.
The median and interquartile range are called resistant measures, while the mean and standard
deviation are not resistant measures. When considering a data set it is necessary to do more
than just compute the mean and standard variation. First it is necessary to examine the data,
using a histogram or stem-and-leaf plot to determine which set of summary statistics is more
suitable.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P1: FXS/ABE
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
Chapter 22 — Describing the distribution of a single variable
529
Using the TI-Nspire
PL
E
The calculator can be used to calculate the values of all of the summary statistics in this
section. Consider the data from Example 16.
The data is easiest entered in a Lists &
3).
Spreadsheet application (
) to
Firstly, use the up/down arrows (
name the first column bike.
Then enter each of the 40 numbers as
shown.
SA
M
1) to
Open a Calculator application (
calculate the summary statistics.
Select the One-Variable Statistics
command from the Stat Calculations
submenu of the Statistics menu (b 6
1 1), specify in the dialog box that
there is only one list, and then complete the
final dialog box as shown.
Press enter to calculate the values of the
summary statistics.
Use the up arrow ( ) to view the rest of the
summary statistics.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P1: FXS/ABE
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
530
Essential Advanced General Mathematics
The calculator can also be used to determine the summary statistics when the data is given
in a frequency table such as:
x
Fr equency
1
5
2
8
3
7
4
2
PL
E
The data is easiest entered in a Lists &
3).
Spreadsheet application (
) to
Firstly, use the up/down arrows (
name the first column x and the second
column freq.
Then enter the data as shown.
SA
M
1) to
Open a Calculator application (
calculate the summary statistics.
Select the One-Variable Statistics
command from the Stat Calculations
submenu of the Statistics menu (b 6
1 1), specify in the dialog box that
there is only one list, and then complete the
final dialog box as shown. Press enter to
calculate the values of the summary
statistics.
Using the Casio ClassPad
Consider the following heights in cm of a group of eight women.
176, 160, 163, 157, 168, 172, 173, 169
Enter the data into list1 in the
module. Tap Calc, One-Variable and when prompted
ensure that the XList is set to list1 and the Freq = 1 (since each score is entered
individually).
The calculator returns the results as shown and all univariate statistics can be viewed
by using the scroll bar. Note that the standard deviation is given by xn−1 .
Where data is grouped, the scores are entered in list1 and the frequencies in list2. In
this case, in Set Calculation use the drop-down arrow to select list2 as the location for
the frequencies.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P1: FXS/ABE
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
Chapter 22 — Describing the distribution of a single variable
531
Exercise 22F
1 Find the mean and the median of the following data sets.
Examples
9, 10
a 29 14 11 24 14 14 28 14 18 22 14
b 5
9
11
3
12
13
12
6
13
7
3
15
12
15
5
6
d 1.5
1.0
0.2
3.4
PL
E
c 8.3 5.6 8.2 6.5 8.2 7.0 7.9 7.1 7.8 7.5
0.7
1.3
0.7
0.9
0.2
1.1
0.2
5.8
0.1
2.7
1.7
3.2
0.5
0.6
1.2
4.6
2.0
0.5
1.7
3.1
2 Find the mean and the median of the following data sets.
x
1
2
3
4
5
a
Frequency
6
3
10
7
8
−2
5
x
Frequency
b
−1
8
0
11
1
3
2
2
3 The price, in dollars, of houses sold in a particular suburb during a one-week period are
given in the following list.
$129 500
$135 500
$93 400
$140 000
$400 000
$186 000
SA
M
$187 500
$133 500
$118 000
$140 000
$168 000
$204 000
$550 000
$122 000
Find the mean and the median of the prices. Which do you think is a better measure of
centre of the data set? Explain your answer.
4 Concerned with the level of absence from his classes a teacher decided to investigate the
number of days each student had been absent from the classes for the year to date. These
are his results.
No. of days missed
No. of students
0 1 2 3 4 5 6 9 21
4 2 14 10 16 18 10 2 1
Find the mean and the median number of days each student had been absent so far that
year. Which is the better measure of centre in this case?
Examples
11, 12
5 Find the range and the interquartile range for each of the following data sets.
a 718
630
1002
b 0.7
−1.6
c 8.56
8.51
d 20
19
16
715
−1.2
0.2
8.96
18
560
8.39
16
1085
−1.0
8.62
18
750
8.51
21
20
510
3.4
3.7
8.58
8.82
17
15
1112
1093
0.8
8.54
22
19
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
532
Essential Advanced General Mathematics
6 The serum cholesterol levels for a sample of twenty people are:
231
190
159
192
203
209
304
161
248
206
238
224
209
276
193
196
225
189
244
199
a Find the range of the serum cholesterol levels.
b Find the interquartile range of the serum cholesterol levels.
7 Twenty babies were born at a local hospital on one weekend. Their birth weights, in kg,
are given in the stem-and-leaf plot below.
2
2
3
3
4
4
1
5
1
5
1
5
PL
E
P1: FXS/ABE
7
3
6
2
9
3
7
2
9
4
7
3
4
9
3|6 represent 3.6 kg
a Find the range of the birth weights.
b Find the interquartile range of the birth weights.
Example
14
8 Find the standard deviation for the following data sets.
a 30
16
$4.38
$5.65
23
18
$3.60
$6.89
18
$2.30
$1.98
SA
M
b $2.52
$4.32
22
c 200
300
950
200
200
14
56
$3.45
$4.60
300
13
$5.40
$5.12
840
26
9
$4.43
$3.79
350
31
$2.27
$4.99
200
$4.50
$3.02
200
d 86 74 75 77 79 82 81 75 78 79 80 75 78 78 81 80 76 77 82
9 For each of the following data sets
Example
15
a calculate the mean and the standard deviation
b determine the percentage of observations falling within two standard deviations of the
mean.
i 41 16 6 21 1 21 5 31 20 27 17 10 3 32 2 48 8 12
21 44 1 56 5 12 3 1 13 11 15 14 10 12 18 64 3 10
ii 141
152
Example
13
260
141
164
239
235
145
167
134
266
150
150
237
255
254
168
150
245
265
258
140
239
132
10 A group of university students was asked to write down their ages with the following
results.
17 17 17 17 17 17 17 18 18 18 18 18 18 18 18 18 18 18
18 18 18 18 18 19 19 19 20 20 20 21 24 25 31 41 44 45
a Construct a cumulative relative frequency polygon and use it to find the median and
the interquartile range of this data set.
b Find the mean and standard deviation of the ages.
c Find the percentage of students whose ages fall within two standard deviations of the
mean.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P1: FXS/ABE
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
533
Chapter 22 — Describing the distribution of a single variable
11 The results of a student’s chemistry experiment are as follows.
7.3
8.3
5.9
7.4
6.2
7.4
5.8
6.0
i Find the mean and the median of the results.
ii Find the interquartile range and the standard deviation of the results.
b Unfortunately when the student was transcribing his results into his chemistry book he
made a small error, and wrote:
a
8.3
5.9
7.4
6.2
7.4
5.8
60
PL
E
7.3
i Find the mean and the median of these results.
ii Find the interquartile range and the standard deviation of these results.
c Describe the effect the error had on the summary statistics calculated in parts a
and b.
Example
17
12 A selection of shares traded on the stock exchange had a mean price of $50 with a
standard deviation of $3. Determine an interval which would include approximately 95%
of the share prices.
13 A store manager determined the store’s mean daily receipts as $550, with a standard
deviation of $200. On what proportion of days were the daily receipts between $150 and
$950?
The boxplot
SA
M
22.7
Knowing the median and quartiles of a distribution means that quite a lot is known about the
central region of the data set. If something is known about the tails of the distribution then a
good picture of the whole data set can be obtained. This can be achieved by knowing the
maximum and minimum values of the data. These five important statistics can be derived from
a data set: the median, the two quartiles and the two extremes.
These values are called the five-figure summary and can be used to provide a succinct
pictorial representation of a data set called the box and whisker plot, or boxplot.
For this visual display, a box is drawn with the ends at the first and third quartiles. Lines are
drawn which join the ends of the box to the minimum and maximum observations. The median
is indicated by a vertical line in the box.
Example 17
Draw a boxplot to show the number of hours spent on a project by individual students in a
particular school.
24
59
9
4
102
3
166
13
48
147
108
27
97
2
264
90
71
86
36
102
9
92
147
40
226
56
146
37
181
19
111
35
76
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
534
Essential Advanced General Mathematics
Solution
First arrange the data in order.
2
3
4
9
37
40
48
56
102
102
108
111
9
59
146
13
71
147
19
76
147
24
86
166
27
90
181
35
92
226
36
97
264
From this ordered list prepare the five-figure summary.
median, m = 71
24 + 27
= 25.5
first quartile, Q 1 =
2
108 + 111
= 109.5
third quartile, Q 3 =
2
minimum = 2
maximum = 264
PL
E
P1: FXS/ABE
The boxplot can then be drawn.
0
200
300
m = 71
Q3 = 109.5
max = 264
SA
M
min = 2
Q1 = 25.5
100
In general, to draw a boxplot:
Arrange all the observations in order, according to size.
Determine the minimum value, the first quartile, the median, the third quartile, and
the maximum value for the data set.
Draw a horizontal box with the ends at the first and third quartiles. The height of the
box is not important.
Join the minimum value to the lower end of the box with a horizontal line.
Join the maximum value to the upper end of the box with a horizontal line.
Indicate the location of the median with a vertical line.
Using a graphics calculator
A graphics calculator can be used to construct a boxplot.
Consider the data from Example 17.
Enter the data into a list named HOURS. To draw the boxplot
press 2ND STAT PLOT and select and turn on Plot1, as
previously described.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P1: FXS/ABE
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
Chapter 22 — Describing the distribution of a single variable
535
Press the down arrow key and select from the Type menu
the boxplot icon as shown, then press ENTER .
Use the LIST menu to paste HOURS as the Xlist. Your
calculator screen should appear like this.
PL
E
To bring up the boxplot, press ZOOM and then
9:ZoomStat. Your calculator screen should now look
like this. To find out values for the five-figure summary,
select TRACE .
SA
M
The symmetry of a data set can be determined from a boxplot. If a data set is symmetric, then
the median will be located approximately in the centre of the box, and the tails will be of
similar length. This is illustrated in the following diagram, which shows the same data set
displayed as a histogram and a boxplot.
A median placed towards the left of the box, and/or a long tail to the right indicates a
positively skewed distribution, as shown in this plot.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P1: FXS/ABE
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
536
Essential Advanced General Mathematics
PL
E
A median placed towards the right of the box, and/or a long tail to the left indicates a
negatively skewed distribution, as illustrated here.
A more sophisticated version of a boxplot can be drawn with the outliers in the data set
identified. This is very informative, as one cannot tell from the previous boxplot if an
extremely long tail is caused by many observations in that region or just one.
Before drawing this boxplot the outliers in the data set must be identified. The term outlier
is used to indicate an observation which is rather different from other observations. Sometimes
it is difficult to decide whether or not an observation should be designated as an outlier. The
interquartile range can be used to give a very useful definition of an outlier.
SA
M
An outlier is any number which is more than 1.5 interquartile ranges above the upper
quartile, or more than 1.5 interquartile ranges below the lower quartile.
When drawing a boxplot, any observation identified as an outlier is indicated by an asterisk,
and the whiskers are joined to the smallest and largest values which are not outliers.
Example 18
Use the data from Example 17 to draw a boxplot with outliers.
Solution
median = 71
interquartile range = Q 3 − Q 1
= 109.5 − 25.5
= 84
An outlier will be any observation which is less than 25.5 − 1.5 × 84 = −100.5,
which is impossible, or greater than 109.5 + 1.5 × 84 = 235.5. From the data it can
be seen that there is only one observation greater than this, 264, which would be
denoted with an asterisk.
The upper whisker is now drawn from the edge of the box to the largest observation
less than 235.5, which is 226.
*
0
100
200
300
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P1: FXS/ABE
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
Chapter 22 — Describing the distribution of a single variable
537
Using the TI-Nspire
PL
E
The calculator can be used to construct a boxplot. Consider the data from Example 17.
The data is easiest entered in a Lists &
3).
Spreadsheet application (
) to
Firstly, use the up/down arrows (
name the first column hours.
Then enter each of the 33 numbers as
shown.
Open a Data & Statistics application (
5 ) to graph the data. At first the data
displays as shown.
SA
M
Specify the x variable by selecting Add X
Variable from the Plot Properties (b 2
4) and selecting hours. The data now
displays as shown.
(Note: It is also possible to use the NavPad
to move down below the x-axis and click to
add the x variable.)
Select Box Plot from the Plot Type menu
(b 1 2). The data now displays as
shown.
Notice how the calculator, by default,
shows any outlier(s).
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P1: FXS/ABE
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
538
Essential Advanced General Mathematics
PL
E
To not show the outlier(s), select Extend
Box Plot Whiskers from the Plot Properties
menu (b 2 3). The data now
displays as shown.
Note: It is possible to show the values of the
five-point summary by moving the cursor
over the boxplot.
Using the Casio ClassPad
In the following consider the set of marks:
28 21 21 3 22 31 35 26 27 33 36 35 23 24 43 31 30 34 48
SA
M
enter the data into list1. Tap SetGraph, Setting . . . and the tab for Graph 2, enter the
In
and tap SET. (Note that on
settings shown including the tick box
the Classpad you can store settings for a number of different graphs and return to them
quickly.)
Tap SetGraph, StatGraph2 and tap the box to tick
and select the graph (de-select any other graphs).
to produce the graph. The boxplot is
Tap
produced as shown.
With the graph window selected (bold border),
tap 6 to adjust the viewing window for the
graph.
Tap Analysis, Trace and use the navigator key to
move between the outlier(s), Minimum, Q1,
Median, Q3 and Maximum scores.
Starting from the left of the plot, we see that the:
Minimum value is 3: min X = 3. It is also an
outlier
Lower adjacent value is 21: X = 21
First quartile is 23: Q1 = 23
Median is 30: Med = 30
Second quartile is 35: Q3 = 35
Maximum value is 48: max X = 48.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P1: FXS/ABE
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
Chapter 22 — Describing the distribution of a single variable
539
Exercise 22G
Example
17
1 The heights (in centimetres) of a class of girls are
160
154
165
159
123
149
143
167
154
176
180
163
133
154
123
167
157
168
157
132
135
145
140
143
140
157
150
156
Example
18
PL
E
a Determine the five-figure summary for this data set.
b Draw a boxplot of the data.
c Describe the pattern of heights in the class in terms of shape, centre and spread.
2 A researcher is interested in the number of books people borrow from a library. She
decided to select a sample of 38 cards and record the number of books each person has
borrowed in the previous year. Here are her results.
7
28
0
2
38
18
0
0
4
0
0
2
1
1
14
1
8
27
0
52
4
0
12
28
10
1
0
2
0
1
11
5
11
0
13
0
a Determine the five-figure summary for this data set.
b Determine if there are any outliers.
c Draw a boxplot of the data, showing any outliers.
d Describe the number of books borrowed in terms of shape, centre and spread.
13
15
3 The winnings of the top 25 male tennis players in 2004 are given in the following table.
Winnings
6 357 547
2 766 051
2 604 590
2 273 283
1 697 155
1 639 171
1 508 177
1 448 209
1 177 254
1 045 985
927 344
861 357
854 533
SA
M
Player
Roger Federer
Lleyton Hewitt
Andy Roddick
Marat Safin
Guillermo Coria
Gaston Gaudio
Tim Henman
Carlos Moya
Andre Agassi
David Nalbandian
Jonas Bjorkman
Tommy Robredo
Nicolas Massu
Player
Joachim Johansson
Jiri Novak
Dominik Hrbaty
Guillermo Canas
Fernando Gonzalez
Sebastian Grosjean
Feliciano Lopez
Max Mirnyi
Juan Ignacio Chela
Mikhail Youzhny
Radek Stepanek
Vincent Spadea
Winnings
828 744
813 792
808 944
780 701
766 416
755 795
748 662
742 196
727 736
725 948
706 387
704 105
a Draw a boxplot of the data, indicating any outliers.
b Describe the data in terms of shape, centre, spread and outliers.
4 The hourly rate of pay for a group of students engaged in part-time work was found to be:
$4.75
$8.50
$17.23
$9.00
$12.00
$11.69
$6.25
$7.50
$8.89
$6.75
$7.90
$12.46
$10.80
$8.40
$12.34
$10.90
$11.65
$10.00
$10.00
$13.00
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
540
Essential Advanced General Mathematics
a Draw a boxplot of the data, indicating any outliers.
b Describe the hourly pay rate for the students in terms of shape, centre, spread and
outliers.
5 The daily circulation of several newspapers in Australia is:
570 000
217 284
98 158
327 654
214 000
77 500
299 797
212 770
56 000
273 248
171 568
43 330
258 700
170 000
17 398
230 487
125 778
a Draw a boxplot of the data, indicating any outliers.
b Describe the daily newspaper circulation in terms of shape, centre, spread and outliers.
22.8
PL
E
P1: FXS/ABE
Using boxplots to compare distributions
Boxplots are extremely useful for comparing two or more sets of data collected on the same
variable, such as marks on the same assignment for two different groups of students. By
drawing boxplots on the same axis, both the centre and spread for the distributions are readily
identified and can be compared visually.
Example 19
The number of hours spent by individual students on the project referred to in Example 17 at
another school were:
152
106
226
80
82
14
17
54
30
18
9
16
16
173
156
106
SA
M
53
57
136
24
136
102
19
6
21
86
107
38
11
227
24
3
1
48
42
55
12
21
128
45
176
Use boxplots to compare the time spent on the project by students at this school with those in
Example 17.
Solution
The five-figure summary for this data set is:
median, m = 48; first quartile, Q 1 = 17.5; third quartile, Q 3 = 106.5; minimum = 1;
maximum = 227
In order to compare the time spent on the project by the students at each school,
boxplots for both data sets are drawn on the same axis.
School 1
*
School 2
0
100
200
300
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P1: FXS/ABE
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
Chapter 22 — Describing the distribution of a single variable
541
From the boxplots the distributions of time for the two schools can be compared in
terms of shape; centre, spread and outliers. Clearly the two distributions for both
schools are positively skewed, indicating a larger range of values in the upper half of
the distributions. The centre for School 1 is higher than the centre for School 2
(71 hours compared to 48 hours). As can be seen by comparing the box widths, which
indicate the IQR, the spread of the data is comparable for both distributions. There is
one outlier, a student who attended School 1 and spent 264 hours on the project.
PL
E
The boxplot is useful for summarising large data sets and for comparing several sets of data. It
focuses attention on important features of the data and gives a picture of the data which is easy
to interpret. When a single data set is being investigated a stem-and-leaf plot is sometimes
better, as a boxplot may hide the local detail of the data set.
Exercise 22H
19
1 To test the effect of a physical fitness course the number of sit-ups that a person could do
in 1 minute, both before and after the course, were recorded. Twenty randomly selected
participants scored as follows.
Before
29
23
After
28
25
22
22
25
26
29
26
26
30
24
12
31
17
46
21
34
20
28
30
26
24
25
30
35
34
33
30
36
15
32
29
54
21
50
19
43
34
SA
M
Example
a Construct boxplots of these two sets of data on the same axis.
b Describe the effect of the physical fitness course on the number of sit-ups achieved in
terms of shape, centre, spread and outliers.
2 The number of hours spent on homework per week by a group of students in Year 8 and a
group of students in Year 12 are shown in the tables.
Year 8
1
1
2
3
4
4
2
3
4
3
4
1
5
7
3
2
7
1
7
3
2
1
4
4
3
1
3
0
Year 12
1
2
2
3
3
1
5
1
6
4
7
7
7
8
6
9
7
6
8
7
7
8
5
7
4
2
1
3
Draw boxplots of these two sets of data on the same axis and use them to answer the
following questions.
a Which group does the most homework?
b Which group varies more in the number of hours homework they do?
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
542
Essential Advanced General Mathematics
3 The ages of mothers at the birth of their first child were noted, for the first forty such
births, at a particular hospital in 1970 and again in 1990.
1970
21
37
24
16
29
22
21
21
25
26
22
25
32
31
36
26
37
26
22
34
30
27
25
27
24
19
31
18
36
21
20
39
23
33
18
24
19
17
20
21
1990
24
19
26
25
22
33
18
35
35
44
28
31
32
24
32
23
17
18
43
19
28
27
28
46
38
24
26
29
20
33
28
23
30
29
41
34
39
23
28
29
PL
E
P1: FXS/ABE
a Construct boxplots of these two sets of data on the same axis.
b Compare the ages of the mothers in 1970 and 1990 in terms of shape, centre, spread
and outliers.
Using a CAS calculator with statistics I
How to construct a histogram
Use a TI-89 graphics calculator to display the following set of marks in the form of a
histogram.
SA
M
16 11 4 25 15 7 14 13 14 12 15 13 16 14
15 12 18 22 17 18 23 15 13 17 18 22 23
Enter the data into your calculator by pressing APPS , and
moving to the Stats/List Editor.
Type the data into list1. Your screen should look like the
one shown.
Next, set up the calculator to plot a statistical graph.
a Press F2 to access the Plots menu.
b Select 1:Plot Setup. This will take you to the Plot Setup
menu.
c With Plot 1: highlighted, press F1 . This will take you to
the Define Plot1 dialogue box.
d Complete the dialogue box as follows:
r For Plot Type: select 4:Histogram
r Leave Mark: as Box.
r For x type in list1.
r For Hist.Bucket Width, type in 3.5.
‘Bucket width’ means ‘interval width’. Choose a minimum of five intervals and
divide the range of values by the number to get the ‘bucket width’. If we choose six
25 − 4
= 4.2, or 4 as a more convenient number.
intervals, the bucket width is
5
Note:
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
Chapter 22 — Describing the distribution of a single variable
543
Pressing ENTER confirms your selection and returns
you to the Plot Setup menu.
Set the viewing window ( WINDOW ) with the
following entries. Remember the interval width is 4.
r xmin= 4
r xmax= 29 (4 greater than the highest data value)
r xscl= 0 (no tick marks will appear on the scale)
r ymin= –5 (to allow space below the histogram)
r ymax= 13 (a first guess at the maximum height
PL
E
P1: FXS/ABE
of the histogram; half the number of data values)
r yscl= 0 (leave as is)
r xres= 2 (leave as is)
Pressing GRAPH
plots the histogram.
SA
M
Pressing F3 places a marker at the top of the first
column of the histogram and tells us that the first class
interval contains all values ranging from 4 to less than 8.
For this interval, the count is two (n:2).
To find out the counts in the other intervals, use the
to move from interval to interval.
horizontal arrow key
How to construct a boxplot with outliers
Use a TI-89 graphics calculator to display the following set of marks in the form of a
boxplot with outliers.
28 21 21 3 22 31 35 26 27 33 36 35 23 24
43 31 30 34 48
Enter the data into your calculator by pressing APPS ,
moving to Stats/List Editor and pressing ENTER to
select.
Type the data into list1. Your screen should look like the
one shown.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P1: FXS/ABE
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
544
Essential Advanced General Mathematics
Set up the calculator to plot a statistical graph.
Press F2 to access the Plots menu.
Press ENTER to select 1:Plot Setup.
Press F1 to Define Plot1.
Complete the dialogue box as follows:
r For Plot Type: select 5:Mod Box Plot
r Leave Mark: as Box.
r For x type in list1.
Pressing ENTER confirms your selection and returns you
to the Plot Setup menu.
PL
E
a
b
c
d
SA
M
Pressing F5 (Zoom Data) in Plot Setup automatically plots
the box plot in a properly scaled window.
Key values can be read from the boxplot by pressing F3 .
This places a marker on the boxplot. You can then use the
horizontal arrow keys ( and ) to move from point to point
on the boxplot and read off the associated values.
Starting at the far left of the plot, we see that the
r minimum value is 3: minX=3. It is also an outlier.
r lower adjacent value is 21: X=21
r first quartile is 23: Q1=23.
r median is 30: Med=30.
r second quartile is 35: Q1=35.
r maximum value is 48: maxX=48.
See the figure opposite.
How to calculate the mean and standard deviation
The following are all heights (in cm) of a group of women:
176 160 163 157 168 172 173 169
Calculate the mean and standard deviation.
Enter the data into your calculator using the Stats/List
Editor.
Type the data into list1. Your screen should look like the
one shown.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P1: FXS/ABE
P2: FXS
9780521740494c22.xml
CUAU033-EVANS
September 12, 2008
9:40
Back to Menu >>>
Chapter 22 — Describing the distribution of a single variable
545
Press F4 to access the Calculate menu.
With 1:1-Var Stats highlighted, press ENTER to select.
This will take you to the 1-Var Stats dialogue box.
SA
M
PL
E
Complete the dialogue box:
r For List:, type in list1. This is not necessary if list1 is
already shown. Press ENTER to obtain the results.
Write down your answers to the required degree of
accuracy (two decimal places).
Note: The value of the standard deviation is given by Sx.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P1: FXS/ABE
P2: FXS
9780521740494c22-1.xml
CUAU033-EVANS
September 12, 2008
10:56
Back to Menu >>>
Essential Advanced General Mathematics
Chapter summary
PL
E
Variables may be classified as categorical or numerical. Numerical data may be discrete
or continuous.
Examination of a data set should always begin with a visual display.
A bar chart is the appropriate visual display for categorical data.
When a data set is small, a stem-and-leaf plot is the most appropriate visual display for
numerical data.
When a data set is larger, a histogram, frequency polygon or boxplot is a more
appropriate visual display for numerical data.
Cumulative frequency distributions and cumulative relative frequency distributions are
useful for answering questions about the number or proportion of data values greater than
or less than a particular value. These are graphically represented in cumulative frequency
polygons or cumulative relative frequency polygons.
From a stem-and-leaf plot, histogram or boxplot, insight can be gained into the shape,
centre and spread of the distribution, and whether or not there are any outliers.
An outlier is a value which sits away from the main body of the data in a plot. It is formally
defined as a value more than 1.5IQR below Q1 , or more than 1.5IQR above Q3 .
For numerical data it is also very useful to calculate some summary statistics.
n
1
The mean is defined as x̄ =
xi .
n i=1
n + 1 th
If n, the number of observations, is odd, then the median is the
observation
2
from the end of the ordered list. If n is even, then the median is found by averaging the two
n th
n
th
middle observations in the list, i.e., the
and the
+ 1 observations are added
2
2
together and divided by 2.
The mode is the most common observation in a group of data.
The most useful measures of centre are the median and the mean.
To find the interquartile range of a distribution:
r Arrange all observations in order according to size.
r Divide the observations into two equal sized groups. If n, the number of
observations, is odd, then the median is omitted from both groups.
r Locate Q , the first quartile, which is the median of the lower half of the
1
observations, and Q3 , the third quartile, which is the median of the upper half of the
observations.
r The interquartile range IQR is defined as the difference between the quartiles. That is
IQR = Q 3 − Q 1
n
1 The standard deviation is defined as s =
(xi − x̄)2 .
n − 1 i=1
The most useful measures of spread are the interquartile range and the standard
deviation.
SA
M
Review
546
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c22-1.xml
CUAU033-EVANS
September 12, 2008
10:56
Back to Menu >>>
Chapter 22 — Describing the distribution of a single variable
547
min
median
Q1
max
Q3
When the data set is symmetric any of the summary statistics are appropriate.
When the data set is not symmetric or when there are outliers the median and the
interquartile range are the preferred summary statistics.
In general, 95% of the values of the data set will fall within two standard deviations of the
mean.
When comparing the distribution of two or more data sets the comparison should be made
in terms of the shape, centre, spread and outliers for each distribution.
Multiple-choice questions
SA
M
1 In a survey a number of subjects were asked to indicate how much they exercise by
selecting one of the following options.
1 Never
2 Seldom
3 Occasionally
4 Regularly
The resulting variable was named Level of Exercise, and the level of measurement of this
variable is
A variable
B numerical
C constant
D categorical
E metric
Questions 2, 3 and 4 relate to the following information.
The numbers of hours worked per week by employees in a large company are shown in this
percentage frequency histogram.
40
Percentage Frequency
30
20
10
0
20
40
60
Hours worked weekly
80
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
Review
The five-figure summary of a set of data consists of the minimum, Q 1 , median, Q 3 , and
the maximum. A boxplot is a diagrammatic representation of this, e.g.
PL
E
P1: FXS/ABE
P1: FXS/ABE
P2: FXS
9780521740494c22-1.xml
CUAU033-EVANS
September 12, 2008
10:56
Back to Menu >>>
Essential Advanced General Mathematics
2 The percentage of employees who work from 20 to less than 30 hours per week is closest to
A 1%
B 2%
C 6%
D 10%
E 33%
3 The median number of hours worked is in the interval
A 10 to less than 20
B 20 to less than 30
C 30 to less than 40
D 40 to less than 50
E 50 to less than 60
0
1
2
3
4
1
0
2
1
2
3
2
4
4
PL
E
Questions 4 and 5 relate to the following information.
A group of 19 employees of a company was asked to record the number of meetings that they
attended in the last month. Their responses are summarised in the following stem-and-leaf plot.
3
4
4
6
5
5
4 The median number of meetings is
A 6
B 6.5
C 7
D 7.5
6
6
7
9
E 9
6 The cumulative frequency polygon shown gives the
examination scores in Mathematics for a group of
200 students.
The number of students who scored less than 70 on the
examination is closest to
A 30
B 100
C 150
D 175
E 200
Number of students
5 The interquartile range (IQR) of number of meetings is
A 0
B 4
C 9.5
D 10
E 14
SA
M
Review
548
200
100
0
40
50
60 70 80
Exam Score
90
Questions 7 and 8 relate to the following information.
The number of years that a sample of people has lived that their current address is summarised
in this boxplot.
Years lived
this address
0
10
20
30
40
50
7 The shape of the distribution of years lived at this address is:
A positively skewed
B negatively skewed
C bimodal
D symmetric
E symmetric with outliers
8 The interquartile range years lived at this address is approximately equal to:
A 5
B 8
C 17
D 12
E 50
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c22-1.xml
CUAU033-EVANS
September 12, 2008
10:56
Back to Menu >>>
Chapter 22 — Describing the distribution of a single variable
549
The amount paid per week to the employees of each of five large companies are shown in the
boxplots:
Company 2
Company 3
0
PL
E
Company 1
20000
40000 60 000 80 000
Yearly income
100 000 120 000
9 The company with the lowest typical wage is
A Company 1
B Company 2
C Company 3
D Company 1 and Company 2
E Company 2 and Company 3
SA
M
10 The company with the largest variation in wage is
A Company 1
B Company 2
C Company 3
D Company 1 and Company 2
E Company 2 and Company 3
Short-answer questions (technology-free)
1 Classify the data which arise from the following situations as categorical or numerical.
a The number of phones calls a hotel receptionist receives each day.
b Interest in politics on a scale from 1 to 5 where 1 = very interested, 2 = quite interested,
3 = somewhat interested, 4 = not very interested, and 5 = uninterested.
2 This bar chart shows the
percentage of people working who are
employed in private companies, work
for the Government or are self-employed
in a certain town.
a What kind of measurement is the
‘Type of company worked for’?
b Approximately what percentage of
the people are self-employed?
50
40
30
20
10
0
Private Government Self-employed
Type of company worked for
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
Review
Questions 9 and 10 relate to the following data.
Percent
P1: FXS/ABE
P1: FXS/ABE
P2: FXS
9780521740494c22-1.xml
CUAU033-EVANS
September 12, 2008
10:56
Back to Menu >>>
Essential Advanced General Mathematics
3 A researcher asked a group of people to record how many
cigarettes they had smoked on a particular day. Here are her results:
0
5
0
0
9
17
10
14
23
3
25
6
0
0
0
33
34
23
32
0
0
32
0
13
30
21
0
22
4
6
Using an appropriate class interval, construct a histogram of these data.
56
54
57
52
47
69
PL
E
4 A teacher recorded the time taken (in minutes) by each of a class of students to complete a
test.
68
72
52
65
51
45
43
44
22
55
59
56
51
49
39
50
a Make a stem-and-leaf plot of these times, using one row per stem.
b Use this stem-and-leaf-plot to find the median and quartiles for the time taken.
5 The weekly rentals, in dollars, for apartments in a particular suburb are given in the
following table.
285
265
185
300
210
210
215
270
320
190
680
245
280
315
Find the mean and the median of the weekly rental.
6 Geoff decided to record the time it takes him to complete his mail delivery round each
working day for four weeks. His data are recorded in the following table.
SA
M
Review
550
170
164
182
189
176
167
201
161
188
183
187
211
168
180
174
182
201
193
161
147
185
166
188
183
167
186
173
176
The mean of the time taken, x̄, is 179 and the standard deviation, s, is 14.
a Determine the percentage of observations falling within two standard deviations of the
mean.
b Is this what you would expect to find?
7 A group of students were asked to record the number of SMS messages that they sent in
one 24-hour period, and the following five-figure summary was obtained from the data set.
Use it to construct a simple boxplot of these data.
Min = 0,
Q 1 = 3,
Median = 5,
Q 3 = 12,
Max = 24
8 The following table gives the number of students absent each day from a large secondary
college on each of 36 randomly chosen school days.
7
7
15
22
3
16
12
21
13
15
30
21
21
13
10
16
2
16
23
7
11
23
12
4
17
18
3
23
14
0
8
14
31
16
0
44
Construct a boxplot of these data, with outliers.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c22-1.xml
CUAU033-EVANS
September 12, 2008
10:56
Back to Menu >>>
Chapter 22 — Describing the distribution of a single variable
551
1 The divorce rates (in percentages) of 19 countries are
27
26
18
8
14
14
25
5
28
15
6
32
32
6
44
19
53
9
0
What is the level of measurement of the variable, ‘divorce rate’?
Construct an ordered stem-and-leaf plot of divorce rates, with one row per stem.
What shape is the divorce rates?
What percentage of countries have divorce rates greater than 30?
Calculate the mean and median of the divorce rates for the 19 countries.
Construct a histogram of the data with class intervals of width 10.
i What is the shape of the histogram?
ii How many countries had divorce rates from 10% to less than 20%?
g Construct a cumulative percentage frequency polygon of divorce rates.
i What percentage of countries has divorce rates less than 20%?
ii Use the cumulative frequency distribution to estimate the median percentage
divorce rate.
a
b
c
d
e
f
SA
M
2 Hillside Trains have decided to improve their service on the Lilydale line. Trains were timed
on the run from Lilydale to Flinders Street, and their times recorded over a period of six
weeks at the same time each day. The time taken for each journey is shown below.
60
90
63
58
61
59
67
64
70
86
74
69
72
70
78
59
68
77
65
62
80
64
68
63
76
57
82
89
65
65
89
74
69
60
75
60
79
68
62
82
60
64
a Construct a histogram of the times taken for the journey from Lilydale to Flinders
Street, using class intervals 55–59, 60–64, 65–69 etc.
i On how many days did the trip take from 65–69 minutes?
ii What shape is the histogram?
iii What percentage of trains took less than 65 minutes to reach Flinders Street?
b Calculate the following summary statistics for the time taken (correct to two decimal
places).
x
s
Min
Q1
M
Q3
Max
c Use the summary statistics to complete the following report.
i The mean time taken from Lilydale to Flinders Street (in minutes) was . . .
ii 50% of the trains took more than . . . minutes to travel from Lilydale to Flinders
Street.
iii The range of travelling times was . . . minutes while the interquartile range was . . ...
minutes.
(cont’d)
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
Review
Extended-response questions
PL
E
P1: FXS/ABE
P1: FXS/ABE
P2: FXS
9780521740494c22-1.xml
CUAU033-EVANS
September 12, 2008
10:56
Back to Menu >>>
Essential Advanced General Mathematics
iv 25% of trains took more than . . . minutes to travel to Flinders Street.
v The standard deviation of travelling times was . . .
vi Approximately 95% of trains took between . . . and . . . minutes to travel to Flinders
St.
d Summary statistics for the year before Hillside Trains took over the Lilydale line from
the Met are indicated below:
Q 1 = 65
Median = 70
Q 3 = 89
Max = 99
PL
E
Min = 55
Draw simple boxplots for the last year the Met ran the line and the data from Hillside
trains on the same axis.
e Use the information from the boxplots to compare travelling times for the two transport
corporations in terms of shape, centre and spread.
3 In a small company, upper management wants to know if there is a difference in the three
methods used to train its machine operators. One method uses a hands-on approach. A
second method uses a combination of classroom instruction and on-the-job training. The
third method is based completely on classroom training. Fifteen trainees are assigned
to each training technique. The following data are the results of a test undertaken by the
machine operators after completion of one of the different training methods.
Method 1
98
100
89
90
81
85
97
95
87
70
69
75
91
92
93
Method 2
79
62
61
89
69
99
87
62
65
88
98
79
73
96
83
Method 3
70
74
60
72
65
49
71
75
55
65
70
59
77
67
80
SA
M
Review
552
a Draw boxplots of the data sets, on the same axis.
b Write a paragraph comparing the three training methods in terms of shape, centre,
spread and outliers.
c Which training method would you recommend?
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
P2: FXS
9780521740494c22-1.xml
CUAU033-EVANS
September 12, 2008
10:56
Back to Menu >>>
553
Chapter 22 — Describing the distribution of a single variable
Family
First-born
Second-born
Third-born
1
38
9
12
2
45
40
12
3
30
24
12
4
29
16
25
5
34
16
9
6
19
21
11
7
35
34
20
8
40
29
12
9
25
22
10
10
50
29
20
11
44
20
16
12
36
19
13
13
26
18
10
SA
M
a Draw boxplots of the data sets on the same axis.
b Write a paragraph comparing the independence scores of first-, secondand third-born children.
Cambridge University Press • Uncorrected Sample Pages • 978-0-521-61252-4
2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard
Review
4 It has been argued that there is a relationship between a child’s level of independence and
the order in which they were born in the family. Suppose that the children in thirteen
three-children families are rated on a 50-point scale of independence. This is done when all
children are adults, thus eliminating age effects. The results are as follows.
PL
E
P1: FXS/ABE