Download Statistics - Haese Mathematics

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
8
Statistics
Contents:
A
B
C
D
E
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
F
G
H
I
J
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\441SA12MA-2_08.CDR Thursday, 16 August 2007 4:44:00 PM DAVID3
Data and sampling
Analysis and representation
Stemplots
Measures of centre
The variability (spread) of a
distribution
Box and whisker plots
Extended investigations
Normal distributions
Correlation
Linear regression
SA_12MA-2
442
STATISTICS
(Chapter 8)
INTRODUCTION
Decisions made by governments, businesses, education departments, sporting bodies, etc., are
often made after careful consideration of statistical evidence. Statistics play a vital role in
many areas of our society. Statistics are a tool for helping to make rational decisions about
variables described by data sets.
Amongst other things, governments use statistics to help formulate future policies.
Businesses often use statistics to aid decision making, for example, whether or not to enter
the market with an alternative to a product when there are already several of these products
on the market.
Statistical information about sport has increased dramatically in recent years. We only need
to watch a ‘Twenty20’ cricket match to observe the many statistical graphs and tables used
to help make the viewer more informed.
In advertising, ‘product superiority claims’ are frequent. Often statistical analysis can be used
to analyse such claims so that we may question their validity.
Following are some examples of the types of problems we may face, and where statistical
methods may help answer them:
²
A young executive of a hotel chain claimed that lowering the room tariff by 10%
would increase the patronage by 25%. Would this be true?
A manufacturer of AAA batteries claimed that her batteries outlasted all other
leading brands by at least 100 hours. Is she correct?
In the AFL, the umpires give more free kicks to the home team than to the other team
due to the crowd’s influence.¡ What evidence do we have, and is the claim justified?
Should lights be placed at a particular intersection of two roads? What factors
should determine this?
An employer claims that younger employees (< 30 years) have on average twice
as many sick days as the older ones (> 30 years). Is he correct?
Which drug for helping to quit smoking has the greatest chance of success?
Does the unemployment rate affect the crime rate for that city?
²
²
²
²
²
²
DISCUSSION
Examine the following problems:
How much will it cost each week to rent a one-bedroom
flat in the Eastern suburbs of Adelaide compared with one
in the Western suburbs?
Problem 1:
Problem 2: Has the size of harvested crayfish changed from 1998 to 2008?
Do two different science text books have the same reading level, determined
by word length?
cyan
magenta
²
²
²
yellow
95
100
50
75
25
0
5
95
100
50
75
how you could obtain appropriate data
what random variable you need to consider
how you would make sure the data is randomly selected.
25
0
95
100
50
75
25
0
5
95
100
50
75
25
0
5
For each problem, discuss:
5
Problem 3:
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\442SA12MA-2_08.CDR Thursday, 16 August 2007 4:44:43 PM DAVID3
SA_12MA-2
STATISTICS
(Chapter 8)
443
OPENING PROBLEM 1
Kareline is looking to buy a house in the Adelaide suburb of Prospect.¡
She has collected the information presented in the table, part of which is
shown below.¡ Click on the icon to expose all the data.
SPREADSHEET
For you to consider:
²
²
²
²
²
What is the variable being
considered?
What is the price range of
the houses?
What is the price range of
the middle 50% of the
houses?
What is the ‘average’ house
price?
Is it possible for this data
to have two ‘averages’?
What would be the effect on the interpretation of data if:
²
²
²
²
the extreme values were removed (for example, if
Kareline was not prepared to spend more than
$275 000)
one or more data values were incorrect
additions were made to the set of data
Kareline was only interested in 3-bedroom houses?
How reliable is Kareline’s data? How can that reliability be tested?
Statistical measures provide powerful tools for answering questions. Kareline may have
wondered, ‘What is the mean price of a house in Prospect?’.
Such a question provides a starting point for collecting and interpreting data.
A
DATA AND SAMPLING
When information for a statistical investigation is collected and recorded, the information is
referred to as data.
WHAT IS A STATISTICAL INVESTIGATION?
The process that Kareline used to collect and interpret data for her house hunting exercise is
an example of a statistical investigation.
There are five processes involved in a statistical investigation:
cyan
For Kareline, the problem examined
is to find a reliable ‘average’ cost of
a house in Prospect.
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
Stating the problem
75
25
0
5
95
100
50
75
25
0
5
Step 1:
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\443SA12MA-2_08.CDR Thursday, 16 August 2007 4:44:48 PM DAVID3
SA_12MA-2
444
STATISTICS
Step 2:
Step 3:
Step 4:
Step 5:
(Chapter 8)
Collection of data (information)
Data for a statistical investigation can be
collected from records, from surveys (either face-to-face, telephone, or postal),
by direct observation or by measuring
or counting. Unless the correct data is
collected, valid conclusions cannot be
made.
Organisation and display of data
Data can be organised into tables and displayed on a graph. This allows us to
identify features of the data more easily.
Kareline has collected the data from
rental advertisements in newspapers
and on the internet.
Kareline has tabulated her data using
a spreadsheet.
Calculation of descriptive statistics
Some statistics used to describe a set of
data are the centre and the spread of the
data. These give us a picture of the sample or population under investigation.
Kareline may calculate the mean
(average) house price and range of
house prices. She may also look for
outliers in the data and decide if the
outliers should be included in her investigation.
Interpretation of statistics
This process involves explaining the
meaning of the table, graph or descriptive statistics in terms of the variable, or
theory, being investigated.
Kareline may explain any graphs
generated and interpret the statistics
calculated from the data.
COLLECTION OF DATA
The variable is the subject that we are investigating.
The entire group of objects from which information is required is called the population.
Gathering statistical information properly is vitally important. If gathered incorrectly then any
resulting analysis of the data would almost certainly lead to incorrect conclusions about the
population.
The gathering of statistical data may take the form of:
² a census, where information is collected from the whole population, or
² a survey, where information is collected from a much smaller group of the
population, called a sample.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
For example:
² The Australian Bureau of Statistics conducts a census of the whole
population of Australia every five years.
² In opinion polls before an election, a survey is conducted to see which
way a sample of the population will vote.
² The students in a school are to vote for a new school captain.¡ If 20 students from the
school are asked how they will vote, then the population is all the students who attend
the school, and the 20 students is a sample.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\444SA12MA-2_08.CDR Thursday, 16 August 2007 4:44:54 PM DAVID3
SA_12MA-2
STATISTICS
Note:
(Chapter 8)
445
A population generally consists of a large number of items. Because of the expense
and time factors it is often easier to select a sample, rather than use the whole
population, and hope that the sample is truly representative of the population.
For accurate information when sampling, it is essential that:
² the number of individuals in the sample is large enough
² the individuals involved in the survey are randomly chosen
from the population.¡ This means that every member of the
population has an equal chance of being chosen.
If the individuals are not randomly chosen or the sample is too small, the data collected may
be biased towards a particular outcome.
For example:
If the purpose of a survey is to investigate how the population of Adelaide will vote at the
next election, then surveying the residents of only one suburb would not provide information
that represents all of Adelaide.
TYPES OF DATA
Data are individual observations of a variable. A variable is a quantity that can have a value
recorded for it or to which we can assign an attribute or quality.
Two types of variable that we commonly deal with are categorical variables and numerical
variables.
CATEGORICAL VARIABLES
A quality or category is recorded for this type of variable. The information collected is
called categorical data.
Examples of categorical variables and their possible categories include:
Colour of eyes:
Continent of birth:
blue, brown, hazel, green and violet
Europe, Asia, North America, South America, Africa, Australia and
Antarctica
male or female
General Motors, Toyota, Ford, Mazda, BMW, Subaru, etc.
Gender:
Type of car:
We will not consider categorical data in this course.
NUMERICAL VARIABLES
A number is recorded for this type of variable. The information collected is called numerical
data.
There are two types of numerical variables:
Discrete numerical variables
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
A discrete variable can only take distinct values and these values are often obtained by
counting.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\445SA12MA-2_08.CDR Thursday, 16 August 2007 4:44:59 PM DAVID3
SA_12MA-2
446
STATISTICS
(Chapter 8)
Examples of discrete numerical variables and their possible values include:
0, 1, 2, 3, ...
0, 1, 2 ..., 29, 30.
The number of children in a family:
The score on a test, out of 30 marks:
Continuous numerical variables
A continuous numerical variable can theoretically take any value on a part of the number
line. Its value often has to be measured.
Examples of continuous numerical variables and their possible values include:
The height of Year 12 students:
The speed of cars on a stretch
of highway:
The weight of newborn babies:
The time taken to run 100 m:
any value from about 140 cm to 220 cm
any value from 0 km/h to the fastest speed that a car can
travel, but most likely in the range 30 km/h to 120 km/h
any value from 0 kg to 10 kg but most likely in the
range 0:5 kg to 5 kg
any value from 9 seconds to 30 seconds.
EXERCISE 8A.1
1 40 students, from a school with 820 students, are randomly selected to complete a survey
on their school uniform. In this situation:
a what is the population size
b what is the size of the sample?
2 A television station is conducting a viewer telephone-into-the-station poll on the question
‘Should Australia become a republic?’
a What is the population being surveyed in this situation?
b How is the data biased if it is used to represent the views of all Australians?
3 A new drug called Cobrasyl is approved for the treatment of high blood pressure in humans. The drug,
a derivative of cobra venom, is able to reduce blood
pressure to an acceptable level. Before its release, a
research team treated 127 high blood pressure patients
with the drug and in 119 cases it reduced their blood
pressure to an acceptable level.
a What is the sample of interest?
b What is the population of interest?
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
4 A polling agency is employed to survey the voting intention of residents of a particular
electorate in the next election. From the data collected they are to predict the election
result in that electorate.
Explain why each of the following situations would produce a biased sample.
a A random selection of people in the local large shopping complex is surveyed
between 1 pm and 3 pm on a weekday.
b All the members of the local golf club are surveyed.
c A random sample of people on the local train station between 7 am and 9 am are
surveyed.
d A doorknock is undertaken, surveying every voter in a particular street.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\446SA12MA-2_08.CDR Thursday, 16 August 2007 4:45:05 PM DAVID3
SA_12MA-2
STATISTICS
(Chapter 8)
447
5 Classify the following numerical variables as continuous or discrete.
a The quantity of fat in a lamb chop.
b The mark out of 50 for a Geography test.
c The weight of a seventeen year old student.
d The volume of water in a cup of coffee.
e The number of trout in a lake.
f The number of hairs on a cat.
g The length of hairs on a horse.
h The height of a sky-scraper.
i The number of floors sky-scrapers have.
j The time taken for students to get from home to school.
6 A sample of public trees in a municipality was surveyed for the following data:
a the diameter of the tree (in centimetres) measured 1 metre above the ground
b the type of tree
c the location of the tree (nature strip, park, reserve, roundabout)
d the height of the tree, in metres
e the time (in months) since the last inspection
f the number of inspections since planting
g the condition of the tree (very good, good, fair, unsatisfactory).
Classify the data collected as categorical, discrete numerical or continuous numerical.
7 For each of the following:
i identify the random variable being considered
ii give possible values for the random variable
iii indicate whether the variable is continuous or discrete.
a To measure the rainfall over a 24-hour period at Mount Gambier the height of water
collected in a rain gauge (up to 200 mm) is used.
b To investigate the stopping distance for a tyre with a new tread pattern a braking
experiment is carried out.
c To check the reliability of a new type of light switch, switches are repeatedly turned
off and on until they fail.
d The publisher of a golfing magazine prints 20 000 copies and is concerned with the
number of copies sold.
RANDOM SAMPLES
When taking a sample it is hoped that the information gathered is representative of the entire
population. We must take certain steps to ensure that this is so. If the sample we choose is
too small, the data obtained is likely to be less reliable than that obtained from larger samples.
For accurate information when sampling, it is essential that:
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
the individuals involved in the survey are randomly chosen from the population
the number of individuals in the sample is large enough.
75
25
0
5
95
100
50
75
25
0
5
²
²
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\447SA12MA-2_08.CDR Thursday, 16 August 2007 4:45:14 PM DAVID3
SA_12MA-2
448
STATISTICS
(Chapter 8)
For example:
Measuring the heights of a group of three fifteen-year-olds would not give a very reliable
estimate of the height of fifteen-year-olds all over the world. We therefore need to choose a
random sample that is large enough to represent the population. Note that conclusions based
on a sample will never be as accurate as conclusions made from the whole population, but if
we choose our sample carefully, they will be a good representation.
Care should be taken not to make a sample too large as this is costly, time consuming and
often unnecessary. A balance needs to be struck so that the sample is large enough for there
to be confidence in the results but not so large that it is too costly and time consuming to
collect and analyse the data.
As we have said before, the sample selected from the population must exhibit the characteristics of the chosen population so that the sample is truly representative of the population.
Unless a sample properly represents the population, it would be foolish to draw conclusions
about the population based on the sample results.
For example, a survey on voters’ preferences prior to an election should include all socioeconomic classes and both male and female voters otherwise the survey may produce biased
results which could not be relied upon.
THE SIZE OF A SAMPLE
The size of a sample should be chosen to reliably reflect the information we want to find out
about the entire population
Various methods exist to find the appropriate sample size.
Some businesses may choose less than the desired number in a sample because of the expense
incurred. For example, a medical research team in the UK always chooses a sample of size
80 for this reason.
Although
may choose a sample of
p
p there is no mathematical reason for doing so, some people
size n when n is the population size. Others might choose n + 10% of n.
p
p
Often both n and n + 10% of n give sample sizes which are too small.
Another complication is that the population size n is often unknown.
EXERCISE 8A.2
1 Discuss how you would randomly select:
a first and second prize in a hockey club raffle
b 12 members of the public to stand for jury duty
c four numbers from 0 to 37 on a roulette wheel.
cyan
magenta
yellow
15
10
5
sample size
95
100
500
50
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
a From the graph, what sample size would
be considered to be large enough?
b What is the best estimate of the population
mean?
mean of sample
20
75
2 In order to estimate the mean of a population,
samples of various sizes were taken and in
each case the sample mean was found.¡
Alongside is a graph of the results obtained.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\448SA12MA-2_08.CDR Thursday, 16 August 2007 4:45:20 PM DAVID3
1000
1500
2000
SA_12MA-2
STATISTICS
3
Sample size (n)
% in favour of (P )
200
82:0
500
56:4
1000
69:7
1600
62:9
2500
61:8
3500
62:0
(Chapter 8)
449
5000
61:7
The table above shows the results of asking the question: “Are you in favour of Australia
becoming a republic?”
a Plot the graph of P against n, with n on the horizontal axis.
b At what sample size do the results become reasonably consistent?
c What information can we see from this data?
4 Discuss: “In conducting a survey to find out the
percentage of people who believe the AFL grand
final should always be played at the MCG (Melbourne), it would be a good idea to ask a section
of the crowd at this year’s clash between the West
Coast Eagles and the Adelaide Crows.”
5 An alpine lake contains trout. On one particular day Rex the research scientist caught
600 trout. They were then tagged and released back into the lake. A fortnight later 350
trout were caught and of these 28 had tags.
a Estimate the number of trout in the alpine lake.
b In calculating your estimate, what assumptions have you made?
6 When examining the daily production of bottles
p of softdrink for quality control purposes,
industrial chemist Tomas takes a sample of n bottles (n is the daily production level).
a What sample size would Tomas choose if the daily production was 27 583 bottles?
b Tomas would choose at random about 1 bottle in every x. Find x.
c One day he calculated the sample size to be 143. What was the approximate
production level to the nearest 100?
p
d Tomas has just decided that the sample size is too small and will use n+10% of n
bottles in future samples. What sample size would he choose for a daily production
was 24 978 bottles?
e Suggest why the management may be unhappy with Tomas’s decision in d.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
7 Most often the population size is unknown. The following formulae are mathematically
correct for determining sample size for simple random sampling:
² For an extremely large population where the population size is unknown:
To be very confident that a sample accurately reflects the population within §r%,
we take a sample of size n where
9600
n= 2
r
² For a population size known to be N:
To be very confident that a sample accurately reflects the population within §r%,
we take a sample of size n where
9600N
n=
9600 + N r2
a To examine the proportion of successes of a new weight reduction drug, a sample of
users needs to be taken. How large a sample must be taken to be very confident that
the sample accurately reflects the population of users within §3% if the population
size is unknown?
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\449SA12MA-2_08.CDR Friday, 17 August 2007 1:19:44 PM DAVID3
SA_12MA-2
450
STATISTICS
(Chapter 8)
b An executive of Mitbushsui Motors wants to find out how much genuine interest
there is in their new series Manga, amongst the Australian community. In order to
get a reasonably accurate estimate (within 2%), but at a reasonable cost:
i what size sample should they include in their survey
ii how should they decide who should be in their survey
iii what questions should be asked in the survey?
c A reporter for the Port Adelaide Messenger was seeking answers to the question:
‘Who do you intend to vote for at the next Federal election?’.
How large a sample would he need if there are 47 621 voters on the electoral roll
and he wishes to be very confident of accuracy within §2:5%?
d A local council sends a form to households of a suburb of 3578 houses, asking
their opinion of a new development in the area. If they expect 60% of recipients to
respond, how many forms should be sent out to be very sure the results are accurate
within 3%?
e To determine whether members of a local gym would be willing to pay higher fees
in order to fund the installation of a new swimming pool, a sample of the members
is surveyed. Given that there are 568 members at the gym, how large a sample
must be taken to be very confident that the sample accurately measures the views
of all the members within 3%?
f A researcher wishes to find out the proportion of high school students in Adelaide
who have part time jobs. She does not know the number of high school students
in Adelaide, and wants to be very confident that the sample she surveys accurately
reflects the population within 3:5%. If she surveys no more than 50 students from
any given high school to minimise bias, what is the least number of schools she
must visit?
SAMPLING METHODS
Possible methods are:
A. SIMPLE RANDOM SAMPLING
For a sample to have the best chance of being truly representative
of the population it should be chosen at random. That is, all
members of the population have an equal chance of being chosen
in the sample. This is a simple random sample.
Random samples can be chosen using coins, dice, numbered
tokens, random number tables, or random number generators on
computers or calculators.
In order to randomly select a sample, each member of the population is assigned a number.
If a member’s number appears, that member is part of the sample.
For example:
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
Suppose you wish to choose X-lotto numbers.
The population of numbers is the integers 1 to 45 inclusive and you are going to choose a
‘sample’ of six different numbers.
How could you choose these numbers randomly?
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\450SA12MA-2_08.CDR Thursday, 16 August 2007 4:45:33 PM DAVID3
SA_12MA-2
STATISTICS
Method 1:
451
(Chapter 8)
Number forty five pieces of paper, place them in a container and select six
pieces of paper without looking.
Method 2: Use the random number generator on the calculator.
Using a Texas Instruments TI-83
Using a Casio fx9860-g
Press MATH
From the RUN menu, press
OPTN F6 (¤) F4 (NUM)
5 to select 5:randInt(
from the MATH PRB menu.
F2 (INT)
Then press ( 45 EXIT F3 (PROB)
F4 (Ran#) )
This will bring randInt( to the screen.
Now press 1 , 45 ) .
Pressing ENTER repeatedly will give
random integers between 1 and 45.
Ignore repetitions.
Then press
+
Now repeatedly press EXE to produce
more random integers.
Example 1
Self Tutor
2002 2003 2004 2005 2006 2007
43:1 48:7 45:7 44:0 48:6 46:3
38:2 35:3 36:4 38:3 37:7 40:2
38:6 36:0 36:2 34:8 35:3 33:3
40:2 40:9 42:4 42:5 43:8 35:7
43:2 44:2 47:0 48:7 50:3 52:4
27:8 32:3 33:5 34:1 32:2 35:8
26:4 27:2 23:5 27:2 27:7 28:1
23:8 24:9 24:8 27:6 26:1 28:2
27:4 30:8 32:7 33:6 34:9 35:1
40:4 39:3 38:7 41:3 42:4 44:9
68:3 67:4 67:3 69:8 70:4 72:6
81:2 83:9 84:6 85:5 88:3 87:2
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
There are twelve months from which we need to
choose one month. We use the calculator, with 1
representing January, 2 representing February, etc.
The randomly chosen month is November.
100
b
50
There are six years from which to choose. We could
use a die to randomly choose one of these years; the
year 2002 would be represented by 1, 2003 by 2,...... ,
2007 by 6.
Alternatively, we could use the random generator on
a calculator.
The randomly chosen year is 2006.
75
a
25
0
5
95
100
50
75
25
0
5
The table shown gives the
monthly sales figures, in
January
thousands of dollars, for a
February
shop over a six year period.
March
a Choose a year at
April
random.
May
June
b Choose a month at
July
random.
August
c Choose three consecuSeptember
tive years.
October
November
December
cyan
1 EXE
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\451SA12MA-2_08.CDR Monday, 20 August 2007 10:15:12 AM DAVID3
SA_12MA-2
452
STATISTICS
c
(Chapter 8)
To choose three consecutive years, we need to
establish the number of sets of three consecutive
years that are possible:
1 2002 - 2004
2 2003 - 2005
3 2004 - 2006
4 2005 - 2007
There are four possibilities, from which we have to
choose one.¡ Using the calculator, the randomly
chosen period is 3, that is, 2004 to 2006.
To choose a simple random sample:
1 Find the sample size needed.
2 State the number of possibilities from which you can choose, and number
them if necessary.
3 State the random number generator that you are using.
4 Explain what you will do if repeated random numbers are not applicable.
5 State the random number(s) chosen and the data that is now in your sample.
EXERCISE 8A.3
1 Use
a
b
c
d
your calculator to:
select a random sample
select a random sample
select a random sample
select a random sample
of
of
of
of
six different numbers between 5 and 25 inclusive
10 different numbers between 1 and 25 inclusive
six different numbers between 1 and 45 inclusive
5 different numbers between 100 and 499 inclusive.
2 Click on the icon to obtain a printable calendar for 2008 showing
CALENDAR
the weeks of the year. Each of the days is numbered.
Using a random number generator, choose a sample from the calendar of:
a five different dates
b a complete week starting with a Monday
c a month
d three different months
e three consecutive months
f a four week period starting on a Saturday
g a four week period starting on any day.
Explain your method of selection in each case.
cyan
magenta
yellow
95
Wk 11
1
2
3
4
5
6
7
8
9
10
11
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
100
25
0
...
Wk 10
50
March
(61)
(62)
(63)
(64)
(65)
(66)
(67)
(68)
(69)
(70)
(71)
75
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
5
95
1
2
3
4
5
6
7
8
9
10
11
100
50
75
25
...
0
95
50
75
25
0
...
February
Fr (32)
Sa (33)
Su (34)
Mo (35)
Tu (36) Wk 6
We (37)
Th (38)
Fr (39)
Sa (40)
Su (41)
Mo (42)
100
1
2
3
4
5
6
7
8
9
10
11
5
January
Tu (1)
Wk 1
We (2)
Th (3)
Fr (4)
Sa (5)
Su (6)
Mo (7)
Tu (8)
Wk 2
We (9)
Th (10)
Fr (11)
5
95
100
50
75
25
0
5
1
2
3
4
5
6
7
8
9
10
11
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\452SA12MA-2_08.CDR Thursday, 16 August 2007 4:45:46 PM DAVID3
April
(92)
(93)
(94)
(95)
(96)
(97)
(98)
(99)
(100)
(101)
(102)
...
Wk 14
Wk 15
1
2
3
4
5
6
7
8
9
10
11
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
May
(122)
(123)
(124)
(125)
(126)
(127) Wk 19
(128)
(129)
(130)
(131)
(132)
...
SA_12MA-2
STATISTICS
453
(Chapter 8)
B. SYSTEMATIC SAMPLING
Example 2
Self Tutor
Management of a large city store wishes to find out how potential customers like
the look of a new product and whether they would buy it. They decide on a 5%
systematic sampling procedure. Explain what this means.
We notice that: 5% =
5
100
=
1
20
So, 1 in 20 people passing by is asked to participate.
If we start with, say, the 3rd person who passes by, then we need to ask the 23rd,
43rd, 63rd, 83rd, 103rd, ..... and so on for a period until sufficient data is obtained.
To obtain a k% random sample, we need to choose a starting place and then choose
¡ 100 ¢
every
k th one after that.
If an accountancy firm wishes to randomly survey
its 3217 clients using systematic sampling, they
may do it at a 10% level. Since their clients each
have files then they might select the 3rd, 13th,
23rd, 33rd, etc.
23rd
13th
3rd
C. STRATIFIED SAMPLING
Suppose you wish to know the opinions of the whole student body on possible changes to
the school uniform. Simple random sampling may not be appropriate, as due to chance a
disproportionate number of say year 11s may be chosen and their views may not be considered
to represent the views of all students. What we do is randomly sample each year level with
a sample size proportional to the number in that year level.
Example 3
Self Tutor
In our school there are 137 year 8’s, 152 year 9’s, 174 year 10’s, 168 year 11’s and
121 year 12’s. A stratified sample of 50 students is needed. How many should be
randomly selected from each group?
Total number of students in the school is: 137 + 152 + 174 + 168 + 121 = 752
) number of year 8’s =
137
752
£ 50 + 9
number of year 9’s =
152
752
£ 50 + 10
number of year 10’s =
174
752
£ 50 + 12
number of year 11’s =
168
752
£ 50 + 11
number of year 12’s =
121
752
£ 50 + 8
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
We then have to randomly select 9 year 8’s, 10 year 9’s etc. in the same way.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\453SA12MA-2_08.CDR Thursday, 16 August 2007 4:45:52 PM DAVID3
SA_12MA-2
454
STATISTICS
(Chapter 8)
To obtain a stratified random sample the population is divided into subgroups called strata
and random samples are proportionally selected from each subgroup.
Strata
Random Samples
Note:
Other sampling techniques
can be used, for example,
Cluster sampling.
We do not consider them in
this course.
Year 8s
Year 9s
Year 10s
Year 11s
Year 12s
EXERCISE 8A.4
1 An NBL basketball club averages 3540 spectators per game.¡ The catering manager
wants to conduct a survey to investigate the proportion of spectators who would spend
more than $20 on food and drinks at the game.¡ He decides to survey the first 40 people
through the gate.
a Discuss any potential bias in the method chosen.
b How reliable would the sample to estimate the proportion be in reflecting the population’s spending? Discuss the sample size in your answer.
c Suggest a better sampling method that includes a suitable sample size that would
better represent the population.
2 A golf club has 1800 members with ages in the folAge range No. of members
lowing ranges:
under 18
257
A member survey is to be undertaken to determine
18 < 40
421
the proportion who want changes to dress regula40 < 55
632
tions.
55 < 70
356
a Why wouldn’t the golf club survey all members
over 70
134
on the proposed changes to dress regulations?
b What minimum sample size should the golf club consider to be 95% confident of
accuracy within 5%?
c If a stratified sample size of 350 is to be used, how many of each age group above
should be surveyed?
3 A large retail store has the following staff: departmental managers - 10;
supervisors - 24; senior sales staff - 62; junior sales staff - 98; shelf packers - 28.
Management wishes to interview a sample of 30 staff to obtain an overall picture of the
staff view of operating procedures. How many of each group of staff members would
be selected for the sample to be representative of overall staff opinion?
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
4 A school has the following enrolments:
A financial planner wishes to survey the students to investigate the number of students who receive more than $10
pocket money each week. She decides on a sample size
of 30.
a Is a sample size of 30 likely to provide a reliable estimate of the proportion of the population who receive
more than $10 per week? Explain.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\454SA12MA-2_08.CDR Thursday, 16 August 2007 4:45:58 PM DAVID3
Year group Boys Girls
8
82
51
9
73
75
10
52
94
11
78
46
12
84
98
SA_12MA-2
STATISTICS
(Chapter 8)
455
b If the survey is to be done using a stratified sampling procedure, calculate the
number to be included in the survey of:
i boys ii girls iii year 8 girls iv year 11 boys v year 12s
c Suggest a way of increasing the reliability of the sample results.
5 The 200 students in year 11 and 12 of a high school were asked whether (y)
they had ever smoked a cigarette. The replies, as they were received, were:
nnnny nnnyn ynnnn yynyy ynyny ynnyn nyynn yynyn ynnyn
nnyyy yyyyy nnnyy nnnnn nnyny yynny nynnn ynyyn nnyny
ynnnn yyyyn yynnn nynyn nynnn yynny nyynn yynyn ynynn
ynnyy nyyny ynynn nyynn nnnyy ynyyn yyyny ynnyy nnyny
or not (n)
yynyy
ynyyy
nyyyn
ynnnn
a Why is this data considered in this case to be a population?
b Find the actual proportion of all students who said they had smoked.
c Examine the validity and usefulness of the following sampling techniques which
could have been used to estimate the proportions in b without actually counting
them:
i sampling the first five replies
ii sampling the first ten replies
iii sampling every second reply
iv sampling the fourth member of every group
of five
v randomly selecting 30 numbers from
001 to 200 and choosing the response
corresponding to that number.
(Note: The 96th response is coloured.)
d Are any of (simple random sample, systematic
sample, stratified sample) used in c i to v ?
6 Imagine you are an agricultural researcher
with a trial plot of fodder grass on which
you are testing a new fertiliser.¡ The plot is
10 metres square.¡ After the grass has been
growing for one month, you need to harvest
a sample to weigh.¡ It is too time-consuming
to collect every blade of grass so you need
to collect a sample representative of the
whole plot.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
a Describe and explain how you would divide up the plot to select a sample of grass
to collect and weigh. You could use a set of random numbers in some way.
b Explain why you think it is necessary to select a random sample across the trial plot
and not just the corner.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\455SA12MA-2_08.CDR Thursday, 16 August 2007 4:46:04 PM DAVID3
SA_12MA-2
456
STATISTICS
(Chapter 8)
SAMPLING ERRORS
Sampling errors are not errors if they are intentional.
We will briefly consider how unintentional sampling errors can occur. Errors in sampling
could arise from:
²
bias caused by faults in the sampling process, sometimes called systematic errors.
For example, in sampling flat rent figures in a suburb one must not consider only the
large advertisements as these may more frequently be for classier, more expensive
accommodation. This sort of bias is often unintentional. Remember that the sample
must truly represent the population.
²
statistical (or random) errors which are caused by natural variability. A sample
may not reflect the population due these errors. However, in much larger samples
these errors tend to be fewer.
SAMPLE SIZE WHEN ESTIMATING A POPULATION MEAN
INVESTIGATION 1
HOW LARGE MUST A SAMPLE BE?
Click on the icon to view a population of known mean x.
DEMO
What to do:
1 Select a sample of size n = 2 and find its mean x.
2 Repeat several times. Comment on how x compares with the population mean.
3 Now select samples of size n = 10 and in each case find x. Comment on how
these xs compare with the true population mean.
4 Repeat for samples of size n = 100.
5 Write a brief report on your findings.
From the investigation you should have observed that:
The larger the sample size, the closer the mean of the sample reflects the mean of
the population.
This is true for other population characteristics, for example, the standard deviation.
We examine the mean and standard deviation in greater detail later.
It is true to say that: “The greater the sample size, the more reliable will be our findings”.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
However, we must strike a balance between the confidence in the reliability of our results
and the expense of carrying out a large sampling procedure.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\456SA12MA-2_08.CDR Thursday, 16 August 2007 4:46:11 PM DAVID3
SA_12MA-2
STATISTICS
B
(Chapter 8)
457
ANALYSIS AND REPRESENTATION
Once data has been collected and organised (in table form) it is ready to be analysed and
represented in graphical form.
DISCRETE NUMERICAL DATA
Recall that a discrete numerical variable can take only distinct values.
The data is often obtained by counting.
For example, a farmer has a crop of peas and wishes to investigate the number of peas in
the pods. He takes a random sample of 50 pods and counts the number of peas in each pod,
obtaining the following data:
6654987776567888752477678
8786642913359887767768455
The variable in this situation is the discrete numerical variable ‘the number of peas in a pod’.
The data could only take the discrete numerical values 0, 1, 2, 3, 4, ....
TABLES AND GRAPHS
To organise his data the farmer could use
the tally and frequency table shown. A
barchart could be used to display the results.
14
12
10
8
6
4
2
0
No. peas in pod
Tally
Frequency
1
j
1
2
jj
2
3
jj
2
4
jjjj
4
© j
5
©
jjjj
6
© jjjj
6
©
jjjj
9
© ©
© jjj
7
©
jjjj
jjjj
13
© ©
©
8
©
jjjj
jjjj
10
9
jjj
3
Total
50
frequency
0
1
2
3
4
5 6 7 8 9
number of peas in pod
Alternatively, the farmer could use a dot plot
which is a convenient method of tallying the
data and at the same time displaying the
frequencies.
To draw a dot plot:
1 Draw a horizontal axis and mark it with the values that the variable can take. For this
example, the variable took values from 1 to 9, so we mark the axis from 0 to 10.
2 Label the axis with a description, in this case: number of peas in pod.
3 Systematically go through the data, placing a dot or cross above the appropriate position
on the axis.
The dot plot for this example is:
cyan
magenta
yellow
4
100
50
75
95
3
2
25
0
1
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
0
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\457SA12MA-2_08.CDR Thursday, 16 August 2007 4:46:17 PM DAVID3
5
6
7
9
8
10
number of peas in pod
SA_12MA-2
458
STATISTICS
(Chapter 8)
Notice that the dots are evenly spaced so the final plot looks similar to the barchart.
From both the barchart and the dot plot it can be seen that:
² Seven was the most frequently occurring number of peas in a pod.
100
² 35
50 £ 1 = 70% of the pods yielded six or more peas.
²
10% of the pods had fewer than 4 peas in them.
DESCRIBING THE DISTRIBUTION OF A SET OF DATA
The distribution of a set of data is the pattern or shape of its graph.
For the example above, the graph has the general
shape shown alongside:
stretched to the left
This distribution of the data is said to be negatively skewed because it is stretched to the left
(the negative direction).
A positively skewed distribution of data
would have a shape:
A symmetrical distribution of data is neither positively nor negatively skewed, but
is symmetrical about a central value.
stretched to the right
A set of data whose graph has two peaks is
said to be bimodal.
Note that the horizontal is a number line
with numbers in ascending order from left
to right.
Outliers are data values that are either much
larger or much smaller than the general
frequency
body of data.¡ Outliers appear separated 12
from the body of data on a frequency graph. 10
magenta
yellow
outlier
95
100
50
75
0 1 2 3 4 5 6 7 8 9 10 11 12 13
number of peas in pod
25
0
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
cyan
5
8
6
4
2
0
For the example, if the farmer found one
pod in his sample contained 13 peas then
the data value 13 would be considered an
outlier.¡ It is much larger than the other data
in the sample.¡ On the column graph it
appears separated.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\458SA12MA-2_08.CDR Thursday, 16 August 2007 4:46:22 PM DAVID3
SA_12MA-2
STATISTICS
459
(Chapter 8)
EXERCISE 8B.1
1 A randomly selected sample of households in both Australia and Thailand were asked,
“How many people live in your household?” Column graphs have been constructed for
the results.
Size of households (Thailand)
frequency
frequency
Size of households (Australia)
8
6
8
6
4
4
2
2
0
For
a
b
c
d
1
0
3 4 5 6 7 8 9 10
number of people in the household
2
1
3 4 5 6 7 8 9 10
number of people in the household
2
each of Australia and Thailand, answer the following:
How many households were surveyed?
How many households had only one or two occupants?
What percentage of the households had five or more occupants?
Compare the distribution of the data for each survey.
2 A bowler recorded the number of wickets he took in the first 15 innings of the season
and the last 15 innings of the season.
1st half of season: 1 1 3 2 0 0 4 2 2 4 3 1 0 1 0
2nd half of season: 2 1 5 1 3 7 2 2 2 4 3 1 1 0 3
a Construct side by side dot plots for each set of data.
b Compare the distributions of the data sets, noting any
outliers.
c In which part of the season did the bowler have more
success? Give evidence.
3 For an investigation into the number of phone calls made by teenagers,
samples of 50 thirteen-year-olds and
50 fifteen-year-olds were asked the
question,
“How many phone calls did you
make yesterday?”
The given dot plot was constructed
for the data.
magenta
yellow
0
1
2
3
4
5
6
7
8
9
10 11
number of
phone calls
15 y.o.
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
75
50
25
0
5
95
100
50
75
25
0
5
13 y.o.
What is the variable in this investigation?
Explain why the data is discrete numerical data.
What percentage of each age group did not make any phone calls?
What percentage of each age group made 5 or more phone calls?
Describe and compare the distributions of the sets of data.
How would you describe the data value ‘11’ for each set of data?
a
b
c
d
e
f
cyan
The no. of phone calls made in a day by teenagers
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\459SA12MA-2_08.CDR Thursday, 16 August 2007 4:46:28 PM DAVID3
SA_12MA-2
460
STATISTICS
(Chapter 8)
CONTINUOUS NUMERICAL DATA
The height of 14-year-old children is being investigated.
The variable ‘height of 14-year-old children’ is a continuous numerical variable because the
values recorded for the variable could, theoretically, be any value on the number line. They
are most likely to fall between 120 and 190 centimetres.
The heights of thirty children are measured in centimetres. The measurements are rounded to
one decimal place, and the values recorded below:
163:0 154:2 152:8 160:5 148:3 149:2 154:7 172:7 171:3 162:5
165:0 160:2 166:2 175:3 143:4 174:6 180:9 162:4 167:3 158:4
159:4 164:5 163:7 183:8 150:8 163:4 181:9 158:3 165:0 156:8
Note that these rounded values are actually discrete. However, when we tally them, we use
continuous class intervals as follows:
The smallest height is 143:4 cm and the largest is 183:8 cm so we will use class intervals 140
up to 150 (this does not include 150), 150 up to 160, 160 up to 170, 170 up to 180, 180 up
to 190. Note that we choose class intervals of the same width.
These class intervals are written as 140 - < 150, 150 - < 160, etc. in the frequency
table.
The final class interval is written as 180 - < 190 which means 180 cm up to a height that
is less than 190 cm.
A tally-frequency table for this example is:
Height (cm)
140 - < 150
150 - < 160
160 - < 170
170 - < 180
180 - < 190
Total
Tally
jjj
© jjj
©
jjjj
© ©
© jj
©
jjjj
jjjj
jjjj
jjj
Frequency
3
8
12
4
3
30
A histogram is used to display continuous numerical data. This is similar to a barchart but
because of the continuous nature of the variable, the ‘bars’ are joined together. The frequency
is represented by the height of the ‘bars’.
A histogram for this example is
shown opposite:
Heights of a sample of fourteen-year-old children
12
frequency
8
4
cyan
magenta
yellow
95
150
100
50
75
140
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
0
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\460SA12MA-2_08.CDR Thursday, 16 August 2007 4:46:34 PM DAVID3
160
170
180 190
height (cm)
SA_12MA-2
STATISTICS
461
(Chapter 8)
RELATIVE FREQUENCY DISTRIBUTIONS
When we compare two distributions which come from different sample sizes, a relative
frequency distribution is used for each of them. Relative frequency tables show the proportion
(or percentage) for each class.
A relative frequency table and histogram can be drawn for the ‘height of 14-year-olds’ data.
Height (cm)
140
150
160
170
180
Frequency
- < 150
- < 160
- < 170
- < 180
- < 190
Total
Relative %
3
30
3
8
12
4
3
30
relative frequency %
£ 100 = 10%
26:7%
40%
13:3%
10%
100%
40
30
20
10
0
140
150
160
170
180 190
height (cm)
From the tables and graphs we can see:
²
More children had a height in the class interval 160 up to 170 cm than any other
class interval. This class interval is called the modal class.
12
30 £ 100 = 40% of the children had a height in this class.
²
3
£ 100 = 10%) had a height less than 150 cm.
Three of the children ( 30
²
Three of the children (10%) were 180 cm or more tall.
²
The distribution of heights was approximately symmetrical.
EXERCISE 8B.2
1 The speeds of cars and trucks travelling along a section of highway have been recorded
separately and displayed using the histograms below.
200
200
number of
cars
150
150
100
100
50
50
0
50
70
90
number of
trucks
0
110
130
speed (km/h)
50
70
90
110
130
speed (km/h)
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
a How many vehicles were included in each survey?
b Compare the percentage of cars and trucks that were travelling at speeds equal to
or greater than 100 km/h.
c Compare the percentage of the cars and trucks that were travelling at a speed less
than 80 km/h.
d If the owners of the vehicles travelling at 110 km/h or more were fined $165 each,
what amount would be collected in fines?
e Compare the shapes of the two histograms.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\461SA12MA-2_08.CDR Thursday, 16 August 2007 4:46:40 PM DAVID3
SA_12MA-2
462
STATISTICS
(Chapter 8)
2 The daily maximum temperature (o C) to the nearest degree, in Adelaide and Hobart, for
each day in January 2006, is recorded below:
34
24
29
22
Adelaide:
Hobart:
38
26
31
25
31
35
25
28
38
36
23
16
23
25
18
17
24
32
24
19
25
27
19
24
26
30
20
26
29
34
21
26
35
30
22
27
41
27
28
23
23
25
25
22
32
26
22
18
36
23
17
20
22 21
25
18 21
22
a Using class intervals of 5 degrees construct a tally and frequency table for each city.
b Construct histograms to display the data.
c Compare the distribution of Adelaide’s daily maximum temperatures in January
2006 with Hobart’s.
3 The height of each member of a basketball
club has been measured and the results are displayed using the frequency table alongside.
a Calculate the relative frequencies and
construct a relative frequency histogram
for each sex.
b Compare the distributions of the heights.
c Find the percentage of members of each
sex whose height is:
i greater than 180 cm
ii less than 170 cm
iii between 175 and 190 cm.
Height (cm)
165
170
175
180
185
190
195
200
-
< 170
< 175
< 180
< 185
< 190
< 195
< 200
< 205
C
Male
Female
Frequency Frequency
1
1
3
2
5
12
12
8
7
6
5
2
2
1
1
0
STEMPLOTS
Constructing a stem-and-leaf plot, commonly called a stemplot, is often a convenient method
to organise and display a set of numerical data.
A stemplot groups the data and shows the relative frequencies but has the added advantage
of retaining the actual data values.
CONSTRUCTING A STEMPLOT
Data values such as 25 36 38 49 23 46 47 15 28 38 34 are all two digit numbers, so
the first digit will be the ‘stem’ and the last digit the ‘leaf’ for each of the numbers.
The stems will be 1, 2, 3, 4 to allow for numbers from 10 to 49.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
The stemplot for the data is shown alongside.
Stem Leaf
Notice that:
1 5
² 1 j 5 represents 15
2 358
3 4688
² 2 j 3 5 8 represents 23, 25 and 28
4 679
2 j 3 means 23
² the data in the leaves is evenly spaced with
no commas
² the leaves are placed in increasing order, so this stemplot is ordered
² the scale (sometimes called the key) tells us the place value of each leaf.
If the scale was 2 j 3 means 2:3, then 4 j 6 7 9 would represent 4:6, 4:7 and 4:9.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\462SA12MA-2_08.CDR Thursday, 16 August 2007 4:46:46 PM DAVID3
SA_12MA-2
STATISTICS
463
(Chapter 8)
If the stems are written with the least number at the top then the stemplot can be rotated so
that the values on the horizontal axis are in ascending order and you can see the shape of the
distribution.
For data values such as 195 199 207 183 201 .... the first two digits are the stem and
the last digit is the leaf.
Example 4
Self Tutor
The score, out of 50, on a test was recorded for 36 students.
a Organise the data using a stemplot.
25 36 38 49 23 46 47 15 28 38 34 9
30 24 27 27 42 16 28 31 24 46 25 31
b Comment on the distribution of the
37 35 32 39 43 40 50 47 29 36 35 33
data.
Recording the data from the list gives
an unordered stemplot:
Stem
0
1
2
3
4
5
b
Leaf
9
56
538
688
967
0
Ordering the data from smallest to
largest produces an ordered stemplot:
Stem
0
1
2
3
4
5
4778459
40117529653
26307
2 j 4 means 24 marks
The shape of the distribution can
be seen when the stemplot is
rotated:
The data is slightly negatively
skewed.
Leaf
9
56
3445577889
01123455667889
02366779
0
Leaf
9
56
34455 77889
01123455667889
02366779
0
a
Stem
0
1
2
3
4
5
We also observe these important
features:
² The minimum (smallest) test
score is 9.
² The maximum (largest) test
score is 50.
SPLIT STEMS
Consider the following example:
The residue that results when a cigarette is smoked
collects in the filter. This residue has been weighed
for twenty cigarettes, giving the following data, in mg.
1:62 1:55 1:59 1:56 1:56 1:55 1:63
1:59 1:56 1:69 1:61 1:57 1:56 1:55
1:62 1:61 1:52 1:58 1:63 1:58
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
Scanning the data reveals that there will be only two ‘stems’, i.e., 15 and 16. In cases like
this we will need to split the stems.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\463SA12MA-2_08.CDR Thursday, 16 August 2007 4:46:52 PM DAVID3
SA_12MA-2
464
STATISTICS
(Chapter 8)
If we use the stem 15 to represent data with
values 1:50 to 1:54 and 15¤ to represent data
with values 1:55 to 1:59 etc., we can construct
a stemplot with four stems:
Stem
15
15¤
16
16¤
Leaf
2
555666678899
112233
9
15 j 2 means 1:52
If we split the stems five ways, where 150 represents data with
values 1:50 and 1:51, 152 represents data with values 1:52 and
1:53 etc., the stemplot becomes:
Stem
150
152
154
156
158
160
162
164
166
168
The stemplot with the stems split five ways clearly gives a
better view of the distribution of the data. The value 1:69
appears as an outlier in this graph.
The stemplot with the stems split two ways was not sensitive
enough to show this.
Leaf
2
5
6
8
1
2
5
6
8
1
2
5
667
99
33
9
BACK-TO-BACK STEMPLOTS
A back-to-back stemplot is a visual display that enables easy analysis and comparison of
two sets of data.
Consider this example:
An office worker has the choice of travelling to work by tram or train. He has recorded the
travel times from recent journeys on both of these types of transport. He wishes to know
which type of transport is quicker and which is the more reliable.
Recent tram journey times (minutes):
21, 25, 18, 13, 33, 27, 28, 14, 18, 43, 19, 22, 30, 22, 24
Recent train journey times (minutes):
23, 18, 16, 16, 30, 20, 21, 18, 18, 17, 20, 21, 28, 17, 16
A back-to-back stemplot could be used to display the relationship between the categorical
variable type of transport which has two categories (or levels), and the numerical variable
travel time.
The type of transport is the independent variable and the travel time is the dependent variable,
because the travel time depends on the type of transport.
A back-to-back stemplot is constructed
with only one stem. The leaves are
grouped on either side of this central
stem. The ordered back-to-back stemplot for the data is shown alongside:
Train leaf
88877666
831100
0
Stem
1
2
3
4
Tram leaf
34889
1224578
03
3
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
The most frequently occurring travel times by train were between 10 and 20 minutes whereas
the most frequently occurring travel times by tram were between 20 and 30 minutes.
It seems as if it is generally quicker and the travel times are more reliable if the worker travels
by train to work.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\464SA12MA-2_08.CDR Thursday, 16 August 2007 4:46:58 PM DAVID3
SA_12MA-2
STATISTICS
465
(Chapter 8)
EXERCISE 8C
1 The heights (to the nearest centimetre) of Year 10 boys and girls in a school are being
investigated. The sample data are as follows:
Boys: 164 168 175 169 172 171 171 180 168 168 166 168 170 165 171 173
187 179 181 175 174 165 167 163 160 169 167 172 174 177 188 177
185 167 160
Girls: 165 170 158 166 168 163 170 171 177 169 168 165 156 159 165 164
154 170 171 172 166 152 169 170 163 162 165 163 168 155 175 176
170 166
a Construct a back-to-back stemplot for the data.
b Compare and comment on the distributions of the data, mentioning the shape.
c What percentage of each sex are 175 cm or taller?
2 A new cancer drug is being developed and is being tested on rats. Two groups of twenty
rats with cancer were formed; one group was given the drug while the other was not.
The survival time of each rat in the experiment was recorded up to a maximum of 192
days.
Survival times of rats that were given the drug:
64
78
106 106 106 127 127 134 148 186
192¤ 192¤ 192¤ 192¤ 192¤ 192¤ 64 78 106 106
Survival times of rats that were not given the drug:
37
38
42
43
43
43
43 43 48 49
51
51
55
57
59
62
66 69 86 37
¤
denotes that the rat was still alive at the end of the experiment
a Construct a back-to-back stemplot for the data.
b Compare and comment on the distributions of the data, mentioning the shape.
c What percentage of each group of rats survived for 70 days or more?
3 Peter and John are competing taxi-drivers who wish to know who earns more money.
They have recorded the amount of money (in dollars) collected per hour for five hours
over five days:
Peter: 17:3 11:3 15:7 18:9 9:6 13 19:1 18:3 22:8 16:7 11:7 15:8
12:8 24 15 13 12:3 21:1 18:6 18:9 13:9 11:7 15:5 15:2 18:6
John: 23:7 10:1 8:8 13:3 12:2 11:1 12:2 13:5 12:3 14:2 18:6 18:9
15:7 13:3 20:1 14 12:7 13:8 10:1 13:5 14:6 13:3 13:4 13:6 14:2
a Construct a back-to-back stemplot for the data.
b Compare and comment on the distributions of the data, mentioning the shape, and
any outliers.
c Who seems to collect more money per hour?
4 The residue that results when a cigarette is smoked collects in the filter. The residue
from twenty cigarettes from the two different brands was measured, giving the following
data, in milligrams:
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
Brand X: 1:62 1:55 1:59 1:56 1:56 1:55 1:63 1:59 1:56 1:69
1:61 1:57 1:56 1:55 1:62 1:61 1:52 1:58 1:63 1:58
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\465SA12MA-2_08.CDR Thursday, 16 August 2007 4:47:05 PM DAVID3
SA_12MA-2
466
STATISTICS
(Chapter 8)
Brand Y: 1:61 1:62 1:69 1:62 1:60 1:59 1:66 1:55 1:61 1:62
1:64 1:61 1:58 1:57 1:57 1:57 1:58 1:60 1:63 1:59
a Copy and complete the back-to-back stemplot for this data:
Stem
150
152
154
156
158
160
162
164
166
168
Brand Y
Brand X
2
5
6
8
1
2
5
6
8
1
2
5
667
99
33
156 includes values 1:56 and 1:57
9
b Comment on and compare the shape of the distributions.
D
MEASURES OF CENTRE
A picture of a data set can be obtained if we have an indication of the centre of the data and
the spread of the data.
Two statistics that provide a measure of the centre of a set of data are:
² the mean
² the median.
THE MEAN
How a class performs in a mathematics test is quickly and probably best described by quoting
the arithmetic mean (often called the average) of the distribution of marks.
The mean of n numbers is obtained by summing the numbers and then dividing by n.
For the numbers x1 , x2 , x3 , x4 , .... , xn , the mean is x =
x1 + x2 + x3 + x4 + ::::: + xn
:
n
Example 5
Self Tutor
The results of a biology test (out of 50) are given below:
44 7 30 40 22 32 39 13 38 35 31 36
29 34 27 39 37 16 35 41 35 45 20 32
23
38
48
46
Find the mean of the test results.
44 + 7 + 30 + :::::: + 38 + 46
28
912
=
28
+ 32:6
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
Mean, x =
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\466SA12MA-2_08.CDR Thursday, 16 August 2007 4:47:12 PM DAVID3
SA_12MA-2
STATISTICS
467
(Chapter 8)
Note: ²
²
The mean involves all the data values.
If you are told that the mean mark for a test is 65% then there will be some
marks higher than 65% and some marks lower than 65%.
²
The mean does not have to be one of the data values.
For example:
The mean number of children per family is 1:8 in Adelaide.
It is obvious that a family cannot have 1:8 children but this statistic tells us that
most families have either 1 or 2 children, with more families having 2 children.
THE MEDIAN
When a set of data is written in order, the median is the middle value of the set.
For the biology test results, the ordered data set is:
7 13 16 20 22 23 27 29 30 31 32 32 34 35 35 35 36 37 38 38 39 39 40 41 44 45 46 48
There are two middle scores, so the median score = 35. ftheir averageg
For a sample of size n, the median is the
Note:
¡ n+1 ¢th
2
score.
If n is odd, say 17, the median is the
17+1
2
= 9th score.
If n is even, say 18, the median is the
18+1
2
= 9:5th score
DEMO
indicating the average of the 9th and 10th scores.
Example 6
Self Tutor
Find the median for the following data sets:
a 5573823465764
b 3 5 5 5 5 6 6 6 7 7 7 8 8 8 9 10
a
The data set is ordered (arranged from smallest to largest).
2 3 3 4 4 5 5 5 6 6 7 7 8
13 + 1
= 7th value (circled).
2
The median is the
The median is 5.
There are 16 data values so the median is the average of the 8th and 9th values
(circled).
3 5 5 5 5 6 6 6 7 7 7 8 8 8 9 10
cyan
magenta
yellow
95
100
50
75
(Note: This is not one of the data values.)
25
0
5
95
100
6+7
= 6:5
2
50
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
The median is
75
b
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\467SA12MA-2_08.CDR Thursday, 16 August 2007 4:47:18 PM DAVID3
SA_12MA-2
468
STATISTICS
(Chapter 8)
Note on symmetry
frequency
This distribution is symmetric.¡ Data values are
symmetrically spread about the centre.
For a symmetrical distribution the mean and median
are equal (or approximately equal).
mean and median
frequency
frequency
mean median
median
This distribution is negatively skewed
(or skewed left) and
the mean < the median.
mean
This distribution is positively skewed
(or skewed right) and
the mean > the median.
FINDING THE MEAN AND MEDIAN OF UNGROUPED DATA
Consider the data 2, 3, 5, 4, 3, 6, 5, 7, 3, 8, 1, 7, 5, 5, 9:
For TI-83
Data is entered in the STAT EDIT menu.¡ Press STAT 1 to select 1:Edit
In L1, delete all existing data.¡ Enter the new data.
Press 2 ENTER then 3 ENTER etc, until all data is entered.
To obtain the descriptive statistics
to select the STAT CALC menu.¡ Press 1 to select 1:1–Var Stats
Press STAT
Pressing 2nd 1 (L1) ENTER gives the mean x = 4:87 (to 3 sf)
cyan
magenta
yellow
95
100
50
median = 5
75
25
0
5
95
100
50
75
25
0
repeatedly gives the
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
Scrolling down by pressing
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\468SA12MA-2_08.CDR Monday, 20 August 2007 10:15:23 AM DAVID3
SA_12MA-2
STATISTICS
469
(Chapter 8)
For Casio
From the Main Menu, select STAT. In List 1, delete all existing data and enter the new
data. Press 2 EXE then 3 EXE etc until all data is entered
To obtain the descriptive statistics
Press F6 (¤) if the GRPH icon is not in the bottom left corner of the screen.
Press F2 (CALC) F1 (1VAR) which gives the mean x = 4:87 (to 3 sf)
Scrolling down by pressing
repeatedly gives the
median = 5
MEAN AND MEDIAN FOR GROUPED DISCRETE DATA
Example 7
Self Tutor
The frequency table alongside shows data collected from a random sample of 50 households
in a particular suburb, investigating the number
of people in the household.
Use the calculator to find the mean and median
of the number of people in a household for this
sample.
Number of people Frequency
in the household
1
2
3
4
5
6
5
8
13
14
7
3
For TI-83
Press STAT 1 to select 1:Edit.¡ Key the variable values into L1 and the
cyan
magenta
yellow
95
100
50
1 to select 1:1–Var Stats from the
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
frequency values into L2.¡ Press STAT
STAT CALC menu.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\469SA12MA-2_08.CDR Monday, 20 August 2007 10:15:36 AM DAVID3
SA_12MA-2
470
STATISTICS
(Chapter 8)
Enter L1, L2 by pressing 2nd 1 (L1) ,
2nd 2 (L2) ENTER
The mean is 3:38¡.
Scroll down, and the
median is 3.
Note: If you do not include L2 you will get a screen of statistics for L1 only.
For Casio
From the Main Menu, select STAT. Key the variable values into List 1 and the
frequency values into List 2.
Press F6 (¤) if the GRPH icon is not in the bottom left corner of the screen.
Press F2 (CALC) F6 (SET)
variable to List 2.
Press EXIT
F3 (List2) to change the frequency
F1 (1VAR)
The mean is 3:38 . Scroll down, and the median is 3.
MEAN AND MEDIAN FOR GROUPED CONTINUOUS DATA
If continuous data is grouped using class intervals, we use the midpoints of the class intervals
as the variable values.
Example 8
Self Tutor
The time taken by students to complete a mid-year examination in Economics for all
students participating is given in the table following (in minutes).
60 - < 70 70 - < 80 80 - < 90 90 - < 100 100 - < 110 110 - < 120
Time
1
Students
2
11
24
28
13
a What do you suspect was the duration of the examination paper?
b What are the midpoints (x) of the time intervals?
c Calculate the mean and median time to complete the exam.
a The exam paper was for 120 minutes (2 hours).
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
b The midpoints are 65, 75, 85, 95, 105, 115 (minutes).
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\470SA12MA-2_08.CDR Monday, 20 August 2007 10:15:47 AM DAVID3
SA_12MA-2
STATISTICS
c
471
(Chapter 8)
For TI-83
For Casio
We enter the midpoints into L1 and
the frequencies into L2:
We enter the midpoints into List 1 and
the frequencies into List 2:
We then proceed using the instructions as in Example 9 to get
We then proceed using the instructions as in Example 9 to get
The mean is 99:6 minutes.
The median is 105 minutes.
The mean is 99:6 minutes.
The median is 105 minutes.
Note: The median is given here as
one of the midpoints entered.
Why?
Note: The median is given here as
one of the midpoints entered.
Why?
EXERCISE 8D
1 Consider the following two data sets:
Data set A: 3, 4, 4, 5, 6, 6, 7, 7, 7, 8, 8, 9, 10
Data set B: 3, 4, 4, 5, 6, 6, 7, 7, 7, 8, 8, 9, 15
a
b
c
d
Find the mean for both Data set A and Data set B.
Find the median of both Data set A and Data set B.
Explain why the mean of Data set A is less than the mean of Data set B.
Explain why the median of Data set A is the same as the median of Data set B.
2 The back-to-back stemplot below shows the points per game scored by two basketballers,
Erin and Tracy:
Erin
Leaf
9
875
76411
8420
1
Stem
0
1
2
3
4
Leaf
4478
012689
359
11
Tracy
3j1 represents 31 points
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
a Find the mean for each player.
b Find the median for each player.
c Why is the median for Erin not one of
the points per game listed?
d Which player generally scored more
points per game?
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\471SA12MA-2_08.CDR Thursday, 16 August 2007 4:47:39 PM DAVID3
SA_12MA-2
472
STATISTICS
(Chapter 8)
3 The frequency table alongside records the
number of phone calls made in a day by
50 thirteen-year-olds and 50 eighteenyear-olds.
Number of 13 year old
phone calls Frequency
0
5
1
8
2
13
3
8
4
6
5
3
6
3
7
2
8
1
9
0
10
0
11
1
a For both sets of data, find the:
i mean
ii median.
b Why is the mean larger than the
median for the thirteen-year-old data?
c Why are the mean and median
approximately equal for the eighteenyear-old data?
18 year old
Frequency
1
2
3
4
4
6
8
7
5
4
3
3
4 The weights of a squad of AFL players are compared with those of NRL players.
Weight (kg)
70 - < 80
80 - < 90
90 - < 100
100 - < 110
110 - < 120
a
b
c
d
Number of AFL players
8
10
12
3
2
Number of NRL players
0
3
9
11
7
How many players were weighed in each squad?
Calculate the mean weight for players in each squad.
Find the median weight for players in each squad.
Which squad generally has heavier players?
5 A tennis club has 450 members listed on its
database. The population mean of member’s ages
has been calculated at 28:3. The marketing department wants to survey members on their preferred
social activities at the club. The age breakdowns of
members at the club are:
Age range
under 18
18 - < 30
30 - < 50
over 50
No. of members
62
211
103
74
a How many of each age range should be surveyed if a stratified sample of 20 members
is used?
The marketing department noted the ages of the 20 members surveyed. The results were:
10, 72, 25, 44, 52, 15, 17, 62, 60, 32, 19, 23, 48, 37, 21, 27, 35, 25, 26, 29
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
Find the sample mean age of the members surveyed.
Why is this different to the mean age of the member population?
Suggest how the sample mean age could better reflect the population mean age.
The marketing department decided to conduct another survey of 40 members.
Discuss the reliability of the sample mean age of this sample in comparison to the
sample mean age of the first sample of 20 members.
5
95
100
50
75
25
0
5
b
c
d
e
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\472SA12MA-2_08.CDR Thursday, 16 August 2007 4:47:45 PM DAVID3
SA_12MA-2
STATISTICS
6 A school has 820 families listed on
the enrolment database. The income ranges of the families is listed
alongside:
Income range
$0 - < $30 000
$30 000 - < $60 000
$60 000 - < $90 000
$90 000 - < $120 000
$120 000 - < $150 000
(Chapter 8)
473
Number of families
56
214
445
73
32
a Calculate an estimate of the mean income of
all families at the school.
The school bursar wanted to survey a sample of
30 families to determine their reaction to an increase in school fees.¡ She selected 30 families
at random for this purpose and at the same time
she asked them to record their income for the
last year.¡
The results were:
$45 000
$54 000
$38 000
$85 000
$75 000
$21 000
$123 000
$47 000
$121 000
$29 000
$145 000
$95 000
$52 000
$46 000
$55 000
$132 000
$112 000
$63 000
$134 000
$115 000
$94 000
$127 000
$78 000
$102 000
$89 000
$72 000
$29 000
$83 000
$62 000
$54 000
b Calculate the mean income of the sample of 30 families.
c Why is this different from the estimate of the population mean calculated in a?
d If a stratified sample of 30 families was used, calculate the number of families in
the $120 000 - < $150 000 income range that should be surveyed.
e How many families in the $120 000 - < $150 000 income range were actually
surveyed?
f Discuss the reliability of a stratified sample of 30 families as compared to a simple
random sample of 30 families as done in a.
g How could an even more reliable sample be obtained?
CHOOSING THE APPROPRIATE MEASURE OF THE CENTRE
The mean and median can be used to indicate the centre of a set of numbers. Which of
these values is the most appropriate measure to use will depend upon the type of data under
consideration.
In real estate values the median is used as a measure of the centre. Why is this?
When selecting which of the measures of central tendency to use as a representative figure for
a set of data, you should keep the following advantages and disadvantages of each measure
in mind.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
Mean
² The mean’s main advantage is that it is commonly used, easy to understand and
easy to calculate.
25
0
5
95
100
50
75
25
0
5
I
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\473SA12MA-2_08.CDR Thursday, 16 August 2007 4:47:51 PM DAVID3
SA_12MA-2
474
STATISTICS
²
(Chapter 8)
Its main disadvantage is that it is affected by extreme values within a set of data
and so may give a distorted impression of the data.
For example, consider the following data: 4, 6, 7, 8, 19, 111: The total of
these 6 numbers is 155, and so the mean is approximately 25:8. Is 25:8 a
representative figure for the data? The extreme value (or outlier) of 111
has distorted the mean in this case.
I
Median
² The median’s main advantage is that it is easily calculated and is the middle
value of the data.
² Unlike the mean, it is not affected by extreme values.
² The main disadvantage is that it ignores all values outside the middle range and
so its representativeness is questionable.
Because the mean is unable to resist the influence of extreme values it is a
non-resistant measure of the centre. The median is a resistant measure.
Note:
E
THE VARIABILITY (SPREAD)
OF A DISTRIBUTION
We use two measures to describe a distribution.¡
These are its centre and its variability (or spread).
The distributions shown have the same mean, but
clearly they have a different spread.¡ For example,
the A distribution has most scores close to the
mean whereas the C distribution has greater
spread.
A
B
C
mean
²
²
²
Three commonly used statistics that indicate the
spread of a set of data are:
the range
the interquartile range
the standard deviation.
THE RANGE AND INTERQUARTILE RANGE
The range is the difference between the maximum (largest) data value and the minimum
(smallest) data value.
Range = maximum data value ¡ minimum data value
Example 9
Self Tutor
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
A greengrocer chain is to purchase apples from two different wholesalers. They take
six random samples of 50 apples to examine them for skin blemishes. The counts for
the number of blemished apples are:
Wholesaler Redapp
5 17 15 3
9 11
Wholesaler Pureapp 10 13 12 11 12 11
What is the range of blemished apples from each wholesaler?
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\474SA12MA-2_08.CDR Thursday, 16 August 2007 4:48:01 PM DAVID3
SA_12MA-2
STATISTICS
Range = 17 ¡ 3 = 14
Range = 13 ¡ 10 = 3
Wholesaler Redapp
Wholesaler Pureapp
Note:
Note:
475
(Chapter 8)
This shows that Wholesaler Redapp has more variability in the number
of skin blemished apples per sample of 50.
The range is not considered to be a particularly reliable or resistant measure of
spread as it uses only two data values.
THE INTERQUARTILE RANGE (Review)
The median divides the ordered data set into two halves and these halves are divided in half
again by the quartiles.
The middle value of the lower half is called the lower quartile (Q1 ). One-quarter, or 25%,
of the data have a value less than or equal to the lower quartile. 75% of the data have values
greater than or equal to the lower quartile.
The middle value of the upper half is called the upper quartile (Q3 ). One-quarter, or 25%,
of the data have a value greater than or equal to the upper quartile. 75% of the data have
values less than or equal to the upper quartile.
Interquartile range = upper quartile ¡ lower quartile
The interquartile range is the range of the middle 50% of the data. The data set has been
divided into quarters by the lower quartile (Q1 ), the median (Q2 ) and the upper quartile (Q3 ).
IQR = Q3 ¡ Q1 .
So, the interquartile range,
Example 10
Self Tutor
For the data set 5 5 7 3 8 2 3 4 6 5 7 6 4 find the:
a median b lower quartile c upper quartile d interquartile range
The ordered data set is 2 3 3 4 4 5 5 5 6 6 7 7 8
a
There are 13 data values so the median is the 7th value (circled).
There is an odd number of data and the median is one of the values so it
divides the data into two halves of six values each.
Note: For an odd number of data the median data value is not included in
the lower or upper half for the calculation of the quartiles.
b
The middle value of the lower half is the average of the 3rd and 4th values.
6 values
6 values
z
}|
{ z
}|
{
2 3 3 4 4 5 5 5 6 6 7 7 8
3:5
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
3+4
= 3:5
2
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
Lower quartile =
median
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\475SA12MA-2_08.CDR Thursday, 16 August 2007 4:48:12 PM DAVID3
SA_12MA-2
476
STATISTICS
(Chapter 8)
Similarly, the middle value of the upper half is the average of the 10th and
11th values:
2 3 3 4 4 5 5 5 6 6 7 7 8
c
6:5
6+7
= 6:5
2
Upper quartile =
Interquartile range = upper quartile ¡ lower quartile = 6:5 ¡ 3:5 = 3
So, the middle half of the data has a spread of 3.
d
A summary for the set of data in Example 10 is:
Range
= 8¡2 = 6
The data has a spread of 6 (range = 6),
centred around the value 5 (median = 5).
The middle half of the data has a spread
of 3 (interquartile range = 3).
2 3 3 4 4 5 5 5 6 6 7 7 8
3:5
5
Lower quartile
6:5
Median
Upper quartile
= 3
Interquartile range
Although they give useful information, the range and the interquartile range are not as useful
as the standard deviation as a measure of spread.
The range uses only two data values and the interquartile range ignores the lowest and highest
quarters of the data.
The standard deviation is an average variation from the mean of all data values.
For a set of n data values of x: x1 , x2 , x3 , x4 , ..... , xn then:
sP
(x ¡ x)2
is the standard deviation for a sample with mean x.
s=
n¡1
Example 11
Self Tutor
Find the standard deviations for the apple samples of Example 9.
Wholesaler Redapp
cyan
magenta
)
x=
yellow
95
100
50
75
25
0
5
95
100
50
+ 5:48
75
25
0
5
95
60
= 10
6
rP
(x ¡ x)2
and s =
n¡1
r
150
=
5
(x ¡ x)2
25
49
25
49
1
1
150
100
50
75
x¡x
¡5
7
5
¡7
¡1
1
Total
25
0
5
95
100
50
75
25
0
5
x
5
17
15
3
9
11
60
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\476SA12MA-2_08.CDR Thursday, 16 August 2007 4:48:18 PM DAVID3
SA_12MA-2
STATISTICS
477
(Chapter 8)
Wholesaler Pureapp
x
10
13
12
11
12
11
69
69
= 11:5
6
rP
(x ¡ x)2
s =
n¡1
r
5:5
=
5
(x ¡ x)2
2:25
2:25
0:25
0:25
0:25
0:25
5:5
x¡x
¡1:5
1:5
0:5
¡0:5
0:5
¡0:5
Total
)
x=
= 1:05
Clearly, Wholesaler Pureapp supplied apples with more blemishes but with less
variability (smaller standard deviation) than for those supplied by Redapp.
The formula and example above is included for completeness and to give you
an idea of how the standard deviation is calculated. In this course, you should
concentrate on using technology to find the standard deviation.
Note:
USING THE CALCULATOR TO FIND THE MEASURES OF SPREAD
We will concentrate on using technology to find the measures of spread.
Example 12
Self Tutor
Find the three measures of spread for the number of goals thrown by a netballer in
18 games: 8, 4, 3, 9, 6, 5, 5, 10, 3, 6, 7, 9, 11, 14, 9, 8, 7, 12
We key the data into a list. The data does not have to be ordered.
For TI-83 we obtain
The range
= maxX ¡ minX
= 14 ¡ 3
= 11
The standard deviation is 3:05
The IQR
= Q3 ¡ Q1
=9¡5
=4
For Casio we obtain
The range
= maxX ¡ minX
= 14 ¡ 3
= 11
The standard deviation is 3:05
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
The IQR
= Q3 ¡ Q1
=9¡5
=4
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\477SA12MA-2_08.CDR Thursday, 16 August 2007 4:48:25 PM DAVID3
SA_12MA-2
478
STATISTICS
(Chapter 8)
EXERCISE 8E
1 Netballers Sally and Joanne compare their goal throwing scores
for the last 8 matches.
23
9
Goals by Sally
Goals by Joanne
17
29
31
41
25
26
25
14
19
44
28
38
32
43
a Find the mean and standard deviation for the number of
goals thrown by each goal shooter for these matches.
b Which measure is used to determine which of the goal
shooters is more consistent?
2 Two cricketers compare their bowling performances for the last ten
test matches. The number of
wickets per match was:
Glen 0 10 1
Shane 4 3 4
9 11 0 8
1 4 11 7
5 6 7
6 12 5
a Show that each bowler has the same mean and range.
b Which performance do you suspect is more variable, Glen’s bowling over the period
or Shane’s?
c Check your answer to b by finding the IQR and standard deviation for each distribution.
d Does the range, IQR or the standard deviation give a better indication of variability?
3 A manufacturer of softdrinks employs a statistician for quality control. Suppose that he
needs to check that 375 mL of drink goes into each can. The machine which fills the
cans may malfunction or slightly change its delivery due to constant vibration or other
factors.
a Would you expect the standard deviation for the whole production run to be the
same for one day as it is for one week? Explain.
b If samples of 125 cans are taken each day, what measure would be used to:
i check that 375 mL of drink goes into each can
ii check the variability of the volume of drink going into each can.
c What is the significance of a low standard deviation in this case?
4 Two groups of students are given pairs of shoes to wear to school. The first group (X)
have original rubber soled shoes and the second group (Y) have the new synthetic rubber
soled shoes. The data below shows the thickness of the soles of the shoes after six
months.
Group X: 3, 5, 6, 4, 5, 6, 2, 7, 3, 4, 4, 6, 5, 5, 5, 7, 6, 4, 4, 3, 6, 5, 4, 2
Group Y: 4, 6, 5, 4, 3, 5, 6, 6, 7, 6, 6, 4, 5, 7, 8, 6, 7, 5, 3, 6, 6, 7, 5
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
a Find the median, lower and upper quartiles for each distribution.
b Find the range and IQR of each distribution.
c Is the new synthetic rubber on the soles of shoes an improvement? Give evidence.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\478SA12MA-2_08.CDR Thursday, 16 August 2007 4:48:31 PM DAVID3
SA_12MA-2
STATISTICS
479
(Chapter 8)
DISCUSSION
Consider the range, interquartile range and standard deviation.
Which of these measures is resistant and which is non-resistant as a
measure of spread?
F
BOX AND WHISKER PLOTS
A box and whisker plot (or simply a boxplot) is a visual display of some of the descriptive
statistics of a data set. It shows:
9
² the minimum value (Minx ) >
>
>
>
² the lower quartile
(Q1 )
=
These five numbers form the
² the median
(med)
>
five-number summary of a data set.
>
>
² the upper quartile
(Q3 )
>
;
² the maximum value (Maxx )
CONSTRUCTING A BOXPLOT
A boxplot (box-and-whisker plot) is constructed above a number line (labelled and scaled)
which is drawn so that it covers all the data values in the data set.
The boxplot is drawn with a rectangular ‘box’ representing the middle half of the data. The
‘box’ goes from the lower quartile to the upper quartile.
The ‘whiskers’ extend from the ‘box’ to the maximum value and to the minimum value.
A vertical line marks the position of the median in the ‘box’.
For example, for the data set 2, 3, 5, 4, 3, 6, 5, 7, 3, 8, 1, 7, 5, 5, 9:
7 values
7 values
z
}|
{
z
}|
{
1, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 7, 7, 8, 9 (15 data)
The ordered data set is
Q1
The
The
The
The
The
median Q2
minimum is 1.
maximum is 9.
median is the 8th value, 5.
lower quartile is the 4th value, 3.
upper quartile is the 12th value, 7.
9
>
>
>
>
>
=
Q3
These 5 statistics form the
>
five-number summary.
>
>
>
>
;
whisker
1
whisker
2
minimum
3
4
5
lower quartile
6
median
7
upper quartile
8
9
value
maximum
USING A GRAPHICS CALCULATOR TO CONSTRUCT A BOXPLOT
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
Consider the data: 2, 3, 5, 4, 3, 6, 5, 7, 3, 8, 1, 7, 5, 5, 9
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\479SA12MA-2_08.CDR Thursday, 16 August 2007 4:48:37 PM DAVID3
SA_12MA-2
480
STATISTICS
(Chapter 8)
For a TI-83
For a Casio
Press STAT 1 to select 1:Edit.
From the Main Menu, select STAT
Enter the data from the example above
into List1.
Press F6 (¤) until the GRPH icon is
in the bottom left corner of the screen.
Press F1 (GRPH) F6 (SET), then
Enter the data from the example above into
L1. Statistical graphs are drawn using
STAT PLOT, which is located above the
Y= key. Press 2nd Y= to use it.
F6 (¤) F2 (BOX) to choose
the boxplot
Press ENTER to use Plot1.
Check that the XList variable is set to
List1, then press EXIT F4 (SEL)
F1 (On) to turn StatGraph1 on.
Turn the plot On by pressing ENTER
then use the arrow keys to choose the
boxplot icon Ö and press ENTER .
Press F6 (Draw) to draw the boxplot
Press ZOOM 9 to select 9:ZoomStat and
draw the boxplot.
Pressing F1 (1VAR) gives the statistics
of the data, including the five-number
summary.
TRACE can be used to locate the statistics
of the five-number summary. The arrow
keys move backwards and forwards
between them.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
In this screen,
the cursor is on
the median.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\480SA12MA-2_08.CDR Monday, 20 August 2007 10:15:58 AM DAVID3
SA_12MA-2
STATISTICS
481
(Chapter 8)
INTERPRETING A BOXPLOT
A set of data with a symmetric distribution will have a symmetric boxplot.
For example:
y
8
6
4
10 11 12 13 14 15 16 17 18 19 20 x
2
0
x
10 11 12 13 14 15 16 17 18 19 20
The whiskers of the boxplot are the same length and the median line is in the centre of the
box.
A set of data which is positively skewed will have a positively skewed boxplot.
For example:
y
10
8
6
4
1
2
3
4
5
6
7
8
x
2
0
1
2
3
4
5
6
7
x
8
The right whisker is longer than the left whisker and the median line is to the left of the box.
A set of data which is negatively skewed will have a boxplot that appears stretched to the
left.
For example:
1
2
3
4
5
6
7
8
9 x
x
1 2
3 4
5
6
7
8
9
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
The left whisker is longer than the right and the median line is to the right of the box.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\481SA12MA-2_08.CDR Thursday, 16 August 2007 4:48:49 PM DAVID3
SA_12MA-2
482
STATISTICS
(Chapter 8)
Example 13
Self Tutor
female
n¡=¡34
male
n¡=¡26
0
5
10
15
20
25
30
35
40
45
age (years)
A conservation park in Sri Lanka is home to 60 elephants, of which 34 are females
and 26 are males. The parallel boxplots above show the distribution of their ages
by sex.
a What sex was the youngest elephant?
b How old is the oldest female elephant?
c Compare the range of ages by sex and interpret.
d Compare the median age of each sex.
e The youngest 25% of female elephants are aged between ...... and ......
f 75% of male elephants are aged under ......
g Comment on the shape of each distribution.
a
The youngest elephant was aged 2 and is male.
b
The oldest female elephant is 36 12 years old.
c
Male:
Range = 43 ¡ 2 = 41
Female:
Range = 36 12 ¡ 4 = 32 12
) the range of male ages is larger
indicating greater variability.
d
The median age is 23 for both males and females.
e
The youngest 25% of female elephants are aged between 4 and 14.
f
75% of male elephants are aged under 30 12 .
g
The distribution of male ages is roughly symmetrical, whilst the distribution of
female ages is stretched to the left and hence is negatively skewed.
EXERCISE 8F.1
Box and whisker plots are often used to compare data.
1 The following box and whisker plots are for weekly motor vehicle sales at two large car
yards owned by the same business.
Yard A
Yard B
0
1
2
3
4
5
6
7
8
9
10
11
12
0
1
2
3
4
5
6
7
8
9
10
11
12
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
a Find the range of each distribution.
b Find the median of each distribution.
c Find the interquartile range of each distribution.
d Do the boxplots enable you to deduce the more effective sales yard?
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\482SA12MA-2_08.CDR Thursday, 16 August 2007 4:48:54 PM DAVID3
SA_12MA-2
STATISTICS
2 The given side by side box and whisker plots
compare the results of a science test and a retest of the same topic.
a Write a brief account comparing the
medians, ranges and IQR’s.
483
Test A
3
b Do you think that the group of students
have improved their understanding of
the topic due to the re-test?
4
5
6
7
8
9
Re-test B
3 A large hardware chain is examining 50 mm
diameter PVC pipe from three different
manufacturers.¡ When the data is analysed,
boxplots are constructed of measurements of
the internal diameters of randomly selected
pipes.¡ The boxplots are shown alongside.
A
B
C
Which manufacturer should the hardware
chain use if:
a they want a consistent diameter (small
variability)
b
c
(Chapter 8)
49.8 49.9
50
50.1 50.2 50.3
they wanted the largest diameter
they want a diameter as close as possible to 50 mm?
4 The boxplots alongside compare the time
students in years 10 and 12 spend on
homework over a one week period.
Year 10
Year 12
a Find the 5-number
0
5
10
15
summaries for both
the year 10 and year 12 students.
b Determine the i range ii interquartile range for each group.
5 Two classes have completed the same test.¡ Boxplots have
been drawn to summarise and display the results.¡ They have
been drawn on the same set of axes so that the results can be
compared.
a In which class was:
i the highest mark
ii the lowest mark
iii there a larger spread of marks?
b Find:
i the range of marks in class B
ii the interquartile range for class A.
90
20
hours
test score
80
70
60
50
40
30
Class A
Class B
c If the pass mark was 50 for the test what percentage of students passed the test in:
i class A
ii class B?
d Describe the distribution of marks in:
i
class A
ii
class B.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
e Copy and complete:
The students in class ....... generally scored higher marks.
The marks in class ...... were more varied.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\483SA12MA-2_08.CDR Thursday, 16 August 2007 4:48:59 PM DAVID3
SA_12MA-2
484
STATISTICS
(Chapter 8)
TESTING FOR OUTLIERS
Outliers are extraordinary data that are either much larger or much smaller than
the main body of data.
One commonly used test for outliers involves the following calculation of ‘boundaries’:
The upper boundary = upper quartile + 1:5 £ IQR.
Any data larger than this number is an outlier.
The lower boundary = lower quartile ¡ 1:5 £ IQR.
Any data smaller than this value is an outlier.
When outliers exist, the ‘whiskers’ of a boxplot extend to the last value that is not an outlier.
Outliers are marked by an asterisk. It is possible to have more than one outlier at either end.
Example 14
Self Tutor
Use technology to draw a boxplot for the following data, identifying any outliers.
1, 3, 7, 8, 8, 5, 9, 9, 12, 14, 7, 1, 4, 8, 16, 8, 7, 9, 10, 13, 7, 6, 8, 11, 17, 7
For TI-83
For Casio
We enter the data in L1.
We enter the data in List 1.
Use STAT PLOT.¡ Press 2nd
Press ENTER to use Plot1.
Press F6 (¤) until the GRPH
icon is in the bottom left corner
of the screen.
Turn the plot On then use the arrow keys
to choose the ‘boxplot with outliers’ icon
Press F1 (GRPH) F6 (SET),
Y= .
and then
F6 (¤) F2 (Box)
to select boxplot.¡ Also set Outliers
to On.
Then press
ENTER .
cyan
magenta
yellow
95
100
50
75
25
0
The data points at 1, 16 and 17 are
highlighted as outliers.
5
95
100
50
75
25
0
5
95
100
50
75
Press F6 (Draw) to draw the
boxplot.
25
Press TRACE and use the arrow keys
to move the
cursor through
the summary
statistics.¡ Note
that both values
at 1 are
included as are
16 and 17.
0
Press EXIT F4 (SEL) F1
(On) to turn StatGraph1 on.
5
95
Note that only one of the outliers at 1
appears on the screen.
100
50
75
25
0
5
Press ZOOM 9 to select 9:ZoomStat
and draw the boxplot.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\484SA12MA-2_08.CDR Monday, 20 August 2007 10:16:10 AM DAVID3
SA_12MA-2
STATISTICS
485
(Chapter 8)
We now sketch the boxplot:
Two outliers of the same
value are shown like this.
The whisker is drawn to the last
value that is not an outlier.
variable
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17
CONSTRUCTING PARALLEL BOXPLOTS
A graphics calculator can be used to construct parallel boxplots which can then be interpreted
and compared.
Consider the office workers from page 464 who recorded travel times to work by train or tram.
21, 25, 18, 13, 33, 27, 28, 14, 18, 43, 19, 22, 30, 22, 24
23, 18, 16, 16, 30, 20, 21, 18, 18, 17, 20, 21, 28, 17, 16
Tram travel times (minutes):
Train travel times (mintues):
If, in addition, car travel time data is available to the office worker, we can use parallel
boxplots to compare the data. They help us decide which type of transport is the quickest to
get him to work and which is the most reliable.
Car travel times (minutes): 30, 21, 19, 17, 24, 28, 23, 25, 25, 16, 18, 19, 29, 22
PARALLEL BOXPLOTS FROM A CALCULATOR
For TI-83
Press STAT 1 to select 1:Edit.
The data for each of the boxplots is entered in a separate list.
Press 2nd
Y= to select STAT PLOT.
Press ENTER to access Plot1.
Make sure that Plot1 is On, the “boxplot with outliers” icon
is selected, and the XList is set to L1.
Use the
key to return the cursor to the top of the screen,
then press
ENTER to access Plot2.
Adjust the settings to Plot2 so they match Plot1, except set the
XList variable to L2, by pressing 2nd 2 (L2).
until Plot3
Return the cursor to the top of the screen, press
is highlighted, then press ENTER .
Again match the settings of Plot3 with those of Plot1, except
set the XList variable to L3.
ZOOM 9 (9:ZoomStat) will bring the graphs to the screen:
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
TRACE , then the arrows, can be used to find
‘5-number summary’ values on the screen.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\485SA12MA-2_08.CDR Monday, 20 August 2007 10:16:21 AM DAVID3
SA_12MA-2
486
STATISTICS
(Chapter 8)
For Casio
From the Main Menu, select STAT.
The data for each of the boxplots is entered in a separate list.
Press F6 (¤) until the GRPH icon is in the bottom left
corner of the screen.
Press F1 (GRPH) F6 (SET) to access StatGraph1.
F6 (¤) F2 (Box) to select the boxplot, press
Press
F1 (List1) to set the XList variable to List1, then set
the Outliers to On.
Use the
key to return the cursor to the top of the screen,
then press F2 (GPH2) to access StatGraph2.
Adjust the settings of StatGraph2 so they match StatGraph1,
except set the XList variable to List2.
Return the cursor to the top of the screen, then press
F6 (GPH3) to access StatGraph3.
Again match the settings of StatGraph3 with those of
StatGraph1, except set the XList variable to List3.
Press EXIT F4 (SEL), and make sure all three graphs are
set to DrawOn.
Press F6 (Draw) to draw the boxplots.
The three boxplots are drawn on the one axis:
tram
train
categorical
variable with
three categories
car
10
15
20
25
30
35
40
45
travel time (minutes)
numerical variable
The car travel times have almost the same spread (range = 14 mins, IQR = 6 mins) as the
train travel times (range = 14 mins, IQR = 4 mins), suggesting that the car travel time is as
reliable as the train travel time.
However, the train travel times include two outliers
which may be due to extraordinary events. If these
are ignored then the range of travel times for the train
would be 7 minutes, which is considerably less than
the ranges for the car and tram.
The median car travel time is 22:5 minutes, compared
to 18 minutes for the train and 22 minutes for the tram,
so it is still generally quicker to travel by train.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
In conclusion: From the data given, it is generally quicker and more reliable to travel by
train than it is by either tram or car.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\486SA12MA-2_08.CDR Monday, 20 August 2007 10:16:30 AM DAVID3
SA_12MA-2
STATISTICS
(Chapter 8)
487
EXERCISE 8F.2
1 The daily maximum temperatures in Melbourne for June 21st and December 21st (the
equinoxes) are being compared. The data for the 20 years from 1987 to 2006 is given
below:
June 21st: 13:6, 10:6, 19:1, 14:2, 12:2, 11:9, 18:3, 14:9, 14:6, 15:1,
17:4, 13:5, 16:7, 14:0, 11:1, 17:0, 15:4, 16:3, 15:6, 36:3
December 21st: 24:2, 19:4, 21:4, 22:7, 21:4, 20:0, 22:3, 21:1, 18:9, 23:5,
21:3, 23:0, 28:1, 20:3, 17:2, 35:0, 33:7, 21:9, 21:4, 38:6
a Construct parallel boxplots for the data.
b Are any outliers able to be identified?
c Copy and complete the given table:
d Compare and comment on the two
data sets.
e The outlier of 36:3o on June 21st is
clearly a mistake in recording!
It should have been 16:3o :
Complete the following table:
June 21st Dec. 21st
Mean
Median
Range
IQR
Stand. deviation
June 21st with 36:3o
June 21st with 16:3o
Mean
Median
Range
IQR
Standard deviation
f Discuss the effect on each of the measures of centre and spread above after the
removal of the outlier.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
2 The heights (to the nearest centimetre) of boys and girls in a school year are as follows:
Boys 164 168 175 169 172 171 171 180 168 168 166 168 170 165 171
173 187 179 181 175 174 165 167 163 160 169 167 172 174 177
188 177 185 167 160 123 205
Girls 165 170 158 166 168 163 170 171 177 169 168 165 156 159 165
164 154 170 171 172 166 152 169 170 163 162 165 163 168 155
175 176 170 166
a Construct parallel boxplots for the data.
b Are there any outliers present?
There are no boys in the year group with a height of
123 cm, but there is one giant of 205 cm! Remove the
123 cm from the set of data.
c Calculate the effect of removing the outlier on the:
i mean
ii median
iii range
iv IQR
v standard deviation.
d Compare and comment on the distribution of the two
data sets with the outlier removed.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\487SA12MA-2_08.CDR Thursday, 16 August 2007 4:49:22 PM DAVID3
SA_12MA-2
488
STATISTICS
(Chapter 8)
3 Batting averages for Australian and Indian teams for the 2001 test series in India were:
Australia 109:8, 48:6, 47:0, 33:2, 32:2, 29:8, 24:8, 20:0, 10:8, 10:0, 6:0, 3:4, 1:0
India
83:83, 56:33, 50:67, 28:83, 27:00, 26:00, 21:00, 20:00, 17:67, 11:33, 10:00,
6:00, 4:00, 4:00, 1:00, 0:00
a Construct parallel boxplots for the data, displaying outliers.
b Compare and comment on the centres and spread of the data sets.
c Should any outliers be discarded and the data be reanalysed?
STATISTICS
PACKAGE
USING A STATISTICAL COMPUTER PROGRAM
Click on the icon to produce a computer program which will enable you to compare data, obtain statistics and draw graphs of comparison.¡ You can then print it all.
G
EXTENDED INVESTIGATIONS
EXERCISE 8G
1 Shane and Brett play in the same cricket team and are fierce but friendly rivals when
it comes to bowling. During a season the number of wickets per innings taken by each
bowler was recorded as:
Shane: 1 6 2 0 3 4 1 4 2 3 0 3 2 4 3 4 3 3
3 4 2 4 3 2 3 3 0 5 3 5 3 2 4 3 4 3
7 2 4 8 1 3 4 2 3 0 5 3 5 2
4 3 4 0 3 3 0 2 5 1 1 2 2 5
Is the data discrete or continuous?
Enter the data into a graphics calculator or statistics package.
Produce side-by-side boxplots for the data.
Are there any outliers? Should they be deleted before we
start to analyse the data?
Describe the shape of each distribution.
Compare the measures of the centre of each distribution.
Compare the spreads of each distribution.
What conclusions, if any, can be drawn from the data?
Brett:
a
b
c
d
e
f
g
h
3
1
1
4
2
0
0
1
2 A manufacturer of light globes claims that the newly invented type has a life 20% longer
than the current globe type. Forty of each globe type are randomly selected and tested.
Here are the results to the nearest hour.
Old type: 103 96 113 111 126 100 122 110 84 117 111 87 90 121
99 114 105 121 93 109 87 127 117 131 115 116 82 130
113 95 103 113 104 104 87 118 75 111 108 112
New type: 146 131 132 160 128 119 133 117 139 123 191 117 132 107
141 136 146 142 123 144 133 124 153 129 118 130 134 151
145 131 109 129 109 131 145 125 164 125 133 135
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
a Is the data discrete or continuous?
b Enter the data into a graphics calculator or statistics package and obtain side-by-side
boxplots.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\488SA12MA-2_08.CDR Thursday, 16 August 2007 4:49:28 PM DAVID3
SA_12MA-2
STATISTICS
c
d
e
f
489
(Chapter 8)
Are there any outliers? Should they be deleted before we start to analyse the data?
Compare the measures of centre and spread.
Use b to describe the shape of each distribution.
What conclusions, if any, can be drawn from the data?
3 Plant fertilisers come in many different brands, but there are essentially two types:
organic and inorganic. A student was interested to discover whether radish plants responded better to organic or inorganic fertiliser. He prepared three identical plots of
ground, named plots A, B and C, in his mother’s garden, and planted 40 radish seeds
in each plot. After planting, each plot was treated in an identical manner, except for the
way they were fertilised. Cost prevented him using a variety of fertilisers, so he chose
one organic and one inorganic fertiliser. Plot A received no fertiliser, plot B received the
organic fertiliser as prescribed on the packet, and plot C received the inorganic fertiliser
as prescribed on the packet. The student was interested in the weight of the root that
forms under the ground.
The data below is the weight of the root (measured to the nearest gram) of the individual
plants:
Data from plot A: 27 29 9 10 8 36 36 42 32 32 32 30 38 32 30
39 38 50 34 41 39 40 12 14 35 35 42 25 34 22
Data from plot B: 51 54 56 41 50 47 47 46 48 52 34 20 28 45 58
47 58 56 63 66 54 48 48 53 47 29 46 33 34
Data from plot C: 55 76 65 61 67 69 68 64 76 59 56 79 70 65 47
69 70 76 43 70 62 60 58 79 65 75 60 39 50 66
68 68 63 54 61 72 58 77
a Produce parallel boxplots for the data.
b Compare and comment on the distributions of the weights of the root for each
plot, mentioning the shape, centre and spread and quoting statistics to support your
statements.
INVESTIGATION 2
KARELINE’S REAL ESTATE DATA
Open the spreadsheet on Kareline’s real estate data.
SPREADSHEET
What to do:
cyan
magenta
yellow
95
100
50
In F3 enter =QUARTILE (C:C,0)
In F5 enter =QUARTILE (C:C,2)
In F7 enter =QUARTILE (C:C,4)
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
1 In F2 enter =COUNT(C:C).
In F4 enter =QUARTILE (C:C,1)
In F6 enter =QUARTILE (C:C,3)
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\489SA12MA-2_08.CDR Thursday, 16 August 2007 4:49:34 PM DAVID3
SA_12MA-2
490
STATISTICS
(Chapter 8)
2 In F9 enter =AVERAGE (C:C) and in F10 enter =STDEV(C:C)
3 Draw a box and whisker plot of the real estate data.
4 Obtain either real estate data or weekly rental data of flats from two or more different
suburbs. Find 5-number summaries for each and appropriate boxplots using the given
spreadsheet.
5 Write a summary of your findings from 4 (no more than 100 words). The emphasis
is to be on comparing the two data sets.
INVESTIGATION 3
HOW DO YOU LIKE YOUR EGGS?
This investigation examines the weight and dimensions of eggs.¡ Because
you will need to collect the data for at least 5 dozen eggs, it is suggested
that you work with at least one, and preferably three other people.
Eggs are sold in three categories: small, medium or large. Decide which category your
group will use.
Using a set of electronic scales, measure the
weight of at least five dozen eggs in the category of your choice.
Use a set of electronic calipers to measure the
length and maximum diameter of each egg.
Record your results in a spreadsheet.
Use the spreadsheet to organise the data in three
separate ways.
By weight
By length
By width
² in classes of 0:1 g
² in classes of 0:2 g
² without classes
² in classes of 0:1 mm
² in classes of 0:2 mm
² without classes
² in classes of 0:1 mm
² in classes of 0:2 mm
² without classes
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
Present your results in a suitable table.
Draw side by side box and whisker plots for each set of data.
In a brief report, comment on your results. Discuss the characteristics of a typical egg in
the category you have studied.
By chance, you hear a suggestion that hens of different breeds produce eggs of different
sizes.
Discuss how you would set about examining that conjecture.
On the basis of your work in this investigation, discuss the characteristics you would expect
to find for a different category of eggs. How many eggs do you think it would take to test
your conjecture?
If someone brought an egg to you and asked if it was of the category you had measured,
how would you check?
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\490SA12MA-2_08.CDR Thursday, 16 August 2007 4:49:40 PM DAVID3
SA_12MA-2
STATISTICS
INVESTIGATION 4
491
(Chapter 8)
HEART STOPPER
A new drug that is claimed to lower the cholesterol level in humans has been
developed.
A heart specialist was interested to know if the claims made by the company
selling the drug were accurate.¡ He enlisted the help of 50 of his patients.
They agreed to take part in an experiment in which 25 of them would be randomly allocated to take the new drug and the other 25 would take an identical looking pill that was
actually a placebo (a sugar pill that would have no effect at all).
All participants had their cholesterol level measured before starting the course of pills and
then at the end of two months of taking the drug, they had their cholesterol level measured
again. The data collected by the doctor is given below.
cholesterol levels of all
participants before the
experiment
7:1
6:7
6:2
6:0
6:3
8:2
7:3
7:0
5:0
6:2
8:4
8:9
8:1
8:3
8:5
6:5
6:2
8:4
7:9
5:0
6:5
6:3
6:4
6:7
6:6
7:1
7:1
7:6
7:3
8:1
7:2
8:4
8:6
6:0
6:8
7:1
7:4
7:5
7:4
7:5
6:1
7:6
7:9
7:4
6:5
6:0
7:5
6:2
8:6
7:6
cholesterol levels of the
25 participants who took
the drug
4:8
4:4
4:7
5:6
4:7
4:7
4:7
4:9
5:1
4:2
6:2
4:6
4:8
4:7
5:2
4:6
4:7
4:8
4:4
5:2
5:6
4:8
4:2
5:0
4:4
cholesterol levels of the
25 participants who took
the placebo
7:0
5:7
8:2
8:4
8:3
7:5
8:8
7:9
6:0
6:1
6:7
7:6
6:6
7:3
6:1
7:6
6:1
6:5
7:4
7:9
8:4
6:2
6:6
6:8
6:5
What to do:
1 Produce a single stemplot for the cholesterol levels of all participants after the experiment. Present the stemplot so that this data can be simply compared to all the
measurements before the experiment began.
STATISTICS
2 Use technology to calculate the relevant statistical data.
PACKAGE
3 Use the data to complete the table:
Cholesterol
Level
4:0
4:5
5:0
5:5
-
Before
25 participants
25 participants
Experiment taking the drug taking the placebo
< 4:5
< 5:0
< 5:5
< 6:0
..
.
8:5 - < 9:0
4 Calculate the mean and standard deviation for each group in the table.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
5 Write a report presenting your data and findings based on that data.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\491SA12MA-2_08.CDR Thursday, 16 August 2007 4:49:46 PM DAVID3
SA_12MA-2
492
STATISTICS
(Chapter 8)
H
NORMAL DISTRIBUTIONS
Many data sets have frequency distributions that are ‘bell-shaped’ and symmetrical about the
mean.
For example, the histogram alongside exhibits this typical ‘bell-shape’. The data
represents the heights of a group of adult
women and has a mean of 165 cm and a
standard deviation of 8 cm.
The data is centred about the mean and
spreads from 140 cm to 190 cm. However,
most of the data have values between
155 cm and 170 cm and not many have
values more than 180 cm or less than
150 cm.
frequency
25
20
15
10
5
0
140 145 150 155 160 165 170 175 180 185 190
height (cm)
THE NORMAL DISTRIBUTION CURVE
On the right we have the graph of the
normal distribution of scores.
Notice its symmetry.
relative
frequency
The normal
distribution
curve.
mean ` x = median
The normal distribution is a theoretical, or idealised model of many real life distributions.
In a normal distribution, data is equally distributed about the mean.
The mean also coincides with the median of the data.
The normal distribution lies at the heart of statistics. Many naturally occurring phenomena
have a distribution that is normal, or approximately normal.
²
²
²
²
²
²
²
Some examples are:
the chest sizes of Australian males
the distribution of errors in many manufacturing processes
the lengths of adult female tiger sharks
the length of cilia on a cell
scores on tests taken by a large population
repeated measurements of the same quantity
yields of corn, wheat, etc.
HOW THE NORMAL DISTRIBUTION ARISES
Example 1:
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
Consider the oranges stripped from an orange tree. They do not all have the same weight.
This variation may be due to several factors which could include:
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\492SA12MA-2_08.CDR Thursday, 16 August 2007 4:49:52 PM DAVID3
SA_12MA-2
STATISTICS
²
²
²
²
(Chapter 8)
493
different genetic factors
different times when the flowers were fertilised
different amounts of sunlight reaching the leaves and fruit
different weather conditions (some may be affected by the prevailing winds more than
others), etc.
The result is that much of the fruit could have weights centred about, for example, a mean
weight of 214 grams, and there are far fewer oranges that are much heavier or lighter.
Invariably, a bell-shaped distribution of weights would be observed and the normal distribution
model fits the data fairly closely.
Example 2:
In manufacturing nails of a given length, say 50 mm, the machines produce nails of average
length 50 mm but there is minor variation due to random errors in the manufacturing process.
A small standard deviation of 0:3 mm, say, may be observed, but once again a bell-shaped
distribution models the situation.
Once a normal model has been established we can use it to make predictions about the
distribution and to answer other relevant questions.
A TYPICAL NORMAL DISTRIBUTION
A large sample of cockle shells were collected
and the maximum distance across each shell
was measured. Click on the video clip icon
to see how a histogram of the data is built up.
VIDEO CLIP
Now click on the demo icon to observe the
effect of changing the class interval lengths
for normally distributed data.
DEMO
PROPERTIES OF NORMAL DISTRIBUTIONS
INVESTIGATION 5
THE NORMAL CURVES PROPERTIES
Click on the icon to obtain a sample from a normal
distribution.
NORMAL
DISTRIBUTION
What to do:
1 Find for n = 300, the sample’s mean (x), median and standard deviation (s).
2 Find the proportion of the sample values which lie in the intervals x § s, x § 2s,
x § 3s.
3 Select another random sample for n = 200 and repeat 2.
4 Repeat again, each time recording your results.
5 Increase n to 1000 and obtain more data for proportions in the intervals described in
2. Repeat several times.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
6 Write a brief report of your findings.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\493SA12MA-2_08.CDR Thursday, 16 August 2007 4:49:58 PM DAVID3
SA_12MA-2
494
STATISTICS
(Chapter 8)
From the previous investigation you should have discovered that the normal distribution has
certain properties or characteristics that enable valid statistical inferences to be made. Some
of the properties are listed below.
For
²
²
²
the Normal distribution it can be shown that:
68% of the data will have values within one standard deviation of the mean.
95% of the data will have values within two standard deviations of the mean.
99:7% of the data will have values within three standard deviations of the mean.
Graphically this can be summarised:
68% of data
95% of data
x¡-¡s
x¡+¡s
mean x
x¡-¡2s
mean x
99.7% of data
x¡-¡3s
x¡+¡2s
mean x
x¡+¡3s
These properties are illustrated on the normal distribution below:
50%
50%
34%
34%
13.5%
13.5%
2.35%
2.35%
0.15%
0.15%
x¡ 3s x ¡2s
x¡s
x+ s
x
68%
x + 2s x + 3s
95%
99.7%
Example 15
Self Tutor
A company sells radios with a mean life of 18 months and a standard deviation of 3
months. The company will replace a radio if it is faulty within 12 months of sale.
If they sell 5000 radios, how many can they expect to replace if life expectancy is
normally distributed?
We draw a rough picture of the normal
distribution curve:
34%
13.5%
2.5%
3
magenta
yellow
95
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
Thus 2:5% are expected to fail within 12 months, and
3
18
3
21
24
2:5% of 5000
= 0:025 £ 5000
+ 125
100
12
cyan
3
15
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\494SA12MA-2_08.CDR Thursday, 16 August 2007 4:50:04 PM DAVID3
SA_12MA-2
STATISTICS
(Chapter 8)
Example 16
495
Self Tutor
The chest measurements of 18 year old male footballers is normally distributed with
a mean of 95 cm and a standard deviation of 8 cm.
a Find the percentage of footballers with chest measurements between:
i 87 cm and 103 cm
ii 103 cm and 111 cm
b Find the probability that the measurement of a randomly chosen footballer is
i more than 119 cm
ii less than 87 cm
c What chest measurement would put a footballer in the largest 2:5% of 18 year
olds?
We draw a rough sketch of the
normal distribution curve and
label with percentages:
34%
34%
13.5%
13.5%
2.35%
0.15%
Let X cm be the chest measure2.35%
0.15%
ment of a footballer.
71
a
i
ii
b
i
ii
c
79
87
95
103
111
119
Pr(87 < X < 103) = 34% + 34%
= 68%
Pr(103 < X < 111) = 13:5%
Pr(X > 119) = 0:15%
= 0:0015
Pr(X < 87) = 13:5% + 2:35% + 0:15%
= 16%
= 0:16
To be in the largest 2:5% of chest measurements, a footballer would need to
have a chest of at least 111 cm.
EXERCISE 8H.1
1 The following data are the heights, to the nearest centimetre, of thirty footballers that
belong to an AFL club.
192 185 189 183 189 191 190 192 198 187 191 194 198 181 189
191 190 187 189 194 198 191 187 196 181 193 187 196 192 178
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
a Find the i mean, x ii standard deviation, s of the height of the footballers
in this club.
b
i Calculate the interval [x ¡ s, x + s].
ii What percentage of the heights would be expected to fall in this interval?
iii What percentage of the actual heights fall in this interval?
c What percentage of the actual heights fall in the interval [x ¡ 2s, x + 2s]?
What percentage would you expect to fall in this interval?
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\495SA12MA-2_08.CDR Thursday, 16 August 2007 4:50:09 PM DAVID3
SA_12MA-2
496
STATISTICS
(Chapter 8)
2 The distribution of weights of 600 g loaves of bread is bell-shaped with a mean weight
of 605 g and a standard deviation of 8 g. What percentage of the loaves can be expected
to have a weight between 597 g and 613 g? (Use the Normal distribution as a model.)
3 A restauranteur found that the average time spent by diners was 2 hours, with a standard
deviation of 30 minutes. Assuming that the time spent by diners is normally distributed,
and that there are 200 diners each week, calculate:
a the number of diners who stay between 2 and 3 hours
b the number who stay longer than 3 hours
c the number who stay less than 1 12 hours.
4 A clock manufacturer did a survey of 800 of its clocks to find out how accurate they
were. They found that the mean error was 6 minutes slow with a standard deviation of
2 minutes. Assuming that the error in time is normally distributed, find the expected
number of clocks that are:
a within 4 minutes of the mean error
b between 4 and 8 minutes slow
c more than 10 minutes slower than the correct time.
5 A bottle filling machine fills a mean of 20 000 bottles a day with a standard deviation of
2000. If we assume that production is normally distributed and the year comprises 260
working days, calculate to the nearest whole day, the number of working days that:
a over 20 000 bottles are filled
b over 16 000 bottles are filled
c between 18 000 and 24 000 bottles are filled.
6 A battery manufacturer finds that its batteries have a mean life of 28 months with a
standard deviation of 4 months. If the battery lives are normally distributed, and the
company manufactures 40 000 batteries per annum, calculate:
a the number of batteries that will last longer than 3 years.
b If the company guarantees their batteries for 2 years, what number could they expect
to replace under the guarantee?
c If the company wanted to limit the claims under the guarantee to no more than 2:5%
of production, what guarantee period would they need to put on their batteries?
7 The distribution of exam scores for 780 students who sat an exam is Normal with a mean
of 55 and a standard deviation of 15.
a Find the number of students who would be expected to obtain a score:
i greater than 70
ii less than 55
iii less than 25
iv between 70 and 85
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
b The bottom 16% of students will be given a ‘fail’. What is the cut-off mark for a
‘fail’?
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\496SA12MA-2_08.CDR Thursday, 16 August 2007 4:50:15 PM DAVID3
SA_12MA-2
STATISTICS
497
(Chapter 8)
NORMAL DISTRIBUTION PROBABILITIES USING A CALCULATOR
The previous questions were based around the standard 68% : 95% : 99:7% proportions.
We can find other probabilities for the normal distribution using a graphics calculator.
Suppose X is normally distributed with
mean 10 and standard deviation 2.
How do we find Pr(8 6 X 6 11) ?
8
10 11
For TI-83
For Casio
Press 2nd VARS (DISTR) to bring
up the DISTR menu and then 2 to
select 2:normalcdf(
From the Main Menu, select STAT.
Press F5 (DIST) F1 (NORM)
F2 (Ncd)
Enter 8 for the lower bound, 11 for the
upper bound, 2 for the standard deviation
and 10 for the mean.
The syntax for this command is
normalcdf(lower bound, upper bound, x, s)
Enter 8 , 11 , 10 , 2 )
Then select Execute.
ENTER
Thus the probability of X being between
8 and 11 is 0:533 (3 s.f.).
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
Thus the probability of X being between
8 and 11 is 0:533 (3 s.f.).
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\497SA12MA-2_08.CDR Monday, 20 August 2007 10:16:46 AM DAVID3
SA_12MA-2
498
STATISTICS
(Chapter 8)
FINDING QUANTILES (k-VALUES) USING A CALCULATOR
Suppose we want to find k such that Pr(X 6 k) = 0:8
for the normal distribution alongside with mean 10 and
standard deviation 2.
For TI-83
80%
10 k
For Casio
k can be found using
From the Main Menu, select STAT.
Press F5 (DIST) F1 (NORM)
mean
invNorm (0:8, 10, 2)
probability
F3 (InvN)
standard
deviation
Press 2nd VARS (DISTR) 3 to get
invNorm(
Enter 0.8 , 10 , 2 ) ENTER
Enter 0.8 for the area, 2 for the standard
deviation and 10 for the mean.
Then select Execute.
k = invNorm (0:8, 10, 2)
+ 11:7 (3 sf)
So, k + 11:7 (3 sf)
Example 17
Self Tutor
The length of King George Whiting caught in SA is normally distributed with mean
38 cm and standard deviation 3:5 cm. What percentage of whiting caught would be
expected to have a length of:
a more than 40 cm
b less than 33 cm
c between 35 and 45 cm?
d If the fisheries department wants to protect the smallest 30% of whiting from
fishing, what size limit should be set?
a
Let X = length of a whiting.
Pr(X > 40) = normalcdf (40, E99, 38, 3:5)
= 0:284 i.e., 28:4%
E99 is the largest
number able to be
entered into the
calculator.
b
Pr(X < 33) = normalcdf (¡E99, 33, 38, 3:5)
= 0:0766 i.e., 7:66%
c
Pr(35 < X < 45) = normalcdf (35, 45, 38, 3:5)
+ 0:782 i.e., 78:2%
d
Size limit = invNorm (0:3, 38, 3:5)
+ 36:2
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
i.e., fish should be greater than 36:2 cm
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\498SA12MA-2_08.CDR Monday, 20 August 2007 10:16:55 AM DAVID3
SA_12MA-2
STATISTICS
(Chapter 8)
499
EXERCISE 8H.2
1 The lengths of metal bolts produced by a machine are found to be normally distributed
with a mean of 19:8 cm and a standard deviation of 0:3 cm. Find the probability that a
bolt selected at random from the machine will be between 19:7 cm and 20 cm.
2 A student hoping to pass an examination is told that the mean mark of the class was
63 with standard deviation 17. If 20% of the class failed, what minimum mark must he
have achieved, assuming the scores were normally distributed?
3 The IQ of secondary school students from a particular area is believed to be normally
distributed with a mean of 103 and a standard deviation of 15:1. A student from one of
the schools is randomly selected. Find the probability that this student will have an IQ:
a of at least 115
b that is less than 75
c between 95 and 105:
4 A machine fills bags with icing sugar. The mean net weight per bag is 2 kg and the
standard deviation is 0:1 kg. If 5% of the bags are rejected for being too heavy, and
10% of the bags are rejected for being too light, in what range must the weight of a bag
lie for it to be acceptable, assuming the weight per bag is normally distributed?
5 The heights of men at an army barracks are found to be normally distributed with a mean
of 181 cm and a standard deviation of 4 cm. A man is selected at random from this
population. Find the probability that this person is:
a at least 175 cm tall
b between 177 cm and 180 cm tall.
6 The average score for a Physics test was found to be 46 and the standard deviation of the
scores was 25. Assuming that the scores were normally distributed, the teacher decided
to award an A to the top 7% of the students in the class. What is the lowest score that
a student must obtain in order to achieve an A?
7 The average weekly earnings of the students at Hardtime High School are found to be
approximately normally distributed with a mean of $40 and a standard deviation of $6:
a What proportion of students would you expect to earn:
i between $30:00 and $50:00 per week
ii less than $50:00 per week?
b A student is classified as ‘rich’ if they are in the top 10% of weekly earners. What
weekly earnings are required to be classified as ‘rich’?
c Discuss the reasonableness of a student earning more than $60 per week.
d The average weekly earnings of the students at Comfy College have a mean of
$25 and a standard deviation of $4. Sketch graphs on the same axes to show the
earnings at the two schools.
e What assumption was made about the earnings of students at Comfy College when
drawing the graph in d? Is this reasonable?
f If a ‘rich’ student from Hardtime High School transferred to Comfy College, how
would the mean and standard deviation of earnings at Comfy College be affected?
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
8 The lengths of Murray cod caught in the River Murray are found to be normally distributed with a mean of 41 cm and a standard deviation of 3:3 cm.
a Find the probability that a randomly selected cod is at least 50 cm long.
b What proportion of cod measure between 40 cm and 50 cm?
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\499SA12MA-2_08.CDR Friday, 17 August 2007 1:22:32 PM DAVID3
SA_12MA-2
500
STATISTICS
(Chapter 8)
c In a sample of 200, how many of them would you expect to be less than 45 cm?
The lengths of callop caught in the River Murray are also normally distributed with a
mean of 32 cm and standard deviation 2:5 cm.
d Jayco catches a Murray cod measuring 48 cm and a callop measuring 39 cm. Which
of the two fish was the biggest catch, relative to their own species?
e Liam claimed he caught a callop measuring 50 cm. Is this statistically reasonable?
9 Sam’s Maths mark is 83 in a class where the mean mark is 87 and the standard deviation
is 4:1. The same group of students are in a Chemistry class where Sam’s mark is 58.
The mean mark in Chemistry is 53 and the standard deviation is 7:3. In which subject
did Sam perform better relative to the other members of the class?
10 The mean time on a netball court for each member of the team is 24 minutes when played
outdoors and 27 minutes when played indoors. The standard deviations are 7:3 minutes
and 8:4 minutes respectively. If Jan’s time on the court was 23 minutes outdoors and 25
minutes indoors, in which environment did she receive the most relative court time?
I
CORRELATION
INTRODUCTION
Often we wish to know how two variables are associated or related.
To find such a relationship we construct and observe a scatterplot.
A scatterplot consists of points plotted on a set of axes where the independent variable is
placed on the horizontal axis and the dependent variable on the vertical axis.
A typical scatterplot could look like one of the following:
²
for the swimming team where weight is
dependent on height.
weight (kg)
height (cm)
²
for profitability of a sports goods store
where the profit is often dependent on the
amount of advertising done.
profit ($)
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
advertising ($)
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\500SA12MA-2_08.CDR Thursday, 16 August 2007 4:50:39 PM DAVID3
SA_12MA-2
STATISTICS
²
501
(Chapter 8)
for the intelligence quotient (IQ) of an individual. A sociologist may be considering if a person’s IQ is dependent on their
weight.
IQ
weight
OPENING PROBLEM 2
The relationship between weight and height of members of an AFL football team is being investigated.¡
We expect there to be a fairly strong association
between these variables as it is generally perceived that the
taller a person is, the more they will weigh.¡
The height and weight of each of the players in the
team is recorded and these values form a coordinate
pair for each of the players:
Player
1
2
3
4
5
6
Height
203
189
193
187
186
197
Weight
106
93
95
86
85
92
Player
7
8
9
10
11
12
Height
180
186
188
181
179
191
Weight
78
84
93
84
86
92
Player
13
14
15
16
17
18
Height
178
178
186
190
189
193
Weight
80
77
90
86
95
89
Weight versus Height
The scatterplot for the data is
given alongside.¡ Height is the independent variable (horizontal
axis) and weight is the dependent
variable (vertical axis).
weight (kg)
105
100
95
90
85
80
height (cm)
175
180
185
190
195
200
205
Consider and possibly discuss:
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
0
5
95
100
50
75
25
0
5
²
What are the variables in this problem and are they categorical or numerical?
What is the dependent variable?
Can you describe the appearance of the scatterplot? Are the points close to being
linear?
Does an increase in the independent variable generally cause an increase (or a
decrease) in the dependent variable?
25
²
²
²
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\501SA12MA-2_08.CDR Thursday, 16 August 2007 4:50:45 PM DAVID3
SA_12MA-2
502
STATISTICS
(Chapter 8)
MATHEMATICAL MODELLING
A mathematical model consists of an equation which connects two or more variables. This
equation may be exact or approximate, depending on the circumstances in which it arises.
For example, assuming no abnormalities or loss due to accident, the total number of fingers
f, for x people is given by f = 10x, clearly an exact rule.
However, w + 0:9h ¡ 81 is a very approximate model for determining the
weight w kg of AFL footballers of height h cm.
This model is obtained by trying to fit a ‘line of best fit’ through the scatterplot
points.
Questions which could be asked where mathematical modelling may be used, could be similar
to those following:
²
Can tomorrow’s temperature be reasonably accurately predicted using today’s
temperatures from country centres west of us?
²
Can a company predict its increase in sales due to increased spending on
advertising?
²
Can a student’s success in a tertiary institution be predicted from his or her
Year 12 final results?
²
Can a person’s increase in intake of vitamins, particularly vitamin C, reduce
one’s susceptibility to colds and influenza?
²
Is there a relationship between a person’s age and their systolic blood pressure?
In this section we will be concerned with trying to fit mathematical models to data obtained
by observation or experiment.
In particular, we will examine for variables x and y:
linear models having form y = ax + b fa, b are constantsg
CORRELATION
Correlation refers to the relationship or association between two variables.
In looking at the correlation between two variables we should follow these steps:
Step 1: Look at the scatterplot for any pattern.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
For a generally upward shape we say that the correlation is positive, and in this case an increase
in the independent variable means that the dependent variable generally increases.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\502SA12MA-2_08.CDR Thursday, 16 August 2007 4:50:50 PM DAVID3
SA_12MA-2
STATISTICS
503
(Chapter 8)
For a generally downward shape we say that the
correlation is negative and in this case an increase in the independent variable means that the
dependent variable generally decreases.
For randomly scattered points (with no upward or
downward trend) there is usually no correlation.
Step 2: Look at the spread of points to make a judgement about the strength of the correlation. For positive relationships we would classify the following scatterplots as:
strong
moderate
weak
Similarly there are strength classifications for negative relationships:
strong
Step 3:
moderate
Look at the pattern of points to see whether or not it is linear.
These points
appear to be
roughly linear.
These points
do not appear
to be linear.
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
Look for and investigate any outliers. These
appear as isolated points away from the main
body of data.
Outliers should be investigated as sometimes
they are mistakes made in recording the data
or plotting it.
Genuine extraordinary data should be included.
75
25
0
5
95
100
50
75
25
0
5
Step 4:
cyan
weak
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\503SA12MA-2_08.CDR Thursday, 16 August 2007 4:50:55 PM DAVID3
outlier
not an
outlier
SA_12MA-2
504
STATISTICS
(Chapter 8)
POSITIVE CORRELATION
An association between two variables is described as a positive correlation if an increase
in one variable results in an increase in the other in an approximately linear manner.
The association is best measured with the correlation coefficient (r) that ranges between 0
and 1 for positive correlation.
An r value of 0 suggests that there is no linear association present (or no correlation).
An r value of 1 suggests that there is a perfect linear association present (or perfect positive
correlation).
Only deterministic models will result in perfect correlation. For example, the association between the number of sides n, of a polygon and its interior angle sum, S, where
S = (n ¡ 2) £ 180o .
The correlation between the height and the weight of people is positive and lies between 0
and +1. It is not an example of perfect positive correlation because, for example, not all short
people are of light weight. However, taller people are generally heavier than shorter people.
The r values in between 0 and 1 represent varying degrees of linearity.
Scatter diagrams for positive correlation:
The scales on each of the four graphs are the same.
y
y
y
x
y
x
r = +1
x
r = +0.8
x
r = +0.5
r = +0.2
NEGATIVE CORRELATION
An association between two variables is described as a negative correlation if an increase in one variable results in a decrease in the other in an approximately linear manner.
The strength of the association is best measured with the correlation coefficient (r) that
ranges between 0 and ¡1 for negative correlation.
An r value of ¡1 suggests that there is a perfect linear association present (or perfect negative
correlation).
Scatter diagrams for negative correlation:
y
y
y
x
x
x
magenta
yellow
95
100
50
75
25
0
5
95
r = -0.5
100
50
75
25
0
5
95
100
50
75
r = -0.8
25
0
5
95
100
50
75
25
0
5
r = -1
cyan
y
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\504SA12MA-2_08.CDR Thursday, 16 August 2007 4:51:00 PM DAVID3
x
r = -0.2
SA_12MA-2
STATISTICS
We can interpret the Weight versus
Height scatterplot from earlier as follows:
“There is a moderate positive association
between the variables height and weight.¡
This means that as height increases,
weight increases.¡ The relationship
appears linear and there are no obvious
outliers.”
505
(Chapter 8)
Weight versus Height
weight (kg)
105
100
95
90
85
80
height (cm)
175
180
185
190
195
200
205
CAUSATION
Correlation between two variables does not necessarily mean that one variable causes the
other. Consider the following:
1 The arm length and running speed of a sample of young children were measured and a
strong, positive correlation was found to exist between the variables.
Does this mean that short arms cause a reduction in running speed or that a high running
speed causes your arms to grow long?
These are obviously nonsense assumptions and the
strong positive correlation between the variables is attributed to the fact that both arm length and running
speed are closely related to a third variable, age. Arm
length increases with age as does running speed (up to
a certain age).
2 The number of television sets sold in Ballarat and
the number of stray dogs collected in Bendigo were
recorded over several years and a strong positive association was found between the variables.
Obviously the number of television sets sold in Ballarat
was not influencing the number of stray dogs collected
in Bendigo. Both variables have simply been increasing over the period of time that their numbers were
recorded.
If a change in one variable causes a change in the other variable then we say that a causal
relationship exists between them.
For example:
The age and height of a group of children is measured and there is a strong positive correlation
between these variables. This will be a causal relationship because an increase in age will
cause an increase in height.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
In cases where this is not apparent, there is no justification, based on high correlation alone,
to conclude that changes in one variable cause the changes in the other.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\505SA12MA-2_08.CDR Thursday, 16 August 2007 4:51:06 PM DAVID3
SA_12MA-2
506
STATISTICS
(Chapter 8)
EXERCISE 8I.1
1 For each of the following, state whether you would expect to find positive, negative,
or no association between the following variables. Indicate the strength (none, weak,
moderate or strong) of the association.
a Shoe size and height.
b Speed and time taken for a journey.
c The number of occupants in a household and the water consumption of the household.
d Maximum daily temperature and the number of newspapers sold.
e Age and hearing ability.
2 Copy and complete the following:
a If the variables x and y are positively associated then as x increases, y ..........
b If there is negative association between the variables m and n then as m increases,
n ..........
c If there is no association between two variables then the points on the scatterplot
appear to be .......... ..........
3 Describe, briefly, exactly what is meant by:
a a scatterplot
b a mathematical model
d positive correlation
e negative correlation
c
f
correlation
an outlier
a What is meant by the independent and dependent variables?
b When graphing, which variable is placed on the horizontal axis?
4
5 For the following scatterplots comment on:
i the existence of any pattern (positive, negative or no association)
ii the relationship strength (zero, weak, moderate or strong)
iii whether the relationship is linear or not
iv whether or not there are any outliers.
a
b
c
y
y
y
x
e
y
magenta
yellow
95
50
75
25
0
5
95
100
50
x
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
x
cyan
x
f
y
100
d
x
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\506SA12MA-2_08.CDR Thursday, 16 August 2007 4:51:12 PM DAVID3
y
x
SA_12MA-2
STATISTICS
(Chapter 8)
507
6 The following pairs of variables were measured and a strong positive correlation between
them was found. Discuss whether a causal relationship exists between the variables. If
not, suggest a third variable to which they may both be related.
a The lengths of one’s left and right feet.
b The damage caused by a fire and the number of firemen who attend it.
c Company expenditure on advertising, and sales.
d The height of parents and the height of their adult children.
e The number of hotels and the number of churches in rural towns.
MEASURING CORRELATION
When dealing with linear association we can use the concept known as correlation to measure
the strength and direction of association.
Correlation is a technique that was devised to measure the strength and direction of the linear
association between two variables.
The correlation between two numerical variables can be measured by a correlation coefficient.
There are several correlation coefficients that can be used, but the most widely used coefficient
is Pearson’s correlation coefficient, named after the statistician Carl Pearson who developed
it. Its full name is Pearson’s product-moment correlation coefficient, and it is denoted r.
The correlation coefficient (r) lies between ¡1 and 1.
Constructing a scatterplot and finding Pearson’s correlation coefficient
x
y
We consider finding Pearson’s correlation
coefficient for the data opposite:
1
2
2
1
3
4
4
3
5
5
6
6
7
5
8
5
9
7
10
8
Using a Texas Instruments TI-83
First we activate the diagnostic tools.¡ Once turned on these
will remain on, but if the memory is cleared or the battery
changed then the calculator will revert back to the default
functions that do not include r.¡ To activate the diagnostic
tools:
Locate the menu CATALOG using 2nd 0 .
Use the arrow keys to scroll down to DiagnosticOn and
press ENTER .
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
DiagnosticOn will appear on the screen. Press ENTER and
you will have turned the diagnostic tools on.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\507SA12MA-2_08.CDR Monday, 20 August 2007 10:18:39 AM DAVID3
SA_12MA-2
508
STATISTICS
(Chapter 8)
Enter the data into lists, the x-data into L1 and the y-data
into L2.
Y= (STAT PLOT), then press ENTER
Press 2nd
Turn Plot1 On and select the scatterplot icon ".
The XList is for the independent variable L1 and the YList
is for the dependent variable L2.
Press ZOOM 9 (9:ZoomStat) to view the scatterplot.¡ You
can press TRACE and use the arrow keys to identify the
points.
We check the scatterplot at this stage as it will reveal any
errors made in entering the data, and any outliers.¡ It will
also indicate whether the data is linear.
Press STAT
4 to select 4:LinReg(ax+b) from the
STAT CALC menu.
(This means we are fitting a linear model or linear regression of the form y = ax + b to the data. Regression will be
discussed in greater detail soon!)
LinReg(ax + b) appears on the screen.¡ You need to tell the
calculator where your data is:
Enter L1, L2 by pressing 2nd 1 (L1) ,
ENTER .
2nd 2 (L2)
The linear regression screen appears and the last figure
r = 0:9130 :::: is Pearson’s correlation coefficient for this
data set.
The r value indicates a strong positive correlation, which
agrees with the scatterplot.
Using a Casio fx-9860g
Enter the data into lists, the x-data into List 1 and the y-data into List 2.
Press F6 (¤) until the GRPH icon is in the bottom left corner of the screen.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
Press F1 (GRPH) F6 (SET), then press
F1 (Scat) to select the scatterplot.
Make sure the XList is set to List1 and the YList is set to List2, then press
EXIT F4 (SEL) F1 (On) to turn StatGraph1 on.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\508SA12MA-2_08.CDR Monday, 20 August 2007 10:19:30 AM DAVID3
SA_12MA-2
STATISTICS
(Chapter 8)
509
Press F6 (Draw) to draw the scatterplot
Press F1 (X) to obtain a linear regression of the data.
The linear regression screen appears and the figure r = 0:9130 ......¡ is Pearson’s correlation coefficient for this data set.
The r value indicates a strong positive correlation, which agrees with the scatterplot.
Notes about Pearson’s correlation coefficient:
² It is designed for linear data only.
² It should be used with caution if there are outliers.
For example, the data in the two scatterplots below both have a correlation coefficient
of r = 0:8. The presence of the outlier in the second graph has greatly reduced the
r value, however, without this point, r would equal 1.
y
15
y
outlier
15
10
10
5
5
x
x
2 4 6 8 10 12 14
2
4
6
8 10 12
Example 18
Self Tutor
In attempting to find if there is any association
average speed
between average speed in the metropolitan
area and age of drivers, a device was fitted
70
to cars of drivers of different ages.
The results are shown in the scatterplot.
60
The r value for this association is +0:027.
Describe the association.
50
20 30 40 50 60 70 80 90
age
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
As r is close to zero, there is no correlation between the two variables.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\509SA12MA-2_08.CDR Monday, 20 August 2007 10:19:40 AM DAVID3
SA_12MA-2
510
STATISTICS
(Chapter 8)
Example 19
Self Tutor
We construct
a scatterplot:
no. of lawn beetles
Wydox have been trying out a new chemical to control the number of lawn beetles in
the soil.¡ Determine the extent of the correlation between the quantity of chemical
used and the number of surviving lawn beetles per square metre of lawn.
Lawn
Amount
of chemical
(g)
Number of
surviving
lawn beetles
A
B
C
D
E
2
5
6
3
9
11
6
4
6
3
We now fit a
linear model
to the data:
chemical
From the scatterplot and r + ¡0:859, we have a moderate negative association
between the amount of chemical used and the number of lawn beetles surviving.
Generally, the more chemical used, the less beetles survive.
EXERCISE 8I.2
1 Mr Whippy thought that there may be a relationship between the temperature and the
number of ice-creams he sells. He collected the following data:
Max. daily
temp. (o C)
29
40
35
30
34
34
27
27 19 37
22 19 25
36
23
No. of ice119 164 131 152 206 169 122 143 63 208 155 96 125 248 139
creams sold
a Use your calculator to sketch a scatterplot and calculate Pearson’s correlation coefficient.
b Interpret the value of r in terms of strength and direction.
c Does the value of the correlation coefficient confirm your observations from the
scatterplot? Was it appropriate to find r for this data? Explain.
cyan
magenta
yellow
70
45
50
32
110
33
100
41
60
50
55
30
80
45
50
36
75
23
95
56
35
100
80
47
50
25
0
85
26
40
39
75
60
20
5
95
100
50
10
17
75
15
17
25
30
38
0
22
17
5
80
38
100
Min. spent
Score
95
110
55
50
65
38
75
35
30
25
30
31
0
75
25
5
95
Min. spent
Score
100
50
75
25
0
5
2 A class of 25 students was asked to record their times (in minutes) spent preparing for
a test. The data below was collected:
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\510SA12MA-2_08.CDR Thursday, 16 August 2007 4:51:33 PM DAVID3
18
34
SA_12MA-2
STATISTICS
511
(Chapter 8)
a Use your calculator to sketch a scatterplot and calculate Pearson’s correlation coefficient.
b Interpret the value of r in terms of strength and direction.
c Does the value of the correlation coefficient confirm your observations from the
scatterplot? Was it appropriate to find r for this data? Explain.
3 Which one of the following is true for Pearson’s correlation coefficient r?
A The addition of an outlier to a set of data would always result in a lesser value of r.
B An r value of 1 represents a stronger relationship between the variables than an r
value of ¡1.
C A high value of r means that one variable is causing the other variable to change.
D An r value of ¡0:8 means that as the independent variable increases, the dependent
variable will tend to decrease.
E It can take values between 0 and 1 inclusive.
THE COEFFICIENT OF DETERMINATION (r2 )
To help describe the strength of association we calculate the coefficient of determination (r2 ). This is simply the square
of the correlation coefficient (r) and as
such the direction of association is eliminated.
value
2
r =0
no correlation
0 < r2 < 0:25
very weak correlation
2
Many texts vary on the advice they give.
We suggest the rule of thumb given alongside when describing the strength of linear
association.
strength of association
0:25 6 r < 0:50
weak correlation
0:50 6 r2 < 0:75
moderate correlation
0:75 6 r2 < 0:90
strong correlation
2
0:90 6 r < 1
very strong correlation
r2 = 1
perfect correlation
CALCULATION OF THE COEFFICIENT OF DETERMINATION
r2 is found on the linear regression screen of
your calculator as shown opposite.
STATISTICS
PACKAGE
Alternatively, if the value of r is known, then
this can simply be squared.
INTERPRETATION OF THE COEFFICIENT OF DETERMINATION
r2 indicates the strength of association between the dependent
variable and the independent variable.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
If there is a causal relationship then r2 indicates the degree to which change in the independent
variable explains change in the dependent variable.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\511SA12MA-2_08.CDR Thursday, 16 August 2007 4:51:39 PM DAVID3
SA_12MA-2
512
STATISTICS
(Chapter 8)
For example:
An investigation into many different brands of muesli found that there is strong positive
correlation between the variables fat content and kilojoule content.
Pearson’s correlation coefficient, r, was found to be 0:8625.
The coefficient of determination for this study is (0:8625)2 + 0:744.
An interpretation of this r2 value is “the proportion of variation in kilojoule content that can
be explained by the variation in fat content of muesli is 0:744.”
It is usual to quote the coefficient of variation as a percentage. A proportion of 0:744 is
equivalent to 0:744 £ 100 = 74:4%.
The interpretation becomes:
dependent variable
74:4% of the variation in kilojoule content of muesli can be explained by the variation in
fat content of muesli.
independent variable
If 74:4% of the variation in kilojoule content of muesli can be explained by the fat content of
muesli then we can assume that the other 100%¡74:4% = 25:6% of the variation in kilojoule
content of muesli can be explained by other factors (which may or may not be known).
Example 20
Self Tutor
A study has found that 45% of the variation in selling price can be explained by the
variation in age of a used car.
If this statement was based on the coefficient of variation then what would be the
value of Pearson’s correlation coefficient for this study?
p
We are told that r2 = 0:45 so r is the square root of 0:45. ( 0:45 + 0:6708)
At this point we need to consider the variables involved: selling price and age of a
car.
We would assume that as the age of a car increases then the selling price of a car
would decrease, i.e., there is negative correlation between the variables.
Hence we can conclude for this study that Pearson’s correlation coefficient, r, will
be ¡0:6708.
‘Casualty crashes’ v ‘All crashes’
casualty crashes
EXERCISE 8I.3
1 The scatterplot alongside shows the association between the number of car crashes in
which a casualty occurred and total number
of car crashes in South Australia in each
year from 1985 to 2007. Given that the r
value is 0:49:
10000
8000
6500
6000
30000 35000 40000 45000 50000 55000
95
100
50
75
all crashes
25
0
5
95
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
100
yellow
8500
7000
a find r
b describe the association between these
variables.
magenta
9000
7500
2
cyan
9500
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\512SA12MA-2_08.CDR Thursday, 16 August 2007 4:51:45 PM DAVID3
SA_12MA-2
STATISTICS
2 In an investigation to examine the association between
the tread depth (y mm) and the number of kilometres
travelled (x thousand), a sample of 8 tyres of the same
brand was taken and the results are given below.
14
5:7
kilometres (x thousand)
tread depth (y mm)
17
6:5
24
4:0
34
3:0
35
1:9
513
(Chapter 8)
depth of tread
tyre cross-section
37
2:7
38
1:9
39
2:3
a Draw a scatterplot of the data.
b Calculate r and r2 for the tabled data.
c Describe the association between tread depth and the number of kilometres travelled
for this brand of tyre.
3 In an investigation the coefficient of determination for the variables preparation time
and exam score is found to be 0:5624. Complete the following interpretation of the
coefficient of determination:
...... % of the variation in .......... can be explained by the .......... in preparation time.
4 For each of the following find the value of the coefficient of determination correct to
four decimal places, and interpret it in terms of the variables.
a An investigation has found the association between the variables time spent gambling
and money lost has an r value of 0:4732.
b For a group of children a product-moment correlation coefficient of ¡0:365 is found
between the variables heart rate and age.
c In a study of a sample of countries, Pearson’s correlation coefficient for the variables
female literacy and gross domestic product is found to be 0:7723.
5 A rural school has investigated the relationship between the time spent travelling to
school (minutes) and a student’s year ten average (%) for a sample of students.
The results are given in the table below:
Travel time
10 33 18 43 34 30 24 47 44 41 17 45 39 31 23 11 14 25 16 17
(mins)
Year 10
51 78 97 56 90 70 64 67 37 46 95 67 31 57 43 99 98 82 40 67
average (%)
a Construct a scatterplot of the data and interpret the scatterplot.
b Find Pearson’s correlation coefficient for the data and interpret.
c Calculate the coefficient of determination and interpret this in terms of the variables.
J
LINEAR REGRESSION
Regression is a word that means fitting a line or curve to a set of data.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
A curve that is fitted to a set of paired numerical data gives us an algebraic relationship
between the variables that can be used to predict values of one variable given values for the
other.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\513SA12MA-2_08.CDR Thursday, 16 August 2007 4:51:52 PM DAVID3
SA_12MA-2
514
STATISTICS
(Chapter 8)
In this course, we only consider linear regression.
Linear regression is fitting a line to the set of data. If there is a linear relationship between
the variables, the regression line accurately models the relationship between them.
Let us revisit Opening Problem 2.¡ We know
that there is quite a strong positive correlation between the height and the weight of the players.¡
Consequently, we should be able to find a linear
equation which ‘best fits’ the data.¡
This line of best fit could be found by eye.¡ However, different people will use different lines.¡
So, how do we find mathematically, the line of
best fit?
Weight versus Height
weight (kg)
105
100
95
90
85
80
height (cm)
175 180 185 190 195 200 205
LEAST SQUARES REGRESSION
y
The least squares regression line is a
line drawn so that the sum of the
squares of the vertical distance
from each point on the scatterplot
(the dotted lines) is a minimum.
y = mx + c
(xc, yc)
Statisticians invented a method where the best
line results.¡
(xz, yz)
(xv, yv)
c
DEMO
(xx, yx)
x
The least squares regression line has form y = ax + b,
x
y
a
b
where
is
is
is
is
the
the
the
the
variable on the horizontal axis
variable on the vertical axis
slope or gradient of the line
y-intercept of the line.
FINDING AND PLOTTING THE LEAST SQUARES REGRESSION LINE
x
y
Consider the data alongside:
55
72
20
37
27
53
33
74
73
73
18
44
37
59
51
55
79
84
For TI-83
Enter the data into lists L1 and L2
and check its scatterplot.
Press STAT
4 to select 4:LinReg(ax+b) from the STAT CALC menu.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
Enter L1 and L2, by pressing 2nd 1 (L1) , 2nd 2 (L2) , , then press VARS
1
to select 1:Function from the Y-VARS menu, then 1 to select 1:Y1 from the FUNCTION
menu.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\514SA12MA-2_08.CDR Monday, 20 August 2007 10:19:50 AM DAVID3
SA_12MA-2
STATISTICS
515
(Chapter 8)
Press ENTER to view the equation of the regression line.
The gradient ‘a’ and the y-intercept ‘b’ of the least squares
regression line are given.
The equation of the regression line is y = 0:572x + 36:2 :
Reminder:
If the values for r2 and r are not shown, press 2nd 0
(CATALOG), choose DiagnosticOn, and press ENTER .
Now press GRAPH to display the regression line on the
scatterplot.
Note:
The equation of the regression line has been pasted into Y1 and can now be used to make
predictions if appropriate.
For Casio
Enter the x-data into List 1 and the y-data into List 2, and view its scatterplot.
Press F1 (X) to view the regression line.
The gradient ‘a’ and the y-intercept ‘b’ of the least squares regression line are given.
The equation of the regression line is y = 0:572x + 36:2 :
Press F6 (DRAW) to display the regression line on the scatterplot.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
Note:
Pressing F1 (X) F5 (COPY) EXE will paste the equation of the regression line
into Y1, where it can be used to make predictions if appropriate.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\515SA12MA-2_08.CDR Monday, 20 August 2007 10:20:00 AM DAVID3
SA_12MA-2
516
STATISTICS
(Chapter 8)
INTERPRETING THE SLOPE AND INTERCEPT OF A REGRESSION LINE
The data below gives the fat content (grams) and the energy (kilojoules) of 17 different foods.
Fat (g)
Energy (kJ)
15
1255
55
3555
18
1800
45
1880
17
1670
24
2520
30
2300
30
2300
Fat (g)
Energy (kJ)
16
1340
11
1130
9
1150
30
2300
24
1670
24
1670
30
2510
32
1460
4000
A scatterplot of the data is shown
alongside.¡
The data shows moderate positive
correlation between the variables energy
and fat content, and there appears to be a
linear relationship between the variables.
30
2090
energy (kJ)
3000
2000
1000
0
0
10
20
30
40
fat (g)
50
60
The regression equation is y = 41:9x + 834 (3 s.f.)
i.e., energy = 41:9 £ fat + 834,
in which the gradient or slope is 41:9 and
the y- or energy-intercept is 834.
The slope can be interpreted as:
“for every increase of one gram of fat there is an increase of 41:9 kilojoules of energy”.
unit of
independent
variable
gradient
or slope
independent
variable
unit of
dependent
variable
dependent
variable
The y-intercept has the value 834.
This can be interpreted as:
“when the fat content of a food is zero, the energy provided by the food is 834 kilojoules”.
This interpretation is reasonable because it is possible for food to have zero fat, and most
foods will still have energy content from carbohydrates such as sugars.
INTERPOLATION AND EXTRAPOLATION
Interpolation means predicting values from a regression model (equation) for values from
within the range of data from which the regression equation was based.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
In the example above, the fat data ranged from 9 to 55. Using a value within this range to
predict an energy value would be interpolation. If we predict that the energy content of a
food containing 40 g of fat is 2509 kilojoules, this is interpolation.
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\516SA12MA-2_08.CDR Thursday, 16 August 2007 4:52:08 PM DAVID3
SA_12MA-2
STATISTICS
517
(Chapter 8)
Extrapolation means predicting values from a regression model (equation) for values from
outside the range of data from which the regression equation was based.
Using a value for fat outside the range 9¡g 4000
dependent
upper pole
to 55¡g would be extrapolation.¡ Using the
equation to predict the energy content of 3000
a food containing 60¡g of fat or 5¡g of fat
2000
would both be cases of extrapolation.¡
line of
The accuracy of an interpolation depends 1000
best fit
lower pole
on how linear the original data was.¡ This
independent
can be gauged by determining the corre0
0
10
20
30
40
50
60
lation coefficient and ensuring that the
data is randomly scattered around the line
extrapolation
extrapolation
interpolation
of best fit.
The accuracy of an extrapolation depends not only on ‘how linear’ the original data was, but
also on the assumption that the linear trend will continue past the poles.
The validity of this assumption depends greatly on the situation under investigation.
Example 21
Self Tutor
The table below shows the sales for Hancock’s Electronics established in late 2000.
2001
5
Year
Sales ($ £ 10 000)
a
b
c
d
2004
18
2005
21
2006
27
Let t be the time
in years from 2000
and S be the sales
in $10 000’s, i.e.,
t
1
2
3
4
5
6
S
5
9
14
18
21
27
30
25
20
15
10
5
S
t
1
2
3
4
5
6
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
Using technology, in 2008, t = 8 ) S + 34:95
i.e., predicted year 2008 sales would be $350 000.
The r and r2 values suggest that the linear relationship between sales and year
is very strong and positive. However, since this prediction is an extrapolation,
it will only be reasonable if the trend evident from 2001 to 2006 continues to
the year 2008, and this may or may not occur.
5
d
95
The line of best fit is S = 4:29t + 0:667 :
100
c
50
Using technology, r2 = 0:9941:
75
b
25
0
5
95
100
50
75
25
0
5
2003
14
Draw a graph to illustrate this data.
Find r2 :
Find the equation of the line of best fit using the linear regression formula.
Predict the sales figures for year 2008, giving your answer to the nearest
$10 000. Comment on the reasonableness of this prediction.
a
cyan
2002
9
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\517SA12MA-2_08.CDR Thursday, 16 August 2007 4:52:14 PM DAVID3
SA_12MA-2
518
STATISTICS
(Chapter 8)
EXERCISE 8J
1 Recall the tread depth data of car tyres after travelling thousands of kilometres:
14
5:7
kilometres (x thousand)
tread depth (y mm)
a
b
c
d
17
6:5
24
4:0
34
3:0
35
1:9
37
2:7
38
1:9
39
2:3
Which is the dependent variable?
On a scatterplot graph the least squares regression line and state its equation.
Use the equation of the line of best fit to estimate the tread depth of a new tyre.
Brock claims that his tyres have done 50 000 km. Is this claim reasonable? Give
evidence.
2 Tomatoes are sprayed with a pesticide-fertiliser mix. The figures below give the yield of
tomatoes per bush for various spray concentrations.
3
67
Spray concentration (x, mL/L)
Yield of tomatoes per bush (y)
a
b
c
d
e
f
g
h
5
90
6
103
8
120
9
124
11
150
15
82
Define the role of each variable and produce an appropriate scatterplot.
Determine the value of r and r2 and interpret.
Is there an outlier present that is contributing to the low correlation?
Remove the outlier from the data set and recalculate r and r2 . Is it reasonable to
now draw a line of best fit?
Determine the equation of the line of best fit.
Give an interpretation for the slope and vertical intercept of this line.
Use the equation of the least squares line to predict the yield if the spray concentration was 7 mL/L. Comment on the reasonableness of this prediction.
If a 50 mL/L spray concentration was used, would this ensure a large tomato yield?
Explain.
3 The table below shows the concentration of chemical X in the blood of an accident
victim at various times after an injection was administered.
Time (minutes)
10
20
30
40
50
60
70
Concentration
(micrograms/mL)
105
38:0
13:1
4:75
1:42
0:63
0:12
a Sketch a scatterplot of the data.
b Calculate the r and r2 values and interpret.
c Is the relationship between the variables strong enough to warrant drawing a least
squares regression line?
4 A restauranteur believes that during March the number of people wanting dinner (y) is
related to the temperature at noon (xo C). Over a period of a fortnight the number of
diners and the noon temperature were recorded.
Temperature (xo C) 23 25 28 30 30 27 25 28 32 31 33 29 27 26
Number of diners (y) 57 64 62 75 69 58 61 78 80 67 84 73 76 67
cyan
magenta
yellow
95
100
50
75
25
0
b Generate a scatterplot of the data.
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
a What is the independent variable?
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\518SA12MA-2_08.CDR Thursday, 16 August 2007 4:52:20 PM DAVID3
SA_12MA-2
STATISTICS
519
(Chapter 8)
c Calculate r and r2 and interpret.
d How accurate would an interpolation using the line of best fit be? Explain.
e Are there any obvious outliers that could be removed to improve the correlation?
5 It has long been thought that frosty conditions are necessary to
‘set’ the fruit of cherries and apples.¡ The following data shows
annual cherry yield and incidence of frosts data for a
cherry growing farm over a 7 year period.
Number of frosts, (x) 27 23 7 37 32 14 16
Cherry yield (y tonnes) 5:6 4:8 3:1 7:2 6:1 3:7 3:8
©iStockphoto
a
b
c
d
e
f
Draw a scatterplot for this data.
Determine the r and r2 value.
Describe the association between cherry yield and the number of frosts.
Determine the equation of the line of best fit.
Give an interpretation for the slope and vertical intercept of this line.
Use the equation of the least squares line to predict the cherry yield if 29 frosts
were recorded. Comment on the reasonableness of this prediction.
g Use the equation of the least squares line to predict the cherry yield if 1 frost was
recorded. Comment on the reasonableness of this prediction.
6 The rate of a chemical reaction in a certain plant depends on the number of frost-free
days experienced by the plant over a year which, in turn, depends on altitude. The higher
the altitude, the greater the chance of frost.
The following table shows the rate of the chemical reaction R, as a function of the
number of frost-free days, n.
75
44:6
Frost-free days (n)
Rate of reaction (R)
100
42:1
125
39:4
150
57:0
175
34:1
200
31:2
a Produce a scatterplot for the data of R against n.
b Is it reasonable to draw a regression line? Give r2 evidence.
Clearly, the data point (150, 57:0) is an outlier. Inspection of records reveals that it
should be (150, 37:0).
c Change the outlier to its correct value and hence find the equation of the regression
line which best fits the data. State the new value of r2 .
d Estimate the rate of the chemical reaction when the number of frost free days is:
i 90
ii 215:
e Complete: “The higher the altitude, the ...... the rate of reaction.”
7 The following table gives peptic ulcer rates per 100 of population for differing family
incomes in the year 2007.
15
7:7
20
6:9
cyan
magenta
yellow
30
5:9
40
4:7
50
3:6
60
2:6
80
1:2
95
50
75
b Find the line of best fit.
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
a Obtain a scatterplot of the data.
25
7:3
100
10
8:3
Income (I thousand $)
Peptic ulcer rate R
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\519SA12MA-2_08.CDR Thursday, 16 August 2007 4:52:27 PM DAVID3
SA_12MA-2
520
STATISTICS
(Chapter 8)
c What is the estimated peptic ulcer rate in families with $45 000 incomes?
d Explain why the model is inadequate for families with income in excess of $100 000.
8 The concentration of carbon dioxide (CO2 ) in the atmosphere at Port Adelaide has been
recorded over a 40 year period. CO2 concentration in the atmosphere has a large influence
over our weather. CO2 concentration is measured in parts per million. Consider the table
which follows:
Year
1965 1970 1975 1980 1985 1990 1995 2000 2005
CO2 concentration 313 316 320 326 329 335 340 338 334
Let t be the number of years since 1965 and C be the CO2 concentration.
a Sketch a scatterplot of the data.
b Does a linear model appear to be appropriate? Explain.
The data for 2000 and 2005 is checked and found to be accurate as CO2 levels have
decreased due to environmental awareness.¡ A researcher wishes to estimate the CO2
concentration for 1993.
c Delete the data for 2000 and 2005 and find the linear model that fits the 1965 to
1995 data. State the value of r2 .
d Use the model to predict the CO2 level in 1993.
e According to the model, what would the CO2 level have been in 2005 if a decrease
in levels had not occurred?
f Is it reasonable to use the 2000 and 2005 data to predict the CO2 level in 2020?
9 Safety authorities advise drivers to travel 3
seconds behind the car in front of them as this
provides the driver with a greater chance of
avoiding a collision if the car in front has to
brake quickly or is itself involved in an accident.¡
A test was carried out to find out how long it
would take a driver to bring a car to rest from the
time a red light was flashed.¡ (This is called
stopping time, which includes reaction time and braking time.) The following results
are for one driver in the same car under the same test conditions.
Speed (v km/h)
10 20 30 40 50 60 70 80 90
Stopping time (t secs) 1:23 1:54 1:88 2:20 2:52 2:83 3:15 3:45 3:83
a
b
c
d
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
Produce a scatterplot of the data.
Find the linear model which best fits the data.
Is the linear model a good fit? Give evidence.
Use the model to find the stopping time for a speed of:
i 55 km/h
ii 110 kmph
e What is the interpretation of the vertical intercept?
f Why does this simple rule apply at all speeds, with a good safety margin?
black
Y:\HAESE\SA_12MA-2ed\SA12MA-2_08\520SA12MA-2_08.CDR Thursday, 16 August 2007 4:52:34 PM DAVID3
SA_12MA-2