Download Inferential Statistics Unit

Document related concepts

History of statistics wikipedia , lookup

Confidence interval wikipedia , lookup

German tank problem wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Inferential Statistics Unit
(Level IV Academic Math)
Draft
NSSAL
C. David Pilmer
©2010
(Last Updated: Dec 2011)
This resource is the intellectual property of the Adult Education Division of the Nova Scotia
Department of Labour and Advanced Education.
The following are permitted to use and reproduce this resource for classroom purposes.
• Nova Scotia instructors delivering the Nova Scotia Adult Learning Program
• Canadian public school teachers delivering public school curriculum
• Canadian nonprofit tuition-free adult basic education programs
The following are not permitted to use or reproduce this resource without the written
authorization of the Adult Education Division of the Nova Scotia Department of Labour and
Advanced Education.
• Upgrading programs at post-secondary institutions
• Core programs at post-secondary institutions
• Public or private schools outside of Canada
• Basic adult education programs outside of Canada
Individuals, not including teachers or instructors, are permitted to use this resource for their own
learning. They are not permitted to make multiple copies of the resource for distribution. Nor
are they permitted to use this resource under the direction of a teacher or instructor at a learning
institution.
Acknowledgments
The Adult Education Division would like to thank the following university professors for
reviewing this resource to ensure all mathematical concepts were presented correctly and in a
manner that supported our learners.
Dr. David Hamilton (Dalhousie University)
Dr. Genevieve Boulet (Mount Saint Vincent University)
Dr. Robert Dawson (Saint Mary’s University)
The Adult Education Division would also like to thank the following NSCC instructors for
piloting this resource and offering suggestions during its development.
Charles Bailey (IT Campus)
Elliott Churchill (Waterfront Campus)
Barbara Gillis (Burridge Campus)
Barbara Leck (Pictou Campus)
Suzette Lowe (Lunenburg Campus)
Floyd Porter (Strait Area Campus)
Brian Rhodenizer (Kingstec Campus)
Joan Ross (Annapolis Valley Campus)
Jeff Vroom (Truro Campus)
Table of Contents
Introduction……………………………………………………………………………
Tracking Your Progress……………………………………………………………….
Negotiated Completion Date………………………………………………………….
Mathematics Multimedia Learning Objects ………………………………………….
The Big Picture ……………………………………………………………………….
Course Timelines ……………………………………………………………………..
ii
iii
iii
iv
v
vi
Introductory Material and Terminology ……………………………………………...
Bar Graphs and Histograms ………………………………………………………….
Describing Data, Part 1 ……………………………………………………………….
Describing Data, Part 2 ……………………………………………………………….
Using Technology …………………………………………………………………….
Normal Distribution …………………………………………………………………..
Using the 68-95-99.7 Rule ……………………………………………………………
Making Inferences ……………………………………………………………………
Collecting a Sample …………………………………………………………………..
Sampling Methods ……………………………………………………………………
Simulated Sampling …………………………………………………………………..
Sampling Distribution of the Sample Means …………………………………………
Central Limit Theorem ……………………………………………………………….
Point Estimates and Interval Estimators ……………………………………………...
Putting It Together ……………………………………………………………………
1
6
11
17
25
29
35
40
41
45
50
53
58
68
79
If You Have Time …………………………………………………………………….
Post-Unit Reflections …………………………………………………………………
Terms, Symbol, and Formulas ………………………………………………………..
TI-83/84 Statistics Information Sheet ………………………………………………..
Answers ………………………………………………………………………………
88
89
90
92
94
NSSAL
©2010
i
Draft
C. D. Pilmer
Introduction
Statistics is the discipline concerned with the collection, the organization, and the analysis of
data to draw conclusions or make predictions. Statistics is widely employed in government,
business, and the natural and social sciences. In the first few sections of the unit we will focus
on descriptive statistics; the branch of statistics that deals with the description of data. In these
sections we will use terms such as mean, median, mode, and standard deviation. The latter
sections and the majority of this unit will focus on inferential statistics - the branch of statistics
in which one makes inferences about population characteristics based on evidence drawn from
samples. In these sections we will learn about confidence intervals based on a sample mean.
Statistics is used by numerous disciplines (e.g. psychology, education, business, medicine,
ecology, anthropology,…). This branch of mathematics impacts directly and indirectly on many
aspects of your life. As governments wrestle with social and economic matters, they rely heavily
on statistical information so that they can make informed decisions. For this reason, the federal
government has a branch solely dedicated to the collection of statistical information. That
branch is called Statistics Canada. When pharmaceutical companies are developing new drugs,
they use numerous statistical tools to analyze data collected from their nonhuman and human
trails. Without these tools they would be unable to access the benefits and risks associated with
the new medication. Companies that are manufacturing goods use a variety of statistical tools to
monitor quality control. Even the coordination of traffic lights is based on the collection and
analysis of statistical information. Statistics is truly woven into every aspect of our lives.
Take a few minutes to view the following three minute online video.
TED Arthur Benjamin's formula for changing math education
http://www.ted.com/talks/lang/eng/arthur_benjamin_s_formula_for_changing_math_education.h
tml
In this unit, we will not require you to master numerous statistical tools; rather, we will focus on
understanding the origins and uses of a few tools. It is important that we do not work blindly
through this material. Although the actual mechanics of using these statistical tools may seem
easy, understanding their origins and meanings if far more challenging and ultimately the
purpose of this unit. We need to think about the new concepts that we are exposed to and how
they relate to previous concepts.
NSSAL
©2010
ii
Draft
C. D. Pilmer
Tracking Your Progress
This page allows you to keep track of your progress through this material.
Date Started
Introductory Material and Terminology …………..
Bar Graphs and Histograms ………………………
Describing Data, Part 1 ……………………………
Describing Data, Part 2 ……………………………
Using Technology …………………………………
Normal Distribution ……………………………….
Using the 68-95-99.7 Rule ………………………...
Collecting a Sample ……………………………….
Sampling Methods ………………………………...
Simulated Sampling ……………………………….
Sampling Distribution of the Sample Means ……
Central Limit Theorem ……………………………
Point Estimates and Interval Estimators …………..
Putting It Together ………………………………...
Date Completed
1
6
11
17
25
29
35
41
45
50
53
58
68
79
Negotiated Completion Date
After working for a few days on this unit, sit down with your instructor and negotiate a
completion date for this unit.
Start Date:
_________________
Completion Date:
_________________
Instructor Signature: __________________________
Student Signature:
NSSAL
©2010
__________________________
iii
Draft
C. D. Pilmer
Mathematics Multimedia Learning Objects
In this resource you will find references to the online Mathematics Multimedia Learning Objects.
These online learner supports can be found at the following website and be accessed using the
following username and password.
http://www.cdli.ca/mlo/tutorials/index.php
Username: camet
Password: camet06
Province: Nova Scotia
Please do not view every learning object at this site. Only use those that are identified in this
resource.
NSSAL
©2010
iv
Draft
C. D. Pilmer
The Big Picture
The following flow chart shows the optional bridging unit and the eight required units in Level
IV Academic Math. These have been presented in a suggested order.
Bridging Unit (Recommended)
• Solving Equations and Linear Functions
Describing Relations Unit
• Relations, Functions, Domain, Range, Intercepts, Symmetry
Systems of Equations Unit
• 2 by 2 Systems, Plane in 3-Space, 3 by 3 Systems
Trigonometry Unit
• Pythagorean Theorem, Trigonometric Ratios, Law of Sines,
Law of Cosines
Sinusoidal Functions Unit
• Periodic Functions, Sinusoidal Functions, Graphing Using
Transformations, Determining the Equation, Applications
Quadratic Functions Unit
• Graphing using Transformations, Determining the Equation,
Factoring, Solving Quadratic Equations, Vertex Formula,
Applications
Rational Expressions and Radicals Unit
• Operations with and Simplification of Radicals and Rational
Expressions
Exponential Functions and Logarithms Unit
• Graphing using Transformations, Determining the Equation,
Solving Exponential Equations, Laws of Logarithms, Solving
Logarithmic Equations, Applications
Inferential Statistics Unit
• Population, Sample, Standard Deviation, Normal Distribution,
Central Limit Theorem, Confidence Intervals
NSSAL
©2010
v
Draft
C. D. Pilmer
Course Timelines
Academic Level IV Math is a two credit course within the Adult Learning Program. As a two
credit course, learners are expected to complete 200 hours of course material. Since most ALP
math classes meet for 6 hours each week, the course should be completed within 35 weeks. The
curriculum developers have worked diligently to ensure that the course can be completed within
this time span. Below you will find a chart containing the unit names and suggested completion
times. The hours listed are classroom hours. In an academic course, there is an expectation that
some work will be completed outside of regular class time.
Unit Name
Minimum
Completion Time
in Hours
0
6
18
18
20
36
12
20
20
Total: 150 hours
Bridging Unit (optional)
Describing Relations Unit
Systems of Equations Unit
Trigonometry Unit
Sinusoidal Functions Unit
Quadratic Functions Unit
Rational Expressions and Radicals Unit
Exponential Functions and Logarithms Unit
Inferential Statistics Unit
Maximum
Completion Time
in Hours
20
8
22
20
24
42
16
24
24
Total: 200 hours
As one can see, this course covers numerous topics and for this reason may seem daunting. You
can complete this course in a timely manner if you manage your time wisely, remain focused,
and seek assistance from your instructor when needed.
NSSAL
©2010
vi
Draft
C. D. Pilmer
Introductory Materials and Terminology
As we learned in the introduction, descriptive statistics is concerned with the description of
data. This means that we look at methods that organize data and summarize data in an effective
presentation that ultimately increases our understanding of the data. One of the most common
tools used in descriptive statistics are pictorial representations such as graphs (e.g. bar graphs,
circle graphs, line graphs,…).
Answers:
(a) 1960
(c) 1.7
4
3.5
3
2.5
2
1.5
1
0.5
0
1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000
Year
(b) 2000
(d) 3.2 − 2.4 = 0.8
Example 2
This circle graph shows the leading causes of death of
American women ages 65 years and older. The graph was
constructed using information collected regarding the
deaths of 50 000 American women 65 years of age.
(a) How many women died from heart disease?
(b) How many women died from either the flu or a stroke?
(c) How many more women died from heart disease than
a stroke?
Answers:
(a) 47% of 50 000
0.47 × 50000 = 23 500 women
(b) 7% + 13% = 20%
0.20 × 50000 = 10 000 women
(c) 47% − 13% = 34%
0.34 × 50000 = 17 000 women
4.5
Fertility Rate
Example 1
This line graph shows how the fertility rate in Canada has
changed since 1950. The fertility rate is the average
number of children born of women between the ages of 15
and 49. This information was obtained from census data
collected by Statistics Canada.
(a) In what year was the highest fertility rate?
(b) In what year was the lowest fertility rate?
(c) Approximate the fertility rate in 1985.
(d) Estimate the drop in fertility rate that occurred between
1965 and 1970?
flu
7%
stroke
13%
COPD
7%
heart
disease
47%
cancer
26%
Note:
Many statisticians discourage the use of pie
charts because most people find it difficult to
compare the sizes of different pie slices. These
statisticians contend that we would be better
served using bar graphs.
The two examples above were provided for a specific reason. These examples allow us to
differentiate between a population and a sample. A population is formally defined as the set
representing all measurements of interest to the investigator. In the first example, the
investigator, Statistic Canada, wanted to know fertility rates based on the births of every woman
NSSAL
©2010
1
Draft
C. D. Pilmer
between the ages 15 and 49. That is why they used census data to accomplish this. Every
person, including women between 15 and 49, are required by law to complete a census. Based
on this, Statistics Canada was confident their data represents all measurements of interest.
A sample is formally defined as a subset of measurements selected from a population of interest.
In the second example, the data was not collected for every death of women 65 years of age or
older. The data was from a sample of size 50 000. These 50 000 data points are a subset of the
population. It is usually more feasible and less expensive to obtain a sample than to obtain all
the measurements from the population.
As we learned in the introduction, inferential statistics is about making inference about
population characteristics based on evidence drawn from samples. In other words, we try to use
a sample to understand a population. We will focus on inferential statistics later in the unit.
If you wish further clarification, go to the Mathematics Multimedia Learning Objects (see page
iv), access Unit 11-5 Statistics, and view MLO4 Differentiating Populations and Samples.
Example 3
The Testing and Evaluation Division of the Department of Education reported that the average
mark on the grade 12 provincial math exam was 68%. This average was obtained by randomly
selecting 500 exams from throughout the province. Are we dealing with a sample or a
population? Explain.
Answer:
The Testing and Evaluation Division randomly selected 500 exams, rather than every exam.
For this reason they were dealing with a sample (i.e. a subset of the population).
Types of Data
When data is collected from a sample or a population, the responses can be classified as a
categorical data set or a numerical data set. These two terms are most easily explained using
an example. Suppose we have an adult education class comprised of 10 learners who all have
cell phones. The instructor asks two questions and obtains the following responses.
Question 1: What cell phone provider do you use?
Responses to Question 1:
{Telus, Bell Aliant, Telus, Bell Aliant, Rogers, Rogers, Koodo, Rogers, Telus, Rogers}
Question 2: What was your cell phone bill for the previous month?
Responses to Question 2:
{$27.80, $33.50, $45.70, $32.00, $54.90, $29.00, $43.65, $67.40, $35.89, $39.67}
The collection of responses to the first question is called a categorical data set. Categorical data
is data that can be assigned to distinct non-overlapping categories. The responses to question 1
fit into four categories; Bell Aliant, Koodo, Rogers and Telus. The collection of responses to the
NSSAL
©2010
2
Draft
C. D. Pilmer
second question is called a numerical data set. This is the case because the data is comprised of
numbers, specifically different amounts of money.
There are two types of numerical data; discrete and continuous. Numerical data is discrete if the
possible values are isolated points on a number line. For example, if survey participants were
asked how many phone calls they made today, their responses would be whole numbers like 0, 4
or 12. They would not respond with something like 7.8 phone calls. Since they can only report
isolated points, then we end up with discrete numerical data. Numerical data is continuous if the
set of possible values forms an entire interval on the number line. For example, if soil samples
were tested for acidity, the pH could be reported with numbers like 4, 4.17, 4.173, or any other
number in the interval. Generally continuous data arises when observations involve making
measurements (e.g. weighing objects, recording temperatures, recording time to complete
tasks,…) while discrete data arises when observations involve counting.
Questions
1. The town’s mayor is interested in knowing what portion of her 4127 taxpayers supports the
development of a new recreational center in the community. Because it is too costly to
contact all the taxpayers, a survey of 300 randomly selected taxpayers is conducted.
Describe the population and sample for this problem.
2. A building contractor just purchased 6000 used bricks. He knows that a small portion of
these bricks are cracked and therefore unusable. He randomly selected 200 bricks and
discovered that 14 of them were unusable. Describe the population and sample for this
problem.
3. A company conducted a phone survey that involved 1200 randomly selected employed
workers from Nova Scotia. Each participant had to report their annual gross income. At the
time (2009) it was known that there were 453 000 employed workers in Nova Scotia. After
conducting the survey and analyzing the data, the company reported an average annual
income of 29 900 for the 1200 participants. Describe the population and sample for this
problem.
NSSAL
©2010
3
Draft
C. D. Pilmer
4. Between 2001 and 2009, 3730 adults obtained high school diplomas through the Nova Scotia
School for Adult Learning (NSSAL). The Nova Scotia government wanted to know how
many of these adults pursued further education after obtaining their diploma. After
interviewing 240 randomly selected graduates, it was discovered that 65% had pursued post
secondary education primarily at the Nova Scotia Community College. Describe the
population and sample for this problem.
5. For each of the following, state whether the data collection would result in a categorical data
set or numerical data set. If the data is numerical, indicate whether we are dealing with
discrete or continuous data.
(a) Concentration in parts per million (ppm) of a particular
contaminant in water supplies
(b) Brand of personal computer purchased by customers
(c) The sex of children born at the IWK Hospital in December
(d) The height of male adult education learners at a specific campus
(e) The number of children in each household.
(f)
The gross income of adult workers between the ages of 25 and 35
in Nova Scotia
(g) The races of people immigrating to Canada
(h) The time it takes for females between the ages of 20 and 30 to
complete the 100 m dash
(i) The sum of the numbers rolled on two dice
400
6. This bar graph shows the average annual snowfall in
six major Canadian cities.
250
200
150
100
50
Va
nc
ou
ve
r
ga
ry
Ca
l
Re
gi
na
To
ro
nt
o
ity
C
ifa
x
0
Ha
l
(b) Approximately how much more snow does
Regina get compared to Vancouver?
300
Q
ue
be
c
(a) Of the six cities reported, which one has the
greatest average annual snowfall? Approximate
that average for this city.
Average Snowfall (cm)
Source: Statistics Canada
350
City
NSSAL
©2010
4
Draft
C. D. Pilmer
(c) When the data was collected prior to creating this bar graph, would the snowfall data be
classified as a categorical data set or numerical data set?
7. The municipality wanted to understand how its citizens
were commuting to and from work. It was impractical
to ask every citizen this question so they decided to
conduct a survey where 1100 randomly selected citizens
were asked, “What is your primary form of
transportation to and from work?” The data was
collected and a circle graph was constructed.
walk
9%
bicycle
7%
own
vehicle
39%
public
transit
28%
(a) How many people responded carpooling?
carpool
17%
(b) How many more people responded public transit
than bicycle?
(c) How many people responded walk or bicycle?
(d) Are we dealing with a population or a sample? Explain.
(e) Would the collection of responses to this survey question be classified as a categorical or
numerical data set?
(a) Approximate the participation rate in 2006?
70
60
50
Participation Rate
8. Statistics Canada has been using census data to track
employment participation (part-time and full-time) of
Canadian females from 1976 to 2006. The graph on
the right was constructed using this data. The
participation rates are reported as a percentage.
40
30
20
(b) Approximately how much did the participation
rate increase by between 1976 and 1991?
10
0
1976
(c) Between what years was there a drop in the
participation rate?
1981
1986
1991
1996
2001
2006
Year
(d) Are we dealing with a population or sample? Explain.
NSSAL
©2010
5
Draft
C. D. Pilmer
Bar Graphs and Histograms
Bar graphs and histograms look very similar so learners often get them confused. Bar graphs
are used to display categorical data or discrete numerical data. The bars in bar graphs are
separated from one another. Examples of bar graphs are shown below.
Bar Graph #1
In this survey, 60 randomly selected Australian
students were asked to report in which month
they were born.
Bar Graph #2
In this survey, 200 randomly selected
international students were asked which hand
they write with.
Histograms are used to display continuous numerical data where the data is organized into
classes. The bars on a histogram are not separated from one another.
Histogram #1
In this survey, 100 randomly selected students
from all over the world were asked to report
how long it took to travel from home to school.
In this case the class width is 5. The first class
goes from 0 to 5, not including five. The
second class goes from 5 to 10, not including
10.
NSSAL
©2010
Histogram #2
Forty randomly selected secondary students
from Canada were asked to report their heights
in centimeters. As with Histogram #1, the
class width in this case is 5 however the
intervals do not start and end on multiples of 5.
For example the first class showing a value is
centered at 120. That means that this class
goes from 117.5 to 122.5, not including 122.5.
6
Draft
C. D. Pilmer
Example 1
Thirty-six randomly selected males between the ages of 20 and 29 years of ages were weighed.
The weights in pounds are shown below.
210
143
194
174
203
181
224
171
178
186
182
186
188
215
192
182
194
174
166
177
192
188
191
167
207
189
155
178
162
202
160
193
181
188
181
196
(a) Construct a histogram with class widths of 10 starting at 140.
(b) What percentage of the randomly selected males weighed less than 180 pounds?
Answers:
(a) Construct a table to organize the data in terms of the classes. The first class is from 140
to 150 includes 140 but does not include 150.
Class
140 to 150
Tally
Frequency
1
150 to 160
1
160 to 170
4
170 to 180
6
180 to 190
11
190 to 200
7
200 to 210
3
210 to 220
2
220 to 230
1
Now construct the histogram.
(b) Out of the 36 participants, 12 weighed less than 180 pounds.
12
1
× 100 = 33 %
36
3
NSSAL
©2010
7
Draft
C. D. Pilmer
In Example 1, we encountered a histogram with a symmetrical shape.
That means that both sides of the histogram are more or less the same
when the graph is folded down the middle. The histogram to the right
has a similar configuration. This symmetrical bell-shaped distribution is
typical when data is collected from a population which follows a
normal distribution. For this course, most of our time will be spent
examining situations that follow normal distributions. However, it is
important to understand that other types of distributions exist. These other types are shown
below. A uniform distribution occurs when every class has equal frequency. A skewed
distribution occurs when one tail is much larger than the other tail. A bimodal distribution
occurs when two classes with the largest frequencies are separated by at least one class.
Uniform
Distribution
Skewed Left
Distribution
Skewed Right
Distribution
Bimodal
Distribution
If you wish further clarification go to the Mathematics Multimedia Learning Objects (see page
iv), access Unit 11-5 Statistics, and view MLO1 Reviewing Histograms and Frequency Polygons.
Questions
1. In each case state whether a bar graph or histogram would be used to visually represent the
data.
Bar Graph or Histogram
(a) Fifty randomly selected adults reported the brand of their
primary vehicle.
(b) Two hundred randomly selected bottles of a particular salsa
sauce were pulled off the shelves and tested for their salt
content.
(c) Seventy randomly selected Tim Hortons franchises reported
their profit for the month of November.
(d) Three hundred randomly selected adults between 30 and 45
years of age were asked to report the number of children
they have.
(e) One hundred and fifty randomly selected males reported
their favorite sport to view on television.
(f) Sixty cups of coffee from randomly selected coffee shops
had their serving temperatures recorded.
(g) A six-sided die was rolled two hundred times and the
number for each roll was recorded.
NSSAL
©2010
8
Draft
C. D. Pilmer
2. Thirty randomly selected families of four were asked how much they spent on their last
family meal at a restaurant. The following data was obtained.
70
68
62
86
78
67
94
82
75
74
66
103
65
97
64
68
80
83
67
71
77
72
69
64
90
72
78
66
64
86
(a) Construct a histogram with class widths of 5 starting at 60.
Class
60 to 65
65 to 70
70 to 75
75 to 80
80 to 85
85 to 90
90 to 95
95 to 100
100 to 105
Tally
Frequency
(b) What percentage of the families spent $90 or more on their meal?
(c) What type of distribution (normal, uniform, bimodal,…) are we dealing with?
(d) Why was a histogram, rather than a bar graph, used with this data?
(e) Are we dealing with a sample or population?
3. Every learner in the Adult Learning Program at one particular campus was asked how many
hours a week they spent working on school work. The following data was collected.
36
31
25
26
34
32
31
27
26
26
28
19
23
28
28
32
28
28
24
29
32
29
30
23
41
29
28
28
35
35
23
37
35
37
31
31
30
30
30
28
28
32
(a) Are we dealing with a sample or a population?
NSSAL
©2010
9
Draft
C. D. Pilmer
(b) Construct a histogram with class widths of 5 starting at 15.
Class
Tally
Frequency
15 to 20
(c) What type of distribution (normal, uniform, bimodal,…) are we dealing with?
4. If you were collecting a random sample in each situation, what type of distribution (normal,
uniform, bimodal,…) would you likely obtain?
Distribution Type
(a) You randomly select 100 students at an elementary school and
each must report their grade level. Each grade level occupies
two classrooms in the school. What would the distribution of
grade levels look like?
(b) Two groups of athletes are running the 100 m dash. One group
is comprised of males 12 years of age or younger, and the other
is comprised of males between 16 and 20 years of age. You
randomly select 150 athletes and ask them to report their time
for the 100 m dash. What would the distributions of times look
like?
(c) Mrs. Chopra teaches one of the three grade six classes.
Normally the administration tries to distribute the strongest math
students evenly between the three classes. That did not occur
this year and now Mrs. Chopra has a large portion of strong
math students in her class. If her class was asked to complete a
fair math test, what would the distribution of marks look like?
(d) You randomly select 100 females between the ages of 20 and 29
and record their heights. What would the distribution of heights
look like?
(e) A college instructor had what he described as an average class of
students. From his perspective there were a few weak students,
a few strong students but the majority of the students were of
average ability. He gave the class an extremely challenging test
where only the strongest students could maintain good marks.
What would the distribution of marks for this test look like?
NSSAL
©2010
10
Draft
C. D. Pilmer
Describing Data, Part 1
Charlie looks at the marks his Level IV Graduate Math learners earned in a particular unit over
the last year.
{82, 74, 91, 82, 79, 95, 77, 92, 86, 74, 78, 69, 84, 77, 88, 78, 71}
He wants to report how well his students performed on this particular unit without having to
supply all seventeen pieces of data. The data can be described using measures of central
tendency, such as the mean (arithmetic average) and median (middle).
Mean
The most common measure of central tendency is the arithmetic average, or mean. When
calculating a mean, statisticians differentiate between population means and sample means by
using different symbols. The procedure for calculating either of these means is identical. The
population mean and sample mean are calculated by adding all the data points and then
dividing up the number of data points.
µ=
x1 + x 2 + x3 + ... + x n
n
where µ (mu) is the population mean
x=
x1 + x 2 + x3 + ... + x n
n
where x (x bar) is the sample mean
Return to Charlie’s math marks. Since he is looking at the marks of all of the learners who
completed the unit, he is dealing with a population. The population mean is calculated below.
x1 + x 2 + x3 + ... + x n
n
82 + 74 + 91 + 82 + 79 + 95 + 77 + 92 + 86 + 74 + 78 + 69 + 84 + 77 + 88 + 78. + 71
µ=
17
1377
µ=
17
µ=
µ = 81
The mean mark for Charlie’s learners on this unit is 81%.
Median
The mean is not the only way to describe the center. Another method is to use the “middle
value” of the data which is called the median. The median separates the higher half of the data
from the lower half. It can be calculated in the following manner.
1. Arrange the data points in order of size, from smallest to largest.
2. If the number of data points is odd, then the median is the data point in the middle of the
ordered list.
3. If the number of data points is even, then the median is the mean of the two data points
that share the middle of the ordered list.
NSSAL
©2010
11
Draft
C. D. Pilmer
Return to Charlie’s math marks. The median is calculated below.
Order the data points from smallest to largest
69, 71, 74, 74, 77, 77, 78, 78, 79, 82, 82, 84, 86, 88, 91, 92, 95
Since we have an odd number of data points (n = 17), then median will be in the middle data
point of the ordered list.
69, 71, 74, 74, 77, 77, 78, 78, 79, 82, 82, 84, 86, 88, 91, 92, 95
The median will be 79.
Suppose we had another instructor who had sixteen learners who completed the same unit. She
has recorded the marks that they made and worked out the mean and median.
{99, 94, 80, 63, 78, 99, 67, 62, 95, 78, 66, 93, 65, 64, 98, 95}
Mean:
x + x + x3 + ... + x n
µ= 1 2
n
99 + 94 + 80 + 63 + 78 + 99 + 67 + 62 + 95 + 78 + 66 + 93 + 65 + 64 + 98 + 95
µ=
16
1296
µ=
16
µ = 81
The mean mark for these learners on this unit is 81%.
Median:
62, 63, 64, 65, 66, 67, 78, 78, 80, 93, 94, 95, 95, 98, 99, 99
78 + 80
Median =
= 79
2
Is the Mean and Median Enough?
These measures of central tendency often do not give us a complete understanding of the data set
because they do not give any indication how the data is spread out. This is especially evident
when we look at the means and medians for the two groups of math students discussed above.
Although the means and medians are identical for both of these groups, the marks earned by the
two groups are vastly different. In Charlie’s group, the majority of students earned marks
between 71 and 88. There was only one mark in the sixties and only three marks in the nineties.
The marks are clustered together. The marks for the other instructor’s learners could be largely
divided into two groups; learners who earned sixties and learners who earned nineties. There
were six learners who earned sixties, seven who earned nineties, and every few in between. It is
important to note that our two measures of central tendency did not reveal this important
difference between the two data sets. We will address this issue in the next section of this unit.
When are the Mean and Median Not Close to Each Other?
There are times when the mean and median may not be close to each other. One case is if an
outlier exists within the data set. An outlier is a data point that falls outside the overall pattern
NSSAL
©2010
12
Draft
C. D. Pilmer
of the data set. Consider the following data set where the data points have already been arranged
in ascending order.
{2.8, 3.0, 3.0, 3.1, 3.2, 3.4, 3.4, 3.5, 3.5, 3.6, 3.7, 3.9, 4.0, 4.2, 16.7}
Notice that all but one data point is between 2.8 and 4.2. The mean for this data set is 4.3 and the
median is 3.5. It is obvious that in this case the median is a far better measure of central
tendency than the mean. The outlier, 16.7, greatly influenced the mean to a point where it no
longer accurately represented the center of the data set.
The extreme sensitivity of the mean to even a single outlier and the insensitivity of the median to
outliers led to the development of trimmed means. Trimmed means are calculated by ordering
the data points from smallest to largest, deleting a selected number of points from both ends of
the ordered list, and finally averaging the remaining numbers. For example to calculate the 5%
trimmed mean, the bottom 5% of the data points and the top 5% of the data points are deleted.
Consider the data set at the top of the page. We will calculate the 5% trimmed mean for this data
set. If 5% of the number of data points (i.e. 5% of 15) is 0.75, we would round up to 1 (round to
nearest whole number). Since we obtained a 1, we would drop one data point from the bottom
and one data point from the top of the data set.
2.8, 3.0, 3.0, 3.1, 3.2, 3.4, 3.4, 3.5, 3.5, 3.6, 3.7, 3.9, 4.0, 4.2, 16.7
Finally we work out the mean of the remaining thirteen data points.
3.0 + 3.0 + 3.1 + 3.2 + 3.4 + 3.4 + 3.5 + 3.5 + 3.6 + 3.7 + 3.9 + 4.0 + 4.2
13
= 3.5
5% trimmed mean =
Notice that this trimmed mean is equal to the median that we previously calculated. By
eliminating the effects of outliers, the median and resulting mean should be in close proximity.
The symbol, x(T ) , is used to represent a trimmed mean. The only problem with this symbol is
that it does not indicate whether we are dealing with a 5%, 10%, 15% or 20% trimmed mean.
Example 1
Twenty two runners of the 100 m dash were randomly selected from colleges and universities in
Canada. The time of each runner in the last competition was recorded. Of these runners, one
person had pulled a hamstring and another had tripped during their last competition. The times
in seconds are recorded below. Determine the mean, median, and 10% trimmed mean.
10.23 10.89 11.76 9.87
11.33 10.75 9.96
11.54 10.52 18.57 9.72
12.05 11.56 10.15
19.42 11.68 12.09 11.49 11.67 10.19 10.52 9.99
Answer:
10.83 + 10.89 + 11.76 + ... + 10.19 + 10.52 + 9.99
22
= 11.63
Mean =
NSSAL
©2010
13
Draft
C. D. Pilmer
Median: Rearrange the data points from smallest to largest. Since we are dealing with an
even number of data points (22), then the median is the mean of the two data points
that share the middle of the ordered list.
9.72, 9.87, 9.96, 9.99,…, 10.75, 10.89, 11.33, 11.49,…, 12.05, 12.09, 18.57, 19.42
Median =
10.89 + 11.33
= 11.11
2
10% Trimmed Mean
If 10% of the number of data points (i.e. 10% of 22) is 2.2, we would round down
to 2 (round to nearest whole number). We will now drop two data points from the
bottom and two data points from the top of the data set, and then work out the
mean of the remaining eighteen data points.
9.72, 9.87, 9.96, 9.99, 10.15,…, 11.76, 12.05, 12.09, 18.57, 19.42
9.96 + 9.99 + 10.15 + ... + 11.76 + 12.05 + 12.09
18
= 11.02
10% trimmed mean =
Questions
Please use the appropriate symbols ( x , µ , and x(T ) ) when answering these questions.
1. A study regarding the size of winter wolf packs in regions of the United States, Canada, and
Finland was conducted. The following data from 18 randomly selected packs was obtained.
2
3
15
8
7
8
2
4
13
7
3
7
10
7
5
4
2
4
(a) Are we dealing with a sample or a population?
_____________________
(b) Determine the mean and the median.
(c) Why would the researchers likely not use a trimmed mean with this data set?
NSSAL
©2010
14
Draft
C. D. Pilmer
2. A local cab company has a fleet of nine cars. The company kept the records for the amount
money each vehicle required for a one week period. The data is shown below.
$125 $157
$210
$139
$182
$167
$143
$150
$162
(a) Are we dealing with a sample or a population?
_____________________
(b) Are we dealing with a numerical or categorical data set?
_____________________
(c) Determine the mean and median.
3. A magazine conducted a survey where they wished to understand the average class size of
first year courses at a local community college. They randomly selected 17 first year classes
and obtained the following numbers.
23
37
36
40
39
115
28
25
23
32
27
16
15
31
27
34
(a) Are we dealing with a sample or a population?
41
____________________
(b) Determine the mean, median, 5% trimmed mean, and 10% trimmed mean.
(c) Why is it appropriate to use trimmed means in this situation?
(d) If this data set was comprised of 78 data points and we wanted to calculate a 5% trimmed
mean, how many data points would be dropped from the bottom and top of the data set?
NSSAL
©2010
15
Draft
C. D. Pilmer
4. A new subdivision outside of Halifax was constructed over the last few years. Barb wanted
to know what the average value of the new homes was. She was not prepared to look at the
assessed values of all 218 new homes. Instead she randomly selected 24 homes and recorded
their assessed values. These values in thousands of dollars are shown below.
266
265
226
254
231
221
246
252
253
241
261
589
243
270
267
253
287
320
221
264
257
249
226
267
(a) Calculate the mean, median, and 5% trimmed mean.
(b) Which of these measures is not influenced or less influenced by extremely high or low
data points?
(c) Would a histogram or a bar graph be used with this data set?
5. (a) In gymnastics and diving, several judges score each athlete. The final score for the
athlete is calculated by removing the high and low scores and averaging the remainder.
Why do you think they use this trimmed mean scoring method in gymnastics and diving?
(b) Judging in figure skating has always been controversial but this issue really came to the
surface during the 2002 Salt Lake City Winter Olympics when two Canadian skaters,
Jamie Sale and David Pelletier were awarded the silver medal, rather than the gold medal
as expected by the crowd, many television commentators, and based on the scores of four
of the nine judges. It was later learned that the French judge had conspired with the
Russian judge to favor the Russian skating pair. At the time they were using an ordinal
method for awarding medals, rather than the trimmed mean method used in gymnastics
and diving. Explain why the trimmed mean method would also have been ineffective at
dealing with this incident of collusion during the 2002 Winter Olympics?
NSSAL
©2010
16
Draft
C. D. Pilmer
Describing Data, Part 2
Measures of central tendency (median and mode) do not give us any indication of how the data is
spread out. Consider the following two sets of data.
First Data Set: 13, 14, 15, 15, 15, 16, 17
Second Data Set: 10, 12, 13, 15, 17, 18, 20
The mean for both of these data sets is 15 however; the individual pieces of data in these sets are
considerably different. In the first set, the numbers range from 13 to 17, and clearly cluster
around the number 15. In the second set the numbers range from 10 to 20 and tend to be more
spread out around the mean. The dispersion is far greater in the second set, than in the first.
Standard deviation is one way of measuring the spread or dispersion of a set of data relative to
the mean. If the standard deviation is low, then the data cluster around the mean. If the standard
deviation is high, then the data are spread out around the mean. Without getting into the actual
calculations, the standard deviation for the first data set is 1.20 and the standard deviation for the
second data set is 3.30. The larger number indicates greater dispersion.
Calculating Standard Deviation
Before we get to the calculations, we have to remind you of an important point and introduce two
formulas. In the first section we talked about populations and samples. A population is the set
representing all measurements of interest to an investigator while a sample is simply a subset of
the measurements from the population chosen at random. We previously learned that both the
population mean and sample mean are calculated by adding all the data points and then dividing
up the number of data points. The only difference is that we use different symbols to
differentiate a population mean from a sample mean.
µ=
x1 + x 2 + x3 + ... + x n
n
where µ (mu) is the population mean
x=
x1 + x 2 + x3 + ... + x n
n
where x (x bar) is the sample mean
Similarly we have two different formulas for population standard deviation and sample standard
deviation. They do, however, differ more than just in the symbols used.
The formula for population standard deviation, σ (sigma), is shown below. You are not
expected to memorize this formula.
σ=
NSSAL
©2010
(x1 − µ )2 + (x2 − µ )2 + (x3 − µ )2 + ... + (xn − µ )2
n
17
Draft
C. D. Pilmer
This formula requires that you complete six steps.
Step 1: Find the mean; µ .
Step 2: Calculate the difference between each data point and the mean; xi − µ .
Step 3: Square those differences found in Step 2; ( xi − µ )
2
Step 4: Add the squared differences; ( x1 − µ ) + (x2 − µ ) + (x3 − µ ) + ... + ( xn − µ )
2
2
2
2
Step 5: Divide the sum from Step 4 by the number of data points.
Step 6: Square root the value from Step 5.
The easiest way to learn how to use this formula (i.e. complete the six steps) is to construct a
table where only small portions of the calculation are completed at any one time.
Example 1
Mrs. Gillis teaches math to adults. At the end of the year she examines the final marks for all of
her students who have completed the course. She wants to work out the standard deviation of
those marks.
87
72
91
82
74
93
75
83
78
75
81
Answer:
Find the mean.
x1 + x 2 + x3 + ... + x n
n
87 + 72 + 91 + 82 + 74 + 93 + 75 + 83 + 78 + 75 + 81
µ=
11
µ = 81
µ=
Construct the table.
xi
xi − µ
87
72
91
82
74
93
75
83
78
75
81
87 - 81 = 6
72 – 81 = -9
91 – 81 = 10
1
-7
12
-6
2
-3
-6
0
( x i − µ )2
62 = 36
(-9)2 = 81
(10)2 = 100
1
49
144
36
4
9
36
0
Note:
Remember that we stated that the
standard deviation is one way of
measuring the spread or dispersion of
a set of data relative to the mean.
Notice that in the second column of
this table we are finding the
differences between individual data
points and the mean. These
differences, not surprisingly, are
integral in calculating the standard
deviation.
Sum = 496
NSSAL
©2010
18
Draft
C. D. Pilmer
496
11
σ = 6.71
σ=
The population standard deviation is 6.71.
The formula for sample standard deviation, S x (S subscript x), is shown below. You are not
expected to memorize this formula.
Sx =
(x
1
) (
2
) (
)
2
(
2
− x + x 2 − x + x3 − x + ... + x n − x
n −1
)
2
This formula requires that you complete a six step procedure very similar, but not identical, to
the procedure for population standard deviation.
Example 2
Mr. MacDonald is the dean of the adult education program at the college. At the end of the year
he wants to understand the types of marks learners are obtaining in their new math program.
Instead of looking at every mark earned in this course, he randomly selects the final marks of 10
students. He wants to work out the standard deviation of those marks.
75
80
70
88
91
77
82
85
73
79
Answer:
Find the mean.
x1 + x 2 + x3 + ... + x n
n
75 + 80 + 70 + 88 + 91 + 77 + 82 + 85 + 73 + 79
x=
10
x = 80
x=
Construct the table.
xi
xi − x
75
80
70
88
91
77
82
85
73
79
-5
0
-10
8
11
-3
2
5
-7
-1
(x
−x
25
0
100
64
121
9
4
25
49
1
i
)
2
Sum = 398
NSSAL
©2010
19
Draft
C. D. Pilmer
398
10 − 1
S x = 6.65
Sx =
The sample standard deviation is 6.65.
If you wish further clarification go to the Mathematics Multimedia Learning Objects (see page
iv), access Unit 11-5 Statistics, and view MLO6 Standard Deviation.
Questions
1. Determine the sample standard deviation for the following data.
25
32
24
28
31
28
x=
(x
xi − x
xi
i
−x
)
2
2. Determine the population standard deviation for the following data.
3.7
4.3
5.0
4.6
4.0
4.7
3.9
4.2
µ=
xi
NSSAL
©2010
xi − µ
( x i − µ )2
20
Draft
C. D. Pilmer
3. Two data sets have been provided.
15
14
13
18
16
13
16
15
15
17
15
16
14
11
19
16
11
16
(a) Calculate the sample mean and sample standard deviation for each data set.
x=
x=
(x
xi − x
xi
i
−x
)
2
xi − x
xi
(x
i
−x
)
2
(b) The standard deviations are different for the two data sets. What is this telling you?
4. In the grocery store, Anne noticed that a particular brand of canned beans was labeled 540
grams. She randomly selected 8 cans and checked the weight of their contents. She ended up
with the following data.
542
539
544
549
537
541
548
552
(a) What is the median for these data?
NSSAL
©2010
21
Draft
C. D. Pilmer
(b) What is the mean for these data?
(c) Determine the standard deviation.
xi
(d) If Anne selected another random sample of size 8, would we expect to obtain the same
mean and standard deviation? Explain.
5. Barb, a math instructor, recorded the heights in centimetres of all of the male students in her
Level IV math courses. She obtained the following measurements.
181
173
184
183
190
180
186
176
185
(a) What is the median for these data?
(b) What is the mean for these data?
(c) Is Barb dealing with a categorical or numerical data set?
NSSAL
©2010
22
Draft
C. D. Pilmer
(d) Determine the standard deviation.
xi
(e) Another instructor at different campus also has 9 male learners in his Level IV Math
courses. He measured their heights. He found the mean to be 182 cm with a standard
deviation of 6.4 cm. Based on these results, what can you say about the heights of this
instructor’s male learners compared to Barb’s male learners?
(f) A third instructor at another campus also has 9 male learners in her Level IV Math
courses. She measured their heights. She found the mean to be 179 cm with a standard
deviation of 4.8 cm. Based on these results, what can you say about the heights of this
instructor’s male learners compared to Barb’s male learners?
6. Create two data sets that meet all of the following conditions.
• They have at eight pieces of data.
• They must have a mean of 10.
• They have standard deviations that are quite different.
NSSAL
©2010
23
Draft
C. D. Pilmer
7. Without attempting any calculations, match each standard deviation with the appropriate
histogram. Explain how you arrived at your answers. Please note that all of the histograms
are drawn at the same scale.
Standard Deviations:
(a) 0.69
(b) 1.40
(c) 3.34
(d) 3.62
Histograms:
(i)
(ii)
(iii)
(iv)
Matches with _____
Matches with _____
Matches with _____
Matches with _____
Explanation:
Note:
We have not fully explained the usefulness of standard deviation in this section of the chapter.
As we progress though the unit, we will constantly revisit this topic and broaden our
understanding of its usefulness, particularly in the context of normal distribution.
NSSAL
©2010
24
Draft
C. D. Pilmer
Using Technology
In the last section we learned how to work out the population standard deviation ( σ ) and sample
standard deviation (S x ) using paper and pencil. The TI graphing calculators can calculate both
of these and more for us. Using such technology is particularly useful when our sample size is
large.
Example
Of the 1643 people who were departing the airport for overseas destinations on the morning of
January 13, an airport worker randomly selected 30 people and asked them how long, in minutes,
it took to check in and pass through security. She obtained the following data.
40
60
(a)
(b)
(c)
(d)
46
56
68
44
51
53
42
58
55
60
48
45
52
52
38
55
49
46
56
51
50
40
35
50
54
64
50
45
Draw a histogram using technology. Use class widths of 5 starting at 35.
Determine the mean time.
Determine the standard deviation.
Determine the median.
Answers:
(a) Step 1: Enter the Data
STAT > Edit > If data already exists in L1 then > Enter the data in L1
move the cursor up so L1 is
highlighted, press CLEAR, and
move the cursor back down.
Step 2: Draw the Histogram
STATPLOT > Select Plot 1 > Turn on the plot, select histogram, > WINDOW
Xlist should be L1 and Freg
should be 1.
> Set Xmin at 35, Xmax at 70, Xscl at 5 > GRAPH > TRACE > Use the right
Ymin at 0, Ymax at 10, Yscl at 1
and left arrows
NSSAL
©2010
25
Draft
C. D. Pilmer
(b) Parts (b) and (c) will be done simultaneously. Please note that the data has already been
entered in the calculator when we constructed the histogram.
STAT > CALC > 1-Var Stats > Enter the List > ENTER
()
The sample mean x is 50.4.
(c) The sample standard deviation (S x ) is 7.64. (See above.)
(d) While the 1-Var Stats results are still on the screen, scroll using
the down arrow until you find Med.
The median in this case is 50.5.
Note:
The calculator uses the symbol σ x , rather than σ , to represent the population standard
deviation. The calculator does not report the population mean ( µ ) however as we previously
learned the formula for sample mean and population mean are the same. We can therefore use
the value the calculator generates for x as the value for µ .
Questions
1. The survey of Study Habits and Attitudes (SSHA) is a psychological test given to college
students to evaluate their motivation, study habits, and attitudes towards their post-secondary
studies. A local community college campus randomly selected 20 female first year students
to complete the SSHA. The individual results are listed below.
154
167
129
151
153
164
140
112
157
144
162
166
158
143
174
190
180
137
175
155
(a) Are we dealing with a population or a sample?
(b) Using technology draw a histogram showing the distribution of SSHA scores. Use class
widths of 10 starting at 110.
(c) Determine the mean, median, and standard deviation.
(d) Describe the distribution (normal, uniform, skewed, or bimodal).
NSSAL
©2010
26
Draft
C. D. Pilmer
2. Below you will find a list of Prime Ministers of Canada since Confederation in 1867. We
have also been supplied with their age upon first taking office as PM.
Prime Minister (PM)
John A. MacDonald
Alexander Mackenzie
John Abbott
John Sparrow Thompson
Mackenzie Bowell
Charles Tupper
Wilfrid Laurier
Robert Borden
Arthur Meighen
William Lyon Mackenzie King
Richard Bennett
Louis St-Laurent
John Diefenbaker
Lester Pearson
Pierre Trudeau
Joe Clark
John Turner
Brian Mulroney
Kim Campbell
Jean Chretien
Paul Martin
Stephen Harper
First Term Starts
1867
1873
1891
1892
1894
1896
1896
1911
1920
1921
1930
1948
1957
1963
1968
1979
1984
1984
1993
1993
2003
2006
Age
52
51
70
48
70
74
54
57
46
47
60
66
61
65
48
39
55
45
46
59
65
46
(a) Are we dealing with a population or a sample?
(b) Using technology draw a histogram showing the distribution of ages for PMs first taking
office. Use class widths of 5 starting at 35.
(c) Determine the mean PM age for first taking office.
(d) Determine the standard deviation.
(e) Determine the median.
(f) What can you conclude based on the histogram and standard deviation?
NSSAL
©2010
27
Draft
C. D. Pilmer
3. Provincial governments keep records of the number of young offenders who are incarcerated
each year. The incarceration rates vary greatly from province to province. In 2006 Nova
Scotia reported an incarceration rate of 9.91. That means that 9.91 young persons out of
10 000 young persons was incarcerated. Below you will find the incarceration rates for the
provinces and territories for 2006. (Source: Statistics Canada)
Province
YT
NT
NU
BC
AB
Rate
8.57
46.12
20.49
4.45
7.18
Province
SK
MB
ON
QC
Rate
24.54
21.25
7.51
3.89
Province
NB
PE
NS
NL
Rate
10.20
7.21
9.91
11.93
(a) Are we dealing with a population or a sample?
(b) Using technology draw a histogram showing the distribution of incarceration rates. Use class
widths of 5 starting at 0.
(c) Determine the mean, median, and standard deviation.
(d) There is a substantial difference between the mean and median. Why is this so?
NSSAL
©2010
28
Draft
C. D. Pilmer
Normal Distribution
A frequency polygon is the shape that is formed when midpoints of the tops of the bars on a
histogram are joined by straight lines.
In this case, the frequency polygon forms a bell-shaped curve that is associated with a population
that follows a normal distribution. Many variables observed in nature, including heights,
weights, and reaction times, follow normal distributions. Consider the heights of female students
at college. There are a few women who are less than 5 feet tall, a few who are taller than 6 feet,
but the majority of the women are probably between 5’3” and 5’8”. We would expect a normal
distribution for the heights of women attending college.
Let’s consider a population that results in a normal distribution. The normal curve will be
centered about population mean ( µ ). The standard deviation ( σ ) determines the extent to
which the curve spreads out. If we
look at the two normal
distributions supplied below, we
can see that both distributions are
A
centered around the same value,
65. That means that the mean for
both of these populations is 65.
The standard deviations, although
not supplied, are not the same.
The standard deviation for normal
distribution A must be lower than
B
that for distribution B because the
curve is narrowing meaning that
the data points are more clustered
around the mean.
Please note that the horizontal axis is labeled x. This indicates that we are looking at the
distribution of the individual data points denoted by the symbol x.
NSSAL
©2010
29
Draft
C. D. Pilmer
According to the 68-95-99.7 rule, in any bell-shaped distribution, the following holds true.
• Approximately 68% of the data points will lie within one standard deviation of the mean..
• Approximately 95% of the data points will lie within two standard deviations of the
mean.
• Approximately 99.7% of the data points will lie within three standard deviations of the
mean.
This rule is true for populations and large samples. However, they are written using different
symbols.
For Populations:
• Approximately 68% of the data points are between µ − σ and µ + σ .
• Approximately 95% of the data points are between µ − 2σ and µ + 2σ .
• Approximately 99.7% of the data points are between µ − 3σ and µ + 3σ .
For Large Samples:
• Approximately 68% of the data points are between x − S x and x + S x .
•
Approximately 95% of the data points are between x − 2 S x and x + 2 S x .
•
Approximately 99.7% of the data points are between x − 3S x and x + 3S x .
Let’s see how this rule applies to a population with a normal distribution where the population
mean ( µ ) is 40 and the population standard deviation ( σ ) is 10. This distribution is shown
below. Notice that it is centered about the mean.
For this population we would expect that approximately 68% of the data points would be
between 30 ( µ − σ or 40-10) and 50 ( µ + σ or 40+10). We would expect that approximately
95% of the data points would be between 20 ( µ − 2σ ) and 60 ( µ + 2σ ). Finally we would
expect that approximately 99.7% of the data points to be between 10 ( µ − 3σ ) and 70 ( µ + 3σ ).
NSSAL
©2010
30
Draft
C. D. Pilmer
Checking the 68-95-99.7 Rule Using a Simulation
Most learners do not follow rules blindly; they like to know where the rule comes from and/or
see if the rule really works. The mathematics required to derive the 68-95-99.7 rule is beyond
this course. However, we can conduct a simulation on the graphing calculator to demonstrate
that the rule does indeed work. We will accomplish this using the random number generator
built into the calculator. Before doing so, we will have to seed the calculator to ensure that the
numbers generated on your calculator differ from those generated by your classmate’s calculator.
Since it is unlikely that you and your classmates share the same telephone
number, you will use your phone number to seed the calculator.
Type in phone > STO → > MATH > PRB > rand > ENTER
number.
For the simulation that follows we will be using the rand command that
is found under the MATH menu.
Suppose you wished to randomly select 100 bee hives of the same size from the same region of
the province. You wished to record the seasonal honey production (in kilograms) for each hive
over a four year period. We are obviously not going to actually collect this data; instead we will
use the graphing calculator to simulate the collection of this data.
Step 1:
We will simulate the collection of honey production numbers for the 100 hives for the first year.
Once completed, you will have values ranging from 40 to 70 in List 1.
STAT > EDIT > Move the cursor up > Enter 40+30*rand(100)
to highlight L1
> ENTER
Step 2:
Now we will simulate the collection of honey production numbers for the 100 hives for years
two, three, and four. This will be accomplished by repeating Step 1 but using List 2, List 3, and
List 4. Once completed, you will have values ranging from 40 to 70 in all four lists.
Step 3:
Since we want the total honey production over the four year period for each hive, we will need to
add the numbers in the same row. This can be accomplished using List 5 where one states that
its values are generated by adding the corresponding values in Lists 1, 2, 3, and 4.
NSSAL
©2010
31
Draft
C. D. Pilmer
STAT > EDIT > Move the cursor up > Enter L1+L2+L3+L4
to highlight L5
> ENTER
Step 4:
Order the numbers in List 5 from smallest to largest (i.e. in ascending
order).
QUIT > CLEAR > STAT > SortA( > L5 > ENTER
Step 5:
Enter the one hundred data points from list 5 in the chart below. Round the values to the nearest
tenth.
Step 6:
Construct a histogram using the following classes.
Class
150 to 160
160 to 170
170 to 180
180 to 190
190 to 200
200 to 210
210 to 220
220 to 230
230 to 240
240 to 250
250 to 260
260 to 270
270 to 280
NSSAL
©2010
Frequency
32
Draft
C. D. Pilmer
Step 7:
Use the calculator to determine the sample mean and sample standard deviation for the data in
List 5.
STAT > CALC > 1-Var Stats > L5 > ENTER
Record the two values.
Sample Mean = ________
Sample Standard Deviation = ________
Questions:
1. In terms of this situation, what does the first data point in List 5 represent?
2. In terms of this situation, what does the last data point in List 5 represent?
3. In terms of this situation, what does x represent?
4. (a) Calculate x − S x and x + S x using the values we obtained in Step 7.
(b) Go through the chart from Step 5 and count the number of data points that are between
x − S x and x + S x .
(c) According to the 68-95-99.7 rule, approximately what percentage of the data points
should be between x − S x and x + S x ? Is this supported by this simulation? Explain.
NSSAL
©2010
33
Draft
C. D. Pilmer
5. (a) Calculate x − 2 S x and x + 2 S x using the values we obtained in Step 7.
(b) Go through the chart from Step 5 and count the number of data points that are between
x − 2 S x and x + 2 S x .
(c) According to the 68-95-99.7 rule, approximately what percentage of the data points
should be between x − 2 S x and x + 2 S x ? Is this supported by this simulation? Explain.
Note:
For Questions 4 and 5, you may feel that the data points from the simulation do not resoundingly
support the 68-95-99.7 rule. Remember that the sample we took was only of size 100
(i.e. n = 100). Better results could be obtained if we increased the sample size significantly (e.g.
n = 1000) but unfortunately it would take us a lot more time to complete the simulation and
accompanying questions.
NSSAL
©2010
34
Draft
C. D. Pilmer
Using the 68-95-99.7 Rule
In the last section we learned how the 68-95-99.7 rule applies to normal populations or large
samples that result in a distribution that is approximately normal. In this section, we will show
how this rule can be used to answer a number of questions.
Consider the following statements for a normal population.
•
•
If 68% of the data points are found between µ − σ and µ + σ , then 34% of the data
points would be between µ and µ + σ .
If 68% of the data points are found between µ − σ and µ + σ , then 34% of the data
points would be between µ − σ and µ .
68%
34%
34%
µ −σ
x
µ +σ
µ
If we extend this line of thinking, we can state the following.
•
•
•
•
If 95% of the data points are found between µ − 2σ and µ + 2σ , then 47.5% of the data
points would be between µ and µ + 2σ .
If 95% of the data points are found between µ − 2σ and µ + 2σ , then 47.5% of the data
points would be between µ − 2σ and µ .
If 99.7% of the data points are found between µ − 3σ and µ + 3σ , then 49.85% of the
data points would be between µ and µ + 3σ .
If 99.7% of the data points are found between µ − 3σ and µ + 3σ , then 49.85% of the
data points would be between µ − 3σ and µ .
Naturally this line of thinking can also be applied to samples that result in a distribution that it
approximately normal; however, we will use the symbols x and S x .
NSSAL
©2010
35
Draft
C. D. Pilmer
Example 1
For a normal population with a mean of 15 and standard deviation of 2, what percentage of the
data points would measure
(a) between 15 and 19?
(b) between 13 and 21?
(c) between 11 and 13?
Answers:
(a) This question could be restated. It would read, “What percentage of the data points
would be between µ and µ + 2σ ?”
47.5%
15
µ
x
19
µ + 2σ
Therefore approximately 47.5% of the data points will be between 15 and 19.
(b) This question could be restated. It would read, “What percentage of the data points
would be between µ − σ and µ + 3σ ?”
34%
13
µ −σ
49.85%
15
µ
21
µ + 3σ
x
Therefore approximately 83.85% (34% + 49.85%) of the data points will be between 13
and 21.
NSSAL
©2010
36
Draft
C. D. Pilmer
(c) This question could be restated. It would read, “What percentage of the data points
would be between µ − 2σ and µ − σ ?”
34%
47.5%
11
13
µ
−σ
µ − 2σ
15
µ
Therefore approximately 13.5% (47.5%-34%) of the data points will be between 11 and
13.
This is a difficult concept to explain without a lot of diagrams. It is strongly recommended that
you seek further clarification by going to the Mathematics Multimedia Learning Objects (see
page iv), accessing Unit 11-5 Statistics, and viewing MLO8 Using Normal Distribution.
Questions
1. Use the 68-95-99.7 rule on a distribution of data points with a population mean of 230 and a
population standard deviation of 15 to answer the following questions.
(a) What percentage of the data points would measure between 215 and 245?
(b) What percentage of the data points would measure between 230 and 260?
(c) What percentage of the data points would measure between 215 and 230?
(d) What percentage of the data points would measure between 185 and 230?
NSSAL
©2010
37
Draft
C. D. Pilmer
(e) What percentage of the data points would measure between 200 and 245?
(f) What percentage of the data points would measure between 215 and 275?
(g) What percentage of the data points would measure between 185 and 260?
(h) What percentage of the data points would measure between 245 and 260?
(h) What percentage of the data points would measure between 185 and 200?
(j) What percentage of the data points would measure between 245 and 275?
2. A sample of randomly selected 2000 bagels of the same type was removed from a production
line. The mean weight was 104 grams with a standard deviation of 3 grams. Assume the
distribution of bagel weights is bell-shaped.
(a) Approximately how many bagels were within 9 grams of the mean?
(b) Approximately how many bagels were within 3 grams of the mean?
(c) Approximately how many bagels are between 98 grams and 104 grams?
NSSAL
©2010
38
Draft
C. D. Pilmer
(d) Approximately how many bagels are between 101 grams and 110 grams?
(e) Approximately how many bagels are between 107 grams and 110 grams?
(f) Approximately how many bagels are between 98 grams and 110 grams?
(g) Approximately how many bagels are between 95 grams and 101 grams?
(h) Approximately how many bagels are between 98 grams and 113 grams?
(i) Approximately how many bagels are between 95 grams and 104 grams?
(j) Approximately how many bagels are between 110 grams and 113 grams?
NSSAL
©2010
39
Draft
C. D. Pilmer
Making Inferences
Up to this point in this resource, we have looked at a variety of ways of describing data whether
that data was derived from a sample or a population. This means that we have focused on
descriptive statistics. At this point in the unit, we will now focus on making inferences about a
population based on a sample. In other words, we will use data obtained from a sample to try to
understand the population. This is inferential statistics.
There are specific steps for making such inferences.
1. Specify the question(s) to be investigated and identify the population of interest.
2. Determine how the sample will be selected from the population.
3. Obtain the sample and analyze the sample information.
4. Use the sample information to make inferences about the population.
5. Use a measure of reliability to indicate how much confidence can be placed on that
inference.
The next two sections of this unit will look at the different ways to collect a sample. After that,
the remaining sections will focus on steps 3, 4, and 5. Specifically we will look at a new concept
for most of you, confidence intervals based on a sample mean.
NSSAL
©2010
40
Draft
C. D. Pilmer
Collecting a Sample
Since collecting data from an entire population is often not feasible, we may use a sample from
that population in order to answer questions about the population as a whole. It is important that
we collect unbiased samples to ensure that the samples are representative of the population.
Investigation
Suppose we want to know the mean (average) square footage of buildings in a local industrial
park. Rather than looking at all forty-eight buildings, we only want to collect a sample of size 6
(i.e. look at only 6 buildings).
The diagram below shows an aerial view of the park where each quadrilateral (four-sided figure)
represents a building and each square represents 1000 square feet. For example building #2 is
represented by 4 squares therefore the area of that building is 4000 square feet.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
NSSAL
©2010
41
Draft
C. D. Pilmer
We will collect five samples of size six from this population.
In the first case, you will look at the population and then select the six buildings that you believe
best represent the population. This will be called our non-random sample. Record the six
building numbers and the corresponding areas (in square feet). After doing this, determine the
mean building area and the standard deviation for our sample of size 6.
Non-Random Sample
Building Number
Area
()
Mean x = _________
Sample Standard Deviation (S x ) = ________
To collect the other four samples of size six, we will use the random number generator on the
graphing calculator. The calculator will be instructed to generate six random integers between 1
and 48, where each number generated is a building number. This can be accomplished using the
following commands.
MATH > PRB > randInt( > Enter the lower and upper
limit, and the sample size
(all separated by commas).
Record the six building numbers and the corresponding areas (in square feet). After doing this,
determine the mean building area and the standard deviation for the sample of size 6. Repeat this
procedure three more times to generate random samples #2, #3, and #4.
Random Sample #1
Building Number
Area
()
Mean x = _________
Sample Standard Deviation (S x ) = ________
Random Sample #2
Building Number
Area
()
Mean x = _________
NSSAL
©2010
Sample Standard Deviation (S x ) = ________
42
Draft
C. D. Pilmer
Random Sample #3
Building Number
Area
()
Mean x = _________
Sample Standard Deviation (S x ) = ________
Random Sample #4
Building Number
Area
()
Mean x = _________
Sample Standard Deviation (S x ) = ________
Conclusions:
The population mean (µ ) and population standard deviation (σ ) for the areas of these fortyeight buildings has already been worked out.
σ = 2993
µ = 3146
Look at the five sample means and five sample standard deviations you previously worked out.
(a) How do these sample means and standard deviations compare to the population mean and
standard deviation?
(b) How do the results from your non-random sample compare to those from your four random
samples. What can you conclude?
In this activity, you collected five samples (one non-random and four random), each with a
sample size of 6 (i.e. n = 6). The sample size refers to the number of data points in the sample.
NSSAL
©2010
43
Draft
C. D. Pilmer
Questions:
1. Explain why each of the following would likely produce a biased sample. For some there is
more than one reason.
(a) David will conduct a survey regarding violence in the media. He will randomly select
people who are attending an ultimate fighter competition at the local arena and ask them
to complete the survey.
(b) Genevieve wants to know how much money the average woman spends monthly on
clothing. She will conduct the survey at Mic Mac Mall. She approaches people who she
feels will likely answer her survey questions. If they agree to participate in the survey,
then she asks them the questions.
(c) A television talent show asks viewers to phone in their vote(s) for their favorite
contestant. The telephone lines are only open for four hours and viewers can vote as
many as six times.
(d) Kendrick wants to know how members of his community feel about the new gun registry
law. He leaves survey sheets on a counter at the local hardware store. There is also a
sign asking interested individuals to complete the survey.
(e) Robert wants to know if Canadians still support the military action in Afghanistan. He
conducts a phone survey where he asks 200 randomly selected adults the following
question. “Considering the number of deaths and injuries of Canadian soldiers, and
persistent allegations of prisoner abuse by local Afghan authorities, should Canadian
soldiers remain in Afghanistan?
NSSAL
©2010
44
Draft
C. D. Pilmer
Sampling Methods
Preferred Sampling Methods
The sampling methods listed below are considered preferred sampling methods because these
methods have a greater chance of being unbiased. All four of these are a form of random
sampling.
1. Simple Random Sample
A simple random sample is a sample selected in such a manner that every sample of size n
has the same chance of being selected.
For example, suppose we put twenty different names into a hat, stirred the contents, and
without looking retrieved the following four names.
Barb, Elliot, Brian, Dave
Suppose those names were returned to the hat, the contents stirred, and now the following
four names were drawn.
Floyd, Joan, Manish, Krys
Suppose that process was repeated and the following four names were obtained.
Suzette, Charlie, Jeff, Elliot
We collected three samples of size four. All three of these samples, along with all other
combinations of four names, have the same probability of being drawn. The Barb, Elliot,
Brian, Dave combination has no greater chance of being selected than the Floyd, Joan,
Manish, Krys combination or any other four name combination. For this reason it is a simple
random sample.
2. Cluster Sample
A cluster sample is used when the available sampling units are groups of elements, or
clusters. One or more clusters are randomly selected and then every element in that cluster is
included in the sample.
For example, suppose Tim Hortons wanted to know how much on average each person spent
in their Toronto establishments between the hours of 7 am to 9 am. Conducting a survey in
their hundreds of establishments scattered across the city would be costly. Instead, they
could randomly select four of their establishments and record how much each patron spend at
those four establishments between the hours of 7 am and 9 am. They randomly selected four
clusters and included every element (patron) in the survey. For these reasons, this is
considered a cluster sample.
3. Stratified Random Sample
With a stratified random sample one conducts a simple random survey with each of the given
number of subpopulations, or strata.
For example, suppose the federal party in power wanted to how Canadians felt about gun
registration. If they conducted a simple random survey of size 1000, they would not be
certain that every province or territory was fairly represented in the survey. It is possible that
one province is over-represented and another is under-represented. To alleviate this problem,
NSSAL
©2010
45
Draft
C. D. Pilmer
they could use a stratified random sample where each province and territory (strata) is
proportionally represented. So if one province accounted for 20% of the eligible voters in
Canada, then the survey would ensure that 200 of the 1000 randomly selected eligible voters
came from that province. If another province only accounted for 7% of the eligible voters in
Canada, then the survey would ensure that only 70 of the 1000 randomly selected eligible
voters came from this particular province.
4. Systematic Sample
A systematic sample is chosen according to a formula or rule.
For example, suppose you wanted to use the names listed in a phone book to conduct a
telephone survey within a small community. You might decide to contact every 200th person
in the book but you need to know where you should start on the list. You could use a
random number generator to select a number between 1 and 200. Suppose it produced the
number 67. That means the first five people selected would be the 67th, 267th , 467th, 867th
and 1067th in the telephone book. This is an example of a systematic sample because the rule
required that we interview every 200th person. To increase the likelihood that the sample
would be unbiased, random numbers were used to select the starting point on the list.
Poor Sampling Methods
The two sampling methods listed below are not a random sampling method and are generally
biased. The results obtained from these samples are generally not representative of the
population.
1. Voluntary Response Sample
Participants are not selected; rather through their own actions they choose to participate in
the survey. The most common forms are call-in polls and online voting.
The most familiar example of this poor sampling method occurs on the television show
American Idol. Audience members are encouraged to phone or text their vote in for their
favorite performer. Some audience members will choose not to participate while others may
vote multiple times.
2. Convenience Sample
A convenience sample is chosen based on convenient availability.
For example, suppose an individual wanted to know how drivers feel about recent changes to
vehicle inspections proposed by the provincial government. The individual decides to
conduct his survey at a local shopping mall close to his residence. He approaches individuals
who he feels might be willing to participate in his survey.
If you wish further clarification go to the Mathematics Multimedia Learning Objects (see page
iv), access Unit 11-5 Statistics, and view MLO5 Identifying Types of Samples.
NSSAL
©2010
46
Draft
C. D. Pilmer
Questions
1. Identify the sampling method used. Both preferred and poor sampling methods are found
below.
(a) A community organization wanted to use a sample to infer how much parents of
elementary school children were spending this September on each child’s school supplies
in their school district. Each child has a five digit school identification number. The
organization placed the numbers 0 through 99 in a hat. They drew the number 28. Based
on this they asked any parent whose child’s ID number ended with the digits 2 and 8 to
participate in the survey.
Method: _______________________________________
(b) Asra wants to know how her fellow employees feel about the company’s new medical
plan. She leaves the survey sheets on a table in the company cafeteria. There is also a
sign that asks interested individuals to complete the survey.
Method: _______________________________________
(c) Jack received twelve baskets of apples from a farmer to sell at the local market. He
wanted to use a sample to infer the average weight of the apples he was selling. He
numbered the baskets 1 through 12, rolled a twelve-sided die, obtained a 9, weighted
every apple in basket 9, and worked out the average weight of those apples.
Method: _______________________________________
(d) Montez wants to know if people feel that cable companies should have to pay local
television stations when they rebroadcast their signals. Since he owns the local gas
station, he decided to conduct the survey from this premise. He asks every customer to
complete the survey.
Method: _______________________________________
(e) Jorell’s company is giving away a one week all-inclusive vacation package to one of its
employees. The thirty employees fill out a ballet, the ballots are placed in a bucket, the
contents are stirred, and a ballot is drawn in order to determine the winner.
Method: _______________________________________
(f) A new reality show, So You Think You Can Yodel, asks its television audience to vote
online for their favorite performer.
Method: _______________________________________
NSSAL
©2010
47
Draft
C. D. Pilmer
(g) Kimi is in charge of the corporate headquarters for a large company. There are 1000
female employees and 500 male employees on the premises. She has decided to build an
employee wellness centered stocked with gym equipment on the premises. In order to
ensure that she buys the appropriate equipment for the employees, she conducts a sample
of size 120 asking respondents about their gym equipment preferences. She randomly
selects 80 females of the 1000, and 40 males of the 500 to complete the survey.
Method: _______________________________________
(h) The Metro Center is hosting an Ultimate Fighting Challenge competition. Following the
event the promoters wanted to know how the audience felt about the competition. The
Center is divided into 43 sections. They randomly selected four numbers between 1 and
43, and ask all the individuals in those four sections to complete a questionnaire.
Method: _______________________________________
(i) Ranelda is planning a trip to the Dominican Republic and wants to know travelers felt
about the all-inclusive resort she is considering. She decides to go to TripAdvisor.com
where she can read reviews posted by other travelers regarding this resort.
Method: _______________________________________
(j) The owner of a large car dealership wants to know if her customers were satisfied with
the purchase of their vehicles. When each customer’s final paper work comes across the
owner’s desk, she rolls a six-sided die. If the number 1 or 2 is rolled, then the customer
is contacted two weeks later to complete a brief telephone survey.
Method: _______________________________________
(k) A Federal politician wants to know how his constituents in 12 different districts feel
about the new tax increases. A random sample is selected in such a manner that there is
proportional representation from each of the districts.
Method: _______________________________________
2. In question 1, which surveys would likely result in biased results?
3. Suppose you worked for the Nova Scotia Department of Education and you were in charge
of determining how well grade 12 students did on the last provincial math exam. When these
exams are distributed to the schools they are numbered, starting at 80000 and going to
82500. You could ask that all 44 public and private schools to send in the 2500 corrected
exams, review them all, and report the results. This would be a very time-consuming and
NSSAL
©2010
48
Draft
C. D. Pilmer
costly endeavor. Instead you decide to collect a sample of size 500, review those exams, and
report the results. Please note that not all high schools in our province are of the same size.
Larger schools have graduating classes exceeding 300, while smaller school may have as few
as 10 graduates. Explain how all four preferred sampling methods could be used to collect
this sample.
(a) Simple Random Sample
(b) Cluster Sample
(c) Stratified Random Sample
(d) Systematic Sample
Important Note:
In the remaining sections of this unit, we will only be concerned with data collected from simple
random samples. If we were to consider the other preferred sampling techniques, then we would
have to learn how to use a wider range of statistical tools. That is beyond the scope of this
course.
NSSAL
©2010
49
Draft
C. D. Pilmer
Simulated Sampling
In the next few sections, we will examine how a sample is used to make inferences about a
population. Ultimately this will lead us to confidence intervals but this will be a gradual process.
To understand the relationship between a sample and a population, we will have to start with a
known population. This is a population where we know the population mean and population
standard deviation. This probably seems a little backwards. Why would we want to collect a
sample if we already understand the population itself? We will be taking this approach so that
we can ultimately see how the concept of a confidence interval was developed.
Our Known Population
In the next three sections of this unit, we will work with the same known population when we are
conducting simulations or providing explanations. Suppose the federal government had tested
the air quality of every household residence (houses, apartments, mobile
homes, etc.) in Canada. They specifically looked at the concentration of a
specific airborne contaminant. This concentration was measured in
micrograms per cubic metre ( µg / m3 ). Suppose they had found that the results
followed a normal distribution with a population mean of 412 µg / m3 and a
standard deviation of 38 µg / m3 .
Investigation
We will simulate the collection of a large sample (sample size 40) from our known population.
We will then determine the mean for that sample. We will then simulate the collection of three
more samples of the same size from the same population and work out their corresponding
means. All of this will be done using a graphing calculator.
Sample 1
We will use the randNorm( command on the graphing calculator to simulate the collection of a
large sample from our known population. This command generates and displays one or more
random numbers from a specific normal distribution. For this reason we must also enter the
population mean ( µ ), standard deviation ( σ ), and sample size (n).
MATH > PRB > randNorm( > Enter 412, 38, and 40, all
separated by commas. Close
the brackets.
> STO → > L1 > ENTER
Go to List 1 and record the first and last five data points in the table below.
NSSAL
©2010
50
Draft
C. D. Pilmer
In terms of this situation, what does the first data point in the table represent?
Rather than using the 1-Var Stats command to determine the sample mean, we will use the
mean( command embedded in the LIST commands. The reason for this alternate approach will
become more apparent as we work through this section and the next.
LIST > MATH > mean( > L1 > ENTER
Record the sample mean ( x ).
Sample Mean = ___________
Sample 2
Follow the same procedure to simulate the collection of another sample of the same size. Record
the first and last five data points from L1 in the table below. Also determine the sample mean.
Sample Mean = ___________
Do the first and last five data points for Sample 2 match with the first and last five data points for
Sample 1? Why is this?
Sample 3
In this case, we will not use the randNorm( and mean( commands separately. We will combine
them such that we can obtain our sample mean in one step. Use the command shown below.
mean(randNorm(412, 38, 40))
Record the sample mean ( x ).
Sample Mean = ___________
Sample 4
Repeat the procedure used for Sample 3 to simulate the collection of our fourth sample. Record
the sample mean.
Sample Mean = ___________
NSSAL
©2010
51
Draft
C. D. Pilmer
Questions
1. In terms of this situation, what does the mean for Sample 1 represent?
2. Are the four sample means equal to each other? Why is this?
3. Are the sample means close to the expected value? Explain.
4. Given a specific situation and population, which one of these four statements is correct?
Explain how you arrived at your answer making reference to our simulations on the last two
pages.
(a) The population mean is random and the sample mean is fixed.
(b) The population mean is random and the sample mean is random.
(c) The population mean is fixed and the sample mean is fixed.
(d) The population mean is fixed and the sample mean is random.
Explanation:
5. In our simulation, we collected four samples and determined four sample means for our
known population. Suppose we were dealing with an unknown population. If this was the
case, would we be able to determine which of the sample means was closest to the
population mean?
YES or NO
NSSAL
©2010
52
Draft
C. D. Pilmer
Sampling Distribution of the Sample Means
Up to this point when we have looked at distributions, primarily normal distributions, we were
looking at how individual data points, x, were spread relative to the mean. For this reason the
horizontal axis on these distributions was labeled with the symbol x.
x
In this section, we will not be looking at the spread of individual data points for a known
population. Instead, we will look at the distribution of sample means ( x ) for a known
population. That means we could repeatedly take samples of the same size from our known
population, work out the sample means, and look at the distribution of sample means. This type
of distribution is called the sampling distribution of the sample means. As we learned in the
last section, sample means are random, but is there a pattern to all those sample means that we
can exploit? Yes, and one is that such distributions are normal as shown below. Notice that the
horizontal axis is labeled x , indicating that we are looking at the distribution of sample means,
not individual data points. There are two other important properties that will be discovered in
the next investigation.
x
Investigation
A true sampling distribution of the sample means is the distribution of all possible values of the
sample means that result when random samples of the same size are drawn from the same
population. This means that we are taking all possible samples of size n from this population,
working out the sample means, and looking at the distribution of those means. The mathematics
required to create a true sampling distribution is beyond this course; however, we can use a
NSSAL
©2010
53
Draft
C. D. Pilmer
graphing calculator to simulate the collection of data to generate a rough approximation of the
sampling distribution of the sample means.
We will continue to work with the scenario involving the airborne contaminant
in Canadian households. Remember that for this known population we have a
mean of 412 and standard deviation of 38. We will simulate the collection of
100 samples of size 40 and calculate the corresponding 100 sample means. A
frequency distribution of these 100 sample means will serve as our rough
approximation of the sampling distribution of the sample mean.
Step 1
In the last section we used the commands mean(randNorm(412,38,40)) to simulate the collection
of one sample of size 40 and then work out the sample mean. In this investigation, we want to
do this 100 times so that we end up with 100 sample means. This will be accomplished by
adding the seq( command. This command is found under LIST and then accessing OPS.
seq(mean(randNorm(412,38,40)),A,1,100) → L1
Press ENTER to activate the command. You should see a small
scrolling line in the upper right-hand corner of the screen. This
indicates that the calculator is busy working on your task. It will take
the calculator 4 to 5 minutes to complete the command.
(i) Why do you think it takes so long for the calculator to complete the command?
Step 2
Rearrange the data in List 1 from smallest to largest (i.e. ascending order). Use the SortA(
command found by pressing the STAT button. Record the first and last five values in the newly
sorted List 1.
(ii) In terms of this situation, what does the first value in your table represent?
NSSAL
©2010
54
Draft
C. D. Pilmer
Step 3
Use the data in List 1 to construct a histogram. Use the classes shown in the table below.
Class
394 to 397
397 to 400
400 to 403
403 to 406
406 to 409
409 to 412
412 to 415
415 to 418
418 to 421
421 to 424
424 to 427
427 to 430
430 to 433
Frequency
(iii) How would we describe this distribution (uniform, skewed, bimodal, normal)?
Step 4
Use the 1-Var Stats command to determine the mean and the standard deviation for the data in
List 1. Please note that the calculator does not know that the 100 values in List 1 are sample
means therefore it reports the mean as x and the standard deviation as S x . Although the
calculator will spit out the correct numbers, it does not report them using the correct symbols.
We will learn what the correct symbols should be in the next section of this unit.
Mean of the 100 Sample Means = __________
Standard Deviation of the 100 Sample Means = __________
Step 5
We will use the calculator to generate another 100 sample means. The only difference is that we
will use samples of size 60, rather than size 40. Enter the following command into the calculator
and give it 4 to 5 minutes to complete the task.
seq(mean(randNorm(412,38,60)),A,1,100) → L1
NSSAL
©2010
55
Draft
C. D. Pilmer
In this case we are still looking at data that would produce a rough approximation of the
sampling distribution of the sample means. We will not bother sorting the data or drawing a
histogram. We will, however, use the 1-Var Stats command to determine our mean and standard
deviation.
Mean of the 100 Sample Means = __________
Standard Deviation of the 100 Sample Means = __________
Questions
1. (a) What is the population mean in this situation?
_________
(b) When we collected 100 random samples of size 40 from our known population, what did
we obtain for the mean of the 100 sample means?
_________
(c) When we collected 100 random samples of size 60 from our known population, what did
we obtain for the mean of the 100 sample means?
_________
(d) Is there a relationship between the population mean and the two means of the sample
means? Explain? Why do you think this is?
2. (a) What is the population standard deviation in this situation?
_________
(b) When we collected 100 random samples of size 40 from our known population, what did
we obtain for the standard deviation of the 100 sample means?
_________
(c) When we collected 100 random samples of size 60 from our known population, what did
we obtain for the standard deviations of the 100 sample means?
_________
(d) Based on our answers for (a), (b) and (c), we can see that the population standard
deviation is not equal to either of the standard deviations of the sample means. There is,
however, a relationship between these standard deviations. Take the population standard
deviation and divide it by the square root of the sample size. This will have to be done
twice since we were working with two different sample sizes (n = 40 and n = 60)
population standard deviation
sample size
NSSAL
©2010
56
Draft
C. D. Pilmer
For the 100 samples of size 40
population standard deviation
sample size
=
=
=
=
For the 100 samples of size 60
population standard deviation
sample size
What are these two values we just calculated approximately equal to?
Through this investigation, we have discovered three important properties of the sampling
distribution of the sample means. These three properties will be discussed in the next section of
the unit titled the Central Limit Theorem.
NSSAL
©2010
57
Draft
C. D. Pilmer
Central Limit Theorem
In the last section, we examined the sampling distribution of the sample means. This type of
distribution is created by repeatedly taking samples of the same size from a known population,
working out the sample means, and looking at the distribution of sample means.
Although we were unable to examine a true sampling distribution of the sample means, we were
able to use a graphing calculator to generate a rough approximation of the sampling distribution
of the sample means. Through a simulation we discovered three important properties of the
sample distribution.
Three Properties
1. The sampling distribution of the sample means is approximately normal (i.e. bellshaped).
2. The mean of the sample means is equal to the population mean.
Mean of the Sample Means = Population Mean
3. The standard deviation of the sample means is equal to the population standard deviation
divided by the square root of the sample size.
Population Standard Deviation
Standard Deviation of the Sample Means =
Sample Size
All of this can be restated using the appropriate notation. It is referred to as the Central Limit
Theorem.
The Central Limit Theorem states the following.
• If random samples of size n are repeatedly drawn from any population with a finite mean
and standard deviation, then the resulting sampling distribution of the sample means ( x )
is approximately normal when n is large (i.e. n ≥ 30 ).
• The mean of the sample means is equal to population mean.
( µ x is pronounced “mu subscript x bar”)
µx = µ
•
The standard deviation of the sample means is equal to the population standard deviation
divided by the square root of the sample size.
σx =
σ
n
( σ x is pronounced “sigma subscript x bar”)
µx = µ
σx
σx
x
NSSAL
©2010
58
Draft
C. D. Pilmer
Applying the 68-95-99.7 rule to the sampling distribution of the sample means, we can say that:
•
68% of the sample means are between µ x − σ x and µ x + σ x
Or
68% of the sample means are between µ −
•
σ
n
σ
n
95% of the sample means are between µ x − 2σ x and µ x + 2σ x
Or
95% of the sample means are between µ − 2
•
and µ +
σ
and µ + 2
n
σ
n
99.7% of the sample means are between µ x − 3σ x and µ x + 3σ x
Or
99.7% of the sample means are between µ − 3
σ
n
and µ + 3
σ
n
Example 1
Random samples of the size 50 are repeatedly drawn from a known population whose population
mean is 78 and population standard deviation is 12. This information is used to construct a
sampling distribution of the sample means.
(a) Describe the shape of the resulting distribution.
(b) Where is the sampling distribution of the sample means centred?
(c) What is the standard deviation of the sample means?
(d) Between what two values would one expect 68% of the sample means to fall?
(e) Between what two values would one expect 95% of the sample means to fall?
Answers:
We are not dealing with a single sample because the question stated that we are repeatedly
collecting samples of the same size. As stated, we are dealing with the sampling distribution
of the sample means. For this reason, this question expects that we understand the Central
Limit Theorem.
(a) According to the Central Limit Theorem, the sampling distribution of the sample means
will be bell-shaped.
(b) The sampling distribution of the sample means will be centred about the population mean
of 78.
(c) The standard deviation of the sample means must be calculated.
σx =
σx =
σ
n
12
50
σ x = 1.70
NSSAL
©2010
59
Draft
C. D. Pilmer
(d) We know that for a sampling distribution of the sample means 68% of the sample means
are between µ −
µ−
σ
and µ +
n
σ
σ
n
µ+
n
= 78 − 1.70
= 76.3
.
σ
n
= 78 + 1.70
= 79.7
For this particular sampling distribution 68% of the sample means are between 76.3 and
79.7.
(e) We know that for a sampling distribution of the sample means 95% of the sample means
are between µ − 2
µ −2
σ
n
and µ + 2
σ
σ
µ+2
n
= 78 − 2(1.70 )
= 74.6
n
.
σ
n
= 78 + 2(1.70 )
= 81.4
For this particular sampling distribution 95% of the sample means are between 74.6 and
81.4.
Example 2
A random sample of size 40 is taken from a known population where µ = 24.3 and σ = 4.1 .
The data points collected are shown in the chart below.
18.78
15.33
27.53
26.45
24.49
27.99
25.83
20.44
21.08
22.08
25.43
21.41
22.36
15.50
15.21
26.02
20.70
20.45
18.54
20.84
16.91
22.47
29.13
26.68
28.26
24.20
19.89
26.98
22.96
19.31
20.01
23.53
26.49
34.21
26.85
24.15
25.66
23.26
20.61
27.19
(a)
(b)
(c)
(d)
(e)
What is the population mean?
What is the population standard deviation?
What is the sample mean? Is it close to the expected value? Explain.
What is the sample standard deviation?
If we collected 800 random samples of size 40 from this known population, could the
distribution of the 800 sample means serve as a rough approximation of the sampling
distribution of the sample means? Explain.
(f) If we collected 800 random samples of size 40 from this known population, what would be
the approximate value of the mean of the sample means?
(g) If we collected 800 random samples of size 40 from this known population, what would be
the approximate value of the standard deviation of the sample means?
(h) Between what two values would one expect 544 of the 800 sample means to fall?
NSSAL
©2010
60
Draft
C. D. Pilmer
Answers:
The first four parts of this question have nothing to do with the Central Limit Theorem since
we are not repeatedly collecting random samples of the same size from a known population.
(a) The population mean ( µ ) is supplied in the question. It is 24.3.
(b) The population standard deviation ( σ ) is supplied in the question. It is 4.1.
(c) We were given a chart for a sample of size 40. We will enter
the 40 data points into List 1 on the graphing calculator and use
the 1-Var Stats command to determine the sample mean ( x ) for
this question and the sample standard deviation ( S x ) for the
next question. The sample mean is 23.13. One would expect
the sample mean (23.13) to be close to the population mean (24.3). This is the case.
(d) The sample standard deviation is 4.16.
(e) A true sampling distribution of the sample means involves taking all possible samples of
the same size from the population. In this question we have limited ourselves to 800
samples of the same size (n = 40) but the resulting distribution of sample means will
serve as a rough approximation of the sampling distribution of the sample means. In the
previous section we completed an investigation where we simulated the collection of 100
samples from a known population to generate a rough approximation of the sampling
distribution of the sample means. We are doing the same thing in this question except
we are dealing with 800 random samples of the same size instead of 100 random samples
of the same size.
(f) Since we are dealing with a rough approximation of the sampling distribution of the
sample means, we can use the Central Limit Theorem. We learned that the mean of the
sample means is equal to the population mean.
µx = µ
µ x = 24.3
(g) We can use the Central Limit Theorem. The standard deviation of the sample means is
equal to the population standard deviation divided by the square root of the sample size.
σx =
σ
n
4.1
σx =
40
σ x = 0.65
(h) The number 544 is 68% of 800. That means that the question could be restated as
“Between what two values would one expect 68% of the sample means to fall?” We
know that for a sampling distribution of the sample means 68% of the sample means are
between µ −
NSSAL
©2010
σ
n
and µ +
σ
n
.
61
Draft
C. D. Pilmer
µ−
σ
µ+
n
= 24.3 − 0.65
= 23.65
σ
n
= 24.3 + 0.65
= 24.95
For this particular rough approximation of the sampling distribution of the sample means,
544 of the 800 sample means will be between 23.65 and 24.95.
Example 3
The mean height of men in a specific age group is 71 inches with a standard deviation of 2.3
inches. Let x be the sample mean height for a random sample of 30 men in this age group.
What is the mean value and standard deviation of the distribution of all possible x ’s?
Answer:
In this contextual problem, the relevant material is presented in a much more subtle
manner. The population mean ( µ = 71 ) and the population standard deviation ( σ = 2.3 )
have been supplied. We also know that we are dealing with the Central Limit Theorem
because the question is asking for the mean of the x ’s (i.e. mean of the sample means)
and the standard deviation of the x ’s (i.e. standard deviation of the sample means).
Mean of the Sample Means
µx = µ
µ x = 71
Standard Deviation of the Sample Means
σx =
σ
n
2.3
σx =
30
σ x = 0.42
Questions
1. Match each term with the appropriate symbol listed below.
Symbol
(a) Sample Mean
(b)
Standard Deviation of the Sample Means
(c)
Sample Standard Deviation
(d)
Population Mean
(e)
Sample Size
(f)
Mean of the Sample Means
(g)
Population Standard Deviation
Symbols
µ
NSSAL
©2010
Sx
µx
σ
x
62
σx
n
Draft
C. D. Pilmer
2. A random sample of size 45 is to be selected from a population with mean µ = 329 and
standard deviation σ = 27 .
(a) If samples of the same size are repeatedly collected from this population, what would the
mean of the sample means be equal to?
(b) If samples of the same size are repeatedly collected from this population, what would the
standard deviation of the sample means be equal to?
3. Random samples of size 60 are repeatedly selected from a known population with a mean of
87 and a standard deviation of 7.2. These repeatedly collected samples allow a sampling
distribution of the x ’s to be drawn.
(a) What type of distribution (uniform, bimodal, normal, or skewed) would result?
(b) Determine the mean of the sample means and indicate where it would be located on the
distribution of the x ’s.
(c) Determine the standard deviation of the sample means.
(d) What percentage of the sample means would be within one σ x of the population mean?
(e) Between what two values would one expect 95% of the sample means to fall?
4. Researchers examined the speeds traveled by motorists on a specific section of a highway in
the month of August. The researchers found that the population mean was 106.2 km/h with a
population standard deviation of 4.1 km/h. We collect a random sample of 55 motorist
speeds from this unknown population. We then repeatedly collect samples of the same size
so that a sampling distribution of mean motorist speeds can be constructed. Where is
resulting distribution centred and how much is it spread out about its centre?
NSSAL
©2010
63
Draft
C. D. Pilmer
5. The mean weight of baggage checked in by an individual adult passenger boarding a
domestic flight is 28.5 kg with a standard deviation of 5.0 kg. A sample of size 30 is taken
from this known population. The data points are shown below in the chart.
29.1
32.9
37.6
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
28.3
26.2
26.8
29.5
30.3
28.4
24.7
29.1
33.0
32.4
31.7
33.2
25.4
23.7
28.3
22.4
31.7
20.4
25.1
28.2
26.7
23.3
28.9
22.8
18.1
21.9
37.8
What is the population mean?
What is the population standard deviation?
Determine the sample mean.
What does the sample mean represent in this situation and is it close to the expected
value?
Determine the sample standard deviation.
If samples of the same size are repeatedly collected from this known population, what
would be the value of the standard deviation of the sample means?
If samples of the same size are repeatedly collected from this known population, what
would be the value of the mean of the sample means?
For the sampling distribution of x ’s, between what two values would one expect 68% of
the sample means to fall?
For the sampling distribution of x ’s, between what two values would one expect 95% of
the sample means to fall?
NSSAL
©2010
64
Draft
C. D. Pilmer
6. Explain in your own words what the difference is between the sample standard deviation and
the standard deviation of the sample means.
7. Suppose we were to sample from a known population. In each of three cases, determine
which phrase would best describe the resulting distribution.
Answer
(a) 500 random samples of size 40 are collected from a known population
and 500 sample means are generated using these samples.
(b) A random sample of size 40 is collected from a known population
allowing a distribution of x’s to be drawn.
(c) Random samples of size 40 are repeatedly collected from a known
population allowing a distribution of x ’s to be drawn.
Choices:
(i) a distribution of data points
(ii) a sampling distribution of sample means
(iii) a rough approximation of the sampling distribution of the sample means
8. The Valley Apple Growing Association knows that the mean weight of a particular type of
apple grown in their county for sale in supermarkets is 86 grams with a standard deviation of
3.7 grams. Let x be the mean weight for a random sample of 52 apples.
(a) What is the mean value and standard deviation of the distribution of all possible x ’s?
(b) Between what two values would one expect 99.7% of the x ’s to fall?
9. Random samples of the same size are repeatedly collected from a known population with a
mean of 98.6 and a standard deviation of 10.8. Determine the mean and standard deviation
of the sampling distribution of all possibles x ’s for each of the following sample sizes.
(a) n = 40
(b) n = 60
(c) n = 80
NSSAL
©2010
65
Draft
C. D. Pilmer
10. Look at the previous question. How does the standard deviation of the sample means change
as the sample size increases? Is this what we would expect? Explain.
11. Three sampling distributions of sample means have been created using the same known
population; however, three different sample sizes were used. One used repeatedly collected
random samples of size 30. The other two used repeatedly collected random samples of size
60 and 90.
(i)
(ii)
(iii)
x
(a) What is the population mean for this known population?
(b) Match the three sampling distributions (i, ii, iii) to the appropriate sample sizes (30, 60, 90).
Briefly explain how you arrived at these answers.
NSSAL
©2010
66
Draft
C. D. Pilmer
12. Four different sampling distributions have been plotted on the same axes. Two of the
sampling distributions come from the same population; however; the sizes of repeatedly
collected samples differ. The same is true with the other two sampling distributions of the
sample means. Match the description with the appropriate distribution.
(i)
(iii)
(ii)
(iv)
x
Answer
(a)
(b)
(c)
(d)
NSSAL
©2010
Population mean is 70 and repeatedly collected samples are of size 80.
Population mean is 40 and repeatedly collected samples are of size 80.
Population mean is 70 and repeatedly collected samples are of size 40.
Population mean is 40 and repeatedly collected samples are of size 40.
67
Draft
C. D. Pilmer
Point Estimates and Interval Estimators
In the last three sections, we have learned the following.
• The population mean is fixed.
• The sample mean is random.
• If random samples of the same size are repeatedly collected from a known population, the
resulting sampling distribution of the sample means displays three distinct properties
defined by the Central Limit Theorem.
So what does this have to do with inferential statistics? In other words, how do we use this
information be help us make inferences about a population based on a sample?
In the section titled Simulated Sampling, we simulated the collection of four
samples of size 40 from a known population. In this case, the population mean
was 412 µg / m3 of contaminant in the air with a population standard deviation
of 38 µg / m3 . Here are the results another adult learner obtained when she
completed the activity.
Sample 1: x = 412.3
Sample 3: x = 413.2
Sample 2: x = 411.9
Sample 4: x = 413.0
Notice that all of these sample means are fairly close to the population mean (412 µg / m3 ). We
can use a sample mean obtained from one sample to represent a plausible value for the
population mean. A single sample mean is called a point estimate because this single value is
used as a plausible value for the population mean.
In inferential statistics, we prefer to report an interval of reasonable values based on a sample,
rather than a single plausible value (point estimate or sample mean). This interval of reasonable
values is called an interval estimator. The interval estimator of the population mean is called
the confidence interval. Associated with every confidence interval is a confidence level. The
confidence level indicates the level of assurance we have that the resulting confidence interval
encloses the population mean.
Example 1
Taylor works as a quality control officer at a compact fluorescent light bulb factory. She wants
to understand how long on average one of these light bulbs lasts. She randomly selects 40 new
bulbs off of the assembly line, and takes them to see how long each will last. Rather than
reporting the mean lifespan of the 40 bulbs (i.e. sample mean/point estimate), she decides to
report the following confidence interval (i.e. interval estimator). She reports that the population
mean lifespan of this type of bulb is between 5880 hours and 6130 hours with 95% confidence.
What does this last sentence mean?
Answer:
Confidence intervals are constructed in a specific manner that we will learn about later in this
section. In this case, Taylor’s confidence interval is between 5880 hours and 6130 and has a
confidence level of 95%. The sentence means that the method that produced this interval
NSSAL
©2010
68
Draft
C. D. Pilmer
from 5880 to 6130 has a 0.95 probability of enclosing the true mean lifespan (i.e. population
mean) of these light bulbs. Therefore there is a 0.05 probability that this method does not
create an interval that encloses the true mean lifespan (i.e. population mean).
It does not mean that there is a 0.95 probability that the population mean falls within the
interval from 5880 to 6130. You are probably asking yourself how this sentence differs from
the one in the previous paragraph. It has to do with the fact that the population mean is fixed
and the sample mean (which a confidence interval is derived from) is random. The incorrect
meaning states that the “population mean falls within the interval.” This statement implies
that the population mean is random, rather than fixed. For this reason, the explanation is
wrong.
One way to visualize the correct meaning of a confidence interval is to think about a
parachutist trying to hit a target on the ground. The target, which is fixed to the ground, is
our population mean. The parachutist with the big parachute is the confidence interval. We
would like some portion of the parachute to hit the target, but there is a possibility that the
parachute might miss the target all together. In the diagram below, the confidence interval
(width of the parachute) will enclose the population mean (the target).
Width of Parachute
(Confidence Interval)
Target
(Population Mean)
In the diagram below we have three parachutes (three confidence intervals). Two of these
parachutes will enclose the target (population mean), but one will not.
Target
NSSAL
©2010
69
Draft
C. D. Pilmer
When we are dealing with a 95% confidence level, we are saying that 95 out of 100
confidence intervals should enclose the population mean. Thinking about our parachuting
analogy, one would say that 95 out of 100 of the parachutes would enclose the target and 5 of
the 100 would not. The parachuting analogy makes sense when talking about a known
population where we know the population mean (i.e location of the target). In the real world,
we use the confidence interval as a set of plausible values for the population mean that may
enclose that unknown mean (i.e. We do not know the location of the target.).
Example 2
The Department of Health randomly selected 200 males between the ages of 25 and 30. They
recorded the resting heart rate of these individuals. Rather than reporting the mean resting heart
rate, they reported the following. The population mean resting heart rate for males (ages 25
years to 30 years) is between 79 beats per minute and 83 beats per minute with 90% confidence.
Explain what is meant by the last sentence.
Answer:
The Department of Health is reporting a confidence interval that goes from 79 to 83. This
particular confidence interval was calculated with a 90% confidence level. They are stating
that the method used to construct their confidence interval has a 0.9 probability of enclosing
the population mean resting heart rate. There is a small probability (0.1) that this method
creates an interval that does not encloses the population mean.
Well it is great that we know what a confidence interval is and understand what it means but how
do we calculate it and what does it have to do with the Central Limit Theorem?
Developing the Confidence Interval
The development of the confidence interval is tied directly to our understanding of the sampling
distribution of the sample means and hence the Central Limit Theorem. For the sampling
distribution of the sample means, we learned that the approximately 95% of the sample means
will be within 2 standard deviations of the population mean (more precisely 1.96 standard
deviation of the population mean). In this case the standard deviation of the sample means is
defined as follows.
σx =
σ
n
Visually 95% of the sample means are between within the region on the diagram below.
µ
1.96
σ
n
1.96
σ
n
x
NSSAL
©2010
70
Draft
C. D. Pilmer
Get ready for it; here’s the big conceptual jump that you will have to think about.
If a single sample mean ( x ) is within 1.96
x − 1.96
σ
n
to x + 1.96
σ
n
σ
n
of the population mean, then the interval from
will enclose the population mean.
This can be seen in the diagram below. Three sample means within the 1.96
σ
n
of the
population mean have between drawn below the sampling distribution. We then went 1.96
σ
n
to the left and right of each of these sample means to create our desired interval. Notice that all
three of these intervals enclose the population mean (i.e. cross the vertical line in the center
representing the population mean, µ ).
µ
1.96
σ
1.96
n
σ
n
x
x − 1.96
σ
x
x + 1.96
n
x − 1.96
x − 1.96
NSSAL
©2010
σ
σ
σ
n
x
x + 1.96
n
x
x + 1.96
n
71
σ
n
σ
n
Draft
C. D. Pilmer
σ
If, however, that sample mean is not within 1.96
of the population mean, then the resulting
n
interval constructed using that sample mean will not enclose the population mean. Such is the
case in the diagram below.
µ
1.96
σ
n
1.96
σ
n
x
x − 1.96
This interval from x − 1.96
σ
to x + 1.96
σ
n
x
x + 1.96
σ
n
σ
is called a 95% confidence interval. This interval
n
n
is a range of plausible values for the population mean that may enclose the population mean. It
will enclose the population for 95% of the samples.
The Confidence Interval Formula
First Draft of the Formula:
When dealing with a random sample of size 30 or greater, the confidence interval based on a
sample mean is calculated using the following formula.
x±z
σ
n
If we are calculating a 90% confidence interval then z equals 1.645.
If we are calculating a 95% confidence interval then z equals 1.96.
If we are calculating a 99% confidence interval then z equals 2.56.
This confidence interval formula requires that we know the population standard deviation ( σ ).
In the real world we use samples (and their resulting confidence intervals) to make inferences
about an unknown population. If it is unknown population, then we will not know the population
standard deviation ( σ ). We need another approach.
If the sample is large ( n ≥ 30 ), then we can replace the population standard deviation with the
sample standard deviation.
NSSAL
©2010
72
Draft
C. D. Pilmer
Second (and Final Draft) of the Formula:
When dealing with a random sample of size 30 or greater, the confidence interval based on a
sample mean is calculated using the following formula.
S
x±z x
n
If we are calculating a 90% confidence interval then z equals 1.645.
If we are calculating a 95% confidence interval then z equals 1.96.
If we are calculating a 99% confidence interval then z equals 2.56.
This second and final draft of the confidence interval formula is the one we will use.
Example 3
Samir conducted a study where he examined the concentration of a particular airborne
contaminant in 250 randomly selected households from across Canada. The sample mean and
sample standard deviation were 413.2 µg / m3 and 40.5 µg / m3 respectively.
(a) Determine the 90% confidence interval.
(b) In question (a), did we generate an interval estimator or a point estimate?
(c) Explain what this confidence interval means.
(d) After completing his study Samir learns that the federal government had tested every
household for this particular airborne contaminant and found the population mean was
412 µg / m3 with a population standard deviation of 38 µg / m3 . Does the interval derived
from Samir’s sample enclose the population mean?
(e) If he collected 200 samples of size 40 and worked out 200 confidence intervals each with a
90% confidence level, how many would we expect to enclose the population mean?
(f) Suppose Samir took a sample of size 400 and the resulting mean and standard deviation were
411.7 µg / m3 and 36.4 µg / m3 respectively. Determine the 99% confidence interval and
state whether it encloses the population mean.
Answers:
(a) x ± z
Sx
n
413.2 ± 1.645
40.5
250
413.2 ± 4.21
From 408.99 to 417.41
(b) A confidence interval is an interval estimator.
(c) We are stating that the method used to construct the interval from 408.99 µg / m3 to
417.41 µg / m3 has a 0.9 probability of enclosing the true mean air contaminant level for
all households in Canada (i.e. population mean). There is a 0.1 probability that this
method does not create an interval that encloses the population mean.
(d) We were told that the population mean is 412 µg / m3 . The interval from 408.99 µg / m3
to 417.41 µg / m3 encloses the population mean.
NSSAL
©2010
73
Draft
C. D. Pilmer
(e) 90% of 200 = 180
We would expect that 180 out of the 200 confidence intervals would enclose the
population mean.
S
(f) x ± z x
n
36.4
411.7 ± 2.56
400
411.7 ± 4.66
From 407.04 to 416.36 ← This interval encloses the population mean (412
µg / m3 ).
Example 4
Jamie and Angela each conduct a study where they record the weights of randomly selected 10
year old males. The weights in pounds for these two samples are recorded below.
Jamie’s Sample:
81.2
110.7
101.4
112.7
112.7
104.8
113.7
91.7
102.9
116.0
107.0
109.9
107.1
85.6
83.1
99.5
85.4
113.2
97.7
95.4
114.6
116.0
101.8
111.1
102.3
108.3
83.8
97.3
112.5
85.6
Angela’s Sample:
90.7
100.5
106.7
87.6
91.5
90.1
102.6
85.4
106.9
88.9
114.3
71.4
122.1
98.2
108.8
84.9
106.2
100.3
115.8
91.5
86.1
109.5
95.9
101.7
95.6
84.7
75.8
96.3
104.8
98.7
80.4
103.0
120.1
84.8
110.2
118.7
(a)
(b)
(c)
(d)
(e)
Determine the 95% confidence interval for Jamie’s sample.
Explain what the confidence interval from (a) means.
Determine the 95% confidence interval for Angela’s sample.
Which confidence interval has a greater probability of enclosing the population mean?
Do either of the confidence intervals enclose the mean?
Answers:
In this question, we cannot calculate either of the confidence intervals without the sample
means ( x ) and sample standard deviations ( S x ) for the two samples. We will enter the data
points in our TI-83 or TI-84 calculators and use the 1-Var Stats command to obtain the
desired means and standard deviations.
S
(a) x ± z x
n
11.1
102.2 ± 1.96
30
102.2 ± 3.97
From 98.23 to 106.17
(b) It means that method used to produce the interval from 98.23 pounds to 106.17 pounds
has a 0.95 probability of enclosing the true mean weight of 10 year old males (i.e.
NSSAL
©2010
74
Draft
C. D. Pilmer
population mean). There is a 5% chance that this method creates an interval that does
not enclose the population mean.
S
(c) x ± z x
n
12.6
98.1 ± 1.96
36
98.1 ± 4.12
From 93.98 to 102.22
(d) This is a trick question. These confidence intervals have the same confidence level
(95%) therefore the methods used to create both intervals have the same probability of
enclosing the population mean.
(e) We cannot tell if either of these confidence intervals encloses the population mean
because the population mean is not supplied. We are dealing with an unknown
population.
If you wish further clarification go to the Mathematics Multimedia Learning Objects (see page
iv), access Unit 11-5 Statistics, and view MLO15 Using Confidence Intervals.
Questions
1. Brian wants to know how much on average Nova Scotian households spend on electricity in
the month of December. He could not get permission from the power corporation to access
their records for that month so he decided to collect a random sample of size 300. After
analyzing the data, he reports that the population mean power bill for Nova Scotian
households is between $292 and $304 with 95% confidence. Explain what this last sentence
means.
2. Barb collects a sample of size 98 from an unknown population. She calculates the sample
mean and finds that it is equal to 583.2. The sample standard deviation works out to be 32.1.
(a) Determine the 99% confidence interval based on this sample.
(b) Explain what this confidence interval means.
(c) Does the confidence interval enclose the population mean?
(d) If we collected 500 samples of the same size from the same population and then
generated five hundred 99% confidence intervals, how many would one expect to
enclose the population mean?
NSSAL
©2010
75
Draft
C. D. Pilmer
3. Dr. Saad conducted a medical study where he recorded the resting heart rate of 32 randomly
selected 18 year old girls. The data in beats per minute is supplied below.
79
71
78
71
76
70
69
66
76
84
77
67
78
78
87
75
65
72
68
72
77
73
70
72
72
81
84
82
66
89
76
72
(a) Determine x . Is it a point estimate or interval estimator?
(b) Calculate the 95% confidence interval. Is it a point estimate or interval estimator?
(c) When Dr. Saad is asked to explain the meaning of the resulting 95% confidence interval
he responds, “There is a 0.95 probability that the true mean resting heart rate of 18 year
old girls falls within the interval we just calculated.” Is his interpretation correct?
Explain.
(d) If he collected 400 samples of size 32 and created four hundred confidence intervals with
the same confidence level as above, how many would one expect not to enclose the true
mean resting heart rate? Would he be able to determine which intervals did not enclose
the population mean?
4. Monica and Kadeer conducted two separate studies that looked at the daily water
consumption of randomly selected adult Nova Scotians. The data reported in litres is listed
below.
Monica’s Data:
360
366
300
313
223
348
343
299
340
330
317
303
254
335
345
368
402
362
306
281
405
321
366
303
393
289
339
444
377
299
306
285
429
Kadeer’s Data:
363
297
271
303
300
330
351
322
311
305
319
359
383
388
321
338
220
271
364
350
309
299
323
320
375
304
308
361
354
359
341
302
390
307
290
325
(a) Determine the 90% confidence interval based on Monica’s data.
(b) Determine the 99% confidence interval based on Kadeer’s data.
(c) Which method used to create the two confidence intervals has a greater probability of
enclosing the true mean daily water consumption?
NSSAL
©2010
76
Draft
C. D. Pilmer
5. Maurita collects a sample of size 56 from an unknown population. The sample mean works
out to be 148.0 and the sample standard deviation works out to be 17.4.
(a) Determine the 90% confidence interval for this sample.
(b) Determine the 95% confidence interval for this sample.
(c) Determine the 99% confidence interval for this sample.
(d) How does the confidence level affect the confidence interval?
(e) If µ = 143.9 , then did all three confidence intervals enclose the desired value?
6. Rana collects three samples of differing sizes from the same unknown population. We have
“cooked” the results so that the sample standard deviations remain the same for the three
samples. The reason for this will become apparent as you progress through the questions.
(a) Determine the 95% confidence interval for a sample of size 30 with a sample mean of
53.8 and sample standard deviation 4.89.
(b) Determine the 95% confidence interval for a sample of size 100 with a sample mean of
54.9 and sample standard deviation 4.89.
(c) Determine the 95% confidence interval for a sample of size 250 with a sample mean of
54.3 and sample standard deviation 4.89.
(d) Does the sample size affect the width of the confidence interval? Explain.
7. Which of these factors affect the width of confidence intervals? Simply indicate with a
check mark.
____ Population Mean
____ Sample Size
____ Confidence Level
____ Sample Mean
NSSAL
©2010
77
Draft
C. D. Pilmer
8. Indicate whether each of the following statements are true or false.
_________ (a) Once we calculate a confidence interval, the population mean may or may not
be enclosed by that interval.
_________ (b) There is a 95% chance that a 95% confidence interval will include the sample
mean.
_________ (c) A sample mean is an example of a point estimate.
_________ (d) If we are sampling from the same population and using the same sample size,
then higher confidence levels produce wider intervals than lower confidence
levels.
_________ (e) If we are sampling from the same population and constructing confidence
intervals with the same confidence levels, then larger sample sizes produce
wider intervals than those from smaller sample sizes.
_________ (f) For a 99% confidence interval, there is a 0.99 probability that the population
mean will fall between the two values.
_________ (g) Approximately 90% of the data points in a sample are enclosed within the
90% confidence interval
9. Water from 70 different rainfalls in Nova Scotia were analyzed for acidity (pH). The mean
pH reading was 6.2 with a standard deviation of 0.5. Determine the 95% confidence interval
for the mean acidity and explain what the interval represents.
10. Go to the following website.
http://www.ruf.rice.edu/~lane/stat_sim/conf_int
erval/
(or Google Search: Confidence Interval Applet
RVLS)
Read the instructions and then click on the
BEGIN icon. The window shown on the right
will appear on the screen. Press SAMPLE and
examine the diagram on the left and the chart at
the bottom of the window. Press the SAMPLE
button again. What is this applet trying to show
you?
NSSAL
©2010
78
Draft
C. D. Pilmer
11. A large national department store chain that offers extended warranties on its products wants
to know how long a particular brand of washing machine will last before needing
maintenance. They randomly selected customers who purchased this machine and asked
them how long their machine lasted before requiring maintenance. The data reported in
months is listed below.
56
47
45
50
42
51
49
41
49
49
46
44
45
49
46
51
45
50
46
46
44
45
41
52
51
51
44
56
54
51
55
44
45
49
48
43
49
50
45
46
(a) Calculate the 90% confidence interval.
(b) Did the interval enclose the population mean? Explain.
(c) If you collected another sample of size 40, would you expect the confidence interval to
change? Explain.
(d) If the confidence level is changed from 90% to 99%, how would that affect the width of
the confidence interval?
(e) If the sample size was changed from 40 to 100, how would that affect the width of the
90% confidence interval?
NSSAL
©2010
79
Draft
C. D. Pilmer
Putting It Together
Before we start working on review questions, we should look at the various sections of this unit
and the terms that were introduced in each of those sections.
Introductory Materials and Terminology
- Descriptive Statistics, Inferential Statistics, Population, Sample, Categorical Data,
Discrete Numerical Data, Continuous Numerical Data
Bar Graphs and Histograms
- Bar Graphs, Histograms, Distributions (Uniform, Skewed, Normal, Bimodal)
Describing Data, Part 1
- Population Mean, Sample Mean, Median, Outliers, Trimmed Means
Describing Data, Part 2
- Population Standard Deviation, Sample Standard Deviation
Normal Distribution
- 68-95-99.7 Rule
Collecting a Sample
- Unbiased Sample, Sample Size
Sampling Methods
- Simple Random Sample, Cluster Sample, Stratified Sample, Systematic Sample,
Voluntary Response Sample, Convenience Sample
Central Limit Theorem
- Sampling Distribution of the Sample Means, Mean of the Sample Means, Standard
Deviation of the Sample Means
Point Estimates and Interval Estimators
- Point Estimate, Interval Estimators, Confidence Interval, Confidence Level
Questions:
1. The manager of the community sportsplex wanted to know how the 1386 members might
feel about the discussion concerning an addition to the existing building that included a 25
metre, 8 lane pool. He asked 230 randomly selected members if they were willing to pay an
additional $35 a year on their membership fee to have these new features. Describe the
population and the sample for this situation.
NSSAL
©2010
80
Draft
C. D. Pilmer
2. For each of the following, state whether the data collection would result in a categorical data
set or numerical data set. If the data is numerical, indicate whether we are dealing with
discrete or continuous data.
(a) The number of pets in Nova Scotian households
(b) The type of MP3 player owned by adults.
(c) The diameter of the trunk of spruce trees growing in a particular
valley.
(d) The size of T-shirts worn by boys between the ages of 16 and 18
years
(e) The number of children traveling more than 1.5 kilometres to
school.
(f) The time to complete a driver’s license renewal at a specific
Access Nova Scotia location
3. If you were collecting a random sample in each situation, what type of distribution (normal,
uniform, bimodal, skewed) would you likely obtain?
Distribution Type
(a) Hodgkin’s lymphoma is a type of cancer that originates from
white blood cells. This disease typically affects people either in
early adulthood or when they are 55 years of age or older. You
randomly select 250 patients with Hodgkin’s lymphoma and ask
them to report the age of their initial diagnosis. What would the
distribution of ages likely look like?
(b) Most people make under $40,000 a year, but some make quite a
bit more, with a smaller number making many millions of
dollars a year. What would the distribution of yearly earnings
likely look like?
(c) James is working as a biologist for the summer and measuring
the circumferences of randomly selected maple trees in a natural
growth forest. What would the distribution of circumferences
likely look like?
(d) You use the random number generator on your calculator to find
500 random whole numbers between 1 and 10. What would the
distribution of numbers likely look like?
4. An airline company randomly selected eighteen suitcases from domestic flights and recorded
their weights in kilograms.
16.2
11.3
15.7
14.7
15.1
19.6
16.0
14.1
3.9
18.0
14.8
16.3
13.6
11.9
12.4
14.8
13.5
19.7
(a) Although the airline collected a sample, describe the population in this situation.
NSSAL
©2010
81
Draft
C. D. Pilmer
(b) Would a histogram or bar graph be used with this data set?
(c) Calculate the mean, median, and 5% trimmed mean without using the STAT feature on a
TI-83/84 calculator.
(d) Which of these measures is not influenced or less influenced by extremely high or low
data points?
5. A study looked at the concentration of iron in the bloodstream of ten randomly selected high
performance female athletes. The following data was collected. The concentrations are
measured in grams per decilitre (g/dl).
15.3
14.2
13.6
11.9
14.8
12.6
14.6
13.9
14.2
12.9
(a) Are we dealing with a population or a sample?
(b) Calculate the mean without using the STAT features on your calculator. Use the
appropriate symbol.
(c) Calculate the standard deviation without using the STAT features on your calculator..
xi
NSSAL
©2010
82
Draft
C. D. Pilmer
6. The body mass index of 600 randomly selected 20 year old males was taken. The sample
mean was 23.0 kg/m2 and the sample standard deviation 2.5 kg/m2. Assume that the
distribution of body mass indexes was bell shaped.
(a) Approximately how many 20 year old males had body mass indexes between 23.0 kg/m2
and 25.5 kg/m2?
(b) Approximately how many 20 year old males had body mass indexes between 18.0 kg/m2
and 23.0 kg/m2?
(c) Approximately how many 20 year old males had body mass indexes between 15.5 kg/m2
and 30.5 kg/m2?
(d) Approximately how many 20 year old males had body mass indexes between 20.5 kg/m2
and 28.0 kg/m2?
(e) Approximately how many 20 year old males had body mass indexes between 18.0 kg/m2
and 30.5 kg/m2?
(f) Approximately how many 20 year old males had body mass indexes between 15.5 kg/m2
and 25.5 kg/m2?
(g) Approximately how many 20 year old males had body mass indexes between 25.5 kg/m2
and 28.0 kg/m2?
(h) Approximately how many 20 year old males had body mass indexes between 15.5 kg/m2
and 18.0 kg/m2?
NSSAL
©2010
83
Draft
C. D. Pilmer
7. Identify the sampling method used. Also indicate whether we are dealing with a preferred
and poor sampling method.
(a) A cable company wanted to know how its customers felt about upgrading the high
definition (HD) television signal from 760p to 1080p. There would be a small increase
in the monthly bill for this upgrade. As customers have signed up for the regular HD
television (760p), they have been assigned a six digit customer identification number
starting at 000000. The company wants to ask every hundredth HD customer about the
potential upgrade. They randomly generate the number 83, and then ask every customer
whose identification number ends with these two digits to respond to the company’s
survey.
Method: ______________________________________________
(b) A small community has 1000 adult residents. The community leaders want to know how
the residents feel about the rezoning of some municipal property so that a small strip mall
can be built. The leaders want to collect a sample of size 120. Each resident is assigned
a number 000 through to 999. They take the ball machine from the local bingo hall and
fill it with three sets of ping pong balls, each set numbered 0 through 9. The machine is
turned on three balls are extracted. Those three numbers correspond to the three digit
number assigned to one resident. The three balls are returned to the machine and this
process is repeated 119 more times. The leaders now know which of their residents will
be asked to partake in the survey.
Method: ______________________________________________
(c) A national hardware store wants to know how its customers feel about the service,
products, and prices. At the bottom of every receipt, they include a website. If the
customer chooses to visit the site, they can answer a series of questions and have an
opportunity to win a prize.
Method: ______________________________________________
(d) A large hotel in a large city wants to know how much its customers spend at their hotel
during an overnight visit. The hotel already knows that 80% customers are there on
business while only 20% are there for leisure. They suspect that the spending habits for
these two groups may be quite different so they create a random sampling technique that
ensures that both groups are proportionally represented in the survey.
Method: ______________________________________________
(e) Ontario has 211 hospitals. The health authority wants to understand the demands that are
presently being put on emergency room staff. Rather than interviewing every ER staff
member at every hospital, they randomly select 10 hospitals and interview every ER staff
member at those ten facilities.
Method: ______________________________________________
NSSAL
©2010
84
Draft
C. D. Pilmer
8. Random samples of size 75 are repeatedly selected from a known population with a mean of
107.2 and a standard deviation of 9.3. These repeatedly collected samples allow a sampling
distribution of all possible x ’s to be drawn.
(a) What type of distribution (uniform, bimodal, normal, or skewed) would result?
(b) Determine the mean of the sample means and indicate where it would be located on the
distribution of the x ’s.
(c) Determine the standard deviation of the sample means.
9. The mean cost of a lunch at a particular eating establishment is $11.52 with a standard
deviation of $1.47. A sample is taken from this known population. The data points are
shown below in the chart.
10.63
9.58
12.43
10.66
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
11.85
11.64
11.58
11.52
9.12
10.12
14.05
11.09
12.20
13.00
11.22
10.05
11.39
8.77
13.11
12.46
12.50
11.21
10.41
9.76
12.99
13.03
14.15
11.06
12.27
11.42
13.73
14.06
14.10
9.74
12.38
11.21
12.80
What is the population mean?
What is the population standard deviation?
Determine the sample mean.
What does the sample mean represent in this situation and is it close to the expected
value?
What is the sample size?
Determine the sample standard deviation.
If samples of the same size are repeatedly collected from this known population, what
would be the value of the standard deviation of the sample means?
If samples of the same size are repeatedly collected from this known population, what
would be the value of the mean of the sample means?
For the sampling distribution of x ’s, between what two values would one expect 68% of
the sample means to fall?
NSSAL
©2010
85
Draft
C. D. Pilmer
10. (a) Does the sample size affect the mean of the sample means? Explain.
(b) Does the sample size affect the standard deviation of the sample means? Explain.
11. Meera collects a sample of size 125 from an unknown population. She calculates the sample
mean and finds that it is equal to 287.1. The sample standard deviation works out to be 25.7.
(a) Determine the 90% confidence interval based on this sample.
(b) Explain what this confidence interval means.
(c) If we collected 500 samples of the same size from the same population and then
generated five hundred 90% confidence intervals, how many would one expect to
enclose the population mean?
12. Dr. Bagnell conducted a medical study where she recorded the height in centimetres of 36
randomly selected 20 year old males. The data is supplied below.
177
177
179
176
182
176
192
172
183
171
185
184
192
180
178
174
184
167
172
179
179
184
171
184
182
176
178
177
172
172
178
173
180
181
173
179
(a) Determine x . Is it a point estimate or interval estimator?
(b) Calculate the 95% confidence interval. Is it a point estimate or interval estimator?
(c) Calculate the 99% confidence interval.
(d) Which of the methods used to create the two confidence intervals has a greater chance of
enclosing the true mean weight of 20 year old males? Explain.
NSSAL
©2010
86
Draft
C. D. Pilmer
13. The head circumferences of 150 randomly selected infants (20 months of age) were recorded.
The mean circumference reading was 48.1 centimetres with a standard deviation of 1.2
centimetres. Determine the 95% confidence interval and explain what the interval
represents. Does the interval enclose the true mean head circumference?
14. Computer equipment can be sensitive to high temperatures. Leck Electronics wanted to test
a particular computer component to determine at what temperature the component would
fail. They randomly selected 35 of the same component, exposed them to increasing
temperatures, and recorded the temperature (oC) at which the component failed.
35.2
30.9
35.6
34.1
31.8
33.7
33.5
30.2
33.4
38.4
33.0
33.1
30.6
33.1
31.5
34.1
28.0
31.5
29.7
29.2
34.1
28.1
33.3
33.5
33.7
30.6
38.4
36.9
33.2
32.8
37.4
36.9
34.5
33.3
33.8
(a) Calculate the 90% confidence interval for the mean failure temperature.
(b) If you collected another sample of size 35, would you expect the confidence interval to
change? Explain.
(c) If 300 random samples of size 35 were obtained and three hundred 90% confidence
intervals were constructed, approximately how many would you expect not to enclose the
population mean?
(d) If the sample size was changed from 35 to 200, how would that affect the width of the
90% confidence interval?
(e) If the confidence level is changed from 90% to 95%, how would that affect the width of
the confidence interval?
NSSAL
©2010
87
Draft
C. D. Pilmer
If You Have the Time
We have spent the last few weeks looking at descriptive and inferential statistics. Although we
examined several statistical tools in real world applications, we have not seen how statistical
information can dramatically change our understanding of the world. There are, however, two
fascinating online videos that do just that.
The first video features Hans Rosling, a Swedish physician and professor of Internal Health. He
uses statistics to show how we must change our perceptions of other countries, particular those
that we deem as third world. He even uses confident intervals in his presentation to show that
Swedish undergraduate university students performed worse than chimpanzees on a test of
international child mortality rates. This video can be viewed at the following site.
TED Hans Rosling Shows the Best Stats You've Ever Seen
http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen.html
The second video features Peter Donnelly, an Australian statistician working at the University of
Oxford. He presents several real world examples where people, including professionals, have
difficulty reasoning with uncertainty and the implications of such shortcomings.
TED Peter Donnelly Shows How Stats Fool Juries
http://www.ted.com/talks/lang/eng/peter_donnelly_shows_how_stats_fool_juries.html
Optional Assignment
With your instructor’s permission, you may wish to negotiate an optional assignment based on
one or both of these videos.
NSSAL
©2010
88
Draft
C. D. Pilmer
Post-Unit Reflections
What is the most valuable or important
thing you learned in this unit?
What part did you find most interesting or
enjoyable?
What was the most challenging part, and
how did you respond to this challenge?
How did you feel about this topic when
you started this unit?
How do you feel about this topic now?
Of the skills you used in this unit, which
is your strongest skill?
What skill(s) do you feel you need to
improve, and how will you improve them?
How does what you learned in this unit fit
with your personal goals?
NSSAL
©2010
89
Draft
C. D. Pilmer
Terms, Symbols, and Formulas
By the end of this unit, you should be familiar with the following terms, symbols, and formulas.
These have been presented in the order that they appear in this resource.
Descriptive Statistics
Inferential Statistics
Population
Sample
Categorical Data Set
Numerical Data Set
Discrete Numerical Data
Continuous Numerical Data
Bar Graph
Histogram
Normal Distribution
Uniform Distribution
Skewed Distribution
Bimodal Distribution
Population Mean, µ
Sample Mean, x
x1 + x 2 + x3 + ... + x n
n
x1 + x 2 + x3 + ... + x n
x=
n
µ=
Median
Trimmed Mean, x(T )
Population Standard Deviation, σ
Sample Standard Deviation, S x
σ=
(x1 − µ )2 + (x2 − µ )2 + (x3 − µ )2 + ... + (xn − µ )2
Sx =
(x
n
) (
) (
2
1
2
)
2
(
− x + x 2 − x + x3 − x + ... + x n − x
n −1
Frequency Polygon
68-95-99.7 Rule
Sample Size, n
Simple Random Sample
Stratified Random Sample
Cluster Sample
Systematic Sample
Convenience Sample
Voluntary Sample
Sampling Distribution of the Sample Means
Mean of the Sample Means, µ x
µx = µ
Standard Deviation of the Sample Means, σ x
σx =
)
2
σ
n
Central Limit Theorem
NSSAL
©2010
90
Draft
C. D. Pilmer
Point Estimate
Interval Estimator
x±z
Confidence Interval Based on a Sample Mean
Sx
n
Confidence Level
NSSAL
©2010
91
Draft
C. D. Pilmer
TI-83/84 Statistics Information Sheet
The following commands are used throughout this unit.
1. The 1-Var Stats Command
This command allows one to determine the mean ( x ), median, sample standard deviation
( S x ), and population standard deviation ( σ ) for data entered into a list on the calculator.
The command also generates other values but none of these will be used in this course.
STAT > CALC > 1-Var Stats > Enter the list name.
(L1, L2,…)
> ENTER
2. The SortA command
This command sorts data in a specific list in ascending order (i.e.
smallest to largest).
STAT > SortA( > Enter the list name. > ENTER
(L1, L2,…)
3. The rand Command
This command generates a random number between 0 and 1.
MATH > PRB > rand > In a set of brackets, indicate the
number of random numbers
you wish to generate.
4. The randNorm Command
This command allows one to simulate the collection a random sample of a specific size from
a known population that is normally distributed.
MATH > PRB > randNorm > Enter the population mean,
population standard
deviation, and sample size,
all separated by commas.
Close the brackets.
NSSAL
©2010
92
Draft
C. D. Pilmer
5. The mean Command
This command finds the mean for data in a specific list.
LIST > MATH > mean( > Enter the list name. > ENTER
(L1, L2,…)
6. The seq Command
The command generates a sequence of numbers.
LIST > OPS > seq(
NSSAL
©2010
93
Draft
C. D. Pilmer
Answers
Introductory Materials and Terminology (pages 1 to 5)
1. Population: all the taxpayers in this community (4127)
Sample: the 300 randomly selected taxpayers
2. Population: all the used bricks that the contractor purchased (6000)
Sample: the 200 randomly selected bricks that were examined to determine usability
3. Population: all of the employed workers in Nova Scotia (453 000)
Sample: the 1200 randomly selected employed workers who participated in the survey and
reported their annual gross income
4. Population: all of the adults who received a high school diploma from NSSAL between 2001
and 2009
Sample: the 240 randomly selected NSSAL graduates who participated in the interview
5. (a)
(c)
(e)
(g)
(i)
numerical (continuous)
categorical
numerical (discrete)
categorical
numerical (discrete)
(b)
(d)
(f)
(h)
categorical
numerical (continuous)
numerical (continuous)
numerical (continuous)
6. (a) Quebec City, 340 cm per year
(b) 50 cm per year
(c) numerical data set – cities are reporting annual snowfalls in centimetres
7. (a)
(b)
(c)
(d)
(e)
187 people
231 people
176 people
sample (reason: only 1100 of all the citizens were selected)
categorical data set
8. (a)
(b)
(c)
(d)
58%
14%
1991 and 1996
population (reason: census)
Bar Graphs and Histograms (pages 6 to 10)
1. (a)
(c)
(e)
(g)
bar graph
histogram
bar graph
bar graph
NSSAL
©2010
(b) histogram
(d) bar graph
(f) histogram
94
Draft
C. D. Pilmer
2. (a)
2
(b) 16 %
3
(c) skewed left
(d) Because we are dealing with continuous numerical data
(e) sample
3. (a) population
(b)
(c) normal
4. (a) uniform
(c) skewed right
(e) skewed left
(b) bimodal
(d) normal
Describing Data, Part 1 (pages 11 to 16)
1. (a) sample
(b) x = 6.2
NSSAL
©2010
Median = 6
95
Draft
C. D. Pilmer
(c) There are no outliers.
2. (a) population
(b) numerical
(c) µ = 159.44 Median = 157
3. (a) sample
(b) x = 35 (34.6)
5% Trimmed Mean
Median = 31
x(T ) = 31 (30.6)
10% Trimmed Mean x(T ) = 31 (30.9)
(c) Trimmed means are appropriate because the outlier 115 exists within the data set.
(d) Four data points from the bottom and four data points from top of the data set
4. (a) x = 268 (267.875) Median = 254 (253.5)
(b) Median and Trimmed Mean
(c) Histogram
x(T ) = 255 (255.409)
5. (a) This score system was likely implemented to eliminate the effect of a single rogue judge
who would inflate or deflate the score of a particular athlete.
(b) The method used in gymnastics and diving removes only one high score and one low
score. If more than one judge work together to inflate or deflate the score of a particular
athlete then this particular trimmed mean technique will eliminate only one rogue judge,
but not all. In the case of this figure skating competition, we were dealing with more
than one rogue judge.
Describing Data, Part 2 (pages 17 to 24)
1.
(x
xi
xi − x
25
-3
−x
9
32
4
16
24
-4
16
28
0
0
31
3
9
28
0
0
i
)
2
Sum = 50
Sx =
NSSAL
©2010
50
= 3.16
6 −1
96
Draft
C. D. Pilmer
2.
xi
xi − µ
( x i − µ )2
3.7
-0.6
0.36
4.3
0
0
5.0
0.7
0.49
4.6
0.3
0.09
4.0
-0.3
0.09
4.7
0.4
0.16
3.9
-0.4
0.16
4.2
-0.1
0.01
Sum = 1.36
σ=
1.36
= 0.41
8
3. (a) First Data Set: x = 15 , S x = 1.58
Second Data Set: x = 15 , S x = 2.65
(b) Although the sample means are the equal, the sample standard deviations are different.
Since the standard deviation is lower for the first data set, then we now that the
individual data points are more clustered around the mean compared to the values in the
second data set.
4. (a) 543
(b) 544
(c) S x = 5.24
(d) Although the means and standard deviations for the two samples would be similar, they
would likely not be the same. Because samples are a subset of the population, it is very
unlikely that the two samples would draw the same individual pieces of data.
5. (a)
(b)
(c)
(d)
(e)
183
182
numerical data set
σ = 4.90
The average heights of these two groups of learners are the same however the standard
deviation for Barb’s group is much lower. That means that there is less variation in
heights between Barb’s male learners compared to the other instructor’s learners. The
heights of her learners are more clustered around the mean.
(f) The standard deviations are almost the same for the two groups of male learners,
however, the mean height for Barb’s group is higher. We can conclude that the average
height of male learners in Barb’s math courses is three centimeters more than the third
instructor’s male students. The variation in heights between the two groups is essentially
the same.
NSSAL
©2010
97
Draft
C. D. Pilmer
6. Answers will vary.
7. Histogram (i) matches with (c)
Histogram (ii) matches with (b)
Histogram (iii) matches with (d)
Histogram (iv) matches with (a)
Using Technology (pages 25 to 28)
1. (a) sample
(b)
(c) x = 155.6 , median: 156, S x = 18.3
(d) normal distribution
2. (a) population
(b)
(c) µ = 55.6
(d) σ = 9.5
(e) median: 54.5
(f) The data does not cluster well around the mean.
3. (a) population
(b)
(c) µ = 14.1 , median: 9.91 , σ = 11.2
(d) The mean is high because the incarceration rate for the Northwest Territories is so much
higher than the rates.
NSSAL
©2010
98
Draft
C. D. Pilmer
Normal Distribution (pages 29 to 34)
1. The first data point in List 5 represents the lowest total honey production over a four year
period for one of the one hundred randomly selected hives.
2. The last data point in List 5 represents the highest total honey production over a four year
period for one of the one hundred randomly selected hives.
3. The sample mean, x , represents the average total honey production over the four year period
of the one hundred randomly selected hives.
4. (a) Answers will vary.
(b) Answers will vary but there should be around 68 (give or take 3 or 4).
(c) 68%, it should be supported because you should have about 68 out of 100 data points
within this range.
5. (a) Answers will vary.
(b) Answers will vary but there should be around 95 (give or take 3 or 4).
(c) 95%, it should be supported because you should have about 95 out of 100 data points
within this range.
Using the 68-95-99.7 Rule (pages 35 to 39)
1.
2.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
Hint:
Between
Between
Between
Between
Between
Between
Between
Between
Between
Between
(a)
Hint:
Between x − 3S x and x + 3S x
(b)
Between x − S x and x + S x
--
68%
1360
(c)
Between x − 2 S x and x
--
47.5%
950
(d)
Between x − S x and x + 2 S x
34% + 47.5%
81.5%
1630
NSSAL
©2010
µ − σ and µ + σ
µ and µ + 2σ
µ − σ and µ
µ − 3σ and µ
µ − 2σ and µ + σ
µ − σ and µ + 3σ
µ − 3σ and µ + 2σ
µ + σ and µ + 2σ
µ − 3σ and µ − 2σ
µ + σ and µ + 3σ
Calculation:
----47.5% + 34%
34% + 49.85%
47.5% + 49.85%
47.5% - 34%
49.85% - 47.5%
49.85% - 34%
Answer:
68%
47.5%
34%
49.85%
81.5%
83.85%
97.35%
13.5%
2.3%
15.85%
Calculation:
--
Percentage:
99.7%
Answer:
1994
99
Draft
C. D. Pilmer
(e)
Hint:
Between x + S x and x + 2 S x
Calculation:
47.5% - 34%
Percentage:
13.5%
Answer:
270
(f)
Between x − 2 S x and x + 2 S x
--
95%
1900
(g)
Between x − 3S x and x − S x
49.85% - 34%
15.85%
317
(h)
Between x − 2 S x and x + 3S x
47.5% + 49.85%
97.35%
1947
(i)
Between x − 3S x and x
--
49.85%
997
(j)
Between x + 2 S x and x + 3S x
49.85% – 47.5%
2.35%
47
Collecting a Sample (pages 41 to 44)
Conclusions (a) and (b)
Although it is not guaranteed, most learners’ non-random samples will not be a good
representation of the population. Generally students will choose a few small buildings (between
1000 and 4000 sq. ft.), a few medium sized buildings (between 6000 and 8000 sq. ft.) and at least
one large building (between 9000 and 16000 sq. ft.). The population, however, has a much
greater proportion of smaller buildings, than medium or large buildings. The random samples
are more likely to capture this and therefore be better representations of the population. The
purpose of this investigation was to show that when we conduct surveys we should be using
random sampling techniques to ensure that end up with unbiased samples.
1
(a) Conducting the survey at an ultimate fighting competition is problematic. This type of
competition is extremely physical and some would say violent. Asking viewers their
options on media violence will not likely produce data that is representative of the
general population.
(b) There are two problems with Genevieve’s survey. The first is the location. Shopping
malls do not serve needs of all shoppers. Low income families will likely use other
shopping establishments. High incomes individuals may shop predominantly at specialty
stores or boutiques. There is also the issue that Mic Mac Mall serves predominantly an
urban, rather than rural, clientele. The second problem lies in the manner she selected
survey participants. They were not randomly selected. She approached people who she
felt would answer her survey questions. She may inadvertently omit individuals from
differing age groups, cultures, or social economic groups.
(c) Not everyone who views this show and has an option regarding the talent of the
contestants will participate in the voting. It is often difficult to register a vote through all
the busy circuit signals therefore only individuals who have a strong option regarding the
competition are likely to vote. Many of these will vote more than once. The other matter
is the cost. In some cases, the individuals have to incur long distance phone charges.
This may serve as a deterrent for some low income individuals from participating in the
vote.
(d) This survey technique has similar problems to the survey discussed in (b); location and
selection of participants. Conducting a survey on gun registration at a hardware store is
problematic. The store likely deals with more male clientele than female. In addition to
NSSAL
©2010
100
Draft
C. D. Pilmer
this, the store likely sells firearms and therefore likely attracts a greater proportion of
hunters and firearms enthusiasts than other establishments. Participants are not randomly
selected for this survey, rather they volunteer to respond. Individuals who have strong
opinions on the matter are likely to respond and they may respond more than once.
(e) The problem is not the sampling technique; rather it is the question itself. The question
is “loaded” in that it initially presents negative aspects of war in Afghanistan and then
asks the question whether Canadian soldiers should remain in the conflict. The solution
is not to include the positive aspects related to Canada’s involvement in the war, rather to
create a question that does not identify positive or negative aspects. The question should
simply be, “Should Canadian soldiers remain in Afghanistan?”
Sampling Methods (pages 45 to 49)
1. (a)
(c)
(e)
(g)
(i)
(k)
systematic
cluster
simple random
stratified
volunteer response
stratified
2. (b)
(d)
(f)
(i)
Asra’s cafeteria survey
Montez’s Gas Station Survey
Reality Show Online Voting
Ranelda going to TripAdvisor.com
(b)
(d)
(f)
(h)
(j)
volunteer response
convenience
volunteer response
cluster
simple random
3. Answers will vary slightly.
(a) Place the numbers 80000 through 82500 on separate pieces of paper, place the pieces of
paper in a drum, stir the drum, and draw 500 pieces of paper.
(b) Assign the numbers 1 through 44 to each of the high schools. Randomly select five
numbers between 1 and 44. Review all of the math exams from the five schools with
those assigned numbers.
(c) Look at the enrollment of grade 12 math students in each of the 44 schools. Randomly
select 500 exams in such a manner that each school is proportionally represented.
(d) Randomly select a number between 0 and 4. If, for example, the number 3 is obtained,
then every exam whose identification number ends with this digit would be selected for
review.
Simulated Sampling (pages 50 to 52)
Sample 1
The first data point in the table represents the airborne contaminant level measured in µg / m 2
for the first randomly selected Canadian household in the first sample of size 40.
NSSAL
©2010
101
Draft
C. D. Pilmer
Sample 2
It is highly unlikely that the first five and last five data points in the two tables are going to
match because we are dealing with different samples. For example is it unlikely that the first
randomly selected household from the millions across Canada in the first sample would have the
same airborne contaminant level as first randomly selected household in the second sample.
1. The mean for Sample 1 represents the average contaminant level in µg / m 2 for the 40
randomly selected households.
2. The sample means differ because they are from four different samples. Each sample
contains different data points and therefore likely results in a different mean.
3. The expected value is the population mean ( µ = 412). The sample means should be fairly
close to the population mean.
4. Statement (d) is correct.
(d) The population mean is fixed and the sample mean is random.
Explanation:
If you consider our simulations, the sample means differed hence they are random while our
population mean remained fixed at 412.
5. No
Sampling Distribution of the Sample Means (pages 53 to 57)
(i) The calculator must generate 40 random numbers from the specified normal population,
work out the mean for those 40 data points, store that piece of information in List 1, and then
repeat that procedure 99 more times. This is obviously a time consuming process even for a
calculator.
(ii) Of the 100 simulated samples, the first value in the table represents the smallest sample mean
obtained from our 100 random samples. These 40 randomly selected households had the
lowest mean airborne contaminant reading.
(iii) Normal Distribution
1. (a)
(b)
(c)
(d)
412
Answers will vary but it should be very close to 412.
Answers will vary but it should be very close to 412.
If we took just one random sample of large enough size, we would expect it to be fairly
close to the population mean. However, we collected 100 samples of the same size,
worked out the sample means, and then averaged those 100 sample means. One would
expect that this average would be very close to the population mean.
Mean of the Sample Means = Population Mean
NSSAL
©2010
102
Draft
C. D. Pilmer
2. (a)
(b)
(c)
(d)
38
Answers will vary but it should be very close to 6.
Answers will vary but it should be very close to 4.9.
We should have learned the following.
Population Standardard Deviation
Standard Deviation of the Sample Means =
Sample Size
Central Limit Theorem (pages 58 to 67)
1. (a) x
(c) S x
(e) n
(g) σ
(b) σ x
(d) µ
(f) µ x
2. (a) 329
(b) 4.02
3. (a) normal
(c) 0.93
(e) Between 85.14 and 88.86
(b) 87, centred on the normal distribution
(d) 68%
4. Centred about the population mean: 106.2 km/h
Spread out about its centre: 0.55 km/h (standard deviation of the sample mean)
5. (a) 28.5 kg
(b) 5.0 kg
(c) 27.9 kg
(d) It represents the mean luggage weight for our sample of size 30. It is close to the
expected value (population mean).
(e) 4.72 kg
(f) 0.91 kg
(g) 28.5 kg
(h) Between 27.59 and 29.41
(i) Between 26.68 and 30.32
6. The sample standard deviation describes how spread out or clustered individual data points
from a single sample are relative to one another. The standard deviation of the sample mean
describes how spread out or clustered sample means derived from repeatedly collected
samples of the same size are relative to one another.
7. (a) iii
(c) ii
(b) i
8. (a) µ x = 86, σ x = 0.51
(b) Between 84.47 and 87.53
NSSAL
©2010
103
Draft
C. D. Pilmer
9. (a) mean of sample means = 98.6
(b) mean of sample means = 98.6
(c) mean of sample means = 98.6
standard deviation of sample means = 1.71
standard deviation of sample means = 1.39
standard deviation of sample means = 1.21
10. As the sample size increases, the standard deviation of the sample means gets smaller,
meaning that the sample means are more clustered. This seems logical. If you increased the
sample size, it is more likely that this one sample is more representative of the population
and therefore has a sample mean that is close to the population mean. If one repeatedly
collects random samples of a larger size then one would expect that the resulting sample
means are collectively closer to the population mean than sample means derived from
samples of a smaller size. That means that the standard deviation of the sample means will
smaller for these larger sample sizes.
11. (a) 40
(b) Sample size 30 corresponds to sampling distribution iii.
Sample size 60 corresponds to sampling distribution ii.
Sample size 90 corresponds to sampling distribution i.
Reason: As the sample size increases, the standard deviation of the sample means gets
smaller, meaning that the sample means are more clustered around the population mean.
12. (a) iii
(c) iv
(b) i
(d) ii
Point Estimates and Interval Estimators (pages 68 to 78)
1. It means that the method used to create the confidence interval from $292 to $304 has a 0.95
probability of enclosing the true mean power bill for all households in Nova Scotia (i.e.
population mean). There is a 0.05 probability (or 5% chance) that the method created an
interval that did not enclose the population mean.
2. (a) From 574.5 to 591.5
(b) The method that produced the confidence interval from 574.5 to 591.6 has a 0.99
probability (or 99% chance) of enclosing the population mean. There is a 0.01
probability that the method produced an interval that does not enclose the population
mean.
(c) We cannot tell if this confidence interval encloses the population mean because we are
dealing with an unknown population.
(d) 495
3. (a) 74.8, point estimate
(b) From 72.7 to 76.9, interval estimator
(c) No, when he states that the population mean “falls within the interval”, he is implying
that the population mean is random, rather than fixed.
(d) 20, no
NSSAL
©2010
104
Draft
C. D. Pilmer
4. (a) (From the Calculator: x = 332.6 and S x = 47.2 ) From 319.7 litres to 345.5 litres
(b) (From the Calculator: x = 327.6 and S x = 37.3 ) From 311.0 litres to 344.2 litres
(c) There is a greater likelihood that the method that produced the 99% confidence interval
encloses the population mean because we are dealing with a higher confidence level
(99% opposed to 90%).
5. (a)
(b)
(c)
(d)
(e)
From 144.2 to 151.2
From 143.4 to 152.6
From 142.0 to 154.0
As the confidence level increases, the width of the confidence interval increases.
The 90% confidence interval did not enclose the population mean but the other two
confidence intervals did.
6. (a)
(b)
(c)
(d)
From 52.05 to 55.55
From 53.94 to 55.86
From 53.69 to 54.91
Width of First Confidence Interval: 55.55 – 52.05 = 3.50
Width of Second Confidence Interval: 55.86 – 53.94 = 1.92
Width of Third Confidence Interval: 54.91 – 53.69 = 1.22
Yes: as the sample size increases, the width of the confidence interval decreases.
7. As you learned in the previous questions, only the sample size and confidence level will
affect the width of the confidence interval. When everything else is constant, larger sample
sizes produce narrower intervals. When everything else is constant, higher confidence levels
produce wider intervals.
8. (a) True
(b) False - The confidence interval is worked out using the sample mean. The sample mean
is in the middle of the confidence interval therefore there is a 100% chance that it
is enclosed within the interval.
(c) True
(d) True
(e) False - Larger sample sizes produce narrower, rather than wider, intervals
(f) False – The problem with this statement is that they are saying that the population mean
will fall between the two values. This implies that the population mean is
random, rather than fixed.
(g) False – Confidence intervals are designed so that they have a strongly likelihood of
enclosing the population mean. Confidence intervals are quite narrow compared
to the wide range of data points one would expect to obtain from a single random
sample.
9. From 6.08 to 6.32
The method that produced the interval from 6.08 to 6.32 has a 0.95 probability of enclosing
the true mean rainfall pH (i.e. population mean). There is a 0.05 probability that this method
created an interval that does not enclose the population mean.
NSSAL
©2010
105
Draft
C. D. Pilmer
10. This applet allows one to generate one
hundred 95% confidence intervals and
one hundred 99% confidence intervals
for a known population ( µ = 50 ) and
track how many enclose the
population mean. When I used it, I
obtained the following. It shows that
98 of my one hundred 99% confidence
intervals, and 93 of my one hundred
95% confidence intervals enclosed the
population mean. Every time we press
SAMPLE, more confidence intervals
are generated and a running record is
kept in the chart at the bottom of the
window.
11. (a) (From the Calculator: x = 47.8 and S x = 3.9 ) From 46.8 months to 48.8 months
(b) We do not know because we are not supplied with the population mean. We are dealing
with an unknown population.
(c) The sample mean and sample standard deviation would likely change, therefore we
would end up with a different confidence interval.
(d) Width would increase
(e) Width would decrease, assuming no significant change in the sample standard deviation.
Putting It Together (pages 79 to 87)
1. Population: all 1386 members of the sportsplex
Sample: the 230 randomly selected members
2. (a) Numerical, Discrete
(c) Numerical, Continuous
(e) Numerical, Discrete
(b) Categorical
(d) Categorical
(f) Numerical, Continuous
3. (a) Bimodal
(c) Normal
(b) Skewed (left)
(d) Uniform
4. (a)
(b)
(c)
(d)
Population: All suitcases on domestic flights
Histogram
x = 14.5 kg, Median = 14.8 kg, x(T ) = 14.9 kg
Median and Trimmed Mean
5. (a) Sample
(b) 13.8 g/dl
NSSAL
©2010
106
Draft
C. D. Pilmer
(c) 1.06 g/dl
6. (a)
(c)
(e)
(g)
204
598
584
81
7. (a)
(b)
(c)
(d)
(e)
Systematic Sampling (preferred)
Simple Random Sampling (preferred)
Voluntary Response (poor)
Stratified Sampling (preferred)
Cluster Sampling (preferred)
(b)
(d)
(f)
(h)
285
489
503
14
8. (a) Normal
(b) µ x = 107.2 , centered on the bell curve
(c) 1.07
µ = $11.52
σ = $1.47
x = $11.71
It represents the average cost of lunch for the 37 randomly selected customers at this
particular restaurant.
(e) n = 37
(f) S x = $1.46
(g) µ x = $11.52
9. (a)
(b)
(c)
(d)
(h) σ x = $0.24
(i) From $11.28 to $11.76
10 (a) No, the mean of the sample means should equal the population mean regardless of the
sample size.
(b) Yes, as the sample size increases, the standard deviation of the sample means decreases.
11. (a) From 283.3 to 290.9
(b) It means that the method that produced the interval from 283.3 to 290.9 has a 0.9
probability of enclosing the population mean. There is a 0.1 probability (or 10% chance)
that this method created an interval that does not enclose the population mean.
(c) 450
12. (a) x = 178.3 cm, point estimate
(b) From 176.5 cm to 180.1 cm, interval estimator
(c) From 175.9 cm to 180.7
(d) The method that produced the 99% confidence interval has a greater likelihood of
enclosing the true mean height (i.e. population mean) because it has a higher confidence
level and therefore results in a wider interval.
NSSAL
©2010
107
Draft
C. D. Pilmer
13. From 47.9 cm to 48.3 cm
It means that the method that produced the interval from 47.9 cm to 48.3 cm has a 0.95
probability of enclosing the true mean head circumference (i.e. population mean). There is a
0.05 probability (or 5% chance) that this method created an interval that does not enclose the
population mean.
We cannot tell whether the interval encloses the population mean because we are dealing
with an unknown population.
14. (a)
(b)
(c)
(d)
(e)
From 32.31oC to 34.03oC
Yes, sample means are random (not fixed).
30
decrease
increase
NSSAL
©2010
108
Draft
C. D. Pilmer