Download 1 - heatherchafe

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
1.1 Definitions
1.1.1 Define the terms: statistics, data, population, and sample.
Statistics refers to the science of collecting, organizing, presenting, analyzing, and
interpreting data to assist in making more effective decisions.
Data are measurements of one or more variables of a sample which was drawn from a
population.
A population is the complete collection of elements (scores, people, measurements) in
which we want to study
A sample is a set of individual elements (again scores, people, measurements…) taken from
a population.
1.2 Type of Statistics
1.2.1 Describe the difference between descriptive and inferential statistics
The study of Statistics (as will the chapters of our textbook) can be broken down into two
main types: Descriptive Statistics and Inferential Statistics.
Descriptive Statistics usually utilizes graphs, charts, tables, or calculations to describe
data. On your income tax brochure, there is usually a pie chart which shows the breakdown
of your tax dollar. It clearly identifies where your hard earned dollar goes.
The other type is known as Inferential Statistics and this usually is utilized when making a
statement, reaching a decision, or coming to some conclusions. An example would be when
the local chip truck wants to trial test a new brand of crispy fries. A sample of the loyal
customers might be given the new chips and their responses to its tastiness and
marketability would give the owners sufficient information for launching the newer and
better chips!
In summary, descriptive statistics are methods of organizing, summarizing, and presenting
data in an informative way whereas inferential statistics can be a decision, an estimate, a
prediction, or a generalization about a population, based on a sample.
1
1.2.2 Identify the four types of data and the characteristics of each type.
Data are measurements of one or more variables of a sample which was drawn from a
population. This data can be classified into four types: nominal, ordinal, interval, or
ratio. Types of Data:
Nominal data, as the name suggests, is essentially when the researcher puts a name to his
or her observations. Warranty cards and surveys often ask one to check the box that best
describes, say, one’s profession. The list may include secretarial; professional; skilled
trade. In such a case it is clear that we are simply naming data. Confusion may creep in
however when we use numbers as names. For instance, if we were comparing the
performance of men and women on some task, we might for ease of computation refer to
all men as 0 and all women as 1. Such numbers are still really just names, however, and
would have no more computational value than the simple label men or women.
Ordinal data is essentially the same as nominal data, but in this case the data may
meaningfully be arranged in order: for instance tall, medium, short. A psychologist may
want to rank children in a class on the basis of how skilled they are at reading. The
resultant data will tell us who is the best reader, who is the second best, and so on, but
gives us no information on how much the children differ from each other in terms of
reading skill.
Interval data is perhaps the most commonly-collected type of data in Psychology. In
interval data, the difference between any two adjacent numbers is equal to the
difference between any other two adjacent numbers. In other words, an interval scale
allows one to measure differences in size or magnitude. This may seem a confusing
concept, but bear in mind that the IQ scale, for instance, is an interval scale. If we say
that Jim scores 120 on an IQ test, and Tom 90, we can say that Jim’s score is 30 points
higher. We know that Jim’s score is the greater of the two, and we know by how much it is
greater. However it is not possible to say that Jim is one-third more intelligent than Tom.
This is because the IQ scale has no really meaningful absolute zero point.
Ratio data contains all of the attributes of interval data, but includes in addition an
absolute zero point. A good example is scores in an exam. Because of the properties of
the ratio scale, if Jim scores 100 in a test and Tom 50, we may meaningfully say that Jim
has scored twice as highly as Tom.
Note: Qualitative (or categorical) Data are non numerical data which includes nominal and
ordinal data.
Quantitative (or numerical) Data are numerical data which includes interval and ratio
data.
2
2.0 Descriptive Statistics
2.1 Discuss what is meant by a frequency distribution.
Frequency Distributions:
Sometimes we need to reduce a large set of data into a much smaller set of numbers that
can be more easily comprehended. Lets take for example if you have recorded the
population sizes of 500 randomly selected cities, there is no easy way to examine these
500 numbers visually and learn anything. It would be easier to examine a condensed
version of this set of data and this is where the frequency distribution comes into play.
Hence, a frequency distribution is a grouping of data into mutually exclusive classes
showing the number of observations in each.
Lets look at an example of a frequency distribution:
Class number
1
2
3
4
5
6
7
8
9
10
Number of dolls sold
5000 up to 10000
10000 up to 15000
15000 up to 20000
20000 up to 25000
25000 up to 30000
30000 up to 35000
35000 up to 40000
40000 up to 45000
45000 up to 50000
50000 up to 55000
frequency
1
5
2
2
6
4
9
8
4
7
This frequency distribution describes the number of dolls sold by a group of companies.
For example, you can see that 1 company sold between 5000 and 10000 dolls, 5 companies
sold between 10000 and 15000 dolls, etc.
How many companies are there in total represented in the frequency distribution?
1+5+2+2+6+4+9+8+4+7= 48 companies.
If a company sells 10000 dolls exactly, which class would it be counted under?
It would be counted in the 10000 up to 15000 class NOT the 5000 up to 10000 class.
2.1.2 Define the terms: frequency, relative frequency, classes and class limits
frequency: how often something happens (count the number of times)
3
class: how the data is split up (For example, the data above is split up into 10 classes)
class limits: Class limits are the highest and lowest values in a particular class . For
example in class number 2 above the lower class limit is 10000 and the upper class limit is
15000.
relative frequency: Relative class frequency is the percentage (given in decimal form) of
the data values which lie in each class.
i.e. relative frequency= frequency of a given class___
total number of values in the data
example#1: What is the relative frequency of class #3?
example#2: What is the relative frequency of class#7?
**2.1.3 Constructing a frequency distribution
When you are given data(a bunch of numbers) and you are asked to construct a frequency
diagram there are certain steps you need to take.
The following example will illustrate.
Construct a frequency diagram for the following data on prices of vacation packages to
Europe.
Data for prices of European vacation packages:
2599
3800
9720
4200
1200
9366
8255
4580
5379
5299
3349
2470
3855
1200
9945
899
5208
3800
1199
5399
2100
2100
999
7399
3557
2200
6899
9999
4
Steps:
1. Determine the number of classes (intervals) to use. (i.e. how to split the data up) This
requires judgment. It is best to have between ________________ classes. (10 is usually
good)
2. Come up with the class width. (CS) using the following equation:
(Tip! To make your frequency distribution easier to interpret, it is good to round to the
nearest ten, hundred, thousand, etc) Note: This will cause the number of classes to change
but not by much.
3. Make sure the lowest class contains the lowest data value and begin with a value that
makes the frequency distribution easy to interpret. In our case the lowest class must
include 899, so the first class could be 0-900. It doesn’t have to start at 0 though. We
could have picked 100-1000 or 200-1100 for our first class.
The frequency distribution for the data given above:
5
2.1.4 Construct a relative frequency distribution
Example: Use the frequency distribution constructed above in 2.1.3, and turn it into a
relative frequency distribution.
Steps:
Step 1: Do a frequency distribution.
Step 2: Calculate the relative frequency for each class and put this info into a new column
called “relative frequency”.
Price of ticket package
($)
0-900
Frequency
900-1500
4
1800-2700
5
2700-3600
2
3600-4500
4
4500-5400
5
5400-6300
0
1
6
6300-7200
1
7200-8100
1
8100-9000
1
9000-9900
2
9900-10800
2
2.1.5 Construct a cumulative frequency distribution
Example: Use the frequency distribution constructed above in 2.1.3, and turn it into a
cumulative frequency distribution.
Steps:
Step 1: Do a frequency distribution
Step 2: Add another column called “cumulative frequency” where you add the frequencies
of all the class frequencies below it to the given class.
Price of ticket package
($)
0-900
Frequency
900-1500
4
1800-2700
5
2700-3600
2
3600-4500
4
4500-5400
5
5400-6300
0
6300-7200
1
1
7
7200-8100
1
8100-9000
1
9000-9900
2
9900-10800
2
Extra Practice: Textbook readings: pg25-28, pg30, pg38-40
For additional practice try these in your textbook.
pg 31: #1-8
pg 41: #15a,b and 16a,b
Recommended problems #1
1. A set of data contains 53 observations. The lowest value is 42 and the largest is 129.
The data are to be organized into a frequency distribution.
a. How many classes would you suggest?
b. What would you suggest as the lower limit of the first class?
2. A manufacturing company produced the following number of units during the last 16
days.
27
26
27
28
27
26
28
28
27
31
25
30
25
26
28
26
The information is to be organized into a frequency distribution.
a. How many classes would you recommend?
b. What class interval would you suggest?
c. What lower limit would you recommend for the first class?
d. Organize the information into a frequency distribution and determine the relative
frequency distribution.
e. Comment on the shape of the distribution.
3. An oil company has a number of outlets in the metropolitan Seattle area. The numbers
of oil changes at the Oak Street outlet in the past 20 days are:
65
70
98
62
55
66
62
80
79
94
59
79
51
63
90
73
8
72
71
56
85
The data are to be organized into a frequency distribution.
a. How many classes would you recommend?
b. What class interval would you suggest?
c. What lower limit would you recommend for the first class?
d. Organize the number of oil changes into a frequency distribution.
e. Comment on the shape of the frequency distribution. Also determine the relative
frequency distribution.
4. The manager of a supermarket gathered the following information on the number of
times a customer visits the store during a month. The responses of 51 customers were:
5
6
3
11
9
3
7
5
3
2
3
1
3
12
12
1
1
4
4
4
14
5
7
4
1
6
6
5
2
8
5
6
4
4
15
4
4
7
1
2
4
6
1
6
5
5
10
6
6
9
8
a. Starting with 0 as the lower limit of the first class and using a class interval of 3,
organize the data into a frequency distribution.
b. Describe the distribution. Were do the data tend to cluster?
c. Convert the distribution to a relative frequency distribution.
5. The food services division of an amusement park is studying the amount families who
visit the amusement part spend per day on food and drink. A sample of 40 families who
visited the park yesterday revealed they spent the following amounts.
77
50
63
63
18
34
62
58
63
44
62
61
84
41
65
71
38
58
61
54
58
52
50
53
60
59
51
60
54
62
45
56
43
66
36
52
83
26
53
71
a. Organize the data into a frequency distribution, using seven classes and 15 as the lower
limit of the first class. What class interval did you select?
b. Where do the data tend to cluster?
c. Describe the distribution.
d. Determine the relative frequency distribution.
9
6. The frequency distribution representing the number of frequent flier miles accumulated
by employees at a consulting company is represented below:
Frequent flier miles (in thousands)
0 up to 3
3 up to 6
6 up to 9
9 up to 12
12 up to 15
total
Frequency
5
12
23
8
2
50
a. How many employees accumulated less than 3000 miles?
b. Convert the frequency distribution to a cumulative frequency distribution.
7. The frequency distribution of order lead time at a firm is:
Lead time (days)
0 up to 5
5 up to 10
10 up to 15
15 up to 20
20 up to 25
total
Frequency
6
7
12
8
7
40
a. How many orders were filled in less than 10 days? In less than 15 days?
b. Convert the frequency distribution to a cumulative frequency distribution.
MA1670 Comprehensive Assignment: 20%, out of 337 marks
Show all workings. All questions must be done by hand unless otherwise specified. Full
marks will not be given if workings are insufficient.
1. a. Construct and fully label a relative frequency distribution to represent the data
below. The data represents the price of books on one shelf in a bookstore. (4 marks)
16
22
10
25
19
5
60
28
30
40
85
80
40
45
30
10
15
22
28
37
10
b. Determine the mean, median, mode, and midrange(omit midrange) of the data. (You may
do this by hand or by using technology) (4 marks)
c. Construct a histogram to represent the data. (4 marks)
2. The following ogive represents the energy efficiencies for a group of buildings owned by
a company. Answer each of the following questions.
a. Approximately what percentage of the company’s buildings has an energy efficiency of
220 kWh/sq m/yr? (1 mark)
b. 70% of buildings have less than what energy efficiency? (1 mark)
3. The following pie chart represents the favourite sports of a class of 300 students.
11
a. How many students chose apple? (don’t give the percent) (1 mark)
b. How many students in total chose pecan and coconut cream? (1 mark)
4. Use the data below to find the following:
6, 29, 9, 45, 23, 30, 36, 10, 26, 30
a. 10th percentile (4 marks)
b. 29th percentile (4 marks)
c. the third quartile (4 marks)
5. Through calculating standard deviation, determine which set of data below (set A or set
B) has greater variation (i.e. the data has greater dispersion). (9 marks)
Set A: 5, 6, 9, 9, 11, 13
Set B: 1, 5, 9, 11, 12, 15
(omit)6. Use Chebychev’s Theorem to determine what proportion of data will generally fall
within +2 or -2 standard deviations of the mean of the data? (note: this does not refer to
the data in #5) (2 marks)
7. Use the frequency distribution below to answer the following questions.
# of hours worked
frequency
0-5
4
5-10
2
10-15
0
15-20
5
a. What is the mean? (5 marks)
b. What is the standard deviation? (5 marks)
(omit)8. The price for a particular type of jeans has changed over the years, as seen in the
data below.
Year
Price ($)
1980
25
1985
32
1990
40
1995
52
a. Using the 1980 price as the base value, what is the index number for the 1990 cost? (1
mark)
b. Using the 1990 price as the base value, what is the index number for the 1995 cost? (1
mark)
12
9. A bag contains 58 balls.
red
white
Blue
37
12
9
You pick one ball.
a. What is the probability that the ball is red? (1 mark)
b. What is the probability that the ball is blue or red? (1 mark)
c. Are the events, “pick a blue ball” and “pick a red ball” mutually exclusive? (assume you
only pick one ball) (1 mark)
You pick two balls.
d. What is the probability that the first ball is red and the second ball is blue if you
replace your first pick? (2 marks)
e. What is the probability that the first ball is red and the second ball is blue if you do not
replace your first pick? (2 marks)
f. What is the probability that you pick two white balls in a row if you replace your first
pick? (2 marks)
g. What is the probability that you pick two white balls in a row if you do not replace your
first pick? (2 marks)
You pick five balls.
h. What is the probability that you pick five red balls in a row if you do not replace any of
your picks? (2 marks)
10. A bag contains the following mixture of balls.
red
white
blue
glittery
20
2
6
If you pick one ball:
a. What is the probability that it is red and glittery? (1 mark)
b. What is the probability that it is red or glittery? (2 marks)
c. What is the probability that it is dull? (1 mark)
13
dull
17
10
3
d. What is the probability that it is blue and dull? (1 mark)
e. What is the probability that it is dull or white? (2 marks)
f. What is the probability that it is white or blue? (2 marks)
If you pick two balls:
g. What is the probability that you pick a red, dull ball and then a white, glittery ball, if
you replace your first pick? (2 marks)
h. What is the probability that you pick a red, dull ball and then a white, glittery ball, if
you do not replace your first pick? (2 marks)
If you pick three balls:
i. What is the probability that you pick 3 glittery, blue balls in a row if you replace your
picks each time? (2 marks)
j. Which of the following is more likely: (2 marks)
 picking three red, glittery balls in a row if you do not replace any of your picks
or
 picking three dull, blue balls in a row if you replace your picks each time
11. A hairdresser knows how to do eight different haircuts, has five different dyes, and
four different type of highlights. How many different hairdos are possible for someone
who wishes to get a haircut, dye their hair, and get highlights? (2 marks)
12. In how many ways can horses in a 10-horse race finish first, second, and third? (3
marks)
13. How many different simple random samples of size 4 can be obtained from a population
whose size is 20? (3 marks)
14. What is the probability of flipping a coin and getting five heads in a row? (2 marks)
15. In a class, 45% are women and 55% are men. A test is given and it is determined that
10% of the women failed and 15% of the men failed.
a. What is the probability that if a student failed, then they are male? (4 marks)
b. What is the probability that if a student passed, then they are female? (4 marks)
16. State whether each of the following variables is discrete or continuous.
a. number of newspapers sold (1 mark)
b. number of days missed (1 mark)
c. height of students (1 mark)
14
d. length of table (1 mark)
17. The table below represents a discrete probability distribution.
x
0
1
2
3
4
P(x)
0.10
0
0.40
0.35
0.15
a. Find the probability that x is:
i. exactly 3 (1 mark)
ii. 3 or less (1 mark)
iii. more than 2 (1 mark)
b. Compute the mean of the distribution. (3 marks)
c. Compute the variance of the distribution. (3 marks)
d. Compute the standard deviation of the distribution. (1 mark)
(omit)18. It is determined that 20% of the population in a town have type A-positive blood.
A simple random sample of size 7 is taken and the number of people X with blood type Apositive is recorded. Determine the probabilities of the following events using the
binomial formula.
a. none of them have type A-positive blood (4 marks)
b. exactly two of them have type A-positive blood (4 marks)
c. exactly five of them have type A-positive blood (4 marks)
19. A telemarketer makes 9 phone calls per hour and is able to make a sale on 10% of these
contacts. Determine (using your choice of method)
(omit)a. the probability of making exactly four sales? (1 mark)
(omit)b. the probability of making exactly six sales (1 mark)
(omit)c. the probability of make no sales? (1 mark)
d. the mean number of sales (3 marks)
e. the variance of the sales (3 marks)
f. the standard deviation of the sales (1 mark)
(omit)g. the probability of making at least 6 sales (hint: this equals the probability of
making 6 sales plus the probability of making 7 sales plus….etc.) (2 marks)
15
20. The annual commissions earned by sales representatives at a company follow a normal
distribution. The mean yearly amount earned is $40 000 and the standard deviation is
$5000. (4 marks each)
a. What percent of the sales representatives earn between $40 000 and $42 000?
b. What percent of the sales representatives earn more than $42 000?
c. What percent of the sales representatives earn less than $42 000?
d. What percent of the sales representatives earn between #32 000 and $42 000?
e. What percent of the sales representatives earn between $32 000 and $35 000?
21. The weights of cans of pears follow the normal distribution with a mean of 1000 grams
and a standard deviation of 50 grams. Calculate the percentage of cans that weight: (4
marks each)
a. less than 860 grams
b. between 1055 and 1100 grams
c. between 860 and 1055 grams
22. Of a telemarketer’s calls, 0.20 are successful. Suppose that 75 of the telemarketer’s
calls are randomly selected. Use the normal approximation to the binomial distribution to
determine the probability that: (4 marks each)
a. fewer than 20 calls are successful
b. fewer than 13 calls are successful
c. more than 16 calls are successful
d. more than 14 calls are successful
23. State what method of probability sampling each of the following is.
a. A sample of a whole school needs to be taken so a sample of students from each class is
taken to represent the school. (1 mark)
b. A sample of a whole school needs to be taken so one class is chosen for the sample to
represent the entire school. (1 mark)
c. A sample of a whole school needs to be taken so all students names are put in a hat and
names are drawn from it to act as the sample. (1 mark)
d. A sample of a whole school needs to be taken so every 15th student is chosen for the
sample. (1 mark)
24. The standard deviation of a population is 1.6 and the mean is 24.6
a. What is the standard deviation of the sampling distribution of the sample mean if a
sample size of 8 is chosen? (2 marks)
16
b. What is the mean of the sampling distribution of the sample mean? (1 mark)
c. What is the standard error of the mean? (1 mark)
d. How does the spread of the sampling distribution of the sample mean compare with the
spread of the population? (1 mark)
25. The mean of a population is 140 and the standard deviation is 5.2.
a. What is the value above which 8% of the data lies? (4 marks)
b. What is the value above which 70% of the data lies? (4 marks)
c. What is the value below which 10% of the data lies? (4 marks)
26. A company wishes to determine whether the mean raging score for its employees is
greater than 85. The rating score was determined for a random sample of 122 managers
with the following results: the sample mean was 99.6 and the sample standard deviation
was 12.6. Can the company conclude that the mean rating score for employees is greater
than 85? Use the 0.025 significance level. (8 marks)
27. In 1995, 74% of the population felt that men were more aggressive than women. In a
poll, a simple random sample of 1026 people 18 years old or older resulted in 698
respondents stating that men were more aggressive than women. Is there significant
evidence to indicate that the proportion of people who believe that men are more
aggressive than women has decreased from the level reported in 1995 at the α=0.05 level
of significance? (8 marks)
28. A department wishes to determine whether the mean number of suicide bombings for
all Al Qaeda attacks against the US differs from 2.5? A sample of 21 recent incidents
involving suicide terrorist attacks was analyzed and the sample mean was 1.86 with a
sample standard deviation of 1.20. Use the 0.10 significance level to determine whether
the mean number of suicide bombings is differs from 2.5. (8 marks)
29. In order to compare the means of two populations, independent random samples of 64
observations were selected from each population, with the following results:
Population 1
Population 2
At the 0.05 significance level, can we conclude that the mean of population one is greater
than the mean of population two? (8 marks)
17
30. In order to compare the means of two populations, independent random samples of 20
observations were selected from each population, with the following results:
Population 1
Population 2
At the 0.05 significance level, can we conclude that the mean of population one is greater
than the mean of population two? (10 marks)
31. A company wishes to determine whether the proportion of super-experienced auction
bidders who fall victim to the “winner’s curse” (i.e. the phenomenon of the winning bid price
being above the expected value of the item being auctioned.) is different from the
proportion of less-experienced bidders. In the super-experienced group, 29 of 189 winning
bids were above the item’s expected value, while in the less-experienced group, 32 of 149
winning bids were above the item’s expected value. At the 0.10 significance level, can we
conclude that there is a difference in the proportions of each population? (8 marks)
(omit)32. A real estate agent wants to compare the variation in the selling price of homes
on the ocean with that of homes one block from the ocean. A sample of 21 homes on the
ocean revealed the standard deviation of the selling prices was $15600. A sample of 18
homes that were sold one black from the ocean revealed the standard deviation of the
selling prices to be $11330. At the 0.01 significance level, can we conclude that there is
more variation in the selling prices of the homes sold on the ocean? (8 marks)
33. The data in the following table resulted from an experiment that used a completely
randomized design.
Treatment 1
3.8
1.2
4.1
5.5
2.3
Treatment 2
5.4
2.0
4.8
3.8
Treatment 3
1.3
0.7
2.2
Can we conclude that at least two of the means of the treatment groups differ? Use the
0.01 level of significance. (16 marks)
18
34. A six-sided die is rolled 30 times and the numbers 1 through 6 appear as shown in the
following frequency distribution. At the 0.10 significance level, ca we conclude that the
categories are not all equal? (12 marks)
outcome
1
2
3
4
5
6
frequency
3
6
2
3
9
7
35. Traffic experts wanted to determine whether there is a relationship between cell
phone use and having a car accident. The data below was gathered. Using the 0.05
significance level, can we conclude that there is a relationship between the two variables?
(12 marks)
Had accident in last year
Cell phone in use
Cell phone not in use
25
50
Did not have accident in
last year
300
400
36. Use the data below to answer the following questions.
x
1
6
2
y
4
8
3
a. Determine the coefficient of correlation. (6 marks)
b. Determine the regression equation. (8 marks)
c. Determine the value of y ' when x=5. (1 mark)
d. Determine the standard error estimate. (5 marks)
e. Determine the 95% confidence interval for the mean predicted when x=5. (5 marks)
37. A commuter airline selected a random sample of 25 flights and found that the
correlation between the number of passengers and the total weight, in pounds, of luggage
stored in the luggage compartment is 0.94. Using the 0.05 significance level, can we
conclude that there is a positive correlation between the two variables? (8 marks)
19
20
2.2 Graphical Techniques
2.2.1 Describe the shape of a frequency distribution
There are several ways that the information contained in a frequency distribution (what we learned how
to construct in the last section) can be graphically displayed to better display the information. Some of
these ways are histograms, frequency polygons, bar charts, stem and leaf diagrams, pie charts, and
ogives. The next few sections will describe how to analyze and construct these graphical displays.
2.2.2 Construct and analyze a histogram (read pg 32-34)
One of the most common ways to portray a frequency distribution is a histogram.
After you complete a frequency distribution, your next step will be to construct a "picture" of these data
values using a histogram. A histogram is a graphical representation of a frequency distribution. It
describes the shape of the data. You can use it to answer quickly such questions as, are the data
symmetric? Or where do most of the data values lie? Let’s look at the histogram for the following
frequency distribution:
Class number
1
2
3
4
5
6
class
250 and under 350
350 and under 450
450 and under 550
550 and under 650
650 and under 750
750 and under 850
Histogram:
Notes on histograms:
21
Frequency
4
8
20
8
7
3
In a histogram, the classes are marked on the horizontal axis and the class frequencies are marked on
the vertical axis. The class frequencies are represented by the heights of the bars and the bars touch
each other.
Analyzing a histogram: Below is a histogram developed from the frequency distribution of the prices of
vehicles sold at a dealership.
The histogram provides an easily interpreted visual representation. We can see at a glance that a total
of 23 vehicles sold in the price range of $15,000 to $18,000. We can also readily see that 58 vehicles
(72.5%) sold in the range of $15,000 to $24,000.
Try answering these additional questions.
a. How many vehicles were sold in total? ____________
b. How many vehicles were sold between $21000 and $33000? ____________
c. What is the relative frequency of vehicles sold in the $21000 and $24000 range? ___________
d. Express part c as a percentage. ____________
e. what percentage of vehicles was sold between $12000 and $18000? ____________
Example #1: Construct a histogram given the frequency distribution given below.
Class number
1
2
3
4
5
Hours spent studying
0 up to 3
3 up to 6
6 up to 9
9 up to 12
12 up to 15
22
frequency
2
4
7
0
5
Example #2: Answer the following questions on the histogram below represents employee salary data
for a privately-owned small company.
a. How many people in the company earned between $15,240 and $17,430?
b. How many people in the company earned between $6480 and $19,620?
c. In what salary range were 23 employees earning?
d. How many people in total worked for the company?
e. What is the relative frequency people who earned between $6480 and $10,860?
f. What percentage of people earned between $15,240 and $26,190?
2.2.3 Construct and analyze a frequency polygon (pg 34-36)
Frequency Polygons:
Although a histogram does demonstrate the shape of the data, perhaps the shape can be more clearly
illustrated by using a frequency polygon. Here, you merely connect the centers of the tops of the
histogram bars (located at the class midpoints) with a series of straight lines. The resulting multi-sided
figure is a frequency polygon. Let’s look at the frequency polygon for the histogram in section 2.2.2.
23
Example #1: Construct a frequency polygon for the frequency distribution below.
Class number
1
2
3
4
5
Number of cars sold
0 up to 20
20 up to 40
40 up to 60
60 up to 80
80 up to 100
frequency
1
4
8
4
0
For extra practice the following exercises are recommended: pg41 #13-16
2.2.7 Construct and analyze an ogive. (not in text)
Step 1: To construct on ogive, you must first construct a relative frequency distribution.(section 2.1.5)
Step 2: Then you must take the cumulative frequencies of the relative frequencies.
Step3: The data contained in the cumulative relative frequency distribution is then graphed.
The following example will illustrate.
Example#1: Construct an ogive based on the data contained in the frequency distribution below.
Number of
frequency
days of work
250 up to 350
350 up to 450
450 up to 550
550 up to 650
650 up to 750
750 up to 850
4
8
20
8
7
3
24
Ogive:
Example #2: Answer the following questions on the ogive below.
The following ogive shows the annual transactions counts for a company.
a. What is the class interval width? ___________
b. What percentage of transactions are less than 400? ____________
c. About 75% of transactions are less than what value? ____________
d. About 85% of transactions are less than what value? ____________
e. What percent of transactions are less than 300? ____________
f. 100% of the transactions are less than what value? ____________
2.2.4 Construct and analyze a bar chart. (pg 43)
A bar chart can be used to depict any of the types of data. (nominal, ordinal. interval, or ratio) A bar
chart is similar to a histogram in that the height of a bar represents the frequency of the class. The bars
don’t touch in a bar chart though.
25
Example #1: Construct a bar chart for the following data.
Type of electrical equipment
TV
DVD player
CD player
Radio
Number sold
40
37
22
10
Example #2: Analyze the following bar chart.
a. Vendor E had how much sales? ______________
b. How much sales was earned in total by all of the vendors? ______________
(recommended problems: pg47 #22)
2.2.5 Construct and analyze a stem and leaf diagram (not in text)
A stem-and-leaf display is a statistical technique for displaying a set of data. Each number is divided into
two parts: the leading digit becomes the stem and the trailing digit becomes the leaf.
26
Example #1: Lisa receives the following grades on her accounting assignments”
8, 74, 54, 23, 50, 84, 66, 80, 44, 88, 67, 90, 72, 105, 84
Represent the data in a stem-and-leaf display.
stem
leaf
Example #2: Answer the questions about the stem-and-leaf display below representing the per capita
GNP for each country in western Africa.
1. How many countries was data collected for? ______________
2. Did any of the countries have a GNP of $700? ______________
3. How many countries have a GNP less than $400? ______________
4. Did any countries have the same GNP? ______________
27
2.2.6 Construct and analyze a pie chart. (pg 44 in text)
Pie Chart - A pie chart is especially useful in displaying a relative frequency distribution. A circle is
divided proportionally into the relative frequency and portions of the circle are allocated for the
different groups. Its purpose is to show the relative comparison between parts of a total.
Example: Answer the following questions on the pie chart below representing the favourite movie
genres in a school of 200 students.
a. Horror movies account for what percentage of favourite movies in the school? _________
b. How many students in the school cite horror as their favourite genre? ___________
c. How many students combined cite foreign or romance as their favourite genres? __________
d. If you tried to draw this pie-chart, how many degrees (for the central angle of each sector) would
each genre have? Remember: a circle is 360º!
genre
Comedy
Action
Romance
Drama
Horror
Foreign
Science fiction
Central angle measure for sector
e. All of your angle measures in part d should add up to what value?
recommended problems: pg37#9(omit f), 11a,c,e,12a,d,e
pg46#17,20,22
28
Worksheet #2
1. Answer the questions on the following ogive representing the amount of money raised by a class of
students.
a. What percentage of students raised less than $500?
b. 20% of students raised how much money?
c. What percentage of students raised less than $30?
d. 90% of students raised how much money?
e. How much money was raised in total? (i.e. by 100% of the students)
2. Construct a stem and leaf diagram fro the following data.
24
27
28
72
66
52
50
34
39
30
3. Construct a pie chart based on the following data. (A class of students with pets was polled.)
pet
Dog
Cat
Fish
other
Frequency
25
20
10
30
Recommended Problems #2
1. A store has several retail stores in the coastal areas of North and South Carolina. Many of the
customers ask the owner to ship their purchases. The following chart shows the number of packages
shipped per day for the last 100 days.
a. What is the chart called?
b. What is the total number of frequencies?
c. What is the class interval?
d. What is the class frequency for the 10 up to 15 class?
29
e. What is the relative frequency of the 10 up to 15 class?
g. On how many days were there 25 or more packages shipped?
2. The following frequency distribution reports the number of frequent flier miles, reported in
thousands, for employees of a consulting firm during the first quarter of 2004.
Frequent flier miles (in thousands)
0 up to 3
3 up to 6
6 up to 9
9 up to 12
12 up to 15
total
Number of employees
5
12
23
8
2
50
a. How many employees were studies?
b. Construct a histogram.
c. Construct a frequency polygon.
3. An internet retailer is studying the lead time (elapsed time between when an order is placed and
when it is filled) for a sample of recent orders. The lead times are reported in days.
Lead time (days)
0 up to 5
5 up to 10
10 up to 15
15 up to 20
20 up to 25
Total
Frequency
6
7
12
8
7
40
a. How many orders were studied?
b. Draw a histogram.
c. Draw a frequency polygon.
4. A small business consultant is investigating the performance of several companies. The sale in 2004
(in thousands of dollars) for the selected companies were:
corporation
Hoden building products
J and R printing
Long Bay concrete
Mancell electric
Maxwell heating
Mizelle roofing
Fourth-quarter sales (thousands of dollars)
1645.2
4757.0
8913.0
627.1
24612.0
191.9
The consultant wants to include a chart in his report comparing the sales of the six companies. Use a bar
chart to compare the fourth quarter sales of these corporations.
30
5. A report prepared for the mayor of a city indicated that 56 percent of the city’s tax revenue went to
education, 23 percent to the general fund, 10 percent to the counties, 9 percent to senior programs, and
the remainder to other social programs. Sketch a pie chart to show the breakdown of the budget. If 3.5
million dollars is generated in tax revenue, how much went to education?
6. Show below are the military and civilian personnel expenditures for the eight largest military locations
in the US. Develop a bar chart to represent the data.
location
St. Louis
San Diego
Pico Rivera
Arlington
Norfolk
Marietta
Fort Worth
Washington
Amount spent (millions)
6087
4747
3272
3284
3228
2828
2492
2347
2.4 Numerical Techniques
2.4.1 Use the summation notation representation to sum numbers
The Greek letter (), capital Sigma, is used to denote the summation of a selection of numbers.
If we have a quantitative data set consisting of x1, x2. x3, .....xn this means that x1 is the first measurement
in the data set, x2 is the second, and xn is the nth and last measurement in the group. If we have five
measurements in a set and they are:
x1 = 5, x2 = 3, x3 = 8, x4 = 5, and x5 = 4, then in order to add up this set, we use the Symbol Sigma () as
x = x 1 + x 2 + x 3 + x 4 + x 5
= 5 + 3 + 8 + 5 + 4 = 25
2.4.2 Measures of Central Tendency
2.4.2.1/2.4.2.2 Discuss the roles of mean, median, midrange and mode as ways of measuring
central tendency in data and calculate them (readings: pg 59: “The population Mean”, pg64-65: “The
Median”, and pg65-66: “The Mode”)
Measures of Central Tendency:
The purpose of a measure of central tendency is to determine the "center" of your data values or
possibly the "most typical" data value. Some measures of central tendency are the mean,
median, midrange, & mode.
The mean:
The mean is the most popular measure of central tendency. It is merely the average of the data.
31
The mean is equal to the sum of the data values divided by the number of data values.
Mathematically the mean is given as follows:
mean = x =
x
N
where N (or n) is the number of values in the data set
Take for example: The number of accidents reported over a particular 5 month period was: 6, 9,
7, 23, 5
So the mean of this sample is:
 x  6  9  7  23  5  50  10
x=
N
5
5
Special note:
When we talk of mean, we can have a population mean or sample mean:
the symbol for population mean is the Greek Symbol (u)
the symbol for sample mean is: x
The median:
The median of a set of data is the value in the center of the data values when they are arranged
from smallest to largest. Consequently, it is in the center of the ordered array. Using the
accident data set, the median (Md) is found by first constructing an ordered array:
5, 6, 7, 9, 23
so the median here is 7.
If there is an even amount of data like 3, 8, 12, 14 then Md is the average of the two center
values thus the median for these numbers is (8 + 12)/2 = 10
Note: In our accident data set, one of the five values (23) is much larger than the remaining
values - it is what we call an outliner.(an out of whack data value) Notice that the median
(Median = 7) was much less affected by this value than was the mean ( x = 10). When dealing
with data that are likely to contain outliners (for example, personal incomes or prices of
residential housing), the median usually is preferred to the mean as a measure of central
tendency, since the median provides a more "typical" or "representative" value for these
situations.
(omit midrange)The Midrange:
Although less popular than the mean and median, the midrange (Mr) provides an easy to grasp
measure of central tendency. Notice that it is also severely affected by the presence of an
outliner in the data. The midrange is:
Midrange = (smallest value) + (largest value)
2
32
For our accident data: Midrange = 5 + 23 = 28/2 = 14.0
2
The mode:
The mode of a data set is the value that occurs more than once and the most often. The mode is
not always a measure of central tendency; this value need not occur in the center of your data.
One situation in which the mode is the value of interest is the manufacturing of clothing. The
most common hat size is what you would like to know, not the average hat size.
For our accident data there is no mode since all values occur only once but let’s consider this
data set: 4, 8, 7, 6, 9, 8, 10, 5, 8
Here 8 occurs three times which is most often so Mode = 8.
Note: There can be more than one mode in a set of data.
For example: 1, 1, 3, 5, 7, 7, 9
There are two modes for the data above. (1 and 7)
Example: A sample of ten was taken to determine the typical completion time (in months) for
the construction of a particular model of Brockwood Homes:
4.1, 3.2, 2.8, 2.6, 3.7, 3.2, 9.4, 2.5, 3.5, 3.8
Find the
a. mean
b. median
c. midrange
d. mode
2.4.3 Measures of variation (pg74:”Measures of Dispersion: The Range”,pg77-78:
“Variance and Standard Deviation”)
Measures of Variation:
Variability:
Variability provides a quantitative measure of the degree to which scores in a distribution are
spread out or clustered together. The purpose of measuring variability is to determine how
spread out a distribution of scores is. Are the scores all clustered together, or are they scattered
over a wide range of values?
The range:
33
The range is the numerical difference between the largest value and the smallest value in a data
set. If the number of accidents reported over a 5 month period was 6, 9, 7, 23, & 5. The range
for this data set is:
range = (largest value) - (smallest value) = 23 - 5 = 18
The range is a rather crude measure of variation, but it is an easy number to calculate and
contains valuable information for many situations. Stock reports generally give prices in terms
of ranges, citing the high and low prices of the day.
Note: The value of the range is strongly influenced by an outliner in the data set.
Standard deviation:
The standard deviation is the most commonly used and the most important measure of
variability. Standard deviation uses the mean of the distribution as a reference point and
measures variability by considering the distance between each score and the mean. It determines
whether the scores are generally near or far from the mean. That is, are the scores clustered
together or scattered? In simple terms, the standard deviation approximates the average
distance from the mean.
The standard deviation is the square root of the variance.
The standard deviation is a measure of the average distance between the values of the data
in the set and the mean. If the data points are all similar, then the standard deviation will be low
(closer to zero). If the data points are highly variable, then the standard variation is high (further
from zero).
Calculating the variance and standard deviation given a set a numbers representing a population.
(Note: the formulas for a sample is different than for a population)
Step One: Calculate the mean of the population. (represented as u)
Step Two: Calculate ( X   ) for each number (X represents a number from the population)
Step Three: Square each difference from Step 2. i.e. ( X   ) 2
Step Four: Get the mean of each of the squares from Step 3. (i.e.
number of numbers in the data. This is your variance.
34
 ( X  )
N
2
, where N is the
Step Five: If the question asked for the standard deviation of the population, then you would just
take the square root of your variance. i.e.
It is helpful to do the calculations in a chart such as the one used below to find variance and
standard deviation.
Example#1: A stores sells the following numbers of TV’s over a week.
3, 2, 5, 0, 7
a. Find the variance in the data.
b. Find the standard deviation.
Note, if you were finding the standard deviation of a sample, the formula would be
35
36
36
37
2.4.3.2 Discuss the interpretation of standard deviation.
As mentioned above, the standard deviation is a good measure of dispersion, or how
spread out the data is. The bigger the standard deviation the more variation there is in the
data and the lower the standard deviation, the lower the variation in the data.
The standard deviation tells us, on average, how far a given number is from the mean.
OMIT 2.4.3.3! 2.4.3.3 Discuss Chebyshev’s Theorem (pg 82: “Chebyshev’s
Theorem)
For any set of observations, the proportion of the values that like within k standard
1
deviations of the mean is at least 1  2 , where k is any constant greater than 1.
k
Examples:
1. Approximately what proportion of data will lie within 2 standard deviations of the
mean?
2. Approximately what proportion of data will lie within 1.4 standard deviations of the
mean?
recommended extra problems:page 84#49,#50
2.4.4 Measures of Position
2.4.4.2/2.4.4.2 Define and calculate percentile and quartile. (pg 97-99:”Quartiles,
Deciles, and Percentiles”—can omit deciles though!)
Percentiles divide a set of numbers into 100 equal parts. For example, if your GPA was in
the 66th percentile, than 66 percent of the students had a lower GPA.
A formula can be applied to determine the location in a list of numbers for a given
percentile.
Lp  (n  1)
P
100
37
38
L p represents the location of the desired percentile
n is the number of observations(numbers)
P is the desire percentile.
Examples:
1. Use the following data to answer the questions.
1, 3, 4, 5, 6, 7, 7
a. Find the 25th percentile of the above data.
b. Find the 50th percentile of the above data.
c. Find the 75th percentile of the above data.
Note: Percentile problems do not always work out perfectly! This is how you do
them if your L p does not work out to a whole number.
#2. Use the data below to answer the following questions.
20, 4, 7, 22, 11, 14, 1, 8
a. Find the 36th percentile.
38
39
b. Find the 62nd percentile.
c. Find the 83rd percentile.
Finding quartiles
Quartiles divide a set of observations (numbers) into four equal parts. (Quarters)

The first quartile, usually labeled Q1 , is the value below which 25 percent of the
observations occur. This would be the 25th percentile as well.

The second quartile, Q2 , is the value below which 50 percent of the observations
occur.
This is also the median. This would be the 50th percentile as well.


The third quartile, usually labeled Q3 , is the value below which 75 percent of the
observations occur. This would be the 75th percentile as well.
So if you are asked to calculate the first, second, or third quartile, just use the appropriate
P
percentile value in the formula Lp  (n  1)
.
100
Examples
3. Use the data set below to answer the following questions.
2, 5, 50, 11, 46, 16, 22, 36, 15, 8
a. Find the first quartile.
39
40
b. Find the third quartile.
c. Find the second quartile.
Math Worksheet #3
1. The data below represents the number of times an item is returned in a store each day.
4 4 5 6 8 10 10 15 15 17 18 20 22
a. Find the mean.
b. Find the median.
c. Find the mode.
d. Find the midrange.
2. The standard deviation for class A is 2.4 and the standard deviation for class B is 13.5. Which
class has a greater variety of test scores?
3. If a set of data has a standard deviation of 10, what, on average, is the distance of a given
number from the mean?
4. What is the standard deviation if the variance is 5.8?
5. What is the variance if the standard deviation is 15.6?
6. Use the data below to answer the following questions.
2
17
18
9
9
18
a. Calculate the mean.
b. Calculate the variance.
c. Calculate the standard deviation.
7. On average, the numbers in the data set in #6 fall within ___________ units of the mean.
8. A set of data has a standard deviation of 9.2. Does this set of data have more or less variation
than the data in #6? Why?
9. Why is the standard deviation a better measure of dispersion than the range?
10. List four measures of central tendency and three measures of dispersion.
40
41
11. Use the data below to answer the following:
2
8
3
10
6
17
29
40
36
a. Find the 70th percentile.
b. Find the 27th percentile.
c. Find the third quartile.
d. Find the first quartile.
12. a. Lisa scored higher than 72% of her class. What would be her percentile?
b. Janice scored lower than 11% of her class. What would be her percentile?
13. What percentage of data will lie within 1.56 standard deviations of the mean? (Hint: Use
Chebyshev’s Theorem)
2.5 Grouped Data (not in text)
Grouped data:
Sometimes we may have to work with data in the form of a frequency distribution, called
grouped data, when the raw data are not available. We do not have the data values
used to make up this frequency distribution, so we are forced to approximate the values
when calculating the mean and standard deviation.
Let’s take the following data for example:
Class Number
Class (age in years)
Frequency
1
20 and under 30
5
2
30 and under 40
14
3
40 and under 50
9
4
50 and under 60
6
5
60 and under 70
2
=36 in total
The approximation for the sample mean here is:
X 
 ( f  mid )
N
where f is the frequency of each class
where mid is the midpoint of each class
where N is the sample size (i.e. add up all the frequencies!)
41
42
So for our above example:
X 
 ( f  mid )
N
Class
20-30
frequency
5
30-40
14
40-50
9
50-60
6
60-70
2
The sample variance can be found using this formula:
 ( f  mid )
)
2
Sample Variance = s 
2
 ( f  mid
2
N
N 1
Class
frequency
20-30
5
30-40
14
40-50
9
50-60
6
60-70
2
42
43
(OMIT 2.6) 2.6 Index Numbers
Index numbers:
How many times have you heard a remark such as "Fifteen years ago we could have
bought that house for $20,000 but now its $120,000"? To compare effectively the change
in the price or value of a certain item between any two time periods, we use an index
number. An index number (or index) measures the change in a particular item (or
collection of items) between two time periods.
The formula for calculating index numbers:
Index number =
x
 100
basevalue
x: number you want to change to an index number
basevalue: the number that represents the base rate
Example 1: The average hourly wages for production workers at Kessler Toy Company
in 1975, 1980, 1985, and 1990 are shown below.
Year
Wage
1975
$6.40
1980
$7.05
1985
$8.50
1990
$10.90
If the base is chosen to be 1975, compute the index numbers for:
a. 1980
b. 1985
c. 1990
43
44
3.0 Elementary Probability
3.1 Introduction
3.1.1 Define the terms: probability, experiment, event, outcome, and sample space.
 Probability: A probability is a measure of the likelihood that an event will occur
when the experiment is performed.
 Event: An event consists of one or more possible outcomes of an experiment.
 Experiment: An activity for which the outcome is uncertain
 Outcome: An outcome is any particular result of an experiment.
 Sample Space: The set of all possible outcomes of an experiment
Examples:
Experiment: Rolling one die
Events: A = rolling a one D = rolling a four
B = rolling a two E = rolling a five
C = rolling a three F = rolling a six
Experiment: Taking a Statistics midterm
Events: A = pass
B = fail
3.1.2 State and describe the methods for assigning probabilities
Methods of assigning Probabilities:
1. Intuition: For example, the sports announcer claims that Shelia has a 90% chance of
breaking the world record in the upcoming 100 meter dash. The statement was probably
developed by looking at Shelia's past record and the announcer's confidence in her ability
as a basis for this prediction.
2. Relative frequency: For example, The Right to Health Lobby claims the probability
is 40% of getting a wrong report from the medical laboratory. This probability comes
from a sample of 100 reports in which 40 were wrong so it was said that 40/100 = .40 so
we said that the probability of getting a wrong report is 40% based on this information.
3. Formula for equally likely outcomes:
Probability of an event = number of outcomes favorable to the event
total number of outcomes
Example: Let’s say we roll a die, now there are 6 different outcomes we can have
(1,2,3,4,5,6) which is its sample space. If we wanted to know what the probability of
rolling a six was we would say there is only one outcome in which we could have a six.
So using our formula for equally likely outcomes:
Probability of getting a six = 1/6 = 0.167
44
45
3.1.3 Explain the use of the word probability
Probability represents the likelihood of an event occurring. The notation used is as
follows: The probability of event A happening: P(A)
If A = rolling an even number on a die than P(A) = 3/6 = 0.5.
This could also be written in long form instead of using the letter A. i.e. P(rolling an
even number)=0.5
Single event Probability Problems for Practice:
1. What is the probability of rolling a 4 on a die on a single roll? ________
2. What is the probability of rolling a 2 or a 5 on a die on a single roll? ________
3. What is the probability of rolling a 1,2,3,4 or 5 on a die on a single roll? ________4.
What is the probability of drawing the queen of hearts from a deck of 52 cards on a single
draw? ________
5. What is the probability of drawing a queen from a deck of cards on a single draw?
________
6. What is the probability of drawing a spade from a deck of cards on a single draw?
________
3.2 Counting Problems (pg142-143) “Principles of Counting”
3.2.1 State and use the formula to determine the number of possible outcomes of an
experiment
Counting Problems: Fundamental Counting Principle:
Counting rules determine the number of possible outcomes that exist for a certain
broad range of experiments. They can be extremely useful in determining probabilities.
The question we wish to answer here is, for a particular experiment, how many possible
outcomes are there?
We use the following rule:
n1 = the number of ways of filling the first slot
n2 = the number of ways of filling the second slot after the first slot is finished
n3 = the number of ways of filling the third slot after the first two slots are filled
.
nk = the number of ways of filling the kth slot after filling slots 1 through (k - 1)
# of possible outcomes = n1 x n2 x n3 ..... nk
Let’s look at the following examples:
1. John has 4 shirts and 2 pairs of pants. How many different outfits can he make?
45
46
2. A restaurant serves 3 items for breakfast and 5 different drinks. How many different
pairings can be made?
3. Kelly has 6 different purses, 3 different pairs of shoes, and 2 belts. How many combos
are there?
4. When ordering a new car, you have a choice of eight interior colors, ten exterior
colors, and four roof colors. How many possible color schemes are there?
Worksheet #4
1. Use the frequency distribution below to find the mean.
cost ($)
15-25
25-35
35-45
45-55
frequency
2
5
0
7
2. The cost of a Nike jacket was $85 five years ago, and today it is $120. If you use the
cost five years ago as the base rate, what is the index number for the cost of the jacket
today.
3. There are 10 balls in a bag. Three are red and seven are blue.
a. What is the probability of picking a red ball?
b. What is the probability of picking a blue ball?
4. You have a standard deck of cards. Determine the following probabilities.
a. picking the ace of spades
b. picking an ace
c. picking a red ace
d. picking a red card
5. A restaurant has three types of drinks, five breakfast options, nine lunch options, and
two dessert options. How many combinations of dinners are possible.
Note: For the next 2 sections (Permutation and Combinations) you will be
determining the number of ways you can “PICK” or “CHOOSE” a certain number
of things from a bigger number of things. So be on the look-out for these key words
in problems.
46
47
3.3 Permutations (pg143)
3.3.1 Define Permutation
A permutation is any arrangement of r objects selected from a single group of n possible
objects.
3.3.2 Analyze permutations of a set of data.
For permutations, the order is important, as you will see in the following example!
Say for example you have three people(Tim, Sarah, and John) and you want to pick 2 of
them. These are the different permutations you could come up with.
i.e. Tim, Sarah
Tim, John
Sarah, Tim
Sarah, John
John, Sarah
John, Tim
NOTE! The order “Tim, Sarah” and “Sarah, Tim” are counted as 2 separate
permutations!!!!! (In the next section on “Combinations”, they will only be counted as
1)
So it is important for you to ask yourself if the order is important. If it is, then AB and
BA are counted as two separate possibilities.
Example: There are 4 people at an event. 2 people must be picked from the 4 to win two
different door prizes. (first place and second place) How many ways can the prizes be
distributed?
3.3.3 Define the factorial function.
A factorial is represented by an exclamation mark. It is defined as:
n !  n  (n  1)(n  2)..........3  2 1
Examples: Calculate the following factorials.
a. 5!
b. 7!
c.
6!
3!
47
48
3.3.4 Calculate permutations using the factorial function.
There is a formula to use to quickly determine the number of permutations, rather than
writing out all the combinations like we did in 3.3.2
n
Pr 
n!
(n  r )!
where: n represents the total number of objects
r represents the number of objects to be selected
Example #1: Just try to do the previous example that we worked out the long way using
this formula. We had to pick 2 people out of 4 and the order of the seating mattered.
REMEMBER: This formula is used only when the order is important. (i.e. when
“Tim, Lisa” and “Lisa, Tim” count as two different answers.
n
Pr 
n!
(n  r )!
Example #2:
a. Calculate 8 P3
b. You are having a dinner party and want to arrange one side of a table. There are 6
chairs, but there are 9 people. How many different arrangements are possible if order
matters?
c. You are developing a quiz. You can pick 4 questions from a list of 9 possible
questions. If the ordering of the questions matters, how many possible arrangements are
there?
d. You need to make up a four digit students number. How many possible arrangements
are there to give to students? (Remember there are10 possible numbers to choose from)
48
49
3.4 Combinations
3.4.1 Define combination (pg145)
If the order of the selected objects is NOT important, the selection is called a
combination.
3.4.2 Analyze the number of combinations possible
To illustrate the idea of combinations, we will first do a problem the long way. It will
help illustrate the difference between a permutation and a combination.
Example: There are 3 people.(Tim, John, and Sarah) You want to make a group of two
students. How many different groups of 2 are there?
i.e. Tim, Sarah
Tim, John
Sarah, Tim
Sarah, John
John, Sarah
John, Tim
BUT, notice that a group with Tim and Sarah in it is THE SAME as a group with Sarah
and John in it. So there are only 3 groups.
Hence when the order is not important, as above, there are less combinations, than
permutations.
3.4.3 Calculate combinations using the factorial formula.
n!
n Cr 
r !(n  r )!
where: n represents the total number of objects
r represents the number of objects to be selected
n!
n Cr 
r !(n  r )!
Examples:
1. 3 C2 can be used to answer the example above.
2. You need to pick 5 colours from a list of 8 colours. How many different groups of 5
are there if the order of the colours is not important?
3. You need to pick 3 people for a committee from a list of 9. How many different
groupings are possible if the ordering does not matter?
49
50
4. You are picking 4 lotto numbers from a bag containing 10 numbers. How many
different arrangements are possible?
(OMIT 3.4.2) 3.4.2 Use Pascal’s Triangle to determine the number of combinations
of sets in a set of data
Pascal's Triangle for Combinations:
1
1
1
1
1
1
1
1
1
1
1
8
9
10
15
28
20
56
5
15
35
70
126
210
1
10
35
84
120
1
4
10
21
36
45
3
6
5
7
1
3
4
6
1
2
252
21
56
126
1
6
1
7
28
1
8
1
84
36
9
1
210 120
45 10
1
.
.
.
To create this triangle we start with our first two rows of 1 & 1 1, then we go to the
next row with ones on the outside then add the two numbers directly above it; for
example for the second row to get the two we add 1 + 1 to get two, and so on.
Pascal's triangle is another way of finding the number of combinations for a particular
situation. Let's say you have five hats on a rack, and you want to know how many
different ways you can pick two of them to wear. It doesn't matter to you which hat is
on top, so the order of selection does not matter (This is a combination).
To solve this problem we could use our combination formula:
5
C2 
5!
=
2!(5  2)!
or we could use Pascal's Triangle in which we look at the fifth row, second number which
is of course is 10.
Note: This triangle starts at Row 0, in other words the 1 at the top is considered to
be Row 0 then the next row is Row 1, and so on. The first number in every row is
numbered 0 as well.
50
51
Lets say we had 10 hats on the rack, and you want to know how many different ways you
can pick 5 of them then we could use our formula and get:
10 C5 
or we can look at the 10th row, 5th number which of course is 252. (recommended
problems: pg 146 #33, 34, 35, 38, 39, 40)
3.6 Addition Rules
3.6.2/3.6.3 Define mutually exclusive events and be able to determine whether events
are mutually exclusive or not.
Mutually Exclusive: The occurrence of one event means that none of the other events can
occur at the same time. (i.e. they can’t be both!)
The events “is male” and “is female” are mutually exclusive since they can’t occur ath
the same time. (i.e. you can’t be male and female)
Similarly if you flip a coin once, the events “is heads” and “is tails” are mutually
exclusive since a number can’t be both odd and even.
Examples: Determine whether the following are mutually exclusive or not.
(hint: ask yourself “Is it possible to be both at the same time?” If yes, then they aren’t
mutually exclusive, and if no, they are mutually exclusive.)
1. Event A: a number’s first digit is 2
Event B: a number’s second digit is 6
2. Event A: a number’s first digit is 1
Event B: a number’s first digit is 3
3. Event A: is taller than 5 feet
Event B: is shorter than 5 feet
4. Event A: has visited Florida
Event B: has visited France
5. Event A: likes strawberries
Event B: likes blueberries
6. Event A: likes strawberries
Event B: hates strawberries
51
52
7. Event A: is a king (in a deck of cards)
Event B: is a heart (in a deck of cards)
8. Event A: is a heart
Event B: is a spade
9. Event A: got an “A” on the first test
Event B: got a “B” on the first test
10. Event A: has brown hair
Event B: has blue eyes
3.6.1 Use the addition rules to determine the probability of an event (pg 131)
P(A or B) = P(A) + P(B) – P(A and B)
Examples:
1. What is the probability that a card chosen at random from a standard deck of cards will
be either a king OR a heart? (there are 4 kings, 13 hearts and 1 king of hearts)
2. A bag contains 10 balls.
3 balls: white
5 balls: red
2 balls: yellow
You must pick one ball from the bag.
a. What is the probability that the ball you pick is white?
b. What is the probability that the ball is white or red?
3. The data below is gathered from a class of 30 students.
sport played
softball
hockey
softball and hockey
frequency
10
20
5
a. What is the probability that a student plays softball?
52
53
b. What is the probability that a student plays softball or hockey?
4. What is the probability of rolling a 3 or a 5 on a single roll of a die?
5. The hourly wages of a group of 30 students that each have one part time job is shown.
hourly wage($)
7
8
9
frequency
20
6
4
a. What is the probability that a student’s wage is $7?
b. What is the probability that a student’s wage is $7 or $9?
Worksheet for 3.3
1. A bag contains 10 different candies. How many ways can you chose 5 candies from a
bag? (order doesn’t matter)
2. You are buying a bed and there are 10 different frames you can chose and there are 4
types of bedspreads. How many combinations of frames and bedspreads can you have?
3. a. An area has eight plots of land. Five families want to move into the area. How many
ways can you arrange the five families?
b. Five people are being chosen from a group of eight to go on a trip. How many ways
can you pick five people?
4. A flight attendant has 11 seats available and must seat 7 passengers. How many ways
are there to arrange the passengers?
5. a. There are seven aisle seats at a concert and there are five people that will be picked
for these seats. How many ways can the students be arranged?
53
54
b. Nine concert-goers are going to be selected from a group of eleven to meet the band.
How many ways can they be picked?
c. Two out of eight dogs are going to be awarded first and second prize. How many ways
can the prizes be awarded?
6. State whether each of the following events are mutually exclusive or not.
a. Event A: born in 1980
Event B: born in 1982
b. Event A: born in 1980
Event B: born in Canada
c. Event A: has type A blood
Event B: has type O blood
d. Event A: has two children
Event B: has no children
e. Event A: likes dogs
Event B: likes cats
7. There are 28 students in a class. Eight of them are nine years old and the rest are ten.
One student is selected.
a. What is the probability of selecting a nine year old?
b. What is the probability of selecting a nine OR a ten year old?
8. Use the table below to answer the following questions. A class of 30 students was
surveyed for the following information.
Video system
X-Box 360
PS3
both
none
frequency
10
7
3
16
If one student is selected from the class:
a. What is the probability of the student having an X-Box 360?
b. What is the probability that they will have an X-Box 360 OR a PS3?
54
55
3.7 Conditional Probability
3.7.1 Define the term conditional probability
Conditional probability is the probability of a particular event occurring, given that
another event has occurred.
3.7.4 Define independent events
Two events are independent if the occurrence of one event has no effect on the
probability of the occurrence of another event.
Example #1: two independent Events: the flipping of a coin twice.
If you flip a coin twice. The result of the first flip has NO EFFECT on the result of the
second flip. So, the result of the first flip is INDEPENDENT of the result of the second
flip.
Example #2: two independent events: picking cards (and replacing the picks)
If you pick a card from a deck and then put this card back before you pick the next card,
then the result of the first pick has NO effect on the result of the second pick. So the two
picks are INDEPENDENT events.
Note: If you had not put your first pick back in the deck before you picked again, then it
would affect the result of the second pick. This is because the second pick could be only
one of 51 options, rather than 52. In this case, the events would NOT be independent.
Example #3: two independent events: picking two balls out of a bag
You have a bag of 10 balls of various colours. If you pick a ball and then put it back
before you pick another ball, then the result of the first pick has NO EFFECT on the
result of the second pick. Hence, the events are INDEPENDENT.
Note: If you did not put the first pick back, then the events would not be independent,
since the second pick would be one of 9 options, not 10.
Helpful Hint: When you replace your first pick, the events are independent, but if you
don’t replace it, then they are NOT independent.
3.7.5 Determine whether the following events are independent.
1. Event A: picking a card and NOT putting it back
Event B: picking another card
55
56
2. Event A: rolling a die
Event B: rolling the die again
3. Event A: flipping a coin
Event B: flipping it again
4. Event A: picking a ball from a bag and putting it back
Event B: picking another ball from the bag
5. Event A: picking a ball from a bag and NOT putting it back
Event B: picking another ball from the bag
3.8 Multiplication Rules
3.8.1 Use the “special rule” for multiplication
The “special rule” for multiplication applies to events that are independent of each
other. The multiplication fule is used for “ AND” problems. (Note: We did “OR”
problems yesterday) Up until now, we only picked a car ONCE, or rolled a die ONCE, or
flipped a coin ONCE. NOW, we are doing TWO or more things IN A ROW.
The “special rule” for independent events is:
P(A and B)=P(A)×P(B)
Or more generally for more than one independent event events:
P(A and B and C and D ..etc)=P(A)×P(B)×P(C)×P(D)….etc
Remember: This rule only applies to INDEPENDENT events!!!!!!
Examples:
1. What is the probability of flipping a coin and getting a tail AND then flipping it again
and getting a head.
Since the events are independent we can use:
2. What is the probability of rolling a 4 and then rolling it again and getting a 2?
Since the events are independent we can use:
56
57
3. There are 10 balls in a bag. Four are red and 6 are white.
a. What is the probability of picking 2 white balls in a row if you put back your fist pick?
Since the events are independent (cause we put our first pick back)we can use:
b. What is the probability of picking a red ball, then a white ball, and then another white
ball if you replace each ball back into the bag after you pick it?
Since the events are independent (cause we keep putting our picks back)we can use:
c. What is the probability of picking 3 red balls in a row if you put each pick back each
time?
Since the events are independent (cause we keep putting our picks back)we can use:
4. What is the probability of picking a red card and then a jack if you replace your first
pick? (13 red cards, 4 jacks, 52 cards in a deck)
3.8.2 Use the “general” multiplication rule to compute probabilities.
This rule can be used on “AND” problems (i.e. events happening in a row) when the
events aren’t independent! (This rule works always---even for independent events)
P(A and B) = P(A)×P(B given that A happened)
or more generally for more than one event:
P(A and B and C and D) = P(A)×P(B given that A happened) ×P(B given that A happened) ×P(B given that A happened)
Examples:
1. There are 10 balls in a bag. Four are red and six are white.
a. What is the probability of picking a red ball and then picking another red ball if you
don’t put back your first pick?
57
58
b. What is the probability of picking a red ball and then picking a white ball if you don’t
put your first pick back?
c. What is the probability of picking a white ball and then a red ball and then a white ball
if you don’t put any of your picks back?
d. What is the probability of picking 4 white balls in a row if you don’t put any picks
back?
2. a What is the probability of picking a heart and then a diamond if you don’t put your
first pick back? (a 52 card deck has 13 hearts and 13 diamonds)
b. What is the probability of picking a jack and then another jack if you don’t put your
first pick back? (a 52 card deck has 4 jacks)
c. What is the probability of picking a red card and then a black card and then another red
card if you don’t put your picks back? (a 52 card deck has 26 red cards and 26 black
cards)
3. There are 10 rolls of film in a box, three of which are defective. (the rest aren’t)
a. What is the probability of picking a defective roll and the picking another defective roll
if you don’t put your first pick back?
b. What is the probability of picking a defective roll and then picking another defective
roll if you put your first pick back?
58
59
c. What is the probability of picking a non-defective roll and then a defective roll if you
do not put your first pick back?
d. What is the probability of picking 3 non-defective rolls in a row if you put your pick
back in the box each time?
e. What is the probability in part d if you do not put your pick back each time?
f. What is the probability of picking a defective roll, then another defective roll, and then
a non-defective roll if you replace your pick each time?
g. What would be your answer to part f if you did not replace your pick each time?
h. What is the probability of getting three defective rolls in a row if you do not replace
your picks?
59
60
3.9 Bayes’ Theorem
3.9.3 Use Bayes’ Theorem to calculate probabilities
To solve Bayes’ Theorem problems, it is very helpful to use a Tree Diagram. In Bayes’
Theorem problems, you will divide your data into groups more than once.
In probability theory, Bayes’ Theorem is a means for revising predictions in light of
relevant evidence, also known as conditional probability.
The following examples will help illustrate.
Examples:
1. A class consists of 60% men and 40% women. Of the men, 25% passed the test, while
45% of the women passed.
a. Illustrate the information with a tree diagram.
b. What is the probability that if a student chosen at random passed the test, then they are
male?
c. What is the probability that if a student chosen at random failed the test, then they are
male?
d. What is the probability that if a student chosen at random passed the test, then they are
female?
60
61
2. There are two different suppliers of a particular part. Company A supplies 80% of the
parts and company B supplies 20%. 5% of the parts supplies by company A and 9% of
the parts supplied by B are defective.
a. Construct a tree diagram to represent the information.
b. Suppose you reach into the bin and select a part and find it is nondefective. What is the
probability that it was supplied by A?
c. Suppose you reach in and select a part and find that it is defective. What is the
probability that it was supplied by B?
d. Suppose that the part you pick was defective. What is the probability that it was
supplied by A?
e. Suppose that the part you pick was non-defective. What is the probability that it was
supplied by B?
61
62
Math Worksheet for 3.7
1. There are 50 balls in a bag. 20 are blue, 10 are white, 15 are red, and 5 are green.
a. What is the probability that you pick a blue ball and then put it back and pick a red
ball?
b. What is the probability that you pick a blue ball and then a red ball if you don’t put
your first pick back?
c. What is the probability of picking a green ball and then another green ball if you don’t
put your first pick back?
d. What is the probability of picking a red ball, and then a green ball if you put your first
pick back?
e. What is the probability of picking a blue ball and then another blue ball if you put your
first pick back?
2. What is the probability of stats a tail and then another tail and then a head?
3. There are 27 toys in a box. 11 of them are defective, and the rest are non-defective.
a. What is the probability that you pick a defective toy and then a non-defective toy if you
do not replace the first toy?
b. What is the probability that you pick a non-defective toy and then another nondefective toy if you replace your first pick?
c. What is the probability that you pick a defective toy and then another defective toy if
you do not replace the first pick?
d. What is the probability of picking 3 non-defective toys in a row if you replace your
picks?
e. What is the probability of picking 3 defective toys in a row if you do not replace your
picks?
4. What is the probability of 5 heads in a row?
5. In a population, 70% of the people have a certain disease. A test is developed that has a
40% chance of detecting the condition in a person who has it, and a 10% chance of
falsely indicating it in a person who does not have it. If a person gets a positive test result,
what is the probability they have the condition? (a tree diagram will help)
62
63
6. A class consists of 75% men and 25% women. Of the men, 60% of them passed the
test, and of the women, 65% passed the test.
a. construct a tree diagram to represent this data.
b. If a person passed the test, what is the probability that they are male?
c. If a person failed the test, what is the probability that they are female?
Test #1 up to end of Unit 3!
63
64
4.0 Discrete Probability Distributions
4.1 Introduction
4.1.1 Discuss the concept of a random variable
4.1.2 Explain the difference between a discrete random variable and a continuous
random variable
When the random variable can take on only a finite number of values or a countable number of
values, we say that the variable is a discrete random variable. A discrete random variable can
assume only certain clearly separated values resulting from a count of some item of interest.
Example: the number of highway deaths in Ontario on the Labour Day weekend may be 5, 6, 7
... etc. (not 5.3). This does not mean that the discrete random variable may not assume fraction
values - but it does mean that there is some distance between the values. The result is still in a
count - example: 12 stocks increased by $0.25 or 1/4.
When the random variable can take on any number on the number line - not just counting - we
say that the variable is a continuous random variable. We can have 2.5 or 1.8 mm of rainfall for
example or in a high school track met, the winning time for the mile run may be reported as 4
minutes 20 seconds; 4 minutes 20.2 seconds; or 4 minutes 20.3416 seconds, etc.
NOTE: If the problem involves counting something, the resulting distribution is usually a discrete
probability distribution.
If the distribution is the result of a measurement, then it is usually a continuous probability
distribution.
64
65
Example: State whether the following random variables are discrete or random.
1. the number of letters delivered on time: ___________________
2. the length of a package: ___________________
3. the number who attended: ___________________
4. the height of each student: ___________________
4.1.3 Show how probability functions assign probabilities to different values of
random variables.
The following is a probability function that assigns probabilities to a random variable.
4.2 The mean of a random variable
4.2.1 Show how the mean of a probability function helps to give the expected value
of a probability function.
Special Note: The mean of a probability distribution is often referred to as the
value" of the distribution.
"expected
This formula directs you to multiply each outcome (x) by its probability P(x); and then add the
products.
65
66
Example 1: Calculate the mean of the following probability function representing
the probabilities of batters getting a given number of strikes.
x
P(x)
0
0.2
1
0.35
2
0.15
3
0.3
4.3 Measuring Chance Variation
4.3.1 Show how the variance of a probability function helps to give the spread of a
probability function.
The standard deviation of a discrete probability distribution is found by taking the square root of
2, thus square root of 2
66
67
variance of probability distribution:
Example 1: Find the variance of the following probability distribution given in 4.2.1.
x
P(x)
0
0.2
1
0.35
2
0.15
3
0.3
Example 2: Find the following using the population distribution below.
x
P(x)
0
0.1
1
0.15
2
0.2
3
0.4
4
0.15
67
68
a. What is the probability of exactly 2? __________
b, What is the probability of exactly none? ___________
c. What is the probability of at least 2? _______________
d. What is the probability of more than 3? _________________
e. What is the probability of no more than 1? _________________
f. Calculate the mean.
g. Calculate the variance.
4.4 Binomial Distribution
4.4.1 Describe a binomial experiment.
One of the most widely used discrete probability distributions is the binomial probably
distribution.
68
69
Illustrations of each characteristic:
1. An outcome is classified as either a "success" or "failure." For example, 40 percent of
2.
3.
4.
the students at a particular university are enrolled in the Business program. For a
selected student there are only two possible outcomes - the student is enrolled in the
program (a success) or is not enrolled in the Business program (failure).
The binomial distribution is the result of counting the number of successes in a fixed
sample size. If we select 5 students, 0, 1, 2, 3, 4 or 5 could be enrolled in the Business
program. This rules out the possibility of 3.75 of the students being enrolled. There
cannot be fractional counts.
Probability of success stays the same - in this example the probability of a success
remains at 40% for all five students selected.
Trials are independent - this means that if the first student selected is enrolled in the
Business program, it has no effect on whether the second or fourth one selected will be in
the Business Program.
(OMIT 4.4.2) 4.4.2 State and graph the formula for a binomial experiment.
There will be TWO ways to get the probability values for a binomial experiment. The
first way is to use the formula below and the second way is to use a table of values and
look up your answer. This saves a LOT of time, but be careful to use the right method. If
you are asked to use the formula to come up with a probability then use the following
method.
To construct a binomial distribution, we need to know:
1. the number of trials - designated n
2. the probability of successes on each trial - designated 
Where:




nCx denotes
a combination (i.e. use the Combination formula)
is the number of trials
is the number of observed successes
is the probability of success on each trial
n
x
69
70
Examples:
1. There are five flights daily from Halifax. Suppose the probability that any flight arrives
late is 0.20. Use the binomial formula to determine the following probabilities.
a. What is the probability that none of the flights are late today?
b. What is the probability that exactly one of the five flights will arrive late today?
c. What is the probability that exactly three of the five flights will arrive late today?
70
71
(OMIT) 4.4.3 Solve Binomial Distribution Problems using the binomial distribution
tables.
Unless it states that you must use the binomial distribution formula, USE THE TABLES!
They are sooooo much easier and quicker to use. There are a bunch of tables. To figure
out which one you need you have to know your n value: which represents the number of
trials. You also need to know your  value (your probability of success on each trial)
Examples:
1. Five percent of the gears produced by a machine are defective.
a. What is the probability that out of 6 gears selected at random, none will be defective?
________
b. What is the probability that out of the 6 selected at random, exactly one will be
defective? ________
c. What is the probability that of the 6, two will be defective? ________
d. What is the probability that of the 6, three will be defective? ________
e. What is the probability that of the 6, four will be defective? ________
f. What is the probability that of the 6, five will be defective? ________
g. What is the probability that of the 6, all will be defective? ________
h. With the information from a-g, graph a binomial probability distribution.
*i. What is the probability that less than 3 will be defective?
________________________________
*j. What is the probability that at least 3 will be defective?
________________________________
*k. What is the probability that more than 3 will be defective?
________________________________
2. Eighty percent of employees use direct deposit. Suppose we select a random sample of
7 employees and count the number using direct deposit. What is the probability that:
a. none use direct deposit? ____________
b. exactly 3 uses direct deposit? ____________
c. exactly 4 uses direct deposit? ____________
3. The postal service reports ninety percent of its first class mail delivered within 2 days.
If 5 letters are randomly selected, what is the probability that:
a. exactly one arrives within 2 days? ____________
b. exactly 3 arrives within 2 days? ____________
c. all arrives within 2 days? ____________
71
72
4.5 The mean and standard deviation of the binomial distribution
4.5.1/4.5.2 Calculate the mean and standard deviation of a binomial distribution.
The mean of a probability distribution is often called the expected value of the distribution. This
terminology reflects the idea that the mean represents a "central point" or "cluster point" for the
entire distribution.
The mean and variance of a binomial distribution can be computed by these formulas:
Examples:
Calculate the mean, variance, and standard deviation for the binomial distribution
problems #1-3 above.
For #1:
For #2:
For #3:
Worksheet for 4.0
1. Compute the mean and variance of the following discrete probability
distribution.
x
0
1
2
3
P(x)
0.2
0.4
0.3
0.1
72
73
2. Compute the mean and variance of the following discrete probability
distribution.
x
2
8
10
P(x)
0.5
0.3
0.2
3. Which of the following variables are discrete and which are continuous
random variables?
a. The number of new accounts established by a salesperson in a year.
b. The time between customer arrivals to a bank ATM.
c. The number of customers in Big Nick’s barber shop.
d. The amount of fuel in your car’s gas tank last week.
e. The number of minorities on a jury.
f. The outside temperature today.
4. Dan Woodward is the owner and manager of Dan’s Truck Stop. Dan offers
free refills on all coffee orders. He gathered the following information on
coffee refills. Compute the mean, variance, and standard deviation for the
distribution of number of refills.
x
0
1
2
3
Percent
30
40
20
10
5. The director of admissions at a university in estimated the distribution of
student admissions for the fall semester on the basis of past experience.
What is the expected number (i.e. what is the mean) of admissions for the
fall semester? Compute the variance and the standard deviation of the
number of admissions.
73
74
admissions Probability
1000
06
1200
03
1500
0.1
6. The following table lists the probability distribution for cash prizes in a
lottery conducted at Lawson’s Department Store.
x
0
10
100
500
P(x)
0.45
0.30
0.20
0.05
If you buy a single ticket, what is the probability that you win:
a. exactly $100?
b. at least $10?
c. no more than $100?
d. Compute the mean, variance, and standard deviation of this distribution.
7. You are asked to match three songs with the performers who made those
songs famous. If you guess, the probability distribution for the number of
correct matches is:
X (the
number
correct)
0
1
2
3
What is the probability that you get:
a. exactly one correct?
b. at least one correct?
74
P(x)
0.333
0.500
0
0.167
75
c. exactly two correct?
d. Compute the mean, variance, and standard deviation of this distribution.
Omit 8. In a binomial situation n=4 and π=0.25. Determine the probabilities
of the following events using the binomial formula.
a. x=2
b. x=3
omit 9. In a binomial situation n=5 and π=0.40. Determine the probabilities
of the following events using the binomial formula.
a. x=1
b. x=2
10. Assume a binomial distribution where n=3 and π=0.60.
Omit a. Refer to Appendix A, and list the probabilities for values of x from
0 to 3.
b. Determine the mean and standard deviation of the distribution.
11. Assume a binomial distribution where n=5 and π=0.30.
Omit a. Refer to Appendix A, and list the probabilities for values of x from
0 to 5.
b. Determine the mean and standard deviation of the distribution.
Omit 12. A society of investors’ survey found 30 percent of individual
investors have used a discount broker. In a random sample of nine
individuals, what is the probability:
a. exactly two of the samples individuals have used a discount broker?
b. exactly four of the them have used a discount broker?
c. none of them have used a discount broker?
13. The postal service reports 95 percent of first class mail within the same
city is delivered within two days of the time of mailing. Six letters are
randomly sent to different locations.
Omit a. What is the probability that all sic arrive within two days?
75
76
Omit b. What is the probability that exactly five arrive within two days?
c. Find the mean number of letters that will arrive within two days.
d. Compute the variance and standard deviation of the number that will
arrive within two days.
14. The industry standards suggest that 10 percent of new vehicles require
warranty service within the first year. Jones sold 12 Nissans yesterday.
Omit a. What is the probability that none of these vehicles requires
warranty service?
Omit b. What is the probability that exactly one of these vehicles requires
warranty service?
Omit c. Determine the probability that exactly two of these vehicles require
warranty service.
d. Compute the mean and standard deviation of this probability distribution.
5.0 Continuous Probability Distributions
5.1 Introduction
5.1.1 Name and sketch the graph of the various probability distributions
The following are examples of some of the different types of probability
distributions.
Continuous Probability Distributions:
1) The normal distribution: (This is the one we will be dealing with)
2) The uniform distribution:
3) The Exponential distribution:
76
77
5.2.1/5.2.2 Discuss and sketch the normal distribution and state the
properties of the normal distribution
The Normal Distribution Curve:(Bell-Curve)
A population which is normally distributed will have a mean located at the center
and a curve that is symmetrical (which means that each side is a reflection of each
other) The percentages given below stem from the Empircal Rule which states that
68.26% of the data lie within one standard deviation of the mean; 95.46% of the
data lie within two standard deviations of the mean and 99.73% of the data lie
within three standard deviations from the mean.
The normal distribution is a continuous probability distribution that is uniquely
determined by its mean (µ ) and standard deviation (σ ).
Characteristics of a Normal Probability Distribution
A normal distribution is completely described by its mean and standard
deviation. This indicates that if the mean and standard deviation are known, a
normal distribution can be constructed and its curve drawn.
77
78
The following chart shows three normal distributions, where the means are the
same but the standard deviations are different.
78
79
The following chart shows three distributions with different means but identical
standard deviations.
The following chart shows three distributions with different means and different
standard deviations.
Examples:
1. A company conducts a test on the lifespan of a battery. For a particular battery,
the mean life is 19 hours. The useful life of the battery follows a normal
distribution with a standard deviation of 1.2 hours. Answer the following questions.
a. About 68% of the batteries will have a lifespan between what two values?
b. About 95% of the batteries will have a lifespan between what two values?
c. Virtually all of the batteries will have a lifespan between what two values?
79
80
2. The mean of a normal probability distribution is 250; the standard deviation is
20.
a. About 95% percent of the observations lie between what values?
b. About what percent of the data lies between 230 and 270?
c. About what percent of the data lies between 190 and 310?
5.3 The Standard Normal Curve
2.4.4.4 and 2.4.4.5 Define and compute z-scores. (new numbers on outline
2.3.3)
The Standard Normal Distribution:
This is a normal distribution curve but in a standard normal distribution the mean is
zero and the standard deviation is 1.
Remember: the standard deviation is a measurement of how much a particular value
deviates from the mean.
The standard normal distribution can be used for all problems where the normal
distribution is applicable. Any normal distribution can be converted into the
"standard normal distribution" by using a z value. The z value measures the
distance between a particular value of X and the mean in units of the standard
deviation.
This is how to compute a z-score:
z
X 

where:
X: value of your random variable
 : the mean of the distribution of the random variable
 : the standard deviation of the distribution
(So, looking at the formula you can see that the z-score measures how many
standard deviations a number is from the mean.)
The following illustration demonstrates converting an X value to a standardized z
value:
80
81
Examples:
1. A distribution has a mean of 100 and a standard deviation of 10. Calculate the zscores of each of the following.
a. 110
b. 80
2. The weekly incomes of shift foremen in the glass industry are normally
distributed with a mean of $1000 and a standard deviation of $100. What is the zvalue for a foreman who earns:
a. $1150 per week?
b. $925 per week?
81
82
5.3.1 Use the standard normal curve to calculate probabilities.
You will be using another table(Appendix D) to calculate probabilities using the
standard normal curve. It contains a list of z-scores.
Obtaining the Probability: (2 STEPS!)
1. To obtain the probability of a value falling in the interval between the variable
of interest (X) and the mean, we first compute the z-score.
2. To obtain the probability we refer to the Standard Normal Probability Table
(Appendix D) for the associated probability of a given area under the curve. The
following is an illustration of how we read the Standard Normal Probability Table
for z = 0.12. (SKETCHES HELP!!!)
Note: The table gives the probability for the area under the curve from the mean
to the z-value. And remember the distribution is symmetrical (50% of values are on
the right and 50% are on the left)
Examples:
1. A normal population has a mean of 1000 and a standard deviation of 100.
a. Compute the z-value associated with 1000. (although you don’t really need to for
this one since it’s the mean!)
b. Compute the z-value associated with 1100.
82
83
b. What is the probability of selecting a value between 1000 and 1100?
d. What is the probability of selecting a value that is less than 1100?
e. What is the probability of selecting a value that is greater than 1100?
2. A recent study of the hourly wages of a group of employees showed that the
mean hourly salary was $20.50, with a standard deviation of $3.50. If we select a
crew member at random, what is the probability the crew member earns:
a. Between $20.50 and $24.50 per hour?
b. More than $24.50 per hour?
83
84
c. Less than $24.50 per hour?
d. Less than $19.00 per hour?
e. more than $19.00 per hour?
*f. between $19.00 and $24.50?
***g. between $22.50 and $24.50?
Worksheet for 5.0
1. The mean of a normal probability distribution is 500; the standard deviation is 10.
a. About 68 percent of the observations lie between what values?
b. About 95 percent of the observations lie between what two values?
c. Practically all of the observations lie between what two values?
2. The mean of a normal probability distribution is 60; the standard deviation is 5.
a. About what percent of the observations lie between 55 and 65?
b. About what percent of the observations lie between 50 and 70?
c. About what percent of the observations lie between 45 and 75?
3. The Kamp family has twins, Rob and Rachel. Both Rob and Rachel graduated
college 2 years ago, and each is now earning $50 000 per year. Rachel works in the
retail industry, where the mean salary for executives with less than 5 years’
84
85
experience is $35 000 with a standard deviation of $8 000. Rob is an engineer. The
mean salary for engineers with less than 5 years’ experience is $60 000 with a
standard deviation of $5000. Compute the z values for both Rob and Rachel.
4. A recent article reported that the mean labour cost to repair a heat pump is $90
with a standard deviation of $22. A company completed repairs on two heat pumps
this morning. The labour cost for the first was $75 and it was $100 for the second.
Compute z values for each and comment on your findings.
5. A normal population has a mean of 20.0 and a standard deviation of 4.0.
a. Compute the z value associated with 25.0.
b. What proportion of the population is between 20.0 and 25.0?
c. What proportion of the population is less than 18.0?
6. A normal population has a mean of 12.2 and a standard deviation of 2.5.
a. Compute the z value associated with 14.3.
b. What proportion of the population is between 12.2 and 14.3?
c. What proportion of the population is less than 10.0?
7. A recent study of the hourly wages of maintenance crew members for majo
airlines showed that the mean hourly salary was $20.50, with a standard deviation
of $3.50. If we select a crew member at random, what is the probability the crew
member earns:
a. between $20.50 and $24.00 per hour?
b. more than $24.00 per hour?
c. less than $19.00 per hour?
8. The mean of a normal distribution is 400 pounds. The standard deviation is 10
pounds.
a. What is the area between 415 pounds and the mean of 400 pounds?
b. What is the area between the mean and 395?
c. What is the probability of selecting a value at random and discovering that it has
a value of less than 395 pounds?
85
86
5.3.1(continued) Use the standard normal curve to calculate probabilities.
The following problems are similar to the type done in the last class only they involve a
bit more work. For example, the first few examples will require some interpretation to
determine the probability. (area under the curve)
There are really four types of problems. They are when you are trying to find the
following areas under the curve.
1.
2.
3.
4.
The following examples will represent each of the types.
Examples:
1. A normal population has mean of 10.5 and a standard deviation of 2.5. What is the area
under the curve between 9.5 and 10.5?
2. A normal population has a mean of 500 and a standard deviation of 25. What
proportion of the data lies above 540?
3. The distribution of weekly incomes for a group of workers follows a normal
distribution. The mean is $1000 and has a standard deviation of $100. What is the area
under this normal curve between $840 and $1200? (Note: This question could’ve been
worded “What is the probability of a weekly salary being between $840 and $1200?”)
86
87
4. A normal distribution has a mean of 1000 and a standard deviation of 100. What is the
area under the normal curve between 1150 and 1250?
In brief, there are four situations for finding the area under the standard normal
distribution.
Type 1: To find the area between 0and z(or –z), look up the probability directly in the
table.
Type 2: To find the area beyond z (or –z), locate the probability of z in the table and
subtract the probability from 0.5000.
Type 3: To find the area between two points on different sides of the mean, determine the
z values and add the corresponding probabilities.
Type 4: To find the area between two points on the same side of the mean, determine the
z values and subtract the smaller probability from the larger.
5.4 Applications of the Normal Curve
5.4.1 Solve normal distribution problems
There is another type of problem that you will be required to do.
Notice, up until now, you have always been asked to find the probability,( or it could
have been worded that you find the area under the normal curve) and you were given the
value for X. In the next type of problem, you will be give the probability(or area under
the curve) and you will be required to get your X value.
It is a bit like you are working backwards. The Steps involved:
87
88
Step1: Since you are given the probability, you will need to draw a sketch to figure out
the area you are dealing with.
Step2: Look up the area in your chart to get your z-score
Step3: Fill your z-score into the equation z 
X 

and solve for your X.
Examples:
1. A normal distribution has a mean of 80 and a standard deviation of 5.
a. Determine the value above which 5% of the data occurs.
b. Determine the value above which 80% of the data occurs.
2. A normal distribution has a mean of 55 and a standard deviation of 7.
a. Determine the value below which 70% of the data occurs.
b. Determine the value above which 10% of the data occurs.
88
89
3. The mean cost to use a plane follows the normal distribution and has a value of $1500
per hour with a standard deviation of $150.
a. What is the operating cost for the lowest 4% of planes?
b. What is the operating cost for the highest 12%?
c. What is the operating cost for the lowest 60%?
Worksheet for 5.3.1
1. A normal distribution has a mean of 50 and a standard deviation of 4.
a. Compute the probability of a value between 44.0 and 55.0.
b. Compute the probability of a value greater than 55.0.
c. Compute the probability of a value between 52.0 and 55.0.
2. A normal population has a mean of 80.0 and a standard deviation of 14.0.
a. Compute the probability of a value between 75.0 and 90.0.
b. Compute the probability of a value 75.0 or less.
c. Compute the probability of a value between 55.0 and 70.0.
3. A cola-dispensing machine is set to dispense on average 7.00 ounces of cola per
use.. The standard deviation is 0.10 ounces. The distribution amounts dispensed
follows a normal distribution.
a. What is the probability that the machine will dispense between 7.10 and 7.25
ounces of cola?
b. What is the probability that the machine will dispense 7.25 ounces of cola or
more?
c. What is the probability that the machine will dispense between 6.80 and 7.25
ounces of cola?
4. The amounts of money requested on home loan applications at a bank follow the
normal distribution, with a mean of $ 70 000 and a standard deviation of $20 000.
A load application is received this morning. What is the probability:
a. The amount requested is $80 000 or more?
b. The amount requested is between $65 000 and $80 000?
c. The amount requested is $65 000 or more?
5. A normal distribution has a mean of 50 and a standard deviation of 4. Determine
the value below which 95 percent of the observations will occur.
89
90
6. A normal distribution has a mean of 80 and a standard deviation of 14. Determine
the value above which 80 percent of the values will occur.
7. The amounts dispensed by a cola machine follow the normal distribution with a
mean of 7 ounces and a standard deviation of 0.10 ounces per cup. How much cola is
dispensed in the largest 1 percent of the cups?
8. The amount requested for home loans follows the normal distribution with a mean
of $70 000 and a standard deviation of $20 000.
a. How much is requested on the largest 3 percent of the loans?
b. How much is requested on the smallest 10 percent of the loans?
9. Assume that the mean hourly cost to operate a commercial airplane follows the
normal distribution with a mean of $2100 per hour and a standard deviation of
$250. What is the operating cost for the lowest 3 percent of the airplanes?
10. The monthly sales of muffins in the Richmond, Virginia, area follow the normal
distribution with a mean of 1200 and a standard deviation of 225. The
manufacturer would like to establish inventory levels such that there is only a 5
percent chance of running out of stock. Where should the manufacturer set the
inventory levels?
5.5 Normal Curve Approximation to the Binomial Distribution
5.5.1 Use the normal curve to find approximate solutions to binomial calculations
We have already done binomial distribution problems before. Remember, you could use
the formula or you could simply look up the answer in the binomial distribution chart. In
this section, you will be using the normal curve to come up with an approximate answer
to binomial distribution problems.
So, just like in the last two sections you will need to calculate a z-score and use your
normal curve probability chart to come up with your answer.
Remember from the binomial distribution section that you know how to calculate the
mean AND variance (and hence, standard deviation) using the following formulas.
  n
 2  n (1   )
The following example will help illustrate.
90
91
Examples:
1. Thermostats are manufactured in batches of 6 with a 70% rate that they are acceptable
(no defects). Use the normal curve approximation to estimate the probability of getting 4
or less acceptable thermostats.
Step 1: Calculate µ and σ2 using the binomial formulas.
Step 2: Get the z-score
2. Of a class of 14 students, each has a 50% chance of passing the test. Use the normal
curve approximation to determine the probability of exactly 5 or more students passing.
91
92
Worksheet for 5.5 and 6.0 (6.0 is the next section)
1. Of a group of 5 cars, each will have a 70% chance of having no defects. Use the
normal curve approximation to estimate the probability of getting 4 or less cars with no
defects.
2. Of a group of 6 people, each has an 80% chance of getting the flu. Use the normal
curve approximation to estimate the probability of getting 3 or less getting the flu.
3. State what type of sample each of the following is.
a. a sample is taken from each of the 12 provinces to represent the whole country
b. a sample is taken from Manitoba to represent the whole country
c. Every third person at the mall is surveyed
d. People are picked at random from the telephone book
e. A sample from each school in the Western district is chosen to represent the district
f. One school in the district is chosen to represent the whole district.
4. T/F
a. There is more variation or dispersion in the sampling distribution of the sample means
than in the population.
b. When the size of the sample increases, the standard error of the mean decreases.
c. The population mean is equal to the mean of the sampling distribution of the sample
means.
5. The standard deviation of a population of 130 is 3.2.
a. What is the standard deviation of the sampling distribution of sample means if a
sample size of 5 is chosen?
b. What is the standard error of the mean if a sample size of 6 is chosen?
6. The mean of a population is 28. What will be the mean of the sampling distribution of
sample means if a sample size of 3 is chosen?
7. a. The standard deviation of the sampling distribution of sample means is 5. If a
sample of 3 was used, what is the standard deviation of the population?
b. If the mean of the sampling distribution of sample means is 37, what is the population
mean?
c. How would the amount of variation (i.e. dispersion) of the sampling distribution of
sample means compare with the amount of variation in the population distribution?
92
93
6.0 Sampling Distributions
6.1 Introduction
6.1.1 Define Sampling
Sampling is when a part of a population is taken.
6.1.2 Name and describe the various types of sampling.
There are four types of probability sampling commonly used - the most widely used is the
simple random sample.
1. Simple Random Sample
Several ways of selecting a simple random sample are:
a) The name or identifying number of each item in the population is recorded on a slip of
paper and placed in a box. The slips of paper are shuffled and the required sample size is
chosen from the box.
b) Each item is numbered and a table of random numbers, such as the one in Appendix E,
is used to select the members of the sample. (Refer to text illustration for using a table of
random numbers. Note: The starting point is randomly selected.)
c) There are many software programs that will randomly select a given number of items
from the population.
2. Systematic Random Sampling
A random starting point is selected, let's say 39. Then every k th item thereafter, such as
every 100th, is selected for the sample. This means the numbers 39, 139, 239, 339, and so
on would be part of the sample.
93
94
3.
Stratified Random Sampling
For example, if our study involved Army personnel, we might decide to stratify the
population into (1) generals, (2) other officers, and (3) enlisted personnel. The number
selected from each of the three strata could be proportional to the total number in the
population for the corresponding strata. Each number of the population can belong to only
one of the strata.
4.
Cluster Sampling
Cluster sampling is often used to reduce cost when the population is scattered over a large
geographic area. Suppose your objective is to study household waste collection in a very
large city. The first step would be to divide the city into smaller units (perhaps blocks).
The units/blocks would be numbered and several of the units/blocks would be selected
randomly for inclusion in the sample. Finally, households within each of these units/blocks
are randomly selected.
6.2 Selecting a Random Sample
6.2.3 Show how to use a table of random digits to obtain a random
sample
Example: Suppose you want to select a random sample of 32 workers from a
population of 800 employees.
Step 1: get a table of random numbers from a computer
Step 2: Number the employees. Give each a 3-digit code so each employee
has an equal chance of selection.
001: first employee
002: second employee
etc
94
95
Since 800 is the largest possible code, discard all 3-digit codes that are
bigger than 800 (801-999 and 000)
Step 3: Pick an arbitrary starting point on the table of random numbers and
begin reading the numbers. You can read the numbers in any direction you
chose. In the example of the sheet provided, they went from left to right.
Remember to discard any numbers bigger than 800. Also, if any appear
twice, discard them.
6.4 Chance Variation among Samples
6.4.1 State the formula for sampling error.
Sampling Error: It is the difference between a sample statistic and its
corresponding population parameter.
A sample of a population has mean of 3.6, but the population mean is 3.8.
What is the sampling error?
6.5 The distribution of Sample Means
The data below represents the number of TV’s sold by four employees.
employee
Tim
Bill
Jill
Kim
# of TV’s
sold
2
3
10
1
a. What is the population mean?
b. Construct the sampling distribution of the sample mean for samples of
size 2.
95
96
Sample
# of TV’s
of 2
sold for each
employees
mean of the
sample of 2
employees
c. What is the mean of the sampling distribution of sample means (i.e. the
standard error of the mean)?
d. What observations can be made regarding the mean of the population and
the mean of the sampling distribution?
e. What observations can be made regarding the spread of scores in the
population compared with the spread of scores in the sampling distribution?
Conclusions about the sampling distribution of the sample means
1. The mean of the sample means is exactly equal to the population
mean.
96
97
2. The dispersion of the sampling distribution of the sample means is
narrower than the population distribution.
**3. The sampling distribution of the sample means tends to become
bell-shaped and to approximate the normal probability distribution.
6.5.1 Show how to determine the distribution of sample means
insert photocopies (you won’t have to make one, just interpret it)
NOTE: The population mean and the sample mean are the same.
6.5.2 Analyze and Calculate the standard error of the means and state
what the error represents
The standard error of the mean is actually the standard deviation of the
sampling distribution of the sample means.
If you know the standard error of the population, you can calculate the
standard error of the means using the following formula:
where:
: standard error of the mean (i.e. standard deviation of the distribution of
sample means)
: the standard deviation of the population
n : the number of observations in each sample
So, the standard error of the means represents, on average, how far a
sample mean is from the population mean.
There are two important things to note about the distribution of sample
means:
97
98
1. The mean of the distribution of sample means will be exactly equal to the
population mean if we are able to select all possible sample of the same size
from a given population. That is:
  x
2. There will be less dispersion in the sampling distribution of the sample
mean than in the population. If the standard deviation of the population is  ,
the standard deviation of the distribution of sample means is

. Note that
n
when the size of the sample is increased, the standard error of the mean
decreases.
Examples:
1. The standard deviation of a population of 35 marks is 5.7. What is the
standard deviation of the sampling distribution of the sample means (i.e. the
standard error of the mean) if a sample size of 5 is chosen?
2. The standard error of the mean (i.e. the standard deviation of the
sampling distribution of the sample means) is 13 for a distribution, when a
sample of 8 is used. What is the standard deviation of the population?
3. The data below represents the number of TV’s sold by four employees.
employee
Tim
Bill
Jill
Kim
# of TV’s
sold
4
5
6
7
98
99
a. What is the population mean?
b. Construct the sampling distribution of the sample mean for samples of
size 2. Calculate the mean of this distribution.
c. What is the mean of the sampling distribution?
d. What observations can be made regarding the mean of the population and
the mean of the sampling distribution?
e. What observations can be made regarding the spread of scores in the
population compared with the spread of scores in the sampling distribution?
4. The standard deviation of a population of 40 marks is 7.9.
a. What is the standard deviation of the sampling distribution of the sample
means (i.e. the standard error of the mean) if a sample size of 6 is chosen?
b. What is the standard error of the mean if a sample size of 10 is chosen?
6.6 The Central Limit Theorem
6.6.1 State the Central Limit Theorem
The Central Limit Theorem states that for large random samples, the
sampling distribution of the sample means is close to a normal probability
distribution.
6.6.2 Apply the Central Limit Theorem to make predictions about and
calculate probabilities for sample means.
The following steps are used for using the central limit theorem to calculate
the probability for a given sample mean.
Step 1: Calculate the z-score for the sample mean, X , using the following
formula
(if the POPULATION standard deviation,  , is known)
99
100
where: X : is the SAMPLE mean
 : is the population mean
 : is the population standard deviation
n : the sample size
OR
(if the SAMPLE standard deviation is known)
where: X : is the SAMPLE mean
 : is the population mean
s : is the sample standard deviation
n : the sample size
Step 2: Look up the probability in Appendix D and determine the desired
probability using the same methods as before.
Examples:
1. A normal population has a mean of 60 and a standard deviation of 12. You
select a random sample of 9. Compute the probability that the sample mean
is:
a. greater than 63
b. less than 56
100
101
c. between 56 and 63
2. A population of 100 with an unknown shape has a mean of 75. You select a
sample of 40. The standard deviation of the sample is 5. Compute the
probability that the sample mean is:
a. less than 74
b. between 74 and 76
c. between 76 and 77
d. greater than 77
Worksheet for 5.5 and 6.0
1. Of a group of 5 cars, each will have a 70% chance of having no defects.
Use the normal curve approximation to estimate the probability of getting 4
or less cars with no defects.
2. Of a group of 6 people, each has an 80% chance of getting the flu. Use
the normal curve approximation to estimate the probability of getting 3 or
less getting the flu.
3. State what type of sample each of the following is.
a. a sample is taken from each of the 12 provinces to represent the whole
country
b. a sample is taken from Manitoba to represent the whole country
c. Every third person at the mall is surveyed
d. People are picked at random from the telephone book
101
102
e. A sample from each school in the Western district is chosen to represent
the district
f. One school in the district is chosen to represent the whole district.
4. T/F
a. There is more variation or dispersion in the sampling distribution of the
sample means than in the population.
b. When the size of the sample increases, the standard error of the mean
decreases.
c. The population mean is equal to the mean of the sampling distribution of
the sample means.
5. The standard deviation of a population of 130 is 3.2.
a. What is the standard deviation of the sampling distribution of sample
means if a sample size of 5 is chosen?
b. What is the standard error of the mean if a sample size of 6 is chosen?
6. The mean of a population is 28. What will be the mean of the sampling
distribution of sample means if a sample size of 3 is chosen?
7. a. The standard deviation of the sampling distribution of sample means is
5. If a sample of 3 was used, what is the standard deviation of the
population?
b. If the mean of the sampling distribution of sample means is 37, what is
the population mean?
c. How would the amount of variation (i.e. dispersion) of the sampling
distribution of sample means compare with the amount of variation in the
population distribution?
102
103
6.6 The Central Limit Theorem
6.6.1 State the Central Limit Theorem
The Central Limit Theorem states that for large random samples, the
sampling distribution of the sample means is close to a normal probability
distribution.
6.6.2 Apply the Central Limit Theorem to make predictions about and
calculate probabilities for sample means.
The following steps are used for using the central limit theorem to calculate
the probability for a given sample mean.
Step 1: Calculate the z-score for the sample mean, X , using the following
formula:
z
X 

(if the POPULATION standard deviation,  , is known)
n
where: X : is the SAMPLE mean
 : is the population mean
 : is the population standard deviation
n : the sample size
OR
z
X 
s
(if the SAMPLE standard deviation is known)
n
where: X : is the SAMPLE mean
 : is the population mean
s : is the sample standard deviation
n : the sample size
103
104
Step 2: Look up the probability in Appendix D and determine the desired
probability using the same methods as before.
Examples:
1. A normal population has a mean of 60 and a standard deviation of 12. You
select a random sample of 9. Compute the probability that the sample mean
is:
a. between 60 and 63
b. greater than 63
c. less than 56
d. between 56 and 63
e. between 50 and 56
2. A population of 100 with an unknown shape has a mean of 75. You select a
sample of 40. The standard deviation of the sample is 5. Compute the
probability that the sample mean is:
a. less than 74
b. between 74 and 77
c. between 76 and 77
d. greater than 77
e. less than 76
104
105
Remember:
For finding sample mean probabilities:
Step 1: Calculate the z-score
Step 2: Use Appendix D (z-score chart) and interpret your answer.
Extra Examples:
1. A normal population has a mean of 60 and a standard deviation of 8. A
random sample of 9 is taken.
a. What is the probability that the sample mean is between 60 and 65?
b. What is the probability that the sample mean is between 54 and 60?
c. What is the probability that the sample mean is between 54 and 65?
2. A population of unknown shape has a mean of 70. You select a sample of42.
The standard deviation of the sample is 5.
a. Compute the probability the sample mean is greater than 71.
b. Compute the probability the sample mean is less than 68.8.
c. Compute the probability the sample mean is greater than 68.8.
d. Compute the probability the sample mean is less than 71.
3. A trucking company claims that the mean weight of their delivery trucks
when they are fully loaded is 6000 pounds and the standard deviation is 250
pounds. Assume that the population follows the normal distribution. Ninety
trucks are randomly selected and weighed.
a. What is the probability that the sample mean is between 6020 and 6070
pounds?
b. What is the probability that the sample mean is between 5970 and 5980
pounds?
c. What is the probability that the sample mean is between 5970 and 6020?
d. What is the probability that the sample mean is more than 6020?
e. What is the probability that the sample mean is less than 6020?
f. What is the probability the sample mean is more than 5970?
g. What is the probability that the sample mean is less than 5970?
h. What is the probability that the sample mean is between 5970 and 6000?
105
106
More Practice!
Worksheet 6.6
1. The mean rent for a one-bedroom apartment in Southern California is
$2,200 per month. The distribution of the monthly costs does not
follow the normal distribution. In fact, it is positively skewed. What is
the probability of selecting a sample of 50 one-bedroom apartments
and finding the mean to be at least $1,950 per month? The standard
deviation of the sample is $250.
2. According to an IRS study, it takes an average of 330 minutes for
taxpayers to prepare, copy, and electronically file a 1040 tax form. A
consumer watchdog agency selects a random sample of 40 taxpayers
and finds the standard deviation of the time to prepare, copy, and
electronically file form 1040 is 80 minutes.
a. What assumption or assumptions do you need to make about the
shape of the population?
b. What is the standard error of the mean in this example?
c. What is the likelihood the sample mean is greater than 320
minutes?
d. What is the likelihood the sample mean is between 320 and 350
minutes?
e. What is the likelihood the sample mean is greater than 350
minutes?
3. Recent studies indicate that the typical 50-year-old woman spends
$350 per year for personal-care products. The distribution of the
amounts spent is positively skewed. We select a random sample of 40
women. The mean amount spent for those sampled is $335, and the
standard deviation of the sample is $45. What is the likelihood of
finding a sample mean this large or larger from the specified
population?
4. Information from the American Institute of Insurance indicates the
mean amount of life insurance per household in the United States is
$110,000. This distribution is positively skewed. The standard
deviation of the population is now known.
106
107
a. A random sample of 50 households revealed a mean of $112,000
and a standard deviation of $40,000. What is the standard
error of the mean?
b. Suppose that you selected 50 samples of households. What is
the expected shape of the distribution of the sample mean?
c. What is the likelihood of selecting a sample with a mean of at
least $112,000?
d. What is the likelihood of selecting a sample with a mean of
more than $1000,000?
e. Find the likelihood of selecting a sample with a mean of more
than $1000,000 but less than $112,000.
5. The mean age at which men in the United States marry for the first
time is 24.8 years. The shape and the standard deviation of the
population are both unknown. For a random sample of 60 men, what is
the likelihood that the age at which they were married for the first
time is less than 25.1 years? Assume that the standard deviation of
the sample is 2.5 years.
6. A recent study by the Greater Los Angeles Taxi Drivers Association
showed that the mean fare charged for service from Hermosa Beach
to the Los Angeles International Airport is $18.00 and the standard
deviation is $3.50. We select a sample of 15 fares.
a. What is the likelihood that the sample mean is between $17.00
and $20.00?
b. What must you assume to make the above calculation?
6.3.3 (new outline addition) Apply the Central Limit Theorem to
construct confidence intervals for sample means and sample proportions
Confidence Interval for a Population Mean: Normal (z) Statistic
What does a confidence interval for a population mean actually mean?
When we form a 95% confidence interval, for example, for μ, we usually
express out confidence interval with a statement such as, “We can be 95%
confident that μ lies between the lower and upper bounds of the confidence
interval.” The statement reflects out confidence in the estimation process
rather than in the particular interval that is calculated from the sample
107
108
data. We know that repeated application of the same procedure will result in
different lower and upper bounds on the interval. Also, we know that 95% of
the resulting intervals will contain μ.
Large Sample (1-α)% Confidence Interval for μ
Where
is the z-value with an area α/2 to its right and
the standard deviation of the sampled population, and
Note: When is unknown and is large (say,
interval is approximately equal to
.
is
is the sample size.
, the confidence
Where is the sample standard deviation
Examples:
1. Unoccupied seats on flights cause airlines to lose revenue. Suppose a
large airline wants to estimate its average number of unoccupied seats per
flight over the past year. To accomplish this, the records of 225 flights are
randomly selected, and the number of unoccupied seats is noted for each
of the sampled flights. The mean of the sample is 11.6 and the standard
deviation of the sample is 4.1. Estimate μ, the mean number of unoccupied
seats per flight during the past year, using a 90% confidence interval.
2. A random sample of n measurements was selected from a population
with unknown mean μ and known standard deviation σ. Calculate a 95%
confidence interval for μ for each of the following situations:
a.
b.
,
,
,
,
108
109
c.
,
d.
,
,
,
Large Sample Confidence Interval for a Population Proportion
The fact that is a “sample mean number of successes per trial” allows us to
form confidence intervals about in a manner that is completely analogous
to that used for large-sample estimation of .
Large Sample Confidence Interval for p:
Where
Note: In order to be able to use this formula, the sample must be a random
sample from the population and the sample size n must be large (usually this
means that
and
)
Examples:
1. A polling agency conducts a survey to determine the current consumer
sentiment concerning the state of the economy. Suppose that the company
randomly samples 484 consumers and finds that 257 are optimistic about
the state of the economy. Use a 90% confidence interval to estimate the
proportion of all consumers who are optimistic about the state of the
economy. Based on the confidence interval, can the company conclude that
the majority of consumers are optimistic about the economy?
2. A newspaper reported that the majority of Americans say that Starbucks
coffee is overprices. A telephone survey of 1000 American adults found
that 730 of them believed the coffee was overpriced. Find and interpret at
95% confidence interval for the proportion.
109
110
6.3.4 Compute sample size for sampling data for an estimation of mean
or proportion
One of the most important decisions researchers need to make when
planning an experiment, is the size of the sample. We show in this section
that how to calculate an appropriate sample size for making an inference
about a population mean or proportion. It depends strongly on the desired
reliability of the experiment.
Sample Size Determination for
Interval for
Confidence
In order to estimate with a sampling error (SE) and with
confidence, the required sample size is found as follows:
Note: The value of
is usually unknown. It can be estimated using: either
the sample standard deviation, s, if it is known OR
where
is the
range of observations in the population. Also, it is good practice to round
the value of upward to ensure that the sample size will be sufficient to
achieve the specified reliability.
Examples:
1. The manufacturer of official NFL footballs uses a machine to inflate its
new balls to a pressure of 13.5 pounds. When the machine is properly
calibrated, the mean inflation pressure is 13.5 pounds, but uncontrollable
factors cause the pressures of individual footballs to vary randomly from
about 13.3 to 13.7 pounds. For quality-control purposes, the manufacturer
wishes to estimate the mean inflation pressure to within 0.025 pound of its
true value with a 99% confidence interval. What sample size should be used?
110
111
2. Suppose you wish to estimate a population mean correct to within 0.20
with a probability equal to 0.90. You do not know σ, but you know that the
observations will range in value between 30 and 34.
3. A study wishes to determine the mean bending strength of imported
white wood used on the roof of an ancient Japanese temple. The researchers
would like to estimate the true mean breaking strength of the wood to
within 4 MPa using a 90% confidence interval. How many pieces of the wood
need to be tested if the sample standard deviation of the breaking
strengths from the study was 10.9 MPa?
The procedure for finding the sample size necessary to estimate a
population proportion with a specified sampling error, SE, is determine as
follows:
Sample Size Determination for
Confidence Interval for
In order to estimate a binomial probability with sampling error SE and
with
confidence, the required sample size is found by solving
the following equation for :
Note: Because the value of the product
is unknown, it can be estimated
by using the sample fraction of successes, , from a prior sample.
Remember that the value of
is at its maximum when equals
, so
you can obtain conservatively large values of by approximating by 0.5
or values close to
In any case, you should round the value of obtained
upward to ensure that the sample size will be sufficient to achieve the
specified reliability. So, if you don’t know p, let it equal 0.5!
111
112
Examples:
1. In each case, find the approximate sample size required to construct a
95% confidence interval for p that has a sampling error of 0.08.
a. Assume p is near 0.2.
b. Assume you have no prior knowledge about p, but you wish to be certain
that your sample is large enough to achieve the specified accuracy for the
estimate.
2. A warehouse stores approximately 60 million empty aluminum beer and
soda cans. Recently, a fire occurred at the warehouse. The smoke from the
fire contaminated may of the cans with blackspot, making them unusable. A
statistician was hired to estimate p, the true proportion of cans that were
contaminated by the fire. How many aluminum cans should be randomly
sampled to estimate p to within 0.02 with 90% confidence?
3. A survey was conducted to determine what “Made in Canada” means to
consumers. 64 of 106 shoppers at a mall believe that it implies that all labor
and materials are produced in Canada. Suppose the researchers want to
increase the sample size in order to estimate the true proportion to within
0.05 of its true value using a 90% confidence interval. Compute the sample
size necessary to obtain the desired estimate.
Test #2 up to here!!!
112
113
7.0 Hypothesis Testing
7.1 Introduction
7.1.1 Describe the purpose of Hypothesis Testing
What is Hypothesis Testing?
Hypothesis testing is a statistical procedure which involves a decision-making process for
evaluating claims about a certain parameter of a population.
As a researcher of data, you may be interested in answering many types of questions.
Automobile manufacturers may be interested in determining whether seat belts will
reduce the severity of injuries caused by accidents. A ladies' wear store may want to know
whether the general public prefers a certain colour in a new line of fashion swim wear.
These types of questions can be answered using the methods of hypothesis testing.
Hypothesis testing starts with a statement about a population parameter such as the mean.
What is a Hypothesis?
In statistical analysis we make a claim, that is, state a hypothesis, then follow up with
tests to verify the assertion or to determine that it is untrue.
Because we utilize statistical inference, is not necessary to measure the entire population;
instead, we take a sample from the population to determine whether the empirical evidence
from the sample does or does not support the statement concerning the population.
As noted, hypothesis testing starts with a statement about a population parameter such as
the mean.


Example: One statement about the performance of a new model car is that the
mean miles per gallon is 30.
Another statement is that the mean miles per gallon is not 30.
113
114
Only one of these statements is correct.
To test the validity of the assumption (hypothesis) that the meal miles per gallon is 30, we
must select a sample from the population, calculate sample statistics, and based on certain
decision rules either accept or reject the hypothesis.
7.2 Hypothesis Tests
7.2.1 Name and describe the components of a statistical hypothesis test
Five-Step Procedure for Hypotheses Testing
When conducting hypothesis tests we actually employ a strategy of "proof by
contradiction." We hope to accept a statement to be true by rejecting or ruling out another
statement. Statistical hypothesis testing is a five-step procedure:
Hypothesis Testing - Step 1
The first step is to state the null and alternate hypotheses. What is the null hypothesis?
114
115
For example, a recent newspaper report made the claim that the mean length of a hospital
stay was 3.3 days. You think that the true length of stay is some other length than 3.3
days.
The null hypothesis is written Ho: µ
= 3.3
It is the statement about the value of the population parameter - in this case the population
mean. The null hypothesis is established for the purpose of testing. On the basis of the
sample evidence, it is either rejected or not rejected. In other words, it is accepted or
rejected.
If the null hypothesis is rejected, then we accept the alternate hypothesis.
The alternate hypothesis is written
H1: µ ≠3.3
There are two other formats for writing the null and alternate hypotheses: Suppose you
think that the mean length of stay is greater than 3.3 days. The null and alternate
hypothesis would be written:
µ = 3.3
H1: µ ≠ 3.3
Ho:
Note that in this case the null hypothesis indicates "no change or that
is less than 3.3."
The alternate hypothesis states that the mean length of stay is greater than 3.3 days.
Suppose you think that the mean length of stay is less then 3.3 days. The null and alternate
hypothesis would be written:
µ ≥ 3.3
H1: µ <3.3
Ho:
It is important to remember that no matter how the problem is stated, the null
hypothesis will always contain the equal sign. The equality sign will never appear in the
alternate hypothesis.
One-tailed versus two-tailed test
When a direction is expressed in the alternate hypothesis, such as > or <, the test is
referred to as being a one-tailed test. When the alternate hypothesis is that of "≠" (not
equal to), the test will be a two-tailed test.
115
116
Hypothesis Testing - Step 2
After setting up the null hypothesis and alternate hypothesis, the next step is to state the
level of significance.
The level of significance is designated α, the Greek letter alpha. If will indicate when the
sample mean is too far away from the hypothesized mean for the null hypothesis to be true.


When a true null hypothesis is rejected it is referred to as a Type I error.
If the null hypothesis is not true, but our sample results indicate that it is, we have
a Type II error.
Hypothesis Testing - Step 3
Step 3 of the hypothesis testing procedure is to compute the test statistic. What is a
test statistic?
Which test statistic do I use? This answer to this question is determined by factors such
as whether the population standard deviation is known and the size of the sample.
The standard normal distribution, the z value, is used
116
117



if the population is normally distributed
if the population standard deviation is known
and, when the sample size is greater than 30.
Hypothesis Testing - Step 4
Formulate the Decision Rule: A decision rule is based on Ho and H1 , the level of
significance, and the test statistic. The decision rule is formulated by finding the critical
values for z.
If we are applying a one-tailed test, there is only one critical value. If we are applying a
two-tailed test, there are two critical values.
The following diagram illustrates the critical values for a two-tailed test, at the 0.01 level
of significance. Since this is a two-tailed test, half of the 0.01 is found in each tail 0.005. The area where Ho is not rejected is therefore 0.99. Since appendix D is based on
half of the area under the curve, we locate 0.99/2 = 0.4950 in the body of the table to find
the corresponding z critical values = 2.58.
117
118
Therefore, our decision rule is:
Reject the null hypothesis and accept the alternate hypothesis if the computed value of z
does not fall in the region between -2.58 and +2.58.
To find the critical value for a one-tailed test, at the 0.01 level of significance, place the
0.01 of the total area in the upper or lower tail. This means that 0.5000 - 0.01 = 0.4900 of
the area is located between the z value of 0 and the critical value. We locate 0.4900 in the
body of Appendix D and our decision rule is to reject the null hypothesis if the computed
value from the test statistic exceeds 2.33 for an upper-tailed test or is less than -2.33 for
a lower tailed test.
The following diagrams will illustrate the acceptance and rejection area for an upper-tailed
test.
Hypothesis Testing - Step 5
Select the Sample and Make a Decision: The final step is to select the sample and
compute the value of the test statistic. This value is compared to the critical value, or
values, and a decision is made whether to reject to accept the null hypothesis.
118
119
In the following example the critical values for z are -2.58 and +2.58 (a two-tailed test).
The computed value of z = 1.55. Since the computed value falls in the acceptance range, we
do not reject, we accept the null hypothesis.
7.0 Hypothesis test examples:
1. A company manufactures desks. Their production follows the normal
distribution, with a mean of 200 per week and a standard deviation of 16.
The president would like to investigate whether the mean number of desks is
different from 200 at the 0.01 significance level. A sample accumulated over
150 weeks has a mean of 203.5. Is the president right in assuming that the
mean number of desks is different from 200?
2. The rate at which a stock of aspirin is changes each year has a mean of
6.0 and a standard deviation of 0.50. A random sample of 64 aspirin revealed
a mean of 5.84. It is suspected that the mean turnover has changed and is
no longer 6.0. Use the 0.05 significance level to test the hypothesis that the
mean turnover is not 6.0.
3. The mean age of passenger cars in the US is 8.4 years. A sample of 40
cars in the student lots at the University of Tennessee showed the mean age
to be 9.2 years. The standard deviation of this sample was 2.8 years. At the
0.1 significance level, can we conclude the mean age is more than 8.4 years
for the cars of Tennessee students?
4. The manager of a store wants to find whether the mean unpaid balance is
more than $400. The level of significance is set at 0.05. A random sample of
60 unpaid balances revealed the sample mean is $407 and the standard
119
120
deviation is $22.50. Should she conclude that the mean is greater than
$400?
5. The mean amount of time spent watching TV per day for eighth graders is
1.6 hours. A sample of 35 eight graders showed the mean number of hours to
be 1.3 hours with a standard deviation of 1.0 hours. At the 0.01 significance
level, can we conclude that the mean age is less than 1.6 hours?
6. The mean number of hours spent on the phone by employees is said to be
37 with a standard deviation of 2.1. The owner of a company wants to
determine whether the mean number of minutes is less than 37. She takes a
sample of 43 employees and finds that the mean amount of time spent is 33.
Can we conclude that the mean number of minutes is less than 37? (Use the
0.05 significance level.)
7. A town council claims that the mean number of hours citizens spend
commuting to work is 28 minutes. A company believes that the mean is not
28 minutes and takes a sample of 50 citizens. They determine that the mean
commuting time of the sample is 36 minutes with a standard deviation of 11
minutes. At the 0.01 significance level, can the company conclude that the
mean commuting time for the town is different from 28 minutes?
Worksheet for 7.0
1.
The following information is available.
H0: µ = 50
H1: µ ≠ 50
The sample mean is 49, and the sample size is 36. The population follows the normal
distribution and the standard deviation is 5. Use the .05 significance level.
2. The following information is available.
H0: µ ≤ 10
H1: µ > 10
The sample mean is 12 for a sample of 36. The population follows the normal
distribution and the standard deviation is 3. Use the .02 significance level.
120
121
3. A sample of 36 observations is selected from a normal population. The sample mean
is 21, and the sample standard deviation is 5. Conduct the following test of
hypothesis using the .05 significance level.
H0: µ ≤ 20
H1: µ > 20
4. A sample of 64 observations is selected from a normal population. The sample mean
is 215, and the sample standard deviation is 15. Conduct the following test of
hypothesis using the .03 significance level.
H0: µ ≥ 220
H1: µ < 220
For Exercises 5-8: (a) State the null hypothesis and the alternate hypothesis.
(b) State the decision rule. (c) Compute the value of the test statistic. (d)
What is your decision regarding H0? (e) What is the ρ-value? Interpret it.
5. The manufacturer of the Χ-15 steel-belted radial truck tire claims that the mean
mileage the tire can be driven before the tread wears out is 60,000 miles. The
Crosset Truck Company bought 48 tires and found that the mean mileage for their
trucks is 59,500 miles with a standard deviation of 5,000 miles. Is Crosset’s
experience different from that claimed by the manufacturer at the .05 significance
level?
6. The MacBurger restaurant chain claims that the waiting time of customers for
service is normally distributed, with a mean of 3 minutes and a standard deviation of
1 minute. The quality-assurance department found in a sample of 50 customers at
the Warren Road MacBurger that the mean waiting time was 2.75 minutes. At the
.05 significance level, can we conclude that the mean waiting time is less than 3
minutes?
7. A recent national survey found that high school students watched an average (mean)
of 6.8 DVDs per month. A random sample of 36 college students revealed that the
mean number of DVDs watched last month was 6.2, with a standard deviation of 0.5.
At the .05 significance level, can we conclude that college students watch fewer
DVDs a month than high school students?
8. At the time she was hired as a server at the Grumney Family Restaurant, Beth
Brigden was told, “You can average more than $80 a day in tips.” Over the first 35
days she was employed at the restaurant, the mean daily amount of her tips was
$84.85, with a standard deviation of $11.38. At the .01 significance level, can Ms.
Brigden conclude that she is earning an average of more than $80 in tips?
121
122
7.5 Tests Concerning Means for Small Samples and Tests concerning
Proportions
This is the method for doing hypothesis tests if you are given s and the
sample size, n, is LESS than 30.
t distribution:
So far we have been talking about the Standard Normal distribution with
the sample chosen being greater than 30 but what happens if the sample is
30 or less.
If we have a population that is already normally distributed and the
population standard deviation, , is unknown then we can pick a sample less
than 30 and we can use the t distribution.
1. The degree of freedom of a sample is given by: d.f. = n - 1
2. The t score is given by: t 
X 
s
n
Characteristics of t Distribution
The Student's t distribution is also referred to as the t distribution. It is similar to the
standard normal distribution in some ways, but quite different in others.
From the following chart comparing the two distributions, you will note that the t
distribution is flatter and more spread out that the z distribution.
122
123
Note that when t is used as the test statistic instead of the standard normal z distribution:
o
o
The region for which the null hypothesis cannot b e rejected is wider
A larger value of t will be required to reject the null hypothesis.
A further requirement is that the population from which the sample is obtained should be
normal or approximately normal.
Hypothesis testing using t-tests:
We have seen how we can use the z-score test statistic for hypothesis
testing but if the sample that is chosen from the population is 30 or less
then we must use the t test statistic.
Note: To use the t statistic we must:
a) have n ≤ 30
b) have a population that is essentially normal
c) is unknown
Examples:
1. A manufacturer of computer disk drives monitors the retail prices of its
drives in order to gauge the market. For one type of drive the average price
is $750, and the manufacturer wishes to know whether the current mean
retail price differs from $750. Seventeen retail establishments are
sampled, and their current prices for the drive are determined. The mean
for the 17 retail price is $732 and the standard deviation is $38. Can we
conclude that the mean price differs from the list price of $750? (Use 0.05
for the level of significance.)
123
124
2. In any bottling process, a manufacturer will lose money if the bottles
contain more or less than is claimed on the label. Suppose a quality manager
for a mustard company is interested in testing whether the mean number of
ounces of mustard per family-size bottle differs from the labeled amount of
20 ounces. The manager samples nine bottles, measures the weight of their
contents, and finds the sample to have a mean of 19.7 ounces and a standard
deviation of 0.3 ounces. Does the sample evidence indicate that the mustard
dispensing machine needs adjusting? Test at the 0.05 significance level.
3. An insurance company reports that the mean cost to process a claim is
$60. Cost-cutting measures were introduced. To evaluate if the measures
worked at sample of 26 claims was taken and found to have a mean of
$56.42 with a standard deviation of $10.04. At the 0.01 significance level, is
it reasonable to conclude that the mean cost to process a claim is now less
than $60?
4. The mean life of a clock battery is 305 days. The battery was changed to
make it last longer. A sample of 20 of the changed batteries had a mean life
of 311 days with a standard deviation of 12 days. Is it reasonable to conclude
that the change increased the mean life of the battery? Use the 0.05
significance level.
5. The mean length of a bar is 43 millimeters. The production supervisor is
concerned that the adjustments of the machine producing the bars have
changed. To investigate, he takes a sample of 12 bars and they are found to
have a mean of 41.5 millimetres with a standard deviation of 43.0
millimetres. Is it fair to conclude that there has been a change in the mean
length of the bars? Use the 0.02 significance level.
124
125
Tests Concerning Proportions
What is a proportion?
For example, we want to estimate the proportion of all home sales made to first time
buyers. A random sample of 200 recent transactions showed that 40 were first time
buyers. Therefore, we estimate that 0.20, or 20% (40/200), of all sales are made to first
time buyers.
To conduct a test of hypothesis for proportions, the same assumptions required for the
binomial distribution must be met.




Each outcome is classified into one of two categories such as, buyers were either
first time home buyers or they were not.
The number of trials is fixed. In the above case it is 200.
Each trial is independent - meaning that the outcome of one trial has no bearing on
the outcome of any other.
The probability of a success is fixed. In the above example, the probability is 0.20
for all 200 buyers.
Again, the five-step procedure for hypothesis testing is used, The test statistic for
testing hypotheses about proportions is the standard normal distribution. The computed z
is found by:
125
126
Examples:
1. A report shows that 40 percent of people involved in minor traffic
accidents this year have been involved in at least one other traffic accident
in the last five years. A group decided to investigate this claim, believing it
was too large. A sample of 200 traffic accidents this year showed 74 people
were also involved in another accident within the last five years. Use the
0.01 significance level to see if you can conclude that a smaller percentage
of people than 40 percent are involved in more than one accident.
2. Prior elections in Indiana indicate it is necessary for a candidate for
governor to receive at least 80 percent of the vote in the northern section
of the state to be elected. The incumbent governor is interested in
assessing his chances of returning to office and plans to conduct a survey of
2000 voters. He finds that of the 2000 sampled, 1550 plan on voting for
him. Can he conclude that the percentage of votes needed to win is different
from 80%? (Use the 0.05 significance level)
3. An urban planner claims that, nationally, 20 percent of all families renting
condos move during a given year. A random sample of 200 families renting
condos in Dallas revealed that 56 had moved during the past year. At the
0.01 significance level, can we conclude that a larger proportion of condo
owners moved in the Dallas area?
4. A TV manufacturer found that 10 percent of its sets needed repair in the
first 2 years of operation. In a sample of 50 sets manufactured 2 years ago,
9 needed repair. At the 0.05 significance level, can we conclude that the
percent needing repair is larger than the 10 percent cited by the
manufacturer?
Worksheet for 7.5 (t-test for one sample)
1. A company claims that the mean number of cars sold per day is 10. A
random sample of 10 days is chosen, and the sample mean was found to be 12
cars a day and the sample deviation was 3. Use the 0.05 significance level to
determine whether we can conclude that the mean number of cars sold is
greater than 10.
126
127
2. A government agency claims that 400 people have traffic accidents every
day. A sample of 12 days is taken and the mean is found to be 407 and the
sample standard deviation is 6. Use the 0.01 significance level to determine
whether we can conclude that the mean is different from 400.
3. A district sales manager claims that the sales representatives make an
average of 40 sales calls per week on professors. Several reps say that this
estimate is too low. To investigate, a random sample of 28 sales
representatives reveals that the mean number of calls made last week was
42. The standard deviation of the sample is 2.1 calls. Using the 0.05
significance level, can we conclude that the mean number of calls per
salesperson per week is more than 40?
4. The management of White Industries is considering a new method of
assembling its golf cart. The present method requires 42.3 minutes, on the
average, to assemble a cart. The mean assembly time for a random sample of
24 carts, using the new method, was 40.6 minutes, and the standard
deviation of the sample was 2.7 minutes. Using the 0.10 significance level,
can we conclude that the assembly time using the new method is faster?
5. A spark plug manufacturer claimed that its plugs have a mean life is
different from 22100 miles. Assume that the life of the spark plugs follows
the normal distribution. A fleet owner purchased a large number of sets. A
sample of 18 sets revealed that the mean life was 23400 miles and the
standard deviation was 1500 miles. Is there enough evidence to substantiate
the manufacturer’s claim at the 0.05 significance level?
Worksheet for 7.5 (proportion test for one sample)
1. A company claims that the 70% of pet owners in a town have dogs. A
sample of 100 observations revealed that 75 pet owners have dogs. At the
0.05 significance level, can we conclude that the percentage of pet owners
who have dogs is bigger than 70%?
2. A newspaper reported that 40% of people taking a certain drug exhibit
side effects. A sample of 120 observations revealed that 36 people
exhibited side effects. At the 0.05 significance level, can we conclude that
the percentage exhibiting side effects is different from 40%?
127
128
3. The National Safety Council reported that 52 percent of American
turnpike drivers are men. A sample of 300 cars travelling southbound on the
New Jersey Turnpike yesterday revealed that 170 were driven by men. At
the 0.01 significance level, can we conclude that a larger proportion of men
were driving on the New Jersey Turnpike than the national statistics
indicate?
4. A recent article in USA Today reported that a job awaits only one in
three new college graduates. The major reasons given were an
overabundance of college graduates and a weak economy. A survey of 200
recent graduates from your school revealed that 80 students had jobs. At
the 0.02 significance level, can we conclude that a larger proportion of
students at your school have jobs?
5. Chicken Delight claims that 90 percent of its orders are delivered within
10 minutes of the time the order is placed. A sample of 100 orders revealed
that 82 were delivered within the promised time. At the 0.10 significance
level, can we conclude that less than 90 percent of the orders are delivered
in less than 10 minutes?
6. Research at the University of Toledo indicates that 50 percent of the
students change their major area of study after their first year in a
program. A random sample of 100 students in the College of Business
revealed that 48 had changed their major area of study after their first
year of the program. Has there been a significant change in the proportion
of students who change their major after the first year in their program.
Test at the 0.05 level of significance.
7.6 Tests Concerning Differences between the Means of large samples
and Tests Concerning the Differences between Proportions
Hypothesis Testing: Two Population Means
If there are two populations, we can compare two sample means to determine if they came
from populations with the same or equal means.
For example, a purchasing agent is considering two brands of tires for use on the company's
fleet of cars. A sample of 60 Rossford tires indicates the mean useful life to be 45,000
miles. A sample of 50 Maumee tires revealed the useful life to be 48,000 miles. Could the
difference between the two sample means be due to chance?
128
129
standard deviations are
either known or have been computed from samples whose sizes are
greater than 30. The test statistic used is the standard normal distribution and its
The assumption is that for both populations, the
value is computed as:
Examples:
1. Customers at a market have a choice when paying for their groceries. They
may check out and pay using the standard cashier check-out, or they can use
the new U-Scan procedure where they do it on their own. The store manager
would like to know if the mean check-out time using the standard method is
longer than using the U-Scan. She gathered the following information.
Method used
Sample mean
standard
U-Scan
5.50 minutes
5.30 minutes
Sample standard
deviation
0.40 minutes
0.30 minutes
Sample size
50
100
Can she conclude that the standard method takes longer than the U-Scan
method? Use the 0.01 significance level.
2. The owner of a company noticed a difference in the dollar value of the
sales between the men and the women who work for her. A sample of 40
days showed that the men sold a mean of $1400 worth of goods with a
standard deviation of $200. For a sample of 50 days, the women sold a mean
of $1500 worth of appliances per day with a standard deviation of $250. At
the 0.05 significance level, can she conclude that the mean amount sold per
day is larger for the women?
129
130
3. Karen would like to determine whether there are more units produced on
the afternoon shift than on the day shift. A sample of 54 day-shift workers
showed that the mean number of units produced was 345, with a standard
deviation of 21. A sample of 60 afternoon-shift workers showed that the
mean number of units produced was 351, with a standard deviation of 28
units. At the 0.05 significance level, is the number of units produced on the
day shift smaller?
4. A company would like to determine whether the average mark of version
one of a test is different from the average mark of version two of a test. A
sample of 48 people taking test one had a mean mark of 72 percent and a
standard deviation of 12. A sample of 56 people taking test two had a mean
mark of 65 and a standard deviation of 15. At the 0.01 significance level, can
we conclude that the average mark of test one is different from the
average mark of test two?
Two Population Proportions
Often we are interested in whether two population proportions are the same. For example,
we want to compare the proportion of rural voters planning to vote for the incumbent
governor with the proportion of urban voters.
In order to conduct this test, we assume each sample is large enough so that the normal
distribution may be used as a good approximation of the binomial.
Again, the difference lies in the formula for finding the computed z-value. In this formula
we use the "pooled estimate" of the population portion.
130
131
Examples:
1. A department store is interested in whether there is a difference in the
proportions of younger and older women who would purchase a type of
perfume if it were marketed. There are two independent populations, a
population consisting of the younger women and a population consisting of
the older women. Each sampled woman will be asked to smell the perfume and
state whether should would buy it. A random sample of 100 young women
revealed 20 liked it enough to buy it. Similarly, a sample of 200 older women
revealed 100 liked it enough to buy it. Using the 0.05 significance level, can
we conclude that there is a difference in the proportion of younger women
and the proportion of older women in whether they would buy the perfume?
2. Of 150 adults who tried a new peach-flavoured peppermint patty, 87
rated it excellent. Of 200 children sampled, 123 rated it excellent. Using
the 0.10 level of significance, can we conclude that there is a significant
difference in the proportion of adults and the proportion of children who
rate the new flavour excellent?
3. The manufacturer of Advil developed a new version and claimed it to be
more effective. To evaluate it, a sample of 200 current users is asked to try
it. After a one-month trial, 180 indicated the new drug was more effective.
At the same time, a sample of 300 current users is given the old drug but
told it is the new formulation (i.e. were given a placebo). From this group,
261 said it was an improvement. At the 0.05 significance level, can we
conclude that the new drug is more effective?
4. A company wishes to determine whether a higher proportion of women
than men prefer cats to dogs as pets. A sample of 320 men showed that 125
preferred cats to dogs and a sample of 250 women showed that 105
preferred cats to dogs. At the 0.02 significance level, can we conclude that
a higher proportion of women than men prefer cats to dogs as pets?
131
132
Worksheet #1 for 7.6 (z-test for 2 samples)
1. A sample of 40 observations is selected from one population. The sample mean is
102 and the sample standard deviation is 5. A sample of 50 observations is
selected from a second population. The sample mean is 99 and the sample
standard deviation is 6. Can we conclude that the mean of the first population is
different from the mean of the second population using the 0.04 significance
level?
2. A sample of 65 observations is selected from one population. The sample mean is
2.67 and the sample standard deviation is 0.75. A sample of 50 observations is
selected from a second population. The sample mean is 2.59 and the sample
standard deviation is 0.66. Can we conclude that the mean of the first population
is bigger than the mean of the second population at the 0.08 significance level?
3. The Gibbs Baby Food Company wishes to compare the weight gain of infants
using their brand versus their competitor’s. A sample of 40 babies using Gibbs
products revealed a mean weight gain of 7.6 pounds in the first three months after
birth. The standard deviation of the sample was 2.3 pounds. A sample of 55
babies using the competitor’s brand revealed a mean increase in weight of 8.1
pounds, with a standard deviation of 2.9 pounds. At the 0.05 significance level,
can we conclude that babies using Gibbs brand gained less weight?
4. As part of a study of corporate employees, the Director of Human Resources for
PNC, Inc. wants to compare the distance traveled to work by employees at their
office in downtown Cincinnati with the distance for those in downtown
Pittsburgh. A sample of 35 Cincinnati employees showed they travel a mean of
370 miles per month, with a standard deviation of 30 miles per month. A sample
of 40 Pittsburgh employees showed they travel a mean of 380 miles per month,
with a standard deviation of 26 miles per month. At the .05 significance level, is
there a difference in the mean number of miles traveled per month between
Cincinnati and Pittsburgh employees? Use the five-step hypothesis-testing
procedure.
132
133
Worksheet #2 for 7.6 (2 sample proportion test)
1.
A company wishes to determine whether population 1 has a larger proportion of people get
married than in population 2. A sample of 100 observations from population 1 indicated that 70
were married. A sample of 150 observations from the second population found that 80 were
married. Can we conclude at the 0.05 significance level that the proportion of people married in
population 1 is bigger than the proportion married in population 2?
2.
The Damon family owns a large grape vineyard in western New York along Lake Erie. The
grapevines must be sprayed at the beginning of the growing season to protect against various
insects and diseases. Two new insecticides have just been marketed: Pernod 5 and Action. To test
their effectiveness, three long rows were selected and sprayed with Pernod 5, and three others
were sprayed with Action. When the grapes ripened, 400 of the vines treated with Pernod 5 were
checked for infestation. Likewise, a sample of 400 vines sprayed with Action were checked. The
results are:
Number of Vines Checked
(sample size)
Number of Infested Vines
Pernod 5
400
24
Action
400
40
Insecticide
3.
A nationwide sample of influential Republicans and Democrats was asked as a part of a
comprehensive survey whether they favoured lowering environmental standards so that highsulfur coal could be burned in coal-fired power plants. The results were:
Republicans
Democrats
Number sampled
1,000
800
Number in favour
200
168
Can we conclude that the proportion of Republicans who favour lowering standards is less than
the proportion of Democrats? Use the 0.02 significance level.
7.7 Tests Concerning Differences between Means for Small Samples
TWO SAMPLE TEST OF MEANS
For a test of hypothesis for the difference between two population means we must make
three assumptions:
To conduct this test, three conditions must be satisfied:
1. The populations must be normally distributed, or approximately normally
distributed.
133
134
2. The populations must be independent.
3. The population variances must be equal.
The t statistic is similar to that used for a z statistic for the difference between two
population means. However, we must make one additional calculation.


The two sample variances must be "pooled" to form a single estimate of the unknown
population variances.
The value of t is then computed.
Pooled Sample Variance:
Test Statistic:
t
X1  X 2
1 
2 1
s p   
 n1 n2 
Note: The degrees of freedom for a two sample test is found by (n1 +
n2 - 2).
So, the above formula is used when you are given two samples and both are
less than 30 and you are given each sample mean and each sample standard
deviation.
Examples:
1. There are two different ways to mount an engine on a lawnmower. A
sample of 5 mowers was used to test the first procedure and it was found to
have a mean time of 4 minutes with a standard deviation of 2.92 minutes. A
sample of 6 mowers was used to test the second method and it was found to
have a mean time of 5 minutes and a standard deviation of 2.10. Can we
conclude that there is a difference between the means of the two methods?
(Use the 0.10 significance level.)
134
135
2. A manager wants to compare the number of defective wheelchairs
produced on the day shift with the number on the afternoon shift. A sample
of 6 day shifts was found to have a mean of 7 and a standard deviation of
1.41. A sample of 8 afternoon shifts was found to have a mean of 10 and a
standard deviation of 2.27. At the 0.05 significance level, can we conclude
that the afternoon shift makes more defective wheelchairs on average?
3. A budget director would like to compare the travel expenses for the sales
staff and the audit staff. A sample of 6 sales staff is found to have a mean
of 142.5 and a standard deviation of 12.2. A sample of 7 audit staff is found
to have a mean of 130.3 and a standard deviation of 15.8. At the 0.10
significance level, can she conclude that the mean daily expenses for the
audit staff are less than the sales staff?
135
136
8.0 Analysis of Variance
8.2.3 Discuss the F-distribution for comparing variances.
The F Distribution
The test statistic used to compare the sample variances and to conduct the ANOVA test is
the F distribution.
F Distribution:
A continuous probability distribution where F is always 0 or positive.
The distribution is positively skewed. It is based on two parameters, the
number of degrees of freedom in the numerator and the number of
degrees of freedom in the denominator.
The major characteristics of the F distribution are:
The shape of the distribution is illustrated in the following graph. Note that the shape of
the curves change as the degrees of freedom change.
136
137
(omit this F test) 8.1.1 Describe the method of analysis of variance.
Comparing Two Population Variances
The F distribution is used to test the hypothesis that the variance of one normal population
equals the variance of another normal population. The F distribution can also be used to
validate assumptions with respect to certain statistical tests.
REGARDLESS OF WHICH TEST STATISTIC WE ARE USING, WE STILL USE THE
USUAL FIVE-STEP HYPOTHESIS TESTING PROCEDURE.


Since we are using a two-tailed test, the significance level is found
by halving the confidence level. (Example: At a significance level of
0.10, we find the critical value by looking up 0.10/2 = 0.05.
We use n1 - 1 degrees of freedom for the numerator, and n2 1 degrees of freedom for the denominator.
137
138
For one-tailed:
Examples:
1. Colin, a stockbroker, reports the mean rate of return on a sample of 10
software stocks is 12.6 percent with a standard deviation of 3.9 percent.
The mean rate of return on a sample of 8 utility stocks is 10.9 percent with
a standard deviation of 3.5 percent. At the 0.05 significance level, can Colin
conclude that there is more variation in the software stocks?
2. A random sample of seven observations from the first population resulted
in a standard deviation of 7. A random sample of five observations from the
second population showed a standard deviation of 12. At the 0.1 significance
level, is there a difference in the variation of the two samples?
3. The standard deviation of the marks for a sample of five females in a
class is 6.2 and the standard deviation for a sample of eight males is 4.3. At
the 0.05 significance level, can we conclude that the male scores have less
variation than the female scores?
Worksheet #1 for 7.7
1. A random sample of 10 observations from one population revealed a sample
mean of 23 and a sample standard deviation of 4. A random sample of 8
observations from another population revealed a sample mean of 26 and a
sample standard deviation of 5. At the 0.05 significance level, is there a
difference between the population means?
2. A random sample of 15 observations from the first population revealed a
sample mean of 350 and a sample standard deviation of 12. A random sample
of 17 observations from the second population revealed a sample mean of
342 and a sample standard deviation of 15. At the 0.10 significance level, is
there a difference in the population means?
3. A recent study compared the time spent together by single- and dualearner couples. According to the records kept by the wives during the study,
the mean amount of time spent together watching television among the
single-earner couples was 61 minutes per day, with a standard deviation of
15.5 minutes. For the dual-earner couples, the mean number of minutes spent
watching television was 48.4 minutes, with a standard deviation of 18.1
138
139
minutes. At. The 0.01 significance level, can we conclude that the singleearner couples on average spend more time watching television together?
There were 15 single-earner and 12 dual-earner couples studied.
Worksheet #2 for 8.0
1. What is the critical F value (step 4) for a sample of six observations in the
numerator and four in the denominator? Use a two-tailed test and the 0.10
significance level.
2. What is the critical F value (step 4) for a sample of four observations in
the numerator and seven in the denominator? Use a one-tailed test and the
0.01 significance level.
3. A random sample of eight observations from the first population resulted
in a standard deviation of 10. A random sample of six observations from the
second population resulted in a standard deviation of 7. At the 0.02
significance level, is there a difference in the variation of the two
populations?
4. A random sample of five observations from the first population resulted
in a standard deviation of 12. A random sample of seven observations from
the second population showed a standard deviation of 7. At the 0.01
significance level, is there more variation in the first population?
5. A media research company conducted a study of the radio listening habits
of men and women. One aspect of the study involved the mean listening time.
It was discovered that the mean listening time for men was 35 minutes per
day. The standard deviation of the sample of the 10 men studied was 10
minutes per day. The mean listening time for the 12 women studied was also
35 minutes, but the standard deviation of the sample was 12 minutes. At the
0.10 significance level, can we conclude that there is a difference in the
variation in the listening times for men and women?
(Definitely do NOT omit this F test! )8.0 continuted: ANOVA - Analysis of Variance
139
140
The F distribution is used to for testing the equality of more than two means using a
technique known as analysis of variance. (ANOVA)
ANOVA requires the following conditions.
1. The populations being sampled are normally distributed.
2. The populations have equal standard deviations.
3. The samples are randomly selected and are independent.
The ANOVA test is used to determine if the various sample means came for a single
population or from populations with different means. The sample means are compared
through their variances.



THE SAME FIVE-STEP HYPOTHESIS TESTING PROCEDURE IS
USED.
THE DIFFERENCE LIES IN THE TEST STATISTIC.
THE TEST STATISTIC IS THE F DISTRIBUTION.
140
141
Analysis of Variance Procedure
For an analysis of variance problem the appropriate test statistic is F. The F
statistic is the ratio of two variance estimates and is computed as:
F=
Estimate of the population variance based on the differences between the sample means
Estimate of the population variance based on the variation within samples
The critical value for F is determined from the F tables found in Appendix G.
The ANOVA Table
A convenient way of organizing the calculations for F is to put them into a
table - referred to as an ANOVA table.
Source of
Variation
Sum of
Squares
Degrees
of
Freedom
Mean Square
Treatments
SST
k-1
SST/(k- 1) = MST
Error
SSE
n-k
SSE/(n - k) =
MSE
Total
SS
Total
141
F
MST/MSE
142
Keep in mind the fact that the SS total term is the total variation, SST is the
variation due to the treatments, and SSE is the variation among the samples.
The three values are determined by first calculating SS total an SST, then finding
SSE by subtraction.
8.2.4 Apply the F-distribution to help determine whether differences in sample
means.
8.0 continued...ANOVA Table
Source of
variation
Treatments
Sum of
squares
SST
Degrees of
freedom
k-1
Error
SSE
n-k
Total
SS Total
n-1
Mean square
F
SST/(k1)=MST
SSE/(nk)=MSE
MST/MSE
Examples
1. Use the following sample information to see whether we can conclude that
that there is a difference in the means of the three groups. Use the 0.05
significance level.
Treatment 1
4
2
6
3
5
Treatment 2
3
9
5
142
Treatment 3
4
3
5
6
143
143
144
Source of
variation
SS
df
Mean square
F
Treatments
Error
2. A real estate developer is considering investing in a shopping mall
somewhere on the outskirts of town. Three different parcels of land are
being evaluated. Of particular importance is the income in the area
surrounding the proposed mall. A random sample of four families is selected
near each proposed mall. Following are the sample results. At the 0.05
significance level, can the developer conclude that there is a difference in
the mean income at the three locations?
144
145
Area 1 (in thousands
$)
64
Area 2 (in thousands $)
74
Area 3 (in thousands
$)
75
68
71
80
70
69
76
60
70
78
145
146
146
147
Source of
variation
SS
df
Mean square
F
Treatment
Error
Worksheet for ANOVA
1. The following is sample information for three different treatments. Can
we conclude at the 0.05 significance level that the treatment means are not
equal?
Treatment 1
9
7
11
9
12
10
Treatment 2
13
20
14
13
Treatment 3
10
9
15
14
15
2. A manager of a computer software company wishes to study the number
of hours senior executives spend at their computers by type of industry.
The manager selected a sample of five executives from each of the three
industries. At the 0.05 significance level, can she conclude there is a
difference in the mean number of hours spent per week by industry?
Banking
12
10
10
12
10
Retail
8
8
6
8
10
147
Insurance
10
8
6
8
10
148
8.3 Two-way analysis of variance (ANOVA)
8.3.1 Analyze two-way analysis of variance (ANOVA) tables where two
factors may affect the sample means
You are responsible for interpreting and analyzing results tables, but you will
not have to come up with the data yourself!
When we have two factors with at least two levels and one or more observations at each
level, we say we have a two-way layout. We say that the two-way layout is crossed when
every level of Factor A occurs with every level of Factor B. With this kind of layout we
can estimate the effect of each factor (Main Effects) as well as any interaction between
the factors.
Like testing in the one-way case, we are testing that two main effects and
the interaction are zero.
The two-way crossed ANOVA is useful when we want to compare the effect
of multiple levels of two factors and we can combine every level of one
factor with every level of the other factor. If we have multiple observations
at each level, then we can also estimate the effects of interaction between
the two factors.
Example:
Let's assume that we want to test if there are any differences in pin
diameters for a part due to different types of coolant. We still have five
different machines making the same part and we take five samples from
each machine for each coolant type to obtain the following data.
Machine 1
Machine 2
Machine 3
Machine 4
Machine 5
Coolant 1
Coolant 2
0.125, 0.127, 0.125, 0.126, 0.128
0.124, 0.128, 0.127, 0.126, 0.129
0.118, 0.122, 0.120, 0.124, 0.119
0.116, 0.125, 0.119, 0.125, 0.120
0.123, 0.125, 0.125, 0.124, 0.126
0.122, 0.121, 0.124, 0.126, 0.125
0.126, 0.128, 0.126, 0.127, 0.129
0.126, 0.129, 0.125, 0.130, 0.124
0.118, 0.129, 0.127, 0.120, 0.121
0.125, 0.123, 0.114, 0.124, 0.117
We can summarize the analysis results in an ANOVA table as follows:
148
149
Source
Sum of
Squares
Deg. of
Freedom
Mean
Square
F0
machine
0.000303
4
0.000076
8.8
coolant
0.00000392
1
0.00000392 0.45
interaction
0.00001468
4
0.00000367 0.42
residuals
0.000346
40
0.0000087
corrected
total
0.000668
49
By dividing the mean square for machine by the mean square for residuals we obtain an
F0 value of 8.8 which is greater than the critical value of 2.61 based on 4 and 40 degrees
of freedom and a 0.05 significance level. Likewise the F0 values for Coolant and
Interaction, obtained by dividing their mean squares by the residual mean square, are less
than their respective critical values of 4.08 and 2.61 (0.05 significance level).



So, you’ll have to remember that the degrees of freedom for the first row is:
I-1 for the numerator and IJ(K-1) for the denominator
And for the second row, it’s J-1 for the numerator and IJ(K-1) for the
denominator
And for the third rwo, it’s (I-1)(J-1) for the numerator and IJ(K-1) for the
denominator
Where I is the number of rows (i.e. values) for each type of coolant, J is the number
of columns (i.e. types of coolants) and K is the number of values for each!
From the ANOVA table we can conclude that machine is the most important factor and is
statistically significant. Coolant is not significant and neither is the interaction. These
results would lead us to believe that some tool-matching efforts would be useful for
improving this process.
Extra Example!
Suppose you want to determine whether the brand of laundry detergent used and
the temperature affects the amount of dirt removed from your laundry. To this
end, you buy two different brand of detergent (“ Super” and “Best”) and choose
three different temperature levels (“cold”, “warm”, and “hot”). Then you divide your
laundry randomly into 6×r piles of equal size and assign each r piles into the
149
150
combination of (“Super” and “Best”) and (”cold”,”warm”, and “hot”). In this example,
we are interested in testing Null Hypotheses
H0D : The amount of dirt removed does not depend on the type of detergent
H0T : The amount of dirt removed does not depend on the temperature
Super
Best
Cold
4, 5, 6, 5
6, 6, 4, 4
Warm
7, 9, 8, 12
13, 15, 12, 12
Hot
10, 12, 11, 9
12, 13, 10, 13
Here are the results:
Analysis of Variance Table
Response: wash
Df
Sum Sq
Mean Sq
deter
1 20.167
20.167
water
2 200.333 100.167
deter:water
2 16.333
8.167
Residuals
18 37.000
2.056
F value
9.8108
48.7297
3.9730
Pr(>F)
0.005758 **
5.44e-08 ***
0.037224 *
Determine the proper critical values using the 0.05 significance level to draw your
conclusion!
Test 3 up to here!
150
151
9.0 Chi-Square distribution
9.1 Introduction
9.1.1 Describe the properties and uses of the Chi-Square Distribution.
The Chi-Square Distribution
So far we have used the standard normal z, t, and F distributions as the test statistics. In
Chapter 13 we will learn how and when to use the Chi-Square as the test statistic.
The chi-square is similar to the t and F distributions in that there is a family of
distributions - each has a different shape depending on the number of degrees of freedom.
As the illustration shows, when the number of degrees of freedom is small, the
distributions positively skewed, but as the number of degrees of freedom increases it
becomes symmetrical and approaches the normal distribution.
Chi-square is based on squared deviations between an observed frequency and an expected
frequency - therefore, it is always positive.
151
152
9.2 Goodness of Fit Test
9.2.1 Describe the purpose of a goodness of fit test
Goodness-of-Fit Tests
In the goodness-of-fit test the
distribution is used to determine how well an observed
set of observations fits an expected set of observations.

Goodness-of-fit test: A nonparametric test involving a set of observed
frequencies and a corresponding set of expected frequencies.
The purpose of the goodness-of-fit test is to determine if there is a statistical difference
between the two sets of data - one which is observed and the other expected. Is the
difference due to chance, or can we conclude there is a significant difference between the
two values.
NOTE: Again, the same systematic five-step hypothesis testing procedure is followed in
our solution.
We begin by denoting f0 as the observed set of frequencies in a particular category and
as the expected frequency in a particular category.

fe
NOTE: A category is referred to as a cell.
Step 1: State the null and alternate hypotheses:
Step 2: Select the Level of Significance - This is the probability of committing a Type I
error.
Step 3: Select the test statistic is the chi-square statistic.
Step 4: Formulate the decision rule. Find the critical value of . This critical value is
found in the Appendix H, found by locating the number of degrees of freedom in the left
column and moving horizontally to the right to read the value associated with the level of
significance.
152
153
Step 5: Compute the value of the Chi-square and make your decision. Page 443 of your
text illustrates the procedure for computing the
value.
It is not necessary that the expected frequencies be equal to apply the goodness-of-fit
test. The text illustrates the case of unequal frequencies and also gives a practical use of
chi-square.
Examples:
1. A student sells baseball cards for a day. At the end of the day she records the sales of
the six types of cards in a chart as show below.
Player
Tom Seaver
Nolan Ryan
Ty Cobb
George Brett
Hank Aaron
Johnny Beach
Cards Sold
13
33
14
7
36
17
At the 0.05 significance level, can she conclude the sales are not the same for each player?
2. A human resources manager records the number of sick days over a week. The
following data was gathered.
Day of the week
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Number absent
12
9
11
10
9
9
At the 0.01 significance level, can she conclude that there is no difference in the
absenteeism throughout the six-day workweek?
153
154
9.3 Test of Independence
9.3.1 Perform the Chi-Square Test to determine whether two
classifications of the same data are independent of each other
Contingency Tables
The
distribution is also used to determine if there is a relationship between two or more
criteria of classifications.
For example, we may be interested in whether or not there is a relationship between job
advancement within a company and the gender of the employee.
Contingency Table: A table made up of rows and columns. Each box is referred to as a cell.
The usual five-step hypothesis testing procedure is followed. The expected frequency, fe ,
is computed by the formula:
fe 
(rowtotal)(columntotal )
grandtotal
The number of degrees of freedom used to find the critical value for
is :
df = (number of rows - 1)(number of columns - 1)
There is a limitation to the use of the
distribution The value of fe should be at least 5
for each cell (box). This requirement is to prevent any cell from carrying an inordinate
amount of weight and causing the null hypothesis to be rejected.
Examples:
1. A Correction Agency is investigating whether those released from prison
show a different adjustment if they return to their hometown or is they go
elsewhere to live. In other words, they would like to know whether there is a
relationship between adjustment to civilian life and place of residence. The
data below was gathered. Using the 0.01 significance level, determine if a
relationship exists.
Live in
hometown
Live
elsewhere
Outstanding
27
Fair
35
Unsatisfactory
33
13
15
27
154
155
2. A social scientist sampled 140 people and classified them according to
income level and whether or not they played a lottery in the last month. The
info is given below. Can we conclude that playing the lottery is related to
income level? Use the 0.05 significance level.
Played
the
lottery in
the last
month
Did not
play the
lottery in
the last
month
High
income
Low
income
21
46
19
14
Worksheet for 9.0
1. In a particular chi-square goodness-of-fit test there are four categories
and 200 observations. Use the .05 significance level.
a. How many degrees of freedom are there?
b. What is the critical value of chi-square?
2. In a particular chi-square goodness-of-fit test there are six categories and
500 observations. Use the .01 significance level.
a. How many degrees of freedom are there?
b. What is the critical value of chi-square?
3. The null hypothesis and the alternate are:
H0: The cell categories are equal.
H1: The cell categories are not equal.
Category
A
B
C
155
f0
10
20
30
156
a. State the decision rule, using the .05 significance level.
b. Compute the value of chi-square.
c. What is your decision regarding H0?
4. The null hypothesis and the alternate are:
H0: The cell categories are equal.
H1: The cell categories are not equal.
f0
10
20
30
20
Category
A
B
C
D
5. Classic Golf, Inc. manages five courses in the Jacksonville, Florida, area. The
Director wishes to study the number of rounds of golf played per weekday
at the five courses. He gathered the following sample information.
Day
Monday
Tuesday
Wednesday
Thursday
Friday
Rounds
124
74
104
98
120
6. The director of advertising for the Carolina Sun Times, the largest
newspaper in the Carolinas, is studying the relationship between the type of
community in which a subscriber resides and the section of the newspaper
he or she reads first. For a sample of readers, she collected the following
sample information.
National
News
Sports
Comics
City
170
124
90
Suburb
120
112
100
Rural
130
90
88
156
157
At the .05 significance level, can we conclude there is a relationship between
the type of community where the person resides and the section of the
paper read first?
7. The Quality Control Department at Food Town, Inc., a grocery chain in
upstate New York conducts a monthly check on the comparison of scanned
prices to posted prices. The chart below summarizes the results of a sample
of 500 items last month. Company management would like to know whether
there is any relationship between error rates on regular priced items and
specially priced items. Use the .01 significance level.
Regular
Advertised
Price
Special Price
Undercharge
20
10
Overcharge
15
30
200
225
Correct Price
157
158
9.2.1 (in new outline) Perform the chi-square test to determine whether more than
two population proportions can be considered equal
Remember:The chi-square independence test is used to find out whether there is an
association between a row variable and column variable in a contingency table constructed
from sample data. The null hypothesis is that the variables are not associated; in other
words, they are independent. The alternative hypothesis is that the variables are
associated, or dependent.
Chi-Square Test for Homogeneity of Proportions:
In a chi-square test for homogeneity of proportions, we test the claim that different
populations have the same proportion of individuals with some characteristic.
Note: The chi-square test for independence is a test regarding a sample from a single
population. We now discuss a second type of chi-square test, which can be used to compare
the population proportions from different populations. (same method, so this is just an
extension of the last section!)
Example:
A drug manufacturer makes a drug (Zocor) that is meant to reduce the level of LDL
cholesterol, while increasing the level of HDL cholesterol. In clinical trials of the drug,
patients were randomly divided into three groups, Group 1 received Zocor, group 2 received
a placebo, a group 3 received cholestyramine, a currently available drug. The table below
contains the number of patients in each group who did and did not experience abdominal pain
as a side effect.
Number of people
who experienced
abdominal pain
Number of people
who did not
experience
abdominal pain
Group 1 (Zocor)
Group 2 (Placebo)
Group 3
(Cholestyramine)
51
5
16
1532
152
163
Is there evidence to indicate that the proportion of subjects in each group who experienced
abdominal pain is different at the α = 0.01 level of significance?
158
159
10.0 Linear Regression and Correlation
10.1 Introduction
10.1.1 Describe the applications of linear regression.
The main purpose of linear regression is firstly to see if there is a linear
relationship between two variables, and secondly to predict the value of one
variable given some particular value of the other variable. To start our
discussion we must introduce the concept of scatter diagrams.
10.1.2 Distinguish between independent and dependent variables
10.2 Scatter Diagrams
10.2.1 Construct scatter diagrams to determine if two variables are
related.
The purpose of correlation analysis is to find out how strong the relationship is between two
variables. On way of looking at the relationship between two variables is to portray the
information in a scatter diagram.
The values of the independent variable are portrayed on the horizontal axis (X-axis)
and the dependent variable along the vertical axis (Y-axis).
The scatter diagram provides a visual graphical display of the "scatter" of the data
and whether or not there appears to be a linear relationship.
Example: Construct a scatter gram using the data below.
x
4
5
6
7
y
6
7
9
10
159
160
10.2.2 What is correlational analysis?
The purpose of correlation analysis is to find out how strong the relationship is between two
variables.
We are often interested in whether two variables exhibit a linear correlation. One way to
tell if there is a linear correlation is to construct a scatter diagram and if it takes the
shape of a line, then there exists some degree of linear correlation.
10.3 The Coefficient of Correlation
10.3.1 Calculate the coefficient of correlation and use it to determine the strength of
linear relationships between variables.
The Coefficient of Correlation
A measure of the linear (straight-line) strength of the association between
two variables is given by the coefficient of correlation. It is also called
Pearson's product moment correlation coefficient or Pearson's r - after
its founder Karl Pearson.
This information is summarized in the following charts:
Perfect Negative Correlation:
160
161
Perfect Positive Correlation:
The formula to compute the coefficient of correlation is:
r
 ( X  X )(Y  Y )
(n  1) s x s y


The coefficient of correlation requires that both variables be at
least of interval scale.
The degree of strength of the relationship is not related to the sign
(direction + or -) of the coefficient of correlation.
o For example, an r value of -0.60 represents the same degree
of correlation as that of +0.60.
161
162
NOTE: The following is the formula for s (SAMPLE standard deviation) It is
slightly different than the population standard deviation formula you used
before.
s
( X  X ) 2
n 1
Where s: sample standard deviation
X: the data value
X : the sample mean
n: the sample size
Example 1: Calculate the coefficient of correlation for the following
variables.
(Formulas that will be needed: s 
r
 ( X  X )(Y  Y ) )
( X  X ) 2
n 1
(n  1) s x s y
x
y
4
10
5
2
6
3
162
163
Example 2: Calculate the coefficient of correlation for the following
variables. (Formulas that will be needed: s 
r
 ( X  X )(Y  Y ) )
( X  X ) 2
n 1
(n  1) s x s y
x
y
2
1
4
5
9
10
11
6
10.4 The Reliability of Correlation
10.4.1 Determine the reliability of the coefficient of correlation
through hypothesis testing.
Test of Significance of the Correlation Coefficient
A test of significance for the coefficient of correlation may be used to determine if the
computed r could have occurred in a population in which the two variables are not related.
Is the correlation in the population zero?
For a two-tailed test the null hypothesis and the alternate hypothesis are written as
follows:
ρ = 0 (The correlation in the population is zero)
H1 : ρ≠ 0 (The correlation in the population is different from zero)
H0 :
163
164
The Greek lower case rho, ρ , represents the correlation in the population. The null
hypothesis is that there is no correlation in the population, and the alternate is that there
is a correlation.

The test statistic follows the t distribution with n - 2 degrees of freedom.
Examples:
1. A sample of 25 mayoral campaigns in cities with populations bigger than
50000 showed that the correlation between the percent of the vote
received and the amount spent on the campaign by the candidate was 0.43.
At the 0.05 significance level, is there is no correlation between the
variables?
2. An airline selects a random sample of 25 flights and found that the
correlation between the number of passengers and the total weight, in
pounds, of luggage stored in the luggage department is 0.94. Using the 0.05
significance level, can we conclude that there is a positive correlation
between the two variables?
3. A company does a survey of 26 people and finds that there is a correlation
between video game playing and television watching. The correlation is found
to be -0.67. Using the 0.025 significance level, can we conclude that there is
a negative correlation between the two variables?
164
165
Worksheet for 10.0 and 10.5 (to come)
1. The following sample observations were randomly selected.
X
Y
4
4
5
6
3
5
6
7
10
7
a. Determine the coefficient of correlation.
b. Determine the regression equation. (will do in 10.5)
c. Determine the value of Y ' when X is 5. (10.5)
d. Determine the standard error of estimate. (10.5)
e. Determine the 0.90 confidence interval for the mean predicted when X=5.
(10.5)
165
166
2. The following sample observations were randomly selected.
X
Y
4
10
5
5
7
1
10
2
a. Determine the coefficient of correlation.
b. Determine the regression equation. (will do in 10.5)
c. Determine the value of Y ' when X is 1. (10.5)
d. Determine the standard error of estimate. (10.5)
e. Determine the 0.95 confidence interval for the mean predicted when X=1.
(10.5)
3. A random sample of 12 paired observations indicated a correlation of 0.32.
Can we conclude that the correlation in the population is greater than zero?
Use the 0.05 significance level.
4. A random sample of 15 paired observations has a correlation of -0.46. Can
we conclude that the correlation in the population is less than zero? Use the
0.05 significance level.
166
167
5. A refining company is studying the relationship between the pump price of
gasoline and the number of gallons sold. For a sample of 20 stations last
Tuesday, the correlation was 0.78. At the 0.01 significance level, is the
correlation in the population greater than zero?
6. A study of 20 worldwide financial institutions showed the correlation
between their assets and pre-tax profit to be 0.86. At the 0.05 significance
level, can we conclude that there is positive correlation in the population?
10.5 Linear Regression
10.5.1/10.52 Describe the method of least square for developing a
simple linear regression model/Calculate a linear regression line and use
it to predict the value of one variable when given the other.
Regression Analysis
The equation for a straight line is used to estimate Y based on X and is referred to as the
regression equation. The equation for a straight line is: Y = A + Bx.
The technique used to develop the equation for the line and make these predictions is called
regression analysis.
Purpose:
Procedure:
167
168
The linear relationship between two variables is given by the regression equation:
These can also be written as:
a  Y  bX
sy
br
sx
(omit) Least Squares Principle
In the regression equation: Y' = a + bX, the value of a is the Y intercept and b is the
regression coefficient. These two values are developed mathematically using the least
squares principle.

least squares principle: Determining a regression equation by minimizing the sum
of the squares of the vertical distances between the actual Y values and the
predicted values of Y'.
168
169
There is only one line for which SSE is a minimum. This line is the least
squares line, the regression line, or the least squares prediction equation.
The methodology used to obtain this line is called the method of least
squares.
Examples:
1. Suppose an appliance store conducts a 5-month experiment to determine
the effect of advertising on sales revenue. The results are shown in the
table below.
month
Advertising
expenditure,
x (in $100s)
Sales
revenue,
y, (in
$1000s)
Month
1
1
1
Month
2
2
1
Month
3
3
2
Month
4
4
2
Month
5
5
4
a. Determine the regression equation for the following data.
b. Predict the sales revenue when advertising expenditure is $200 (i.e., when
x=2).
c. Give practical interpretations for the y-intercept (a) and the slope (b) of
the line.
169
170
(omit)d. Show that the sum of the errors (SE) equals 0.
(omit)e. Calculate the sum of the squared errors (SSE) and state its
significance for the regression model.
2. An investigation of the properties of bricks used to line aluminum smelter
pots was published in a journal. Six different commercial bricks were
evaluated. The life length of a smelter pot depends on the porosity of the
brick linking (the less porosity, the longer the life); consequently, the
researchers measured the apparent porosity of each brick specimen, as well
as the mean pore diameter of each brick. The data was given in the
accompanying table.
brick
A
B
C
D
E
F
Mean pore diameter
(micrometers)
12.0
9.7
7.3
5.3
10.9
16.8
Apparent porosity (%)
18.8
18.3
16.3
6.9
17.1
20.4
a. Find the least squares line relating porosity (y) to mean pore diameter. (x).
b. Predict the apparent porosity percentage for a brick with a mean pore
diameter of 10 micrometers.
c. Give practical interpretations for the y-intercept (a) and the slope (b) of
the line.
(omit) d. Show that the sum of the errors (SE) equals 0.
(omit) e. Calculate the sum of the squared errors (SSE) and state its
significance for the regression model.
170
171
3. Researchers at a company investigated the effect of tablet surface area
and volume on the rate at which a drug is released in a controlled-release
dosage. Six similarly shaped tablets were prepared with different weights
and thicknesses, and the ratio of surface area to volume was measured for
each. Using a dissolution apparatus, each tablet was placed in 900 milliliters
of deionized water, and the diffusional drug release rate (percentage of
drug released divided by the square root of time) was determined. The
experimental data are listed in the table.
Surface area to volume (mm2/mm3)
1.50
1.05
0.90
0.75
0.60
0.65
Drug release rate (%
released/time)
60
48
39
33
30
29
a. Find the least squares line relating drug release rate (y) to surface to
volume ratio (x).
b. Predict the drug release rate for a tablet that has a surface area/volume
ratio of 0.50.
c. Give practical interpretations for the y-intercept (a) and the slope (b) of
the line.
(omit) d. Show that the sum of the errors (SE) equals 0.
(omit) e. Calculate the sum of the squared errors (SSE) and state its
significance for the regression model.
171
172
10.6 Standard error estimate
10.6.1 Compute the standard error estimate and use it to estimate the
predictability of the regression line.
The Standard Error of Estimate
The predicted value of Y' will rarely be exactly the same as the actual Y value. We expect
some prediction error. One measure of this error is called the standard error of
estimate. It is written sy.x
A small standard error of estimate indicates that the independent variable is a good
predictor of the dependent variable. It is similar to the standard deviation.
Linear regression is based on these four assumptions:
172
173
Examples:
1. Determine the standard error estimate for the following data. (Note, the
data is from #1)
month
Advertising
expenditure,
x (in $100s)
Sales
revenue,
y, (in
$1000s)
Month
1
1
1
Month
2
2
1
Month
3
3
2
Month
4
4
2
Month
5
5
4
2. Determine the standard error estimate for the following data. (Note, the
data is from #2)
brick
Mean pore diameter
Apparent porosity (%)
(micrometers)
A
12.0
18.8
B
9.7
18.3
C
7.3
16.3
D
5.3
6.9
E
10.9
17.1
F
16.8
20.4
173
174
3. Determine the standard error of estimate for the following data. (Note,
the data is from #2 in 10.5.1/10.5.2)
Surface area to volume (mm2/mm3)
1.50
1.05
0.90
0.75
0.60
0.65
Drug release rate (%
released/time)
60
48
39
33
30
29
10.6.2 Determine confidence intervals for regression estimates.
Establishing a Confidence Interval for Y
The standard error is also used to set confidence intervals for the predicted value of Y'.
When the sample size is large and the scatter about the regression line is approximately
normally distributed, then the following relationships can be expected:
Y' ± 1sy.x encompasses about 68% of the observed values.
Y' ± 2sy.x encompasses about 95.5% of the observed values.
Y' ± 3sy.x encompasses about 99.7% of the observed values.
Two types of confidence intervals may be developed.


For the mean value of Y' for a given value of X .
For an individual value of Y' for a given value of X - called a prediction interval.
The explain the difference: Suppose we are predicting the salary of management
personnel who are 40 years old. In this case we are predicting the mean salary of all
management personnel age 40. However, if we want to predict the salary of a particular
manager who is 40, then we are making a prediction about a particular individual.
The formula for the confidence interval for the mean value of Y' for a given X is:
The confidence interval for the mean value of Y for a given value of X is
given by:
Y '  ts yx
1
( X  X )2

n ( X  X )2
174
175
Where:
Y ' is the predicted value for any selected X value.
X is any selected value of X
X is the mean of the X’s
N is the number of observations
s y x is the standard error of estimate
t is the value of t from appendix F with
n-2 degrees of freedom
Examples:
1. a. Determine the 0.95 confidence interval for the mean predicted when X=2.
(Note, the data is from #1 in 10.5.1/10.5.2)
b. What does it represent?
month
Advertising
expenditure,
x (in $100s)
Sales
revenue,
y, (in
$1000s)
Month
1
1
1
Month
2
2
1
Month
3
3
2
Month
4
4
2
Month
5
5
4
175
176
2. Determine the 0.90 confidence interval for the mean predicted when X=10 for
the following data. (Note, the data is from #2 in 10.5.1/10.5.2)
brick
A
B
C
D
E
F
Mean pore diameter
(micrometers)
12.0
9.7
7.3
5.3
10.9
16.8
Apparent porosity (%)
18.8
18.3
16.3
6.9
17.1
20.4
3. Determine the 0.99 confidence interval for the mean predicted when X=0.50 for
the following data. (Note, the data is from #3 in 10.5.1/10.5.2)
Surface area to volume (mm2/mm3)
1.50
1.05
0.90
0.75
0.60
0.65
Extra Example for 10.0 and 10.5
x
2
4
9
11
Drug release rate (% released/time)
60
48
39
33
30
29
y
1
5
10
14
a. Find the coefficient of correlation, r.
b. Find the regression equation that fits the data.
c. Predict the value of y, when x = 10.
d. Calculate the standard error of estimate.
e. Calculate the 90% confidence interval for x = 10.
176
177
11.0 Multiple Regression
11.1 Compute the coefficients of a multiple regression line and use the
equation to predict the value of a dependent variable when given the value of
independent variables/11.1.1 Interpret the coefficients of the multiple
regression equation
When there are several independent variables, you can extend the simple linear
regression model that we did in unit 10.
The following equation defines the multiple regression model with two
independent variables
Where
intercept
slope of
with variable
, holding variable
constant
slope of
with variable
, holding variable
constant
Examples:
1. A graphing calculator was used to determine the values of the
regression coefficients for a multiple regression equation. The
two variables used were:
= price of an OmniPower bar (in cents) for store
monthly in-store promotional expenditures (in dollars) for store
And these independent variables were used to calculate:
predicted monthly sales of OmniPower bars (in # of bars) for store
The computed values for the regression coefficients are:
177
178
a. What is the multiple regression equation?
b. Interpret the Y intercept.
c. Interpret
.
d. Interpret
.
e. Use your multiple regression equation to predict the sales for a store
charging 79 cents during a month in which promotional expenditures are
$400.
178
179
Example 2: A marketing analyst for a shoe manufacturer is
considering the development of a new brand of running shoes. The
marketing analyst wants to determine which variables to use in
predicting durability. Two independent variables under
consideration are
a measurement of the forefoot shockabsorbing capability, and , a measurement of the change in
impact properties over time. The dependent variable, , is a
measure of the shoe’s durability after a repeated impact test. A
random sample of 15 types of currently manufactured running
shoes was selected for testing, and the following regression
coefficients were found:
a. What is the multiple regression equation?
b. Interpret the Y intercept.
c. Interpret
.
d. Interpret
.
e. Use your multiple regression equation to predict the shoe’s durability if
the forefoot shock-absorbing capability is 1.5 units and the measurement
of the change in impact properties over ti me is 2.1 units.
179
180
11.1.2 Conduct a test of hypothesis to determine if the regression
coefficients differ from zero
Testing for the Slope in Multiple Regression
Where:
slope of variable with
holding constant the effects of all
other independent variables
standard error of the regression coefficient
test statistic for a distribution with
freedom
degrees of
number of independent variables in the regression equation
hypothesized value of the population slope for variable ,
holding constant the effects of all other independent variables
Examples:
1. a. Determine whether variable
(amount of promotional
expenditures) has a significant effect on sales in the OmniPower
bar example from example 1 in Section 11.1 if the standard error
of was determined to be 0.6852. There were 34 stores sampled
for data. Use the 0.05 level of significance. (i.e. Determine
whether is different from zero at the 0.05 level of significance.)
180
181
Recall that:
= price of an OmniPower bar (in cents) for store
monthly in-store promotional expenditures (in dollars) for store
predicted monthly sales of OmniPower bars for store
The computed values for the regression coefficients are:
b. Determine whether there is evidence that the slope of sales
with price (i.e.
is different from zero at the 0.05 level of
significance if the standard error for b1 is 6.8522.
181
182
2. Recall example #2 in section 11.1: Example 2: A marketing
analyst for a shoe manufacturer is considering the development
of a new brand of running shoes. The marketing analyst wants to
determine which variables to use in predicting durability. Two
independent variables under consideration are
a measurement
of the forefoot shock-absorbing capability, and
, a measurement
of the change in impact properties over time. The dependent
variable, , is a measure of the shoe’s durability after a repeated
impact test. A random sample of 15 types of currently
manufactured running shoes was selected for testing, and the
following regression coefficients were found:
At the 0.05 level of significance, determine whether each
independent variable makes a significant contribution to the
regression model (i.e. conduct two hypothesis tests) if the
standard error for b1 is 0.06295 and the standard error for b2 is
0.07174.
182
183
To add to your Comprehensive assignment
38. A random sample of 90 observations produced a mean
and a standard deviation
a. Find an approximate 95% confidence interval for the population mean . (4 marks)
b. Find an approximate 90% confidence interval for the population mean . (4 marks)
c. Find an approximate 99% confidence interval for the population mean . (4 marks)
39. A survey was conducted by a broadcasting company in which they asked 501 satellite radio
subscribers if they had a satellite radio receiver in their cars. They found that 396 subscribers did
have a satellite receiver in their car. Find and interpret a 90% confidence interval for the
proportion. (4 marks)
40. A Gold Association would like to measure the average distance traveled when a gold ball is
hit by a machine. Suppose the association wishes to estimate the mean distance for a new
brand to within 1 yard with 90% confidence. Assume that past tests have indicated that the
standard deviation of the distance the machine hits golf balls is approximately 10 yards. How
many gold balls should be hit by the machine to achieve the desired accuracy in estimating the
mean? (4 marks)
41. A company warns that bottled water may contain more bacteria than allowed by law. Of the
more than 1000 bottles studies, nearly one-third exceeded government levels. Suppose that the
company wants an updated estimate of the population proportion of bottled water that violates
183
184
government standards. Determine the sample size (number of bottles) needed to estimate this
proportion to within
with 99% confidence. (4 marks)
42. A researcher wishes to determine whether prenatal alcohol effects learning in rats, whether
learning in rats changes with age and whether the effects of prenatal alcohol depend on the age
of the testing. The researcher uses a two by two design with five subjects per group.
The following results table is produced:
Source
SS
df MS
F
p
192.2 1 192.2 40.68 .05
A (alcohol)
57.8 1 57.8 12.23 .05
B (age)
AxB (interaction) 168.2 1 168.2 35.60 .05
Within
75.6 16 4.725
Total
493.8 19
Determine the critical values for the first three rows of your table and compare with the
calculated values in the table to help you draw your conclusion. (8 marks)
43. A mail-order catalog business selling personal computer supplies, software, and hardware
maintains a centralized warehouse. Management is currently examining the process of
distribution from the warehouse and wants to study the factors that affect warehouse
distribution costs. Currently, a small handling free is added to each order, regardless of the
amount of the order. Data collected over the past 24 months indicate the warehouse
distribution costs, (in thousands of dollars), the sales, (in thousands of dollars), and the
number of orders received .
a. State the multiple regression equation if the regression coefficients were determined to be: (2
marks)
b. Interpret the meaning of the slopes,
and
, in this problem. (4 marks)
c. Does an interpretation of the regression coefficient,
or why not? (2 marks)
184
, make any sense in this example? Why
185
d. Predict the mean monthly warehouse distribution cost when sales are $400,000 and the
number of orders is 4500. (2 marks)
e. At the 0.05 level of significance, determine whether each independent variable makes a
significant contribution to the regression model if the standard error for is 0.0203 and the
standard error for is 0.00225. (i.e. perform two hypothesis tests) (16 marks)
Optional: 11.1.3 Conduct a test of hypothesis on each of the regression
coefficients
Final Exam up to here!
185