Download Statistical Reasoning Student Notes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
CHAPTER 5: STATISTICAL REASONING
1. Exploring Data – pg. 238-240
Assignment: pg. 239-240#1-3
2. Frequency Tables, Histograms, and Frequency Polygons – pg. 241253
Assignment: pg. 249-253 #1-7, 11
3. Standard Deviation – pg. 254-265
Assignment: pg. 261-265 #1-9, 11-13
4. Mid-Unit Review – pg. 266-268
Assignment: pg. 267-268 #1-6
5. The Normal Distribution – pg. 269-282
Assignment: pg. 279-282 #1-5, 7-13
6. Z-scores – pg. 283-294
Assignment: pg. 292-294 #1-10, 12, 13, 16, 18, 21
7. Confidence Intervals – pg. 295-304
Assignment: pg. 302-304 #1-6, 9, 10
8. Worksheet Review
9. Chapter Quiz
10. Chapter Review – pg. 306-310
Assignment: pg. 308-310# 1-12
11. Chapter Exam
LESSON 1: EXPLORING DATA
Learning Outcome: Learn to explore the similarities and differences
between two sets of data.
Why comparing two or more sets of data, which comparing devices could
you use? (draw on previous experiences with stats). For each statistical
device you name, can you offer a definition for each?
Mean:
Median:
Mode:
Range:
Outlier:
With a partner complete Getting started on page 236-237 of your textbook.
2
Ex. Frank needs a new battery for his car. He is trying to decide between
two different brands. Both brands are the same price. He obtains data for
the lifespan, in years, of 30 batteries of each brand, as shown below.
Measured Lifespans of 30 Car Batteries (years)
Brand X
Brand Y
5.1 7.3 6.9 4.7 5.0 6.2 6.4 5.5 5.7 6.1
6.8 6.0 4.8 4.1 5.2 8.1 6.3 7.5 5.0 4.6
5.7 8.2 3.3 3.1 4.3 5.9 6.6 5.8 6.4 5.7
5.4 6.3 4.8 5.9 5.5 4.7 6.0 4.5 6.6 6.0
5.0 6.5 5.8 5.4 5.1 5.7 6.8 5.6 4.9 6.1
4.9 5.7 6.2 7.0 5.8 6.8 5.9 5.3 5.6 5.9
How can you compare the data to help Frank decide which brand of battery
to buy?
Describe how the data in each set is distributed. Describe any similarities
and differences between the two sets of data.
Explain why the mean and median do not fully describe the difference
between these two brands of batteries. Consider the range, which is one
measure of dispersion for data. Explain what additional information can be
learned from the range of the data.
Is the mode useful to compare in this situation? Explain.
3
Suppose that one battery included in the set of data for brand Y is
defective, and its lifespan is 0.5 years instead of 5.9 years. Discuss how
this would or would not affect Frank’s decision.
Dispersion: A measure that varies by the spread among the data in a set;
dispersion has a value of zero if all the data in a set is identical, and it
increases in value as the data becomes more spread out.
Assignment: pg. 239-240 #1-3
4
LESSON 2: FREQUENCY TABLE, HISTOGRAMS AND FREQUENCY
POLYGONS
Learning Outcome: Learn to create frequency tables and graphs from a
set of data.
If you inherited a hockey card collection, how could you organize a
catalogue of the cards to see how many of each player you have?
Frequency and Frequency Tables
The frequency of a particular data value is the number of times the data
value occurs.
For example, if four students have a score of 80 in mathematics, and then
the score of 80 is said to have a frequency of 4. The frequency of a data
value is often represented by f.
A frequency table is constructed by arranging collected data values in
ascending order of magnitude with their corresponding frequencies
Ex.
The marks awarded for an assignment set for a Year 8 class of 20 students
were as follows:
6 7 5 7 7 8 7 6 9 7
4 10 6 8 8 9 5 6 4 8
Present this information in a frequency table.
5
Construct a table with three columns. The first column shows what is being
arranged in ascending order (i.e. the marks). The lowest mark is 4. So,
start from 4 in the first column as shown below.
Go through the list of marks. The first mark in the list is 6, so put a tally
mark against 6 in the second column. The second mark in the list is 7, so
put a tally mark against 7 in the second column. The third mark in the list is
5, so put a tally mark against 5 in the third column.
Count the number of tally marks for each mark and write it in third column
With the given tally chart, we can create a histogram of the situation using
our calculator:
Histogram: Graph of a frequency distribution, in which equal intervals of
values are marked on a horizontal axis and the frequencies associated with
these intervals are indicated by the areas of the rectangles drawn for these
intervals.
Stat – 1: Edit: this opens up the table editor
List the Mark in 𝐿1 and the frequency of the mark in 𝐿2
Use 2nd function STAT PLOT to access the graphing screen.
Select 1: turn plot ON, select the histogram graph
Xlist: L1
Freq: L2
6
In general:
We use the following steps to construct a frequency table:
Step 1:
Construct a table with three columns. Then in the first column, write down
all of the data values in ascending order of magnitude.
Step 2:
To complete the second column, go through the list of data values and
place one tally mark at the appropriate place in the second column for
every data value. When the fifth tally is reached for a mark, draw a
horizontal line through the first four tally marks as shown for 7 in the above
frequency table. We continue this process until all data values in the list
are tallied.
Step 3:
Count the number of tally marks for each data value and write it in the third
column
Class Intervals (or Groups)
When the set of data values are spread out, it is difficult to set up a
frequency table for every data value as there will be too many rows in the
table. So we group the data into class intervals (or groups) to help us
organize, interpret and analyze the data.
Ideally, we should have between five and ten rows in a frequency
table. Bear this in mind when deciding the size of the class interval (or
group).
7
Each group starts at a data value that is a multiple of that group. For
example, if the size of the group is 5, then the groups should start at 5, 10,
15, 20 etc. Likewise, if the size of the group is 10, then the groups should
start at 10, 20, 30, 40 etc.
The frequency of a group (or class interval) is the number of data values
that fall in the range specified by that group (or class interval).
Ex.
The number of calls from motorists per day for roadside service was
recorded for the month of December 2003. The results were as follows:
Set up a frequency table for this set of data values.
To construct a frequency table, we proceed as follows:
Choose any appropriate range
for the values, keeping in mind
you want between 5-10 ranges.
Step 1: Construct a table with three columns, and then write the data
groups or class intervals in the first column. The size of each group is
40. So, the groups will start at 0, 40, 80, 120, 160 and 200 to include all of
the data. Note that in fact we need 6 groups (1 more than we first thought).
8
Step 2: Go through the list of data values. For the first data value in the
list, 28, place a tally mark against the group 0-39 in the second
column. For the second data value in the list, 122, place a tally mark
against the group 120-159 in the second column. For the third data value
in the list, 217, place a tally mark against the group 200-239 in the second
column.
Step 3: Count the number of tally marks for each group and write it in the
third column. The finished frequency table is as follows:
With our tally chart completed, create a histogram without the use of a
calculator.
Roadside Service Calls
9
Using a frequency Polygon:
Using our previous information:
We will need to add on one more component to our tally chart in order to
graph the situation. We need to add a midpoint value of our class intervals:
Class Interval
0-40
40-80
80-120
120-160
160-200
200-240
Midpoint
20
60
100
140
180
220
Frequency
1
5
12
8
4
1
We calculate our midpoint by adding the boundaries of each interval and
dividing by 2.
The chart now gives us values we can input into our calculator.
List 1: Midpoint
List 2: Frequency
Recall how to input information into lists on our calculator.
In our stat plot, choose the line instead of the histogram.
We may need to change our window in order to see the graph:
X: [0, 250, 50] y: [0, 15, 5]
10
Frequency Polygon: The graph of a frequency distribution, produced by
joining the midpoints of the intervals using straight lines.
Assignment: pg. 249-253 #1-7, 11
11
LESSON 3: STANDARD DEVIATION
Learning Outcome: Learn to determine the standard deviation for sets of
data, and use it to solve problems and make decisions.
Celebrity Guessing Game
Suppose that you are in some course and have just received your grade on
an exam. It is natural to ask how the rest of the class did on the exam so
that you can put your grade in some context. Knowing the mean or median
tells you the "center" or "middle" of the grades, but it would also be helpful
to know some measure of the spread or variation in the grades.
Let’s look at a small example. Suppose three classes of 5 students each
write the same exam and the grades are:
Class 1 Class 2 Class 3
82
82
67
78
82
66
70
82
66
58
42
66
42
42
65
Find the mean for each class:
What do we notice?
Does the mean describe the differences in each set of data?
What is the range in each class? Does this help with picking the more
consistent class?
12
Each of these classes has a mean, , of 66 and yet there is great difference
in the variation of the grades in each class. One measure of the variation is
the range, which is the difference between the highest and lowest grades.
In this example the range for the first two classes is 82 - 42 = 40 while the
range for the third class is 67 - 65 = 2. The range is not a very good
measure of variation here as classes 1 and 2 have the same range yet their
variation seems to be quite different. One way to see this variation is to
notice that in class 3 all the grades are very close to the mean, in class 1
some of the grades are close to the mean and some are far away and in
class 2 all of the grades are a long way from the mean. It is this concept
that leads to the definition of the standard deviation.
Standard Deviation: A measure of the dispersion or scatter of data values
in relation to the mean; a low standard deviation indicates that most data
values are close to the mean, and a high standard deviation indicates that
most data values are scattered farther from the mean.
We can find standard deviation using our calculators.
What is the standard deviation (𝜎), of each class in the chart above?
On our calculators, enter each class information into our list on our
calculator. Once entered, choose STAT: Calc: 1: 1-Var Stats, then enter
the list you choose to be evaluated:
Class 1: 𝜎 = 14.5
Class 2: 𝜎 = 19.6
Class 3: 𝜎 = 0.63
According to the standard deviations calculated, which class is the most
consistent? Which class is the most inconsistent?
13
Ex. Brendan was wondering about the accuracy of the mass
measurements given on two cartons that contained sunflower seeds. He
decided to measure the masses of the 20 bags in the two cartons. One
carton contained 227g bags, and the other contained 454g bags.
Masses of 227g Bags (g)
228 220 233 227
230 227 221 229
224 235 224 231
226 232 218 218
229 232 236 223
Masses of 454g Bags (g)
458 445 457 458
452 457 445 452
463 455 451 460
455 453 456 459
451 455 456 450
How can measures of dispersion be used to determine if the accuracy of
measurement is the same for both bag sizes?
227 g bags:
454 g bags:
Range:
Range:
Mean =
Mean =
𝜎=
𝜎=
The accuracy of measurement is not the same for both sizes of bag.
Ex. Angela conducted a survey to determine the number of hours per
week that Grade 11 males in her school play video games. She
determined that the mean was 12.84h, with a standard deviation of 2.16h.
Janessa conducted a similar survey of Grade 11 females. She organized
her results in this frequency table. Compare the results of the two surveys.
14
Gaming Hours per Week for Grade 11 Females
Hours
Frequency
3-5
7
5-7
11
7-9
16
9-11
19
11-13
12
13-15
5
How would we have to modify the table to enter the results into our
calculator?
Find a midpoint of the hours.
Gaming Hours per Week for Grade 11 Females
Hours
Midpoint
Frequency
3-5
4
7
5-7
6
11
7-9
8
16
9-11
10
19
11-13
12
12
13-15
14
5
Enter values into calculator:
Gaming hours per week for Grade 11 females:
𝑥̅ =
𝜎=
Gaming hours per week for Grade 11 males:
𝑥̅ =
𝜎=
Comparing the two sets of data:
The standard deviation for the females is higher than the standard
deviation for the males.
Therefore, the female’s times vary more from their mean of about 9h.
The standard deviation for the males is lower. Therefore, their data is more
consistent, even though their mean is higher.
15
Assignment: pg. 261-265 #1-9, 11-13
Mid-Unit Review – pg. 266-268
Assignment: pg. 267-268 #1-6
16
LESSON 4: THE NORMAL DISTRIBUTION
Learning Outcome: Learn to determine the properties of a normal
distribution, and compare normally distributed data.
Many games require dice. For example, the game of Yahtzee requires five
dice.
What shape is the data distribution for the sum of the numbers rolled with
dice, using various numbers of dice?
If we use two dice, find the amount of ways we can add the two die to the
sum given in the chart below:
Sum
2
3
4
5
6
7
8
9
10
11
12
Ways to find sum
Create a histogram using the above information.
17
Use the information to calculate the mean of the frequency table.
Mean =
Which phrase best describes the distribution of the data?
A: There are more data above the mean than below it
B: There are more data below the mean than above it
C: The data are symmetrically distributed about the mean
With a partner, roll two dice 50 times. Record the sum for each roll in a
frequency distribution table. Then draw a graph to represent the
distribution of the data. Comment on the distribution of the data.
Sum
2
3
4
5
6
7
8
9
10
11
12
Ways to find sum
18
What would happen to the graph if we combined the data from the entire
class? How would the graph look?
Make a conjecture about what the graph would look like if you rolled the
two dice 50 000 times.
Normal Curve: a symmetrical curve that represents the normal distribution;
also called the bell curve.
Normal Distribution: Data that, when graphed as a histogram or a
frequency polygon, results in a unimodal symmetric distribution about the
mean.
The Standard Normal Distribution Curve
In order to compare normal curves and to solve probability problems
involving normal distributions, we convert the normal distribution curve
given in a problem into the standard normal distribution curve.
The diagram shows the approximate area under the standard normal
distribution curve sub-divided into regions of width equal to one standard
deviation. The percentage of the area under the curve in each region is
indicated.
19





50% of the data is above the mean
68.26% of the data is within one standard deviation of the mean
95.44% of the data is within two standard deviations of the mean
99.74% of the data is within three standard deviations of the mean
Total area under the curve is 1 or 100%
Ex. A nurse records the number of hours an infant sleeps during a day. He
then records the data on a normal distribution curve shown below. The
values shown on the horizontal axis differ by one standard deviation.
34.13%
0.13%
34.13%
13.59%
13.59%
2.15%
A
B
0.13%
2.15%
10
12
14
C
Number of Hours Slept
a.What is the mean of the data? :
b.What is the standard deviation?:
20
D
c. What are the values for A, B, C, and D?
d. What percentage of a day to the nearest hundredth, does the infant
Sleep between 8 and 16 hours?
Ex. Shirley wants to buy a new cellphone. She researches the cellphone
she is considering and finds the following data on its longevity, in years.
2.0 2.4
3.3
1.7
2.5
3.7
2.0
2.3
2.9
2.2
2.3 2.7
2.5
2.7
1.9
2.4
2.6
2.7
2.8
2.5
1.7 1.1
3.1
3.2
3.1
2.9
2.9
3.0
2.1
2.6
2.6 2.2
2.7
1.8
2.4
2.5
2.4
2.3
2.5
2.6
3.2 2.1
3.4
2.2
2.7
1.9
2.9
2.6
2.7
2.8
a. Does the data approximate a normal distribution. (plot data on graphing
calculator and find the mean and standard deviation)
Histogram looks symmetrical about the mean.
Mean =
, standard deviation:
, median =
Using the information from our data, draw a normal distribution.
Draw curve for students, label mean, and values of each increment.
u + sd = 3…
21
b. If Shirley purchases this cellphone, what is the likelihood that it will last
for more than three years?
Draw the histogram, label the mean and standard deviations.
𝜇 + 1𝜎 = 2.526 + 0.482 = 0.3008
One standard deviation above the mean is 50% + 34% = 84%
For the cell phones that lasted more than three years:
100% - 84% = 16%.
Normal curves can vary in two main ways: the mean determines the
location of the centre of the curve on the horizontal axis, and the standard
deviation determines the width and height of the curve.
Assignment: pg. 279-281 #1-5, 7-13
22
LESSON 5: Z-SCORES
Learning Outcome: Learn to use z-scores to compare data, make
predictions, and solve problems.
What are z-scores?
A common statistical way of standardizing data on one scale so a
comparison can take place is using a z-score. The z-score is like a
common yard stick for all types of data. Each z-score corresponds to a
point in a normal distribution and as such is sometimes called a normal
deviate since a z-score will describe how much a point deviates from a
mean or specification point.
A z-score is a standardized value that indicates the number of standard
deviations of a data value above or below the mean.
Formula for calculating z-scores: z 
x

z=the z-score, x = the particular data value,  = mean,
 = standard deviation
Given a singular data point, how many Standard Deviations is it from the
mean?
How do we find how many
standard deviations from the
mean the line is?
Ex. Tony’s midterm marks are shown below, together with the class mean
and the standard deviation for each subject. By calculating z-scores,
determine in which subject Tony performed best relative to the class.
Subject
Tony’s
Mean
Standard
Mark
Mark
Deviation
Math
74
68
12
Chemistry
79
73
14
Physics
68
66
11
23
Z-Score Tables
The z-score table gives the area to the left of a particular z value. This
area to the left of z is denoted by A(z)
Properties of z-scores
 A z-score for a data value describes the number of standard deviations
above or below the mean
 A negative z-score indicates that the data value is below the mean and
is shown to the left of the mean on the standard normal curve.
 A positive z-score indicates that the data value is above the mean and is
shown to the right of the mean on the standard normal curve.
 The z-score table gives:
 Area under the curve, to the left of the z-score or
 Percentage of data to the left of the z-score or
 Probability that a randomly chosen data value is to the left of the zscore.
 The mean, median, and mode have a z-score of zero.
Ex. Use z-score table to calculate:
a. A(-3)
Locate –3 on the left hand side, then locate 0.00 on the top and match
up the answers to get 0.0013
b. A(1)
A(1) =
24
Ex. IQ tests are sometimes used to measure a person’s intellectual
capacity at a particular time. IQ scores are normally distributed, with a
mean of 100 and a standard deviation of 15. If a person scores 119 on an
IQ test, how does this score compare with the scores of the general
population? Sketch the situation as a normally distributed histogram.
This means that an IQ score of 119 is greater than 89.90% of IQ scores in
the general population.
Solving the same question using technology:
Normalcdf(
The command “normalcdf” can be used to calculate normal distribution
probability between two data values or to the left or right of a data value for
a specified mean and standard deviation.
Accessing normalcdf:
1. Access the distribution menu DISTR by pressing 2nd then VARS
2. Select “normalcdf”, second selection, press enter
3. Use the following to determine the area between two data values:
Normalcdf(lower bound, upper bound, mean, standard deviation)
4. Press enter to determine the area
To calculate to the left of a data value, replace the lower bound with 11099
or 0
To calculate to the right of a data value, replace the upper bound with 1 1099
The answers obtained will not be exactly the same as those obtained from
tables due to the increased accuracy provided by the calculator
25
Normalcdf(0, 119, 100, 15) = 0.8973
Ex. Running shoes lose their shock-absorption after a mean distance of
640 km, with a standard deviation of 160 km. Zack is an elite runner and
wants to replace his shoes after 25% of their natural life. At what distance
should he replace his shoes?
Sketch the situation:
25%
We need to find out the z-score that represents 25% or 0.25
Solving the previous question using technology:
26
InvNorm
Given the area to the left of a data value, the command “invNorm(“ can be
used to calculate the data value. The mean and standard deviation must
be given.
Steps:
1. Access the distribution menu DISTR by pressing 2nd then VARS
2. Select “invNorm(“, the 3rd choice and then press enter
3. Enter data values:
InvNorm(area, mean, standard deviation)
4. Press enter
Using InvNorm to solve our previous question:
InvNorm(0.25, 640, 160) = 532.08
Assignment: pg. 292-294 #1-10, 12, 13, 16, 18, 21
27
LESSON 6: CONFIDENCE INTERVALS
Learning Outcome: Learn to use the normal distribution to solve problems
that involve confidence intervals.
If a light bulb company wants to test the number of hours that a bulb will
burn before failing, is it logical to test every bulb?
If not, propose a method that the company could use to determine the
longevity of its light bulbs.
Ex. A telephone survey of 600 randomly selected people was conducted in
an urban area. The survey determined that 76% of people, from 18 to 34
years of age, have a social networking account. The results are accurate
within plus or minus 4 percent points, 19 time out of 20.
How can this result be interpreted, it the total population of 18 to 34 year
olds is 92 500?
Calculate the range of people that have a social networking account, and
determine the certainty of the results.
The margin of error is ±4, so the confidence interval is 76% ±4%
28
The range of results are 72% - 80%
The confidence level of the survey is 95% (19 out of 20)
Confidence interval for population:
92 500 x 0.76 = 70 300
92500 x 0.04 = 3700
Population interval is 66 600 to 74 000.
Ex. To meet regulation standards, baseballs must have a mass from
142.0g to149.0 g. A manufacturing company has set its production
equipment to create baseballs that have a mean mass of 145.0 g.
To ensure that the production equipment continues to operate as expected,
the quality control engineer takes a random sample of baseballs each day
and measures their mass to determine their mean mass. If the mean mass
of the random sample is 144.7 g to 145.3 g, then the production equipment
is running correctly. If the mean mass of the sample is outside the
acceptable level, the production equipment is shut down and adjusted. The
quality control engineer refers to the chart shown when conducting random
sampling.
Confidence Level
99%
95%
90%
Sample Size Needed
110
65
45
a. What is the confidence interval and margin of error the engineer is using
for quality control tests?
29
b. Interpret the table
 Confidence level 99%: needs to measure 110 baseball to be
confident 99 out of 100 times.
 Confidence level 95%: needs to measure 65 baseball to be confident
95 out of 100 times.
 Confidence level 90%: needs to measure 45 baseball to be confident
90 out of 100 times.
c. What is the relationship between confidence level and sample
size?
For a constant margin of error, as the confidence level increases, the size
of the sample needed to attain that confidence level increases. To have
greater confidence that the baseballs meet quality standards, the engineer
must use a larger sample.
Ex. A poll was conducted to ask voters the following question: If an
election were held today, whom would you vote for? The results indicated
that 53% would vote for Smith and 47% would vote for Jones. The results
were stated as being accurate within 3.8 percent points, 19 times out of 20.
Who will win the election?
For Smith: 53% ±3.8%: Confidence interval:
For Jones: 47% ±3.8%: Confidence interval:
The two confidence intervals overlap from
If the poll is accurate, Smith is more likely to______. However, there is a
chance that Jones will win, since the confidence intervals overlap by
_____of the votes.
30
Need to know:
 A confidence interval is expressed as the survey or poll result, plus or
minus the margin of error.
 The margin of error increases as the confidence level increases (with
a constant sample size). The sample size that is needed also
increases as the confidence level increases (with a constant margin
of error).
 The sample size affects the margin of error. A larger sample results
in a smaller margin of error. A larger sample results in a smaller
margin of error, assuming that the same confidence level is required.
For Example:
 A sample of 1000 is considered to be accurate to within ±3.1%, 19
times out of 20
 A sample of 2000 is considered to be accurate to within ±2.2%, 19
times out of 20
 A sample of 3000 is considered to be accurate to within ±1.8%, 19
times out of 20.
Assignment: pg. 302-304 #1-6, 9, 10
Chapter Quiz
Chapter Review – pg. 306-310
Assignment: pg. 308-310# 1-12
Chapter Exam
31
32