Download Part 3 MAT 110 Workshop PowerPoint on Statistics

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
MAT 110 WORKSHOP
Created by Michael Brown, Haden McDonald & Myra Bentley
for use by the Center for Academic Support
UNIT 3: STATISTICS
Introduction
Definitions
Mean: The average in a set of data.
Median: The middle number in an ordered list. If there are two middles, the median is the average of those two.
Mode: The number(s) that appears the most frequently in a data set.
Range: The difference between the largest and smallest values.
Standard Deviation: An average measure of how far each data point is from the mean.
Normal Distribution: A very common distribution that describes many real life values. The symmetric Bell curve.
Z-Score: The number of standard deviations a value is from the mean.
Confidence Interval: A range that is 'likely' to contain the actual mean of a data set. Usually associated with a margin of
error.
Margin of error: The likelihood that a confidence interval does NOT contain the mean of a data set.
Example
• Calculate the mode, mean, and median of the following
data:
• 12, 11, 6, 24, 11, 9, 15, 11
Example
6,9,11,11,11,12,15,24
• Mode = 11 because it appears 3 times while all the other
numbers only appear once.
• Mean = 99/8 because 6+9+11+11+11+12+15+24=99 and
there are 8 numbers
• Median = 11 because 11 is in the middle of the numbers
when placed smallest to largest
Frequency Tables
A frequency table is a table that shows the total for each category or group of data.
•Example: 25 viewers evaluated the latest episode of CSI. The possible evaluations
are:
(E)xcellent, (A)bove average, a(V)erage, (B)elow average, (P)oor.
After the show, the 25 evaluations were as follows:
A, V, V, B, P, E, A, E, V, V, A, E, P, B, V, V, A, A, A, E, B, V, A, B, V
Construct a frequency table and a relative frequency table for this list of evaluations.
Frequency Tables
After the show, the 25 evaluations were as follows:
E, E, E, E, A, A, A, A, A, A, A, V, V, V, V, V, V, V, V, B, B, B, B, P, P
Representing Data Visually
The bar graph for the relative frequency is
shown below.
• Solution:
Frequency Tables
•Example: Suppose 40 health care workers take an AIDS
awareness test and earn the following scores:
79, 62, 87, 84, 53, 76, 67, 73, 82, 68,
82, 79, 61, 51, 66, 77, 78, 66, 86, 70,
76, 64, 87, 82, 61, 59, 77, 88, 80, 58,
56, 64, 83, 71, 74, 79, 67, 79, 84, 68
Construct a frequency table and a relative frequency table for
these data.
Frequency Tables
79, 62, 87, 84, 53, 76, 67, 73, 82,
68,
82, 79, 61, 51, 66, 77, 78, 66, 86,
70,
76, 64, 87, 82, 61, 59, 77, 88, 80,
58,
56, 64, 83, 71, 74, 79, 67, 79, 84,
68
Representing Data Visually
A variable quantity that cannot take on arbitrary
values is called discrete. Other quantities, called
continuous variables, can take on arbitrary values.
The number of children in a family is an example of
a discrete variable. Weight is an example of a
continuous variable.
We use a special type of bar graph called a
histogram to graph a frequency distribution when we
are dealing with a continuous variable quantity or a
variable quantity that is discrete, but has a very large
number of different possible values.
Representing Data Visually
A clinic has the following data regarding
the weight lost by its clients over the past 6
months. Draw a histogram for the relative
frequency distribution for these data.
• Example:
(continued on next slide)
Representing Data Visually
We first find the relative frequency
distribution.
• Solution:
(continued on next slide)
Representing Data Visually
Draw the histogram exactly like a bar graph
except that we do not allow spaces between the
bars.
Stem and Leaf Display
The following are the number of home
runs hit by the home run champions in the
National League for the years 1975 to 1989 and
for 1993 to 2007.
1975–1989: 38, 38, 52, 40, 48, 48, 31, 37, 40,
36, 37, 37, 49, 39, 47
1993–2007: 46, 43, 40, 47, 49, 70, 65, 50, 73,
49, 47, 48, 51, 58, 50
Compare these home run records using a stemand-leaf display.
• Example:
(continued on next slide)
Stem and Leaf Display
In constructing a stem-and- leaf display,
we view each number as having two parts. The
left digit is considered the stem and the right digit
the leaf. For example, 38 has a stem of 3 and a
leaf of 8.
• Solution:
1975 to 1989
1993 to 2007
(continued on next slide)
Stem and Leaf Display
We can compare these data by placing these two
displays side by side as shown below. Some call
this display a back-to-back stem-and-leaf display.
It is clear that the home run champions hit
significantly more home runs from 1993 to 2007
than from 1975 to 1989.
The Mean and the Median
We use the Greek letter Σ (capital sigma) to indicate a
sum. For example, we will write the sum of the data
values 7, 2, 9, 4, and 10 by Σx = 7 + 2 + 9 + 4 + 10.
We represent the mean of a sample of a population by
x (read as “x bar”), and we will use the Greek letter μ
(lowercase mu) to represent the mean of the whole
population.
The Mean and the Median
A car company has been studying its
safety record at a factory and found that the
number of accidents over the past 5 years was
25, 23, 27, 22, and 26. Find the mean annual
number of accidents for this 5-year period.
• Example:
• Solution: We add the number of accidents and
divide by 5.
The Mean and the Median
The water temperature at a point
downstream from a plant for the last 30 days is
summarized in the table. What is the mean
temperature for this distribution?
• Example:
(continued on next slide)
The Mean and the Median
A third column is added to the table that
contains the products of the raw scores and their
frequencies.
• Solution:
The mean is
The Mean and the Median
The Mean and the Median
Listed are the
yearly earnings of
some celebrities.
• Example:
a) What is the mean of
the earnings of the
celebrities on this list?
b) Is this mean an accurate measure of the
“average” earnings for these celebrities?
(continued on next slide)
The Mean and the Median
• Solution (a):
Summing the salaries and dividing by 10
gives us
Solution (b): Eight of the celebrities have
earnings below the mean, whereas only two have
earnings above the mean. The mean in this
example does not give an accurate sense of
what is “average” in this set of data because it
was unduly influenced by higher earnings.
The Mean and the Median
The Mean and the Median
The table lists the ages at
inauguration of the presidents who
assumed office between 1901 and
1993. Find the median age for this
distribution.
• Example:
We first arrange the ages in
order to get
• Solution:
There are 17 ages. The middle age is
the ninth, which is 55.
The Mean and the Median
: Fifty 32-ounce quarts of a particular
brand of milk were purchased and the actual
volume determined. The results of this survey
are reported in the table. What is the median for
this distribution?
• Example
The Mean and the Median
: Because the 50 scores are in increasing
order, the two middle scores are in positions 25
and 26. We see that 29 ounces is in position 25
and 30 ounces is in position 26. The median for
this distribution is
• Solution
Five Number Summary
The Five Number Summary
Consider the list of ages of the presidents
from a previous example:
42, 43, 46, 51, 51, 51, 52, 54, 55,
55, 56, 56, 60, 61, 61, 64, 69.
Find the following for this data set:
a) the lower and upper halves
b) the first and third quartiles
c) the five-number summary
• Example:
(continued on next slide)
The Five Number Summary
• Solution:
Finding the median, we can identify the lower
and upper halves.
(a):
(b): The median of the lower half is
The median of the upper half is
(continued on next slide)
The Five Number Summary
(c): The five number summary is
We represent the five-number summary by a
graph called a box-and-whisker plot.
(continued on next slide)
Example
• Find the five-number summary for the following 10 values:
• 40, 37, 32, 28, 27, 24, 22, 34, 19, 36
• Find the minimum:
• Find Q1:
• Find the median:
• Find Q3:
• Find the maximum:
Example
• 19,22,24,27,28,32,34,36,37,40
• minimum: 19 because that is the smallest number
• Q1: 24 because it is in the middle of the minimum and median
• median: 30 because it is in the very middle of the numbers.
((28+32)/2=30)
• Q3: 36 because it is in the middle of the median and the
maximum
• maximum: 40 because it is the largest number
The Five Number Summary
The Five Number Summary
Find the mode for each data set.
a) 5, 5, 68, 69, 70
• Example:
b) 3, 3, 3, 2, 1, 4, 4, 9, 9, 9
c) 98, 99, 100, 101, 102
d) 2, 3, 4, 2, 3, 4, 5
: a) The mode is 5.
b) There are two modes: 3 and 9.
• Solution
In c) and d) there is no mode.
Comparing Measures of Central
Tendency
Assume that you are negotiating the
contract for your union. You have gathered
annual wage data and found that three workers
earn $30,000, five workers earn $32,000, three
workers earn $44,000, and one worker earns
$50,000. In your negotiations, which measure of
central tendency should you emphasize?
• Example:
Comparing Measures of Central
Tendency
• Solution:
Mode: $32,000
Median: $32,000
Mean:
The mean is $36,000.
To make the salaries appear as low as possible,
you would want to use the mode and median.
The Range of a Data Set
Standard Deviation
Standard Deviation
Standard Deviation
Standard Deviation
• Example: A company has hired six interns. After 4
months, their work records show the following
number of work days missed for each worker:
0, 2, 1, 4, 2, 3
Find the standard deviation of this data set.
• Solution:
Mean:
(continued on next slide)
Standard Deviation
We calculate the squares of the deviations of the
data values from the mean.
Standard Deviation:
Standard Deviation
Standard Deviation
• Example: The following are the closing prices for a
stock for the past 20 trading sessions:
37, 39, 39, 40, 40, 38, 38, 39, 40, 41,
41, 39, 41, 42, 42, 44, 39, 40, 40, 41
What is the standard deviation for this data set?
• Solution:
Mean:
(sum of the closing prices is 800)
(continued on next slide)
Standard Deviation
We create a table with values that will facilitate
computing the standard deviation.
Standard Deviation:
Standard Deviation
Comparing Standard Deviations
All three distributions have a mean and median
of 5; however, as the spread of the distribution
increases, so does the standard deviation.
The Normal Distribution
The normal distribution describes many real-life
data sets. The histogram shown gives an idea of
the shape of a normal distribution.
The Normal Distribution
The Normal Distribution
We represent the mean by μ and the standard
deviation by σ.
The Normal Distribution
Suppose that the distribution of scores of
1,000 students who take a standardized
intelligence test is a normal distribution. If the
distribution’s mean is 450 and its standard
deviation is 25,
• Example:
a) how many scores do we expect to fall between
425 and 475?
b) how many scores do we expect to fall above
500?
(continued on next slide)
The Normal Distribution
425 and 475 are
each 1 standard deviation
from the mean.
Approximately 68% of the
scores lie within 1 standard
deviation of the mean.
• Solution (a):
We expect about
0.68 × 1,000 = 680 scores
are in the range 425 to 475.
(continued on next slide)
The Normal Distribution
Solution (b):
We know 5% of the
scores lie more than
2 standard
deviations above or
below the mean, so
we expect to have
0.05 ÷ 2 = 0.025 of
the scores to be
above 500. Multiplying by 1,000, we can expect
that 0.025 * 1,000 = 25 scores to be above 500.
Quartile Problem
• The scores of students on an exam are normally distributed
with a mean of 516 and a standard deviation of 36.
• A) What is the first quartile score for this exam?
• B) What is the third quartile score for this exam?
Quartile Problem
• The Quartiles have 25% of the data on either side so we
can use the area to find the Z-Score which is +- 0.67.
• A) x= -0.67*36+516. so the first quartile is at 491.88
• B) x = 0.67*36+516. so the third quartile is at 540.12
z-Scores
The standard normal distribution has a mean of 0
and a standard deviation of 1.
There are tables (see next slide) that give the area
under this curve between the mean and a number
called a z-score. A z-score represents the number of
standard deviations a data value is from the mean.
For example, for a normal distribution with mean
450 and standard deviation 25, the value 500 is 2
standard deviations above the mean; that is, the
value 500 corresponds to a z-score of 2.
z-Scores
Below is a portion of a table that gives the area
under the standard normal curve between the mean
and a z-score.
z-Scores
Use a table to find the percentage of the
data (area under the curve) that lie in the
following regions for a standard normal
distribution:
• Example:
a) between z = 0 and z = 1.3
b) between z = 1.5 and z = 2.1
c) between z = 0 and z = –1.83
(continued on next slide)
z-Scores
The area under
the curve between z = 0
and z = 1.3 is shown. Using
a table we find this area for
the z-score 1.30. We find
that A is 0.403
when z = 1.30. We expect 40.3%, of the data to
fall between 0 and 1.3 standard deviations above
the mean.
• Solution (a):
(continued on next slide)
z-Scores
The area under the
curve between z = 1.5 and
z = 2.1 is shown. We first find
the area from z = 0 to z =
2.1 and then subtract the
area from z = 0
to z = 1.5. Using a table we get A = 0.482 when
= 2.1, and A = 0.433 when z = 1.5. The area is
0.482 – 0.433 = 0.049 or 4.9%
• Solution (b):
z
(continued on next slide)
z-Scores
Due to the
symmetry of the normal
distribution, the area
between z = 0 and z = –1.83
is the same as the area
between z = 0 and z = 1.83.
Using a table, we see that A = 0.466 when
z
= 1.83. Therefore, 46.6% of the data values lie
between 0 and –1.83.
• Solution (c):
Converting Raw Scores to z-Scores
Converting Raw Scores to z-Scores
Suppose the
mean of a normal
distribution is 20 and its
standard deviation is 3.
• Example:
a) Find the z-score
corresponding to
the raw score 25.
b) Find the z-score
corresponding to
the raw score 16.
(continued on next slide)
Converting Raw Scores to z-Scores
• Solution (a):
We have
We compute
(continued on next slide)
Converting Raw Scores to z-Scores
• Solution (b):
We have
We compute
Applications
Suppose you take a standardized test.
Assume that the distribution of scores is normal
and you received a score of 72 on the test, which
had a mean of 65 and a standard deviation of 4.
What percentage of those who took this test had
a score below yours?
• Example:
• Solution: We first find the z-score that corresponds
to 72.
(continued on next slide)
Applications
Using a table, we have
that A = 0.460 when z
= 1.75. The normal
curve is symmetric, so
another 50% of the
scores fall below the
mean. So, there are
50% + 46% = 96%
of the scores below 72.
(continued on next slide)
Applications
Consider the following information:
1911: Ty Cobb hit .420. Mean average was .266
with standard deviation .0371.
1941: Ted Williams hit .406. Mean average was
.267 with standard deviation .0326.
1980: George Brett hit .390. Mean average was
.261 with standard deviation .0317.
Assuming normal distributions, use z-scores to
determine which of the three batters was ranked
the highest in relationship to his contemporaries.
• Example:
(continued on next slide)
Applications
• Solution:
Ty Cobb’s average of .420 corresponded to a
score of
-
z
Ted Williams’s average of .406 corresponded to
a z-score of
George Brett’s average of .390 corresponded to
a z-score of
Compared with his contemporaries, Ted Williams
ranks as the best hitter.
Applications
• Example:
A manufacturer plans to offer a warranty
on an electronic device. Quality control
engineers found that the device has a mean time
to failure of 3,000 hours with a standard
deviation of 500 hours. Assume that the typical
purchaser will use the device for 4 hours per day.
If the manufacturer does not want more than 5%
to be returned as defective within the warranty
period, how long should the warranty period be
to guarantee this?
(continued on next slide)
Applications
We need to find
a z-score such that at
least 95% of the area
is beyond this point.
This score is to the
left of the mean and is
negative. By
symmetry we find
the z-score such that 95% of the area is below
this score.
• Solution:
(continued on next slide)
Applications
50% of the entire area lies below the mean, so
our problem reduces to finding a z-score greater
than 0 such that 45% of the area lies between
the mean and that z-score. If A = 0.450, the
corresponding z-score is 1.64. 95% of the area
underneath the standard normal curve falls
below z = 1.64. By symmetry, 95% of the values
lie above –1.64.
Since
, we obtain
(continued on next slide)
Applications
Solving the equation for x, we get
Owners use the device about 4 hours per day, so
we divide 2,180 by 4 to get 545 days. This is
approximately 18 months if we use 31 days per
month. The warranty should be for roughly 18
months.
Right and Left Z-Score
• Find the z-score such that:
• A) The area under the standard normal curve to its left is 0.518
• B) The area under the standard normal curve to its left is 0.8167
• C) The area under the standard normal curve to its right is 0.2879
• D) The area under the standard normal curve to its right is 0.3573
Right and Left Z-Score
• A) 0.518-.5= 0.018 look this up in the table to get 0.04
• B) 0.8167-.5= 0.3167 look this up in the table to get 0.91
• C) .5-0.2879= .2121 look this up in the table to get 0.56
• D) .5-0.3573= .1427 look this up in the table to get 0.36
Practice Problem
• Length of skateboards in a skateshop are normally
distributed with a mean of 30.9 in and a standard deviation
of 1 in. The figure below shows the distribution of the length
of skateboards in a skateshop. Calculate the shaded area
under the curve.
• Express your answer in decimal form with at least two
decimal place accuracy.
Practice Problem
• There is an area of .475 on the right side of the curve but
we must use the Z-Score formula to find the area on the left
side.
•
z=
30.23−30.9
1
= -.67
• The area at Z-Score of -.67 is 0.2486
• So the total area is 0.2486+.475= .7236
Confidence Intervals
A level C confidence interval is a range that is C% likely to
contain the population mean of a set of data based on a
sample mean (a 95% confidence interval based on sample
data would be 95% likely to contain the population mean
that the sample came from). The formula for the lower and
upper bounds of a confidence interval is:
Where the term on the left is the sample average, and the
term on the right is referred to as the margin of error.
Confidence Intervals
•Example: Suppose that the distribution of scores of 100
students who take a standardized intelligence test is a
normal distribution. If the distribution’s sample mean is 90
and its standard deviation is 10, what is a 95% confidence
interval for the population mean?
Here, the z-score is related to an area equal to half the confidence, i.e. z is related
to .95/2 = .475.
Locating this area in a z-score table will yield that z = 1.96.
The left end of the interval is: 90 - 1.96 * 10 / sqrt(100) = 88.04
The right end of the interval is: 90 + 1.96 * 10 / sqrt(100) = 91.96
So the 95% confidence interval is (88.04, 91.96), OR we are 95% confident the
population mean is between 88.04 and 91.96.
Critical Values Z*
• To find the critical Z* value you need to get the confidence
interval in decimal form.
• After than you divide the confidence interval by 2 to get the
area.
• Use your chart to find the area closest to your value and that
is your critical Z*
Critical Z Problem
• Find the critical z* for a level 51 % confidence interval.
Critical Z* Solution
• 51/100 = .51
• .51/2 = .255
• Find the closest value to .255 (which is .2549).
• Your Critical Z* value is 0.69
Representing Data Exercises
Create a frequency and relative frequency table for the
following set of numbers.
7 7 9 8 7
8 0 2 1 5
9 7 9 7 7
0 4 9 6 9
8 7 7 8 9
0 6 8 6 6
Calculate the mean, median, mode, and standard deviation of
the data.
Representing Data Solutions
Create a frequency and relative frequency table for the
following set of numbers.
Calculate the mean, median, mode, and standard deviation
of the data. Rang Frequency
Relative Frequency
e
70-74
2 (70, 74)
2/15 = .1333
75-79
6 (75, 76, 76, 78,
78, 79)
6/15 = .4
80-84
2 (80, 81)
2/15 = .1333
85-89
1 (86)
1/15 = .0667
90-94
2 (90, 92)
2/15 = .1333
95100
2 (96, 99)
2/15 = .1333
Mean = 82
Median = 79
Mode = 76, 78
Std. Dev. = 8.61
Normal Distribution Exercises
Suppose 200 students took a test, and their scores were
approximately normally distributed. The mean of the test
scores was 82 and the standard deviation was 9.
How many students got at least a 73? How many students
got more than 95?
What would a 95% confidence interval for this population be?
Normal Distribution Solutions
200 students took a test. The mean was 82 and the std. dev.
was 9.
(a) How many students got at least a 73? (b) More than 95?
(c) What would a 95% confidence interval for this population
be?
(a).84 or 84% (168
students)
(b).075 or 7.5% (15
students)
(c)(80.75, 83.25)