Download Chapter 4 - Algebra I PAP

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Regression toward the mean wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Chapter 4
Numerical Methods for
Describing Data
Section 4.1
Describing the Center of a
Data Set
Population characteristic—a fixed value
about a population that is typically
Suppose
we
want
to
know
the
MEAN
length
of
unknown
all the fish in Lake Sam Rayburn . . .
Is this a value that is known?
Can we find it out?
At any given point
in time, how
many values are
there for the
mean length of
fish in the lake?
Statistic—a value calculated from a sample
Suppose we want to know the MEAN length
of all the fish in Lake Sam Rayburn.
What can we do to estimate this unknown
population characteristic?
Measures of Central Tendency
mode--the observation that occurs the most often
• Can be more than one mode
• If all values occur only once – there is no mode
• Not used as often as mean & median
Measures of Central Tendency
The mean of a set of numerical
observations is just the familiar arithmetic
average: the sum of the observations
divided by the number of observations.
Important Notations
• x = the variable for which we have sample data
• n = the number of observations in the sample (the
sample size)
• x1 = the first observation in the sample
• x2 = the second observation in the sample…
• xn = the nth (last) observation in the sample
Battery Life Example
We might have a sample consisting of n = 4
observations on x = battery lifetime (in hours):
x1 = 5.9, x2 = 7.3, x3 = 6.6, x4 = 5.7
• x1 is just the first observation in the data set
and not necessarily the smallest observation
• xn is the last observation but not necessarily
the largest
More Notation
The sum of x1, x2,… ,xn can be denoted by x1 +
x2 + … + xn, but this could be a daunting task
for a large sample. The Greek letter S
(pronounced sigma) is traditionally used in
mathematics to denote summation.
•In particular, S x denotes the sum of all the x
values in the data set under consideration
Sample Mean
The sample mean of a numerical
sample, x1, x2, x3, …, xn, denoted 𝑥,
is
sum of all observations in the sample
x
number of observations in the sample
x1  x 2   x n  x


n
n
Fancytown Example
During a two-week period, 10 houses were sold in
Fancytown. Calculate the sample mean.
x
House Price
in Fancytown
x
225,000
311,000
299,000
310,000
285,000
315,000
291,000
287,000
300,000
287,000
 2,910,000
x 2,910,000

x

 291,000
n
10
The average (or mean) price
for this sample of 10 houses
in Fancytown is $291,000.
Lowtown Example
During a two-week period, 10 houses were sold in
Lowtown. Calculate the sample mean.
x
House Price
in Lowtown
x
97,000
93,000
110,000
121,000
113,000
95,000
100,000
122,000
99,000
2,000,000
 2,950,000
x 2,950,000

x

 295,000
n
10
The average (or mean) price
for this sample of 10 houses
in Lowtown is $295,000.
Outlier
Reflections on the Sample Mean
Calculations
Looking at the dotplots of the samples for
Fancytown and Lowtown we can see that the mean,
$295,000 appears to accurately represent the
“center” of the data for Fancytown, but it is not
representative of the Lowtown data.
Clearly, the mean can be greatly affected by the
presence of even a single outlier.
Dotplots for Fancytown and Lowtown
Outlier
Lowtown
Fancytown
500000
295000
1000000
1500000
2000000
Describing the Center of a Data Set with the
arithmetic mean
The population mean, denoted by
µ, is the average of all x values in
the entire population.
Important Note
•The value of 𝑥 varies from sample to sample.
•There is only one value for µ.
Drawback with the Mean
One potential drawback to the mean as a
measure of center for a data set is that its
value can be greatly affected by the
presence of even a single outlier (an
unusually large or small observation) in
the data set.
Describing the Center of a Data Set with the median
The sample median is obtained by first
ordering the n observations from smallest to
largest (with any repeated values included, so
that every sample observation appears in the
ordered list). Then
the single middle value if n is odd
sample median= 
 the mean of the middle two values if n is even
Population Median
The population median is the middle
value in the ordered list consisting of all
population observations.
The population median plays the same role
for the population as the sample median
plays for the sample.
The Median
The stability of the median is what sometimes
justifies its use as a measure of center in some
situations.
Income distributions are commonly summarized
by reporting the median rather than the mean,
because otherwise a few very high salaries
could result in a mean that is not representative
of a typical salary
Median Calculation
Consider the Fancytown data. Calculate the median house value for
Fancytown.
x
House Price
in Fancytown
x
225,000
311,000
299,000
310,000
285,000
315,000
291,000
287,000
300,000
287,000
 2,910,000
First, we put the data in numerical increasing
order to get:
225,000 285,000 287,000 287,000
291,000 299,000 300,000 310,000
311,000 315,000
Median Calculation
Since there is an even number of data values, the median is the
mean of the two values in the middle.
median =
291000+299000
2
= $295,000
Another Median Calculation
Consider the Lowtown data. Calculate the median house value for
Lowtown.
x
House Price
in Lowtown
x
97,000
93,000
110,000
121,000
113,000
95,000
100,000
122,000
99,000
2,000,000
 2,950,000
We put the data in numerical increasing
order to get:
93,000 95,000 97,000 99,000
100,000 110,000 113,000 121,000
122,000
2,000,000
100000  110000
median 
 $105, 000
2
Imagine a ruler with pennies placed at 3”, 4”, 5”, 6”, 8” and 10”.
To balance the ruler
on your finger, you
would need to place
your finger at the
mean of 6.
The mean is the balance
point of a distribution
Comparing the Sample Mean
& Sample Median
Comparing the Sample Mean
& Sample Median
Comparing the Sample Mean
& Sample Median
Notice from the preceding pictures that the median splits the
area in the distribution in half and the mean is the point of
balance.
Typically,
1. when a distribution is skewed positively, the mean
is larger than the median,
2. when a distribution is skewed negatively, the mean
is smaller then the median, and
3. when a distribution is symmetric, the mean and the
median are equal.
Mean vs. Median
•In a skewed distribution, the mean is pulled in the
direction of the skewness.
•In a symmetrical distribution, you should report the
mean!
•In a skewed distribution, the median should be reported
as the measure of center!
The Trimmed Mean
A trimmed mean is computed by first
ordering the data values from smallest to
largest, deleting a selected number of
values from each end of the ordered list,
and finally computing the mean of the
remaining values.
The trimming percentage is the
percentage of values deleted from each
end of the ordered list.
The Trimmed Mean
Purpose is to remove outliers from a data set
To calculate a trimmed mean:
• Multiply the percent to trim by n
• Truncate that many observations from BOTH ends of
the distribution (when listed in order)
• Calculate the mean with the shortened data set
Find the mean of the following set of data:
12
14
19
Mean = 23.8
20
22
24
25
26
26
Find the 10% trimmed mean.
10%(10) = 1
So remove one observation from each side!
14  19  20  22  24  25  26  26
xT 
 22
8
50
FancyTown Trimmed Mean
House Price
inHouse
Fancytown
Price
Sum of the eight in 231,000
Fancytown
middle
is
285,000
Sum ofvalues
the eight
231,000
2,402,000
287,000
middle values is
285,000
294,000
2,402,000
287,000
Divide this value
297,000
294,000
by
8 to obtain
Divide
this value
299,000
297,000
the
by 810%
to obtain
312,000
299,000
trimmed
the 10% mean.
313,000
312,000
trimmed mean.
315,000
313,000
317,000
315,000
317,000
 x  2,950,000
 2,950,000
xx 
291,000
291,000
x  295,000
Calculate the 10% trimmed mean for FancyTown
median 
295,000
median
10% Trim
Mean 
 300,250
10% Trim Mean  300,250
Summary of Trimmed Means
A trimmed mean with a small to moderate trimming
percentage—between 5% and 25%--is less affected by
outliers than the mean, but it is not as insensitive as the
median.
Is the median affected by extreme
values?
NO
Is the mean affected by extreme values?
YES
Sample Proportion for Categorical Data
The sample proportion of success, denoted by p, is
p=
number of successes in the sample (S)
𝑛
Where S is the label used for the response designated
as success. The population proportion of successes is
denoted by p.
Tampering with Automobile Antipollution
Equipment Example
The use of antipollution equipment on automobiles has substantially
improved air quality in certain areas. Unfortunately, many car owners
have tampered with smog control devices to improve performance.
Suppose that a sample of 15 cars is selected and that each car is
classified as S or F, according to whether or not tampering has taken
place. The resulting data are:
S
F
S
S
S
F
F
S
S
F
S
S
S
F
F
If we consider the variable of successes, the sample proportion (of
successes) is:
Example Tampering with Automobile Antipollution
Equipment
That is, 60% of the sample responses
are S’s. In 60% of the cars sampled,
there has been tampering with the air
pollution control devices.
Section 4.2
Describing Variability in a Data
Set
Why is the study of variability important?
Does this can of soda contain
exactly 12 ounces?
There is variability in virtually everything
Allows us to distinguish between usual & unusual
values
Reporting only a measure of center doesn’t provide
a complete picture of the distribution.
20
30
40
50
60
70
20
30
40
50
60
70
20
30
40
50
60
70
Notice that these three data sets all have the
same mean and median (at 45), but they have
very different amounts of variability.
Describing Variability
The simplest numerical measure of the variability
of a numerical data set is the range, which is
defined to be the difference between the largest
and smallest data values.
range = maximum - minimum
Calculating Range
Calculate the range for each data set
from the previous example:
20
30
40
50
60
70
20
30
40
50
60
70
20
30
40
50
60
70
The first two data sets
have a range of 50 (7020) but the third data set
has a much smaller
range of 10.
Describing Variability
The n deviations from the sample mean
are the differences:
𝑥1 - 𝑥, 𝑥2 - 𝑥, … , 𝑥𝑛 - 𝑥
Note: The sum of all of the deviations from the
sample mean will be equal to 0, except possibly
for the effects of rounding the numbers. This
means that the average deviation from the
mean is always 0 and cannot be used as a
measure of variability.
Calculating Deviations
from the Sample Mean
Suppose we caught a sample of 6 fish from the lake with the
following lengths:
3”, 4”, 5”, 6”, 8”, 10”
Calculate the deviations from the sample mean. What must we
find first?
Now find how each observation deviates from the mean.
x
3
4
5
6
8
10
Sum
(x - x)
-3
3-6
-2
-1
0
2
4
0
The mean is considered
This is the deviation from
the balance the
point
of the
mean.
distribution because it
“balances”
thethepositive
Find
rest of the
and negative
deviations.
deviations
from the mean
What is the sum of
Will
sum always
thethis
deviations
from
zero?
theequal
mean?
YES
Notes on Deviations
A particular deviation is positive if the x value
exceeds 𝑥 and negative if the x value is less
than 𝑥.
In general, the greater the amount of variability
in the sample, the larger the magnitudes
(ignoring the signs) of the deviations.
Measures of
Variability
What can we do to the deviations so that we
Can we find an average deviation?
Suppose
caught aof
sample
of 6 fish
that
we
caught
could
an average?
Anotherwe
measure
the variability
in find
a data
set
from the lake with the following lengths:
uses the deviations from the mean (𝑥).
3”, 4”, 5”, 6”, 8”, 10”
The mean length is 6 inches. Recall that we calculated
the deviations from the mean. What was the sum of
The
estimated
these
deviations? average of the deviations
Population
varianceis
is called the variance.
squared
denoted by
s 2.
Degree of
freedom
x  x 


2
s
2
n 1
The customary way to prevent negative
and positive deviations from
counteracting one another is to square
them before combining.
Suppose that everyone in the class caught a sample
When
variance,
of 6calculating
fish from thesample
lake. Would
eachwe
of our
use degrees
of freedom
(n same
– 1) infish?
the
samples
contain the
denominator instead of n because this
tends to produce better estimates.
Degrees of freedom will be revisited
Would
our in
mean
lengths
again
Chapter
8. be the same?
The samples would also have different
ranges!
Remember the sample of 6 fish that we
caught from the lake . . .
Find the variance of the length
of square
the fish.
First
the
deviations
x
3
4
5
6
8
10
Sum
(x - x)
-3
-2
-1
0
2
4
0
(x - x)2
9
4
1
0
4
16
34
Finding the average of the
deviations would always
equal 0!
What is the sum of
the deviations
Divide this by 5.
squared?
s2 = 6.8
Sample Standard Deviation
The sample standard deviation, denoted s is
the positive square root of the sample variance.
s s 
2
 (x  x)
2
n 1
The population standard deviation is
denoted by s.
Sxx

n 1
Sample Variance
A large amount of variability in the sample is
indicated by a relatively large value of s2 or s,
whereas a value of s2 or s close to 0 indicates a
small amount of variability.
For most statistical purposes, s is the desired
quantity, but s2 must be computed first.
The most commonly used measures of center
and variability are the mean and standard
deviation, respectively.
Measures of Variability
Calculate the standard deviation for the fish sample.
s2 = 6.8 inches2 so s = 2.608 inches
The fish in our sample deviate from the mean of 6
inches by an average of 2.608 inches.
Apple Weight Example
A sample of 10 Macintosh apples were randomly selected and weighed
(in ounces). Calculate the standard deviation of the sample.
s 
2
2
(x

x)

5.5398

10  1
n 1
5.5398

 0.61554
9
s= 0.61554  0.78456
Interquartile Range
Interquartile range (iqr)--the range of the middle half
of the data.
What advantage does the interquartile range have
over the standard deviation?
The iqr is resistant to extreme values.
iqr
The iqr is based on quantities called quartiles.
The lower quartile separates the bottom 25% of
the data set from the upper 75%, and the upper
quartile separates the top 25% of the data set
from the bottom 75%.
Quartiles
Finding Quartiles
The quartiles for sample data are obtained
by dividing the n ordered observations into
a lower half and an upper half: if n is odd,
the median is excluded from both halves.
Quartiles and the Interquartile Range
Lower Quartile (Q1) = median of the lower half of the data set.
Upper Quartile (Q3) = median of the upper half of the data set.
The interquartile range (iqr), a resistant measure of
variability is given by
iqr = upper quartile – lower quartile
= Q3 – Q 1
Note: If n is odd, the median is excluded from both the
lower and upper halves of the data.
Quartiles and IQR Example
A sample of 15 students with part time jobs were randomly
selected and the number of hours worked last week was
recorded. Find the interquartile range for this set of data.
19, 12, 14, 10, 12, 10, 25,
9, 8, 4, 2, 10, 7, 11, 15
The data is put in increasing order to get
2, 4, 7, 8, 9, 10, 10, 10, 11, 12, 12, 14, 15, 19, 25
Quartiles and IQR Example
With 15 data values, the median is the 8th value. Specifically,
the median is 10.
Upper Half
Lower Half
2, 4, 7, 8, 9, 10, 10, 10, 11, 12, 12, 14, 15, 19, 25
Lower quartile Q1
Median
Upper quartile Q3
Lower quartile = 8
Upper quartile = 14
Iqr = 14 - 8 = 6
The Chronicle of Higher Education (2009-2010 issue)
published the accompanying data on the percentage
of the population with a bachelor’s or higher degree in
2007 for each of the 50 states and the District of
Columbia.
21
27
35
25
22
26
27
30
38
32
25
26
24
25
26
29
19
29
31
24
33
30
22
19
22
34
35
24
24
28
30
35
29
27
26
17
26
20
27
30
25
Find the interquartile range for this set of data.
47
20
23
23
23
26
27
34
25
34
21
17
27
23
35
25
25
27
22
31
26
47
27
19
30
23
38
26
32
27
25
32
26
19
24
25
26
26
28
29
33
19
20
29
24
31
26
24
29
33
34
30
20
22
24
19
26
22
29
34
35
21
24
26
24
28
29
30
34
35
22
29
25
27
26
26
30
17
35
26
22
20
25
27
30
25
35
47
22
20
25
23
27
23
30
23
35
First
put the
data (Q
in order
& find the
the median
median.of the
Find
the
lower
quartile
)
by
finding
1 ) by finding the median of the
Find the upper quartile (Q
3
lower
upperhalf.
half.
iqr = 30 – 24 = 6
26
23
27
25
34
27
25
30
34
38
Quartiles and iqr
The resistant nature of the interquartile
range follows from the fact that up to 25% of
the smallest sample observations and up to
25% of the sample observations can be
made more extreme without affecting the
value of the interquartile range.
Special Note on Rounding
Protection against adverse rounding effects can almost
always be achieved by using four digits of decimal
accuracy.
Section 4.3
Summarizing a Data Set:
Boxplots
Boxplots
A boxplot is a picture that conveys information
about the most important features of a data
set: center, spread, extent of skewness, and
presence of outliers.
Boxplots
What are some advantages of boxplots?
•
•
•
•
•
ease of construction
convenient handling of outliers
construction is not subjective (like histograms)
used with medium or large size data sets (n > 10)
useful for comparative displays
Boxplots
When
Use: Univariate
numerical
Thetofive-number
summary
is thedata
smallest
observation, first quartile, median, third
How to construct
a Skeleton
Boxplot:
quartile, and
largest observation
• Calculate the five number summary
• Draw a horizontal (or vertical) scale
• Construct a rectangular box from
the
lower
quartile (Q1) to
Use
for
moderate
the upper quartile (Q3)
toatlarge
data sets.
• Draw a line inside the rectangle
the median
value
• Draw lines from the lower quartile
to theuse
smallest
Don’t
with
observation and from the upper quartile to the largest
data sets of n < 10.
observation
To describe:
comment on the center, spread, and shape of the distribution
and if there is any unusual features
Remember the data on the percentage of the
population with a bachelor’s or higher degree in 2007
for each of the 50 states and the District of Columbia.
17
23
25
27
31
47
19
23
26
27
32
19
24
26
28
33
20
24
26
29
34
20
24
26
29
34
21
24
26
29
34
22
25
26
30
35
22
25
27
30
35
22
25
27
30
35
23
25
27
30
38
First
draw
a for
scale
Draw
aalines
box
from
Q1
Draw
line
forthe
the
to Q3
median
whiskers
10
20
30
Percentages
40
50
Outliers
An observation is an outlier if it is more than
1.5 iqr away from the closest end of the box
(less than the lower quartile minus 1.5 iqr or
more than the upper quartile plus 1.5 iqr).
An outlier is extreme if it is more than 3 iqr
from the closest end of the box, and it is mild
otherwise.
Modified Boxplots
A modified boxplot represents mild outliers by shaded
circles and extreme outliers by open circles. Whiskers
extend on each end to the most extreme observations
that are not outliers.
Remember the data on the percentage of the
To describe:
population with a bachelor’s or higher degree in 2007
The distribution of percent of the population with
for each of the 50 states and the District of Columbia.
a bachelor’s degree or higher for the U.S. states
and District
of Columbia
positively
skewed with
the is
upper
end
17There
19 is one
19 outlier
20 at20
21
22
22
22
23
outlier
at 47%.
is
at23
the distribution,
but
none
the
23 an
24
24 The
24 median
24 at percentage
25
25
25at 25
a end.
range
30%.
lower
extreme?
25 26%
26with
26
26 Isofit26
26
26
27
27
27
27
31
47
27
32
28
33
29
34
29
29
30
30
30
30
for
the 38
34 Place
34 aDraw
35
35 for
35
solidlines
dot
the
whiskers
First,
draw
the
scale,
boxfor
outlier
Next calculate the
fences
and the
line for the
outliers.
median
24-1.5(6) = 15
30+1.5(6) = 39
30+3(6) = 48
10
20
30
Percentages
40
50
Symmetrical boxplots
Approximately symmetrical boxplot
Notice that the range of
Notice that all 3 boxplots
the lower half and the
are identical, but their
range of the upper half
corresponding
of this distribution are
histograms are very
approximately equal so
different. Can you
we can say that it is
determine the number of
approximately
modes from a boxplot?
symmetrical.
However,
the range of
Skewed boxplot
the two halves of this
distribution are definitely
different sizes, so it
would be skewed in the
direction of the longest
side.
The 2009-2010 salaries of NBA players published
on the web site hoopshype.com were used to
construct the comparative boxplot of salary data for
five teams.
Discuss the
similarities
and
differences.
Modified Boxplot Example
Consider the ages of the 79 students from the
classroom data set from Chapter 3. Create a modified
boxplot for the data below.
Iqr = 22 – 19 = 3
Lower quartile – 3 iqr = 10
Upper quartile + 3 iqr = 31
17
19
19
20
21
22
22
25
18
19
19
20
21
22
23
26
18
19
19
20
21
22
23
28
18
19
19
20
21
22
23
28
Lower quartile – 1.5 iqr =14.5
Upper quartile + 1.5 iqr = 26.5
18
19
19
20
21
22
23
30
Moderate Outliers
18
19
19
20
21
22
23
37
19
19
20
21
21
22
23
38
19
19
20
21
21
22
24
44
Extreme Outliers
19
19
20
21
21
22
24
47
19
19
20
21
21
22
24
Lower
Quartile
Median
Upper
Quartile
Modified Boxplot Example
Here is the modified boxplot for the student age data.
Smallest data
value that isn’t
an outlier
Largest data
value that isn’t
an outlier
Mild
Outliers
15
20
25
30
Extreme
Outliers
35
40
45
50
Modified Boxplot Example
50
45
40
Here is the same boxplot
reproduced with a vertical
orientation.
35
30
25
20
15
Comparative Boxplot Example
By putting boxplots of two separate groups or subgroups we can
compare their distributional behaviors. Describe the similarities
and differences among the two groups.
The distributional pattern of female and male student weights have
similar shapes, although the females are roughly 20 pounds lighter
(as a group).
G Males
e
n
d
e Females
r
100
120
140
160
180
Student Weight
200
220
240
Comparative Boxplot Example
Boxplots of Age by Gender
(means are indicated by solid red circles)
50
Age
40
30
Male
Gender
Female
20
Section 4.4
Interpreting Center and
Variability: Chebyshev’s Rule,
the Emperical Rule, and z
Scores
Interpreting Center & Variability
This rule can be used with
Chebyshev’s Rule–-The percentage
of
any distribution – no
observations that are within k standard deviations
matter it’s shape!
of the mean is at least
1

100 1  2 %
 k 
where k > 1
1

100 1  2 %  75 %
 2 
If k = 2, then at least 75%
of the observations are
within 2 standard
deviations of the mean.
Interpreting Variability
Chebyshev’s Rule
For specific values of k Chebyshev’s Rule reads
• At least 75% of the observations are within 2 standard deviations of the
mean.
• At least 89% of the observations are within 3 standard deviations of the
mean.
• At least 90% of the observations are within 3.16 standard deviations of the
mean.
• At least 94% of the observations are within 4 standard deviations of the
mean.
• At least 96% of the observations are within 5 standard deviations of the
mean.
• At least 99% of the observations are with 10 standard deviations of the
mean.
For a sample of families with one preschool child, it
was reported that the mean child care time per
week was approximately 36 hours with a standard
deviation of approximately 12 hours.
At least 89% of the observations are
between 0 & 72 hours. Since time
Using Chebyshev’s
rule,
least 75%
the 11%
sample
can’t
beatnegative,
atof
most
of
observations must be
12 and
hours72
thebetween
observations
are60above
(within 2 standard deviations of hours.
the mean).
At most, what percent of the
observations are greater than
72 hours?
Example - Chebyshev’s Rule
Consider the student age data
17
19
19
20
21
22
22
25
18 18
19 19
19 19
20 20
21 21
22 22
23 23
26 28
18 18 18
19 19 19
19 19 19
20 20 20
21 21 21
22 22 22
23 23 23
28 30 37
19
19
20
21
21
22
23
38
19
19
20
21
21
22
24
44
19
19
20
21
21
22
24
47
19
19
20
21
21
22
24
Color code: within 1 standard deviation of the mean
within 2 standard deviations of the mean
within 3 standard deviations of the mean
within 4 standard deviations of the mean
within 5 standard deviations of the mean
Example - Chebyshev’s Rule
Summarizing the student age data
Interval
Chebyshev’s
Actual
within 1 standard deviation of
the mean
 0%
72/79 = 91.1%
within 2 standard deviations
of the mean
 75%
75/79 = 94.9%
within 3 standard deviations
of the mean
 88.8%
76/79 = 96.2%
within 4 standard deviations
of the mean
 93.8%
77/79 = 97.5%
within 5 standard deviations
of the mean
 96.0%
79/79 = 100%
Notice that Chebyshev gives very conservative lower bounds and the
values aren’t very close to the actual
84 percentages.
What’s my area?
Input the following command into a graphing calculator in order to graph
a normal curve with a mean of 20 and standard deviation of 3:
•Y1 = normalpdf(X,20,3)
(Window x: [10,30] y: [0,0.2])
•Use the command 2nd trace, 7 to find the area under the curve for:
(Round to 4 decimal places.)
•Lower limit: 17
•Lower limit: 14
•Lower limit: 11
Upper limit: 23
Upper limit: 26
Upper limit: 29
Area: ____________________
Area: ____________________
Area: ____________________
What’s my area?
Graph a normal curve with a mean of 50 and standard deviation of 5.
•Y1 = normalpdf(X,50,5)
(x: [30,70] y: [0,0.1])
•Find the area under the curve for the following:
•Lower limit: 45
•Lower limit: 40
•Lower limit: 35
Upper limit: 55
Upper limit: 60
Upper limit: 65
Area: ________
Area: ________
Area: ________
What pattern do you notice?
Chebyshev’s Rule
Chebyshev’s Rule states that 75% of
the observations in a data set are
within 2 standard deviations of the
mean, however, in many data sets
substantially more than 75% of the
values satisfy this condition
Interpreting Center & Variability
• Empirical Rule• Approximately 68% of the observations
99.7%
68% are within 1
95%
standard deviation of the mean
Can ONLY be used with distributions that are mound
• Approximately 95% of shaped!
the observations are within 2
standard deviation of the mean
• Approximately 99.7% of the observations are within
3 standard deviation of the mean
The height of male students at PWSH is
approximately normally distributed with a mean of
71 inches and standard deviation of 2.5 inches.
a)What percent of the male
shorter than 66 inches?
About 2.5%
b) Taller than 73.5 inches?
About 16%
c) Between 66 & 73.5 inches?
About 81.5%
students are
Empirical Rule vs. Chebyshev’s Rule
The Empirical Rule makes “approximately”
instead of “at least” statements, and the
percentages for k = 1, 2, and 3 standard
deviations are much higher than those
allowed by Chebyshev’s Rule.
Empirical Rule vs. Chebyshev’s Rule
In contrast to Chebyshev’s Rule, dividing the
percentages in half is permissible because a
normal curve is symmetric.
Empirical Rule
Another reminder!!
The Empirical Rule can only be used If
the histogram of values in a data set is
reasonably symmetric and unimodal
(specifically, is reasonably approximated
by a normal curve)
Empirical Rule
It is unusual to see an observation from a
normally distributed population that is
farther than 2 standard deviations from the
mean (only 5%), and it is very surprising to
see one that is more than 3 standard
deviations away.
z Scores
The z score is how many standard
deviations the observation is from the
mean.
A positive z score indicates the observation
is above the mean and a negative z score
indicates the observation is below the
mean.
The z score corresponding to a particular
The
z score corresponding
a particular
observation
in a data set istocalculated
as:
observation in a data set is
zscore  observation  mean
standard deviation
What do these z scores mean?
-2.3
2.3 standard deviations below the mean
1.8
1.8 standard deviations above the mean
-4.3
4.3 standard deviations below the mean
z Scores
Computing the z score is often referred to
as standardization and the z score is
called a standardized score.
The formula used with sample data is
z score  x s x
Sally is taking two different math achievement tests
with different means and standard deviations. The
mean score on test A was 56 with a standard
deviation of 3.5, while the mean score on test B
was 65 with a standard deviation of 2.8. Sally
scored a 62 on test A and a 69 on test B. On which
test did Sally score the best?
z-score on test A
z-score on test B
62  56
z
 1.714
3 .5
69  65
z
 1.429
2.8
She did better on test A.
Measures of Relative Standing
percentiles--A value in the data set where
r percent of the observations fall AT or
BELOW that value.
In addition to weight and length, head
circumference is another measure of health in
newborn babies. The National Center for Health
Statistics reports the following summary values for
head circumference (in cm) at birth for boys.
Head circumference (cm)
Percentile
32.2
33.2
34.5
35.8
37.0
38.2
38.6
5
10
25
50
75
90
95
What percent of newborn boys had head
circumferences greater than 37.0 cm?
25%
10% of newborn babies have head circumferences
bigger than what value?
38.2 cm