Download Chapter 3 PPT

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Time series wikipedia , lookup

World Values Survey wikipedia , lookup

Transcript
Honors Statistics
Chapter 3
Measures of
Variation
3.1 - 1
Review
 Chapter 1
Distinguish between population and
sample, parameter and statistic
Good sampling methods: simple random
sample, collect in appropriate ways
 Chapter 2
Frequency distribution: summarizing data
Graphs designed to help understand data
Center, variation, distribution, outliers,
changing characteristics over time
3.1 - 2
Chapter 3
3-1 Measures of Central Tendency:
Mode Median and Mean
3-2 Measures of Variation
3-3 Percentiles and Box-and-Whisker
3.1 - 3
Section 3-1
Measures of Central
Tendency:
Mode Median and
Mean
Copyright © 2010
2010,Pearson
2007, 2004
Education
Pearson Education, Inc. All Rights Reserved.
3.1 - 4
Key Concept
Characteristics of center. Measures of center,
including mean and median, as tools for
analyzing data. Not only determine the value of
each measure of center, but also interpret
those values.
3.1 - 5
Measure of Central Tendency
 Compute mean median and mode
from raw date
Interpret what mean, median and
mode tell you
Explain how mean, median, and mode
can be affected by extreme data values
What is a trimmed mean and how to
compute it
Compute a weighted average
3.1 - 6
Notation

denotes the sum of a set of values.
x
is the variable usually used to represent
the individual data values.
n
represents the number of data values in a
sample.
N represents the number of data values in a
population.
3.1 - 7
Arithmetic Mean
 Arithmetic Mean (Mean)
the measure of center obtained by
adding the values and dividing the
total by the number of values
What most people call an average.
3.1 - 8
Notation
x is pronounced ‘x-bar’ and denotes the mean of a set
of sample values
x =
x
n
µ is pronounced ‘mu’ and denotes the mean of all values
population
µ =
in a
x
N
3.1 - 9
Mean
 Advantages
Is relatively reliable, means of samples drawn
from the same population don’t vary as much
as other measures of center
Takes every data value into account
 Disadvantage
Is sensitive to every data value, one
extreme value can affect it dramatically;
is not a resistant measure of center
3.1 - 10
5% Trimmed Mean
This is the mean of the data values left after trimming
the specific percentage off the largest and smallest
data values from the data set.
1
2
3
4
Put date in ascending order
Delete the bottom 5% of the data and the top 5%
of the data
If the 5% is not a whole number round to the
Nearest integer
Compute the mean of the remaining 90%
3.1 - 11
Median
 Median
the middle value when the original data values
are arranged in order of increasing (or
decreasing) magnitude
 often denoted by x~
(pronounced x-tilde’)
 is not affected by an extreme value - is a
resistant measure of the center
.
3.1 - 12
Finding the Median
First sort the values (arrange them in
order), the follow one of these
1. If the number of data values is odd,
the median is the number located in
the exact middle of the list.
2. If the number of data values is even,
the median is found by computing the
mean of the two middle numbers.
3.1 - 13
5.40
1.10
0.42
0.73
0.48
1.10
0.42
0.48
0.73
1.10
1.10
5.40
(in order - even number of values – no exact middle
shared by two numbers)
0.73 + 1.10
MEDIAN is 0.915
2
5.40
1.10
0.42
0.73
0.48
1.10
0.66
0.42
0.48
0.66
0.73
1.10
1.10
5.40
(in order - odd number of values)
exact middle
MEDIAN is 0.73
3.1 - 14
Mode
 Mode
the value that occurs with the MOST
FREQUENCY
 Data set can have one, more than one, or no
mode
Bimodal
two data values occur with the
same greatest frequency
Multimodal more than two data values occur
with the same greatest
frequency
No Mode
no data value is repeated
Mode is the only measure of central
tendency that can be used with nominal data
3.1 - 15
Mode - Examples
a. 5.40 1.10 0.42 0.73 0.48 1.10
Mode is 1.10
b. 27 27 27 55 55 55 88 88 99
Bimodal -
c. 1 2 3 6 7 8 9 10
No Mode
27 & 55
3.1 - 16
Critical Thinking
Think about whether the results
are reasonable.
Think about the method used to
collect the sample data.
3.1 - 17
Mean from a Frequency
Distribution
Assume that all sample values in
each class are equal to the class
midpoint.
3.1 - 18
Weighted Mean
When data values are assigned
different weights, we can compute a
weighted mean.
 (w • x)
x =
w
3.1 - 19
Best Measure of Center
3.1 - 20
Skewed and Symmetric
 Symmetric
distribution of data is symmetric if the
left half of its histogram is roughly a
mirror image of its right half
 Skewed
distribution of data is skewed if it is not
symmetric and extends more to one
side than the other
3.1 - 21
Skewed Left or Right
 Skewed to the left
(also called negatively skewed) have a
longer left tail, mean and median are to
the left of the mode
 Skewed to the right
(also called positively skewed) have a
longer right tail, mean and median are
to the right of the mode
3.1 - 22
Shape of the Distribution
The mean and median cannot
always be used to identify the
shape of the distribution.
3.1 - 23
Skewness
3.1 - 24
Section 3-2
Measures of Variation
Copyright © 2010
2010,Pearson
2007, 2004
Education
Pearson Education, Inc. All Rights Reserved.
3.1 - 25
Definition
The range of a set of data values is
the difference between the
maximum data value and the
minimum data value.
Range = (maximum value) – (minimum value)
It is very sensitive to extreme values; therefore
not as useful as other measures of variation.
3.1 - 26
Definition
 Midrange
the value midway between the
maximum and minimum values in the
original data set
Midrange =
maximum value + minimum value
2
3.1 - 27
Definition
The standard deviation of a set of
sample values, denoted by s, is a
measure of variation of values about
the mean.
3.1 - 28
Standard Deviation Important Properties
 The standard deviation is a measure of
variation of all values from the mean.
 The value of the standard deviation s is
usually positive.
 The value of the standard deviation s can
increase dramatically with the inclusion of
one or more outliers (data values far away
from all others).
 The units of the standard deviation s are the
same as the units of the original data values.
3.1 - 29
Midrange
 Sensitive to extremes
because it uses only the maximum
and minimum values, so rarely used
 Redeeming Features
(1) very easy to compute
(2) reinforces that there are several
ways to define the center
(3) Avoids confusion with median
3.1 - 30
Sample Standard
Deviation Formula
s=
.
 (x – x)
n–1
2
3.1 - 31
Population Standard
Deviation
 =
 (x – µ)
2
N
This formula is similar to the previous
formula, but instead, the population mean
and population size are used.
3.1 - 32
Properties of the
Standard Deviation
• Measures the variation among data
values
• Values close together have a small
standard deviation, but values with
much more variation have a larger
standard deviation
• Has the same units of measurement
as the original data
3.1 - 33
Comparing Variation in
Different Samples
It’s a good practice to compare two
sample standard deviations only when
the sample means are approximately
the same.
When comparing variation in samples
with very different means, it is better to
use the COEFFICIENT OF VARIATION,
which is defined later in this section.
3.1 - 34
Variance
 The variance of a set of values is a
measure of variation equal to the
square of the standard deviation.
 Sample variance: s2 - Square of the
sample standard deviation s
 Population variance: 2 - Square of
the population standard deviation 
3.1 - 35
Variance - Notation
s = sample standard deviation
s2 = sample variance
 = population standard deviation
 2 = population variance
3.1 - 36
Range Rule of Thumb
is based on the principle that for many
data sets, the vast majority (such as
95%) of sample values lie within two
standard deviations of the mean.
3.1 - 37
Properties of the
Standard Deviation
• For many data sets, a value is unusual
if it differs from the mean by more
than two standard deviations
• Compare standard deviations of two
different data sets only if the they use
the same scale and units, and they
have means that are approximately
the same
3.1 - 38
Empirical (or 68-95-99.7) Rule
For data sets having a distribution that is
approximately bell shaped, the following
properties apply:
 About 68% of all values fall within 1
standard deviation of the mean.
 About 95% of all values fall within 2
standard deviations of the mean.
 About 99.7% of all values fall within 3
standard deviations of the mean.
3.1 - 39
The Empirical Rule
3.1 - 40
The Empirical Rule
3.1 - 41
The Empirical Rule
3.1 - 42
Chebyshev’s Theorem
The proportion (or fraction) of any set of
data lying within K standard deviations
of the mean is always at least 1–1/K2,
where K is any positive number greater
than 1.
 For K = 2, at least 3/4 (or 75%) of all
values lie within 2 standard
deviations of the mean.
 For K = 3, at least 8/9 (or 89%) of all
values lie within 3 standard
deviations of the mean.
3.1 - 43
Rationale for using n – 1
versus n
There are only n – 1 independent
values. With a given mean, only n – 1
values can be freely assigned any
number before the last value is
determined.
Dividing by n – 1 yields better results
than dividing by n. It causes s2 to
target 2 whereas division by n causes
s2 to underestimate 2.
3.1 - 44
Coefficient of Variation
The coefficient of variation (or CV) for a set
of nonnegative sample or population data,
expressed as a percent, describes the
standard deviation relative to the mean.
Sample
CV =
s  100%
x
Population
CV =

 100%
m
3.1 - 45
Recap
In this section we have looked at:
 Range
 Standard deviation of a sample and
population
 Variance of a sample and population
 Range rule of thumb
 Empirical distribution
 Chebyshev’s theorem
 Coefficient of variation (CV)
3.1 - 46
Section 3-3
Percentiles
and
Box-and-Whisker
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.
3.1 - 47
Percentiles,
Quartiles, and
Boxplots
3.1 - 48
Percentiles
are measures of location. There are 99
percentiles denoted P1, P2, . . . P99,
which divide a set of data into 100
groups with about 1% of the values in
each group.
3.1 - 49
Finding the Percentile
of a Data Value
Percentile of value x =
number of values less than x
• 100
total number of values
3.1 - 50
Converting from the kth Percentile to
the Corresponding Data Value
Notation
total number of values in the
data set
k percentile being used
L locator that gives the
position of a value
Pk kth percentile
n
L=
k
100
•n
3.1 - 51
Quartiles
Are measures of location, denoted Q1, Q2, and
Q3, which divide a set of data into four groups
with about 25% of the values in each group.
 Q1 (First Quartile) separates the bottom
25% of sorted values from the top 75%.
 Q2 (Second Quartile) same as the median;
separates the bottom 50% of sorted
values from the top 50%.
 Q3 (Third Quartile) separates the bottom
75% of sorted values from the top 25%.
3.1 - 52
Quartiles
Q1, Q2, Q3
divide ranked scores into four equal parts
25%
(minimum)
25%
25% 25%
Q1 Q2 Q3
(maximum)
(median)
3.1 - 53
Some Other Statistics
 Interquartile Range (or IQR): Q3 – Q1
 Semi-interquartile Range:
Q3 – Q1
2
 Midquartile:
Q3 + Q1
2
 10 - 90 Percentile Range: P90 – P10
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.
3.1 - 54
5-Number Summary
 For a set of data, the 5-number
summary consists of the
minimum value; the first quartile
Q1; the median (or second
quartile Q2); the third quartile,
Q3; and the maximum value.
3.1 - 55
Boxplot
 A boxplot (or box-and-whiskerdiagram) is a graph of a data set
that consists of a line extending
from the minimum value to the
maximum value, and a box with
lines drawn at the first quartile,
Q1; the median; and the third
quartile, Q3.
3.1 - 56
Boxplots
Boxplot of Movie Budget Amounts
3.1 - 57
Boxplots - Normal Distribution
Normal Distribution:
Heights from a Simple Random Sample of Women
3.1 - 58
Boxplots - Skewed Distribution
Skewed Distribution:
Salaries (in thousands of dollars) of NCAA Football Coaches
3.1 - 59
Outliers
 An outlier is a value that lies very far
away from the vast majority of the other
values in a data set.
.
3.1 - 60
Important Principles
 An outlier can have a dramatic effect
on the mean.
 An outlier can have a dramatic effect
on the standard deviation.
 An outlier can have a dramatic effect
on the scale of the histogram so that
the true nature of the distribution is
totally obscured.
3.1 - 61
Modified Boxplots - Example
Pulse rates of females listed in Data Set 1 in
Appendix B.
3.1 - 62
Recap
In this section we have discussed:
 Percentiles
 Quartiles
 Converting a percentile to corresponding
data values
 Other statistics
 5-number summary
 Boxplots and modified boxplots
 Effects of outliers
3.1 - 63