Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
3.3 MEASURING VARIATION OR SPREAD
Both sets of data have the same mean, median and mode but the values
obviously differ in another respect -- the variation or spread of the values.
The values in List 1 are much more tightly clustered around the center value of
60. The values in List 2 are much more dispersed or spread out.
List 1: 55, 56, 57, 58, 59, 60, 60, 60, 61, 62, 63, 64, 65
mean = median = mode = 60
X
X
XXXXXXXXXXX
35
40
45
50
55
60
.
65
70
75
80
85
List 2: 35, 40, 45, 50, 55, 60, 60, 60, 65, 70, 75, 80, 80
mean = median = mode = 60
X
X
X
X
35
40
45
50
X
55
X
X
X
60
X
X
65
X
70
X
75
X
80
.
85
1
Range
The range is the simplest measure of variability or spread.
Range is just the difference between the largest value and the smallest value.
Range can give a distorted picture of the actual pattern of variation.
Two distributions: same range but different patterns of variation.
The first distribution has most of its values far from the center,
second distribution has most of its values closer to the center.
X
X
X
20
X
X X
X X
X X X
X X X X X X X X X X
21 22 23 24 25 26 27 28 29 30
X
X X
X X X X X
20 21 22 23 24
while the
X
X X
X X X
X X X X X X
25 26 27 28 29 30
2
Interquartile Range
The interquartile range measures the spread of the middle 50% of the data.
You first find the median (represented by Q2—the value that divides the data
into two halves), and then find the median for each half.The three values that
divide the data into four parts are called the quartiles, represented by Q1, Q2,
and Q3. The difference between the third quartile and the first quartile is called
the interquartile range, denoted by IQR=Q3-Q1.
Finding the Quartiles
1. Find the median of all of the observations.
2. First Quartile = Q1 = median of observations that fall below the
median.
3. Third Quartile = Q3 = median of observations that fall above the
median.
Notes
 When the number of observations is odd, the middle observation is
the median. This observation is not included in either of the two
halves when computing Q1 and Q3.
 Although different books, calculators, and computers may use
slightly different ways to compute the quartiles, they are all based
on the same idea.
 In a left-skewed distribution, the first quartile will be farther from the
median than the third quartile is. If the distribution is symmetric, the
quartiles should be the same distance from the median.
Example Quartiles for Age
The ages of the 20 subjects in the medical study are listed below in order.
32,
37,
39,
40,
41,
41,
41,
42,
42,
43,
44,
45,
45,
45,
46,
47,
47,
49,
50,
51
The histogram of the ages is also provided.
3
32
(a)
Calculate the median age.
(b)
Calculate the first Quartile Q1 for this age data.
(c)
Calculate the third Quartile Q3 for this age data.
(d)
Calculate the range for this age data.
37
39
40
41
41
41
42
42
43
44
45
45
45
46
47
47
49
50
51
median = 43.5
Q1 = 41
We
see that the distribution of age is
approximately symmetric and that the quartiles are
about the same distance from the median.
Q3 = 46.5
Count
8
6
4
2
The quartiles are actually the 25th, 50th, and 75th
percentiles.
30
35
40
45
50
55
DEFINITION:
The pth percentile is the value such that p% of the observations fall at or
below that value and (100 - p)% of the observations fall at or above that
value.
4
Five-Number Summary
Five-number summary:
Minimum,
Q1,
Median,
Q3,
Maximum
Boxplot:
Min
Q1 Q2=Median
Max
Q3
To Build a Basic Boxplot
 List the data values in order from smallest to largest.
 Find the five number summary: minimum, Q1, median, Q3, and maximum.
 Locate the values for Q1, the median and Q3 on the scale. These values
determine the “box” part of the boxplot. The quartiles determine the ends of
the box, and a line is drawn inside the box to mark the value of the median.
 Draw lines (called whiskers) from the midpoints of the ends of the box out to
the minimum and maximum.
Example 
Five-Number Summary and Boxplot for Age
Problem
Consider the (ordered) ages of the 20 subjects in a medical study :
32,
44,
37,
45,
39,
45,
40,
45,
41,
46,
41,
47,
41,
47,
42,
49,
42,
50,
43,
51
The five-number summary for the age data is given by:
min = 32, Q1 = 41, median = 43.5, Q3 = 46.5, and
max = 51.
5
Draw the basic boxplot.
The distance between the median and the quartiles is roughly the same,
supporting the rough symmetry of the distribution as seen previously from the
histogram.
Side-by-side boxplots are helpful for comparing two or more distributions with
respect to the five-number summary.
Although the median of the first process
is closer to the target value of 20.000
cm, the second process produces a
less variable distribution.
6
Using the 1.5 x IQR Rule to Identify Outliers and Build a Modified Boxplot
 List the data values in order from smallest to largest.
 Find the five number summary: minimum, Q1, median, Q3, and maximum.
 Locate the values for Q1, the median and Q3 on the scale. These values
determine the “box” part of the boxplot. The quartiles determine the ends of
the box, and a line is drawn inside the box to mark the value of the median.
 Find the IQR = Q3 – Q1.
 Compute the quantity STEP = 1.5 x (IQR)
 Find the location of the inner fences by taking 1 step out from each of the
quartiles
lower inner fence = Q1 – STEP;
upper inner fence = Q3 + STEP.
 Draw the lines (whiskers) from the midpoints of the ends of the box out to
the smallest and largest values WITHIN the inner fences.
 Observations that fall OUTSIDE the inner fences are considered potential
outliers. If there are any outliers, plot them individually along the scale using
a solid dot.
Five-number summary:
min=1
Q1=21
median=32
Q3=66
max=325
Inner Fences
Potential Outliers
Outside
value
Far Outside
value
Farthest observations that
are not potential outliers
7
Example Any Age Outlier?
Let’s apply the "rule of thumb" to our age data set to assess if there are any
outliers.
(a)
Construct the fences for the modified boxplot based
on the 1.5 * IQR rule.
(b)
Are there any outliers using the 1.5 * IQR rule?
(c)
Construct the modified boxplot.
8
Let's Do It! 1( 3min)
Five-Number Summary and Outliers
9
Let's Do It! 2 (3min)
10
Let’s Do It! 2 Cost of Running Shoes
The prices for 12 comparable pairs of running shoes produced the
following boxplot.
*
40
60
80
PRICE
100
120
(a) What was the approximate range of prices for such running shoes?
Range = ______________
(b)Twenty-five percent of the shoes cost more than
approximately what amount?
$ _____________
11
Let's Do It! 3 (10min)
Comparing Ages—Antibiotic Study
Variable = age for 23 children randomly assigned to one of two treatment
groups.
(a)
Give the five-number summary for each of the two
treatment groups. Comment on your results.
Amoxicillin Group (n=11): 8
9
9
10
Cefadroxil Group (n=12): 7 8
9
9
10
11
11
12
14
14
17
Five-number summary:
9
10
10 11
12
13
14
16
Five-number summary:
(b) Make side-by-side boxplots for the antibiotic study
data in part (a).
(c) Using our “rule of thumb,” are there any outliers for
the Amoxicillin group? If so, modify your boxplot
above.
(d) Using our “rule of thumb,” are there any outliers for
the Cefadroxil group?
If so, modify your boxplot above.
12
Standard Deviation
.…...a measure of the spread of the observations from the
mean.
.……think of the standard deviation as an “average (or
standard) distance of the observations from the mean.”
Example 5.9 Standard Deviation—What Is It?
Deviations:
-4,
1,
Squared Deviations: 16,
1,
9
3
----------------------------------------------------------------------------------------Observation Deviation Squared Deviation
 x  x 2
x x
x
----------------------------------------------------------------------------------------0
0 - 4 = -4
16
5
5-4= 1
1
7
7-4= 3
9
----------------------------------------------------------------------------------------mean = 4 sum always = 0
sum = 26
sample variance 
 4 2  1 2   3 2
31

16  1  9 26

 13
2
2
sample standard deviation  13  36
.
13
Interpretation of the Standard Deviation
Think of the standard deviation as roughly an average distance of the
observations from their mean. If all of the observations are the same, then the
standard deviation will be 0 (i.e. no spread). Otherwise the standard deviation
is positive and the more spread out the observations are about their mean, the
larger the value of the standard deviation.
If x 1 , x 2 ,..., x n denote a sample of n observations,
the sample variance is denoted by:
s
2
 x

 x
2
2
2

x1  x   x2  x     xn  x 

2
i
n 1
2
2
i
2

n 1
 x   x 

i
n ( xi2 )   xi 
n(n  1)
/n
(n  1)
Sample standard deviation, denoted by
s,
is the square root of the variance: s 
s2 .
The population standard deviation, denoted by the Greek letter

(sigma),
is the square root of the population variance and is computed as:
  2 
 x
i
 
N
2
.
14
Remarks:

The variance is measured in squared units. By taking the square root
of the variance we bring this measure of spread back into the original
units.

Just as the mean is not a resistant measure of center, since the
standard deviation used the mean in its definition, it is not a resistant
measure of spread. It is heavily influenced by extreme values.

There are statistical arguments that support why we divide by n  1
instead of n in the denominator of the sample standard deviation.
Let's Do It! 4 (4min) 5.13Increasing Spread
Consider the following three data sets.
I: 20 20 20
II: 18 20 22
III: 17 20 23
(a) Which data set will have the smallest standard deviation?
(b) Which data set will have the largest standard deviation?
(c) Find the standard deviation for each data set and
check your answers to (a) and (b).
Think About It (3 min)
Given that two (or more) sets of n observations yield the same standard deviation, will
these sets show the same amount of variability? Just what is variability anyway?
15
Example 
There Are Many Measures of Variability
Consider the following four data sets
along with their histograms:
6
Data Set I
2 3 3 3 4 4 4 4 5 5 5 5 5
Distribution I
4
4
2
2
Data Set II
3 3 3 3 3 4 4 4 4 5 5 5 6
1 2 3 4 5 6
Distribution III
Data Set III
2 3 3 4 4 4 4 4 4 4 5 5 6
Data Set IV
3 3 3 3 3 3 4 5 5 5 5 5 5
4
4
2
2
Measure of
Variability
I
x  4.
Distribution
II
III
6
2
3
4
5
6
Distribution IV
6
5
Distribution II
1
6
(a) Calculate the mean for each
1 2 3 4
data set.
(b) Calculate the range for each
data set.
(c) Calculate the interquartile range, IQR, for each data set.
(d) Calculate the standard deviation for each data set.
(e) Which data set is most variable? Explain.
The mean for all four distributions is
distributions:
6
1
2
3
4
5
6
The table presents three measures of variability for each of the four
IV
If we look at the range: Distribution III is most variable; if we
look at the IQR: Distribution III is least variable; while all four
distributions have the SAME standard deviation.
Some people associate variability with range while others
associate variability with how values differ from the mean.
There are many measures of variability, with the standard
deviation being the most widely used measure. But keep in
mind, a distribution with the smallest standard deviation is not
necessarily the distribution that is least variable with respect to other definitions or to your own definition of
variability. (Reference: A. J. Nitko, (1983), Educational Tests and Measurement: An Introduction.)
Range
IQR
Std dev
3
2
1
3
2
1
4
1
1
2
2
1
Think About It
What do you think would happen to the measures of variability if the last value
in all four of the preceding data sets were changed to 16?
16
IQR and Standard Deviation
The interquartile range, IQR, is the distance between the first and third
quartiles (Q3 - Q1), and measures the spread of the middle 50% of the data.
When the median is used as a measure of center, the IQR is often used as a
measure of spread. For skewed distributions, or distributions with outliers, the
IQR tends to be a better measure of spread if your goal is to summarize that
distribution.
Adding the minimum and maximum values to the median and quartiles results
in the five-number summary. A graphical display of the five-number
summary is a boxplot, and the length of the box corresponds to the IQR.
The standard deviation is roughly the average distance of the observed
values from their mean. The mean and the standard deviation are most useful
for approximately symmetric distributions with no outliers. In the next chapter
we will discuss an important family of symmetric distributions, called the normal
distributions, for which the standard deviation is a very useful summary.
Tip:
The numerical summaries presented in this chapter provide information about
the center and spread of a distribution, but a graph, such as a histogram or
stem-and-leaf plot, provides the best picture of the overall shape of the
distribution.
Graph your data first!
17
Variance and Standard Deviation for Grouped Data
The procedure for finding the variance and standard deviation for grouped data
is similar to that for finding the mean for grouped data, and it uses the
midpoints of each class.
Example
The data represent the number of miles that
20 runners ran during one week. Find the
variance and the standard deviation for the
frequency distribution of the data.
Solution
Step1
Make a table as shown, and
find the midpoint of each class.
Step 2
Multiply the frequency
by the midpoint for each class, and
place the products in column D.
1 .8 = 8, 2 . 13 =26, . . . , 2 .38
= 76
Step 3
Multiply the frequency
by the square of the midpoint, and
place the products in column E.
1 .82 = 64, 2 . 132 = 338, . . . ,
2 .382 = 2888
Step 4
Find the sums of columns B, D, and E. The sum of column B is n, the
sum of column D is  f i xm , and the sum of column E is  f i xm2 . The completed
table is shown.
Step 5
Substitute in the formula and solve for s2 to get the variance.
Step 6
Take the square root to get the standard deviation.
18
Let's Do It! 5
The data show distribution of the birth weight ( in oz.) of 100 consecutive
deliveries. Find the variance and the standard deviation.
Interval
29.50-69.45
69.50-89.45
89.50-99.45
99.50-109.45
109.50-119.45
119.50-129.45
129.50-139.45
139.50-169.45
Frequency
5
10
11
19
17
20
12
6
Practice Exercises from Textbook For 3.3 section
Page 129: 1-7 all, 9-11 all, 16, 18-21 all
Page 157: 1-12 all, 16, 17, and 18
19
TI Quick Steps
Obtaining Summary Measures
Step 1
Clear data.
Step 2
Enter data to be summarized.
Step 3
Obtain the summary measures for the data in L1.
Summary measures are obtained by requesting the 1-Var Stats
from under the STAT CALC menu list. The sequence of buttons is
as follows:
The 1-Var Stats are now displayed in the window. Notice that both the sample
standard deviation s and the population
depending on whether the values in L1 are a sample or the entire population.
The only mean provided is x
values are the entire population. To find more information, in particular the
five-number summary, press down arrow button.
Producing a Boxplot
20
Step 1
Clear data and plots
Step 2
Enter data to be plotted
Step 3
Setting the STAT PLOT options for a boxplot.
Finally set the stat plot options for producing
a boxplot of the data in L1 as Plot 1.
The sequence of steps is as follows:
Press the ZOOM button and then “9” to have the boxplot displayed. Use the
TRACE button and the right and left arrow keys to see values for the fivenumber summary. Note that the modified boxplot type is 4th graph icon in the
Type list.
21