Download 2 - WebAssign

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

World Values Survey wikipedia , lookup

Time series wikipedia , lookup

Transcript
Averages and
Variation
3
Copyright © Cengage Learning. All rights reserved.
3.1 - 1
Section
3.1
Measures of Central
Tendency: Mode,
Median, and Mean
Copyright © Cengage Learning. All rights reserved.
3.1 - 2
Focus Points
•
Compute mean, median, and mode from raw
data.
•
Interpret what mean, median, and mode tell
you.
•
Explain how mean, median, and mode can be
affected by extreme data values.
•
Compute a weighted average.
3.1 - 3
Arithmetic Mean
 Arithmetic Mean (Mean)
the measure of center obtained by
adding the values and dividing the
total by the number of values
What most people call an average.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.
3.1 - 4
Notation

denotes the sum of a set of values.
x
is the variable usually used to represent
the individual data values.
n
represents the number of data values in a
sample.
N represents the number of data values in a
population.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.
3.1 - 5
Notation
x is pronounced ‘x-bar’ and denotes the mean of a set
of sample values
x =
x
n
µ is pronounced ‘mu’ and denotes the mean of all values
population
µ=
in a
x
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.
N
3.1 - 6
Mean
 Advantages
Is relatively reliable, means of samples drawn
from the same population don’t vary as much
as other measures of center
Takes every data value into account
 Disadvantage
Is sensitive to every data value, one
extreme value can affect it dramatically;
is not a resistant measure of center
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.
3.1 - 7
Trimmed mean(optional part)
• A measure of center that is more resistant than the
mean but still sensitive to specific data values is the
trimmed mean.
• A trimmed mean is the mean of the data values left
after “trimming” a specified percentage of the
smallest and largest data values from the data set.
3.1 - 8
Trimmed Mean
• Usually a 5% trimmed mean is used. This implies that
we trim the lowest 5% of the data as well as the
highest 5% of the data. A similar procedure is used
for a 10% trimmed mean.
• Procedure:
3.1 - 9
Median
 Median
the middle value when the original data values
are arranged in order of increasing (or
decreasing) magnitude
 often denoted by x~
(pronounced ‘x-tilde’)
 is not affected by an extreme value - is a
resistant measure of the center
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.
3.1 - 10
Finding the Median
First sort the values (arrange them in
order), the follow one of these
1. If the number of data values is odd,
the median is the number located in
the exact middle of the list.
2. If the number of data values is even,
the median is found by computing the
mean of the two middle numbers.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.
3.1 - 11
5.40
1.10
0.42
0.73
0.48
1.10
0.42
0.48
0.73
1.10
1.10
5.40
(in order - even number of values – no exact middle
shared by two numbers)
0.73 + 1.10
MEDIAN is 0.915
2
5.40
1.10
0.42
0.73
0.48
1.10
0.66
0.42
0.48
0.66
0.73
1.10
1.10
5.40
(in order - odd number of values)
exact middle
MEDIAN is 0.73
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.
3.1 - 12
Mode
 Mode
the value that occurs with the greatest
frequency
 Data set can have one, more than one, or no
mode
Bimodal
two data values occur with the
same greatest frequency
Multimodal more than two data values occur
with the same greatest
frequency
No Mode
no data value is repeated
Mode is the only measure of central
tendency that can be used with nominal data
3.1 - 13
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.
Mode - Examples
a. 5.40 1.10 0.42 0.73 0.48 1.10
Mode is 1.10
b. 27 27 27 55 55 55 88 88 99
Bimodal -
c. 1 2 3 6 7 8 9 10
No Mode
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.
27 & 55
3.1 - 14
Definition
 Midrange
the value midway between the
maximum and minimum values in the
original data set
Midrange =
maximum value + minimum value
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.
2
3.1 - 15
Midrange
 Sensitive to extremes
because it uses only the maximum
and minimum values, so rarely used
 Redeeming Features
(1) very easy to compute
(2) reinforces that there are several
ways to define the center
(3) Avoids confusion with median
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.
3.1 - 16
Round-off Rule for
Measures of Center
Carry one more decimal place than is
present in the original set of values.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.
3.1 - 17
Critical Thinking
Think about whether the results
are reasonable.
Think about the method used to
collect the sample data.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.
3.1 - 18
Weighted Average
3.1 - 19
Weighted Average
• Sometimes we wish to average numbers, but we
want to assign more importance, or weight, to some
of the numbers.
• For instance, suppose your professor tells you that
your grade will be based on a midterm and a final
exam, each of which is based on 100 possible points.
• However, the final exam will be worth 60% of the
grade and the midterm only 40%. How could you
determine an average score that would reflect these
different weights?
3.1 - 20
Weighted Average
• The average you need is the weighted average.
3.1 - 21
Example – Weighted Average
• Suppose your midterm test score is 83 and your final
exam score is 95.
• Using weights of 40% for the midterm and 60% for
the final exam, compute the weighted average of
your scores.
• If the minimum average for an A is 90, will you earn
an A?
• Solution:
• By the formula, we multiply each score by its weight
and add the results together.
3.1 - 22
Example – Solution
cont’d
• Then we divide by the sum of all the weights.
Converting the percentages to decimal notation, we
get
•
Your average is high enough to earn an A.
3.1 - 23
Example 2– Weighted Mean
In her first semester of college, a student of the author took five courses.
Her final grades along with the number of credits for each course were A (3
credits), A (4 credits), B (3 credits), C (3 credits), and F (1 credit).
The grading system assigns quality points to letter grades as follows:
A = 4; B = 3; C = 2; D = 1; F = 0.
Compute her grade point average.
Solution
Use the numbers of credits as the weights: w = 3, 4, 3, 3, 1.
Replace the letters grades of A, A, B, C, and F with the
corresponding quality points: x = 4, 4, 3, 2, 0.
3.1 - 24
Example 2 – Weighted Mean
Solution
 w  x
x
w
3  4    4  4    3  3    3  2   1 0 


3  4  3  3 1
43

 3.07
14
3.1 - 25
Mean from a Frequency
Distribution
Assume that all sample values in
each class are equal to the class
midpoint.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.
3.1 - 26
Mean from a Frequency
Distribution
use class midpoint of classes for variable x
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.
3.1 - 27
Example
• Estimate the mean from the IQ scores in Chapter 2.
( f  x) 7201.0
x

 92.3
f
78
3.1 - 28
Best Measure of Center
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.
3.1 - 29
Skewed and Symmetric
 Symmetric
distribution of data is symmetric if the
left half of its histogram is roughly a
mirror image of its right half
 Skewed
distribution of data is skewed if it is not
symmetric and extends more to one
side than the other
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.
3.1 - 30
Skewed Left or Right
 Skewed to the left
(also called negatively skewed) have a
longer left tail, mean and median are to
the left of the mode
 Skewed to the right
(also called positively skewed) have a
longer right tail, mean and median are
to the right of the mode
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.
3.1 - 31
Shape of the Distribution
The mean and median cannot
always be used to identify the
shape of the distribution.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.
3.1 - 32
Skewness
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.
3.1 - 33
Section
3.2
Measures of
Variation
Copyright © Cengage Learning. All rights reserved.
3.1 - 34
Focus Points
•
Find the range, variance, and standard
deviation.
•
Compute the coefficient of variation from
raw
data. Why is the coefficient of variation
important?
3.1 - 35
Definition
The range of a set of data values is
the difference between the
maximum data value and the
minimum data value.
Range = (maximum value) – (minimum value)
Example: Range of {1, 3, 14} is 14-1=13.
It is very sensitive to extreme values; therefore
not as useful as other measures of variation.
Copyright © 2010
2010,Pearson
2007, 2004
Education
Pearson Education, Inc. All Rights Reserved.
3.1 - 36
Round-Off Rule for
Measures of Variation
When rounding the value of a
measure of variation, carry one more
decimal place than is present in the
original set of data.
Round only the final answer, not values in
the middle of a calculation.
Copyright © 2010
2010,Pearson
2007, 2004
Education
Pearson Education, Inc. All Rights Reserved.
3.1 - 37
Definition
The standard deviation of a set of
sample values, denoted by s, is a
measure of variation of values about
the mean.
Copyright © 2010
2010,Pearson
2007, 2004
Education
Pearson Education, Inc. All Rights Reserved.
3.1 - 38
Sample Standard
Deviation Formula
s=
 (x – x)
n–1
Copyright © 2010
2010,Pearson
2007, 2004
Education
Pearson Education, Inc. All Rights Reserved.
2
3.1 - 39
Sample Standard Deviation
(Shortcut Formula)
nx ) – (x)
n (n – 1)
2
s=
Copyright © 2010
2010,Pearson
2007, 2004
Education
Pearson Education, Inc. All Rights Reserved.
2
3.1 - 40
Example
Use either formula to find the standard
deviation of these numbers of a sample
of chocolate chips:
22, 22, 26, 24
3.1 - 41
Example
x 22  22  26  24

x

 23.5
n
s
4
 x  x 
2
n 1
 22  23.5   22  23.5   26  23.5   24  23.5 
2

2
2
2
4 1
11

 1.9149
3
3.1 - 42
Another Example: Publix checkout waiting times in minutes
Dataset: {1, 4, 10}. Find the sample mean and
sample standard deviation.
Using the shortcut
formula:
xx
( x  x )2 x 2
x
n=3
x
15
 5.0 min
3
s
1
4
10
15
x
2


x

x

n 1
1-5= -4
-1
5
s
16
100
117
 (x  x)  x
2
n x 2   x 
2
1
16
1
25
42
2
42

 21  4.6 min
3 1
n(n  1)
3(117)  15

3(3  1)
2
351  225
126


6
6
 21  4.6 min
3.1 - 43
Standard Deviation Important Properties
 The standard deviation is a measure of
variation of all values from the mean.
 The value of the standard deviation s is
usually positive.
 The value of the standard deviation s can
increase dramatically with the inclusion of
one or more outliers (data values far away
from all others).
 The units of the standard deviation s are the
same as the units of the original data values.
Copyright © 2010
2010,Pearson
2007, 2004
Education
Pearson Education, Inc. All Rights Reserved.
3.1 - 44
Comparing Variation in
Different Samples
It’s a good practice to compare two
sample standard deviations only when
the sample means are approximately
the same.
When comparing variation in samples
with very different means, it is better to
use the coefficient of variation, which is
defined later in this section.
Copyright © 2010
2010,Pearson
2007, 2004
Education
Pearson Education, Inc. All Rights Reserved.
3.1 - 45
Population Standard
Deviation
 =
 (x – µ)
2
N
This formula is similar to the previous
formula, but instead, the population mean
and population size are used.
Copyright © 2010
2010,Pearson
2007, 2004
Education
Pearson Education, Inc. All Rights Reserved.
3.1 - 46
Variance
 The variance of a set of values is a
measure of variation equal to the
square of the standard deviation.
 Sample variance: s2 - Square of the
sample standard deviation s
 Population variance: 2 - Square of
the population standard deviation 
Copyright © 2010
2010,Pearson
2007, 2004
Education
Pearson Education, Inc. All Rights Reserved.
3.1 - 47
Unbiased Estimator
The sample variance s2 is an
unbiased estimator of the population
variance 2, which means values of
s2 tend to target the value of 2
instead of systematically tending to
overestimate or underestimate 2.
Copyright © 2010
2010,Pearson
2007, 2004
Education
Pearson Education, Inc. All Rights Reserved.
3.1 - 48
Variance - Notation
s = sample standard deviation
s2 = sample variance
 = population standard deviation
 2 = population variance
Copyright © 2010
2010,Pearson
2007, 2004
Education
Pearson Education, Inc. All Rights Reserved.
3.1 - 49
Properties of the
Standard Deviation
• Measures the variation among data
values
• Values close together have a small
standard deviation, but values with
much more variation have a larger
standard deviation
• Has the same units of measurement
as the original data
Copyright © 2010
2010,Pearson
2007, 2004
Education
Pearson Education, Inc. All Rights Reserved.
3.1 - 50
Properties of the
Standard Deviation
• For many data sets, a value is unusual
if it differs from the mean by more
than two standard deviations
• Compare standard deviations of two
different data sets only if the they use
the same scale and units, and they
have means that are approximately
the same
Copyright © 2010
2010,Pearson
2007, 2004
Education
Pearson Education, Inc. All Rights Reserved.
3.1 - 51
Coefficient of Variation
The coefficient of variation (or CV) for a set
of nonnegative sample or population data,
expressed as a percent, describes the
standard deviation relative to the mean.
Sample
CV =
s  100%
x
Copyright © 2010
2010,Pearson
2007, 2004
Education
Pearson Education, Inc. All Rights Reserved.
Population
CV =

 100%
m
3.1 - 52
Example: How to compare the variability
in heights and weights of men?
Sample: 40 males were randomly selected. The
summarized statistics are given below.
Sample mean
Height
68.34 in
Sample standard
deviation
3.02 in
Weight
172.55 lb
26.33 lb
Solution: Use CV to compare the variability
s
3.02

100
%

 100%  4.42%
Heights:
x
68.34
s
26.33
Weights: CV   100% 
 100%  15.26%
x
172.55
CV 
Conclusion:
Heights (with
CV=4.42%) have
considerably less
variation than
weights (with
CV=15.26%)
3.1 - 53
Section
3.3
Percentiles and
Box-and-Whisker
Plots
Copyright © Cengage Learning. All rights reserved.
3.1 - 54
Focus Points
•
Interpret the meaning of percentile scores.
•
Compute the median, quartiles, and
five-number summary from raw data.
•
Make a box-and-whisker plot. Interpret the
results.
•
Describe how a box-and-whisker plot
indicates spread of data about the median.
3.1 - 55
Percentiles and Box-and-Whisker
Plots
• We’ve seen measures of central tendency and spread
for a set of data. The arithmetic mean x and the
standard deviation s will be very useful in later work.
• However, because they each utilize every data value,
they can be heavily influenced by one or two extreme
data values.
• In cases where our data distributions are heavily
skewed or even bimodal, we often get a better
summary of the distribution by utilizing relative
position of data rather than exact values.
3.1 - 56
Percentiles and Box-and-Whisker
Plots
• We know that the median is an average computed by
using relative position of the data.
If we are told that 81 is the median score on a biology
test, we know that after the data have been ordered,
50% of the data fall at or below the median value of
81.
The median is an example of a percentile; in fact, it is
the 50th percentile. The general definition of the P th
percentile follows.
3.1 - 57
Percentiles and Box-and-Whisker
Plots
• In Figure 3-3, we see the 60th percentile marked on a
histogram. We see that 60% of the data lie below the
mark and 40% lie above it.
A Histogram with the 60th Percentile Shown
Figure 3-3
3.1 - 58
Percentiles and Box-and-Whisker
Plots
• There are 99 percentiles, and in an ideal situation, the
99 percentiles divide the data set into 100 equal
parts.
(See Figure 3-4.)
However, if the number of data elements is not
exactly divisible by 100, the percentiles will not
divide the data into equal parts.
Percentiles
Figure 3-4
3.1 - 59
Percentiles and Box-and-Whisker
Plots
• There are several widely used conventions for
finding percentiles. They lead to slightly different
values for different situations, but these values are
close together.
• For all conventions, the data are first ranked or
ordered from smallest to largest. A natural way to
find the Pth percentile is to then find a value such
that P% of the data fall at or below it.
• This will not always be possible, so we take the
nearest value satisfying the criterion. It is at this
point that there is a variety of processes to determine
the exact value of the percentile.
3.1 - 60
Percentiles and Box-and-Whisker
Plots
• We will not be very concerned about exact
procedures for evaluating percentiles in general.
However, quartiles are special percentiles used so
frequently that we want to adopt a specific procedure
for their computation.
Quartiles are those percentiles that divide the data
into fourths.
3.1 - 61
Percentiles and Box-and-Whisker
Plots
• The first quartile Q1 is the 25th percentile, the second
quartile Q2 is the median, and the third quartile Q3 is
the 75th percentile. (See Figure 3-5.)
Quartiles
Figure 3-5
• Again, several conventions are used for computing
quartiles, but the convention on next page utilizes
the median and is widely adopted.
3.1 - 62
Percentiles and Box-and-Whisker
Plots
• Procedure
3.1 - 63
Percentiles and Box-and-Whisker
Plots
• In short, all we do to find the quartiles is find three
medians. The median, or second quartile, is a
popular measure of the center utilizing relative
position.
• A useful measure of data spread utilizing relative
position is the interquartile range (IQR). It is simply
the difference between the third and first quartiles.
•
Interquartile range = Q3 – Q1
• The interquartile range tells us the spread of the
middle half of the data. Now let’s look at an example
to see how to compute all of these quantities.
3.1 - 64
Example – Quartiles
• In a hurry? On the run? Hungry as well? How about an ice
cream bar as a snack? Ice cream bars are popular among
all age groups.
Consumer Reports did a study of ice cream bars.
Twenty-seven bars with taste ratings of at least “fair” were
listed, and cost per bar was included in the report.
Just how much does an ice cream bar cost? The data,
expressed in dollars, appear in Table 3-4.
Cost of Ice Cream Bars (in dollars)
Table 3-4
3.1 - 65
Example – Quartiles
cont’d
• As you can see, the cost varies quite a bit, partly
because the bars are not of uniform size.
(a) Find the quartiles.
• Solution:
• We first order the data from smallest to largest. Table
3-5 shows the data in order.
Ordered Cost of Ice Cream Bars (in dollars)
Table 3-5
3.1 - 66
Example – Solution
cont’d
• Next, we find the median.
•
Since the number of data values is 27, there are an
odd number of data, and the median is simply the
center or 14th value.
•
The value is shown boxed in Table 3-5.
•
Median = Q2 = 0.50
•
There are 13 values below the median position, and
Q1 is the median of these values.
3.1 - 67
Example – Solution
cont’d
• It is the middle or seventh value and is shaded in
Table 3-5.
•
First quartile = Q1 = 0.33
• There are also 13 values above the median position.
The median of these is the seventh value from the
right end.
•
This value is also shaded in Table 3-5.
•
Third quartile = Q3 = 1.00
3.1 - 68
Example– Quartiles
cont’d
• (b) Find the interquartile range.
• Solution:
•
IQR = Q3 – Q1
•
= 1.00 – 0.33
•
= 0.67
• This means that the middle half of the data has a cost
spread of 67¢.
3.1 - 69
Box-and-Whisker Plots
3.1 - 70
Box-and-Whisker Plots
• The quartiles together with the low and high data
values give us a very useful five-number summary of
the data and their spread.
• We will use these five numbers to create a graphic
sketch of the data called a box-and-whisker plot.
Box-and-whisker plots provide another useful
technique from exploratory data analysis (EDA) for
describing data.
3.1 - 71
Box-and-Whisker Plots
• Procedure
Box-and-Whisker Plot
Figure 3-6
• The next example demonstrates the process of
making a box-and-whisker plot.
3.1 - 72
Example – Box-and-whisker plot
• Make a box-and-whisker plot showing the calories in
vanilla-flavored ice cream bars.
• Use the plot to make observations about the
distribution of calories.
• (a) We ordered the data (see Table 3-7) and found the
values of the median, Q1, and Q3.
Ordered Data
Table 3-7
3.1 - 73
Example – Box-and-whisker plot
cont’d
• From this previous work we have the following fivenumber summary:
• low value = 111; Q1 = 182; median = 221.5; Q3 = 319;
high value = 439
3.1 - 74
Example– Box-and-whisker plot
cont’d
• (b) We select an appropriate vertical scale and make
the plot (Figure 3-7).
Box-and-Whisker Plot for Calories in
Vanilla-Flavored Ice Cream Bars
Figure 3-7
3.1 - 75
Example– Box-and-whisker plot
cont’d
• (c) Interpretation A quick glance at the box-and-whisker
plot reveals the following:
(i) The box tells us where the middle half of the data lies, so
we see that half of the ice cream bars have between 182
and 319 calories, with an interquartile range of 137
calories.
(ii) The median is slightly closer to the lower part of the box.
This means that the lower calorie counts are more
concentrated. The calorie counts above the median are
more spread out, indicating that the distribution is
slightly skewed toward the higher values.
3.1 - 76
Example– Box-and-whisker plot
cont’d
• (iii) The upper whisker is longer than the lower,
which again
emphasizes skewness toward the higher values.
3.1 - 77