Download Central Tendency Trimmed Mean

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

World Values Survey wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
(Part 1)
4
Numerical Description
Central Tendency
Dispersion
McGraw-Hill/Irwin
Copyright © 2009 by The McGraw-Hill Companies, Inc.
Chapter
Descriptive Statistics
Numerical Description
• Statistics are descriptive measures derived
from a sample (n items).
• Parameters are descriptive measures derived
from a population (N items).
4A-2
Numerical Description
• Three key characteristics of numerical data:
Characteristic
Interpretation
Central Tendency Where are the data values concentrated?
What seem to be typical or middle data
values?
Dispersion
How much variation is there in the data?
How spread out are the data values?
Are there unusual values?
Shape
Are the data values distributed
symmetrically? Skewed? Sharply peaked?
Flat? Bimodal?
4A-3
Numerical Description
Example: Vehicle Quality
• Consider the data set of vehicle defect rates
from J. D. Power and Associates.
• Defect rate = total no. defects
x 100
no. inspected
• Numerical statistics can be used to summarize
this random sample of brands.
• Must allow for sampling error since the
analysis is based on sampling.
4A-4
Numerical Description
• Number of defects per 100 vehicles, 2006 models.
4A-5
Numerical Description
To begin, sort the
data in Excel.
4A-6
Numerical Description
• Sorted data provides insight into central
tendency and dispersion.
4A-7
Numerical Description
Visual Displays
• The dot plot offers a visual impression of the
data.
4A-8
Numerical Description
Visual Displays
• Histograms with 5 bins (suggested by Sturge’s
Rule) and 10 bins are shown below.
• Both are symmetric with no extreme values
and show a modal class toward the low end.
4A-9
Central Tendency
• The central tendency is the middle or typical
values of a distribution.
• Central tendency can be assessed using a dot
plot, histogram or more precisely with
numerical statistics.
4A-10
Central Tendency
Six Measures of Central Tendency
Statistic
Mean
Median
4A-11
Formula
1 n
xi

n i 1
Middle
value in
sorted
array
Excel Formula
Pro
Familiar and
uses all the
=AVERAGE(Data)
sample
information.
=MEDIAN(Data)
Con
Influenced
by extreme
values.
Ignores
extremes
Robust when
and can be
extreme data
affected by
values exist.
gaps in data
values.
Central Tendency
Six Measures of Central Tendency
Statistic
Mode
Midrange
4A-12
4A-12
Formula
Most
frequently
occurring
data value
xmin  xmax
2
Excel Formula
Pro
Con
=MODE(Data)
Useful for
attribute
data or
discrete
data with a
small range.
=0.5*(MIN(Data)
+MAX(Data))
Easy to
understand
and
calculate.
May not be
unique,
and is not
helpful for
continuous
data.
Influenced
by extreme
values and
ignores
most data
values.
Central Tendency
Six Measures of Central Tendency
Statistic
Geometric
mean (G)
Trimmed
mean
4A-13
Formula
n
x1 x2 ... xn
Same as the
mean except
omit highest
and lowest
k% of data
values (e.g.,
5%)
Excel Formula
Pro
Con
Useful for
growth
rates and
=GEOMEAN(Data)
mitigates
high
extremes.
Less
familiar
and
requires
positive
data.
Mitigates
=TRIMMEAN(Data, effects of
Percent)
extreme
values.
Excludes
some
data
values
that could
be
relevant.
Central Tendency
Mean
• A familiar measure of central tendency.
Population Formula
Sample Formula
N
n

 xi
i 1
N
x
 xi
i 1
n
• In Excel, use function =AVERAGE(Data) where
Data is an array of data values.
4A-14
Central Tendency
Mean
• For the sample of n = 37 car brands:
4A-15
Central Tendency
Characteristics of the Mean
• Arithmetic mean is the most familiar average.
• Affected by every sample item.
• The balancing point or fulcrum for the data.
4A-16
Central Tendency
Characteristics of the Mean
• Regardless of the shape of the distribution,
absolute distances from the mean to the data
points always sum to zero.
n
• Consider the following
 ( xi  x )  0
asymmetric distribution of
i 1
quiz scores whose mean = 65.
n
 ( xi  x ) = (42 – 65) + (60 – 65) + (70 – 65) + (75 – 65) + (78 – 65)
4A-17
i 1
= (-23) + (-5) + (5) + (10) + (13) = -28 + 28 = 0
Central Tendency
Median
• The median (M) is the 50th percentile or
midpoint of the sorted sample data.
• M separates the upper and lower half of the
sorted observations.
• If n is odd, the median is the middle
observation in the data array.
• If n is even, the median is the average of the
middle two observations in the data array.
4A-18
Central Tendency
Median
• Consider the following n = 6 data values:
11 12 15 17 21 32
• What is the median?
For even n, Median =
xn / 2  x( n / 21)
2
n/2 = 6/2 = 3
and
n/2+1 = 6/2 + 1 = 4
M = (x3+x4)/2 = (15+17)/2 = 16
11
4A-19
12
15
16
17
21
32
Central Tendency
Median
(Figure 4.6)
• For n = 8, the median is between the fourth and fifth
observations in the data array.
4A-20
Central Tendency
Median
• For n = 9, the median is the fifth observation in the
data array.
4A-21
Central Tendency
Median
• Consider the following n = 7 data values:
12 23 23 25 27 34 41
• What is the median?
x
For odd n, Median = ( n 1) / 2
(n+1)/2 = (7+1)/2 = 8/2 = 4
M = x4 = 25
12
4A-22
23
23
25
27
34
41
Central Tendency
Median
• Use Excel’s function =MEDIAN(Data) where
Data is an array of data values.
• For the 37 vehicle quality ratings (odd n) the
position of the median is
(n+1)/2 = (37+1)/2 = 19.
• So, the median is x19 = 121.
• When there are several duplicate data values,
the median does not provide a clean “50-50”
split in the data.
4A-23
Central Tendency
Characteristics of the Median
• The median is insensitive to extreme data values.
• For example, consider the following quiz scores for
3 students:
Tom’s scores:
20, 40, 70, 75, 80
Jake’s scores:
60, 65, 70, 90, 95
Mary’s scores:
50, 65, 70, 75, 90
Mean =57, Median = 70, Total = 285
Mean = 76, Median = 70, Total = 380
Mean = 70, Median = 70, Total = 350
• What does the median for each student tell you?
4A-24
Central Tendency
Mode
• The most frequently occurring data value.
• Similar to mean and median if data values
occur often near the center of sorted data.
• May have multiple modes or no mode.
4A-25
Central Tendency
Mode
• For example, consider the following quiz
scores for 3 students:
Lee’s scores:
60, 70, 70, 70, 80
Pat’s scores:
45, 45, 70, 90, 100
Sam’s scores:
50, 60, 70, 80, 90
Xiao’s scores:
50, 50, 70, 90, 90
Mean =70, Median = 70, Mode = 70
Mean = 70, Median = 70, Mode = 45
Mean = 70, Median = 70, Mode = none
Mean = 70, Median = 70, Modes = 50,90
• What does the mode for each student tell you?
4A-26
Central Tendency
Mode
• Easy to define, not easy to calculate in large
samples.
• Use Excel’s function =MODE(Array)
- will return #N/A if there is no mode.
- will return first mode found if multimodal.
• May be far from the middle of the distribution
and not at all typical.
4A-27
Central Tendency
Mode
• Generally isn’t useful for continuous data
since data values rarely repeat.
• Best for attribute data or a discrete variable
with a small range (e.g., Likert scale).
4A-28
Central Tendency
Example: Price/Earnings Ratios and Mode
• Consider the following P/E ratios for a random
sample of 68 Standard & Poor’s 500 stocks.
7
8
8
10 10 10 10 12 13 13 13 13 13 13 13 14 14
14 15 15 15 15 15 16 16 16 17 18 18 18 18 19 19 19
19 19 20 20 20 21 21 21 22 22 23 23 23 24 25 26 26
26 26 27 29 29 30 31 34 36 37 40 41 45 48 55 68 91
• What is the mode?
4A-29
Central Tendency
Example: Price/Earnings Ratios and Mode
• Excel’s descriptive
statistics results are:
• The mode 13
occurs 7 times, but
what does the dot
plot show?
Mean
Median
19
Mode
13
Range
84
Minimum
7
Maximum
91
Sum
Count
4A-30
22.7206
1545
68
Central Tendency
Example: Price/Earnings Ratios and Mode
• The dot plot shows local modes (a peak with
valleys on either side) at 10, 13, 15, 19, 23, 26,
29.
• These multiple modes suggest that the mode
is not a stable measure of central tendency.
4A-31
Central Tendency
Example: Rose Bowl Winners’ Points
• Points scored by the winning NCAA football
team tends to have modes in multiples of 7
because each touchdown yields 7 points.
• Consider the dot plot of the points scored by
the winning team in the first 87 Rose Bowl
games.
• What is the mode?
4A-32
Central Tendency
Mode
• A bimodal distribution refers to the shape of the
histogram rather than the mode of the raw data.
• Occurs when dissimilar populations are combined in
one sample. For example,
4A-33
Central Tendency
Skewness
• Compare mean and median or look at
histogram to determine degree of skewness.
4A-34
Central Tendency
Symptoms of Skewness
Distribution’s Histogram Appearance
Shape
Statistics
Skewed left
(negative
skewness)
Long tail of histogram points left
(a few low values but most data
on right)
Mean < Median
Symmetric
Tails of histogram are balanced
(low/high values offset)
Mean  Median
Skewed right Long tail of histogram points right
(positive
(most data on left but a few high Mean > Median
skewness)
values)
4A-35
Central Tendency
Skewness
• For the sample of spending per customer at 74
Noodles &, the mean ($7.04) exceeds the
median ($7.00). What does this suggest?
4A-36
Central Tendency
Geometric Mean
• The geometric mean (G) is a
multiplicative average.
G  n x1 x2 ... xn
• For the J. D. Power quality data (n=37):
G  37 (87)(93)(98)...(164)(173)  37 2.37667 1077  123.38
• In Excel use =GEOMEAN(Array)
• The geometric mean tends to mitigate the
effects of high outliers.
4A-37
Central Tendency
Growth Rates
• A variation on the geometric mean used to find
the average growth rate for a time series.
• For example, from
2002 to 2006,
JetBlue Airlines
revenues are:
4A-38
Year
Revenue (mil)
2002
635
2003
998
2004
1265
2005
1701
2006
2363
Central Tendency
Growth Rates
• The average growth rate is given by taking the
geometric mean of the ratios of each year’s
revenue to the preceding year.
• Due to cancellations, only the first and last
years are relevant.
• or 38.9 % per year. In Excel, we would use
=(2363/635)^(1/4)-1
4A-39
Central Tendency
Midrange
• The midrange is the point halfway between the
lowest and highest values of X.
• Easy to use but sensitive to extreme data values.
xmin  xmax
Midrange =
2
• For the J. D. Power quality data (n=37):
xmin  xmax
Midrange =
=
2
91  204
 147.5
2
• Here, the midrange (147.5) is higher than the mean
(134.51) or median (132).
4A-40
Central Tendency
Trimmed Mean
• To calculate the trimmed mean, first remove the
highest and lowest k percent of the observations.
• For example, for the n = 68 P/E ratios, we want a 5
percent trimmed mean (i.e., k = .05).
• To determine how many observations to trim,
multiply k x n = 0.05 x 68 = 3.4 or 3 observations.
• So, we would remove the three smallest and three
largest observations before averaging the remaining
values.
4A-41
Central Tendency
Trimmed Mean
• Here is a summary of all the measures of
central tendency for the n = 68 P/E values.
Mean: 22.72
Median: 19.00
Mode: 13.00
Geometric Mean: 19.85
Midrange: 49.00
5% Trim Mean: 21.10
=AVERAGE(PERatio)
=MEDIAN(PERatio)
=MODE(PERatio)
=GEOMEAN(PERatio)
(MIN(PERatio)+MAX(PERatio))/2
=TRIMMEAN(PERatio,0.1)
• The trimmed mean mitigates the effects of very high
values, but still exceeds the median.
4A-42
Central Tendency
Trimmed Mean
• The Federal
Reserve uses a
16% trimmed
mean to
mitigate the
effects of
extremes in its
analysis of the
Consumer
Price Index.
4A-43
Dispersion
• Variation is the “spread” of data points about
the center of the distribution in a sample.
Consider the following measures of dispersion:
Measures of Variation
Statistic
Range
Formula
xmax – xmin
n
Variance
(s2)
4A-44
  xi  x 
i 1
n 1
Excel
Pro
=MAX(Data)- Easy to
MIN(Data)
calculate
2
=VAR(Data)
Con
Sensitive to
extreme
data values.
Plays a key role
Non-intuitive
in mathematical
meaning.
statistics.
Dispersion
Measures of Variation
Statistic
Standard
deviation
(s)
Coefficient. of
variation
(CV)
4A-45
Formula Excel
n
  xi  x 
i 1
n 1
100 
s
x
2
Pro
Most common
measure. Uses
=STDEV(Data) same units as the
raw data ($ , £, ¥,
etc.).
Measures relative
variation in
None
percent so can
compare data
sets.
Con
Nonintuitive
meaning.
Requires
nonnegative
data.
Dispersion
Measures of Variation
Statistic
Mean
absolute
deviation
(MAD)
4A-46
Formula
Excel
Pro
Con
Easy to
understand.
Lacks
“nice”
theoretical
properties.
n
 xi  x
i 1
n
=AVEDEV(Data)
Dispersion
Range
• The difference between the largest and
smallest observation.
Range = xmax – xmin
• For example, for the n = 68 P/E ratios,
Range = 91 – 7 = 84
4A-47
Dispersion
Variance
• The population variance (s2) is
defined as the sum of squared
2
deviations around the mean  s 
divided by the population size.
• For the sample variance (s2), we
divide by n – 1 instead of n,
otherwise s2 would tend to
2
s

underestimate the unknown
population variance s2.
4A-48
N
  xi   
i 1
N
n
  xi  x 
i 1
n 1
2
2
Dispersion
Standard Deviation
• The square root of the variance.
• Explains how individual values in a data set
vary from the mean.
• Units of measure are the same as X.
Population
standard
deviation
4A-49
N
s
  xi   
i 1
N
2
Sample
standard
deviation
n
s
  xi  x 
i 1
n 1
2
Dispersion
Standard Deviation
• Excel’s built in functions are
Statistic
Excel population
formula
Excel sample
formula
Variance
=VARP(Array)
=VAR(Array)
=STDEVP(Array)
=STDEV(Array)
Standard deviation
4A-50
Dispersion
Calculating a Standard Deviation
• Consider the following five quiz scores for
Stephanie. (Table 412)
4A-51
Dispersion
Calculating a Standard Deviation
• Now, calculate the sample standard deviation:
n
s
  xi  x 
i 1
n 1
2
2380

 595  24.39
5 1
• The two-sum formula can also be used:
2


x
 i 
n
2
(360)
2  i 1 
 xi  n
28300 
5  28300  25920  595  24.39
s 2  i 1

n 1
5 1
5 1
n
4A-52
Dispersion
Calculating a Standard Deviation
• The standard deviation is nonnegative because
deviations around the mean are squared.
• When every observation is exactly equal to the
mean, the standard deviation is zero.
• Standard deviations can be large or small,
depending on the units of measure.
• Compare standard deviations only for data
sets measured in the same units and only if
the means do not differ substantially.
4A-53
Dispersion
Coefficient of Variation
• Useful for comparing variables measured in
different units or with different means.
• A unit-free measure of dispersion
• Expressed as a percent of the mean.
s
CV  100 
x
• Only appropriate for nonnegative data. It is
undefined if the mean is zero or negative.
4A-54
Dispersion
Coefficient of Variation
• For example:
s
CV  100 
x
Defect rates
(n = 37)
s = 22.89
x= 125.38 gives CV = 100 × (22.89)/(125.38) = 18%
ATM
deposits
(n = 100)
s = 280.80
x= 233.89 gives CV = 100 × (280.80)/(233.89) =
120%
P/E ratios
(n = 68)
4A-55
s = 14.28
x = 22.72 gives CV = 100 × (14.08)/(22.72) = 62%
Dispersion
Mean Absolute Deviation
• The Mean Absolute Deviation (MAD) reveals
the average distance from an individual data
point to the mean (center of the distribution).
• Uses absolute values of the deviations around
the mean.
n
MAD 
 xi  x
i 1
n
• Excel’s function is =AVEDEV(Array)
4A-56
Dispersion
Central Tendency vs. Dispersion:
Manufacturing
Machine A
Desired mean (5mm)
but too much variation.
Machine B
Acceptable variation but
mean is less than 5 mm.
• Take frequent samples to monitor quality.
4A-57
Dispersion
Central Tendency vs. Dispersion:
Job Performance
• Consider student ratings of four professors on
eight teaching attributes (10-point scale).
4A-58
Dispersion
Central Tendency vs. Dispersion:
Job Performance
• Jones and Wu have identical means but
different standard deviations.
4A-59
Dispersion
Central Tendency vs. Dispersion:
Job Performance
• Smith and Gopal have different means but
identical standard deviations.
4A-60
Dispersion
Central Tendency vs. Dispersion:
Job Performance
• A high mean (better rating) and low standard
deviation (more consistency) is preferred.
Which professor do you think is best?
4A-61
Applied Statistics in
Business & Economics
End of Chapter 4A
4A-62