Download Essential Statistics 1/e

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

World Values Survey wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
4A-1
Chapter
4A
Descriptive Statistics (Part 1)
Numerical Description
Central Tendency
Dispersion
McGraw-Hill/Irwin
© 2008 The McGraw-Hill Companies, Inc. All rights reserved.
4A-3
Numerical Description
• Statistics are descriptive measures derived from a
sample (n items).
• Parameters are descriptive measures derived from
a population (N items).
4A-4
Numerical Description
• Three key characteristics of numerical data:
Characteristic
Interpretation
Central Tendency
Where are the data values concentrated?
What seem to be typical or middle data
values?
Dispersion
How much variation is there in the data?
How spread out are the data values?
Are there unusual values?
Shape
Are the data values distributed symmetrically?
Skewed? Sharply peaked? Flat? Bimodal?
4A-5
Numerical Description
 Example: Vehicle Quality
• Consider the data set of vehicle defect rates from
J. D. Power and Associates.
• Defect rate = total no. defects x 100
no. inspected
• Numerical statistics can be used to summarize this
random sample of brands.
• Must allow for sampling error since the analysis is
based on sampling.
4A-6
Numerical Description
• Number of defects per 100 vehicles, 1004 models.
4A-7
To begin, sort the
data in Excel.
4A-8
Numerical Description
• Sorted data provides insight into central tendency
and dispersion.
4A-9
Numerical Description
 Visual Displays
• The dot plot offers a visual impression of the data.
4A-10
Numerical Description
 Visual Displays
• Histograms with 5 bins (suggested by Sturge’s
Rule) and 10 bins are shown below.
• Both are symmetric with no extreme values and
show a modal class toward the low end.
4A-11
Descriptive
Statistics in Excel
Go to Tools | Data Analysis
and select
Descriptive Statistics
4A-12
Highlight the data
range, specify a cell
for the upper-left
corner of the output
range, check
Summary Statistics
and click OK.
4A-13
Here is the resulting analysis.
4A-14
Descriptive Statistics in MegaStat
4A-15
Here is the
resulting
MegaStat
analysis:
4A-16
Central Tendency
• The central tendency is the middle or typical
values of a distribution.
• Central tendency can be assessed using a dot
plot, histogram or more precisely with numerical
statistics.
4A-17
Central Tendency
 Six Measures of Central Tendency
Statistic
Formula
Excel Formula
Mean
1 n
xi

n i 1
Familiar and
uses all the
=AVERAGE(Data)
sample
information.
Median
Middle
value in
sorted
array
=MEDIAN(Data)
Pro
Robust when
extreme data
values exist.
Con
Influenced
by extreme
values.
Ignores
extremes
and can be
affected by
gaps in data
values.
4A-18
Central Tendency
 Six Measures of Central Tendency
Statistic
Mode
Midrange
Formula
Most
frequently
occurring
data value
xmin  xmax
2
Excel Formula
=MODE(Data)
=0.5*(MIN(Data)
+MAX(Data))
Pro
Con
Useful for
attribute
data or
discrete data
with a small
range.
May not be
unique,
and is not
helpful for
continuous
data.
Easy to
understand
and
calculate.
Influenced
by extreme
values and
ignores
most data
values.
4A-19
Central Tendency
 Six Measures of Central Tendency
Statistic
Geometric
mean (G)
Trimmed
mean
Formula
n
x1 x2 ... xn
Same as the
mean except
omit highest
and lowest
k% of data
values (e.g.,
5%)
Excel Formula
=GEOMEAN(Data)
Pro
Con
Useful for
growth
rates and
mitigates
high
extremes.
Less
familiar
and
requires
positive
data.
Mitigates
effects of
=TRMEAN(Data, %)
extreme
values.
Excludes
some data
values
that could
be
relevant.
4A-20
Central Tendency
 Mean
• A familiar measure of central tendency.
Population Formula
Sample Formula
n
N

 xi
i 1
N
x
 xi
i 1
n
• In Excel, use function =AVERAGE(Data) where
Data is an array of data values.
4A-21
Central Tendency
 Mean
• For the sample of n = 37 car brands:
n
x
 xi
i 1
n

87  93  98  ...  159  164  173 4639

 125.38
37
37
4A-22
Central Tendency
 Characteristics of the Mean
• Arithmetic mean is the most familiar average.
• Affected by every sample item.
• The balancing point or fulcrum for the data.
4A-23
Central Tendency
 Characteristics of the Mean
• Regardless of the shape of the distribution,
absolute distances from the mean to the data
n
points always sum to zero.
 ( xi  x )  0
• Consider the following
i 1
asymmetric distribution of quiz
scores whose mean = 65.
n
 ( xi  x ) = (42 – 65) + (60 – 65) + (70 – 65) + (75 – 65) + (78 – 65)
i 1
= (-23) + (-5) + (5) + (10) + (13) = -28 + 28 = 0
4A-24
Central Tendency
 Median
• The median (M) is the 50th percentile or midpoint
of the sorted sample data.
• M separates the upper and lower half of the sorted
observations.
• If n is odd, the median is the middle observation in
the data array.
• If n is even, the median is the average of the
middle two observations in the data array.
4A-25
Central Tendency
 Median
• For n = 8, the median is between the fourth and
fifth observations in the data array.
4A-26
Central Tendency
 Median
• For n = 9, the median is the fifth observation in the
data array.
4A-27
Central Tendency
 Median
• Consider the following n = 6 data values:
11 12 15 17 21 32
• What is the median?
xn / 2  x( n / 21)
For even n, Median =
n/2 = 6/2 = 3
and
2
n/2+1 = 6/2 + 1 = 4
M = (x3+x4)/2 = (15+17)/2 = 16
11
12
15
16
17
21
32
4A-28
Central Tendency
 Median
• Consider the following n = 7 data values:
12 23 23 25 27 34 41
• What is the median?
For odd n, Median =
x( n 1) / 2
(n+1)/2 = (7+1)/2 = 8/2 = 4
M = x4 = 25
12
23
23
25
27
34
41
4A-29
Central Tendency
 Median
• Use Excel’s function =MEDIAN(Data) where Data
is an array of data values.
• For the 37 vehicle quality ratings (odd n) the
position of the median is
(n+1)/2 = (37+1)/2 = 19.
• So, the median is x19 = 121.
• When there are several duplicate data values, the
median does not provide a clean “50-50” split in
the data.
4A-30
Central Tendency
 Characteristics of the Median
• The median is insensitive to extreme data values.
• For example, consider the following quiz scores for
3 students:
Tom’s scores:
20, 40, 70, 75, 80
Jake’s scores:
60, 65, 70, 90, 95
Mary’s scores:
50, 65, 70, 75, 90
Mean =57, Median = 70, Total = 285
Mean = 76, Median = 70, Total = 380
Mean = 70, Median = 70, Total = 350
• What does the median for each student tell you?
4A-31
Central Tendency
 Mode
• The most frequently occurring data value.
• Similar to mean and median if data values occur
often near the center of sorted data.
• May have multiple modes or no mode.
4A-32
Central Tendency
 Mode
• For example, consider the following quiz scores for
3 students:
Lee’s scores:
60, 70, 70, 70, 80
Pat’s scores:
45, 45, 70, 90, 100
Sam’s scores:
50, 60, 70, 80, 90
Xiao’s scores:
50, 50, 70, 90, 90
Mean =70, Median = 70, Mode = 70
Mean = 70, Median = 70, Mode = 45
Mean = 70, Median = 70, Mode = none
Mean = 70, Median = 70, Modes = 50,90
• What does the mode for each student tell you?
4A-33
Central Tendency
 Mode
• Easy to define, not easy to calculate in large
samples.
• Use Excel’s function =MODE(Array)
- will return #N/A if there is no mode.
- will return first mode found if multimodal.
• May be far from the middle of the distribution and
not at all typical.
4A-34
Central Tendency
 Mode
• Generally isn’t useful for continuous data since
data values rarely repeat.
• Best for attribute data or a discrete variable with a
small range (e.g., Likert scale).
4A-35
Central Tendency
 Example: Price/Earnings Ratios and Mode
• Consider the following P/E ratios for a random
sample of 68 Standard & Poor’s 500 stocks.
7
8
8
10 10 10 10 12 13 13 13 13 13 13 13 14 14
14 15 15 15 15 15 16 16 16 17 18 18 18 18 19 19 19
19 19 20 20 20 21 21 21 22 22 23 23 23 24 25 26 26
26 26 27 29 29 30 31 34 36 37 40 41 45 48 55 68 91
• What is the mode?
4A-36
Central Tendency
 Example: Price/Earnings Ratios and Mode
• Excel’s descriptive
statistics results are:
• The mode 13 occurs
7 times, but what
does the dot plot
show?
Mean
22.7206
Median
19
Mode
13
Range
84
Minimum
7
Maximum
91
Sum
Count
1545
68
4A-37
Central Tendency
 Example: Rose Bowl Winners’ Points
• Points scored by the winning NCAA football team
tends to have modes in multiples of 7 because
each touchdown yields 7 points.
• Consider the dot plot of the points scored by the
winning team in the first 87 Rose Bowl games.
• What is the mode?
4A-38
Central Tendency
 Skewness
• Compare mean and median or look at histogram to
determine degree of skewness.
4A-39
Central Tendency
 Symptoms of Skewness
Distribution’s
Shape
Histogram Appearance
Skewed left
(negative
skewness)
Long tail of histogram points left
(a few low values but most data on Mean < Median
right)
Symmetric
Tails of histogram are balanced
(low/high values offset)
Mean  Median
Skewed right
(positive
skewness)
Long tail of histogram points right
(most data on left but a few high
values)
Mean > Median
Statistics
4A-40
Central Tendency
 Midrange
• The midrange is the point halfway between the
lowest and highest values of X.
• Easy to use but sensitive to extreme data values.
xmin  xmax
Midrange =
2
• For the J. D. Power quality data (n=37):
x1  x37 87  173
xmin  xmax

 130
Midrange =
=
2
2
2
• Here, the midrange (130) is higher than the mean
(125.38) or median (121).
4A-41
Dispersion
• Variation is the “spread” of data points about the
center of the distribution in a sample. Consider the
following measures of dispersion:
 Measures of Variation
Statistic
Range
Formula
xmax – xmin
n
Variance
(s2)
  xi  x 
i 1
n 1
Excel
Pro
Con
=MAX(Data)MIN(Data)
Sensitive to
Easy to calculate extreme data
values.
=VAR(Data)
Plays a key role
in mathematical
statistics.
2
Non-intuitive
meaning.
4A-42
Dispersion
 Measures of Variation
Statistic
Standard
deviation
(s)
Coefficient. of
variation
(CV)
Formula
n
  xi  x 
i 1
2
Excel
Pro
Con
=STDEV(Data)
Most common
measure. Uses
same units as the
raw data ($ , £, ¥,
etc.).
Non-intuitive
meaning.
None
Measures relative
variation in
percent so can
compare data
sets.
Requires
nonnegative
data.
n 1
100 
s
x
4A-43
Dispersion
 Measures of Variation
Statistic
Mean
absolute
deviation
(MAD)
Formula
Excel
Pro
Con
Easy to
understand.
Lacks “nice”
theoretical
properties.
n
 xi  x
i 1
n
=AVEDEV(Data)
4A-44
Dispersion
 Range
• The difference between the largest and smallest
observation.
Range = xmax – xmin
• For example, for the n = 68 P/E ratios,
Range = 91 – 7 = 84
4A-45
Dispersion
 Variance
• The population variance (s2) is
defined as the sum of squared
deviations around the mean 
divided by the population size.
N
s2 
• For the sample variance (s2), we
divide by n – 1 instead of n,
otherwise s2 would tend to
2
s

underestimate the unknown
population variance s2.
  xi   
2
i 1
N
n
  xi  x 
i 1
n 1
2
4A-46
Dispersion
 Standard Deviation
• The square root of the variance.
• Explains how individual values in a data set vary
from the mean.
• Units of measure are the same as X.
Population
standard
deviation
N
s
  xi   
i 1
N
2
Sample
standard
deviation
n
s
  xi  x 
i 1
n 1
2
4A-47
Dispersion
 Standard Deviation
• Excel’s built in functions are
Statistic
Excel population
formula
Excel sample
formula
Variance
=VARP(Array)
=VAR(Array)
=STDEVP(Array)
=STDEV(Array)
Standard deviation
4A-48
Dispersion
 Calculating a Standard Deviation
• Consider the following five quiz scores for
Stephanie.
4A-49
Dispersion
 Calculating a Standard Deviation
• Now, calculate the sample standard deviation:
n
s
2
x

x


 i
i 1
n 1

2380
 595  24.39
5 1
• Somewhat easier, the two-sum formula can also
be used:
2


x

 i
n
2
(360)
2  i 1 
 xi  n
28300 
2
5  28300  25920  595  24.39
s  i 1

n 1
5 1
5 1
n
4A-50
Dispersion
 Calculating a Standard Deviation
• The standard deviation is nonnegative because
deviations around the mean are squared.
• When every observation is exactly equal to the
mean, the standard deviation is zero.
• Standard deviations can be large or small,
depending on the units of measure.
• Compare standard deviations only for data sets
measured in the same units and only if the means
do not differ substantially.
4A-51
Dispersion
 Coefficient of Variation
• Useful for comparing variables measured in
different units or with different means.
• A unit-free measure of dispersion
• Expressed as a percent of the mean.
CV  100 
s
x
• Only appropriate for nonnegative data. It is
undefined if the mean is zero or negative.
4A-52
Dispersion
 Coefficient of Variation
• For example:
s
CV  100 
x
Defect rates
(n = 37)
s = 22.89
x = 125.38 gives CV = 100 × (22.89)/(125.38) = 18%
ATM deposits
(n = 100)
s = 280.80
x = 233.89 gives CV = 100 × (280.80)/(233.89) = 120%
P/E ratios
(n = 68)
s = 14.28
= 22.72 gives CV = 100 × (14.08)/(22.72) = 62%
x
4A-53
Dispersion
 Mean Absolute Deviation
• The Mean Absolute Deviation (MAD) reveals the
average distance from an individual data point to
the mean (center of the distribution).
• Uses absolute values of the deviations around the
mean.
n
MAD 
 xi  x
i 1
n
• Excel’s function is =AVEDEV(Array)
4A-54
Dispersion
 Central Tendency vs. Dispersion:
Manufacturing
• Consider the histograms of hole diameters drilled in
a steel plate during manufacturing.
Machine A
Machine B
• The desired distribution is outlined in red.
4A-55
Dispersion
 Central Tendency vs. Dispersion:
Manufacturing
Machine A
Machine B
Acceptable variation but
Desired mean (5mm)
but too much variation. mean is less than 5 mm.
• Take frequent samples to monitor quality.
4A-56
Dispersion
 Central Tendency vs. Dispersion:
Job Performance
• Consider student ratings of four professors on eight
teaching attributes (10-point scale).
4A-57
Dispersion
 Central Tendency vs. Dispersion:
Job Performance
• Jones and Wu have identical means but different
standard deviations.
4A-58
Dispersion
 Central Tendency vs. Dispersion:
Job Performance
• Smith and Gopal have different means but identical
standard deviations.
4A-59
Dispersion
 Central Tendency vs. Dispersion:
Job Performance
• A high mean (better rating) and low standard
deviation (more consistency) is preferred. Which
professor do you think is best?