Download Measures of Dispersion - Alan Neustadtl @ The University of MD

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Measures of Dispersion
SOCY601—Alan Neustadtl
Measures of Dispersion
¾ Measures of central tendency estimate the numerical
center of a distribution—these are measures of
location.
¾ Measure of dispersion estimate the spread or
variability of a distribution around the center.
™Variation Ratio
™Range
™Interquartile Range
™Quartile Deviation
™Mean Deviation
™Variance/Standard Deviation
Variation Ratio
¾ The variation ratio can be used with grouped data
and is most useful for nominal level data.
f modal
V .R. = 1 −
n
¾ Basically, this is the proportion of cases that lie
outside of the modal category.
Variation Ratio
Frequency Distribution
of Sex in the 2000
General Social Survey
Race
f
White
2,244
Black
404
Other
170
Total
2,817
f modal
V .R. = 1 −
n
2, 244
=1−
2,817
≈ 0.2
Variation Ratio
¾ Advantages: Can be used with data that do not
contain a lot of information (i.e. nominal level data).
This measure is easily interpretable.
f modal
V .R. = 1 −
n
¾ Disadvantages: The variation ratio is dependent on
the categorization scheme used by the researcher (i.e.
it is somewhat arbitrary) and does not reflect the
distribution of data in the non-modal caegories..
Range
¾ The range is simply the difference between the values
of the largest and smallest observations.
range = max − min
¾ In the year 2000 General Social Survey, the minimum
and maximum respondent ages, respectively are 18
and 89. So, the range is:
71 = 89 − 18
Range
¾ Advantages: The range is an extremely simple
measure to calculate and interpret. It is useful
looking for “out of range” values (i.e. values
erroneously in a dataset).
range = max − min
¾ Disadvantages: The range is totally dependent on just
two values—the most extreme, and therefore most
variable (sample dependent), observations in a data
set.
Interquartile Range/Quartile Deviation
¾ The interquartile range or IQR is the numerical
difference or distance between the third and first
quartiles in a distribution.
IQR = Q3 − Q1
¾ A related measure is the quartile deviation that
numerically represents half the distance between the
first and third quartiles in a distribution—the middle
half of a distribution.
Q3 − Q1
Q=
2
Interquartile Range/Quartile Deviation
¾ Advantages: The IQR and quartile deviation are
more stable estimators of spread since they use two
values closer to the middle of the distribution that
vary less from sample to sample than more extreme
values.
¾ Disadvantages: These measure are totally dependent
on just two values and ignore all other observations in
a data set.
IQR = Q3 − Q1
Q3 − Q1
Q=
2
Interquartile Range/Quartile Deviation
¾ In the 2000 General Social Survey the third and first
quartiles are, respectively, 57 and 25. Therefore:
IQR = Q3 − Q1
= 57 − 32
= 25
Q3 − Q1
Q=
2
57 − 32
=
2
= 12.5
Mean Deviation
¾ With data that contain more information we may
calculate a measure that uses all of this information.
¾ We can do this my calculating the deviation of each
score from the mean and computing an average
deviation.
X
∑
mean deviation =
i
−X
n
¾ Note: absolute values are taken as a theoretical
convenience that, unfortunately has poor
mathematical properties.
Mean Deviation
¾ The mean deviation of age in the 2000 General Social
Survey is equal to:
X
∑
mean deviation =
i
n
= 14.5
−X
Mean Deviation
¾ Advantages: The mean deviation uses all the valid
observations of a variable to produce this summary
statistic—it is a “democratic” measure. It may be
interpreted intuitively.
¾ Disadvantages: Absolute values are not easily
algebraically manipulated. There is no ready-made
metric to aid the interpretation of this statistic as there
is for the standard deviation.
X
∑
mean deviation =
i
n
−X
Standard Deviation
¾ The most widely used measure of variability, usually
paired with the mean, is the standard deviation.
s=
∑( X
i
n
−X)
2
Standard Deviation
¾ The standard deviation of age in the 2000 General
Social Survey data is equal to:
s=
∑( X
= 17.4
i
−X)
n
2
Standard Deviation
¾ Advantages:
™Like the mean deviation, the standard deviation uses all the
valid observations of a variable to produce this summary
statistic—it also is a “democratic” measure.
™It may be interpreted by using the Gaussian normal
distribution.
™This statistic varies from low to high with the spread of the
distribution.
¾ Disadvantages: Squaring the differences gives
greater weight to more extreme values.
s=
∑( X
i
n
−X)
2
Standard Deviation
Samples
s=
s
2
∑( X
X
(
∑
=
Populations
i
−X)
2
σ=
n
i
n
−X)
2
σ
2
∑( X
i
− µ)
2
N
(X
∑
=
i
N
− µ)
2
Summary
Age
= 46
Agemedian = 43
Agemode = 32
Agerange
= 71.0
AgeIQR
= 25.0
Agequartile dev. = 12.5
Agemean dev. = 14.5
Agevariance
= 301.6
Agestd. dev.
= 17.4
Heuristic vs. Computational Formulas
2
2
2
=
+
X
2
XX
X
X −X)
(
)
(
∑
∑
2
=
s =
=
n
n
2
2
∑ X − ∑ 2 XX + ∑ X
n
2
2
X
2
XnX
nX
−
+
∑
n
=
=
s=
2
X
∑
n
2
X
∑
n
2
X
∑
n
−X2
− 2X 2 + X 2
−X
2
Interpreting Standard Deviations
¾ The Empirical Rule (applies to “normal” shaped
distributions)
™Approximately 68% of all cases fall within 1 standard
deviation of the mean ( X − s, X + s )
™Approximately 95% of all cases fall within 2 standard
deviations of the mean ( X − 2s, X + 2s )
™Essentially all cases fall within 3 standard deviations of
the mean ( X − 3s, X + 3s )
Interpreting Standard Deviations
¾ Chebyshev’s Rule (applies to any sample regardless
of shape)
™It is possible that very few cases will fall within 1 standard
deviation of the mean ( X − s, X + s )
™At least 3/4 of all cases will fall within 2 standard
deviations of the mean ( X − 2s, X + 2s )
™At least 8/9 of the cases will fall within 3 standard
deviations of the mean ( X − 3s, X + 3s )
1
1
−
™Generally, at least k 2 will fall within k standard
deviations of the mean ( X − ks, X + ks ) for any number where k
is greater than 1
Understanding the Mean
¾“Best Guess” Interpretation
™We have proven that the tendency for cases in a
distribution to differ from the mean in one
direction is exactly balanced by difference in the
other direction.
™But, there is another way to understand the mean.
Suppose that a single observation was selected at
random from a sample and you were asked to
determine the value of that observation, i.e. make a
guess about its value.
Understanding the Mean
¾ A single observation is selected at random—make a
guess about its value
¾ If you guess the mean of the
distribution your guess might be
too high, too low, or exactly
right. The extent of your error is
equal to:
¾ Over all possible cases that could
be drawn from the distribution,
the average signed error, or the
mean signed deviation is equal
to:
d =(X − X )
d
∑
d=
i
n
We can make the following statement: If the mean is guessed as
the score for any case drawn at random from a distribution, on
average, the amount of signed error will be zero.
Understanding the Mean
¾ A single observation is selected at random—make a
guess about its value
¾ Suppose the rules are changed somewhat—for a single guess about the
value of this case it is required that you be absolutely correct in your
guess with the greatest probability.
In this situation you would guess the mode since it is the most
frequently occurring score, and therefore the most probably value
in the distribution—it has the greatest likelihood of being selected.
Understanding the Mean
¾ A single observation is selected at random—make a
guess about its value
¾ The final situation requires that your guess have the smallest absolute
error possible. Here, the sign of the error is unimportant, but the sheer
size is critical.
In this situation you would guess the median since it is closest,
on average, to every other score in the distribution. This is shown
symbolically by:
∑ X − md
= minimum
Final Thoughts
¾ An easy way to understand variance is by way of
physical analogy.
™A deviation from the mean can be identified as having a
certain amount of force away from the mean.
™We do not know necessarily what factors are at work to
make an observation deviate from the mean, but we can
measure the actual deviation for each case.
™Therefore, the value of any observation is composed of
the mean plus a deviation from the mean:
X i = X + di
Final Thoughts
¾ Consider two cases, d1 and d2, drawn from a
distribution.
™If we use normal random sampling techniques, these two
cases are unrelated or independent of each other.
™Independent values may be represented geometrically as
being at right angles to each other. The issue here is to
determine the net force away from the mean for these
two cases. This can be calculated using this formula:
∑d
2
=
(X
− X ) + ( X2 − X )
2
1
2
¾ This is, of course, the Pythagorean Theorem, and if
we divide this mess by n, this looks suspiciously
like the standard deviation.
Final Thoughts
d2
( d1 − X ) + ( d2 − X )
2
X
d1
2
Final Thoughts
¾ What if we had three independent cases? This resultant
force away from the mean can be calculated as:
d +d +d
2
1
2
2
2
3
¾ This may be generalized to all n cases in a distribution even
though it is difficult to “visualize”.
¾ The larger the standard deviation, the larger the total force
away from the mean.
¾ Error is viewed as the resultant force away from
homogeneity
¾ The standard deviation reflects the net effect of such forces
per observation.
Final Thoughts
¾ People with physics backgrounds will see that:
™ the mean is equal to the center of gravity
™ the variance is equal to the moment of inertia of a distribution of
mass
™ the standard deviation is equal to the radius of gyration of a
distribution of mass