* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Download Pol 600: Research Methods
Survey
Document related concepts
Transcript
Pol 600: Research Methods
Scott Granberg-Rademacker
Handout #2
Measures of Central Tendency
Measures of central tendency are mathematical operations which supply information about the “typical” observation in a set or variable. There are
several measures of central tendency, each with different pros and cons: expected values (sometimes called expectations, means or averages), medians,
and modes. Expected values (usually denoted as E (X) or x̄) are most commonly used in practice, but there are applications where medians (denoted
x̃) or modes may prove to be a better indicator of what the “typical” observation is like.
Most of the time, the expected value is identical to the simple average,
which is nothing more than the arithmetic mean of a set or variable. Simple
averages, however, make the assumption that the probability of each observation is equal: P (x1 ) = P (x2 ) = · · · = P (xk ). If X is a discrete stochastic
variable, the simple average can be simply found as follows:
n
P
E(X) = x̄ =
xi
i=1
n
(1)
However, such an assertion may or may not be true. If the probabilities
assocatied with each observation are different, then the expected value is a
weighted average. Consider the expected value of a variable, x, where the
probability of each possible observation is different. In a case like this, the
expected value would simply be each observation times its probability:
E(X) = x̄ =
n
X
xi f (xi )
(2)
i=1
Though the problem with weighted averages in practice is that we often do
not know the exact probabilities that make up f (x) (remember that f (x)
is the probability density function of x). When these probabilities are not
known, the most common approach is to simply assume that the probabilities are all the same and use the simple average formula.
1
One of the main problems with using expected values is that the influence of
outliers is poorly mitigated. Basically, extreme values which are not “typical” of other observations may heavily skew the expected value. Consider
two variables:
a = {3, 4, −2, 4, 5, 3}
b = {3, 4, −2, 4, 5, 3, 170}
The only difference between the two is that B has one more observation than
A, but that single observation is clearly much different than the rest of the
observations. Such abnormal observations are outliers, which can badly skew
the expected value:
n
P
ā =
n
n
P
b̄ =
ai
i=1
i=1
n
=
3+4+(−2)+4+5+3
6
=
3+4+(−2)+4+5+3+170
7
bi
=
17
6
=
= 2.83
187
7
= 26.71
So, how can one consider extreme outliers while still getting a good idea
about the “typical” observation? Another possibility is to use the median.
The median of a set or variable is the value that has just as many values
greater than it as are less than it. When the set or variable has an even
number of observations, the median is the average of the two middle values.
When the set or variable has an odd number of observations, the median is
simply the middle value.
It is important to note for discrete variables that the median will always
satisfy the following condition:
P (X ≤ x̃) ≥ 0.5 ≤ P (X ≥ x̃)
(3)
Finding the median is quite simple. The first step is to arrange the values in
the variable(s) from least to greatest. Let us denote the arranged variables
as a∗ and b∗ .
a∗ = {−2, 3, 3, 4, 4, 5}
b∗ = {−2, 3, 3, 4, 4, 5, 170}
When the total number of observations is odd, the median can be found
using the following formula:
(4)
x̃ = x∗n+1
2
2
and when the total number of observations is even:
x̃ =
x∗n + x∗n +1
2
2
2
(5)
Since a has six observations (n = 6), it is necessary for us to use Equation 5
to find the median of a:
ã =
a∗n + a∗n +1
2
2
2
=
a∗6 + a∗6 +1
2
2
2
=
a∗3 + a∗3+1
a∗ + a∗4
3+4
7
= 3
=
= = 3.5
2
2
2
2
Finding the median of b is simply a matter of using Equation 4, since b has
an odd number of observations (n = 7):
b̃ = b∗n+1 = b∗7+1 = b∗8 = b∗4 = 4
2
2
2
When we compare the means and medians of a and b, one can see that they
are not the same:
ā = 2.83, ã = 3.5
b̄ = 26.71, b̃ = 4
However, both the mean and median are fairly “typical” of a, which is to
be expected since there is no extreme outlier in a. Note that the mean of b
has been heavily skewed by the outlier but the median of b easily mitigates
the impact of the outlier. This illustrates one of the nice properties of the
median–it tends to be resistant to outliers.
Another measure of central tendency which is not used very often is the
mode. The mode of a set or variable is simply the value that occurs most
frequently within that set or variable. It is possible that for any given set or
variable, there may be one mode, several modes, or no modes. For example,
the mode of a is simply:
Mode (a) = {3, 4}
Modes are seldom used in practice for good reason. They are often unreliable
and misleading, as illustrated in the following example:
c = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 902, 902}
Where the mode of c is:
3
Mode (c) = 902, which is hardly typical of c.
Consider another example:
d = {1, 2, 3, 4, 5, 6, 7}
In this instance, there is no mode of d, because there is only one instance of
each value.
Mode (d) = ∅, where ∅ denotes an empty set.
Measures of Variability are mathematical operations which measure the
amount of dispersion or spread in a given set or variable. While measures of
central tendency tell you what the “typical” observation is like, measures of
variability tell you hbow dispersed or spread out the data in a set or variable is. There are several measures of variability available to us, each with
advantages and disadvantages.
The most basic measure of variability is the range. The range of a set or
variable is simply the largest value minus the smallest value. The range can
be denoted as:
Range (x) = xmax − xmin
(6)
So if we have two variables:
e = {3, 5, 5, 7}
f = {4, 4, 6, 6}
Finding the ranges is quite simple:
Range (e) = 7 − 3 = 4
Range (f ) = 6 − 4 = 2
Ranges are nice but are only informative about the extreme values of a variable. This means that they are susceptible to outliers, and can ultimately
provide a badly skewed picture of the variability of a variable.
A better measure of variability is the mean deviation. The mean deviation
is the average distance an observation in a set or variable is away from the
mean. This makes for a nice interpretation about the “typical” observation.
4
The mean deviation can be found by using the following formula:
n
P
MD (x) =
|xi − x̄|
i=1
n
(7)
Absolute value bars || simply mean that after all operations in the absolute value are finished, negative numbers are turned positive. For example,
|5 − 8| = |−3| = 3. The absolute value of a positive number is a positive
number: |5| = 5.
Despite the nice interpretation, absolute values are not used all that often.
First of all, absolute values are problematic (particularly for computers) when
doing more complex operations. Secondly, it is possible for variables with different distributions to have the same mean deviation. Consider e and f once
again:
e = {3, 5, 5, 7}
f = {4, 4, 6, 6}
Clearly they are distributed differently, but the mean deviation will not reveal this to us. Observe how both mean deviations yield the same result
(keep in mind both ē and f¯ = 5):
n
P
|ei − ē|
|3 − 5| + |5 − 5| + |5 − 5| + |7 − 5|
=
n
4
|−2| + |0| + |0| + |2|
2+0+0+2
4
=
= =1
4
4
4
MD (e) =
i=1
=
n P
fi − f¯
|4 − 5| + |4 − 5| + |6 − 5| + |6 − 5|
=
n
4
|−1| + |−1| + |1| + |1|
1+1+1+1
4
=
= =1
4
4
4
MD (f ) =
i=1
=
This is where the variance (commonly denoted σ 2 which is pronounced
“Sigma squared”) and standard deviation (denoted σ) can help out. The
formula for the variance is very similar to the mean deviation, but it avoids
the problem of taking the absolute value by simply squaring the deviations.
Additionally, it provides us with a measure that is more sensitive to variation
5
than the mean deviation. The formula for the variance is simply:
n
P
σ2 =
(xi − µ)2
i=1
n
The variance is simply the square root of the variance:
√
σ2 =
(8)
v
n
uP
u (xi − µ)2
t
i=1
(9)
n
All of these benefits do have a downside, however. Since the deviations are
being squared, the variance and standard deviation do not have a clean and
simple interpretation like the mean deviation does. It does have some nice
qualities which will be illustrated when we talk about distributions and hypothesis testing.
σ=
So how do the variance and standard deviation fare with e and f ? Let’s find
the variances:
n
P
σe2
=
i=1
n
P
σf2 =
(ei − µe )2
(3 − 5)2 + (5 − 5)2 + (5 − 5)2 + (7 − 5)2
=
n
4
(−2)2 + 02 + 02 + 22 4 + 4
8
=
+
= =2
4
4
4
(fi − µf )2
(4 − 5)2 + (4 − 5)2 + (6 − 5)2 + (6 − 5)2
n
4
2
2
2
2
(−1) + (−1) + 1 + 1
1+1+1+1
4
=
+
= =1
4
4
4
i=1
=
And the standard deviations:
σe =
p
σe2 =
√
2 = 1.41
q
√
σf = σf2 = 1 = 1
6
Notice that the standard deviations are close (or identical in the case of f )
to the mean deviations found, but are still different from each other–better
reflecting the true variability of e and f . In general, the larger the standard
deviation, the greater the variability.
All of what we have done so far assumes that we are dealing with populations. Populations are complete sets of all observations of interest. In
reality, true populations are often unknown. Most of the time, what we have
in social science is sample data. Samples are simply subsets of a population. Because we often deal with sample data, we need to account for the
uncertainty that needs to be accounted for in a sample. Think of it like a
currency: every observation in a sample is a currency unit, but whenever
an estimate is calculated, one unit of currency is “spent”. These “currency”
are known as degrees of freedom (referred to as “df” for short), and one
degree of freedom is lost when we “spend” it to calculate an estimate.
More technically, degrees of freedom are any of the unrestricted, random variables that constitute a statistic. In practicality, this means that we have to
make small adjustments to some of our formulas when dealing with samples.
The biggest change for us right now is to remember that the formulas for
variance and standard deviations need to be slightly corrected. The sample
variance can be found using the following formula:
n
P
s2 =
(xi − x̄)2
i=1
n−1
And the sample standard deviation is:
v
uP
u n
u (xi − x̄)2
√
t
s = s2 = i=1
n−1
(10)
(11)
You might ask, what really changed? The most noticeable change is that the
Greek letter σ is not used in either formula. Instead, the sample variance
is denoted as s2 and the sample standard deviation is denoted as s. These
are estimates which approximate the unknown population variance σ 2 and
population standard deviation σ. Since these are sample estimates, we lose
one degree of freedom, which is taken off of the denominator. So instead of
dividing by n, we divide by n − 1, when finding s2 and s.
7
Also of note is that the typical notation for the population mean and sample
mean are different. The population mean is usually denoted by the Greek
letter µ (pronounced “mu”), and the sample mean is usually denoted with
a bar over the variable name, x̄. Once again, in practice the true value of
µ is often unknown, and the mean of the observed sample data x̄ is only an
estimate of µ.
GNUMERIC Commands:
Average: =AVERAGE(number1,number2,...)
Median: =MEDIAN(number1,number2,...)
Mode: =MODE(number1,number2,...)
Range: =MAX(number1,number2,...)-MIN(number1,number2,...)
Mean Deviation: =AVEDEV(number1,number2,...)
Population Variance: =VARP(number1,number2,...)
Population Standard Deviation: =STDEVP(number1,number2,...)
Sample Variance: =VAR(number1,number2,...)
Sample Standard Deviation: =STDEV(number1,number2,...)
8