Download Pol 600: Research Methods

Pol 600: Research Methods Scott Granberg-Rademacker Handout #2 Measures of Central Tendency Measures of central tendency are mathematical operations which supply information about the “typical” observation in a set or variable. There are several measures of central tendency, each with different pros and cons: expected values (sometimes called expectations, means or averages), medians, and modes. Expected values (usually denoted as E (X) or x̄) are most commonly used in practice, but there are applications where medians (denoted x̃) or modes may prove to be a better indicator of what the “typical” observation is like. Most of the time, the expected value is identical to the simple average, which is nothing more than the arithmetic mean of a set or variable. Simple averages, however, make the assumption that the probability of each observation is equal: P (x1 ) = P (x2 ) = · · · = P (xk ). If X is a discrete stochastic variable, the simple average can be simply found as follows: n P E(X) = x̄ = xi i=1 n (1) However, such an assertion may or may not be true. If the probabilities assocatied with each observation are different, then the expected value is a weighted average. Consider the expected value of a variable, x, where the probability of each possible observation is different. In a case like this, the expected value would simply be each observation times its probability: E(X) = x̄ = n X xi f (xi ) (2) i=1 Though the problem with weighted averages in practice is that we often do not know the exact probabilities that make up f (x) (remember that f (x) is the probability density function of x). When these probabilities are not known, the most common approach is to simply assume that the probabilities are all the same and use the simple average formula. 1 One of the main problems with using expected values is that the influence of outliers is poorly mitigated. Basically, extreme values which are not “typical” of other observations may heavily skew the expected value. Consider two variables: a = {3, 4, −2, 4, 5, 3} b = {3, 4, −2, 4, 5, 3, 170} The only difference between the two is that B has one more observation than A, but that single observation is clearly much different than the rest of the observations. Such abnormal observations are outliers, which can badly skew the expected value: n P ā = n n P b̄ = ai i=1 i=1 n = 3+4+(−2)+4+5+3 6 = 3+4+(−2)+4+5+3+170 7 bi = 17 6 = = 2.83 187 7 = 26.71 So, how can one consider extreme outliers while still getting a good idea about the “typical” observation? Another possibility is to use the median. The median of a set or variable is the value that has just as many values greater than it as are less than it. When the set or variable has an even number of observations, the median is the average of the two middle values. When the set or variable has an odd number of observations, the median is simply the middle value. It is important to note for discrete variables that the median will always satisfy the following condition: P (X ≤ x̃) ≥ 0.5 ≤ P (X ≥ x̃) (3) Finding the median is quite simple. The first step is to arrange the values in the variable(s) from least to greatest. Let us denote the arranged variables as a∗ and b∗ . a∗ = {−2, 3, 3, 4, 4, 5} b∗ = {−2, 3, 3, 4, 4, 5, 170} When the total number of observations is odd, the median can be found using the following formula: (4) x̃ = x∗n+1 2 2 and when the total number of observations is even: x̃ = x∗n + x∗n +1 2 2 2 (5) Since a has six observations (n = 6), it is necessary for us to use Equation 5 to find the median of a: ã = a∗n + a∗n +1 2 2 2 = a∗6 + a∗6 +1 2 2 2 = a∗3 + a∗3+1 a∗ + a∗4 3+4 7 = 3 = = = 3.5 2 2 2 2 Finding the median of b is simply a matter of using Equation 4, since b has an odd number of observations (n = 7): b̃ = b∗n+1 = b∗7+1 = b∗8 = b∗4 = 4 2 2 2 When we compare the means and medians of a and b, one can see that they are not the same: ā = 2.83, ã = 3.5 b̄ = 26.71, b̃ = 4 However, both the mean and median are fairly “typical” of a, which is to be expected since there is no extreme outlier in a. Note that the mean of b has been heavily skewed by the outlier but the median of b easily mitigates the impact of the outlier. This illustrates one of the nice properties of the median–it tends to be resistant to outliers. Another measure of central tendency which is not used very often is the mode. The mode of a set or variable is simply the value that occurs most frequently within that set or variable. It is possible that for any given set or variable, there may be one mode, several modes, or no modes. For example, the mode of a is simply: Mode (a) = {3, 4} Modes are seldom used in practice for good reason. They are often unreliable and misleading, as illustrated in the following example: c = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 902, 902} Where the mode of c is: 3 Mode (c) = 902, which is hardly typical of c. Consider another example: d = {1, 2, 3, 4, 5, 6, 7} In this instance, there is no mode of d, because there is only one instance of each value. Mode (d) = ∅, where ∅ denotes an empty set. Measures of Variability are mathematical operations which measure the amount of dispersion or spread in a given set or variable. While measures of central tendency tell you what the “typical” observation is like, measures of variability tell you hbow dispersed or spread out the data in a set or variable is. There are several measures of variability available to us, each with advantages and disadvantages. The most basic measure of variability is the range. The range of a set or variable is simply the largest value minus the smallest value. The range can be denoted as: Range (x) = xmax − xmin (6) So if we have two variables: e = {3, 5, 5, 7} f = {4, 4, 6, 6} Finding the ranges is quite simple: Range (e) = 7 − 3 = 4 Range (f ) = 6 − 4 = 2 Ranges are nice but are only informative about the extreme values of a variable. This means that they are susceptible to outliers, and can ultimately provide a badly skewed picture of the variability of a variable. A better measure of variability is the mean deviation. The mean deviation is the average distance an observation in a set or variable is away from the mean. This makes for a nice interpretation about the “typical” observation. 4 The mean deviation can be found by using the following formula: n P MD (x) = |xi − x̄| i=1 n (7) Absolute value bars || simply mean that after all operations in the absolute value are finished, negative numbers are turned positive. For example, |5 − 8| = |−3| = 3. The absolute value of a positive number is a positive number: |5| = 5. Despite the nice interpretation, absolute values are not used all that often. First of all, absolute values are problematic (particularly for computers) when doing more complex operations. Secondly, it is possible for variables with different distributions to have the same mean deviation. Consider e and f once again: e = {3, 5, 5, 7} f = {4, 4, 6, 6} Clearly they are distributed differently, but the mean deviation will not reveal this to us. Observe how both mean deviations yield the same result (keep in mind both ē and f¯ = 5): n P |ei − ē| |3 − 5| + |5 − 5| + |5 − 5| + |7 − 5| = n 4 |−2| + |0| + |0| + |2| 2+0+0+2 4 = = =1 4 4 4 MD (e) = i=1 = n P fi − f¯ |4 − 5| + |4 − 5| + |6 − 5| + |6 − 5| = n 4 |−1| + |−1| + |1| + |1| 1+1+1+1 4 = = =1 4 4 4 MD (f ) = i=1 = This is where the variance (commonly denoted σ 2 which is pronounced “Sigma squared”) and standard deviation (denoted σ) can help out. The formula for the variance is very similar to the mean deviation, but it avoids the problem of taking the absolute value by simply squaring the deviations. Additionally, it provides us with a measure that is more sensitive to variation 5 than the mean deviation. The formula for the variance is simply: n P σ2 = (xi − µ)2 i=1 n The variance is simply the square root of the variance: √ σ2 = (8) v n uP u (xi − µ)2 t i=1 (9) n All of these benefits do have a downside, however. Since the deviations are being squared, the variance and standard deviation do not have a clean and simple interpretation like the mean deviation does. It does have some nice qualities which will be illustrated when we talk about distributions and hypothesis testing. σ= So how do the variance and standard deviation fare with e and f ? Let’s find the variances: n P σe2 = i=1 n P σf2 = (ei − µe )2 (3 − 5)2 + (5 − 5)2 + (5 − 5)2 + (7 − 5)2 = n 4 (−2)2 + 02 + 02 + 22 4 + 4 8 = + = =2 4 4 4 (fi − µf )2 (4 − 5)2 + (4 − 5)2 + (6 − 5)2 + (6 − 5)2 n 4 2 2 2 2 (−1) + (−1) + 1 + 1 1+1+1+1 4 = + = =1 4 4 4 i=1 = And the standard deviations: σe = p σe2 = √ 2 = 1.41 q √ σf = σf2 = 1 = 1 6 Notice that the standard deviations are close (or identical in the case of f ) to the mean deviations found, but are still different from each other–better reflecting the true variability of e and f . In general, the larger the standard deviation, the greater the variability. All of what we have done so far assumes that we are dealing with populations. Populations are complete sets of all observations of interest. In reality, true populations are often unknown. Most of the time, what we have in social science is sample data. Samples are simply subsets of a population. Because we often deal with sample data, we need to account for the uncertainty that needs to be accounted for in a sample. Think of it like a currency: every observation in a sample is a currency unit, but whenever an estimate is calculated, one unit of currency is “spent”. These “currency” are known as degrees of freedom (referred to as “df” for short), and one degree of freedom is lost when we “spend” it to calculate an estimate. More technically, degrees of freedom are any of the unrestricted, random variables that constitute a statistic. In practicality, this means that we have to make small adjustments to some of our formulas when dealing with samples. The biggest change for us right now is to remember that the formulas for variance and standard deviations need to be slightly corrected. The sample variance can be found using the following formula: n P s2 = (xi − x̄)2 i=1 n−1 And the sample standard deviation is: v uP u n u (xi − x̄)2 √ t s = s2 = i=1 n−1 (10) (11) You might ask, what really changed? The most noticeable change is that the Greek letter σ is not used in either formula. Instead, the sample variance is denoted as s2 and the sample standard deviation is denoted as s. These are estimates which approximate the unknown population variance σ 2 and population standard deviation σ. Since these are sample estimates, we lose one degree of freedom, which is taken off of the denominator. So instead of dividing by n, we divide by n − 1, when finding s2 and s. 7 Also of note is that the typical notation for the population mean and sample mean are different. The population mean is usually denoted by the Greek letter µ (pronounced “mu”), and the sample mean is usually denoted with a bar over the variable name, x̄. Once again, in practice the true value of µ is often unknown, and the mean of the observed sample data x̄ is only an estimate of µ. GNUMERIC Commands: Average: =AVERAGE(number1,number2,...) Median: =MEDIAN(number1,number2,...) Mode: =MODE(number1,number2,...) Range: =MAX(number1,number2,...)-MIN(number1,number2,...) Mean Deviation: =AVEDEV(number1,number2,...) Population Variance: =VARP(number1,number2,...) Population Standard Deviation: =STDEVP(number1,number2,...) Sample Variance: =VAR(number1,number2,...) Sample Standard Deviation: =STDEV(number1,number2,...) 8

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Pol 600: Research Methods