Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
XII. Regression Analysis XII. REGRESSION ANALYSIS Subtopics Introduction Scatterplots Regression Equations Pearson’s r2 Pearson’s r Deviant Case Analysis Multivariate Analysis Curvilinear Relationships Key Concepts Exercises For Further Study SPSS Tools New with this topic o Scatterplot o Correlate o Regression Review o Getting Started o Compute Introduction[1] This topic describes what, for reasons that will be explained shortly, is also called ordinary least squares (OLS). It is called regression analysis because Francis Galton (1822-1911), a pioneer in the application of OLS to the behavioral sciences, used it to study “regression toward the mean.”[2] Regression analysis is a simple but extremely powerful technique with a wide variety of applications. It also forms the basis for many other techniques in intermediate and advanced research methods courses. To use regression analysis appropriately, all variables must be at least interval though, as we will see, dichotomous variables constitute a special case that may seem to, but really doesn't, violate this rule. To help us understand regression analysis, we will try to explain why people in some states identify themselves more with the Republican Party (and less with the Democratic Party) than do people in some other states. Our measure of party identification is a scale derived from analysis of CBS/New York Times polls by Gerald Wright et al. The data used here are from 1999 through 2003. The scale has a theoretical range from -100 (a completely Democratic state) to +100 (an all GOP state). 108 XII. Regression Analysis Scatterplots The philosopher and mathematician René Descartes (1596-1650) famously wrote, “I think, therefore I am.” One of the things he thought about was coordinate graphs. In his honor, the locations of points on such a graph are sometimes referred to as “Cartesian coordinates.”[3]) The graph consists of a horizontal (X) axis, sometimes called the “ordinate,” and a vertical (Y) axis, sometimes called the “abscissa.” You can think of the X axis as being similar to the columns of a contingency table, and the Y axis as similar to the rows, except that, in testing a hypothesis, the independent variable should always be placed on the X axis, and the dependent variable should always be placed on the Y axis. (When you are using a scatterplot for purely descriptive purposes, it doesn't matter which is on the horizontal and which is on the vertical axis. This would be the case, for example, if you wanted to compare two different measures of the voting record of members of congress.) Each case is represented by a point on the graph based on its values for X and Y. Taken together, the points form a scatterplot (or scattergram, or scatter diagram). In the following figure, each point represents a state. We will begin by examining the perhaps obvious hypothesis that the more liberal the people of a state are, the less they will identify with the Republican Party. The independent variable, ideology, is also derived from Wright et al., and uses a scale on which -100 is most conservative and +100 is most liberal. 109 XII. Regression Analysis Regression Equations Notice that the points in the scatterplot form a pattern. As the value of the independent variable increases, the value of the dependent variable tends to decrease. Insofar as it decreases at a constant rate, the scatterplot will tend to form a downwardly sloping straight line. Conversely, insofar as the values of the two variables increase together at a constant rate, the scatterplot will tend to form an upwardly sloping straight line. The line of best fit (also called the regression line) is the straight line that, loosely speaking, passes through the "middle" of the scatterplot. More precisely, it is the one with the smallest variance of points about the line. Recall that variance is the mean squared deviation. The best fitting line, in other words, is the one with the least squares in the deviations between the line and the points on the graph. 110 XII. Regression Analysis In high school geometry, you probably learned that the general formula for a straight line is: Y = mX + b. Statisticians use slightly different symbols, using “a” instead of “b,” “b” instead of “m,” and reversing the order of the two terms on the right side of the equation, thus producing: Y = a + bX. In this formula, “a” is the Y intercept (the value of Y when X = 0), and “b” is the slope. The former usually has little or no theoretical importance, and would often lead to absurd conclusions if interpreted. The “b” coefficient (also called the regression coefficient) is very important. It tells us the nature of the linear relationship between the dependent and independent variables: the increase or decrease in the dependent variable that, all else being equal, can be expected from an increase of one unit in the value of the independent variable. This general equation describes any straight line. Using appropriate formulas,[4] we solve for the values of “a” and “b” that produce the least squares equation for our data. We obtain the results shown here: 111 XII. Regression Analysis We’ll return to the “model summary” and the “ANOVA” later. For now, we are interested only in the “B” column under “unstandardized coefficients” in the “coefficients” table. The “constant” (-10.263) is the Y intercept, while the number underneath it (-.493) is the slope. Rewriting these numbers in standard algebraic form, we obtain: Y' = -10.263 - .493X. Note that the dependent variable in the equation is shown as Y′ (Y prime), the “predicted”[5] value of Y. This means that, all else being equal, we would “predict” that a given case will fall exactly on the line (the coordinates represented by multiplying a given value of X by -.493 and subtracting the result from -10.263). Arkansas, for example, has an ideology score of -25.84. Plugging this value into the equation, we predict a party id score for this state of 2.48 (that is, with Republicans enjoying a slight edge over Democrats). In fact, the party identification score for Arkansas is -21.88 (that is, showing a decided advantage for the Democrats). The least squares regression line, even though it is the best fitting straight line, is far from a perfect fit to the data. If it were, all the points in the graph would fall exactly on the line. (The deviations between the actual points and where they would fall on the line are called the residuals.) The following figure repeats the scatterplot shown above, but this time the regression line has been added. 112 XII. Regression Analysis In the next figure, each point has been labeled with the name of the state, and points have been coded by region. Note that several southern states have negative residuals — they seem to identify less with the GOP than we would expect based on ideology. In other words, despite Republican inroads into the once “solid South,” the South (notwithstanding its relatively conservative ideology) still retains some of its traditional ties to the Democratic Party. On the other hand, a number of states in the Rocky Mountains and Great Plains are more Republican than we would expect based on ideology. 113 XII. Regression Analysis Pearson’s r2 (the Coefficient of Determination) This then raises the question of how good the best fitting line is. In guessing the value of the dependent variable, how much does it help to know the value of the independent variable?[6] For an interval or ratio variable, our best guess as to the score of an individual case, if we knew nothing else about that case, would be the mean. The total sum of the squared deviations from the mean gives us a measure of the error we make by guessing the mean, since the greater this sum, the less reliable a predictor the mean will be. From the analysis of variance (“ANOVA”) table presented earlier in this topic, we see that in this case the total sum of squares is 5865.732 114 XII. Regression Analysis How much less will our error be in guessing the value of the dependent variable (in this case, party identification) if we know the value of the independent variable (ideology)? We can calculate the sum of squared deviations about the least squares line in the same way as total variation is calculated, except that, instead of subtracting the overall mean from each score, we subtract the predicted value of Y. In the case of Arkansas, for example, we subtract the predicted value of 2.48 from the actual value of -21.88, leaving us with a deviation (or residual) of -24.36. By doing this for each state, then squaring and summing the results, we obtain the residual sum of squares (5054.208, from the ANOVA table above). We can then determine how much less variation there is about the regression line than about the mean. The formula: r2 = (total sum of squares - residual sum of squares)/total sum of squares provides us with the familiar proportional reduction in error. (Note: as when computing eta2 in the previous topic, we don't need to divide each element in the equation by N, since it is the same in each instance.) In this case: In other words, by knowing a state’s ideology score, we can reduce the error we make in guessing its partisanship score by about 13.8 percent. Pearson’s r2 thus belongs to the same “PRE” family of measures of association as Lambda, Gamma, Kendall’s tau, eta2, and others. Pearson’s r2 is also called the coefficient of determination, because it tells us the proportion of the variance in the dependent variable that is “determined” by (or “explained” by) its association with the independent variable. Put another way, it tells us how much closer the points in the scatterplot come to the regression line than they do to the mean. Pearson’s r Just as the standard deviation, rather than the variance, is usually reported in measuring dispersion, Pearson’s r (also called the correlation coefficient) is usually reported rather than r2. Pearson’s r is the positive square root of r2 when the relationship (as indicated by the sign of the “b” coefficient) is positive, and the negative square root when the relationship is negative.[7] It thus ranges from 0 (when there is no relationship between the two variables) to ±1 (when indicating a perfect relationship). In the case of the relationship between partisanship and ideology, it is -.372. We can also perform a test for the statistical significance of the relationship. From the ANOVA table presented earlier, we see that the relationship is significant at the .009 level (written “p = .009”). Note that this is a so-called "two-tailed" test, that is, one in which the hypothesis does not predict the direction (positive or negative) of the relationship. Since, in this and in most 115 XII. Regression Analysis cases, we do in fact predict the direction (in this instance, positive), we can use a "one-tailed" test, and the relationship is actually twice as significant (that is, p = .0045). Deviant Case Analysis One difference between political science as a social science and political science as a humanity is that the latter tends to focus on the unique person or event while the former tends to focus on typical patterns in human behavior. This is not, however, a hard and fast distinction, and the study of the unique and the typical are complementary, not conflicting, pursuits. Regression analysis illustrates this point quite well. Once we have discovered an overall pattern, we can then focus on those cases that do not fit that pattern, that is, those with high residuals. This may be a matter of an unusual case that is “the exception that proves the rule.” (The original meaning of this saying was that the exception "tests" the rule, which actually makes a lot more sense.) Earlier, we found that several southern states were a good deal less Republican than we would have predicted. There are, on the other hand, some Rocky Mountain and Great Plains states that are substantially more Republican than their ideology would have led us to expect. Perhaps these areas have great potential for party building efforts (the South for Republicans and the Rocky Mountains and Great Plains for the Democrats). Finding deviant cases may help us generate additional hypotheses. In other words, just as finding a pattern helps us focus on cases that do not fit the pattern, finding such cases may in turn help us in looking for other patterns. Figure 3 shows that two New England states, New Hampshire and Vermont, are about as liberal as two others, Massachusetts and Rhode Island, but are far more Republican. Can you explain why? Multivariate Analysis Regression can be extended to analysis that includes more than one independent variable. There are limits to such analysis. Among these is multicollinearity. When two or more independent variables are highly correlated with one another, it may be impossible to separate the impact of each on the dependent variable. Despite this, regression provides a very powerful tool for creating more comprehensive models of political life. Though not easily represented graphically, the multiple regression equation is relatively straightforward: Y' = a + b1X1 + + b2X2 . . . + bnXn This equation, instead of describing the least squares line in a two dimensional plane, describes the least squares plane (or hyper plane) in a space with as many dimensions as there are variables in the equation. Don't panic — the computer can handle the calculations. 116 XII. Regression Analysis The equation just described is called the unstandardized regression equation, because each b coefficient is expressed in terms of the original units of analysis. For example, if Y is measured in percents and X1 in thousands of dollars, a value of -0.8 for b1 would mean that, all else being equal, an increase of $1,000 in X1will result in a decrease of 0.8 percent in Y. There is also a standardized regression equation, in which all relationships are expressed in standard scores. This equation takes the following general form: Y' = β1X1 + + β2X2 . . . + βnXn If β1 (pronounced "beta sub 1"), for example, were to equal -0.8, it would mean that, all else being equal, an increase of 1 standard deviation in X1 will result in a decrease of .8 standard deviations in Y. In a standardized regression equation, the “a” coefficient is always zero, and so drops out of the equation. Both standardized and unstandardized regression equations are important. Because it expresses relationships in terms of the original units of analysis, the unstandardized equation is often easier to understand. It is also easier to use the unstandardized equation to calculate the value of predicted value of Y for any given case or set of cases. On the other hand, because it expresses all relationships in terms of standard scores, the standardized equation lets us evaluate the relative importance of each independent variable: the higher the value (ignoring sign) of, say, β1, the bigger the change in Y produced by a change of one standard deviation in X1. For either form of the equation, we can obtain the multiple correlation coefficient and coefficient of determination, written R and R2 respectively. R2 is a proportional reduction in error measure that tells us how much more accurately we can guess the value of the dependent variable by knowing the values of all the independent variables in the equation. R2 should usually be adjusted to take into account the number of variables in the equation. When an independent variable is a dichotomy, it can be entered into a regression equation like any other variable. Called dummy variables in this context, dichotomies are usually coded “0” and “1.” (Actually, any two numbers will do, but 0 and 1 will make the results easier to interpret.) Suppose that gender is an independent variable, with female coded 1 and male coded 0. In an unstandardized regression equation in which the dependent variable is a “feeling thermometer” for Hillary Clinton, a “b” coefficient of 8.106 associated with gender would mean that, all else being equal, women rate her a little more than 8 points higher than do men. A variable with more than two categories can be converted into a series of dummy variables. If we have a region variable with four categories (Northeast, Midwest, South, and West), we can create up to three dummy variables (such as Northeast, South, and West) in which a case is coded 1 if it is located in the region, and 0 if it is not. We could not create a fourth dummy variable, since if we specify the value of a case for three regions, we have in effect already specified its value for the fourth. (If a respondent does not live in the Northeast, the South, or the Midwest, and there is only one other region, (s)he must live in the West.) 117 XII. Regression Analysis Note that dummy variables cannot be used as dependent variables in a regression equation. There are other methods available for such a purpose (notably logit and probit), but they are beyond the scope of this topic. Finally, we can perform a test for the statistical significance of each independent variable in a multiple regression equation (called the “t test”) and for the equation as a whole (called the “F ratio”). To illustrate the notion of multiple regression, consider the following variables for the American states: Y = Party identification X1 = Ideology X2 = Percent of households that include a married couple X3 = Dummy variable (South = 1; other states = 0) Let’s begin by creating a correlation matrix among all four of these variables. We can see that there is a moderate positive correlation between identification with the GOP and the percent of households containing a married couple, and moderate negative correlations with ideology (that is, with liberalism) and with being in the South. (Note that the coefficients in this matrix are “r,” not “r2.”) 118 XII. Regression Analysis Now let’s carry out a regression analysis. The relevant portions of the output provided by SPSS are as follows : 119 XII. Regression Analysis The “ Adjusted R Square ” (adjusted for the number of variables in the equation) for the model summary shows that all three independent variables taken together explain about 52 percent of the variation among the states in party identification. The ANOVA table shows that the residual sum of squares (the sum of squared deviations from the least squares line) is 2612.883, while the total sum of squares (the sum of squared deviations from the mean) is 5865.732. Note that (5865.732 – 22612.883) / 5865.732 = .555. This is identical to the unadjusted R Square in the model summary. The “Sig” of .000 is the significance level (based on an “F ratio”). In other words, for the model as a whole, p < .001. The “coefficients” table provides the regression equations. Under “unstandardized coefficients,” the “Constant” (-88.163) is the “a” coefficient. The remaining values in this column are the “b” coefficients. Rewriting this in standard algebraic form, the unstandardized regression equation is: Y' = -88.163 -.559X1 + 1.593X2 - 10.355X3. Similarly the standardized coefficients can be rewritten as Y'= - .421X1 + .390X2 - .442X3 . The unstandardized equation tells us that, all else being equal, each additional point on the ideology (liberalism) scale is associated with a reduction of something over a half point on the party identification (Republicanism) scale, that each additional percent of households containing 120 XII. Regression Analysis married couples is associated with an increase of about 1.6 points, and that a Southern state will have a score about 10.4 points lower than a state outside the South. Note that each of these coefficients holds constant the other variables in the equation. Thus, for example, the coefficient for the South indicates that we would expect a Southern state to score about 10.4 points lower than an non-Southern state with the same ideology score and the same proportion of married couples. A limitation of the unstandardized equation is that each variable is measured in terms of very different and hard to compare units of analysis. For example, it isn't obvious whether a decrease of .559 points on the party id scale for an increase of one point on the ideology scale constitutes a larger or a smaller change than an increase of 1.593 points for an increase of one percent in households including married couples. The standardized equation makes it easier to compare the relative importance of the different independent variables. In this case, it tells us that us that each independent variable has a moderately important impact on party identification, even when the other independent variables are held constant. A disadvantage of the standardized equation is that it's rather abstract. What does it mean, for example, to say that an increase of one standard deviation in household composition is associated with an increase of .390 standard deviations in party id? Because unstandardized and standardized equations each have their strengths and limitations, it is helpful to have both. Finally, the figures in the “Sig” column, show that (based on t tests) the contribution of each independent variable is statistically significant even when the other variables in the equation are taken into account. Curvilinear Relationships Sometimes a scatterplot will show that there is a relationship between two variables, but that the pattern forms a curved rather than a straight line. There are techniques for dealing with so-called curvilinear patterns, but they are beyond the scope of this topic. Key Concepts coefficient of determination correlation coefficient deviate case analysis dummy variables line of best fit multcollinearity multivariate analysis ordinary east squares 121 XII. Regression Analysis Pearson's r Pearson's r2 regression regression coefficient regression line residuals scatterplots standardized regression equation unstandardized regression equation Exercises 1. Start SPSS, and open the states.sav file. Open the states codebook. Repeat the analysis described in this topic, but use either "ideo" or percent voting for Bush (which you will need to compute) as your dependent variable. Select independent variables that you hypothesize influence your dependent variable. An optional byproduct of the SPSS regression tool is the ability to save residual scores as a new variable. Choose this option. Using SPSS Data View, find the states with the highest positive and negative residuals. Can you think of any reasons that would explain these? 2. Start SPSS, and open the senate.sav file and the senate codebook. Look at the measures of senator's voting records described in exercise 3 of the topic on Standard Scores and the Normal Distribution. Using the correlate procedure, see if these variables are all measuring more or less the same thing. Note that unity measures the degree to which a senator votes with his or her own party when majorities of the two parties are opposed. To make these measures comparable to the others, convert them so that they represent the degree to which a senator votes with the Republican Party. Note again that Lieberman (I, Connecticut) and Sanders (I, Vermont) are treated as Democrats for purposes of this variable. Repeat this exercise, but use the House of Representatives instead of the Senate. (There are currently no independents serving in the House.) 3. What constituency variables can be used to explain a senator’s voting record? Given a senator’s constituency, does knowing anything about the senator as an individual (such as party or gender) provide a more complete explanation? Are there some individual senators who are much more liberal or much more conservative than predicted by your equation? (Note: Joseph Lieberman of Connecticut and Bernie Sanders of Vermont are coded as independents for the party variable. To treat party as a dummy variable either, 1) recode to treat these two senators as Democrats (since they c caucus with the Democratic Party), 2) go to SPSS Variable View and make “3” a missing value for this variable, or 3) use select cases to exclude these senators from your analysis.) Repeat this exercise, but use the House of Representatives instead of the Senate. 122 XII. Regression Analysis 4. Open the countries.sav file and the countries codebook. Freedom House provides estimates of each county’s level of political rights and civil liberties. Compute an additive index summing these two measures. What variables help explain the value of this index? Are there countries that are either much more or much less democratic by this measure than your equation predicts? Can you explain these “deviant cases”? Repeat this analysis, but use perceived political corruption as your dependent variable. 5. Open the states.sav file and the states codebook. What explains the differences in the "policy" variable among the states? (Note: strictly speaking, this variable is only ordinal level, so we are taking some liberties here in using it in regression analysis, and so results should be regarded as at best tentative.) For Further Study “Correlations,” StatSoft Electronic Textbook, http://www.statsoft.com/textbook/stbasic.html#Correlations. Lowry, Richard, “Introduction to Linear Correlation and Regression,” Concepts and Applications of Inferential Statistics http://faculty.vassar.edu/lowry/webtext.html. “Presenting Data for Two Continuous Measures,” Surfstat. http://surfstat.anu.edu.au/surfstathome/1-4-1.html. Applets: "Regression Applet," http://www.stattucino.com/berrie/dsl/regression/regression.html. "Regression Applet," http://www.stat.sc.edu/~west/javahtml/Regression.html. "Regression by Eye," http://www.ruf.rice.edu/~lane/stat_sim/reg_by_eye/index.html. [1] The order in which concepts are introduced here differs a bit from that found in most introductory texts, which calculate Pearson’s r directly and then proceed to Pearson’s r2. I cover r2 first because of its relationship to other PRE measures discussed earlier and because r2 follows more directly from the discussion of lines of best fit. For textbooks that employ a similar approach, see William Buchanan, Understanding Political Variables, 4th edition. (NY: Macmillan, 1988), chs. 18-19, and Susan Ann Kay, Introduction to the Analysis of Political Data. (Englewood Cliffs, NJ: Prentice Hall, 1991), chs. 4-5. [2] A. Abebe, J. Daniels, J. W. McKean , and J. A. Kapenga, “How Regression Got Its Name,” Statistics and Data Analysis. http://www.stat.wmich.edu/s160/book/node70.html. 2001. Accessed December 5, 2003. 123 XII. Regression Analysis [3] “Descartes, René,” Columbia Encyclopedia, 6th ed. http://www.bartleby.com/65/de/Descarte.html. 2001. Accessed December 5, 2003 . [4] and and a = -b , where N is the number of cases, is the mean of Y, is the mean of X. [5] Sometimes shown as Yc (the “calculated” value of Y) or (Y hat). [6] Note the similarity between the explanation that follows and that used in explaining eta2. [7] In most textbooks, Pearson’s r is calculated directly. The formula is: . 124