Download The POWERMUTT Project: Regression Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Choice modelling wikipedia , lookup

Time series wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
XII. Regression Analysis
XII. REGRESSION ANALYSIS
Subtopics











Introduction
Scatterplots
Regression
Equations
Pearson’s r2
Pearson’s r
Deviant Case
Analysis
Multivariate
Analysis
Curvilinear
Relationships
Key Concepts
Exercises
For Further
Study
SPSS Tools


New with this topic
o Scatterplot
o Correlate
o Regression
Review
o Getting Started
o Compute
Introduction[1]
This topic describes what, for reasons that will be explained shortly, is also called ordinary least
squares (OLS). It is called regression analysis because Francis Galton (1822-1911), a pioneer in
the application of OLS to the behavioral sciences, used it to study “regression toward the
mean.”[2] Regression analysis is a simple but extremely powerful technique with a wide variety
of applications. It also forms the basis for many other techniques in intermediate and advanced
research methods courses. To use regression analysis appropriately, all variables must be at least
interval though, as we will see, dichotomous variables constitute a special case that may seem to,
but really doesn't, violate this rule.
To help us understand regression analysis, we will try to explain why people in some states
identify themselves more with the Republican Party (and less with the Democratic Party) than do
people in some other states. Our measure of party identification is a scale derived from analysis
of CBS/New York Times polls by Gerald Wright et al. The data used here are from 1999
through 2003. The scale has a theoretical range from -100 (a completely Democratic state) to
+100 (an all GOP state).
108
XII. Regression Analysis
Scatterplots
The philosopher and mathematician René Descartes (1596-1650) famously wrote, “I think,
therefore I am.” One of the things he thought about was coordinate graphs. In his honor, the
locations of points on such a graph are sometimes referred to as “Cartesian coordinates.”[3])
The graph consists of a horizontal (X) axis, sometimes called the “ordinate,” and a vertical (Y)
axis, sometimes called the “abscissa.” You can think of the X axis as being similar to the
columns of a contingency table, and the Y axis as similar to the rows, except that, in testing a
hypothesis, the independent variable should always be placed on the X axis, and the dependent
variable should always be placed on the Y axis. (When you are using a scatterplot for purely
descriptive purposes, it doesn't matter which is on the horizontal and which is on the vertical
axis. This would be the case, for example, if you wanted to compare two different measures of
the voting record of members of congress.) Each case is represented by a point on the graph
based on its values for X and Y. Taken together, the points form a scatterplot (or scattergram, or
scatter diagram). In the following figure, each point represents a state. We will begin by
examining the perhaps obvious hypothesis that the more liberal the people of a state are, the less
they will identify with the Republican Party. The independent variable, ideology, is also derived
from Wright et al., and uses a scale on which -100 is most conservative and +100 is most liberal.
109
XII. Regression Analysis
Regression Equations
Notice that the points in the scatterplot form a pattern. As the value of the independent variable
increases, the value of the dependent variable tends to decrease. Insofar as it decreases at a
constant rate, the scatterplot will tend to form a downwardly sloping straight line. Conversely,
insofar as the values of the two variables increase together at a constant rate, the scatterplot will
tend to form an upwardly sloping straight line. The line of best fit (also called the regression
line) is the straight line that, loosely speaking, passes through the "middle" of the scatterplot.
More precisely, it is the one with the smallest variance of points about the line. Recall that
variance is the mean squared deviation. The best fitting line, in other words, is the one with the
least squares in the deviations between the line and the points on the graph.
110
XII. Regression Analysis
In high school geometry, you probably learned that the general formula for a straight line is: Y =
mX + b.
Statisticians use slightly different symbols, using “a” instead of “b,” “b” instead of “m,” and
reversing the order of the two terms on the right side of the equation, thus producing: Y = a +
bX.
In this formula, “a” is the Y intercept (the value of Y when X = 0), and “b” is the slope. The
former usually has little or no theoretical importance, and would often lead to absurd conclusions
if interpreted. The “b” coefficient (also called the regression coefficient) is very important. It
tells us the nature of the linear relationship between the dependent and independent variables: the
increase or decrease in the dependent variable that, all else being equal, can be expected from an
increase of one unit in the value of the independent variable.
This general equation describes any straight line. Using appropriate formulas,[4] we solve for
the values of “a” and “b” that produce the least squares equation for our data. We obtain the
results shown here:
111
XII. Regression Analysis
We’ll return to the “model summary” and the “ANOVA” later. For now, we are interested only
in the “B” column under “unstandardized coefficients” in the “coefficients” table. The
“constant” (-10.263) is the Y intercept, while the number underneath it (-.493) is the slope.
Rewriting these numbers in standard algebraic form, we obtain: Y' = -10.263 - .493X. Note that
the dependent variable in the equation is shown as Y′ (Y prime), the “predicted”[5] value of Y.
This means that, all else being equal, we would “predict” that a given case will fall exactly on the
line (the coordinates represented by multiplying a given value of X by -.493 and subtracting the
result from -10.263). Arkansas, for example, has an ideology score of -25.84. Plugging this
value into the equation, we predict a party id score for this state of 2.48 (that is, with Republicans
enjoying a slight edge over Democrats).
In fact, the party identification score for Arkansas is -21.88 (that is, showing a decided advantage
for the Democrats). The least squares regression line, even though it is the best fitting straight
line, is far from a perfect fit to the data. If it were, all the points in the graph would fall exactly
on the line. (The deviations between the actual points and where they would fall on the line are
called the residuals.)
The following figure repeats the scatterplot shown above, but this time the regression line has
been added.
112
XII. Regression Analysis
In the next figure, each point has been labeled with the name of the state, and points have been
coded by region. Note that several southern states have negative residuals — they seem to
identify less with the GOP than we would expect based on ideology. In other words, despite
Republican inroads into the once “solid South,” the South (notwithstanding its relatively
conservative ideology) still retains some of its traditional ties to the Democratic Party. On the
other hand, a number of states in the Rocky Mountains and Great Plains are more Republican
than we would expect based on ideology.
113
XII. Regression Analysis
Pearson’s r2 (the Coefficient of Determination)
This then raises the question of how good the best fitting line is. In guessing the value of the
dependent variable, how much does it help to know the value of the independent variable?[6]
For an interval or ratio variable, our best guess as to the score of an individual case, if we knew
nothing else about that case, would be the mean. The total sum of the squared deviations from
the mean gives us a measure of the error we make by guessing the mean, since the greater this
sum, the less reliable a predictor the mean will be. From the analysis of variance (“ANOVA”)
table presented earlier in this topic, we see that in this case the total sum of squares is 5865.732
114
XII. Regression Analysis
How much less will our error be in guessing the value of the dependent variable (in this case,
party identification) if we know the value of the independent variable (ideology)? We can
calculate the sum of squared deviations about the least squares line in the same way as total
variation is calculated, except that, instead of subtracting the overall mean from each score, we
subtract the predicted value of Y. In the case of Arkansas, for example, we subtract the
predicted value of 2.48 from the actual value of -21.88, leaving us with a deviation (or residual)
of -24.36. By doing this for each state, then squaring and summing the results, we obtain the
residual sum of squares (5054.208, from the ANOVA table above). We can then determine how
much less variation there is about the regression line than about the mean. The formula:
r2 = (total sum of squares - residual sum of squares)/total sum of squares
provides us with the familiar proportional reduction in error. (Note: as when computing eta2 in
the previous topic, we don't need to divide each element in the equation by N, since it is the same
in each instance.)
In this case:
In other words, by knowing a state’s ideology score, we can reduce the error we make in
guessing its partisanship score by about 13.8 percent. Pearson’s r2 thus belongs to the same
“PRE” family of measures of association as Lambda, Gamma, Kendall’s tau, eta2, and others.
Pearson’s r2 is also called the coefficient of determination, because it tells us the proportion of
the variance in the dependent variable that is “determined” by (or “explained” by) its association
with the independent variable. Put another way, it tells us how much closer the points in the
scatterplot come to the regression line than they do to the mean.
Pearson’s r
Just as the standard deviation, rather than the variance, is usually reported in measuring
dispersion, Pearson’s r (also called the correlation coefficient) is usually reported rather than r2.
Pearson’s r is the positive square root of r2 when the relationship (as indicated by the sign of the
“b” coefficient) is positive, and the negative square root when the relationship is negative.[7] It
thus ranges from 0 (when there is no relationship between the two variables) to ±1 (when
indicating a perfect relationship). In the case of the relationship between partisanship and
ideology, it is -.372.
We can also perform a test for the statistical significance of the relationship. From the ANOVA
table presented earlier, we see that the relationship is significant at the .009 level (written “p =
.009”). Note that this is a so-called "two-tailed" test, that is, one in which the hypothesis does
not predict the direction (positive or negative) of the relationship. Since, in this and in most
115
XII. Regression Analysis
cases, we do in fact predict the direction (in this instance, positive), we can use a "one-tailed"
test, and the relationship is actually twice as significant (that is, p = .0045).
Deviant Case Analysis
One difference between political science as a social science and political science as a humanity is
that the latter tends to focus on the unique person or event while the former tends to focus on
typical patterns in human behavior. This is not, however, a hard and fast distinction, and the
study of the unique and the typical are complementary, not conflicting, pursuits. Regression
analysis illustrates this point quite well. Once we have discovered an overall pattern, we can
then focus on those cases that do not fit that pattern, that is, those with high residuals. This may
be a matter of an unusual case that is “the exception that proves the rule.” (The original meaning
of this saying was that the exception "tests" the rule, which actually makes a lot more sense.)
Earlier, we found that several southern states were a good deal less Republican than we would
have predicted. There are, on the other hand, some Rocky Mountain and Great Plains states that
are substantially more Republican than their ideology would have led us to expect. Perhaps these
areas have great potential for party building efforts (the South for Republicans and the Rocky
Mountains and Great Plains for the Democrats).
Finding deviant cases may help us generate additional hypotheses. In other words, just as
finding a pattern helps us focus on cases that do not fit the pattern, finding such cases may in turn
help us in looking for other patterns. Figure 3 shows that two New England states, New
Hampshire and Vermont, are about as liberal as two others, Massachusetts and Rhode Island, but
are far more Republican. Can you explain why?
Multivariate Analysis
Regression can be extended to analysis that includes more than one independent variable. There
are limits to such analysis. Among these is multicollinearity. When two or more independent
variables are highly correlated with one another, it may be impossible to separate the impact of
each on the dependent variable. Despite this, regression provides a very powerful tool for
creating more comprehensive models of political life.
Though not easily represented graphically, the multiple regression equation is relatively
straightforward:
Y' = a + b1X1 + + b2X2 . . . + bnXn
This equation, instead of describing the least squares line in a two dimensional plane, describes
the least squares plane (or hyper plane) in a space with as many dimensions as there are variables
in the equation. Don't panic — the computer can handle the calculations.
116
XII. Regression Analysis
The equation just described is called the unstandardized regression equation, because each b
coefficient is expressed in terms of the original units of analysis. For example, if Y is measured
in percents and X1 in thousands of dollars, a value of -0.8 for b1 would mean that, all else being
equal, an increase of $1,000 in X1will result in a decrease of 0.8 percent in Y.
There is also a standardized regression equation, in which all relationships are expressed in
standard scores. This equation takes the following general form:
Y' = β1X1 + + β2X2 . . . + βnXn
If β1 (pronounced "beta sub 1"), for example, were to equal -0.8, it would mean that, all else
being equal, an increase of 1 standard deviation in X1 will result in a decrease of .8 standard
deviations in Y. In a standardized regression equation, the “a” coefficient is always zero, and so
drops out of the equation.
Both standardized and unstandardized regression equations are important. Because it expresses
relationships in terms of the original units of analysis, the unstandardized equation is often easier
to understand. It is also easier to use the unstandardized equation to calculate the value of
predicted value of Y for any given case or set of cases. On the other hand, because it expresses
all relationships in terms of standard scores, the standardized equation lets us evaluate the
relative importance of each independent variable: the higher the value (ignoring sign) of, say, β1,
the bigger the change in Y produced by a change of one standard deviation in X1.
For either form of the equation, we can obtain the multiple correlation coefficient and coefficient
of determination, written R and R2 respectively. R2 is a proportional reduction in error measure
that tells us how much more accurately we can guess the value of the dependent variable by
knowing the values of all the independent variables in the equation. R2 should usually be
adjusted to take into account the number of variables in the equation.
When an independent variable is a dichotomy, it can be entered into a regression equation like
any other variable. Called dummy variables in this context, dichotomies are usually coded “0”
and “1.” (Actually, any two numbers will do, but 0 and 1 will make the results easier to
interpret.) Suppose that gender is an independent variable, with female coded 1 and male coded
0. In an unstandardized regression equation in which the dependent variable is a “feeling
thermometer” for Hillary Clinton, a “b” coefficient of 8.106 associated with gender would mean
that, all else being equal, women rate her a little more than 8 points higher than do men.
A variable with more than two categories can be converted into a series of dummy variables. If
we have a region variable with four categories (Northeast, Midwest, South, and West), we can
create up to three dummy variables (such as Northeast, South, and West) in which a case is
coded 1 if it is located in the region, and 0 if it is not. We could not create a fourth dummy
variable, since if we specify the value of a case for three regions, we have in effect already
specified its value for the fourth. (If a respondent does not live in the Northeast, the South, or the
Midwest, and there is only one other region, (s)he must live in the West.)
117
XII. Regression Analysis
Note that dummy variables cannot be used as dependent variables in a regression equation.
There are other methods available for such a purpose (notably logit and probit), but they are
beyond the scope of this topic.
Finally, we can perform a test for the statistical significance of each independent variable in a
multiple regression equation (called the “t test”) and for the equation as a whole (called the “F
ratio”).
To illustrate the notion of multiple regression, consider the following variables for the American
states:
Y = Party identification
X1 = Ideology
X2 = Percent of households that include a married couple
X3 = Dummy variable (South = 1; other states = 0)
Let’s begin by creating a correlation matrix among all four of these variables. We can see that
there is a moderate positive correlation between identification with the GOP and the percent of
households containing a married couple, and moderate negative correlations with ideology (that
is, with liberalism) and with being in the South. (Note that the coefficients in this matrix are “r,”
not “r2.”)
118
XII. Regression Analysis
Now let’s carry out a regression analysis. The relevant portions of the output provided by SPSS
are as follows :
119
XII. Regression Analysis
The “ Adjusted R Square ” (adjusted for the number of variables in the equation) for the model
summary shows that all three independent variables taken together explain about 52 percent of
the variation among the states in party identification.
The ANOVA table shows that the residual sum of squares (the sum of squared deviations from
the least squares line) is 2612.883, while the total sum of squares (the sum of squared deviations
from the mean) is 5865.732. Note that (5865.732 – 22612.883) / 5865.732 = .555. This is
identical to the unadjusted R Square in the model summary. The “Sig” of .000 is the
significance level (based on an “F ratio”). In other words, for the model as a whole, p < .001.
The “coefficients” table provides the regression equations. Under “unstandardized coefficients,”
the “Constant” (-88.163) is the “a” coefficient. The remaining values in this column are the “b”
coefficients. Rewriting this in standard algebraic form, the unstandardized regression equation
is:
Y' = -88.163 -.559X1 + 1.593X2 - 10.355X3.
Similarly the standardized coefficients can be rewritten as Y'= - .421X1 + .390X2 - .442X3 .
The unstandardized equation tells us that, all else being equal, each additional point on the
ideology (liberalism) scale is associated with a reduction of something over a half point on the
party identification (Republicanism) scale, that each additional percent of households containing
120
XII. Regression Analysis
married couples is associated with an increase of about 1.6 points, and that a Southern state will
have a score about 10.4 points lower than a state outside the South. Note that each of these
coefficients holds constant the other variables in the equation. Thus, for example, the coefficient
for the South indicates that we would expect a Southern state to score about 10.4 points lower
than an non-Southern state with the same ideology score and the same proportion of married
couples.
A limitation of the unstandardized equation is that each variable is measured in terms of very
different and hard to compare units of analysis. For example, it isn't obvious whether a decrease
of .559 points on the party id scale for an increase of one point on the ideology scale constitutes a
larger or a smaller change than an increase of 1.593 points for an increase of one percent in
households including married couples.
The standardized equation makes it easier to compare the relative importance of the different
independent variables. In this case, it tells us that us that each independent variable has a
moderately important impact on party identification, even when the other independent variables
are held constant. A disadvantage of the standardized equation is that it's rather abstract. What
does it mean, for example, to say that an increase of one standard deviation in household
composition is associated with an increase of .390 standard deviations in party id? Because
unstandardized and standardized equations each have their strengths and limitations, it is helpful
to have both.
Finally, the figures in the “Sig” column, show that (based on t tests) the contribution of each
independent variable is statistically significant even when the other variables in the equation are
taken into account.
Curvilinear Relationships
Sometimes a scatterplot will show that there is a relationship between two variables, but that the
pattern forms a curved rather than a straight line. There are techniques for dealing with so-called
curvilinear patterns, but they are beyond the scope of this topic.
Key Concepts
coefficient of determination
correlation coefficient
deviate case analysis
dummy variables
line of best fit
multcollinearity
multivariate analysis
ordinary east squares
121
XII. Regression Analysis
Pearson's r
Pearson's r2
regression
regression coefficient
regression line
residuals
scatterplots
standardized regression equation
unstandardized regression equation
Exercises
1. Start SPSS, and open the states.sav file. Open the states codebook. Repeat the analysis
described in this topic, but use either "ideo" or percent voting for Bush (which you will need to
compute) as your dependent variable. Select independent variables that you hypothesize
influence your dependent variable.
An optional byproduct of the SPSS regression tool is the ability to save residual scores as a new
variable. Choose this option. Using SPSS Data View, find the states with the highest positive
and negative residuals. Can you think of any reasons that would explain these?
2. Start SPSS, and open the senate.sav file and the senate codebook. Look at the measures of
senator's voting records described in exercise 3 of the topic on Standard Scores and the Normal
Distribution. Using the correlate procedure, see if these variables are all measuring more or less
the same thing. Note that unity measures the degree to which a senator votes with his or her own
party when majorities of the two parties are opposed. To make these measures comparable to the
others, convert them so that they represent the degree to which a senator votes with the
Republican Party. Note again that Lieberman (I, Connecticut) and Sanders (I, Vermont) are
treated as Democrats for purposes of this variable.
Repeat this exercise, but use the House of Representatives instead of the Senate. (There are
currently no independents serving in the House.)
3. What constituency variables can be used to explain a senator’s voting record? Given a
senator’s constituency, does knowing anything about the senator as an individual (such as party
or gender) provide a more complete explanation? Are there some individual senators who are
much more liberal or much more conservative than predicted by your equation? (Note: Joseph
Lieberman of Connecticut and Bernie Sanders of Vermont are coded as independents for the
party variable. To treat party as a dummy variable either, 1) recode to treat these two senators as
Democrats (since they c caucus with the Democratic Party), 2) go to SPSS Variable View and
make “3” a missing value for this variable, or 3) use select cases to exclude these senators from
your analysis.)
Repeat this exercise, but use the House of Representatives instead of the Senate.
122
XII. Regression Analysis
4. Open the countries.sav file and the countries codebook. Freedom House provides estimates
of each county’s level of political rights and civil liberties. Compute an additive index summing
these two measures. What variables help explain the value of this index? Are there countries
that are either much more or much less democratic by this measure than your equation predicts?
Can you explain these “deviant cases”?
Repeat this analysis, but use perceived political corruption as your dependent variable.
5. Open the states.sav file and the states codebook. What explains the differences in the
"policy" variable among the states? (Note: strictly speaking, this variable is only ordinal level,
so we are taking some liberties here in using it in regression analysis, and so results should be
regarded as at best tentative.)
For Further Study
“Correlations,” StatSoft Electronic Textbook,
http://www.statsoft.com/textbook/stbasic.html#Correlations.
Lowry, Richard, “Introduction to Linear Correlation and Regression,” Concepts and
Applications of Inferential Statistics http://faculty.vassar.edu/lowry/webtext.html.
“Presenting Data for Two Continuous Measures,” Surfstat. http://surfstat.anu.edu.au/surfstathome/1-4-1.html.
Applets:
"Regression Applet," http://www.stattucino.com/berrie/dsl/regression/regression.html.
"Regression Applet," http://www.stat.sc.edu/~west/javahtml/Regression.html.
"Regression by Eye," http://www.ruf.rice.edu/~lane/stat_sim/reg_by_eye/index.html.
[1] The order in which concepts are introduced here differs a bit from that found in most
introductory texts, which calculate Pearson’s r directly and then proceed to Pearson’s r2. I cover
r2 first because of its relationship to other PRE measures discussed earlier and because r2 follows
more directly from the discussion of lines of best fit. For textbooks that employ a similar
approach, see William Buchanan, Understanding Political Variables, 4th edition. (NY:
Macmillan, 1988), chs. 18-19, and Susan Ann Kay, Introduction to the Analysis of Political
Data. (Englewood Cliffs, NJ: Prentice Hall, 1991), chs. 4-5.
[2] A. Abebe, J. Daniels, J. W. McKean , and J. A. Kapenga, “How Regression Got Its Name,”
Statistics and Data Analysis. http://www.stat.wmich.edu/s160/book/node70.html. 2001.
Accessed December 5, 2003.
123
XII. Regression Analysis
[3] “Descartes, René,” Columbia Encyclopedia, 6th ed.
http://www.bartleby.com/65/de/Descarte.html. 2001. Accessed December 5, 2003 .
[4]
and
and a = -b , where N is the number of cases, is the mean of Y,
is the mean of X.
[5] Sometimes shown as Yc (the “calculated” value of Y) or
(Y hat).
[6] Note the similarity between the explanation that follows and that used in explaining eta2.
[7] In most textbooks, Pearson’s r is calculated directly. The formula is:
.
124