Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
QMIN Regression 2006-03-01: 1-1 1 GLM: Regression 1.1 Simple Regression 1.1.1 Background Simple regression involves predicting one quantitative variable (called a dependent variable) from another quantitative variable (called the independent or predictor variable). The terms dependent and independent imply predictability but do not necessarily imply causality. The most common notation in regression is to let Y denote the dependent variable and X, the independent variable. The phrase “regress (name of dependent variable) on (name of independent variable)” is often used. For example, “regress receptor levels on age” denotes that receptor level is the dependent variable and age is the independent variable. In simple regression, one fits a straight line through the data points and then uses that line to predict values of the dependent variable from values of the independent variable. The fundamental equation for the predicted value is Yˆ = α + βX where Yˆ is the predicted value of the dependent variable, α is the intercept of the line (i.e., the place where it crosses the vertical axis which is the same as the predicted value of Y when X = 0), and β is the slope of the line. We can write a similar equation for an observed value of Y. Because one will never be able to predict all of the observed Ys with perfect accuracy, there will be prediction errors. A prediction error is the difference between an observed value for the dependent variable and its predicted value, i.e., Y − Yˆ . Letting E denote a prediction error, then the equation for an observed Y is Y = Yˆ + E = α + β X + E . In simple regression, the estimates of parameters α and β are those that minimize the sum of squared prediction errors. That is, calculate the prediction error for the first observation and square it, then do the same for the all other observations in the sample. Finally, add together all the squared prediction errors. The result is termed the error sum of squares. Parameters that minimize the error sum of squares are called least squares estimates. We illustrate simple regression by considering the problem of change in the number of a particular receptor in human cortex with age. To investigate this, a researcher obtains cortex from a series of post mortems, extracts protein, and then performs a binding assay for the receptor. 1.1.2 How to Do It. 1.1.2.1 Step 1: Check the Data The very first step in a simple regression should be to examine the data with an eye towards a possible nonlinear relationship and outliers. We discuss nonlinearity later, so here we concentrate on outlier detection. This is an essential step because even a single outlier can give very misleading results, especially with the moderate sample sizes used in experimental neuroscience. The best method to assess outliers for this simple case is to follow the procedures QMIN Regression 2006-03-01: 1-2 outlined in Section X.X and construct a scatter plot. Figure 1.1 illustrates the plot, along with the regression line. (We discuss this line later). In the present example, there do not appear to be any disconnected data points. (Later we shall demonstrate the effects of outliers). Figure 1.1 Example of a scatter plot and the regression line (line of best bit). 1.1.2.2 Step 2: Compute the Regression The overall orientation of the data points in Figure 1.1 along with the slope of the regression line suggest that the density of receptors decreases with age. There are, however, only 27 observations in this data set. Could such a pattern result simply from chance? Only a rigorous statistical test can answer this question. All general statistical packages contain at least one routine for computing regressions. In the present case, we used the PROC REG routine in SAS. The dependent variable is called Receptor, which is the concentration of bound receptor in a binding assay per unit of protein; we refer to variable Receptor as reflecting the number of receptors. The independent variable is simply called Age. The mathematical model behind simple regression fits a straight line through the data points. For the present case, the population equation behind the regression is Receˆ ptor = α + β ⋅ Age . (X.1) Here, the hat (^) over Receˆ ptor denotes that this is the predicted value of Receptor. QMIN Regression 2006-03-01: 1-3 Observed values of Receptor Number in the cortex, however, will not always be equal to their predicted values. Hence, simple regression adds an error term when it writes the equation for observed values of the dependent variable: Receptor = Receˆ ptor + Error = α + β ⋅ Age + E . Note that we abbreviate error as E. Regression procedures obtain estimates of the population parameters α and β by minimizing the sum of squared error, the summation being taken over all observations in the data set. 1.1.2.3 Step 3: Interpret the Results The output from the regression procedure is given in Figure 1.2. In general, regression procedure will output two different tables. The first is called an analysis of variance table and its purpose is to assess the overall fit of the GLM. The second is a table of parameter estimates that assess the contribution of each independent variable to prediction. Because we have only one independent variable in the model, both tables will contain the same information, albeit expressed in different forms. Other output depends on the computer package and on options chosen by a user. All regression software will calculate the squared multiple correlation or R2. R2 is the square of the correlation between the predicted and the observed values of the dependent variable. Hence, it is a measure of the proportion of variance in the dependent variable explained by the predicted values (i.e., the model). With only a single independent variable, R2 equals the square of the ordinary Pearson product-moment correlation between Receptor and Age. Its value, rounded to .27, tells us that about 27% of the variance in receptor numbers is attributable to age. Regression can capitalize on chance peculiarities in the data, so R2 is, on average, an upwardly biased estimate of the population parameter. Hence, an adjusted R2 is also printed. The actual degree to which R2 is biased is generally unknown, so the adjustment is only approximate. The difference between R2 and adjusted R2 is a function of sample size and the number of independent variables in the model. Small samples and large numbers of independent variables will give greater discrepancies between R2 and adjusted R2 than large sample with few independent variables. R2 is reported more often than adjusted R2. QMIN Regression 2006-03-01: 1-4 Figure 1.2 Output from a simple regression predicting the quantity of receptors in human cortex as a function of age. The REG Procedure Model: MODEL2 Dependent Variable: receptor Receptor Binding fmol/mg Analysis of Variance Source Model Error Corrected Total DF 1 25 26 Root MSE Dependent Mean Coeff Var Variable Intercept age Sum of Squares 7.55991 20.95872 28.51863 0.91561 4.90630 18.66202 DF 1 1 Mean Square 7.55991 0.83835 R-Square Adj R-Sq Parameter Parameter Estimate 9.43824 -0.06052 F Value 9.02 Pr > F 0.0060 0.2651 0.2357 Estimates Standard Error t Value 1.51942 6.21 0.02015 -3.00 Pr > |t| <.0001 0.0060 With only a single independent variable, the analysis of variance table can be skipped because it will lead to the same inference as the table of parameter estimates. The estimate of the intercept (i.e., the estimate of α in Equation X.1) is 9.44. If the relationship between receptor number and age were linear throughout the lifespan, then this is the predicted receptor concentration at birth. (Although this is the correct mathematical interpretation of the intercept, one should never extrapolate the regression line beyond the range of the data at hand. Hence, it is best to regard the receptor concentration at birth as an unknown best examined by empirical data.) The slope of the regression line (i.e., the estimate of parameter β in Equation X.1) is -.06. The minus sign implies a negative or inverse relationship—increasing age results in lower receptor numbers. The value of the estimate implies that a one-year increase in age is associated with a .06 reduction in receptor concentration (measured in this hypothetical example as fmoles per mg of protein). Taking the estimates of α and β and placing them into Equation X.1 gives Receˆ ptor = 9.44 - .06 ⋅ Age . These numbers define the regression line in Figure 1.1. This line is sometimes referred to as the line of best fit because the estimates are based on minimizing the sum of squared error. The standard error for this parameter estimate is an estimated standard error, so the appropriate test statistic for the hypothesis that β = 0 is the t statistic. The value here (-3.00) is large and its associated p value (.006) is less than .05. Hence, we would conclude that there is evidence for a change in receptor number with age. In write-ups of a regression result, be careful of the degrees of freedom, the column labeled DF in Figure 1.2 because there is only one parameter being tested. The actual degrees of freedom for the t test equal the defrees of freedom for error in the model, in this case, 26. QMIN Regression 2006-03-01: 1-5 Note that there is also a test that the intercept is 0. Usually—but not always—this test is unimportant. You may also have noticed that the p value of the regression coefficient is identical to the fourth decimal place to that of the F test in the ANOVA table. This is not coincidence. It is due to the fact that there is only a single independent variable in the model. 1.1.2.4 Step 4. Communicating the Results Depending on the importance of the analysis, a graph similar to the one in Figure 1.1 may be the best way to communicate the results. A simple regression has two measures of effect size—the raw regression coefficient and R2. We recommend that both be included in the write up. Naturally, the p value or some statement about statistical significance is also required. A very convenient way of presenting the statistical information is to include the regression equation and other statistics in the graph. In the present case, one could add the following line to Figure 1.1: Yˆ = 9.44 − .06 ⋅ Age, R 2 = .27, p = .006 . As always, we recommend publishing figures only for statistically significant results or for theoretically or empirically meaningful results. 1.1.3 Assumptions At this point it is useful to examine the assumptions underlying regression analysis because they will apply to the other sections in this chapter and beyond. 1.1.3.1 Linearity Simple regression assumes that the relationship between the IV and the DV is linear. Figure 1.1 illustrates two different forms of a nonlinear relationship. Figure 1.3 Examples of nonlinear relationships. Usually, the effect of fitting a straight line to data with a strong nonlinear relationship is to reduce power. This is especially true when the relationship is U-shaped or inverted U-shaped because the slope of the linear term will be close to 0. Hence, the danger is that it might fool the analyst into concluding that there was no relationship. QMIN Regression 2006-03-01: 1-6 The best way to diagnose linearity is to construct a scatterplot along the lines of Figure 1.1or Figure 1.3 and visually inspect it. If you suspect the relationship may be nonlinear, then you can test that using polynomial regression (see Section 1.3.3). 1.1.3.2 Normality of Residuals The assumption that the residuals are normally distributed is necessary for the validity of the F statistic. Small departures from this assumption are of little concern. Large departures, however, can create problems. Usually, careful screening of the data prior to analysis can avoid these problems. There are several options for assessing this assumption. In the regression, one could calculate the residuals and then plot a histogram or perform a statistical test (see Section X.X) for normality. Most regression packages have the option of constructing a cumulative distribution plot and/or a quantile-quantile (or QQ) plot of the residuals to assess normality (see Section X.X for a definition of these plots). Figure 1.4 illustrates both of these residual plots from a regression of dependent variable Y in independent variable X when X has a lognormal distribution and the true linear relationship is between Y and the log of X. Figure 1.4 Example of residuals that are not normally distributed. If the residuals in this case were normally distributed, then they would tend to form a straight line. It is quite clear that they do not, so one should question this regression. Often, the solution to non-normal residuals can be found in a transformation of the independent and/or the dependent variable. For the present example, the solution is clear— regress Y on the log of X and not on X itself. Figure 1.5 gives the residual plots for this regression. QMIN Regression 2006-03-01: 1-7 Figure 1.5 Example of normally distributed residuals. 1.1.3.3 Equality of Residual Variances A final assumption of the regression model is a condition called homoscedasticity, which, quite frankly sounds more like a medical condition than a statistical one. Homoscedasticity is defined as the equality of variance around the regression line. It implies that the variance of the residuals is the same for each and every value of Yˆ . The most common way to assess for homoscedasticity is to plot the residuals from the regression as a function of the predicted values. Figure 1.6 illustrates homoscedasticity (Panel A) and the lack of it (Panel B), a condition called heteroscedasticity). Figure 1.6 Examples of equal variance of residuals (homoscedasticity) and unequal variance of residuals (heteroscedasticity). Heteroscedasticity of the form in Panel B of Figure 1.6 usually results from a scaling problem. Often taking a square root or log transform of Y will remove the heteroscedasticity. QMIN 1.1.4 Regression 2006-03-01: 1-8 Problems and Diagnostics 1.1.4.1 Outliers and Influential Data Points To illustrate the effect of an outlier, consider a study in cognitive neuroscience aimed at exploring retrieval from working memory. To recruit subjects, research techs post fliers throughout the University offering a small honorarium for participation in the study. Naturally, the majority of respondents will be students. On a whim, Ralph, a 75 year-old emeritus professor signs up because he knows little about the field and would like to experience first-hand the techniques in measuring cognition. Figure 1.7 presents a scatter plot of age and a score of retrieval from working memory and Figure 1.8 gives the results from regressing the working memory score on age. It is clear from visual inspection that Ralph is a disconnected data point. Not only is he discrepant in age, but it is apparent that his working memory has not retired even though Ralph has. Figure 1.7 Example of a scatter plot containing an outlier. QMIN Regression 2006-03-01: 1-9 Figure 1.8 Results from a regression with an outlier. Variable Intercept age DF 1 1 Parameter Estimate 26.31758 0.42592 Standard Error 5.26743 0.20180 t Value 5.00 2.11 Pr > |t| <.0001 0.0450 The results of the regression analysis show a significant effect of age, and both the regression coefficient and the regression line suggest that the relationship is positive—like a good wine, retrieval from working memory improves with age. But these results depend entirely on Ralph. Figure 1.9 and Figure 1.10 repeat the scatter plot and the regression but this time after removing Ralph from the data set. Notice how the scatter plot gives a very different impression of the relationship between age and working memory. Even if one were to erase Ralph and the regression line from the previous scatter plot, the remaining data points give little hint of the negative relationship clearly apparent in Figure 1.9. Furthermore, this negative relationship is significant. Figure 1.9 The same scatterplot after removing the outlier. QMIN Regression 2006-03-01: 1-10 Figure 1.10 Results of the regression with the outlier removed. Variable Intercept age DF 1 1 Parameter Estimate 82.71226 -2.13679 Standard Error 21.35160 0.96343 t Value 3.87 -2.22 Pr > |t| 0.0007 0.0363 This example, albeit extreme, should impress on the reader the importance of screening data in preparation for data analysis. By happenstance, the sample contained an elderly gentleman with an excellent memory, and including him in the analysis gives very misleading information about the relationship between age and memory. In general, the effect of an outlier on tests of statistical significance is unpredictable. In can—as it did in this example—retain statistical significance but switch the direction of the effect. In other cases, outliers can give statistical significance when there is none, or the outlier can result in failure to detect significant findings when in fact there is significance. In almost all cases, however, inclusion of an outlier can serious bias parameter estimates. 1.2 Multiple Regression 1.2.1 Background The term multiple regression refers to the case in which one quantitative dependent variable is predicted by more than one quantitative independent variable. For two independent variables (which we denote as X1 and X2), the equation is Yˆ = α + β1 X1 + β 2 X 2 , and in the general case with k independent variables, the equation is Yˆ = α + β1 X1 + β 2 X 2 + Kβ k X k . As in simple regression, the estimates of the parameters are those that minimize the sum of squared prediction errors. The equation defined by the set of independent variables selected for the analysis is called the model. There are three reasons for adding independent variables in a multiple regression. The first of these is for experimental designs that manipulate more than one factor. For example, suppose a study examined the effect of a drug on plasma cortisol levels in rats who were either stressed or not stressed. There are two factors in this design—drug and stress. Hence, X1 could be coded as 0 = Control and 1 = Drug. Similarly, X2 could be coded as 0 = not stressed and 1 = stressed. A second reason for adding variables is scientific hypothesis testing—e.g., does variable X2 add predictability above and beyond that of X1? In this situation, the independent variables are often correlated and the major research question is “does X1 directly predict Y or does X1 predict Y because it is correlated with X2 and X2 predicts Y?” We will see an example of this use of multiple regression below. The third reason for adding independent variables is for statistical control. Here, X2 is known to predict Y (or has a very strong likelihood of predicting Y) and by adding it to the equation, one achieves a more powerful statistical test for X1. The best variables for statistical control will be correlated with the dependent variable but not correlated with the other independent variables in the model. A classic example of statistical control is the clinical trial QMIN Regression 2006-03-01: 1-11 where baseline variables are entered into the regression model. If participants are truly randomized to control and experimental conditions, then a baseline variable will not correlate with the treatment variable but will usually correlate with the outcome measure. A simple regression fit a one-dimensional model (a straight line) to data points distributed in two-dimensional space. Similarly, a multiple regression with two independent variables fits a two-dimensional model (a plane) to data points distributed in three-dimensional space. A multiple regression with three independent variables fits a three-dimensional model to data points located in four-dimensional space—a task that cannot be easily visualized but can be dealt with in the world of mathematics. The interpretation of the parameters is easiest to learn by considering two independent variables. Parameter β1 gives the predicted change in the dependent variable per unit change in X1 holding variable X2 constant. Expressed in different terms, if one fixed X2 at any value, then a one-unit change in X1 predicts a change of β1 units in Y. Similarly, if one fixed X1 at any value, then a one-unit change in X2 predicts a change of β2 units in Y. In general, βi gives the predicted change in Y for a one-unit change in Xi holding all other independent variables in the model constant. The term controlling for is usually used to refer to the phrase “holding all other variables constant.” For instance, “βi measures the effect (predictive effect, not necessarily causal effect) of variable Xi controlling for X1, X 2 ,K.” We illustrate multiple regression by elaborating on the data set used in simple regression. Suppose that the receptor was a nicotinic receptor. Use of nicotine could upregulate the number of receptors and we all know that smokers die young. Could the relationship between receptor number and age be due to the effects of smoking? To check for this, we will include among the assays one for cotinine, a metabolite of nicotine. We can now use cotinine levels as a control variable in a multiple regression that predicts the levels of nicotinic receptors in cortex from age and cotinine levels. 1.2.2 How to Do It 1.2.2.1 Step 1: Check the Data The purpose of the data check in multiple regression is the same as it is in simple regression—examining the data for nonlinearity and outliers. Outlier detection by visual inspection can be difficult in multiple regression because one cannot deal with more than three variables at a time. Most statisticians recommend constructing a series of scatter plots for all pairs of variables. Usually, a rouge or blunder will appear in one or more of these graphs. After the regression model is fitted to the data, one can use additional procedures to check for outliers and influential data points. These procedures are detailed below in Section 1.2.4.1. 1.2.2.2 Step 2: Compute the Regression Fitting the model is the preferred phrase in multiple regression. The major decision on fitting a multiple regression model is the method. There two major types of methods for fitting models: (1) complete estimation, and (2) variable-selection methods. Complete estimation is almost always the default method for statistical computer programs. Here, the model that you specify—and only that model—gets fitted to the data. If the model has four independent variables, then four regression coefficients (plus an intercept) will be estimated and tested. In variable-selection methods (sometimes called stepwise regression), a series of models are fitted QMIN Regression 2006-03-01: 1-12 to the data and independent variables are added, kept, or deleted based on some statistical criteria. Complete estimation should be the choice for the vast majority of problems in neuroscience and must always be used for planned experiments. 1 1.2.2.3 Step 3: Interpret the Results Figure 1.11 gives the results from PROC REG in SAS of fitting the model to the data. Here, the first item to inspect is the Analysis of Variance (ANOVA) table. This table assess whether the model as a whole predicts better than chance. The crucial test statistic is the F value. Because this F refers to the model as a whole, it is sometimes called the omnibus F. The numerator degrees of freedom for this F equal the df for the model (2 in this case) and the denominator df equal the df for error (24). Here the test statistic is F(2, 24) = 13.35 and its associated p value is less than .0001. This suggests that the whole model does indeed predict nicotinic receptor levels better than chance. Figure 1.11 Multiple regression of receptor number on age and cotinine. Analysis of Variance Source Model Error Corrected Total Root MSE Dependent Mean Coeff Var DF 2 24 26 Sum of Squares 15.02076 13.49787 28.51863 0.74994 4.90630 15.28527 Mean Square F Value 7.51038 13.35 0.56241 R-Square Adj R-Sq Pr > F 0.0001 0.5267 0.4873 Parameter Estimates Variable Intercept age cotinine DF 1 1 1 Parameter Estimate 7.33197 -0.05382 0.03823 Standard Error t Value 1.37229 5.34 0.01661 -3.24 0.01050 3.64 Pr > |t| <.0001 0.0035 0.0013 The next item for inspection is R2, called the squared multiple correlation, the quantity R being the multiple correlation. As in simple regression R is the correlation between the predicted values and the observed values of the dependent variable. Squaring R gives the proportion of 1 Variable-selection techniques and stepwise regression were useful in the past when it took considerable time to compute regressions by hand. In the modern era, it is very easy to compute regressions for all possible subsets of IVs. One can then accept the model that best fits predetermined criteria. For this reason, we will not discuss variable-selection methods in this text. QMIN Regression 2006-03-01: 1-13 variance in the dependent variable explained by the model (i.e., by all of the independent variables). R2 is a measure of effect size (see Section X.X) of the whole model. Here, the value of R2 is .53, so 53% of the variance in receptor levels if predictable from the model, both age and cotinine. R2 values from nested models can be added or subtracted. Two models are nested when all the independent variables in the smaller model are contained in the larger model. R2 values for non-nested models should not be compared to each other (see Section 1.3.1). The R2 from the simple regression of receptor on age was .27 (see Figure 1.2). Because the simple regression is nested within the current model, their R2s can be compared. Hence, we can say that adding cotinine to the prediction equation explains an addition (.53 - .27) = .26 or 26% of the variance in nicotinic receptors. Model comparisons are such an important part of regression that we devote a whole section to it (Section 1.3.1) and we explain the testing of interactions and polynomial regression in multiple regression along these lines (Sections 1.3.2.3 and 1.3.3.1). The R2 and its significance inform us that we have overall predictability much better than chance would allow. The table of parameter estimates and their significance can help us decide which independent variables contribute to that significant predictability. In Figure 1.11 both age and cotinine are statistically significant. Hence, there are contributions from both age and cotinine to density of nicotinic receptors. The decline in nicotinic receptors with age was not due to the fact that smokers elevate their receptors levels and also die at younger ages. 1.2.2.4 Step 4: Communicating the Results Rather than follow a rigid formula, the hypotheses of interest should always determine the write-up of the results from multiple regression models. For example, consider the case of a clinical trial where the first IV is a dummy code for control versus active treatment and the second IV is baseline symptoms. The purpose of the whole study is to test whether or not the treatment was efficacious. Hence, one would report only two pieces of information: (1) the test statistic and significance level for the treatment variable; and (2) the effect size of the treatment. There is no need to report anything about the extent to which baseline symptoms predict followup symptoms. Nor is there any need to provide the reader with extraneous (and distracting) information about the overall R2 or the significance of the omnibus F. Absent detailed hypotheses of interest, three pieces of information from the multiple regression should be conveyed to the reader. The first piece is whether or not the overall model predicts better than chance. This entails saying something about the omnibus F statistic and its associated p level. The second is the estimate of overall effect size. Here, the statistical index is the R2. Usually, the first piece of information (is the prediction better than chance?) and the second (how well does the model predict the dependent variable?) can be combined into a single sentence. For the current example, one might write, “Together, age and cotinine significantly predicted individual differences in the number of receptors (R2 = .53, p < .0001).” The third class of information conveys the extent to which each IV contributes to the prediction. Although we chunk this into a single “piece” of information here, the overall purpose is to explain to the reader which IVs are significant and which are not significant. Phrasing of this information should always be done in terms of the purpose of the hypotheses that motivated the model for the multiple regression in the first place. For example, in the current study one might blithely write, “Both age and cotinine significantly predicted receptor concentration.” That phrasing is uninformative. Instead, write the results in terms that emphasize the fact the QMIN Regression 2006-03-01: 1-14 receptor concentration decreases with age even when controlling for cotinine levels. A good write-up might be: “Increased cotinine levels significantly predicted increased receptor numbers (b = .038, t(24) = 3.64, p = .001). Even controlling for effect of cotinine, however, receptor concentration still significantly declined with age (b= -.054, t(24) = -3.24, p = .004). Hence the relationship between age and receptor number cannot be explained solely in terms of cotinine concentrations in brain.” Note how this latter write-up expresses the significance of the regression coefficients, gives the direction of their effect (positive for cotinine, negative for age), and provides a clear summary of the results in terms of the reason for performing the regression in the first place (despite controlling for cotinine, age is still significant). Instead of following a slavish and formulaic write up (that is usually quite uninteresting), the analyst is urged to be creative in terms of explaining the results in terms of the major hypotheses responsible for the analysis. 1.2.3 Assumptions The assumptions of multiple regression are the same as those for simple linear regression: linearity, normality of residuals, and equality of residual variances. Two of these—normality of residuals and equality of residual variances—can be checked in the same way as these assumptions are assessed in simple regression. 1.2.3.1 Linearity If there are k independent variables, then the inclusion of the dependent variable gives a problem in (k + 1) dimensional space. It is not possible to construct plots for visual inspection in more than three-dimensional space. So how does one assess linearity? The answer is that there is no exact mathematical way. Some statisticians recommend constructing scatter plots of all variables in the analysis taken two at a time. Indeed, software packages such as the Interactive Data Analysis feature of SAS, allow one to do this with a only few point-and-clicks. A second method is to construct plots of the residuals as a function of each independent variable. A U- or inverted U-shaped plot suggests nonlinearity. Finally, an analysis of the residuals for outliers and/or influential data points may give individuals of nonlinearity. 1.2.3.2 Normality of Residuals Both simple and multiple regression have one and only one dependent variable. Because the residuals apply to the dependent variable, examining this assumption is the same in both simple and multiple regression. Hence, see Section 1.1.3.2 above for the techniques used to do this. 1.2.3.3 Equality of Residual Variances The procedure for testing equality of the variances of the residuals in multiple regression is identical to that in simple regression. Hence, consult Section 1.1.3.3. 1.2.4 Problems and Diagnostics 1.2.4.1 Outliers and Influential Data Points The identification of outliers and/or influential data points is simple when there is only one independent variable—simply plot the Y values by the X values and visually inspect the QMIN Regression 2006-03-01: 1-15 graph. The situation becomes more complicated as the number of IVs increases. For example, when there are three X variables, then the data occupy a four-dimensional space (the three Xs plus the Y variable), something very difficult for us humans to conceptualize, let alone plot. Most statisticians recommend two processes to deal with outliers and influential data points. The first is the inspection of the residuals or errors. The second is the inspection of statistics designed to identify a multivariate outlier and/or influential data points. We shall speak of each in turn. 1.2.4.1.1 Inspection of Residuals There are two major reasons for the inspection of residuals. The first reason is purely statistical—isolate those data points that clearly are outside of the prediction sphere. Here, one might look at data recording errors or processing errors as a cause. If the discrepancy is severe, one might even delete that observation from the analysis. The second reason is substantive—uncovering a plausible reason why those data points are outside the regression sphere. For example, suppose a study of restrictive eating disorders included four males along with 37 females. If inspection of the residuals revealed that all of the three potential outliers were male, one might conclude that the regression model is probably different for males and females. The first step in inspecting residuals is to determine which type of residual to inspect. The raw residual is simply the prediction error, i.e., the difference between the observed value of Y and the predicted value of Y or Y − Yˆ . The second type is the standardized residual. This is computed as the raw residual divided by the standard error of the residuals. In effect, they are Z scores of the residuals and hence, have a clear meaning to statisticians. For example, a standardized residual of –2.3 implies that the observation is 2.3 standard deviations below its predicted value. Finally, a studentized residual converts the residual into a t score using the t distribution and taking into account the influence of that observation on the regression model (technically, the leverage which will be discussed below). If an observation does not influence the regression very much, then the standardized residual will be very similar to the studentized residual. When that observation has a large influence on the regression, then the two will differ. 2 The question at hand should dictate which of these three quantities should be used. If you are very familiar with the metric behind the dependent variable, then raw residuals are fine. A raw residual of 6.3 will be meaningful to you. If you are not familiar with the metric or if you have any doubts about the metric, then a residual of 6.3 might be large or small. Here, either the standardized or the studentized residual should be preferred. The second step is the manner for inspecting the residuals. If the number of observations is small, then one can visually inspect the numerical values and flag observations with deviant residuals. When the number of observations is large, then construct a boxplot or histogram of the residual and inspect that plot for outliers. 2 There are several variations on how these residuals may be calculated. The major variation is whether the observation in question is included or excluded from the regression model when its error (residual) is calculated. Consult the manual for your software to make certain that you know what the residuals mean. QMIN Regression 2006-03-01: 1-16 1.2.4.1.2 Multivariate Outliers and Influential Data Points The two main statistics for examining influence and potential multivariate outliers are leverage and Cook’s D. The formula for leverage (also called h, the hat statistic, or the hat value) is complicated, but the numerical values should range between 0 (the observation has little influence on the regression and hence, is not a potential multivariate outlier) to 1 (the observation has an extraordinary influence and is definitely to be a multivariate outlier). Belsley, Kuh, and Welsch (1980) suggest a cutoff of 2k/N where k equals the number of predictors in the model and N is the number of observations. For example, with three X variables and a sample of 24, then observations with leverages above 2(3)/24 = .25 should be examined. This criterion is slightly conservative, but works well with sample sizes typical in neuroscience. As N get very large, however, one may spend unwanted time exploring observations that still fit the regression model. Other’s recommend that observations with a leverage of than 0.5 or more should always be examined while those with leverages less than .20 can be ignored. Cook’s D for an observation measures the extent to which the regression parameters change when that observation is not included in the calculation of the parameters. In short, D is a measure of the effect of deleting that observation from the regression. D can range from 0 to a very high, positive number. Values close to 0 suggest that there is little change when that observation is deleted, and hence that observation is not an outlier or unduly influential. Different criteria have been proposed to flag potential outliers: D > 4/N or D > 4/(N – k – 1) are but two of them. Another recommended strategy is to examine the distribution of D and look for outliers that are large positive values. As in the examination of residuals, one can eyeball leverage and/or D or use plots or use the mathematical and graphical ways to detect discrepant data points. Large data sets will certainly require graphical plots or mathematical methods. 1.2.4.2 Multicollinearity Multicollinearity is usually defined as a state that exists when two or more of the independent variables are highly correlated. A more precise definition may be developed by imaging that we computed a series of multiple regressions. In each regression, one of the independent variables became the dependent variable and all of the other IVs were the predictors. Then multicollinearity would occur when the R2 for at least one of these regressions was high. Note that multicollinearity applies to the X variables only. It is not influenced in any way by the extent to which the X variables correlate with the Y variable. Hence, R2 for the model is not influenced by multicollinearity. Instead, multicollinearity increases the standard errors of the regression coefficients, thus making it more difficult to detect whether a coefficient is in fact significant. As in many statistical phenomenon multicollinearity is not an either-or state, akin to falling off a cliff. Instead, regressions descend gradually into multicollinearity as the correlations among the X variables increase. The central issue for the analyst is to identify of the situation when multicollinearity becomes such a problem that it compromises the interpretation of a regression. Most designs in neuroscience do not have to worry about multicollinearity, except for the very important situation of statistical interactions (which we deal with below). Why? Most designs are experimental and hence, the independent variables will not be correlated (if there are an equal number of observations in each cell) or very weakly correlated (if the number is close to being equal in each cell). Outside of statistical interactions, the most likely situation that could QMIN Regression 2006-03-01: 1-17 induce multicollinearity in experimental designs occurs when two or more highly correlated variables are entered into the equation as control variables. Generally, however, multicollinearity is a problem most often encountered in observational studies. 1.2.4.2.1 Diagnosing multicollinearity. If the model has an interaction term in it—regardless of whether or not the design is a true experiment—then the interaction may induce some degree of multicollinearity. Rather than apply the diagnoses and remedies described below, the reader is referred to Section 1.3.2.3 which deals with interactions in a very straightforward manner. The diagnoses and remedies described here apply to the situation in which single predictors—and not their interactions—are highly correlated. Recall that multicollinearity increases the standard errors of the regression coefficients but does not affect the overall predictability of the model. Hence, one of its major effects is to reduce the statistical power of the tests of the regression coefficients. (The test for the significance of a regressor equals the regression coefficient divided by its standard error which follows a t distribution. As the denominator for this t statistic increases, the value of t decreases.) Thus, one of the major hints that multicollinearity might influence a regression occurs when the whole models predicts well (i.e., R2 is large and significant), but few, if any, of the regression coefficients are significant. A second way to examine multicollinearity is to examine the tolerance or the variance inflation factor (VIF) of the independent variables. The tolerance is simply the quantity (1 – R2) when that independent variable is regressed on all of the other IVs. Hence, tolerance will range from 0 to 1. A tolerance near 1.0 implies that the IV is close to being statistically independent of the other IVs. Tolerances close to 0 indicate multicollinearity. There is no sharp dividing line for a “good” versus a “bad” tolerance, but tolerances below .20 should alert the analyst to potential problems with multicollinearity. The VIF equals the reciprocal of tolerance. Hence, a VIF close to 1 denotes independence while large VIFs suggest multicollinearity. 1.2.4.2.2 Remedies for multicollinearity Potential fixes for multicollinearity range from the simple to the esoteric. If the multicollinearity involves only two variables, then one can delete one variable or combine the two into a single variable. If the standard deviations for the two variables are similar, then simply adding them together is satisfactory. If the standard deviations differ, then convert both variables to Z scores before adding them. In some cases, one or more sets of variables are responsible for multicollinearity. Again, one can drop one of the variables (or create a new variable from the sums), rerun the model and then assess the fit. In other cases, one might want to construct new variables on the basis of existing theory or empirical evidence. For example, suppose a neuropsychologist used all of the subscales of the WAIS (Wechsler Adult Intelligence Scale) as predictors in a study of CNS lesions, and the regression results suggested multicollinearity among the WAIS subscales. In such a case, one could use the WAIS norms to calculate total IQ or, perhaps, Verbal IQ and Performance IQ. These variables could then be used as regressors. Here, the validation of the WAIS and subsequent research on it are a guide for reducing a large number of variables to a fewer number of variables. If a set of intercorrelated variables creates multicollinearity and there is no theoretical or empirical basis for data reduction, then the best solution is to subject those variables to a QMIN Regression 2006-03-01: 1-18 principal component analysis and use the resulting principal component scores as the IVs. This technique is sometimes referred to as regression on principal components. 1.3 Special Topics in Multiple Regression 1.3.1 Model Comparisons Many tasks in GLM require a comparison between models. Does a model with an additional IV predict significantly better than the model without that variable? Can I drop two predictor variables from a model without a significant loss in fit? In experimental designs, the question usually arises about whether models with or without interaction terms fit better. The thrust of all of these questions is a comparison between two models. We want to know if the larger of two nested models predicts significantly better than the smaller model. (Conversely, we may ask whether the smaller of two nested models still gives satisfactory prediction without a significant loss of fit.) Indeed, all of the General Linear Model can be viewed in terms of the comparison of nested models (Judd & McClelland, 1989). Note the use of the word nested in the above statements. Two linear models are nested whenever all the variables in the smaller model are contained in the larger model. Only a smaller nested model can be compared to a larger model. If the smaller model is not nested (i.e., if it has a predictor variable that is not in the larger model), then the two models cannot be compared. 3 To illustrate nesting, suppose that a data set had four potential predictor variables which we denote here as X1, X2, X3, and X4. Now consider a GLM that uses the first three of these: Yˆ = α + β1 X 1 + β 2 X 2 + β 3 X 3 . The following three models are smaller models that are nested within this larger model: Yˆ = α + β1 X 1 + β 2 X 2 Yˆ = α + β X + β X 1 1 3 3 and Yˆ = α + β 2 X 2 + β 3 X 3 . Each of three smaller models is nested within the larger model because all of the predictor variables on the right-hand side of the equations are predictor variables in the larger model. In contrast, none of the following models are nested within the larger model: Yˆ = α + β1 X 1 + β 4 X 4 Yˆ = α + β X + β X 2 2 4 4 Yˆ = α + β 3 X 3 + β 4 X 4 and Yˆ = α + β 4 X 4 . Even though each of the above four models are smaller than the larger model, they are not nested within the larger model. Why? Because they all contain variable X4 which is not contained in the large model. Hence, it is not possible to compare these four models with the larger model. 3 More advanced methods can permit the assessment of non-nested models. They are, however, beyond the purview of this book. QMIN Regression 2006-03-01: 1-19 To examine model comparisons, we will first develop the general case of comparing any two nested models. After that, we will examine the special case of comparing two models that differ in one and only one parameter. 1.3.1.1 Model Comparisons: The General Case Suppose that we have a linear with k predictors in the equation: Yˆ = α + β1 X 1 + β 2 X 2 + Κ β k X k We want to compare this model to another model that has the same predictors X1 through Xk but adds m new predictors, giving the model Yˆ = α + β1 X 1 + β 2 X 2 + Κ β k X k + β k +1 X k +1 + β k + 2 X k + 2 + Κ β k + m X k + m . We want to test whether the m new predictors significantly add to prediction. This, of course, is the same as starting with the general model of all (k + m) predictors and testing whether a model that drops the m predictors significantly worsens the fit. The test statistical test involves the R2s of the two models. Let Rk2+ m denote the squared multiple correlation for the general model and let Rk2 denote the squared multiple correlation for the reduced model. Then, the test statistic for a significant difference between the two R2s is an F statistic of the form R2 − R2 N − k − m − 1 . (X.X) F (m, N − k − m − 1) = k + m 2 k • m 1 − Rk + m where N equals the number of observations in the sample. This F statistic has m degrees of freedom in the numerator and (N – k – m – 1) degrees of freedom in the numerator. If the p level of the F reaches significance, then the larger model is a significantly better model in terms of prediction. Otherwise, the smaller model should be preferred. Most modern statistical packages have provisions to test for differences in R2. Typically, these involve fitting the general model first and then testing whether one or more of the terms can be set to 0. In SAS, for example, the TEST statement used with PROC REG allows one to test for the significance of a set parameter estimates. If your software does not have this option easily available, then run the two regression models and use Equation X.X to calculate the F statistic. You can find the significance level of the observed F in tables found at the back of statistics books. 1.3.1.2 Model Comparisons: The Special Case of One Predictor In practical terms, if you test a general model against a smaller model that drops one and only one predictor, then simply run the general model and examine the t statistic and its p level for the predictor that you want to drop. Below, we give a demonstration of this principle. Figure 1.12 presents the results of a regression of four predictors (X1 through X4) on dependent variable Y. Suppose that we wanted to test this model against a smaller model that sets β3 to 0. There are two different ways in which we can do that. First, we could compute the regression without X3 as a predictor and use the R2 from this model along with the R2 from the general model to compute an F statistic from Equation X.X. In terms of the notation used in Equation X.X., N = 53; (k + m) = 4 (the total number of IVs in the general model given in Figure 1.12; and m = 1 (the number of predictors dropped from the general model). From Figure X.X, the quantity Rk2+ m equals .425. From the regression that drops X3 (not shown), the quantity Rk2 = .4084. Substituting these quantities into Equation X.X gives QMIN Regression 2006-03-01: 1-20 .425 − .4084 53 − 3 − 1 − 1 • = 1.39 1 − .425 1 The critical value for this F is 4.04. Because the observed F is less than its critical value, the observed F is not significant. Hence, dropping X3 from the model does not significantly worsen fit. (Note that if we had started with the reduced model and compared it to a model that added X3, then we could have stated that adding X3 to the model does not significantly increase R2). F (1,48) = Figure 1.12 Model Comparisons: A General Model with Four Predictors, Model Comparisons The REG Procedure Model: MODEL1 Dependent Variable: Y Number of Observations Read Number of Observations Used Source Model Error Corrected Total Root MSE Dependent Mean Coeff Var Variable DF Intercept X1 X2 X3 X4 1 1 1 1 1 DF 4 48 52 53 53 Analysis of Variance Sum of Mean Squares Square F Value 15.14742 3.78685 8.87 20.49560 0.42699 35.64302 0.65345 4.97358 13.13833 R-Square Adj R-Sq Parameter Estimates Parameter Standard Estimate Error t Value 1.34183 0.14587 0.05578 0.01578 0.05944 0.66310 0.06591 0.02916 0.01341 0.04428 2.02 2.21 1.91 1.18 1.34 Pr > F <.0001 0.4250 0.3771 Pr > |t| 0.0486 0.0317 0.0617 0.2453 0.1858 A second method is to use the TEST statement. The SAS syntax for this along with the results of the test statement is given in Figure 1.13. Note that the F statistic for the TEST statement is the same (within rounding error) as the one calculated above. Note also that the p value for the F statistic is identical to the p value for the t statistic in the original, general model presented earlier in Figure 1.12. This is no coincidence. The F statistic in Figure 1.13 is actually the square of the t statistic in Figure 1.12. Both statistics answer the same question: Can β3 be set to 0 without sacrificing predictability? In mathematics, two methods that answer the same question with the same data must give the same answer. Hence, the t statistic for a parameter in QMIN Regression 2006-03-01: 1-21 a general model is the same as the F statistic in a model comparison that drops that parameter and only that parameter. Figure 1.13 Model Comparisons: Example of the TEST statement in SAS for a single predictor variable. SAS Syntax: TITLE Model Comparisons; PROC REG DATA=ModelComparison2; MODEL Y = X1 X2 X3 X4; RUN; TITLE2 Test that beta3 = 0; beta3_EQ_0: TEST x3=0; RUN; QUIT; SAS Output (NOTE: Only results of the TEST statement are shown): Model Comparisons Test that beta3 = 0 The REG Procedure Model: beta3_EQ_0 beta3_EQ_0 Results for Dependent Variable Y Source Numerator Denominator DF 1 48 Mean Square 0.59084 0.42699 F Value 1.38 Pr > F 0.2453 1.3.1.3 Model Comparisons: An Example Re-examine Figure 1.12 which presented the results of a regression of four predictors (X1 through X4) on dependent variable Y. Suppose that we wanted to compare this general model to a nested, smaller model in which β2 = 0 and β4 = 0. Before proceeding with this example, let us take a moment to dispel a commonly held myth of data analysis. Neither the t statistic for X2 nor the t for X4 is significant. Novice analysts sometimes make the mistake of concluding that therefore both X2 and X4 can be set to 0. This may be the case but it does not have to be the case. A short exercise in logic can illustrate this. From Figure 1.12, the non significant t statistic for X2 tells us that we can fit a model with the three predictors X1, X3, and X4 without a significant worsening of fit. The non significant value for X4 suggests that we can fit a model with the three predictors X1, X2, and X3 without a significant worsening of fit. Note, however, that neither of these conclusions addresses the question at hand—can we fit a model with only the two predictors X1 and X3 without a QMIN Regression 2006-03-01: 1-22 significant loss of prediction? The t statistics for individual predictors apply to dropping one and only one parameter from the model. They do not necessarily inform us of the effect of dropping more than one parameter from the model. As an analogy, imagine that your ship sinks and you are left with two floatation devices—a life-jacket and a circular life-saver. You can discard the life-saver and still float for a long time because of the life-jacket. Similarly, you can discard the life-jacket, keeping the lifesaver, and still float for a long time. But does this imply that you can safely throw away both the life-jacket and the life-saver and still remain until rescue arrives? Hence, it is quite legitimate to ask whether both X2 and X4 can be dropped from the general model with a significant sacrifice in prediction. The first method of doing this is to use the TEST statement with PROC REG (or, of course, an equivalent statement with another statistical package). The SAS syntax is shown in the upper part of Figure X.X and the results are provided in the lower part of that Figure. The F statistic is 3.67 and its p value is .03. Hence, we cannot simultaneously set β2 to 0 and β4 to 0 at the same time. At least one of these—and perhaps both of them—are important for prediction. In the original, general model, we either lacked the power or had conditions such as multicollinearity that prevented us from detecting significance. QMIN Regression 2006-03-01: 1-23 Figure 1.14 Model Comparisons: Example of the TEST statement in SAS for a two predictor variables. SAS Syntax: TITLE Model Comparisons; PROC REG DATA=ModelComparison2; MODEL Y = X1 X2 X3 X4; RUN; TITLE2 Test that beta2 = 0 and beta4=0; beta2_AND_beta4_EQ_0: TEST X2=0, X4=0; RUN; QUIT; SAS Output (NOTE: Only results of the TEST statement are shown): Model Comparisons Test that beta2 = 0 and beta4=0 The REG procedure Model: beta2_AND_beta4_EQ_0 beta2_AND_beta4_EQ_0 Results for Dependent Variable Y Source DF Mean Square Numerator Denominator 2 48 1.56733 0.42699 F Value Pr > F 3.67 0.0329 The second method for model comparisons is to run the reduced model with only X1 and X3 as predictors and then compute the F statistic using Equation X.X. We will not show all the results from this reduced model, but simply give its R2 (.337). Hence, in terms of the algebraic quantities in Equation X.X, Rk2+ m = .425, Rk2 = .337, N = 53, k = 2, and m = 2. The F statistic becomes .425 − .337 53 − 2 − 2 − 1 F (2,48) = • = 3.67 1 − .425 2 The critical value for F with 2 and 48 df is 3.19. Because the observed F is larger than the critical value, we reject the smaller model. Eliminating both X2 and X4 from the model significantly worsens fit. 1.3.2 Interactions In everyday language, we often say that two variables “interact” in predicting a third variable, meaning that both variables are important for the prediction. In the GLM, however, the term “interaction” has a more precise meaning. A statistical interaction between two variables QMIN Regression 2006-03-01: 1-24 implies that the slope (or curve) for one independent variable differs in shape as a function of the second variable. For example, a statistical interaction between dose of drug and sex implies that the dose-response curves for males and females differ in shape. In different words, an interaction implies that the effect of a dose is different for men and women. An interaction can be looked upon as a nonadditive contribution of two (or more) variables to the prediction of Y. We explore this viewpoint on interactions further in the discussion of a two-by-two factorial design (Section 1.3.2.1) but urge the reader to view interactions as the extent to which nonadditive factors contribute to prediction. Let us first consider the case of two independent variables. In multiple regression, modeling an interaction begins by creating a new variable that is the product of the two variables involved in the interaction: e.g., X3 = X1*X2 denotes that the new variable (X3) represents the interaction between variables X1 and X2. Then the two original variables plus the new variable become the independent variables in the regression model. Hence, the model with the interactive term is Yˆ = α + β1 X1 + β 2 X 2 + β 3 X 3 or, writing it in terms of the original variables, Yˆ = α + β1 X1 + β 2 X 2 + β 3 X1 X 2 . With three independent variables, we can construct three new variables that are the product of any two of the three variables: X4 = X1*X2, X5 = X1*X3, and X6 = X2*X3. These are called two-way interactions because they model the interaction between two variables. In addition, we can model a three-way interaction by calculating yet another new variable that is the product of all three independent variables: X7 = X1*X2*X3. The model, expressed in terms of the original variables, is Yˆ = α + β1 X1 + β 2 X 2 + β 3 X 3 + β 4 X1 X 2 + β 5 X1 X 3 + β 6 X 2 X 3 + β 7 X1 X 2 X 3 . With four independent variables, there would be six possible two-way interactions, three possible three-way interactions (X1*X2*X3, X1*X2*X4, and X2*X3*X4), and a four-way interaction (X1*X2*X3*X4). Usually, higher order interactions such as a four-way interaction are ignored because it is very difficult to interpret them. Let us examine a specific problem to illustrate interactions. It has long been known that testosterone induces sexual activity in castrated male rats. The recovery of sexual activity is also a function of prior sexual experience—rats with high levels of previous experience have higher post testosterone sexual activity than those with less experience. Suppose that a lab is investigating a new compound with a resemblance to testosterone. The experimenters raised male rats in two ways—one group had no opportunity to mate with females while the other was allowed to have sexual activity. Rats were then surgically castrated and divided into three groups: (1) controls, (2) those given 10 mgs of the new drug per unit weight, and (3) those given 15 mgs. All groups were then allowed access to females in estrus and a composite index of sexual activity was derived. The results from this hypothetical study are depicted in Figure 1.15. QMIN Regression 2006-03-01: 1-25 Figure 1.15 Mean sexual activity (± 1 standard error) of rats with and without prior sexual experience as a function of dose of a testosterone-like compound. In principle, the procedure for assessing an interaction model is to fit two regression models. The first of these may be termed the main effects model, and it does not have the interaction term. The second regression model has the same variables as the first but includes the interaction term. The purpose is to assess the significance of the interaction term in the second model. (In practice, one can attain the same result by fitting a model that includes the main effects and the interaction and then assessing the t statistic for the regression coefficient for the interaction term—see Section 1.3.1.2. The notion of fitting two models, however, greatly aids in the interpretation of the regression coefficients, as the following algebraic exercise will illustrate.) For the main effects model, we start with two independent variables. The first, Experience, is dummy coded by assigning a 0 to rats with no previous sexual experience and a 1 to rats with prior experience. The second variable is Dose with values of 0, 10, and 15. Now examine the regression equation in this model: (X.X) Yˆ = α + β1 ⋅ Experience + β 2 ⋅ Dose . QMIN Regression 2006-03-01: 1-26 Substitute the numeric values for Experience to get the predicted values for rats with no experience ( Yˆ0 ) and those with prior experience ( Yˆ1 ): Yˆ0 = α + β 2 ⋅ Dose , Yˆ = (α + β ) + β ⋅ Dose . 1 1 2 Notice that the equation for the inexperienced rats is a simple regression with intercept α and slope β2. The equation for the experienced rats is also a simple regression, but here the intercept is now (α + β1) while the slope remains the same at β2. Hence, the main effects model fits two simple regressions—one for the inexperienced, the other for the experienced group—allowing the intercepts to differ between groups but constraining the slopes to be equal. Consequently, the regression lines for main effects models will be parallel. Parallel regression lines are not idiosyncratic to this example—they will always be predicted for a main effects model. Note the careful phrasing above about the intercept. The main effect model permits or allows the intercepts to differ. It does not force them to be different. Because the intercept for the inexperienced group is α and the intercept for the experienced rats is (α + β1), a test of parameter β1 is a test for equality of intercepts. If β1 = 0, then the intercepts are the same; if the hypothesis that β1 = 0 is rejected, then there is evidence for different intercepts. Now examine the regression equation that includes the interaction term: Yˆ = α + β1 ⋅ Experience + β 2 ⋅ Dose + β 3 ⋅ Experience ⋅ Dose . (X.X) Substituting the numeric values for Experience gives the equations for the two groups of rats as (X.X) Yˆ0 = α + β 2 ⋅ Dose ˆ Y1 = (α + β1 ) + (β 2 + β 3 ) ⋅ Dose . (X.X) Once again, the equation for the inexperience rats is a simple regression with intercept α and slope β2. The equation for the experienced rats also remains a simple regression. The intercept, however, is now (α + β1) and the slope is now (β2 + β3). Hence, the interaction term allows the slopes for the two groups to differ in addition to the intercepts. Furthermore, parameter β3, the coefficient for the interaction term in Equation X.X, provides the test for differing slopes. When β3 is 0, then the slopes for the two groups are the same; if we reject the hypothesis that β3 = 0, then we have evidence for different slopes. When the two groups have different slopes, their regression lines will not be parallel. (See Figure 1.18 and Figure 1.19 for, respectively, examples of parallel and nonparallel regression slopes). To summarize, main effects models allow intercepts to differ among groups but forces their slopes to be equal. Interaction models permit different intercepts but also allow different slopes. 4 . Hence, a significant interaction rejects the null hypothesis that the slopes are parallel. Although the example used groups, this principle extends to continuous independent variables. For a continuous X1, the main effects model predicts that the slopes will be parallel for any set of specific values of X1. The interaction model provides a test for parallel slopes. Figure 1.16 provides output from PROC GLM in SAS for the main effects model. The main effects model fit the data well—R2 = 56, F(2,57) = 36.09, p < .0001. Both of the regression coefficients are significant. For Experience, b = 2.51, t(57) = 4.64, p < .0001, and for Dose, b = 0.31, t(57) =7.11, p < .0001. These finding agree with previous research, so there is evidence that the new compound has physiological activity like testosterone. 4 The terms homogeneity of slopes and heterogeneity of slopes are sometimes used to refer to, respectively, parallel and non-parallel slopes. QMIN Regression 2006-03-01: 1-27 Figure 1.16 Regression results for the main effect model. Dependent Variable: response Source Model Error Corrected Total DF 2 57 59 Sexual Activity Index Sum of Squares 316.2206 249.7392 565.9598 Mean Square 158.1103 4.3814 F Value 36.09 Pr > F <.0001 R-Square 0.558733 Coeff Var 13.81483 Root MSE 2.093177 response Mean 15.15167 Parameter Intercept experience dose Estimate 11.32785714 2.51000000 0.30825714 Standard Error 0.52578024 0.54045600 0.04333288 t Value 21.54 4.64 7.11 Pr > |t| <.0001 <.0001 <.0001 Results from fitting the interactive model are given in Figure 1.17. This interaction model is also significant, R2 = .59, F(3,56) = 27.19, p < .0001. The appropriate statistical test for the interaction is for the coefficient for Experience*Dose. The coefficient is b = .182 and we can reject the null hypothesis that this it is random draw from a sampling distribution with a mean of 0 (t(56) = 2.17, p = .034). Hence, we can reject the hypothesis that the regression lines are parallel for the rats with and without prior sexual experience. Figure 1.17 Regression results for the interaction model. Dependent Variable: Response Source Model Error Corrected Total R-Square 0.592896 DF 3 56 59 Sexual Activity Index Sum of Squares 335.5551 230.4048 565.9598 Coeff Var 13.38725 Parameter Estimate Intercept 12.08642857 experience 0.99285714 dose 0.21722857 experience*dose 0.18205714 Mean Square 111.8517 4.1144 Root MSE 2.028391 Standard Error 0.61810091 0.87412669 0.05938521 0.08398338 F Value 27.19 Pr > F <.0001 Response Mean 15.15167 t Value 19.55 1.14 3.66 2.17 Pr > |t| <.0001 0.2609 0.0006 0.0344 QMIN Regression 2006-03-01: 1-28 Note that the coefficient for Experience in Figure 1.17 is not significant (β1 = .99, t(56) = 1.14, p = .26). This does not imply that Experience plays no role in the recovery of sexual function. Why? Because the coefficients in interactive models do not have the same interpretation as the coefficients in main effects models. To examine the meaning of a coefficient in an interactive model, one must substitute numeric values into the regression equation—as we did above—and then examine their meaning. In Equation X.X, the coefficient for Experience is β1, and from Equations X.X and X.X, we see that β1 tests whether the intercepts for the inexperienced and experienced rats differ. Consequently, the lack of significance for Experience implies that the two groups have the same intercept. Substantively, this means that in the absence of the drug (i.e., Dose = 0) there is no difference in sexual activity for rats with and without prior sexual experience. Hence, the difference in means for the two Vehicle groups in Figure 1.15 is not statistically significant. By substituting the numeric values of the regression coefficients into Equations X.X, we can examine the interaction between Experience and Dose: Yˆ0 = 12.09 + .22Dose , (X.X) (X.X) Yˆ = 13.08 + .40Dose . 1 For naïve rats, a 1mg increase in the drug will increase sexual activity by .22 units. In experienced rats, however, there will be almost a two-fold increase (.40 units). Hence, experienced rats are more sensitive to the drug than inexperienced rats. 1.3.2.1 Interactions: Example 1: The two-by-two factorial One of the classic examples in experimental design is the two-by-two factorial. This design involves two variables (or, in ANOVA terminology, “factors”) each of which has two levels, say a control level and a treatment. Table 1.1 illustrates the design. Table 1.1 Schematic of a two by two factorial design. Factor 1: Control Treatment Factor 2: Control Treatment In concrete terms, let us assume that the mean of the group receiving both Control treatments is 10.3. Suppose that Treatment 1 increases the response by 4.6 units. Hence, the mean for those observations who are Controls for Factor 2 but Treatments for Factor 1 will be 10.3 + 4.6 = 14.9. Let us further assume that Treatment 2 increases the response by 2.4 units. Then the mean for the observations who are Controls for Factor 1 but Treatments for Factor 2 will be 10.3 + 2.4 = 12.7. If the Treatments are additive, then the predicted value for those Observations receiving both Treatments will be the base rate (10.3) plus the effect of Treatment 1 (4.6) plus the effect of Treatment 2 (2.4) or 10.3 + 4.6 + 2.4 = 17.3. Hence, the additive model predicts the Treatment/Treatment cell in Table 1.1. If the mean of this cell differs significantly from this predicted value, then we have evidence that the additive model is false. Thus, the test of an additive model (i.e., Main Effects model) versus a non-additive model (i.e., Interactive Model) QMIN Regression 2006-03-01: 1-29 consists in how well the mean of the Treatment/Treatment cell is predicted by the Main Effects of Treatment 1 and Treatment 2. To place this design into a regression analysis, we code two independent variables: X1 is the independent variable for Factor 1 and it is dummy coded as 0 = Control and 1 = Experimental; X2 is the independent variable for the second Factor and it is similarly coded as 0 = Control and 1 -= Experimental. The Main Effects model (i.e., the additive model) is Yˆ = α + β1 X1 + β 2 X 2 , and the interactive model is Yˆ = α + β 1 X 1 + β 2 X 2 + β 3 X 1 X 2 . Clearly, the main effects model is a subset of the interactive model that stipulates that β3 = 0. We can now substitute the dummy codes for X1 and X2 to obtain the predicted values for the dependent variable in any of the four cells in this design. For example, the predicted value for a Control for Factor 1 (i.e., X1 = 0) and a Treatment for X2 is Yˆ = α + β1 (0) + β 2 (1) + β 3 (0)(1) = α + β 2 . Filling all the predicted values into the empty cells of Table 1.1 gives the algebraic expressions in Table 1.2. Table 1.2 Predicted values of the dependent variable in a two by two factorial design with interaction. Factor 1: Control Treatment Factor 2: Control Treatment α + β2 α α + β1 α + β1 + β 2 + β 3 Here, the intercept in the regression model (i.e., α) is the predicted value for the two Control conditions. Parameter β1 gives the effect of Treatment for Factor 1, and parameter β2 gives the effect of Treatment for Factor 2. If the effect of both Treatments is additive (i.e., the Main Effects model), then the predicted value for those observations that have receive both Treatments is Yˆ = α + β1 + β 2 . If, on the other hand, the effects of both Treatments are non-additive (or interactive) then the predicted value will differ from the Main Effects model. Hence, the predicted value of this cell becomes Yˆ = α + β 1 + β 2 + β 3 . Consequently, a test that β3 = 0 effectively tests whether the relationship between Treatment 1 and Treatment 2 is additive (β3 is not significant) or non-additive (β3 is significant). If the test that β3 = 0 is not significant, then we prefer the Main Effects model. Otherwise, we favor the Interactive Model. 1.3.2.2 Interactions: Example 1: Two Continuous Independent Variables The philosophy outlined above on homogenous versus heterogeneous slopes still holds in the case of two continuous independent variables. If the additive, Main Effects model holds, then the slope of the regression line of Y on X1 will remain the same irrespective of the value of QMIN Regression 2006-03-01: 1-30 X2. That is, the predicted value of Y given any specific value of X2—a quantity that we denote as (Yˆ | X 2 ) --will be (X.X) (Yˆ | X ) = (α + β X ) + β X . 2 2 2 1 1 In this equation, the intercept is the quantity (α + β2X2) which will indeed depend on X2, but the slope of the regression line is always β1 which does not depend on X2. To illustrate this, let α = .7, β1 = .4, and β2 = 1.3. Now fix X2 at any three values that you wish, substitute each of these into Equation X.X, and draw the three regression lines. No matter what three values you select, the regression lines will be parallel as illustrated in Figure 1.18. Figure 1.18 Regression lines from a main effects model. The interactive model, on the other hand, is Yˆ = α + β 1 X 1 + β 2 X 2 + β 3 X 1 X 2 . If X2 is fixed at any value, then this equation can be rearranged to become (Yˆ | X 2 ) = (α + β 2 X 2 ) + ( β1 + β 3 X 2 ) X 1 . Here, the intercept equals the quantity (α +β2X2) which is, as before, a function of X2. The slope, however, now equals (β1 + β3X2) which is now a function of X2. Hence, the slope of regression QMIN Regression 2006-03-01: 1-31 line of Y on X1 depends on the value of X2. To illustrate, keep α, β1 and β2 as before but let β3 = .8. Then, for the same three values of X2 used to compute the lines in Figure 1.18, we have the regression lines in Figure 1.19. Figure 1.19 Regression lines from an interaction model. In the example, we fixed the value(s) of X2 and then examined the slope of the regression line of Y on X1. The same principles, however, hold if we were to fix values of X1 and examine the slopes of the regression lines of Y on X2. For example, according to the additive, Main Effects model, the equation is Yˆ = α + β1 X1 + β 2 X 2 = (α + β1 X1 ) + β 2 X 2 . Again, the intercept depends on the value of X1, but the slope does not. The non-additive or Interactive Model is Yˆ = α + β 1 X 1 + β 2 X 2 + β 3 X 1 X 2 = (α + β 1 X 1 ) + ( β 2 + β 3 X 1 ) X 2 . Here, both the intercept and the slope depend on the value of X1. As in the case of the two by two factorial, the procedure for dealing with interactions involving continuous variables is to fit two different regression models. The first of these contains the terms without an interaction. That is, Yˆ = α + β1X1 + β2X2 . The second model contains the interaction, or QMIN Regression 2006-03-01: 1-32 Ŷ = α + β1X1 + β2X2 + β3X1X2 . A test that β3 = 0 is used to decide between models. If β3 is significant, then one favors the interactive model. If it is not significant, then one favors the model without the interaction. At this point, the reader may protest, “Why not just fit the interactive model, and if the interaction is not significant, just ignore it? What does one gain by also fitting the model without the interaction?” The answer is that one can gain a lot, mostly in the form of avoiding errors. Because an interactive term is formed as the multiple to two variables, it can create strong multicollinearity among the IVs. 5 As a result, β1, say, may not be significant in the interactive model while it is highly significant in the reduced model. If the interaction term is not significant and β1 is not significant, then one might falsely conclude that variable X1 has no influence on the dependent variable. To avoid such incorrect inferences, we now develop a strategy for analyzing interactions using a model-comparison approach. 1.3.2.3 Assessing Interactions: A Model-Comparison Approach The general rules for assessing interactions using a model-comparison approach are stated in Figure 1.20. The general philosophy is to assess the highest-order interactions first. If they are not significant, then drop them from the model and rerun the regression using the reduced model. Note that we recommend this approach for both observational studies and for experimental designs. The reason why it is needed for experimental designs is that interactions may induce multicollinearity in the data. This can have the effect of masking the significance of the lower-order terms in the model. It is also possible to work using a bottom-up approach. Here, one starts with a simple model and then tests whether adding terms to that model significantly improves prediction. See Judd and McClelland (1989) for advice in implementing that approach. We emphasize the topdown philosophy here because it is suitable for the “test every possible interaction” approach sometimes used in the analysis of experimental data. 5 The operative phrase here is “can induce multicollinearity.” Interactions do not always have to induce multicollinearity. Rather than learn all the conditions under which interactions do and do not result in multicollinearity, we urge that the reader use the algorithm outlined in the text— statistically test for interactions and if the test is not significant, then eliminate the interaction term and rerun the model. This simple algorithm will always work. QMIN Regression 2006-03-01: 1-33 Figure 1.20 Model comparison algorithm for interactions. (1) Fit the most plausible general model first. The most plausible general model should not necessarily be the one with the highest possible interaction term, but the model with the highest plausible interaction term. In fitting the general model, always include all lower-order interactions, and always include all of the variables that form the interaction. For example, if the highest plausible interaction is a three-way interaction, then the model should include all of the two-way interactions as well as the three variables that compose the interaction. (2) If the highest plausible interaction term is significant, then accept that model. (One may, however, wish to test whether other terms in the model can be eliminated). If the interaction is not significant, then remove it from the model, consider the reduced model as the next plausible general model, and proceed to step one. NOTE: In some cases, this algorithm may result in evaluating a model with several interactions of the same order. For example, suppose that the initial general model contained a non-significant three-way interaction. Then, the next model will have three two-way interactions as the next highest interactive terms. What follows next is an exercise in logic. If all three of these interactions are not significant, then we know that we can eliminate any one of them without a significant loss of fit. We do not know, however, whether or not we can eliminate any two of them (or even all three of them) without a significant loss of fit. In this case, it is safest to test the fit by comparing the model with all three interactions to the model with none of the two-way interactions. If there is no significant difference between the two, then all of the two-way interactions can be safely eliminated. If there is a significant loss of fit, then we know that at least two of the interactions are needed. Here, the next step would be to compare all the three models that have one and only one two-way interaction to the model with all three twoway interactions. At the end of the day, one should be left with a parsimonious model that still satisfactorily predicts the dependent variable. Inferences about the predictor variables should be based on the significance of their coefficients in this, reduced model. 1.3.2.4 An Example of the Model Comparison Approach to Interactions. We illustrate this approach using fabricated data from a clinical trial of three treatments. For each treatment, there is a control group along with an experimental group and each treatment is crossed with the other two treatments. Hence, there are eight total groups in the study. Examples of these groups would be: Control 1—Control 2—Control 3; Active 1—Control 2— Control 3; Control 1—Active2—Active3; and Active1—Active 2—Active 3. We can code this design by using dummy codes for each treatment. For treatment 1, we construct a variable X1 with values of 0 for a control subject and 1 for an active treatment subject. Variables X2 and X3 would contain the dummy codes for respectively treatments 2 and 3. QMIN Regression 2006-03-01: 1-34 There will be three two-way interactions among the three treatments—X1 and X2, X1 and X3, and X2 and X3. These may be calculated as the products of the two variables involved in the interaction. For instance, X1X2 = X1 * X2. Finally, there will the three-way interaction. Again, this will be the product of all three variables: X1X2X3 = X1 * X2 * X3. Hence, the general model is Yˆ = α + β1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 1 X 2 + β 5 X 1 X 3 + β 6 X 2 X 3 + β 7 X 1 X 2 X 3 . The logic is to fit this general model first and then test for the significance of the three-way interaction. If it is significant, it will be kept in all subsequent models. If it is not significant, then it will be dropped from the model. Next, the two-way interactions will be evaluated. The ultimate goal is to arrive at a parsimonious model that explains the data well without sacrificing predictability. The output from the general regression model is given in Figure 1.21. The coefficient for the three-way interaction is .7625 and it is not close to being significant (t = 0.54, p = .59). Hence, the next step is to create a reduced model without the three-way interaction. This will serve as a baseline model for testing the two-way interactions. Before leaving Figure 1.21, note that there is only one significant regression coefficient, the one for X3. The overall R2 for this model is very high (R2 = .65, not shown in Figure 1.21). The lack of significance for the coefficients coupled with the high R2 strongly suggests multicollinearity. Figure 1.21 Interactions: Regression coefficients for the general model. QMIN: Regression Model Comparisons: Interactions in an experimental design General Model Variable DF Parameter Estimate Standard Error t Value Pr > |t| Intercept X1 X2 X3 X1X2 X1X3 X2X3 X1X2X3 1 1 1 1 1 1 1 1 18.52500 0.87500 1.07500 -1.98750 1.58750 0.50000 0.82500 0.76250 0.50302 0.71138 0.71138 0.71138 1.00605 1.00605 1.00605 1.42277 36.83 1.23 1.51 -2.79 1.58 0.50 0.82 0.54 <.0001 0.2238 0.1364 0.0071 0.1202 0.6211 0.4157 0.5941 The coefficients for the first reduced model are presented in Figure 1.22. The coefficient for the interaction term X1X2 is significant. Given that the effect of multicollinearity is usually to make it difficult to detect significance when it is indeed present, this suggests that the interaction between X1 and X2 is meaningful. But what of the other two-way interactions? QMIN Regression 2006-03-01: 1-35 Figure 1.22 Interactions: Regression coefficients from the first reduced model. QMIN: Regression Model Comparisons: Interactions in an experimental design Reduced Model 1 Variable DF Parameter Estimate Intercept X1 X2 X3 X1X2 X1X3 X2X3 1 1 1 1 1 1 1 18.62031 0.68437 0.88437 -2.17813 1.96875 0.88125 1.20625 Standard Error 0.46758 0.61221 0.61221 0.61221 0.70692 0.70692 0.70692 t Value Pr > |t| 39.82 1.12 1.44 -3.56 2.78 1.25 1.71 <.0001 0.2683 0.1541 0.0008 0.0073 0.2176 0.0934 It is clear that we can delete X1X3 from the model (t = 1.25, p = .22) or we can delete X2X3 from the model (t = 1.71, p = .09). So the major question is whether we can delete both of them without significantly worsening the fit of the model. (Recall that the lack of significance of X1X3 and X2X3 does not necessarily imply that both can safely be removed from the model—see Section 1.3.1.3). All good statistical packages have provisions that allow one to delete more than one IV from the model and then assess the fit. In PROC REG in SAS, that provision is given by the TEST statement. The output from a test for setting the coefficient for X1X3 to 0 and simultaneously setting the coefficient for X2X3 to 0 is given in Figure 1.23. Here the test statistic is an F ratio and its associated p value (.12) is not significant. Hence, both of these coefficients can be set to 0 without a significant loss of fit. Figure 1.23 Interactions: Test of the significant of both interaction terms X1X3 and X2X3. QMIN: Regression Model Comparisons: Interactions in an experimental design Reduced Model 1 Test No_X1X3_X2X3 Results for Dependent Variable Y Mean Source DF Square F Value Pr > F Numerator 2 4.46328 2.23 0.1165 Denominator 57 1.99895 We now run a second reduced model, this time eliminating both of the two-way interactions. The results from this reduced model are given in Figure 1.24. Note that all of the coefficients are now significant. Hence, we cannot reduce this model any further. This second reduced model becomes out final model. QMIN Regression 2006-03-01: 1-36 Figure 1.24 Interactions: Regression coefficients from the second reduced model. QMIN: Regression Model Comparisons: Interactions in an experimental design Reduced Model 2 Variable DF Parameter Estimate Standard Error t Value Pr > |t| Intercept X1 X2 X3 X1X2 1 1 1 1 1 18.09844 1.12500 1.48750 -1.13438 1.96875 0.40335 0.51021 0.51021 0.36077 0.72154 44.87 2.20 2.92 -3.14 2.73 <.0001 0.0314 0.0050 0.0026 0.0084 Note the difference in the significance of the main effects in the general model (Figure 1.21) versus the final model (Figure 1.24). In the final model, all three main effects were significant. In the general model, only X3 was significant. Note that the interaction X1X2 was also significant in the final, but not the general, model. The reason for this is multicollinearity. The three main effects have a correlation of .38 with the three-way interaction, and each twoway interaction correlates .65 with X1X2X3. Also, in the general model, the tolerance for X1X2X3 is quite low (.14). When the three-way interaction is dropped and the model is reduced to its final form, then the maximum tolerance is .33—still small, but given that each coefficient is significant, there is no way to reduce the model without sacrificing predictability. If we had interpreted the coefficients from the general model, we would have erroneously concluded that only Treatment 3 had an effect on the response. Both Treatments 1 and 2, as well as their interaction, influenced the response. Hence, we state a cardinal rule for analyzing interaction terms in regression, ANOVA, and GLM. Never, under any circumstances, interpret the significance of lower-order coefficients in a model in which a higher-order interaction is not significant. Always remove the higherorder interaction from the model and rerun the GLM. 1.3.3 Polynomial Regression An important case of multiple regression is polynomial regression. Here, the independent variables consist of an original independent variable and a series of new independent variables that are powers of the original independent variable 6 . Let X1 denote the original independent variable. Then in the data set, we create additional independent variables so that X2 = X 12 , X 3 = X 13 , and so on. The GLM written in terms of the original independent variable is Yˆ = α + β1 X 1 + β 2 X 12 + β 3 X 13 + Κ β k X 1k . (X.X) In order to understand polynomial regression, we must first understand the mathematical implications of regressing a dependent variable on a single independent variable. As we have shown before, this regression fits a straight line to the data points. In mathematical terms, the 6 An “original independent variable” may be a variable originally recorded in the data set or some transform of that variable (e.g., a log or square root transform). QMIN Regression 2006-03-01: 1-37 straight line ranges from negative infinity to positive infinity. For any concrete data set, however, the straight line applies only to the range of values of the independent variable in the data set. To illustrate, consider a linear regression of weight on height in humans. Fitting a straight line through the data points might generate the mathematical prediction that a person who is -1.7 meters in stature should weigh -322 kilograms. Mathematically, this prediction is correct, but it is quite illogical from the common sense view that no human can be -1.7 meters in height. In a similar way, when we fit a polynomial model with a square term for an independent variable, i.e., Yˆ = α + β1 X + β 2 X 2 , we fit a parabola to the data. Mathematically, that parabola takes the form of the curve in Figure 1.25a (or that curve flipped on its head), and that curve extends from negative infinity to positive infinity alone the horizontal axis and from the minimum of the curve to positive (or negative) infinity along the vertical axis. For any specific data set, however, the range of values on the horizontal axis will be limited. Hence, the practical effect of fitting a model with the X2 independent variable to the data will be to take a “slice” from the parabola in Figure 1.25a (or a “slice” from an inverted form of that parabola). Examples of those slices—i.e., the types of curves that could be fitted to real data—are illustrated in panels (b) through (d) of Figure 1.25. QMIN Regression 2006-03-01: 1-38 Figure 1.25 Examples of quadratic curves. (a) (b) (c) (d) Similarly, a cubic regression fits the model Y = α + β1 X + β 2 X 2 + β 3 X 3 + E , and an example of the general form of the cubic polynomial is provided in panel (a) of Figure 1.26. Specific “slices” of different cubic curves are illustrated in panels (b) through (d) of that Figure. Note that a cubic can be used to model relationships that asymptote (panels b and c) as well as curves in which the rate of acceleration to a maximum differs from the deceleration rate to baseline (panel d). QMIN Regression 2006-03-01: 1-39 Figure 1.26 Examples of cubic curves. (a) (b) (c) (d) 1.3.3.1 Fitting Polynomial Models We illustrate polynomial regression with the serotonin data set. Here, the purpose is to examine the time course of the effect of administering a drug that purports to increase levels of serotonin (5-HT) in the CNS. Groups of rats were administered a standard dose of the agonist and sacrificed at various time points after administration. CSF was then assayed for 5-HIAA, the principle metabolite of 5-HT. Figure 1.27 presents the mean level of 5-HIAA for each group of rats. (The predicted means in the Figure will be discussed later). QMIN Regression 2006-03-01: 1-40 Figure 1.27 Assays of 5-HIAAA in CSF as a function of time. The first step in fitting polynomial regression is to formulate an a priori hypothesis about the form of the curve. In experimental studies this hypothesis should guide the original design of the study. For example, in the 5-HT study, we assume that 5-HT levels will increase, reach a peak, and then decrease to baseline. Hence, we expect at least a quadratic relationship, but we should also be prepared to fit a cubic or even a quartic to test whether the rate of increase in the initial stage equals the rate of decrease in the latter stages. In terms of fitting polynomial models, think of the term X2 as X * X or an interaction between X and itself. Similarly, X3 may be viewed as the three-way interaction of variable X with itself. According to this perspective, polynomial regression amounts to a series of interactions involving one variable with itself. Hence, the algorithm developed for interactions also applies to polynomial regression. In concrete terms this algorithm is stated in Figure 1.28 QMIN Regression 2006-03-01: 1-41 Figure 1.28 Algorithm for fitting polynomial regression models. (1) Fit the model with the highest plausible polynomial term. (2) If the coefficient for the highest plausible polynomial terms is significant, then accept that model (although one may wish to test the significance of lower-order coefficients in that model). If the coefficient for the highest plausible polynomial term is not significant, then drop that term from the model. Consider the next highest polynomial term as the most plausible polynomial tern and go to (1). In short, one starts with the largest plausible polynomial and then tests for the significant of the largest term and only the largest term. If that coefficient is significant then accept the model (although one may subsequently test for the significance of other terms in the model). If the term is not significant, then eliminate it from the model, rerun the regression, and examine the significance of the next highest term. Continue this procedure to reduce the polynomial to the lowest significant order. In the 5-HIAA data, we would start with a quartic and then test the quartic term. If the quartic is significant, then we stop there and accept that model. If it is not significant, then we delete the quartic term, fit a cubic polynomial, and examine the significance of the regression coefficient for the cubic term. If it is significant, then we accept the cubic polynomial as the final model. If it is not significant, then we delete the cubic term and fit a quadratic model. If the quadratic term is significant, then we accept the quadratic. Otherwise, we drop the quadratic term and fit the simple linear model. If the linear model is significant, then we accept that model. Otherwise, we conclude that there is no change in 5_HIAA over time. Figure 1.29 gives the output from PROC REG in SAS that fitted up to a quartic polynomial to the 5-HT data. In this output, variable time2 is the square of time, time3 is the cube of time, etc. Only that section of the output that pertains to the regression coefficients is shown. For completeness, all regressions from the linear to the quartic are shown. We would begin with the quartic model given in the last part of the table. The coefficient for the quartic is not significant (b = .0043, t = .40, df = 91, p = .69). Hence, we would eliminate the quartic and fit a cubic. Recall here the importance of the algorithm. Our interest is only in the quartic term. If we examined the significance of all terms in the model, we may have erroneously concluded that there was not change of 5-HIAA over time. The next step is to eliminate the quartic term and run the cubic model. The coefficient for the cubic is significant (b = .0399, t = 2.14, df = 92, p = .03), so we would accept this model as the most parsimonious polynomial that does not sacrifice predictability. The plot of the predicted means in Figure 1.27 was derived from the parameters of the cubic model. The inclusion of the linear and the quadratic regressions in Figure 1.29 suggest an important lesson—be wary of starting with the lowest term and working up. Had we fitted the linear model first, then we might have been tempted to say there was no significant change over time because the linear term was not significant (b = -.133, t = -1.68, df = 94, p = .10). It is quite true that this coefficient is not significant, but the only thing that this tells us is that the best fitting straight line through the means in Figure 1.27 has a slope near 0. There may be nonlinear effects of time on 5-HIAA. QMIN Regression 2006-03-01: 1-42 Indeed, fitting a quadratic term substantially increases predictability—R2 increases from .03 to .22, and the coefficient for the square term (-.1713) is highly significant (t = -4.80, df = 93, p < .0001). This implies that there is a significant “bend” to the plot of means in Figure 1.27. But this quadratic effect may have been missed if one stopped the analysis after fitting only the linear model. Figure 1.29 Output from regression of 5-HIAA on the polynomials of time. Linear Model: R2 = .029 Variable Intercept time DF 1 1 Estimate 12.89554 -0.13317 Error 0.40067 0.07934 t Value 32.18 -1.68 Pr > |t| <.0001 0.0966 Estimate 10.32589 1.40861 -0.17131 Error 0.64576 0.32924 0.03571 t Value 15.99 4.28 -4.80 Pr > |t| <.0001 <.0001 <.0001 Estimate 8.34881 3.46558 -0.71051 0.03994 Error 1.11909 1.01261 0.25400 0.01863 t Value 7.46 3.42 -2.80 2.14 Pr > |t| <.0001 0.0009 0.0063 0.0347 Estimate 9.07845 2.39322 -0.24312 -0.03745 0.00430 Error 2.13917 2.86166 1.19340 0.19393 0.01072 t Value 4.24 0.84 -0.20 -0.19 0.40 Pr > |t| <.0001 0.4052 0.8390 0.8473 0.6894 Quadratic Model: R2 = .222 Variable Intercept time time2 DF 1 1 1 Cubic Model: R2 = .259 Variable Intercept time time2 time3 DF 1 1 1 1 Quartic Model: R2 = .260 Variable Intercept time time2 time3 time4 DF 1 1 1 1 1 1.3.3.1.1 Advanced Topic: Fitting polynomial models: A sneaky shortcut. In Section X.X, we spoke of two different sums of squares—hierarchical sums of squares and partial sums of squares. The algorithm outlined above will always work regardless of the types of sums of squares used. If, however, the regression or GLM software that you use allows easy calculation of the hierarchical sums of squares, then there is an easy shortcut for computing polynomial regression. The shortcut is to fit the polynomial model and request the hierarchical sums of squares from the software. Then select the highest order of the polynomial that is significant. Figure 1.30 presents the hierarchical solution (labeled as the “Type I SS”) from fitting a quartic to the 5- QMIN Regression 2006-03-01: 1-43 HIAA data set referenced above 7 . (Recall that variable “time1” is the linear term, “time2” is the quadratic, and so on.) Figure 1.30 Hierarchical solution (Type I SS) for fitting a quartic polynomial model to the 5-HIAA data set. Source time1 time2 time3 time4 DF 1 1 1 1 Type I SS 8.93867937 59.16345714 11.37122475 0.40152468 Mean Square 8.93867937 59.16345714 11.37122475 0.40152468 F Value 3.58 23.68 4.55 0.16 Pr > F 0.0617 <.0001 0.0356 0.6894 The quartic term (time4) is not significant (F(1, 94) = 0.16, p = .69). The cubic term (Time3), however, is significant (F(1, 94) = 4.55, p = .04). Hence, we select the cubic polynomial as the best model. We would then run the cubic model and use the coefficients from this model to obtain predicted values of 5-HIAA as a function of time. The results would be the same as if we had followed the algorithm presented in Figure 1.28. 1.3.3.2 Advanced Topic: Maximum and Minimums in Polynomial Regression One of the advantages of polynomial regression is that it can provide precise estimates of maximal (or minimal) response. In designing experiments such as the time-course one given above, selection of the initial time intervals can be problematic, especially when investigating a new drug. Also, the time intervals are usually round numbers (e.g., 5 minutes, 10 minutes) while the physical action of drugs respects no arbitrary time unit and can peak or trough anyplace. Hence, coefficients from a polynomial equation can be helpful in “filling in” those portions of a curve between adjacent groups. The problem of finding the maxima and minima of a function is one of differential calculus—get the first derivative of the equation, set the function value to 0, and then take the roots of the derivative. For example, the first derivative of a quadratic equation with respect to the independent variable is ∂Yˆ = β1 + 2β 2 X . ∂X Setting this to 0 gives β1 + 2β 2 X = 0 , and solving for X gives X =− β1 . 2β 2 Substituting the numeric estimates of β1 and β2 into this equation provides an estimate of the maximum (or minimum). Simply look at the curve of predicted values to determine whether the result is a maximum or a minimum. Table 1.3 gives the first derivatives for up to a fifth-order polynomial. The easiest way to solve these equations is to use mathematical software that solves for roots (i.e., the values of X) 7 This solution was generated from PROC GLM in SAS. QMIN Regression 2006-03-01: 1-44 of polynomials. Good statistical packages will always have such a feature 8 . For a polynomial of order k, there will be (k – 1) roots. Select only those roots that are within (or very close to being within) the range of values of X in the data set. Table 1.3 Equations for finding the maxima and minima of some polynomial functions. Order of Polynomial 2 3 4 5 Equation β1 + 2β 2 X = 0 β1 + 2β 2 X + 3β 3 X 2 = 0 β1 + 2β 2 X + 3β 3 X 2 + 4β 4 X 3 = 0 β1 + 2β 2 X + 3β 3 X 2 + 4β 4 X 3 + 5β 5 X 4 = 0 As an example, substituting the regression coefficients for the serotonin time-course example into the equation in Table 1.3 for a cubic polynomial gives 3.46558 − 2(.71051) X + 3(.03994) X 2 = 0 , 3.46558 − 1.42102 X + .11982 X 2 = 0 . Note that it is recommended practice to use several decimal places in solving for the roots of polynomials. Using the POLYROOT function in PROC IML of SAS, we find that the time unit giving maximal response was 3.4. Minimal response was at time unit 8.4. 1.3.3.3 Example: Polynomial Regression with Ordered Groups An important application of polynomial regression occurs with ordered groups (see Section X.X). Many studies of substance abuse in humans place subjects into ordinal categories. Cigarette use within the past month, for example, be coded into levels of “never smoked,” “previous user, currently abstinent,” “occasional, but not a daily smoker” “daily smoker, less than 20 cigarettes per day,” “daily smoker, 20 to 40 cigarettes,” and “daily smoker, more than 40 cigarettes.” Suppose that a study on the effects of nicotine withdrawal placed subjects into these categories. The purpose of the study was to examine the effects of a specified period of abstinence from nicotine on a variety of biological measures. Participants entered the lab and spent a fixed amount of time waiting—without the opportunity for smoking—before being outfitted with a series of electrodes. The dependent variable is the percentage of time spent during a relaxation session in alpha EEG wave activity. The more time spent in recorded alpha during this session is taken as an overall measure of relaxation. Figure 1.31 plots the observed means of percent time in alpha activity as a function of the level of cigarette smoking. (Ignore for the moment, those lines in this Figure that represent predicted values. We deal with them later.) It is obvious from the observed means that those subjects with high cigarette consumption spend have lower measures of relaxation than those with little or no consumption. But can we extract more information from these data? The answer is “Yes.” Analysis of these data would begin by assigning ordinal numbers to the categories: 1 for “never smoked,” 2 for “previous user, currently abstinent,” and so on, up to 6 for “daily smoker, more than 40 cigarettes.” Next, create new variables that are the polynomials for this ordered 8 Sometimes the routine will be described as finding “zeros” of a polynomial. QMIN Regression 2006-03-01: 1-45 group variable. Finally, perform the polynomial regressions as outlined above in Sections 1.3.3.1 and 1.3.3.1.1. Figure 1.31 Mean (+/- one standard error) relaxation scores of six categories of cigarette use after a brief period of abstinence. Also shows are the best fitting linear, quadratic, and cubic polynomials. In analyzing these data, we used the hierarchical method specified in Section 1.3.3.1.1 and started with a quartic model. The polynomial variables were called Poly1 (the linear term) through Poly4 (the quartic term). Output from the hierarchical SS from PROC GLM in SAS is given in Figure 1.32. QMIN Regression 2006-03-01: 1-46 Figure 1.32 Hierarchical solution (Type I SS) from fitting a quartic polynomial to the cigarette-abstinence data. Source Poly1 Poly2 Poly3 Poly4 DF 1 1 1 1 Type I SS 28588.12723 7158.86418 4888.27163 1430.65951 Mean Square 28588.12723 7158.86418 4888.27163 1430.65951 F Value 51.30 12.85 8.77 2.57 Pr > F <.0001 0.0005 0.0036 0.1115 The quartic model over fits the data because the quartic term is not significant. The first three terms, however, are significant, so we settle on the cubic polynomial as the best model. As should be clear from this description, the mechanics of fitting polynomials to ordered groups are the same as fitting polynomials where the independent variable is measured on an interval or ratio scale. The difference lies in the interpretation of the results. With an interval or ratio scale, one can extrapolate between groups and calculate maxima and minima response points. With ordered groups, however, such extrapolations and calculations are valid only to the extent that the underlying order of the groups approaches an interval scale. The polynomial functions for ordered groups should be used to interpret differences in the group means. Hence, the model can be used to make inferences about the groups, but the nature of those inferences depends on the nature of the groups. Let us illustrate. Return to Figure 1.31 and compare the predicted values for the linear model to those from the quadratic model. The linear model predicts that increasing exposure to tobacco is associated with decreased relaxation. The quadratic model—which, recall, fits better than the linear model— reveals something more. The curve for the first three groups is flat. This suggests that those who have never smoked, are ex-smokers, or occasionally smoke have similar levels of relaxation. Once we get to daily smokers, however, the curve descends in a “dose-dependent” fashion. This pattern could be interpreted as a difference between those currently addicted to nicotine and those not currently addicted. The cubic curve adds to prediction in two ways. First it suggests a meaningful difference among the first three groups. Those who have never smoked may be different in their tendency to relax from those who have taken up smoking—either in the past (the abstinent group) or only occasional. This may have less to do with the additive and physiological properties of nicotine and more to do with the participants’ environments and personalities during the period of maximal risk for sampling cigarettes. The statistical analysis cannot prove this, but it acts as a good heuristic than can guide future research into this area. The second difference between the cubic and the quadratic curve is the “dose-response” portion of the curve for daily smokers. The quadratic curve predicts an almost linear decrease in relaxation from the 3rd (daily smokers, less than 1 pack) to the 6th (daily smokers, more than 2 packs). The cubic curve, on the other hand, agrees with the observed means in suggesting that the “dose-response” curve flattens after a certain point. Because the data consist of ordered groups and not a firm quantitative estimate of dose, one should not make strong claims about where the predictions asymptote. One could, however, use the form of the curve to guide the design of further studies into this area. QMIN Regression 2006-03-01: 1-47 1.3.3.4 Example: Polynomial Regression with Interaction Figure 1.33 presents the mean dose response curves for two groups of mice—a control group and a group receiving a drug agonist. (The predicted values in this Figure come from the best fitting regression model which will await further explication below). It is clear from this Figure that the agonist has the effect of increasing the response by roughly two units. But does the overall shape of the dose-response curve for the agonist differ from that of the controls? Figure 1.33 Observed means and means predicted from the best fitting regression model for dose-response curves from a control group and a group administered a drug agonist. To answer this question, we write a model that includes both polynomial terms (in order to model the dose-response) and interactive terms (in order to test whether the shape of the doseresponse curve differs between the Control and the Agonist group). Let us begin by dummy coding the group variable such that 0 = Control and 1 = Agonist. We will then fit a series of regression models, starting with the most general model and then reducing it to a parsimonious one that still predicts well. Finally, after we settle on the best model, we will substitute into the regression equation a value of 0 for the Control group and 1 for the Agonist group to interpret the meaning of the model’s parameters. QMIN Regression 2006-03-01: 1-48 Figure 1.34 presents the statistics for overall fit and the parameter estimates from two regression models. The first order of business is to ascertain the most parsimonious model that explains the data without sacrificing parsimony to explanatory power. Model 1 fits the main effect for agonist, the polynomial for dose, the an interaction term between agonist and the linear term for dose (agonist*dose), and an interaction between agonist and the quadratic term for dose (agonist*dose*dose). Figure 1.34 Results of regressing dependent variable Response on independent variables group (Control vs. Agonist) and dose of drug, including the polynomial effects of drug and the interactions of group and dose of drug. Model 1: F(5,174) = 16.05, p < .0001, R2 = .316 Parameter Intercept agonist dose dose*dose agonist*dose agonist*dose*dose Estimate 16.75001587 -0.13864286 0.20909714 -0.01479873 0.36138286 -0.01326095 Standard Error t Value 0.41305636 40.55 0.58414990 -0.24 0.11697777 1.79 0.00743560 -1.99 0.16543155 2.18 0.01051552 -1.26 Pr > |t| <.0001 0.8127 0.0756 0.0481 0.0303 0.2090 Model 2: F(4,175) = 19.60, p < .0001, R2 = .309 Parameter Intercept agonist dose dose*dose agonist*dose Estimate 16.54281349 0.27576190 0.30855429 -0.02142921 0.16246857 Standard Error t Value 0.37961019 43.58 0.48376998 0.57 0.08653941 3.57 0.00526662 -4.07 0.04996355 3.25 Pr > |t| <.0001 0.5694 0.0005 <.0001 0.0014 In the general model, the three way interaction is not significant (t(174)= -1.26, p = .21). Hence, this term is removed form the model and the regression is rerun. In the reduced model, both of quadratic term for dose (dose*dose) and the interaction between agonist and dose (agonist*dose) are significant. Hence, we will accept this model. (Note: One could also drop the variable agonist from the model and rerun it. We hold off doing that in order to explain the meaning of the coefficient for agonist). Having arrived at a satisfactory model, we must now perform some algebraic manipulations to derive the meaning of the parameters. The general equation for Model 3 is Yˆ = α + β1Agonist + β 2 Dose + β 3Dose 2 + β 4 Agonist * Dose . Because we have coded Controls as 0 and Agonists as 1, we have the following two equations for the Control and the Agonist groups: (X.X) YˆControl = α + β 2 Dose + β 3Dose 2 , and QMIN Regression 2006-03-01: 1-49 YˆAgonist = (α + β1 ) + (β 2 + β 4 )Dose + β 3Dose 2 . (X.X) Using the logic outlined above for the interactions, the intercept for the Control group is α, while the intercept for the Agonist group is (α + β1). Hence, the regression coefficient β1 tests whether the intercepts for the two groups differ. From the preferred model (Model 2) in Figure 1.34 the t test for this parameter is not significant (t(175) = 0.54, p = .57), so we conclude that the intercepts for the two groups are the same. In concrete terms, we conclude that the mean for the Controls who have not received the active drug is the same as the mean for the Agonist group who have not received the active drug. Having explained the mean of coefficient β1, let us now drop variable “agonist” from the model and rerun it. The overall R2 changes very little—it drops from .309 to .308. Similarly, the parameter estimates hardly change. Keeping the same subscripts as in Equations X.X and X.X, they are: Intercept = 16.681, β2 = .296, β3 = -.021, and β4 = .187. We can now substitute 0 for β1 into Equations (X.X) and (X.X), giving, YˆControl = α + β 2 Dose + β 3 Dose 2 , and YˆAgonist = α + (β 2 + β 4 )Dose + β 3Dose 2 . Now ask the following question—as long as the dose of the drug is not 0 (a situation that we have already described), what is the predicted difference between the Control and the Agonist groups? To answer this query, we can subtract the predicted value of the Controls from the predicted value of the Agonists, giving YˆAgonist − YˆControl = α + (β 2 + β 4 )Dose + β 3 Dose 2 − α − β 2 Dose − β 3 Dose 2 = β 4 Dose. Thus, the predicted mean difference between the two groups depends on the dose of the drug. The estimate of β4 in the accepted model is .189. If the dose is 5 mg/weight, then the predicted difference in response is .187(5) = .0.94 units; if the dose is 10 mgs/weight, then the predicted different is .187(10) = 1.87 units; and if the dose is 15 mgs/weight, then the predicted difference is .187(15) = 2.81 units. Finally, we can use the equations in Section 1.3.3.2 (max/min response) to calculate the maximum response for the Control and the Agonist group. The equation for the control group is YˆControl = 16.81 + .296 ⋅ Dose − .021 ⋅ Dose 2 , so the maximum response occurs when dose is .296 Dose = − = 7.05 mgs/weight . 2(−.021) The equation for the Agonist group is YˆAgonist = 16.81 + .296 ⋅ Dose − .021 ⋅ Dose 2 + .187 ⋅ Dose = , 16.81 + .483 ⋅ Dose - .021 ⋅ Dose 2 so the maximal predicted response for this group is with .483 Dose = − = 11.5 mgs/weight . 2(−.021) QMIN Regression 2006-03-01: 1-50 1.4 References Cohen, J. & Cohen, P. (1983). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, 2nd Ed. Hillsdale, NJ: Lawrence Erlbaum. Judd, C.M. & McClelland, G.H. (1989). Data Analysis: A Model-Comparison Approach. New York: Harcourt, Brace, Jovanovich. Belsley, D.A., Kuh, E., and Welsch, R.E. (1980). Regression Diagnostics. New York: John Wiley & Sons, Inc. QMIN 1.5 Regression 2006-03-01: 1-51 Tables Table 1.1 Schematic of a two by two factorial design............................................................... 1-28 Table 1.2 Predicted values of the dependent variable in a two by two factorial design with interaction. ................................................................................................................................. 1-29 Table 1.3 Equations for finding the maxima and minima of some polynomial functions......... 1-44 QMIN 1.6 Regression 2006-03-01: 1-52 Figures: Figure 1.1 Example of a scatter plot and the regression line (line of best bit). ........................... 1-2 Figure 1.2 Output from a simple regression predicting the quantity of receptors in human cortex as a function of age. ..................................................................................................................... 1-4 Figure 1.3 Examples of nonlinear relationships. ......................................................................... 1-5 Figure 1.4 Example of residuals that are not normally distributed.............................................. 1-6 Figure 1.5 Example of normally distributed residuals................................................................. 1-7 Figure 1.6 Examples of equal variance of residuals (homoscedasticity) and unequal variance of residuals (heteroscedasticity). ...................................................................................................... 1-7 Figure 1.7 Example of a scatter plot containing an outlier. ......................................................... 1-8 Figure 1.8 Results from a regression with an outlier. .................................................................. 1-9 Figure 1.9 The same scatterplot after removing the outlier. ........................................................ 1-9 Figure 1.10 Results of the regression with the outlier removed. ............................................... 1-10 Figure 1.11 Multiple regression of receptor number on age and cotinine. ................................ 1-12 Figure 1.12 Model Comparisons: A General Model with Four Predictors,............................... 1-20 Figure 1.13 Model Comparisons: Example of the TEST statement in SAS for a single predictor variable....................................................................................................................................... 1-21 Figure 1.14 Model Comparisons: Example of the TEST statement in SAS for a two predictor variables. .................................................................................................................................... 1-23 Figure 1.15 Mean sexual activity (± 1 standard error) of rats with and without prior sexual experience as a function of dose of a testosterone-like compound............................................ 1-25 Figure 1.16 Regression results for the main effect model. ........................................................ 1-27 Figure 1.17 Regression results for the interaction model. ......................................................... 1-27 Figure 1.18 Regression lines from a main effects model........................................................... 1-30 Figure 1.19 Regression lines from an interaction model. .......................................................... 1-31 Figure 1.20 Model comparison algorithm for interactions. ....................................................... 1-33 Figure 1.21 Interactions: Regression coefficients for the general model. ................................. 1-34 Figure 1.22 Interactions: Regression coefficients from the first reduced model....................... 1-35 Figure 1.23 Interactions: Test of the significant of both interaction terms X1X3 and X2X3........ 1-35 Figure 1.24 Interactions: Regression coefficients from the second reduced model. ................. 1-36 Figure 1.25 Examples of quadratic curves................................................................................. 1-38 Figure 1.26 Examples of cubic curves....................................................................................... 1-39 Figure 1.27 Assays of 5-HIAAA in CSF as a function of time. ................................................ 1-40 Figure 1.28 Algorithm for fitting polynomial regression models.............................................. 1-41 Figure 1.29 Output from regression of 5-HIAA on the polynomials of time. ........................... 1-42 Figure 1.30 Hierarchical solution (Type I SS) for fitting a quartic polynomial model to the 5HIAA data set. ........................................................................................................................... 1-43 Figure 1.31 Mean (+/- one standard error) relaxation scores of six categories of cigarette use after a brief period of abstinence. Also shows are the best fitting linear, quadratic, and cubic polynomials................................................................................................................................ 1-45 Figure 1.32 Hierarchical solution (Type I SS) from fitting a quartic polynomial to the cigaretteabstinence data. .......................................................................................................................... 1-46 Figure 1.33 Observed means and means predicted from the best fitting regression model for dose-response curves from a control group and a group administered a drug agonist. ............. 1-47 QMIN Regression 2006-03-01: 1-53 Figure 1.34 Results of regressing dependent variable Response on independent variables group (Control vs. Agonist) and dose of drug, including the polynomial effects of drug and the interactions of group and dose of drug. ..................................................................................... 1-48