Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
'!HE EFFECT OF CENTERING ON THE CONDITION NUMBER OF POLYNOMIAL REGRESSION MODELS Robert B. Bendel, Washington State University Centered-first ABSTRACT It has been recognized that centering reduces the condition number of the incidence matrix in ordinary linear regression models. In polynomial models, y - ~ + 7 (x-x) + 7 (X_X)2 + ... + o 1 fJ (x-x)p + P centering can occur first (X_X)2 or last 2 E The random error, E, is assumed to be independent and identically distributed with variance q2. The symbols for the constant terms and the regression coefficients are chosen to reflect the fact (x 2 _X 2 ). This paper determines condition numbers using simulated incidence matrices and the COLLIN option in SAS PROC REG. The results empirically verify that centering first dramatically reduces the condition number whereas centering last provides only a small improvement over no centering at all. The empirical evidence supports the theoretical discussion in Bradley and Srivastava (1979), Marquardt (1980) and Snee (1983). that ~ P is the same for all three models and that the constant terms are are not the same for all three models A - A with a - y but 7 o ~ y. The centered-first model has been advocated by Bradley and Srivastava (1979), as well as Marquardt and Snee (1975). INTRODUCTION The COLLIN option in SAS PROC REG is used to determine the condition number, CN, of the (entire) incidence matrix, X, including the constant term. The CN is the ratio of the largest singular value of X to the smallest Singular value of X. It is also the ratio of the square root of the largest eigenvalue of X'X to the smallest eigenvalue of X'X and, hence, represents a good measure of multicollinearity and the ill-conditioning of the linear system of normal equations Although centering in ordinary linear regression has been a subject of considerable debate recently (Hocking (1984), Snee (1983), Be1s1ey (1984b)), it is generally recognized that centering reduces the condition number of the incidence matrix X in the (ordinary) linear regression model As pointed out by Bradley and Srivastava (1979), Marquardt (1980) and Snee (1983), centering in polynomial regression models is even more critical since the "intercorrelation" of the variables (x, X2, x 3 , etc.) becomes higher as the degree of the polynomial increases. The purpose of this paper is to evaluate the effect of centering on the condition number in polynomial regression models by using simulated incidence matrices and the COLLIN option in SAS PROC REG. X'~_X'~. As discussed in Belsley, Rub and Welsch (1980), the condition number is determined from the scaled X matrix where X is scaled so that each column has unit length. This scaling ensures that an incidence matrix with orthogonal columns has a condition number of one. The condition number is related to the variance inflation factor or VIF. If X has been centered and scaled, then the condition number of X'X is greater than or equal to the maximum VIF. A further discussion of condition numbers, VIF and multicollinearity can be found in Wilson (1983); Berk (1977); Be1s1ey, Kuh and Welsch (1980); and Bendel (1985). PROCEDURE Three types of (curve fitting) polynomial models are considered: Uncentered y - fJ + fJ x + p x 2 + ... + fJ x P + P 012 Centering is accomplished by using the PROC MEANS procedure either before (centering-first) or after (centeringlast) construction of the X matrix, where: E Centered-last y - a + fJ (x-x) + fJ 1 (X 2 _X 2 ) + ." + 2 fJ (xp-xP ) + E P 756 • f ,, and X is nx(p+l). It is noteworthy that the condition number for centering-last can also be obtained by using the COLLINOINT option on the uncentered matrix X. using n = 40, p = 10 and CV - .2. The results for the uncentered matrix indicate that the minimum eigenvalues decrease rapidly as the degree of the polynomial increases. These results were as expected since the collinearity among the polynomial terms x, x 2 , x S , etc. should increase as the degree of the polynomial increases. When the degree of the polynomial reaches six, an error message for both the X'X inversion as well as the eigenvalue decomposition was printed out by SAS PROC REG using the COLLIN option. In the table, error messages occurred whenever ~. < 10- 12 • The vector x was chosen to be normal with mean given by ~ U and standard deviation = ~ U x x CV, with CV x representing the coefficient of variation expressed as a proportion. The three parameters of the study were n, p and the CV with values of n - 20, 40; p ~ 1, 10, 10,000; and CV - .1, .2, .3. It was anticipated that there would be no effect due to n or ~ since these parameters would not affect the correlation structure of X'X. However, the CV was expected to affect the condition number since the intercorrelations among x, x 2 , x S , etc., would depend upon the standard deviation of x. m,n The condition numbers associated with these eigenvalues less than 10- 12 are correctly noted as lower bounds since the SOLVIT procedure in PROe MATRIX obtained the same eigenvalues. (The SOLVIT procedure uses more precision in the calculations than PRoe REG.) Results and Discussion For the centered-last results, note that the minimum eigenvalues are Slightly larger and the condition numbers slightly smaller than those without any centering. This shows that centering-last improved the condition number of X only slightly by removing the collinearity with the constan~ term. The results of the simulation are presented in Tables I, 2 and 3. Table 1 illustrates the type of intercorrelations that occur among the variables for the three types of polynomial regression models considered. Note that the correlation between (X_X)2 and (x_x)S is much lower than the correlation between x 2 and x 3 . Bradley and Srivastava (1979) showed more generally that the - a correlation between (x-x) - b and (x-x) For the centered-first results, the minimum eigenvalues decrease rather slowly with acceptable condition numbers with polynomial models as high as the eighth degree. It is clear, then, that centering-first reduces the condition numbers dramatically for the situations considered here. is "smaller" than the correlation between x a and xb ; may be smaller if a+b is even; is much smaller if a+b is odd; and is zero if a+b is odd and the values of x are symmetrically chosen about their Similar conclusions associated with Table 2 are reached for other parametric configurations as well. mean X, as in experimental design models. Note also in Table 1 that the Table 3 presents condition numbers for a fourth degree polynomial model using all values of the parameters. (The pattern of the results is Similar for other degree polynomials as well.) Note that the pattern of the condition numbers does not appear to be heavily influenced by n or by p. There was, however, a strong effect of the ev for the uncentered and centered-last condition numbers. As indicated earlier, a smaller CV was expected to increase the magnitude of the correlations among x, X2, x 3 , etc. and, hence, should increase the condition numbers also. correlation between x-x and X2 _X 2 is of course the same as the correlation between x and x 2 • Hence, the sparse results in Table 1 support the premise that centering-first reduces the intercorrelations among polynomial terms, with terms like x, x 2 reduced more than terms like x, xS. Centering-last does not change any of the correlations a b between x and x (a, b ~ 1) but only reduces the collinearity with the constant term. Table 2 presents condition numbers of the scaled X matrix as well as the minimum eigenvalues of (the scaled) X'X, It is of interest to comment on the practical implications of these results. This will be accomplished by addressing three questions: 757 1. 2. 3. Should we center? What is affected by centering? What happens if we do not center? terpretation of the regression coefficients. For example, the sign of ~ in 1 the region of the data would reflect The author believes that centering should be used for the curve fit'ting polynomial models considered here as well as in response surface models. Centering affects the values and significance of all terms except the highest power but does not affect the important quantities such as R2, the predicted values, the residuals and S2, the estimated variance of the random error E. (For a further discussion of the effect of linear transformations see Griepentrog, et al. 1982.) If we do not center, then, we have not lost anything if curve fitting and prediction is our objective. Centering-first, however, generally reduces the intercorrelations among x, x 2 , x 3 , etc.; reduces the maximum VIF and, as we noted, reduces the condition number of the incidence matrix, X. Hence, centering-first protects against extreme multicollinearity especially 'with higher order polynomial models. As noted by Marquardt (1980), centering also aids in the in- the slope of a second degree polynomial model. Reducing the collinearity may also be useful when selecting the proper degree polynomial to represent the data. It should be noted that although the intercorrelations among some of the x, x 2 , x 3 , etc. terms are known to be reduced, the arguments presented here do not ensure that the condition number of the centered-first matrix X would necessarily be reduced. Indeed, Belsley (1984b) notes that this is still "an open question" and cites a reference in which an example ,is presented where lIcentering worsens conditioning." Along this same vein, however, it would be possible to center-first and center-last although the results presented here indicate that for most practical problems centering-last would not reduce the condition number very much since centeringfirst will usually reduce it considerably. TABLE 1 Data Structure and Correlation Matrix Among the First Three Powers of "Xli for Uncentered, Centered-Last and Centered-First Polynomial Models; Illustrating Data For p = 10, CV - .2, n = 40 Data Structure-First Five Observations Observation 1 2 3 4 5 Mean(n=40) x' x2 1315 1525 878 341 947 120.0 132.5 91. 7 48.8 96.5 10.96 11.51 9.58 6.99 9.82 1179.1 108.0 10.2 x x-x .75 1. 30 - .63 -3.22 - .39 0 (x-X)3 x2 .56 1. 69 .40 10.39 .15 .41 2.20 .26 -33.51 .06 12.0 24.5 -16.3 -59.2 -11.5 3.77 .46 Correlation Matrix (n x' x2 x x-x (X-X)2 x' x2 1 0 40) (x-X)3 X2 _X 2 x 3 _X 3 .9754 .1690 .8630 .9941 1 .9935 .0623 .8668 1 .9941 1 - .0518 .8683 .9935 .9754 1 .0518 .8683 .9935 .9754 1 -.0082 .0623 .1690 1 .8668 .8630 1 .9941 x-x .9941 .9754 1 .9935 1 _x 3 1 758 x3 _x 3 136 346 -301 -838 -232 0 (X_X)2 X (X-X)3 X2 _X 2 x3 _x 2 (X_X)2 TABLE 2 Minimum Eigenvalues and Condition Numbers for Uncentered. Centered-last and Centered-first Polynomial Regression Models n , Degree of Polynomial = 1. 8E-02 3.0E-04 4.0E-06 5.0E-08 5.4E-10 <1.0E-12 <1. OE-12 <1. OE-12 Third Fourth Fifth Sixth Seventh Eighth CV , Uncentered CN min First Second 40 ~ .2 Centered-Last Centered- Firs t CN min 1.0 1 6.5E-03 17 4.6E-05 255 3.7E-07 3,278 2.7E-09 42,265 1. 9E-11 551,934 <1. OE-12 >2,596,985 <1. OE-12 >2,765,330 11 99 980 9,690 102,403 >2,440,201 >2,704,445 >2,850,581 Amin CN 1.0 3.6E-01 1. 3E-01 3.8E-02 8.4E-03 1. 6E-03 2.1E-04 4.5E-05 1 2 4 8 18 44 134 310 TABLE 3 Centered-First Condition Numbers for Uncentered, Centered-Last and Fourth Degree Polynomial Models Uncentered Centered-Last Centered-First D '" CV 20 20 20 1 1 1 .1 .2 .3 266,025 46,115 1,334 46,049 13,363 709 11.3 13.3 12.7 20 20 20 10 10 10 .1 .2 .3 253,359 2,094 892 41,715 947 439 11.2 13.3 16.0 20 20 20 10,000 10,000 10,000 .1 .2 .3 754,024 9,192 3,145 108,407 3,659 1,504 15.3 11.5 12.6 40 40 40 1 1 1 .1 .2 .3 290,664 5,744 457 47,456 2,048 259 10.2 9.2 11.2 40 40 40 10 10 10 .1 .2 .3 146,694 9,690 1,218 26,597 3,278 713 10.0 7.9 14.1 40 40 40 10,000 10,000 10,000 .1 .2 .3 123,425 12,475 735 22,265 4,318 374 9.2 9.2 8.5 759 REFERENCES AND SELECTED BIBLIOGRAPHY Snee. R. D. (1983). Discussion of "Developments in Linear Regression Methodology: 1959·1982" by R. R. Hocking. Technometrics, 25, 230-236. Be1s1ey, D. A. (1984a). "Eigenvector Weaknesses and Other Topics for Assessing Conditioning Diagnostics," Technometrics, Letters to the Editor, Silvey, S. D. (1969). "Multicollinear. ity and Imprecise Estimation," Journal of the Royal Statistical Society, Series 26, 297-299. Be1s1ey, D. A. (1984b). "Demeaning Conditioning Diagnostics Through Centering," The ArneTj cap Statistician, B, 31, 539-552. Wilson, W. J. (1983). "Treating Multi. col1inearity with SAS." SUGI 9 Conference Proceedings. 38, 73-93. Be1s1ey, D. A., Kuh, E. and Welsch, R. E. (1980). Regression Diagnostics: Identifying Influential Observations and Sources of Collinearity, Wiley, New York. SAS is a registered trademark of SAS Institute Inc., Cary, NC. Bendel, R. B. (1985). Multico11ineari· ty: Past, Present and Future Considera· tions." Presented at the 1985 WNAR Biometric Meetings, San Louis Obispo, CA. Berk, K. (1977). "Tolerance and Condition in Regression Computations." Journal of the American Statistical Association, 72, 863-866. Bradley, R. A. and Srivastava, S. S. (1979). "Correlation in Polynomial Regression." The American Statistician, 33, 11-14. Griepentrog, G. L., Ryan, J. M. and Smith, D. (1982). "Linear Transformations of Polynomial Regression Models." The American Statistician, 36, 171·174. Hocking, R. R. (1983). "Developments in Linear Regression Methodology: 19591982" (with discussion), Technometrics. 25, 219-249. Hocking, R. R. (1984). Response to "Eigenvector Weaknesses and Other Topics for Assessing Conditioning Diagnostics," Technometrics, 26, 299-301. Marquardt, D. W. (1980) "You Should Standardize the Predictor Variables in Your Regression Models." (Discussion of "A Critique of Some Ridge Regression Methods" by G. Smith and F. Campbell). Journal of the American Statistical Association, 75, 87-91. Marquardt, D. W. and Snee, R. D. (1975). "Ridge Regression in Practice," The American Statistician, 29, 3·19. SAS Institute, Inc. (1982). SAS User's Guide: Statistics, Cary. NC. 760