Download The Effect of Centering on the Condition Number of Polynomial Regression Models

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mathematics of radio engineering wikipedia , lookup

Mathematical model wikipedia , lookup

Horner's method wikipedia , lookup

Vincent's theorem wikipedia , lookup

Polynomial wikipedia , lookup

Factorization of polynomials over finite fields wikipedia , lookup

System of polynomial equations wikipedia , lookup

Fundamental theorem of algebra wikipedia , lookup

Transcript
'!HE EFFECT OF CENTERING ON THE CONDITION NUMBER OF POLYNOMIAL REGRESSION MODELS
Robert B. Bendel, Washington State University
Centered-first
ABSTRACT
It has been recognized that centering
reduces the condition number of the incidence matrix in ordinary linear regression models. In polynomial models,
y -
~
+ 7 (x-x) + 7 (X_X)2 + ... +
o
1
fJ (x-x)p +
P
centering can occur first (X_X)2 or last
2
E
The random error, E, is assumed to be
independent and identically distributed
with variance q2. The symbols for the
constant terms and the regression coefficients are chosen to reflect the fact
(x 2 _X 2 ). This paper determines condition numbers using simulated incidence
matrices and the COLLIN option in SAS
PROC REG. The results empirically verify that centering first dramatically
reduces the condition number whereas
centering last provides only a small
improvement over no centering at all.
The empirical evidence supports the
theoretical discussion in Bradley and
Srivastava (1979), Marquardt (1980) and
Snee (1983).
that ~
P
is the same for all three
models and that the constant terms are
are not the same for all three models
A
-
A
with a - y but 7
o
~
y.
The
centered-first model has been advocated
by Bradley and Srivastava (1979), as
well as Marquardt and Snee (1975).
INTRODUCTION
The COLLIN option in SAS PROC REG is
used to determine the condition number,
CN, of the (entire) incidence matrix, X,
including the constant term. The CN is
the ratio of the largest singular value
of X to the smallest Singular value of
X. It is also the ratio of the square
root of the largest eigenvalue of X'X to
the smallest eigenvalue of X'X and,
hence, represents a good measure of multicollinearity and the ill-conditioning
of the linear system of normal equations
Although centering in ordinary linear
regression has been a subject of considerable debate recently (Hocking (1984),
Snee (1983), Be1s1ey (1984b)), it is
generally recognized that centering
reduces the condition number of the incidence matrix X in the (ordinary)
linear regression model
As pointed out by Bradley and Srivastava (1979), Marquardt (1980) and Snee
(1983), centering in polynomial regression models is even more critical since
the "intercorrelation" of the variables
(x, X2, x 3 , etc.) becomes higher as the
degree of the polynomial increases. The
purpose of this paper is to evaluate the
effect of centering on the condition
number in polynomial regression models
by using simulated incidence matrices
and the COLLIN option in SAS PROC REG.
X'~_X'~.
As discussed in Belsley, Rub and
Welsch (1980), the condition number is
determined from the scaled X matrix
where X is scaled so that each column
has unit length. This scaling ensures
that an incidence matrix with orthogonal
columns has a condition number of one.
The condition number is related to the
variance inflation factor or VIF. If X
has been centered and scaled, then the
condition number of X'X is greater than
or equal to the maximum VIF. A further
discussion of condition numbers, VIF and
multicollinearity can be found in Wilson
(1983); Berk (1977); Be1s1ey, Kuh and
Welsch (1980); and Bendel (1985).
PROCEDURE
Three types of (curve fitting) polynomial models are considered:
Uncentered
y - fJ
+ fJ x + p x 2 + ... + fJ x P +
P
012
Centering is accomplished by using the
PROC MEANS procedure either before
(centering-first) or after (centeringlast) construction of the X matrix,
where:
E
Centered-last
y -
a
+ fJ (x-x) + fJ
1
(X 2
_X 2 ) + ."
+
2
fJ (xp-xP ) + E
P
756
•
f
,,
and X is nx(p+l). It is noteworthy that
the condition number for centering-last
can also be obtained by using the
COLLINOINT option on the uncentered
matrix X.
using n = 40, p = 10 and CV - .2. The
results for the uncentered matrix indicate that the minimum eigenvalues decrease rapidly as the degree of the
polynomial increases. These results
were as expected since the collinearity
among the polynomial terms x, x 2 , x S ,
etc. should increase as the degree of
the polynomial increases. When the
degree of the polynomial reaches six, an
error message for both the X'X inversion
as well as the eigenvalue decomposition
was printed out by SAS PROC REG using
the COLLIN option. In the table, error
messages occurred whenever ~. < 10- 12 •
The vector x was chosen to be normal
with mean
given by
~
U
and standard deviation
=
~
U
x
x CV, with CV
x
representing the coefficient of variation expressed as a proportion. The
three parameters of the study were n, p
and the CV with values of n - 20, 40; p
~ 1, 10, 10,000; and CV - .1, .2, .3.
It was anticipated that there would be
no effect due to n or ~ since these parameters would not affect the correlation
structure of X'X.
However, the CV was expected to affect
the condition number since the intercorrelations among x, x 2 , x S , etc., would
depend upon the standard deviation of x.
m,n
The condition numbers associated with
these eigenvalues less than 10- 12
are correctly noted as lower bounds
since the SOLVIT procedure in PROe
MATRIX obtained the same eigenvalues.
(The SOLVIT procedure uses more
precision in the calculations than PRoe
REG.)
Results and Discussion
For the centered-last results, note
that the minimum eigenvalues are
Slightly larger and the condition numbers slightly smaller than those without
any centering. This shows that centering-last improved the condition number
of X only slightly by removing the collinearity with the constan~ term.
The results of the simulation are presented in Tables I, 2 and 3. Table 1
illustrates the type of intercorrelations that occur among the variables for
the three types of polynomial regression
models considered. Note that the correlation between (X_X)2 and (x_x)S is
much lower than the correlation between
x 2 and x 3 . Bradley and Srivastava
(1979) showed more generally that the
- a
correlation between (x-x)
- b
and (x-x)
For the centered-first results, the
minimum eigenvalues decrease rather
slowly with acceptable condition numbers
with polynomial models as high as the
eighth degree. It is clear, then, that
centering-first reduces the condition
numbers dramatically for the situations
considered here.
is
"smaller" than the correlation between
x a and xb ; may be smaller if a+b is
even; is much smaller if a+b is odd; and
is zero if a+b is odd and the values of
x are symmetrically chosen about their
Similar conclusions associated with
Table 2 are reached for other parametric
configurations as well.
mean X, as in experimental design
models. Note also in Table 1 that the
Table 3 presents condition numbers for
a fourth degree polynomial model using
all values of the parameters. (The
pattern of the results is Similar for
other degree polynomials as well.) Note
that the pattern of the condition numbers does not appear to be heavily
influenced by n or by p. There was,
however, a strong effect of the ev for
the uncentered and centered-last condition numbers. As indicated earlier, a
smaller CV was expected to increase the
magnitude of the correlations among x,
X2, x 3 , etc. and, hence, should increase
the condition numbers also.
correlation between x-x and X2 _X 2 is of
course the same as the correlation between x and x 2 • Hence, the sparse
results in Table 1 support the premise
that centering-first reduces the intercorrelations among polynomial terms,
with terms like x, x 2 reduced more than
terms like x, xS. Centering-last does
not change any of the correlations
a
b
between x and x (a, b ~ 1) but only
reduces the collinearity with the
constant term.
Table 2 presents condition numbers of
the scaled X matrix as well as the minimum eigenvalues of (the scaled) X'X,
It is of interest to comment on the
practical implications of these results.
This will be accomplished by addressing
three questions:
757
1.
2.
3.
Should we center?
What is affected by centering?
What happens if we do not center?
terpretation of the regression coefficients.
For example, the sign of ~
in
1
the region of the data would reflect
The author believes that centering
should be used for the curve fit'ting
polynomial models considered here as
well as in response surface models.
Centering affects the values and significance of all terms except the highest
power but does not affect the important
quantities such as R2, the predicted
values, the residuals and S2, the estimated variance of the random error E.
(For a further discussion of the effect
of linear transformations see Griepentrog, et al. 1982.) If we do not center, then, we have not lost anything if
curve fitting and prediction is our objective. Centering-first, however,
generally reduces the intercorrelations
among x, x 2 , x 3 , etc.; reduces the
maximum VIF and, as we noted, reduces
the condition number of the incidence
matrix, X. Hence, centering-first
protects against extreme multicollinearity especially 'with higher order polynomial models. As noted by Marquardt
(1980), centering also aids in the in-
the slope of a second degree polynomial
model. Reducing the collinearity may
also be useful when selecting the proper
degree polynomial to represent the data.
It should be noted that although the
intercorrelations among some of the x,
x 2 , x 3 , etc. terms are known to be reduced, the arguments presented here do
not ensure that the condition number of
the centered-first matrix X would necessarily be reduced. Indeed, Belsley
(1984b) notes that this is still "an
open question" and cites a reference in
which an example ,is presented where
lIcentering worsens conditioning." Along
this same vein, however, it would be
possible to center-first and center-last
although the results presented here indicate that for most practical problems
centering-last would not reduce the condition number very much since centeringfirst will usually reduce it considerably.
TABLE 1
Data Structure and Correlation Matrix Among the First
Three Powers of "Xli for Uncentered, Centered-Last and
Centered-First Polynomial Models; Illustrating Data
For p = 10, CV - .2, n = 40
Data Structure-First Five Observations
Observation
1
2
3
4
5
Mean(n=40)
x'
x2
1315
1525
878
341
947
120.0
132.5
91. 7
48.8
96.5
10.96
11.51
9.58
6.99
9.82
1179.1
108.0
10.2
x
x-x
.75
1. 30
- .63
-3.22
- .39
0
(x-X)3
x2
.56
1. 69
.40
10.39
.15
.41
2.20
.26
-33.51
.06
12.0
24.5
-16.3
-59.2
-11.5
3.77
.46
Correlation Matrix (n
x'
x2
x
x-x
(X-X)2
x'
x2
1
0
40)
(x-X)3
X2 _X 2
x 3 _X 3
.9754
.1690
.8630
.9941
1
.9935
.0623
.8668
1
.9941
1
- .0518
.8683
.9935
.9754
1
.0518
.8683
.9935
.9754
1
-.0082
.0623
.1690
1
.8668
.8630
1
.9941
x-x
.9941
.9754
1
.9935
1
_x 3
1
758
x3
_x 3
136
346
-301
-838
-232
0
(X_X)2
X
(X-X)3
X2 _X 2
x3
_x 2
(X_X)2
TABLE 2
Minimum Eigenvalues and Condition Numbers for Uncentered.
Centered-last and Centered-first Polynomial Regression Models
n
,
Degree of
Polynomial
=
1. 8E-02
3.0E-04
4.0E-06
5.0E-08
5.4E-10
<1.0E-12
<1. OE-12
<1. OE-12
Third
Fourth
Fifth
Sixth
Seventh
Eighth
CV
,
Uncentered
CN
min
First
Second
40
~
.2
Centered-Last
Centered- Firs t
CN
min
1.0
1
6.5E-03
17
4.6E-05
255
3.7E-07
3,278
2.7E-09
42,265
1. 9E-11
551,934
<1. OE-12 >2,596,985
<1. OE-12 >2,765,330
11
99
980
9,690
102,403
>2,440,201
>2,704,445
>2,850,581
Amin
CN
1.0
3.6E-01
1. 3E-01
3.8E-02
8.4E-03
1. 6E-03
2.1E-04
4.5E-05
1
2
4
8
18
44
134
310
TABLE 3
Centered-First
Condition Numbers for Uncentered, Centered-Last and
Fourth Degree Polynomial Models
Uncentered
Centered-Last
Centered-First
D
'"
CV
20
20
20
1
1
1
.1
.2
.3
266,025
46,115
1,334
46,049
13,363
709
11.3
13.3
12.7
20
20
20
10
10
10
.1
.2
.3
253,359
2,094
892
41,715
947
439
11.2
13.3
16.0
20
20
20
10,000
10,000
10,000
.1
.2
.3
754,024
9,192
3,145
108,407
3,659
1,504
15.3
11.5
12.6
40
40
40
1
1
1
.1
.2
.3
290,664
5,744
457
47,456
2,048
259
10.2
9.2
11.2
40
40
40
10
10
10
.1
.2
.3
146,694
9,690
1,218
26,597
3,278
713
10.0
7.9
14.1
40
40
40
10,000
10,000
10,000
.1
.2
.3
123,425
12,475
735
22,265
4,318
374
9.2
9.2
8.5
759
REFERENCES AND SELECTED BIBLIOGRAPHY
Snee. R. D. (1983). Discussion of
"Developments in Linear Regression
Methodology: 1959·1982" by R. R.
Hocking. Technometrics, 25, 230-236.
Be1s1ey, D. A. (1984a). "Eigenvector
Weaknesses and Other Topics for
Assessing Conditioning Diagnostics,"
Technometrics, Letters to the Editor,
Silvey, S. D. (1969). "Multicollinear.
ity and Imprecise Estimation," Journal
of the Royal Statistical Society, Series
26, 297-299.
Be1s1ey, D. A. (1984b). "Demeaning
Conditioning Diagnostics Through
Centering," The ArneTj cap Statistician,
B, 31, 539-552.
Wilson, W. J. (1983). "Treating Multi.
col1inearity with SAS." SUGI 9 Conference Proceedings.
38, 73-93.
Be1s1ey, D. A., Kuh, E. and Welsch, R.
E. (1980). Regression Diagnostics:
Identifying Influential Observations and
Sources of Collinearity, Wiley, New
York.
SAS is a registered trademark of SAS
Institute Inc., Cary, NC.
Bendel, R. B. (1985). Multico11ineari·
ty: Past, Present and Future Considera·
tions." Presented at the 1985 WNAR Biometric Meetings, San Louis Obispo, CA.
Berk, K. (1977). "Tolerance and
Condition in Regression Computations."
Journal of the American Statistical
Association, 72, 863-866.
Bradley, R. A. and Srivastava, S. S.
(1979). "Correlation in Polynomial
Regression." The American Statistician,
33, 11-14.
Griepentrog, G. L., Ryan, J. M. and
Smith, D. (1982). "Linear Transformations of Polynomial Regression Models."
The American Statistician, 36, 171·174.
Hocking, R. R. (1983). "Developments
in Linear Regression Methodology: 19591982" (with discussion), Technometrics.
25, 219-249.
Hocking, R. R. (1984). Response to
"Eigenvector Weaknesses and Other Topics
for Assessing Conditioning Diagnostics,"
Technometrics, 26, 299-301.
Marquardt, D. W. (1980) "You Should
Standardize the Predictor Variables in
Your Regression Models." (Discussion of
"A Critique of Some Ridge Regression
Methods" by G. Smith and F. Campbell).
Journal of the American Statistical Association, 75, 87-91.
Marquardt, D. W. and Snee, R. D.
(1975). "Ridge Regression in Practice,"
The American Statistician, 29, 3·19.
SAS Institute, Inc. (1982). SAS User's
Guide: Statistics, Cary. NC.
760