* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download The Ridge Regression Estimated Linear Probability Model: A Report
Survey
Document related concepts
Transcript
The Ridge Regression Estimated Linear Probability Model: A Report of Recent Work Introduction The linear probability model (LPM) is Y = Xb + u, but where Y only takes the values 0 and 1 (Goldberger 1964). It is usually estimated by logit or probit analysis. The LPM, which is sufficiently well-known to econometricians and statisticians, is discussed in many well-known texts on econometrics (e.g., Judge, Hill, Griffiths & Lee 1985; Maddala 1992; Gujarati 1995). The LPM is a heteroscedastic model. So, if E(u) = 0, then each ui has variance E(Yi)(1 – E(Yi)). Goldberger suggested estimating E(Yi) by ordinary least squares (OLS) and then re-estimating the model by weighted least squares (WLS) to achieve homoscedasticity. Goldberger’s LPM estimator is a consistent one (McGillivray 1970), and the problem of getting negative residual variances is not an asymptotic one (Amemiya 1977), but a finite-sample one, thus, hindering empirical work. So, the classic problem with the LPM is that the least squares estimator of b cannot guarantee that the LPM predictions, which represent conditional probabilities, will lie between 0 and 1. This problem has made the LPM, despite its simplicity, unfashionable. However, over the years, several researchers have been attracted to the LPM and have proposed interesting methods to resolve this problem (see Judge et al.; Mullahy 1990). One ad-hoc method simply sets LPM predictions greater than 1 to a number close to 1 (such as 0.999) and negative LPM predictions to a number close to 0 (such as 0.001). Another ad-hoc method uses the absolute values of the OLS estimated residual variances to do the WLS estimation. Goldfeld and Quandt (1972) proposed only using those observations having OLS estimates between 0 and 1 to do the WLS estimation. In another approach, the sum of squared errors is minimized subject to the constraints 0 Xb 1 (see Judge & Takayama 1966). Hensher and Johnson (1981) proposed bounding the weights and assigning negative weights a constant value. More recently, Mullahy proposed a quasi generalized least squares estimator which is a generalization of the Goldfeldt-Quandt and Hensher-Johnson estimators. In the spirit of this tradition, ridge regression (RR) estimation (Hoerl & Kennard 1970; 1990) of the LPM was also proposed (Gana 1995). A detailed account of RR is provided by Vinod and Ullah (1981), and, more recently, by Gruber (1998). The RR estimator of b, bR , is given by (XTX + kI )-1XTY , where k 0 is the smallest constant for which all of the resultant LPM predictions, X bR , lie between 0 and 1. The classical bisection method can be used to calculate such a value of k. Next, WLS is used to re-estimate b by using the weights X bR ( 1 X bR ). If any of the resultant WLS estimated LPM predictions fall outside the 0-1 interval, then RR is used Author: Rajaram Gana Occasion: Invited Presentation, The Philadelphia Chapter of the American Statistical Association Meeting Date: March 16, 1999, at The Wharton School, University of Pennsylvania, Philadelphia, PA 1 The Ridge Regression Estimated Linear Probability Model: A Report of Recent Work again (this is “weighted” RR) to resolve the problem as before (Gana 1996); let bWR denote this estimate of b. The conventional use of RR is to resolve the problem of multicollinearity in the usual linear model (i.e., where Y is a continuous regressand). Furthermore, Brook and Moore (1980) have shown that the least squares estimated coefficient vector is much too long on average. Hence, some shrinkage of the least squares coefficient vector may, in general, be desirable. Obenchain (1977) has shown that the RR estimator yields the same t-statistics and F-ratios as does the classical least squares estimator. Saccucci (1985) showed that the RR estimator is also robust to the effects of outliers under the assumption that the usual linear model has a multicollinearity problem. Frank and Friedman (1993) have shown that RR employs optimal linear shrinkage of the coefficient vector as well. It is interesting to note that when Saccucci’s thesis is applied to the case of the LPM there is an indication that the RR estimated LPM will be robust to the effects of outliers as well (Gana 1996 – when my 1995 paper was completed, I was not aware of Saccucci’s work). This is interesting because RR estimation of the LPM does not necessarily require invoking the assumption of multicollinearity to justify its application. Furthermore, this may be useful in applied work with 0-1dummy regressands because logit models, for example, are sensitive to outliers (see Pregibon 1981). Since Saccucci’s work is unpublished, it is reviewed in this report as an addition to the brief history, presented here, of the flow of ideas that have influenced me and, thus, allowed me to lay the foundations for the doctoral work of John Monyak done during the years 1996-1998 at the University of Delaware (UD). Finally, with hindsight, it is easy to wonder why using RR to estimate the LPM was not thought of in the 1970s, after the invention of RR. On Michael Saccucci’s Thesis Little work has been done on the impact of outliers on RR (Saccucci 1985; Walker & Birch 1988; Chalton & Troskie 1992). Saccucci considered the case of variance inflated outliers (VIOs) in the usual linear model where Y is a continuous regressand. A VIO is an observation whose residual variance is 2 w, where w 1 is a constant. Saccucci assumed that given n observations, m of them are VIOs each with residual variance 2 w . He assumed that the remaining n m observations each have residual variance 2 . Let Xm denote the sub-matrix of X containing the VIOs. Author: Rajaram Gana Occasion: Invited Presentation, The Philadelphia Chapter of the American Statistical Association Meeting Date: March 16, 1999, at The Wharton School, University of Pennsylvania, Philadelphia, PA 2 The Ridge Regression Estimated Linear Probability Model: A Report of Recent Work Saccucci showed that the mean square error (MSE) of the RR estimated b under this assumption of VIOs, is equal to the MSE of the RR estimated b under the assumption of no outliers plus 2 (w-1) times the sum of the diagonal elements of the following matrix: ( XTX + kI )-1XmTXm( XTX + kI )-1 where, as usual, I denotes the identity matrix, T denotes the transpose operator, and k 0 is a constant. Saccucci showed that this matrix (which is the additional MSE for the RR estimator) decreases monotonically with k. He showed that there always exists a k > 0 for which the MSE of the RR estimated b under his assumption of VIOs, MSE ( bR VIOs ), say, is less than the MSE of the least squares estimated b, bLS, under his assumption of VIOs, MSE ( bLS VIOs ), say. That is, in symbols we have: k > 0 MSE ( bR VIOs ) < MSE ( bLS VIOs ) This can be viewed as a generalization of the original existence theorem of Hoerl and Kennard (1970). Saccucci used simulation to show that his result holds (with probability > 0.5) for the values of k selected using the algorithms proposed by Lawless and Wang (1976), Hoerl, Kennard and Baldwin (1975), Hoerl and Kennard (1976), and Dempster, Schatzoff and Wermuth (1977). Saccucci allowed w to take the integer values 1 through 10, and m the values 1, 2 and 3. Saccucci used six known X matrices for his simulation. Five of these were obtained from real data, and the other was artificially created to illustrate the effects of multicollinearity on bR. The dimensions (i.e., number of regressors number of observations) of these X matrices are 3 10, 4 13, 7 20, 6 16, 10 36 and 19 36. Saccucci generated the true bi values using the formula: bi = R ui ( ui2 ) where i indexes the coefficients, ui is a random uniform number on the interval 0.5, +0.5, and R is the pre-selected length of the coefficient vector (i.e., R2 = bTb). Saccucci used the following values of R2: 10.0, 15.8, 25.1, 39.8, 63.1, 100, 158, 251, 398, 631, 1000, 1580, 2510, 3980, 6310, and 10000. The bi values generated are pairwise uncorrelated and create an approximate, but not exact, uniform distribution of the vector b over the hypersphere of radius R centered at the origin (see Lawless and Wang). For each combination of factors over the six data sets, he generated 500 regression simulations. At this stage of the analysis it is easy to see, as Saccucci did, that embedded in the additional MSE term is a diagnostic to flag outliers. The additional MSE for an incremental increase in the variance of observation i is given by 2 xiT(XTX)-2xi , where vector xi denotes row i of X. Saccucci’s diagnostic is xiT(XTX)-2xi . He indicated, by example, that his diagnostic (which is related to Cook’s (1977) distance) can flag an outlier which remains undetected by Author: Rajaram Gana Occasion: Invited Presentation, The Philadelphia Chapter of the American Statistical Association Meeting Date: March 16, 1999, at The Wharton School, University of Pennsylvania, Philadelphia, PA 3 The Ridge Regression Estimated Linear Probability Model: A Report of Recent Work Cook’s distance. Saccucci ended his dissertation by suggesting that his results could be extended to the case of m VIOs with distinct variances (i.e., variances of the form 2 wi , where wi > 1), and that the RR estimator would exhibit similar MSE properties for other types of outliers. The Next Step Let us now look at Saccucci’s work in a new way and, thereby, connect it to the idea of the RR estimated LPM. Each residual, ui, under the LPM has variance xib(1 xib) . Hence, the LPM can be viewed as a linear model with observations having distinct variances, in Saccucci’s sense. Therefore, it is not unreasonable to conjecture the following existence property: k > 0 MSE ( bR LPM ) < MSE ( bLS LPM ) where MSE ( LPM ) denotes the MSE of “ ” under the LPM assumption. Furthermore, if we can show that this result also holds for the proposed value of k (i.e., the smallest value of k > 0 for which all of the LPM predictions are between 0 and 1), then we have a stronger case for considering the RR estimated LPM to have some measure of usefulness. Now, Theobald (1974) had shown that the RR estimator can also improve prediction properties. Thus, we are lead to conjecture the following property: k > 0 MSE ( XbR LPM ) < MSE ( XbLS LPM ) Again, this result will be stronger if we can also show that it holds for the proposed value of k. Investigating these conjectures requires the use of standard results in matrix theory (see, for example, the text of Rao & Toutenburg 1995) and the use of simulation in the spirit of Hoerl, Schuenemeyer, and Hoerl (1984). And interestingly enough, even though the proofs use some non-elementary ideas, they are, like the idea of the LPM itself, simple in style and in execution. Most of the statistical computing involved can be done using “proc iml” in the SAS System (SAS Institute, Inc.). The last page of this presentation includes a SAS macro that I wrote to compute bR without having to invoke “proc iml”. In early 1996 (more than 30 years after Goldberger’s LPM proposition !), my ideas outlined in this section were imparted to John Monyak, who was then a doctoral student at UD in search of a thesis topic (Professor John Schuenemeyer of UD, whom I have known for many years, introduced me to him). He found these ideas interesting and decided to take up the task of demonstrating their worth. On John Monyak’s Thesis Using matrix algebra, Monyak showed that the conjectures outlined above are true. In the spirit of McGillivray, he showed that the RR estimated LPM is consistent as well. His simulation results indicate that the RR estimated LPM is superior to the least squares estimated Author: Rajaram Gana Occasion: Invited Presentation, The Philadelphia Chapter of the American Statistical Association Meeting Date: March 16, 1999, at The Wharton School, University of Pennsylvania, Philadelphia, PA 4 The Ridge Regression Estimated Linear Probability Model: A Report of Recent Work LPM both in terms of coefficient and prediction MSEs. Monyak’s simulation results indicate that the best improvement in coefficient and prediction MSEs is achieved by bR (closely followed by bWR ). This is interesting because it indicates that solving for bR resolves the twin problems of heteroscedasticity and getting the predictions to lie in the range 0-1. For his simulation, Monyak used two known X matrices. One is the classic data set of Spector and Mazzeo (1980) which has 3 regressors and 32 observations. The other is a modification of an X matrix used by Hoerl, Schuenemeyer and Hoerl, and has 5 regressors and 36 observations. To simulate the large sample case, the sample sizes of the two data sets were increased (as in Hoerl, Schuenemeyer & Hoerl 1984) to 200 without changing the correlation structure of XTX . The condition number of XTX took the values 1 (no multicollinearity), 1,000 (“medium” multicollinearity), and 10,000 (“high” multicollinearity). Monyak also used “low” and “high” levels of heteroscedasticity. For the “low” level he generated b vectors such that 0.1 Xb 0.9. For the “high” level he generated b vectors in the usual manner so that 0 Xb 1. Due to computing constraints in generating the b vectors, Monyak limited the number of regressors to five. For each scenario, 1,000 b vectors were generated. Since, there are 24 combinations (2 2 3 2) of these factors, he generated 24,000 b vectors in all. While estimating bWR , using bR , values of X bR that are close to 0 or 1 will tend to produce large weights. This could lead to large values of k when estimating bWR . So, such a point was deleted before solving for bWR (Gana 1996). Monyak noticed this phenomenon in his simulation runs. He noted that the problem was more pronounced when the observation was a high leverage point (RR changes the leverage of points relative to least squares). Instead of deleting such a point, Monyak set an upper bound of 100 on the weights (in the spirit of Hensher and Johnson). He then calculated bWR . Some of his results for the proposed value of k are stated next. MSE ( bR LPM ) and MSE ( bWR LPM ) are 77.5% (standard error = 3.6%) and 80.2% (standard error = 3.9%) of MSE ( bLS LPM ), respectively. MSE ( XbR LPM ) and MSE ( XbWR LPM ) are 87.5% (standard error = 1.8%) and 87.8% (standard error = 2.0%) of MSE ( XbLS LPM ), respectively. MSE ( bR LPM ) < MSE ( bLS LPM ), 60.5% of times. MSE ( XbR LPM ) < MSE ( XbLS LPM ), 66.0% of times. Hence, the probability that the RR estimated LPM improves upon least squares is greater than 0.5. When the RR estimated LPM is compared with some of the other proposed LPM estimators (like the ad-hoc, Goldfeld & Quandt, and Mullahy estimators), Monyak’s simulation results indicate that its superiority, in terms of MSEs, continues to hold. The Goldfeld-Quandt and Mullahy methods produce coefficient and prediction vector MSEs that are about 118% each, Author: Rajaram Gana Occasion: Invited Presentation, The Philadelphia Chapter of the American Statistical Association Meeting Date: March 16, 1999, at The Wharton School, University of Pennsylvania, Philadelphia, PA 5 The Ridge Regression Estimated Linear Probability Model: A Report of Recent Work of the corresponding least squares MSE values. The first ad hoc method (rounding predictions to 0.999 or 0.001) produces coefficient and prediction MSEs of 98.3% and 98.6% of the corresponding least squares MSE values, respectively. Least squares produces predictions outside [ 0,1], 42% of times. LPM versus Logit and Probit Models: some empirical results It is natural to ask how the RR estimated LPM compares with logit and probit models when doing applied work. Three empirical studies (Trusheim & Gana 1994; Gana & Trusheim 1995; Gana & Rossi 1998) have addressed this question. Two of these studies are discussed next. Two data sets were modeled in the study of Gana and Trusheim. The first set consisted of data on 296 freshmen who were offered admission to UD for Fall 1991. Of the 296, 128 students enrolled at the University. The aim was to model the college selection process which is a complex one. The following regressors were used: SAT score, high school GPA, parental income, ethnicity (White, Black, or other), number of colleges applied to, type of high school attended (public, independent non-religious, independent Catholic, or other), and a UD “attitude” score (which is a composite score developed from students’ rating of some 20 college characteristics as “very important”, “somewhat important”, or “not important”). Assuming that estimated probabilities greater than 0.43 (128/296) predict an enrolling student (although such cutoff probabilities are arbitrary), we found that there are virtually no differences between logit, probit, and the RR estimated LPM. The second data set consisted of data on 3,215 first-time UD freshmen in Fall 1993 and their retention to the sophomore year. Of these freshmen, 2,834, or about 88%, returned to UD for their second year. The aim was to model the probability of retention to the second year. Clearly, here the distribution of the dependent variable is much more skewed than the distribution of the dependent variable in the first data set. The following regressors were used: academic probation status (a 0-1 dummy variable), GPA, ethnicity, gender, and deficit points (a number between 0 and 30 which students start to receive if their GPA falls below 2.0). Assuming that estimated probabilities greater than or equal to 0.88 predict retention to the second year, we found that the RR estimated LPM produced errors (i.e., false positives and negatives) of 17.7%, while logit and probit models produced 15.4% and 16.6% of errors, respectively. However, when specific retention probabilities are compared, all three models show close agreement on average. We also noted that the academic probation status variable was not significant (and had a positive coefficient) in logit and probit models, but was significant in the RR estimated LPM (and had a negative coefficient). Author: Rajaram Gana Occasion: Invited Presentation, The Philadelphia Chapter of the American Statistical Association Meeting Date: March 16, 1999, at The Wharton School, University of Pennsylvania, Philadelphia, PA 6 The Ridge Regression Estimated Linear Probability Model: A Report of Recent Work In the study of Gana and Rossi, a sample of 95 mortgage loans was selected. Of the 95 loans, 30 loans were delinquent. The aim was to model the probability of delinquency. The following regressors were used: FICO score (a credit score given by Fair, Issac & Company, San Rafael, CA), the ratio of the loan amount to the value of the property, the borrower’s income, and whether or not the property is located in California (0-1 dummy variable). Both the RR estimated LPM and the weighted RR (WRR) estimated LPM were compared to the logit model. Assuming that predicted probabilities greater than 30/95 predict a delinquent loan, we found that the RR estimated LPM, WRR estimated LPM, and the logit model produced errors of 24%, 20%, and 18%, respectively. Next, assuming that predicted probabilities greater than 0.5 predict a delinquent loan, we found that the RR estimated LPM, WRR estimated LPM, and the logit model produced errors of 15%, 16%, and 19%, respectively. Finally, the Kolmogorov-Smirnov (KS) statistic (Smirnov 1939) was computed to measure the degree of separation between the distributions of predicted probabilities for the delinquent and non-delinquent loans in the sample. All three models yielded a KS statistic value of 0.60 (this statistic is often used in credit scoring). Monyak also empirically compared the RR and WRR estimated LPMs with logit and probit models by using the classic data set of Spector and Mazzeo. He found that the RR and WRR estimated LPMs, logit, and probit models produced PRESS statistic (Allen 1971) values of 5.3, 5.0, 5.8, and 5.8, respectively and errors of 25%, 19%, 19%, and 19%, respectively, when the cutoff probability is set equal to the sample mean of the regressand. These error proportions become 19% each, for all models, when the cutoff probability is set equal to 0.5. Finally, it should be mentioned that Monyak’s simulation results show that the RR and WRR estimated LPMs produce errors of 31% and 32%, respectively, when the cutoff probability is set to 0.5, and errors of 36% each, when the cutoff probability is set to the sample mean of the regressand. The trends in these results are consistent with the empirical results considered above. In contrast, the OLS estimated LPM produces errors of 30% when the cutoff probability is 0.5. These results indicate that the RR estimated LPM is competitive with logit and probit models. Author: Rajaram Gana Occasion: Invited Presentation, The Philadelphia Chapter of the American Statistical Association Meeting Date: March 16, 1999, at The Wharton School, University of Pennsylvania, Philadelphia, PA 7 The Ridge Regression Estimated Linear Probability Model: A Report of Recent Work References Allen, D.M. (1971): The prediction sum of squares as a criterion for selecting predictor variables, Technical Report No. 23, Department of Statistics, University of Kentucky. Amemiya, T. (1977): Some theorems on the linear probability model, International Economic Review. Brook, R.J. and T. Moore (1980): On the expected length of the least squares coefficient vector, Journal of Econometrics. Chalton, D.O. and C.G. Troskie (1992): Identifying of outlying and influential data with biased estimation: a simulation study, Communications in Statistics – Simulation. Cook, R.D. (1977): Detection of influential observations in linear regression, Technometrics. Dempster, A.P., M. Schatzoff, and N. Wermuth (1977): A simulation study of alternatives to ordinary least squares, Journal of the American Statistical Association. Frank, I.E. and J.H. Friedman (1993): A statistical view of some chemometrics regression tools, Technometrics. Gana, R. (1995): Ridge regression estimation of the linear probability model, Journal of Applied Statistics, England. Gana, R. and D.W. Trusheim (1995): An empirical comparison of linear probability, logit, and probit models of enrollment, presented at the Association for Institutional Research national meeting, Boston, Massachusetts. Gana, R. (1996): The effect of influential data on the ridge regression estimated linear probability model, Proceedings of the Northeast Decision Sciences Institute annual meeting, St. Croix, USA. Gana, R. and C.V. Rossi (1998): An empirical comparison of linear probability and logit models of mortgage default, presented at the international conference on “Credit Scoring and Control V”, University of Edinburgh, UK, September 1997. Goldberger, A.S. (1964): Econometric Theory, John Wiley, New York. Goldfeld, S.M. and R.E. Quandt (1972): Nonlinear Methods in Econometrics, North-Holland, Amsterdam. Gruber, M.H.J. (1998): Improving Efficiency By Shrinkage – The James-Stein and Ridge Regression Estimators, Marcel Dekker, New York. Gujarati, D.N. (1995): Basic Econometrics, McGraw-Hill, New York. Hensher, D.A. and L.W. Johnson (1981): Applied Discrete Choice Modeling, Croom Helm, London. Hoerl, A.E. and R.W. Kennard (1970): Ridge regression: biased estimation of nonorthogonal problems, Technometrics. Hoerl, A.E., R.W. Kennard and K.F. Baldwin (1975): Ridge regression: some simulations, Communications in Statistics. Hoerl, A.E. and R.W. Kennard (1976): Ridge regression iterative estimation of the biasing parameter, Communications in Statistics. Hoerl, R.W., J.H. Schuenemeyer, and A.E. Hoerl (1984): A simulation of biased estimation and subset Author: Rajaram Gana Occasion: Invited Presentation, The Philadelphia Chapter of the American Statistical Association Meeting Date: March 16, 1999, at The Wharton School, University of Pennsylvania, Philadelphia, PA 8 The Ridge Regression Estimated Linear Probability Model: A Report of Recent Work regression techniques, Technometrics. Hoerl, A.E. and R.W. Kennard (1990): Ridge regression: degrees of freedom in the analysis of variance, Communications in Statistics. Judge, G. and T. Takayama (1966): Inequality restrictions in regression analysis, Journal of the American Statistical Association. Judge, G., C. Hill, W. Griffiths, and T. Lee (1985): The Theory and Practice of Econometrics, John Wiley. Lawless, J.F. and P. Wang (1976): A simulation study of ridge and other regression estimators, Communications in Statistics. Maddala, G.S. (1992): Introduction to Econometrics, Macmillan, New York. McGillivray, R.G. (1970): Estimating the linear probability function, Econometrica. Monyak, J.T. (1998): Mean squared error properties of the ridge regression estimated linear probability model, Ph.D. dissertation, University of Delaware, Newark, Delaware. Mullahy, J. (1990): Weighted least squares estimation of the linear probability model revisited, Economics Letters. Obenchain, R.L. (1977): Classical F-tests and confidence regions for ridge regression, Technometrics. Pregibon, D. (1981): Logistic Regression Diagnostics, Annals of Statistics. Rao, C.R. and H. Toutenburg (1995): Linear Models: Least Squares and Alternatives, Springer-Verlag. Saccucci, M.S. (1985): The effect of variance-inflated outliers on least squares and ridge regression, unpublished Ph.D. dissertation (supervised by Arthur E. Hoerl), University of Delaware, Newark, Delaware. SAS Institute, Inc. The SAS System, Cary, North Carolina. Smirnov, N.V. (1939): On the estimation of the discrepancy between empirical curves of distribution for two independent samples, Bulletin of the University of Moscow, Russia. Spector, L.C. and M. Mazzeo (1980): Probit analysis and economic education, Journal of Economic Education. Theobald, C.M. (1974): Generalization of the mean square error applied to ridge regression, Journal of the Royal Statistical Society. Trusheim, D.W. and R. Gana (1994): How much can financial aid increase the probability of freshman enrollment ?, presented at the annual meeting of the Association for Institutional Research, New Orleans. Vinod, H.D. and A. Ullah (1981): Recent Advances in Regression Methods, Marcel Dekker. Walker, E. and J.B. Birch (1988): Influence measures in ridge regression, Technometrics. Author: Rajaram Gana Occasion: Invited Presentation, The Philadelphia Chapter of the American Statistical Association Meeting Date: March 16, 1999, at The Wharton School, University of Pennsylvania, Philadelphia, PA 9