Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SELECTING SELECTION METHODS ERHARD RESCHENHOFER Department of Statistics and Decision Support Systems University of Vienna, Universitätsstr. 5, A-1010 Vienna, Austria Abstract This paper proposes a new approach for model selection and applies it to a classical time series modeling problem. In contrast to conventional model selection methods like AIC and BIC, whose penalty terms typically depend only on the number of model parameters, the proposed model selection method also takes the values of the model parameters and the sets of candidate models into account. A brief sketch of a Bayesian further development of this method is given within the framework of the linear regression model. Key words and phrases: AIC; BIC; Model selection criteria 1. Introduction The expected (predictive) log likelihood E log f(z θ̂(y)) is a widely used measure for the evaluation of a model f(. θ) , θ∈Θ. Here y and z are independent samples from an n-dimensional density f(. θ) and θ̂(y) is the ML-estimator for the unknown (k+1)-dimensional parameter vector θ based on sample y. The maximum log likelihood log f(y θ̂(y)) of a given sample y is a biased estimator of the expected log likelihood because one and the same sample is used both for the estimation of the parameter vector and the evaluation of the goodness of fit. In the classical linear regression model y~N(Xβ,σ2I) the bias is given by k+1+ kn −+ 3kk−+22 2 (see Sugiura, 1978 ) and therefore depends only on the number of parameters but not on the values of θ=(βT,σ2)T. The asymptotic bias is k+1. Ignoring the fact that these two bias terms are only meaningful for correctly specified models, Akaike (1973) proposed to select that model from m competing models f j (. θ j ) , θj∈Θj, j=1,…,m, which minimizes the criterion -2 log f(y θ̂ j (y)) +2(k+1) (AIC) (for the case of possibly misspecified models see Reschenhofer, 1999). Since Kempthorne (1984) has shown that all model selection methods are admissible, AIC cannot be uniformly better than any other method (for a more general setting see Kabaila, 1997). Anyhow, Shibata (1980, 1981) was able to prove some kind of asymptotic optimality of AIC. Kabaila (2002) put Shibata’s results into perspective by pointing out that it holds only pointwise and may therefore be misleading. In contrast to AIC the criterion -2 log f(y θ̂ j (y)) + (k+1) log(n). (BIC) (see Schwarz, 1978) is consistent in the sense that the probability of selecting the true model approaches 1 provided that it is among the candidate models. Yang (2005) derived the minimax-rate optimality of AIC and showed that no model selection criterion can share the main strengths of AIC and BIC simultaneously, no consistent criterion can therefore be minimax-rate optimal. In addition, Leeb and Pötscher (2004) showed that any consistent selection criterion has maximal risk that diverges to infinity whenever the loss function is unbounded. Of course, evidence obtained by simulation studies can also be dubious and misleading. For example, claims of “near optimality” of some selection methods over a broad range of data generating mechanisms (see, e.g., George and Foster, 2000) may - after slight changes in the design of the simulation studies - turn out as overoptimistic (see Reschenhofer, 2004). Thus there is little to be said against simply using that method which we like best. Perhaps the most appealing and straightforward approach to automatic model selection is via correction of the data snooping bias. But the devil is in the details. Firstly, in more sophisticated models like time series models the bias term does not only depend on the model dimension but also on the model parameters. Secondly, practitioners rarely get up in the morning and start pondering whether, for example, an MA(2) model or an ARMA(3,2) model would be more appropriate for their data. It is more likely that they end up with two competing models because they apply AIC and BIC simultaneously. Now if BIC selects an MA(2) model and AIC selects an ARMA(3,2) model, the bias terms for these models should not only depend on the respective dimensions and model parameters but also on the sets of candidate models from which the two models have been selected. In Section 3 we describe a fully operational selection method that takes all “degrees of freedom” consumed by the whole model building procedure into account and apply it to real data. For comparison the results obtained by applying conventional selection methods like AIC and BIC to the same data are presented in Section 2. Section 4 presents a Bayesian extension of the selection method described in Section 3. 2. Modeling the US GDP To illustrate the difficulty of automatic modeling we consider the seasonally adjusted real U.S. Gross Domestic Product (GDP) from 1947.1 to 2005.1. The data were downloaded from FRED® (Federal Reserve Economic Data), the database of the Federal Reserve Bank of Saint Louis. First we use simple autoregressive (AR) models for the description of the first differences of the log GDP. According to AIC the AR(4) model is the best while BIC selects the AR(1) model. Figure 1 allows us to interpret the different models and to form our own opinion about their goodness. It shows the periodogram of the first differences together with the spectral densities of the estimated AR(p) models with p=0,…,8. In contrast to the AR(1) model the AR(4) model implies a decrease of the spectral density near frequency zero. The higher the spectral density near frequency zero is the more persistent are the shocks. When we use moving average (MA) models instead of AR models both AIC and BIC select the MA(2) model. Remarkably, AIC now also decides that there is no decline near frequency zero (see Figure 2). Obviously, the results obtained with automatic model selection methods are not so objective after all. Since they depend strongly on the building blocks (AR or MA) used in the model building procedure, we would need yet another selection method for the selection of the model class. Not surprisingly, AIC comes up with the third spectral shape when we use the third model class. According to AIC the best ARMA model is the ARMA(3,2) model, which implies two approximately constant regimes. The possible jump discontinuity between the two regimes is cobbled together by a sharp increase immediately followed by a sharp decrease. In contrast, the more conservative BIC adheres stubbornly to the MA(0,2) model. This remains still true when we make the fractional differencing parameter d available, which allows the parsimonious description of a sharp decline or incline near frequency zero. BIC decides that the inclusion of this parameter is not advisable. In contrast, AIC decides in favor of a sharp decline at frequency zero. According to AIC the best fractionally integrated ARMA (ARFIMA) model is the ARFIMA(3,d,3) model. Figure 1: Periodogram of differenced log GDP with spectral densities of AR(p) models, p=0,1,…,8. Figure 2: Periodogram of differenced log GDP with spectral densities of MA(q) models, q=0,1,…,8. Figure 3: Periodogram of differenced log GDP with spectral densities of ARMA(p,q) models, p=1,2,3 (rows), q=1,2,3 (columns). Figure 4: Periodogram of differenced log GDP with spectral densities of ARFIMA(p,d,q) models, p=1,2,3 (rows), q=1,2,3 (columns). 3. Choosing the selection method Suppose we have to decide between the ARMA(3,2) model and the MA(2) model for the description of the differenced log GDP. To find an acceptable balance between the goodness of fit (bias) and the complexity of the model (variance) we want to take into account that these models are not given a priori but have been found as the best ARMA models according to AIC and BIC, respectively. So actually we are going to decide between two model selection methods rather than between two models. We proceed as follows. First we fit an ARMA model of order (p0,q0)=(3,2) to the differenced log GDP, y, and compute the residuals, û . Then we generate r=100 synthetic time series y(i), i=1,..,r, using the estimated model parameters θ̂(y; (p 0 , q 0 )) and different random permutations û(i) , i=1,…,r, of the residuals. Finally we fit all ARMA models up to order (3,3) to each synthetic series y(i) and select the optimal orders (pi(AIC),qi(AIC)) and (pi(BIC),qi(BIC)) according to AIC and BIC, respectively. The quantities of interest are the discrepancies between the original goodness-of-fit measures and those obtained from the synthetic series. The differences r DAIC= log f(y θ̂(y; (p 0 , q 0 ))) - 1r ∑ log f(y(i) θ̂(y(i); (p i (AIC), q i (AIC)))) i =1 and r DBIC= log f(y θ̂(y; (p 0 , q 0 ))) - 1r ∑ log f(y(i) θ̂(y(i); (p i (BIC),q i (BIC)))) i =1 may serve as penalties for the models found with AIC and BIC, respectively. In our case DAIC-DBIC=2.503-(-1.842)=4.345, hence the difference between the penalties is greater than the difference between the numbers of parameters. The fact that DBIC is negative indicates that BIC is too conservative in this situation. Since the difference between the log likelihood terms of the two models is much greater than DAIC-DBIC, i.e., log f(y θ̂(y; (3,2))) - log f(y θ̂(y; (0,2))) =7.202 > DAIC-DBIC=4.345, the ARMA(3,2) model comes off much better than the MA(2) model when we use the former model for the computation of the penalties. Alternatively, we may use the model selected by BIC for the computation of the penalties. Starting with (p0,q0)=(0,2) and running through all steps again we get DAIC-DBIC=3.329-0.650=2.679. While both penalties are now greater than before, which is due to an increased risk of overfitting, their difference is smaller than before, it is even smaller than the difference between the numbers of parameters. So our approach of taking the complexity of the whole modeling procedure into account (and not only the complexity of the selected model) clearly favors the ARMA(3,2) model over the more parsimonious MA(2) model. 4. Bayesian extension of the selection method But what can we do when the outcome is not as unambiguous as in the GDP-example or when there are more than two model selection criteria? In such cases we could try a Bayesian modification of our approach, where the synthetic series are generated according to the posteriori probabilities of the different models and the posteriori distributions of the model parameters. We describe the details for the classical problem of choosing a linear regression model. Suppose that the n-dimensional random vector y has a multivariate normal distribution with mean vector µ=Xβ, where the (n× K)-dimensional matrix X has rank K, and covariance matrix σ2I. We want to select one from m competing models Mj, j=1,…,m. Each model Mj is represented by an (n×kj)-dimensional submatrix Xj of X. Since µ is typically unknown we cannot approximate µ by the projections µ *j =Xj( X Tj Xj)-1 X Tj µ and must therefore use their sample analogues µ̂ j =Xj( X Tj Xj)-1 X Tj y instead. The best models according to AIC and BIC can be obtained by minimizing n log σ̂ 2j + 2(kj+1) and n log σ̂ 2j + (kj+1) log n, respectively, where σ̂ 2j = 1n (y-Xj β̂ j )T(y-Xj β̂ j ), β̂ j =( X Tj Xj)-1 X Tj y. But we might wish to use alternative model selection criteria as well, e.g., an extension of AIC for small sample sizes (see Sugiura, 1978, Hurvich and Tsai, 1989), n log σ̂ 2j + 2(kj+1) + 1 n −k j −2 ( 2k 2j + 6k j + 4 ), an extension of AIC for small sample sizes and possible misspecifications (see Reschenhofer, 1999), n log σ̂ 2j + 2(kj+2) Q̂ j - 2 Q̂ 2j + 1 n −k j −2 ( 14k jQ̂ 2j + 2k 2j Q̂ 2j − 8k jQ̂ 3j + 24Q̂ 2j − 32Q̂ 3j + 12Q̂ 4j ), a small sample version of BIC (see Reschenhofer, 1996a), (n-2) log σ̂ 2j + kj log n2 - kj log σ̂ 2j τ̂ 2 - 2 log Γ( kj 2 +1), and so on, where Q̂ j = σ̂ 2 σ̂ 2j , σ̂ 2 = 1n (y-X β̂ )T(y-X β̂ ), β̂ j =(XTX)-1XTy, τ̂ 2 = 1n yT y, and Γ denotes the gamma function. For the generation of each of r synthetic series y(i), i=1,..,r we randomly generate a fully specified model using the approximate posterior probabilities P̂(M j y) = e −B(j)/2 m ∑e − B(s)/2 , j=1,…,m, s =1 of the m models (see Kass and Wasserman, 1995, Reschenhofer, 1996b) and the usual posterior distributions nσ̂ 2j σ 2j y, M j ~ χ2(n-kj), β j y, σ 2j , M j ~ N( β̂ j , σ̂ 2j ( X Tj Xj)-1) of the model parameters, respectively. Here B(j) is the BIC value of model Mj (or the corresponding value of the small sample version of BIC). Now we fit all m models to the original sample y as well as to each synthetic sample y(i) and use the different criteria for the selection of the respective optimal models. The discrepancies between the original goodness-of-fit measures and the means of those obtained from the synthetic samples can be used to compare the different optimal models for y and to select the overall best model for y. Of course, this Bayesian procedure is computationally practicable only if m is not too large. The magnitude of m is particularly critical when the methods used for the estimation of the model parameters are very time-consuming. In the case of subset selection with criteria like 2 ζ (k j , K) n log σ̂ 2j + n log(1+ n − ζ (k j , K) ), where ζ(kj,K) is the expected value of the sum of the kj largest of K independent χ2(1)statistics (see Reschenhofer, 2004), m may even be very large when K is small. References Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In , Petrov, B.N. Csaki, F. ed. Second International Symposium on Information Theory. Akademia Kiado, Budapest, 1973, pp. 267-281. George, E.I. Foster, D.P. (2000). Calibration and empirical Bayes variable selection. Biometrika 87:731-747. Hurvich, M. Tsai, C.L. (1989). Regression and time series model selection in small samples. Biometrika 76:297-307. Kabaila, P. (1997). Admissible variable-selection procedures when fitting misspecified regression models by least squares. Communications in Statistics – Theory and Methods 26: 2303-2306. Kabaila, P. (2002). On variable selection in linear regression. Econometric Theory 18:913925. Kempthorne, P.J. (1984). Admissible variable-selection procedures when fitting regression models by least squares for prediction. Biometrika 71:593-597. Leeb, H. Pötscher, B.M. (2004). Sparse estimators and the oracle property, or the return of Hodges’ estimator. Working paper. Reschenhofer, E. (1996a). Approximating the Bayes factor, Statistics & Probability Letters 30:241-245. Reschenhofer, E. (1996b). Prediction with vague prior knowledge. Communications in Statistics – Theory and Methods 25:601-608. Reschenhofer, E. (1999). Improved estimation of the expected Kullback-Leibler discrepancy in case of misspecification. Econometric Theory 15:377-387. Reschenhofer, E. (2004). On subset selection and beyond. Advances and Applications in Statistics 4:265-286. Sawa, T. (1978). Information criteria for discriminating among alternative regression models. Econometrica 46:1273-1291. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6:461-464. Shibata, R. (1980). Asymptotically efficient selection of the order of the model for estimating parameters of a linear process. Annals of Statistics 8: 147-164. Shibata, R. (1981). An optimal selection of regression variables. Biometrika 68:45-54. Sugiura, N. (1978). Further analysis of the data by Akaike’s information criterion and the finite corrections. Communications in Statistics – Theory and Methods 7:13-26. Yang, Y. (2005). Can the strengths of AIC and BIC be shared? A conflict between model identification and regression estimation. Biometrika 92: 937-950.