Download portable document (.pdf) format

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear regression wikipedia , lookup

Time series wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
SELECTING SELECTION METHODS
ERHARD RESCHENHOFER
Department of Statistics and Decision Support Systems
University of Vienna, Universitätsstr. 5, A-1010 Vienna, Austria
Abstract
This paper proposes a new approach for model selection and applies it to a classical time
series modeling problem. In contrast to conventional model selection methods like AIC and
BIC, whose penalty terms typically depend only on the number of model parameters, the
proposed model selection method also takes the values of the model parameters and the sets
of candidate models into account. A brief sketch of a Bayesian further development of this
method is given within the framework of the linear regression model.
Key words and phrases: AIC; BIC; Model selection criteria
1. Introduction
The expected (predictive) log likelihood
E log f(z θ̂(y))
is a widely used measure for the evaluation of a model f(. θ) , θ∈Θ. Here y and z are
independent samples from an n-dimensional density f(. θ) and θ̂(y) is the ML-estimator for
the unknown (k+1)-dimensional parameter vector θ based on sample y. The maximum log
likelihood
log f(y θ̂(y))
of a given sample y is a biased estimator of the expected log likelihood because one and the
same sample is used both for the estimation of the parameter vector and the evaluation of the
goodness of fit. In the classical linear regression model
y~N(Xβ,σ2I)
the bias is given by
k+1+ kn −+ 3kk−+22
2
(see Sugiura, 1978 ) and therefore depends only on the number of parameters but not on the
values of θ=(βT,σ2)T. The asymptotic bias is k+1. Ignoring the fact that these two bias terms
are only meaningful for correctly specified models, Akaike (1973) proposed to select that
model from m competing models f j (. θ j ) , θj∈Θj, j=1,…,m, which minimizes the criterion
-2 log f(y θ̂ j (y)) +2(k+1)
(AIC)
(for the case of possibly misspecified models see Reschenhofer, 1999). Since Kempthorne
(1984) has shown that all model selection methods are admissible, AIC cannot be uniformly
better than any other method (for a more general setting see Kabaila, 1997). Anyhow, Shibata
(1980, 1981) was able to prove some kind of asymptotic optimality of AIC. Kabaila (2002)
put Shibata’s results into perspective by pointing out that it holds only pointwise and may
therefore be misleading. In contrast to AIC the criterion
-2 log f(y θ̂ j (y)) + (k+1) log(n).
(BIC)
(see Schwarz, 1978) is consistent in the sense that the probability of selecting the true model
approaches 1 provided that it is among the candidate models. Yang (2005) derived the
minimax-rate optimality of AIC and showed that no model selection criterion can share the
main strengths of AIC and BIC simultaneously, no consistent criterion can therefore be
minimax-rate optimal. In addition, Leeb and Pötscher (2004) showed that any consistent
selection criterion has maximal risk that diverges to infinity whenever the loss function is
unbounded.
Of course, evidence obtained by simulation studies can also be dubious and misleading. For
example, claims of “near optimality” of some selection methods over a broad range of data
generating mechanisms (see, e.g., George and Foster, 2000) may - after slight changes in the
design of the simulation studies - turn out as overoptimistic (see Reschenhofer, 2004). Thus
there is little to be said against simply using that method which we like best. Perhaps the most
appealing and straightforward approach to automatic model selection is via correction of the
data snooping bias. But the devil is in the details. Firstly, in more sophisticated models like
time series models the bias term does not only depend on the model dimension but also on the
model parameters. Secondly, practitioners rarely get up in the morning and start pondering
whether, for example, an MA(2) model or an ARMA(3,2) model would be more appropriate
for their data. It is more likely that they end up with two competing models because they
apply AIC and BIC simultaneously. Now if BIC selects an MA(2) model and AIC selects an
ARMA(3,2) model, the bias terms for these models should not only depend on the respective
dimensions and model parameters but also on the sets of candidate models from which the
two models have been selected.
In Section 3 we describe a fully operational selection method that takes all “degrees of
freedom” consumed by the whole model building procedure into account and apply it to real
data. For comparison the results obtained by applying conventional selection methods like
AIC and BIC to the same data are presented in Section 2. Section 4 presents a Bayesian
extension of the selection method described in Section 3.
2. Modeling the US GDP
To illustrate the difficulty of automatic modeling we consider the seasonally adjusted real
U.S. Gross Domestic Product (GDP) from 1947.1 to 2005.1. The data were downloaded from
FRED® (Federal Reserve Economic Data), the database of the Federal Reserve Bank of Saint
Louis.
First we use simple autoregressive (AR) models for the description of the first differences of
the log GDP. According to AIC the AR(4) model is the best while BIC selects the AR(1)
model. Figure 1 allows us to interpret the different models and to form our own opinion about
their goodness. It shows the periodogram of the first differences together with the spectral
densities of the estimated AR(p) models with p=0,…,8. In contrast to the AR(1) model the
AR(4) model implies a decrease of the spectral density near frequency zero. The higher the
spectral density near frequency zero is the more persistent are the shocks.
When we use moving average (MA) models instead of AR models both AIC and BIC select
the MA(2) model. Remarkably, AIC now also decides that there is no decline near frequency
zero (see Figure 2). Obviously, the results obtained with automatic model selection methods
are not so objective after all. Since they depend strongly on the building blocks (AR or MA)
used in the model building procedure, we would need yet another selection method for the
selection of the model class. Not surprisingly, AIC comes up with the third spectral shape
when we use the third model class. According to AIC the best ARMA model is the
ARMA(3,2) model, which implies two approximately constant regimes. The possible jump
discontinuity between the two regimes is cobbled together by a sharp increase immediately
followed by a sharp decrease. In contrast, the more conservative BIC adheres stubbornly to
the MA(0,2) model. This remains still true when we make the fractional differencing
parameter d available, which allows the parsimonious description of a sharp decline or incline
near frequency zero. BIC decides that the inclusion of this parameter is not advisable. In
contrast, AIC decides in favor of a sharp decline at frequency zero. According to AIC the best
fractionally integrated ARMA (ARFIMA) model is the ARFIMA(3,d,3) model.
Figure 1: Periodogram of differenced log GDP with spectral densities of AR(p) models,
p=0,1,…,8.
Figure 2: Periodogram of differenced log GDP with spectral densities of MA(q) models,
q=0,1,…,8.
Figure 3: Periodogram of differenced log GDP with spectral densities of ARMA(p,q) models,
p=1,2,3 (rows), q=1,2,3 (columns).
Figure 4: Periodogram of differenced log GDP with spectral densities of ARFIMA(p,d,q)
models, p=1,2,3 (rows), q=1,2,3 (columns).
3. Choosing the selection method
Suppose we have to decide between the ARMA(3,2) model and the MA(2) model for the
description of the differenced log GDP. To find an acceptable balance between the goodness
of fit (bias) and the complexity of the model (variance) we want to take into account that these
models are not given a priori but have been found as the best ARMA models according to
AIC and BIC, respectively. So actually we are going to decide between two model selection
methods rather than between two models.
We proceed as follows. First we fit an ARMA model of order (p0,q0)=(3,2) to the
differenced log GDP, y, and compute the residuals, û . Then we generate r=100 synthetic time
series y(i), i=1,..,r, using the estimated model parameters θ̂(y; (p 0 , q 0 )) and different random
permutations û(i) , i=1,…,r, of the residuals. Finally we fit all ARMA models up to order
(3,3) to each synthetic series y(i) and select the optimal orders (pi(AIC),qi(AIC)) and
(pi(BIC),qi(BIC)) according to AIC and BIC, respectively. The quantities of interest are the
discrepancies between the original goodness-of-fit measures and those obtained from the
synthetic series. The differences
r
DAIC= log f(y θ̂(y; (p 0 , q 0 ))) - 1r ∑ log f(y(i) θ̂(y(i); (p i (AIC), q i (AIC))))
i =1
and
r
DBIC= log f(y θ̂(y; (p 0 , q 0 ))) - 1r ∑ log f(y(i) θ̂(y(i); (p i (BIC),q i (BIC))))
i =1
may serve as penalties for the models found with AIC and BIC, respectively. In our case
DAIC-DBIC=2.503-(-1.842)=4.345,
hence the difference between the penalties is greater than the difference between the numbers
of parameters. The fact that DBIC is negative indicates that BIC is too conservative in this
situation. Since the difference between the log likelihood terms of the two models is much
greater than DAIC-DBIC, i.e.,
log f(y θ̂(y; (3,2))) - log f(y θ̂(y; (0,2))) =7.202 > DAIC-DBIC=4.345,
the ARMA(3,2) model comes off much better than the MA(2) model when we use the former
model for the computation of the penalties.
Alternatively, we may use the model selected by BIC for the computation of the penalties.
Starting with (p0,q0)=(0,2) and running through all steps again we get
DAIC-DBIC=3.329-0.650=2.679.
While both penalties are now greater than before, which is due to an increased risk of
overfitting, their difference is smaller than before, it is even smaller than the difference
between the numbers of parameters. So our approach of taking the complexity of the whole
modeling procedure into account (and not only the complexity of the selected model) clearly
favors the ARMA(3,2) model over the more parsimonious MA(2) model.
4. Bayesian extension of the selection method
But what can we do when the outcome is not as unambiguous as in the GDP-example or when
there are more than two model selection criteria? In such cases we could try a Bayesian
modification of our approach, where the synthetic series are generated according to the
posteriori probabilities of the different models and the posteriori distributions of the model
parameters. We describe the details for the classical problem of choosing a linear regression
model. Suppose that the n-dimensional random vector y has a multivariate normal distribution
with mean vector µ=Xβ, where the (n× K)-dimensional matrix X has rank K, and covariance
matrix σ2I. We want to select one from m competing models Mj, j=1,…,m. Each model Mj is
represented by an (n×kj)-dimensional submatrix Xj of X. Since µ is typically unknown we
cannot approximate µ by the projections µ *j =Xj( X Tj Xj)-1 X Tj µ and must therefore use their
sample analogues µ̂ j =Xj( X Tj Xj)-1 X Tj y instead. The best models according to AIC and BIC
can be obtained by minimizing
n log σ̂ 2j + 2(kj+1)
and
n log σ̂ 2j + (kj+1) log n,
respectively, where
σ̂ 2j = 1n (y-Xj β̂ j )T(y-Xj β̂ j ), β̂ j =( X Tj Xj)-1 X Tj y.
But we might wish to use alternative model selection criteria as well, e.g., an extension of
AIC for small sample sizes (see Sugiura, 1978, Hurvich and Tsai, 1989),
n log σ̂ 2j + 2(kj+1) +
1
n −k j −2
( 2k 2j + 6k j + 4 ),
an extension of AIC for small sample sizes and possible misspecifications (see Reschenhofer,
1999),
n log σ̂ 2j + 2(kj+2) Q̂ j - 2 Q̂ 2j +
1
n −k j −2
( 14k jQ̂ 2j + 2k 2j Q̂ 2j − 8k jQ̂ 3j + 24Q̂ 2j − 32Q̂ 3j + 12Q̂ 4j ),
a small sample version of BIC (see Reschenhofer, 1996a),
(n-2) log σ̂ 2j + kj log n2 - kj log
σ̂ 2j
τ̂ 2
- 2 log Γ(
kj
2
+1),
and so on, where
Q̂ j =
σ̂ 2
σ̂ 2j
, σ̂ 2 = 1n (y-X β̂ )T(y-X β̂ ), β̂ j =(XTX)-1XTy, τ̂ 2 = 1n yT y,
and Γ denotes the gamma function.
For the generation of each of r synthetic series y(i), i=1,..,r we randomly generate a fully
specified model using the approximate posterior probabilities
P̂(M j y) =
e −B(j)/2
m
∑e
− B(s)/2
, j=1,…,m,
s =1
of the m models (see Kass and Wasserman, 1995, Reschenhofer, 1996b) and the usual
posterior distributions
nσ̂ 2j
σ 2j
y, M j ~ χ2(n-kj),
β j y, σ 2j , M j ~ N( β̂ j , σ̂ 2j ( X Tj Xj)-1)
of the model parameters, respectively. Here B(j) is the BIC value of model Mj (or the
corresponding value of the small sample version of BIC).
Now we fit all m models to the original sample y as well as to each synthetic sample y(i)
and use the different criteria for the selection of the respective optimal models. The
discrepancies between the original goodness-of-fit measures and the means of those obtained
from the synthetic samples can be used to compare the different optimal models for y and to
select the overall best model for y.
Of course, this Bayesian procedure is computationally practicable only if m is not too large.
The magnitude of m is particularly critical when the methods used for the estimation of the
model parameters are very time-consuming. In the case of subset selection with criteria like
2 ζ (k j , K)
n log σ̂ 2j + n log(1+ n − ζ (k
j , K)
),
where ζ(kj,K) is the expected value of the sum of the kj largest of K independent χ2(1)statistics (see Reschenhofer, 2004), m may even be very large when K is small.
References
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle.
In , Petrov, B.N. Csaki, F. ed. Second International Symposium on Information Theory.
Akademia Kiado, Budapest, 1973, pp. 267-281.
George, E.I. Foster, D.P. (2000). Calibration and empirical Bayes variable selection.
Biometrika 87:731-747.
Hurvich, M. Tsai, C.L. (1989). Regression and time series model selection in small samples.
Biometrika 76:297-307.
Kabaila, P. (1997). Admissible variable-selection procedures when fitting misspecified
regression models by least squares. Communications in Statistics – Theory and Methods 26:
2303-2306.
Kabaila, P. (2002). On variable selection in linear regression. Econometric Theory 18:913925.
Kempthorne, P.J. (1984). Admissible variable-selection procedures when fitting regression
models by least squares for prediction. Biometrika 71:593-597.
Leeb, H. Pötscher, B.M. (2004). Sparse estimators and the oracle property, or the return of
Hodges’ estimator. Working paper.
Reschenhofer, E. (1996a). Approximating the Bayes factor, Statistics & Probability Letters
30:241-245.
Reschenhofer, E. (1996b). Prediction with vague prior knowledge. Communications in
Statistics – Theory and Methods 25:601-608.
Reschenhofer, E. (1999). Improved estimation of the expected Kullback-Leibler discrepancy
in case of misspecification. Econometric Theory 15:377-387.
Reschenhofer, E. (2004). On subset selection and beyond. Advances and Applications in
Statistics 4:265-286.
Sawa, T. (1978). Information criteria for discriminating among alternative regression models.
Econometrica 46:1273-1291.
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6:461-464.
Shibata, R. (1980). Asymptotically efficient selection of the order of the model for estimating
parameters of a linear process. Annals of Statistics 8: 147-164.
Shibata, R. (1981). An optimal selection of regression variables. Biometrika 68:45-54.
Sugiura, N. (1978). Further analysis of the data by Akaike’s information criterion and the
finite corrections. Communications in Statistics – Theory and Methods 7:13-26.
Yang, Y. (2005). Can the strengths of AIC and BIC be shared? A conflict between model
identification and regression estimation. Biometrika 92: 937-950.