Download Estimating ARs

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Least squares wikipedia , lookup

Coefficient of determination wikipedia , lookup

Lecture 12 – Estimating Autoregressions and
Vector Autoregressions
(Reference - Section 6.4, Hayashi)
Assume that the observed time series data, y1,…,yT
have been generated by the AR(p) process:
yt = c + φ1yt-1 + … + φpyt-p + t
where the roots of of (1- φ1z - … -φpzp) lie outside
the unit circle and t ~ i.i.d. (0,2) (The i.i.d.
assumptions can be replaced with white noise and
additional conditions)
Then the OLS estimator of the AR(p) parameter
vector [c φ1 … φp]’ is
1) (strongly) consistent
2) asymptotically normal
3) asymptotically efficient
Further, for sufficiently large samples the model can
be treated as if it is a classical linear regression
model with strictly exogenous regressors and
normally distributed, serially uncorrelated, and
conditionally homoskedastic errors. So, for example,
the t-statistic
  (ˆ i   ) / se(ˆ i )  N (0,1)
se(ˆi )  ˆ 2 ( X ' X ) ii1
ˆ 2  SSR /(T  p)
To prove these claims (except for asymptotic
efficiency), we can show that the assumptions
underlying Hayashi’s Proposition 2.5 apply to this
Consider the proof for the case where p = 1:
A.1 – Linearity; Yes
A.2 – [yt xt] is strictly stationary and ergodic
For this model, xt = [1 yt-1]’, so A.2 requires that
yt is strictly stationary and ergodic. This follows
from the fact that yt has an absolutely summable
MA(∞) form in terms of the i.i.d. process, t.
A.4 - E(xtxt’) is nonsingular
E(xtxt’) = E{[1 yt-1]’[1 yt-1]} =
E(yt-1) E(yt-12)
Recall that E(yt) = μ = c/(1-) and Var(yt) = E(yt2)E(yt)2 = 0 So,
E(xtxt’) = 1
 0+ μ 2
which is nonsingular. [ det(E(xtxt’)) = 0+ μ2 - μ2 = 0,
0 < 0 < ∞ ]
A.5 (and, therefore, A.3) - {xtt} is an m.d.s. with
finite second moment
xtt = [t yt-1t]’
Show that {xtt} is an m.d.s. by proving the sufficient
E (xtt │ xt-1t-1, xt-2t-2,…) = 0
E(t │t-1, t-2,…,yt-2t-1,yt-3t-2,…) = E(t │t-1, t-2,…)
E(yt-1t │t-1, t-2,…,yt-2t-1,yt-3t-2,…)
= E{ E(yt-1t │yt-1,t-1, t-2,…,yt-2t-1,yt-3t-2,…) │t-1,
by the Law of Iterated Expectations
= E{yt-1 E(t │yt-1,t-1, t-2,…,yt-2t-1,yt-3t-2,…) │ t-1,
= 0, since E(t │yt-1,t-1, t-2,…,yt-2t-1,yt-3t-2,…) = 0.
So, [t yt-1t]’ is an m.d.s. To complete A.5, we need
to show that Var(xtt) is finite.
Since E(xtt) = 0 (by the m.d.s. property),
Var(xtt) = E {[t yt-1t]’ [t yt-1t]} =
Applying the Law of Iterated Expectations:
E(yt-1t2) = E{E(yt-1t2 │yt-1)} = E(yt-12) = 2 E(yt-1)
= 2 μ
E(yt-12t2) = E{E(yt-12t2 │yt-1)} = E(yt-122) =
2(0+ μ2)
So, E {[t yt-1t]’ [t yt-1t]} =
2 μ
= 2E(xrxt’),
which is finite and nonsingular.
2 μ
2(0+ μ2)
A.7 – E(t2│ xt) = 2 > 0
This follows from the facts that 1) xt is a linear
combination of past ’s, 2) t is independent of past
’s, and 3) E(t2) = 2 > 0.
Therefore, the assumptions of Proposition (2.5) are
The proof for the general AR(p) follows along this
A couple of practical issues –
1. What transformation(s), if any, should we make to
a time series before we fit it to an AR model?
We want to the series to look like a realization of a
stationary process.
 use logged form of the series (especially with
trending series, since the changes in levels will
typically also be growing over time while the
changes in the logs, which are approximately
percentage changes, will typically be relatively
stable over time)
 remove trend – should we use the linear trend
model or should we use first differences? More
on this in the next section of the course.
2. How do we select the appropriate value for p?
This gets kind of messy. There is no “best way” to do
select the appropriate lag length, although there are a
number of sensible ways to go about this process.
Selecting the lag length for the AR(p) model –
The idea is that we want to choose p large enough to
remove any serial correlation in the error term but we
want to choose p small enough so that we are not
including irrelevant regressors. (Why is including
irrelevant regressors a problem?)
There are two approaches to lag length selection –
 Hypothesis testing – sequential t or F tests
 Minimize the AIC or SIC statistic
Sequential testing –
Consider sequential t tests.
First, select the largest “plausible” p, say pmax (For
quarterly real GDP this might be say, 6 – 8.)
Second, fit the AR model using p = pmax and test H0:
φpmax = 0. If H0 is rejected, set p = pmax. If H0 is not
rejected, redo with p = pmax-1. Continue until H0 is
1)After you have selected your p, you should check
for serial correlation in the error term (using the
sample autocorrelogram of the residuals) to make
sure that your p is large enough to have removed the
serial correlation in the error process.
2)For any given p, you can only fit the model for t =
p+1,…T. One way to perform the lag length tests we
have described is to fit all the model using t =
pmax+1,…,T, then having the selected the preferred
lag length, p*, fit the model using t = p*+1,…,T. The
advantage of this approach is that the lag length test
results depend only on varying p rather than varying
p and samples. The alternative approach is to using
the maximum number of observations for each test.
3) As T → ∞, Prob(p* < p) → 0 but Prob(p* > p) →
c > 0. (The probability of underfitting goes to zero
but there is a positive probability of overfitting even
in large samples.) (Why? If φp ≠ 0, │p │ → ∞ with
probability 1.)
AIC and SIC –
The AIC (Akaike Information Criterion) and the SIC
(Schwartz Information Criterion) are based on the
following statistics -
AIC(p, pmax) = log (SSRp)/(T-pmax)+2(p+1)/(T-pmax)
SIC(p,pmax) = log (SSRp)/(T-pmax)+
The AR(p) is fit to the data for t = pmax+1,…,T for p
= 0,1,…,pmax . The optimal lag length p* is chosen to
minimize the AIC (or the SIC).
The idea – The AIC and SIC are like adjusted R2’s.
The first term will decrease as p increases since the
SSR will fall as p increases. The second term is a
“penalty term” that increases as p increases.
Notes –
1) So long as T-pmax > 8 (i.e., for all practical
purposes), p*(AIC) > p*(SIC).
2) If t is independent white noise with finite
fourth moment, the following can be shown:
 p*(SIC) → p as T → ∞
 prob(p*(AIC)<p) → 0 as T → ∞
 prob(p*(AIC)>p) → c > 0 as T → ∞
(So why ever use AIC?)
3) The SIC is also called the SBC (Schwartz
Bayesian Criterion) and BIC (Bayesian
Information Criterion).
Estimating Vector Autoregressions –
Let yt = [y1t … ynt]’ evolve according to the
VAR(p) model p
yt  A0   Ai yt s   t
s 1
where t is n-dimensional white noise and A1,…,Ap
satistfy the stationarity condition.
Then, OLS applied equation by equation is
1) (strongly) consistent
2) asymptotically normal
3) asymptotically efficient
For sufficiently large samples, each of the n
equations can be treated as if it is a classical linear
regression model with strictly exogenous regressors
and normally distributed, serially uncorrelated, and
conditionally homoskedastic errors.
Linear cross equation restrictions can be tested as
follows Let A = [A0 A1 … Ap], A is nx(np+1).
Let vec(A) = (n2p+n)x1
Consider H0: Rvec(A)= r, where R is a known qx
(n2p+n) matrix and r is a known qx1 vector.
Then, under H0, the (quasi-) likelihood ratio statistic,
T [log det ˆ R  log det ˆ u ]
converges in distribution to a X2(q), where
̂ R
1 T
= T  ˆt , R ˆt , R ,
p 1
ˆt , R = residual from the “restricted regression”
̂ u
1 T
,U ,
p 1
ˆt ,U = residual from the “unrestricted regression”
1) OLS is algebraically equivalent to SUR
because the same regressors appear in
each equation. (Once you start allowing for
different lag lengths in different equations or
different variables in different equations, SUR
will be be preferred to OLS, unless the
innovations are uncorrelated across equations.)
2) We can choose the lag length by applying the
vector versions of the AIC and
AIC(p,pmax) = logdet ̂ p + 2(pn2+n)/(T-pmax)
SIC(p,pmax) = logdet ̂ p
+ (pn2+n)log(T-pmax)/(T-pmax)
̂ p
 ˆ
pmax 1
t, p
ˆt' , p ,
ˆt , p = residual from the VAR(p) model
We choose p* from p = 0,1,…,pmax to minimize
the AIC (or SIC). The large sample properties of the
AIC and SIC in the VAR(p) case are the same as for
the AR(p) case.
3) How to select the variables that appear in yt?
 (Weiner-Granger-Sims) causality tests
4) Bayesian VARs (BVAR)