Download Davidson-McKinnon book chapter 4 notes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
DMcKch4
Even though Ch.4 deals exclusively with hypothesis testing, this is a rather long
chapter.
Section 4.2 is quite elementary. If you have a reasonable background in statistics, we
may not need to cover most of this material in class. However, precisely because this
material is so elementary, it is important that you understand it thoroughly.
Section 4.3 deals with the normal, chi-squared, Student’s t, and F distributions. None of
this material is very advanced.
Linear combinations of normal variables are them-selves normally distributed is very
important, but the proof can be omitted.
The next two sections (page 138 on ) deal with exact tests and asymptotic tests,
respectively. Most of the results in Section 4.4 are well known and reasonably
elementary, as is the Chow test, which is introduced here as an example of the F test.
In contrast, even though there are no real proofs, the asymptotic results in Section 4.5
are inevitably a bit more advanced. Here we continue the discussion of laws of large
numbers, which was begun in Section 3.3, and we also introduce central limit theorems.
If you are to understand asymptotic theory at all, this material is essential.
page 153 The subsection on the t test with predetermined regressors is, somewhat
more advanced than the rest of this section, and it can be omitted without loss of
continuity.
p. 155 Section 4.6 contains what is intended to be an accessible introduction to
simulation-based tests, including Monte Carlo tests, which are exact, and bootstrap
tests, which are not. Students who understand the material of this section should be
capable of performing bootstrap tests in regression models, including dynamic
regression models, with or without assumptions about the distribution of the error terms.
The last substantive section of the chapter 4.7 on page 166 deals with test power, a
subject that is often overlooked by applied econometricians. You should understand the
implications of Figure 4.10, which shows power functions for several sample sizes.
Page 122
We must take the randomness of ^ into account if we are to make inferences about . In
classical econometrics, the two principal ways of doing this are performing hypothesis tests and
constructing confidence intervals or, more generally, confidence regions.
Section 4.2 is quite elementary. If you have a reasonable background in statistics, we
may not need to cover most of this material in class. However, precisely because this
material is so elementary, it is important that you understand it thoroughly.
try ch08ppln.ppt file on my webpage
Simplest sort of hypothesis test concerns the (population) mean from which a random sample has
been drawn. To test such a hypothesis, we may assume that the data are generated by the
regression model. (4.01)
Here ^ is the sample mean.
Page 123
The least-squares estimator of B and its variance, for a sample of size n, are given by (4.02).
Thus, for the model (4.01), the standard formulas ^ = (XX)-1X y and Var (^) = 2 (XX)-1
yield the two formulas given in (4.02). See page 100
Test the hypothesis that ^ = o, where o is some specified value of . The hypothesis that we
are testing is called the null hypothesis. Given the label Ho for short. In order to test Ho, we
must calculate a test statistic, which is a random variable that has a known distribution when the
null hypothesis is true and some other distribution when the null hypothesis is false. If the value
of the test statistic is an extreme one that would rarely be encountered by chance under the null,
then the test does provide evidence against the null. If this evidence is sufficiently convincing,
we may decide to reject the null hypothesis that ^ = o.
We will restrict the model (4.01) by making two very strong assumptions. The first is that ut is
normally distributed, and the second is that o is known.
Under the null hypothesis, z must be distributed as N(0,1). It must have variance unity because,
by (4.02).
Page 124
z defined in (3) has the first property that we would like a test statistic to possess: It has a known
distribution under the null hypothesis.
For every null hypothesis there is, at least implicitly, an alternative hypothesis, which is often
given the label H1. Just as important as the fact that z follows the N(0,1) distribution under the
null is the fact that z does not follow this distribution under the alternative. Suppose that ^ takes
on some other value, say B1. Then clearly, ^ = B1 + v, where v has mean 0 and variance 2
/n; z is also normal under the null and we find from (4.03) that (4.04).
We would expect the mean of z to be large and positive if Bl > Bo and large and negative if Bl <
Bo. Reject the null hypothesis whenever z is sufficiently far from 0.
If the alternative is that B  Bo, we must perform a two-tailed test and reject the null whenever
the absolute value of z is sufficiently large. If instead we were interested in testing the null
hypothesis that B< Bo against the alternative that B > Bo, we would perform a one-tailed test.
Decide in advance on a rejection rule, according to which we choose to reject the null
hypothesis if and only if the value of z falls into the rejection region of the rule.
Page 125
Type I error =. The probability of making such an error is, by construction, the probability,
under the null hypothesis, that z falls into the rejection region. This probability is sometimes
called the level of significance, or just the level, of the test. Popular values of  include .05
and.01.
Distribution of the test statistic under the null hypothesis is known exactly, so that we have what
is called an exact test. Usually, it is known only approximately. In this case, we need to draw a
distinction between the nominal level of the test, that is, the probability of making a Type I error
and the actual rejection probability, which may differ greatly from the nominal level.
The probability that a test rejects the null is called the power of the test. Power depends on
precisely how the data were generated and on the sample size.
Size of a test. Technically, this is the supremum of the rejection probability over all DGPs that
satisfy the null hypothesis. For an exact test, the size equals the level. It is often, but by no means
always, greater than the nominal level of the test.
 is the non-centrality parameter (non-zero value of the mean). Figure on page 126 shows the
effect of  on the power of the test.
Page 126
Mistakenly failing to reject a false null hypothesis is called making a Type II error. The
probability of making such a mistake is equal to 1 minus the power of the test.
To construct the rejection region for a test at level a, the first step is to calculate the critical
value. Critical value ca is defined implicitly by (4.05).
 is Normal density and big phi denotes cumulative density.
Critical value for a two-tailed test is PHI-Inverse  -1 (0.975) =1.96.
Reject the null if observed value is more extreme than the critical value in either direction.
P-Values Page 127
Result of a test is yes or no (accept or rejct). A more sophisticated approach to deciding whether
or not to reject the null hypothesis is to calculate the P value, or marginal significance level,
associated with the observed test statistic z^.
P value associated with z is denoted p(z^). It means we incur probability of Type I error of p(z^)
when we REJECT the null.
(7) defines P value for two-tailed test
The smallest value of  for which the inequality holds is thus obtained by solving the equation.
(7a)
Solution is easily seen to be the right-hand side of (4.07).
Computing a P value transforms z from a random variable with the N(0,1) distribution into a
new random variable p(z^) with the uniform U(0,1). A test at level  rejects whenever p(z^) <
 . Generally, one rejects the null when observed test statistic is large. However one rejects the
null when P values are small. This gets some getting used to!
Page 128
In downloadable answer of Exercise 4.15, readers are asked to show how to compute P values
for two-tailed tests based on an asymmetric distribution involving the min function.
Page 129
Section 4.3 deals with the normal, chi-squared, Student’s t, and F distributions.
None of this material is very advanced.
Linear combinations of normal variables are themselves normally distributed is very
important, but the proof can be omitted.
In Exercise 1.8, the PDF of the N(, 2) distribution, evaluated at x, is lower case letters to
denote both random variables and the arguments of their PDFs or CDFs.
Page 130
Third central moment, which measures the skewness of the distribution, is always zero. The
fourth moment of a symmetric distribution provides a way to measure its kurtosis, which
essentially means how thick the tails are. The fourth central moment is 34. See Exercise 4.2
numerically in R software and see if rnorm gives good approximation to skewness and fourth
moment (kurtosis) as sample size increases.
Any linear combination of independent normally distributed random variables is itself normally
distributed.
Page 131
Given the conditional mean and variance we have just computed, we see that the conditional
distribution must be N(b1z1,(b2)2).
Joint density can also be expressed as in (4.12), but with z1 and w interchanged, as follows in
(14):
We are now ready to compute the unconditional, or marginal, density of w. To do so, we
integrate the joint density (4.14) with respect to z1; conclude that the marginal density of w is
f(w) = (w).
Page 132
Linear combinations of normal random variables that are not necessarily independent. We
introduce the multivariate normal distribution. This is a family of distributions for random
vectors, completely characterized by their first two moments.
Set of m mutually independent standard normal variables, zi, which we can assemble into a
random m-vector z. Then any m-vector x of linearly independent linear combinations. Can
always be written as Az, for some nonsingular m x m matrix A.
We can always find a lower-triangular A such that AA = . We write this as x ~ N(0,). If we
add an m-vector  of constants to x, the resulting vector must follow the N(, ) distribution.
Page 133
If x is any multivariate normal vector with zero covariances, the components of x are mutually
independent. In general, a zero covariance between two random variables does not imply that
they are independent. A nice property of normal variables!
Page 134
Z1, …, zm are mutually independent standard normal random variables. z~N(0,I) in matrix
notation.
(15) is said to follow the chi-squared distribution with m degrees of freedom.
Its mean is m
Variance of the sum of the (zi)2 is just the sum of the (identical) variances:
Variance of chi sq is 2m by (17)
using the fact that E(z4i) =3. If y1 ~ X2(m1) and y2 ~ X2(m2) and y1 and y2 are indep then
y1+y2 is chi-sq with df=m1+m2. just add the degrees of freedom!
Page 135
Many test statistics can be written as quadratic forms in normal vectors, or as functions of such
quadratic forms.
Useful Thm 4.1, especially it says something when projection matrices (Hat matrix or Identity
minus Hat or M matrix) are involved. We have seen that they are common for regression. So
regression theory is easy once you understand these projection matrices. A quadratic form in
these projection matrices is Chi-sq distributed with df=rank of projection (easily counted)
Page 136
If z ~ N(0,1) and y ~ X2 (m), and z and y are independent, then the random variable (18) is said
to follow the Student’s t distribution with m degrees of freedom. Note the square root in the
denominator of the definition.
Only the first m – 1 moments exist for the t variable. Thus the t(1) distribution, which is also
called the Cauchy distribution, has no moments at all, and the t(2) distribution has no variance.
As m increases, the chance that the denominator of (4.18) is close to zero diminishes (see Figure
4.4), and so the tails become thinner.
Var(t) = m/(m – 2). Thus, as m  , the variance tends to 1.
Since Chi-square can be expressed as a sum of squares of N(0,1) variables, by la law of large
numbers, such as (3.16), y/m, which is the average of the (zi)2, tends to its expectation as m .
Denominator of (4.18), (y/m) ½, also tends to 1, and hence that t  z ~ N(0,1) as m  .
Page 137
If y1 and y2 are independent random variables distributed as χ2(m1) and χ 2(m2), respectively,
then the random variable (19) is said to follow the F distribution.
Square of a random variable which is distributed as t(m2) is distributed as F(1,m2).
Page 138
The next two sections (page 138 on ) deal with exact tests and asymptotic tests,
respectively. Most of the results in Section 4.4 are well known and reasonably
elementary, as is the Chow test, which is introduced here as an example of the F test.
In contrast, even though there are no real proofs, the asymptotic results in Section 4.5
are inevitably a bit more advanced. Here we continue the discussion of laws of large
numbers, which was begun in Section 3.3, and we also introduce central limit theorems.
If you are to understand asymptotic theory at all, this material is essential.
It is assumed in (20) that the error vector u is statistically independent of the matrix X. This
idea is expressed in matrix notation in (20) by using the multivariate normal distribution. OK to
express this independence assumption by saying that the regressor X are exogenous.
TEST of a Single Restriction:
Page 139
Under the assumption that the model (4.21) is correctly specified, we find that.
This yields a test statistic analogous to (4.03), p.123 given in (23)
The test statistic z 2 defined in (4.23) has exactly the same distribution under the null
hypothesis as the test statistic z defined in (4.03).
In the more realistic case in which the variance of the error terms is unknown, we need to
replace  in equation (4.23) by s.
Page 140
If, as usual, Mx is the orthogonal projection on to S perpendicualr(X), then we have.
This test statistic is distributed as t(n-k) under the null hypothesis. Not surprisingly, it is called a
t statistic.
(4.25) can be rewritten as (26)
Under any DGP that belongs to (4.21) we write (27)
We saw above that Mxy = Mxu. The n n matrix of covariances of components of Pxu and Mxu
is thus the last eq on page 140.
Page 141
Even though the numerator and denominator of (4.26) both depend on y, this orthogonality
implies that they are independent.
We just have to use the t(n-k) distribution instead of the N(0,1) distribution to compute P values
or critical values.
What if there are several restrictions and
Tests of Several Restrictions:
Either 2 is zero or we have (28)
We want to compute a single test statistic for all the k2 restrictions at once.
SSR from restricted model (4.29) cannot be smaller, and is almost always larger, than the SSR
from the unrestricted model (4.28). It seems natural to base a test statistic on the difference
between these two SSRs. See (30)
Page 142
Restricted SSR is ytM1y, and unrestricted one is ytMxy. One way to obtain a convenient
expression for the difference between expressions is to use the FWL Theorem.
Under the null hypothesis, Mxy – Mxu and M1y – M1u. Thus, under this hypothesis, the F
statistic (4.33) reduces to (34)
where, as before,  = u/ . Random variables in the numerator and denominator are independent,
because Mx and Pm1x2 project on to mutually orthogonal subspaces: MxM1X2 – Mx(X2 –
P1X2) = O. Thus (4.34) statistic follows the F(r,n – k) distribution.
Page 143
A Threefold Orthogonal Decompositon
The three fold orthogonal decomposition is (37).
Use tilde to denote the restricted estimates, and a hat (ˆ) to denote the unrestricted estimates.
 Sub 2^ is a subvector of estimates from the unrestricted model. Finally, Mxy is the vector of
residuals from the unrestricted model.
In (38) two hats on left side and one tilde and one hat on right side
F statistic (4.33) can be written as the ratio of the squared norm of the second componenet in
(4.37) to the squared norm of the third, each normalized by the appropriate number of degrees of
freedom.
F test serves to detect the possible presence of systematic variation, related to X2, in the second
component of (4.37).
We want to reject the null whenever the numerator of the F statistic, RSSR – USSR, is relatively
large.
Page 144
Thus we compute the P value as if for a one-tailed test. However, F tests are really two-tailed
tests.
The square of the t statistic t22, defined in (4.25) is in equation before (39).
This test statistic is evidently a special case of (4.33), with the vector x2 replacing the matrix X2.
An Example of F test:
The most familiar application of the F test is testing the hypothesis that all the coefficients in a
classical normal linear model, except the constant term, are zero.
The last eq. on page 144 shows that the F statistic (4.40) depends on the data only through the
centered R2, of which it is a monotonically increasing function.
Page 145
Chow Test of equality of two parameter Vectors in eq (43) below.
It is often natural to divide a sample into two, or possibly more than two, subsamples. These
might correspond to periods of fixed exchange rates and floating exchange rates. We may then
ask whether a linear regression model has the same coefficients for both the subsamples. It is
natural to use an F test for this purpose.
A good example in Greene’s text is demand for gasoline before and after the Arab oil embargo
of 1973. Clearly the gasoline demand changed due to long lines at the gas pump, and high prices
and energy efficiency requirements on Detroit cars imposed by the US Congress.
Two subsamples, of lengths n1 and n2, with n = n1 + n2. We will assume that both n1 and n2 are
greater than k, the number of regressors. Now partition y and X into two parts.
Write (41) and use it to define matrix Z and then write (42)
This (42) is a regression model with n observations and 2k regressors. Constructed in such a way
that 1 is estimated directly, while 2 is estimated using the relation 2 =  + 1. Restriction that
1 = 2 is equivalent to the restriction that  =0 ) in (4.42). Since (4.42) is just a classical normal
linear model with k linear restrictions to be tested, the F test provides the appropriate way to test
those restrictions. The null hypothesis in the oil embargo application is that the gas demand
model remained unchanged before and after the embargo.
There is another way to compute the USSR. In Exercise 4.11, readers are invited to show that it
is simply the sum of the two SSRs obtained by running two independent regressions on the two
subsamples. F statistic becomes (43). This is the Chow statistic.
Page 146
(43) is distributed as F(k, n – 2k) under the null hypothesis that 1 = 2.
Large Sample Tests:
Tests that we developed in the previous section are exact only under the strong assumptions of
the classical normal linear model. If the error vector were not normally distributed or not
independent of the matrix of regressors, we could still compute t and F statistics, but they would
not actually follow their namesake distributions in finite samples. In many cases approximately
follow know distributions in large samples.
Asymptotic theory
Asymptotic theory gives us results about the distributions of t and F statistics under much weaker
assumptions than those of the classical normal linear model.
A law of large numbers (LLN) may apply to any quantity, which can be written as an average of
n random variables, that is, 1/n times their sum.
Page 147
A fairly simple LLN assures us that, as n  ., xbar tends to . sample mean approaches the
population mean as n tends to .
The empirical distribution defined by this sample is the discrete distribution that puts a weight
of 1/n at each of the xt, t = 1,…,n.
I(.) is the indicator function, which takes the value 1 when its argument is true and takes the
value 0 otherwise. (4.44) counts the number of realizations xt that are smaller than or equal to x.
The EDF has the form of a step function: The height of each step is 1/n, and the width is equal to
the difference between two successive values of xt.
Page 148
These may be compared with the CDF of the standard normal distribution in the lower panel of
Figure 4.2.
Thus (4.44) is the mean of n IID random terms, each with finite expectation. The simplest of all
LLNs (due to Khinchin) applies to such a mean, and we conclude that, for every x, F(x) is a
consistent estimator of F(x).
If we can apply a LLN to any random average, we can treat it as a nonrandom quantity for the
purpose of asymptotic analysis. The matrix n-1XXtX, under many plausible assumptions about
how X is generated, tends to a nonstochastic limiting matrix SxTx as n  .
Central limit theorems are crucial in establishing the asymptotic distributions of estimators and
test statistics. In many circumstances, 1/ n times the sum of n centered random variables
approximately follows a normal distribution when n is sufficiently large.
Page 149
It may seem curious that we divide by n instead of by n in (4.45), but this is an essential feature
of every CLT. To see why, we calculate the variance of z.
Whenever we want to use a CLT, we must ensure that a factor of n-1/2=1/ n is present.
The assumption that the xt are identically distributed is easily relaxed, as is the assumption that
they are independent. However, if there is either too much dependence or too much
heterogeneity, a CLT may not apply. CLT says that, for a sequence of random variables xt, t =
1,…,, with E(xt) =0.
Multivariate, versions of CLTs. Suppose that we have a sequence of random m-vectors xt, for
some fixed m, with E(xt) = 0. Then the appropriate multivariate version of a CLT tells us that
each Var(xt) is an m x m matrix. Figure 4.7 illustrates the fact that CLTs often provide good
approximations.
Page 150
Top figure is for uniform density and lower figure is for Chisq. Both converge to normality by a
process of averaging over n elements from these densities, but the convergence is much faster
and for a lower value of n when underlying density is uniform.
ASYMPTOTIC t and F TESTS are actually valid under much weaker conditions than
normality of errors.
How do we show this?
Weaken the conditions, find the expression for the limit of the t or F statistic as n  by
applying the CLT and LLN
For example let us assume iid errors where error terms are drawn from some specific but
unknown distribution in eq (47)
Page 151
When we use iid, we abandon the assumption of exogenous regressors and replace it with
assumption (3.10). The conditional means and variance in the new set up are (48)
From the point of view of the error terms, (48) says that they are innovations. From the point of
view of the explanatory variables Xt, assumption (3.10) says that they are predetermined with
respect to the error terms.
In general the matrix XX   as t   because each term in the kk matrix increases beyond
bound. However the average of each term does not necessarily diverge to  . This is stated as
an addition assumption (49) in order to be able to use asymptotic results.
Where Sxtx is finite, deterministic, positive definite matrix. Condition (4.49) is violated in many
cases. For example, it cannot hold if one of the columns of the X matrix is a linear time trend,
because  t2 grows at a rate faster than n.
Where the numerator and denominator both been multiplied by n-1/2. It follows from the
consistency of s2 that the first factor in (4.50) tends to 1/o as n  .
If data are generated by (47) with 2 = 0, we have that M1y = M1u, and so (4.50) is
asymptotically equivalent to (51).
page 152
Recall that we want to
Weaken the conditions; find the expression for the limit of the t statistic as n  by applying
the CLT and LLN
We see that (51) is equivalent to (50) and show that the
The numerator of (4.51) is n-1/2 times a weighted sum of the ut, each of which has mean 0 and
variance 1, which does not depend on X. Hence the conditional variance with knowledge of X is
also the unconditional variance when the knowledge of X is missing.
Thus, (4.51) numerator evidently has mean 0 and variance 1, i.e., it is N(0,1)
The denominator can be ignored under the null, since it is non-random.
Under the null with exogenous regressors, we have (52) where
The notation “~a” means that tb2 is asymptotically distributed as N(0,1).
This result (52) justifies the use of t test beyond the standard assumptions of normal regression
model.
This subsection provides an excellent example of how asymptotic theory works. How to apply
LLN and CLT to the expression for t test in a regression.
We begin by
Pulling a k-vector v out of thin air and consider its limit by CLT as multivariate normal
We will then consider sub vectors of this v and submatrices S11, S12, S22 of Sxx.
Page 153
We applied a LLN in reverse to go from the first line to the second.
Recall that (51) is equivalent to the t stat.
Consider the numerator of (4.51). After considerable manipulations, it can be written as (55)
The first term of this expression is just the last, or kth, component of v, which we can denote by
v2.
The subsection on the t test with predetermined regressors is, somewhat more
advanced than the rest of this section, and it can be omitted without loss of continuity.
Page 154
But the key idea p.154 is that even if regressors are not exogenous but merely predetermined,
asymptotic theory can be used to justify the usual t test on regression coefficients.
The denominator of the t statistic in (4.51) is easier to analyze.
In the limit, all the pieces of this expression become submatrices of Sxx.
And we have Numerator  normal variable with mean zero
Denominator  standard deviation of the numerator normal variable
For example, if y ~N(0,2) we know that (y/ ) ~N(0,1)
This is what happens to t statistic.
Hence t stat tends to N(0,1) asymptotically
With regressors that are not necessarily exogenous but merely predetermined, still we have the
useful asymptotic result that t2 ~a~ N(0,1).
ASYMPTOTIC F TESTS
A similar analysis can be performed for the F statistic (4.33).
Thus see (59) which says that r times F tends to Chisq with df=r, where r denotes df of the
numerator of F
(4.52) and (4.59) justify the use of t and F tests outside the confines of the classical normal linear
model.
These P values from asymptotic tests are approximate, and one worries that tests based on them
are not exact in finite samples in the sense that they may over-reject in small samples (or underreject). The conclusions may end up being too liberal.
Page 155
Whether they overreject or underreject, and how severely, depends on many things, including the
sample size, the distribution of the error terms, the number of regressors and their properties, and
the relationship between the error terms and the regressors.
SIMULATION BASED TESTS
p. 155 Section 4.6 contains what is intended to be an accessible introduction to
simulation-based tests, including Monte Carlo tests, which are exact, and bootstrap
tests, which are not. Students who understand the material of this section should be
capable of performing bootstrap tests in regression models, including dynamic
regression models, with or without assumptions about the distribution of the error terms.
In the usual tests we assume: Distribution of the statistic under the null hypothesis was not only
(approximately) known, but also exactly the same for all DGPs contained in the null hypothesis.
Consider a compound hypothesis, which is represented by a model that contains more than one
DGP. Now the test statistic has different distributions under the different DGPs contained in the
model. This creates practical problems for econometrics called loss of pivot.
A random variable with the property that its distribution is the same (say Normal) for all DGPs
in a model M is said to be pivotal. There is no pivot problem for simple hypotheses (one at a
time), but there is a problem for compound hypotheses.
A pivot is a statistic whose sampling distribution does not depend on unknown parameters.
(b-)/SE is a good example of a pivotal statistic. Its distribution is student’s t and it does
not depend on .
Page 156
One can use asymptotic tests for compound hypotheses. The price that large sample asymptotic
tests pay for this added generality is that t and F statistics now have distributions that depend on
things like the error distribution: They are therefore not pivotal statistics in finite samples.
However, their asymptotic distributions are independent of such things across DGPs and are
said to asymptotically pivotal.
For any pivotal test statistic, the P value can be estimated by simulation to any desired level of
accuracy.
The Fundamental Thm of Stats says: Empirical distribution of a set of independent drawings of
a random variable generated by some DGP converges to the true CDF of the random variable
under the DGP. This is just as true of simulated drawings.
Empirical cdf → true CDF
If we knew that a certain test statistic was pivotal but did not know how it was distributed, we
could select any DGP in the null model and generate simulated samples from it.
The TRICK: Use simulation to get empirical cdf of a statistic and use funda. Thm
Suppose that we have computed a test statistic t. The P value for a test based on ^ is (60).
Note that P value is the probability of observing as extreme or even more extreme value than the
observed value. This probability is the Tail area beyond the observed value.
Let capital F denote the cumulative density or cdf.
F(^) gives the cdf till the observed value ^. It is the area to the left of ^ We do not want area
to the left, we want tail area. Hence we subtract from 1 in eq. (60) to get the p value. observed p
value is denoted by lower case p.
This P value can be estimated if we can estimate the CDF F evaluated at ^.
The procedure is very general.
Page 157
How to compute P-values in a simulation?
Choose any DGP in M, and draw B(=999) samples of size n from it. Denote the simulated
samples as yj*, j = 1,…,B. The star (*) notation will be used systematically to denote quantities
generated by simulation.
The proportion for which the statistic *j is less than or equal to  can be readily calculated in
simulation.
Just count the number of times simulated values exceeds something set at the observed value.
See eq. 62
Since the EDF converges to the true CDF, it follows that, if B were infinitely large, this
procedure would yield and exact test.
Simulating a pivotal statistic, is called a Monte Carlo test; Dufour and Khalaf (2001) provides a
more detailed introduction and references. Simulation experiments in general are often referred
to as Monte Carlo experiments.
RNG, is a program for generating random numbers.
Page 158
Drawings from the uniform U(0,1) distribution, can then be transformed into drawings from
other distributions. Fortunately, we do not have to do this explicitly if we have R software.
The RNG generator (63) starts with a (generally large) positive integer zo called the seed,
multiplies it by  , and then adds c to obtain an integer that may well be bigger than m. It uses
the modulo function to prevent that and gets the right random number in the desired range.
In R software this is done easily. If I want to choose 12 questions from a list of 50 to ask in a
final exam I can use uniform random variables between 1 and 50 which are rounded to be
integers. I type
sort(round(runif(12, min=1, max=50)))
and get 3 7 8 10 16 22 30 31 40 49 49 50
How well or badly this procedure works depends on how , m, and c are chosen. Set c = 0 and
use for m a prime number that is either a little less than 232 or a little less than 231. When and
m are chosen properly with c = 0, the RNG has a
Period of m – 1. This means that it generates every rational number with denominator m
between 1/m and (m-1)/m precisely once until, after m – 1 steps, zo comes up again. After that,
the generator repeats itself.
Page 159
Most test statistics in econometrics are not pivotal. The vast majority of them are, however,
asymptotically pivotal. Bootstrap is for the non-pivotal cases and when underlying distribution
is not known to be of any standard form of density.
It is necessary to estimate a bootstrap DGP from which to draw the simulated samples. DGP
that generated the original data is unknown, and so it cannot be used to generate simulated data.
The bootstrap DGP is an estimate of the unknown true DGP.
There are many ways to specify the bootstrap DGP.
Page 160
We will take for our example a linear regression model with normal errors, but with a lagged
dependent variable among the regressors.
(4.65) is a fully specified parametric model, which means that each set of parameter values for ,
 and 2 defines just one DGP. Simplest type of bootstrap DGP for fully specified models is
given by the parametric bootstrap.
Draw an n-vector u* from the N (0, s~2I) distribution.
Since lagged values are present, use recursive method. See in (67)
y1* determined in the first equation is put in the second equation’s right side before ~
and so on sequentially.
The bootstrap sample from (4.67) has y1* y2*, …, yn* and is of course CONDITIONAL on the
observed data initial value y0, when such a recursive scheme is used.
Page 161
RECALL THE PROBLEM at hand:
Let ^ denote the value of the t statistic obtained from the data, before any bootstrapping.
Note that we want to know if it is significant, but are not sure that the underlying sampling
distribution it Student’s t. So we are going to create B=999 bootstrap (simulated) values of *j
which will follow whatever density that they do follow. We are not worried if it is not Student t.
All we want to know is the tail area or P-value based on these 999 realizations.
For each of the B bootstrap samples, denoted by a vector y*, a bootstrap test statistic *j is
computed. The bootstrap P value p*(t) is then computed by formula (4.62).
We simply count the number of times *j exceeds ^ out of the 999 attempts, divide by 999 and
bingo, we have the p-value we want to know, without bothering with asymptotic distributions
stuff.
Under the null hypothesis, the OLS residual vector u~ for the restricted model is a consistent
estimator of the error vector u. For each t we have the plim of u~t is the true ut
From the Fundamental Theorem of Statistics, we know that the empirical (cumulative)
distribution function of the error terms is a consistent estimator of the unknown CDF of the error
distribution.
ecdf of u~  coverges to true CDF of u
after all the residuals consistently estimate the true errors.
Each bootstrap sample contains some of the residuals exactly once, some of them more than
once, and some of them not at all.
Page 162
Suppose that, when forming one of the bootstrap samples, the ten drawings from the U(0,1)
distribution happen to be.
This implies that the ten index values are after rounding to integers are given in middle of page
162. They are 7, 3, 8 etc
They mean we select 7th, 3rd, 8th data point to go in the bootstrap re-sample.
(68) with empirical cdf from observed regression residuals is called nonparametric bootstrap,
although strictly speaking it is semi-parametric, since it uses estimates of ~ (regr coeff)
parameters on the right side before computing the star values on the left side.
The empirical distribution of the residuals may fail to satisfy some of the properties that the null
hypothesis imposes on the true error distribution.
Page 163
One case in which this failure has grave consequences arises when the regression (4.65) yt=Xt
+ yt-1+ut, is forced through the origin, because then the sample mean of the residuals is not, in
general, equal to 0. {Beware of forcing regressions thru the origin, i.e., eliminating the
intercept}
Variance of the empirical distribution of the residuals is s~2(n – k1)/n. We can still draw from a
distribution with variance s~2. All we do is to Re-scale the residuals. These are obtained by
Multiplying the OLS residuals by (n/(n-k1)1/2.
If the distribution of the error terms displays substantial skewness (that is, a nonzero third
moment) or excess kurtosis (that is, a fourth moment greater than 34o), then same may be true
of rescaled residuals. [double bootstrap is used for that purpose, see “Implementing the Double
Bootstrap,” Vinod and McCullough in Computational Economics, Vol. 10, 1-17, 1997.
Or
“Double Bootstrap for Shrinkage Estimators,” Journal of Econometrics Vol. 68(2) 1995, pp.
287-302.]
Page 164
Suppose that a = .05 and B = 99. Then there are 5 out of 100 values of r, namely, r = 0,1,…,4,
that would lead us to reject the null hypothesis. We suggest choosing B = 999.
Page 165
Bootstrap tests generally perform better than tests based on approximate asymptotic
distributions. The errors committed by both asymptotic and bootstrap tests diminish as n
increases, but those committed by bootstrap tests diminish more rapidly.
Figure 4.8 p. 166, shows the rejection frequencies based on 500,000 replications for each of 31
sample sizes: n = 10, 12, 14,…,60.
The results of this experiment are striking. The asymptotic test over-rejects quite noticeably,
although it gradually improves as n increases.
Page 166
In contrast, in Fig. 4.8, two bootstrap tests over-reject only very slightly. Their rejection
frequencies are always very close to the nominal level of .05, and they approach that level quite
quickly as n increases.
The last substantive section 4.7 of the chapter on page 166 deals with
test power, a subject that is often overlooked by applied econometricians. You should
understand the implications of Figure 4.10, which shows power functions for several
sample sizes.
To study the power of a test we need to know how it behaves in the case when the null
hypothesis fails to hold. Null is often that mean is zero. When null fails, mean becomes
nonzero. Then we need non-central t and F densities, which in turn depend on non-central chisq
density. See Fig 4.9 page 168 for non-central Chisq distributions.
The distribution under the null is quite different from the distribution under the alternative.
When the NULL SHOULD BE REJECTED and if the DGP places most of the probability mass
of the test statistic in rejection region of a test, the test has high power, that is, a high probability
of rejecting the null. POWER TO REJECT WHAT SHOULD BE REJECTED.
Page 167
We would surely prefer to employ the more powerful test statistic.
We study the power of exact and bootstrap tests in this section.
We need to focus on the difference between what happens under the null and under alternative.
Hence, for exact tests, the Power must come from the numerator of 4.34 page 142 alone.
Use these facts to determine the distribution of the quadratic form (4.71). To do so, we must
introduce the noncentral chi-squared distribution. Noncentrality parameter is  =
If x ~N(, ), then x-1 x ~ Noncentral Chisq(m, -1)
Under the null 2=0
Last line on page 167 abs 2 since it is now nonzero under the alternative hypothesis
Page 168
Under the null, Λ= 0. The F statistic therefore has a distribution that we can write as.
A ratio of two chisq variables divided by respective degrees of freedom or F
Where the numerator chisq is NON-Central with explicit  stated as 2(r, )/ r
As Λ increases, the distribution moves to the right and becomes more spread out. This is
illustrated in Figure 4.9 page 168.
Page 169
At any given level, the critical value (dividing line between accept and reject) of a chisq or F test
increases (moves to the right side) as degrees of freedom r increase.
This distribution (4.73post) is known as the non-central t distribution, with (n – k) degrees of
freedom and non-centrality parameter . It is written as t(n-k, ). Note that Λ2 = ,.
When we know the distribution of a test statistic under the alternative hypothesis, we can
determine the power of a test of given level  as a function of the (UNKNOWN) parameters of
that hypothesis.
This function is called the power function of the test. Power of the t test depends only on this
ratio. The power function is generally unknown and only hypothesized and analyzed for
theoretical discussion in the context of comparing different testing methods.
Test method 1 has higher power than test method 2  reject method 2
Page 170
Since the test is exact, all the power functions are equal to  =0.05 when  = 0, i.e., when the
null hypothesis is valid (See figure on page 170).
Power then increases as  moves away from 0.
Recall that power is the prob. Of rejection when null should be rejected away from 0.
Power when n = 400 exceeds the power when n = 100, which in turn exceeds the power when n
= 25, for every value of   0. Foot of the vertical segment is at 0.05 and the head is at 1. The
horizontal segment goes over the range -1 to 1.
What about the power when the test is NOT EXACT, such as when we use the bootstrap?
Recall that the Bootstrap p-value is the Tail area
IT is only estimated from a random bootstrap, not exact
But, as B=999 or typical number in a bootstrap →  we need not worry in terms of plims
Page 171
Bootstrap testing procedure discussed in Section 4.6 incorporates this random variation, and in
so doing it reduces the power of the test.
But z test N(0,1) can be MORE powerful than a t test if bootstrap is used.
This is somewhat counterintuitive, since one would think that t test is more conservative, i.e., is
more willing to reject the null. The result arises because the denominator of t test itself is
computed by bootstrap and created additional variability of the t ratio, which renders it not
conservative at all
Page 172
Power loss is very rarely a problem when B = 999, and that it is never a problem when B =
9,999.