Download Davidson-McKinnon book chapter 4 notes

DMcKch4 Even though Ch.4 deals exclusively with hypothesis testing, this is a rather long chapter. Section 4.2 is quite elementary. If you have a reasonable background in statistics, we may not need to cover most of this material in class. However, precisely because this material is so elementary, it is important that you understand it thoroughly. Section 4.3 deals with the normal, chi-squared, Student’s t, and F distributions. None of this material is very advanced. Linear combinations of normal variables are them-selves normally distributed is very important, but the proof can be omitted. The next two sections (page 138 on ) deal with exact tests and asymptotic tests, respectively. Most of the results in Section 4.4 are well known and reasonably elementary, as is the Chow test, which is introduced here as an example of the F test. In contrast, even though there are no real proofs, the asymptotic results in Section 4.5 are inevitably a bit more advanced. Here we continue the discussion of laws of large numbers, which was begun in Section 3.3, and we also introduce central limit theorems. If you are to understand asymptotic theory at all, this material is essential. page 153 The subsection on the t test with predetermined regressors is, somewhat more advanced than the rest of this section, and it can be omitted without loss of continuity. p. 155 Section 4.6 contains what is intended to be an accessible introduction to simulation-based tests, including Monte Carlo tests, which are exact, and bootstrap tests, which are not. Students who understand the material of this section should be capable of performing bootstrap tests in regression models, including dynamic regression models, with or without assumptions about the distribution of the error terms. The last substantive section of the chapter 4.7 on page 166 deals with test power, a subject that is often overlooked by applied econometricians. You should understand the implications of Figure 4.10, which shows power functions for several sample sizes. Page 122 We must take the randomness of ^ into account if we are to make inferences about . In classical econometrics, the two principal ways of doing this are performing hypothesis tests and constructing confidence intervals or, more generally, confidence regions. Section 4.2 is quite elementary. If you have a reasonable background in statistics, we may not need to cover most of this material in class. However, precisely because this material is so elementary, it is important that you understand it thoroughly. try ch08ppln.ppt file on my webpage Simplest sort of hypothesis test concerns the (population) mean from which a random sample has been drawn. To test such a hypothesis, we may assume that the data are generated by the regression model. (4.01) Here ^ is the sample mean. Page 123 The least-squares estimator of B and its variance, for a sample of size n, are given by (4.02). Thus, for the model (4.01), the standard formulas ^ = (XX)-1X y and Var (^) = 2 (XX)-1 yield the two formulas given in (4.02). See page 100 Test the hypothesis that ^ = o, where o is some specified value of . The hypothesis that we are testing is called the null hypothesis. Given the label Ho for short. In order to test Ho, we must calculate a test statistic, which is a random variable that has a known distribution when the null hypothesis is true and some other distribution when the null hypothesis is false. If the value of the test statistic is an extreme one that would rarely be encountered by chance under the null, then the test does provide evidence against the null. If this evidence is sufficiently convincing, we may decide to reject the null hypothesis that ^ = o. We will restrict the model (4.01) by making two very strong assumptions. The first is that ut is normally distributed, and the second is that o is known. Under the null hypothesis, z must be distributed as N(0,1). It must have variance unity because, by (4.02). Page 124 z defined in (3) has the first property that we would like a test statistic to possess: It has a known distribution under the null hypothesis. For every null hypothesis there is, at least implicitly, an alternative hypothesis, which is often given the label H1. Just as important as the fact that z follows the N(0,1) distribution under the null is the fact that z does not follow this distribution under the alternative. Suppose that ^ takes on some other value, say B1. Then clearly, ^ = B1 + v, where v has mean 0 and variance 2 /n; z is also normal under the null and we find from (4.03) that (4.04). We would expect the mean of z to be large and positive if Bl > Bo and large and negative if Bl < Bo. Reject the null hypothesis whenever z is sufficiently far from 0. If the alternative is that B  Bo, we must perform a two-tailed test and reject the null whenever the absolute value of z is sufficiently large. If instead we were interested in testing the null hypothesis that B< Bo against the alternative that B > Bo, we would perform a one-tailed test. Decide in advance on a rejection rule, according to which we choose to reject the null hypothesis if and only if the value of z falls into the rejection region of the rule. Page 125 Type I error =. The probability of making such an error is, by construction, the probability, under the null hypothesis, that z falls into the rejection region. This probability is sometimes called the level of significance, or just the level, of the test. Popular values of  include .05 and.01. Distribution of the test statistic under the null hypothesis is known exactly, so that we have what is called an exact test. Usually, it is known only approximately. In this case, we need to draw a distinction between the nominal level of the test, that is, the probability of making a Type I error and the actual rejection probability, which may differ greatly from the nominal level. The probability that a test rejects the null is called the power of the test. Power depends on precisely how the data were generated and on the sample size. Size of a test. Technically, this is the supremum of the rejection probability over all DGPs that satisfy the null hypothesis. For an exact test, the size equals the level. It is often, but by no means always, greater than the nominal level of the test.  is the non-centrality parameter (non-zero value of the mean). Figure on page 126 shows the effect of  on the power of the test. Page 126 Mistakenly failing to reject a false null hypothesis is called making a Type II error. The probability of making such a mistake is equal to 1 minus the power of the test. To construct the rejection region for a test at level a, the first step is to calculate the critical value. Critical value ca is defined implicitly by (4.05).  is Normal density and big phi denotes cumulative density. Critical value for a two-tailed test is PHI-Inverse  -1 (0.975) =1.96. Reject the null if observed value is more extreme than the critical value in either direction. P-Values Page 127 Result of a test is yes or no (accept or rejct). A more sophisticated approach to deciding whether or not to reject the null hypothesis is to calculate the P value, or marginal significance level, associated with the observed test statistic z^. P value associated with z is denoted p(z^). It means we incur probability of Type I error of p(z^) when we REJECT the null. (7) defines P value for two-tailed test The smallest value of  for which the inequality holds is thus obtained by solving the equation. (7a) Solution is easily seen to be the right-hand side of (4.07). Computing a P value transforms z from a random variable with the N(0,1) distribution into a new random variable p(z^) with the uniform U(0,1). A test at level  rejects whenever p(z^) <  . Generally, one rejects the null when observed test statistic is large. However one rejects the null when P values are small. This gets some getting used to! Page 128 In downloadable answer of Exercise 4.15, readers are asked to show how to compute P values for two-tailed tests based on an asymmetric distribution involving the min function. Page 129 Section 4.3 deals with the normal, chi-squared, Student’s t, and F distributions. None of this material is very advanced. Linear combinations of normal variables are themselves normally distributed is very important, but the proof can be omitted. In Exercise 1.8, the PDF of the N(, 2) distribution, evaluated at x, is lower case letters to denote both random variables and the arguments of their PDFs or CDFs. Page 130 Third central moment, which measures the skewness of the distribution, is always zero. The fourth moment of a symmetric distribution provides a way to measure its kurtosis, which essentially means how thick the tails are. The fourth central moment is 34. See Exercise 4.2 numerically in R software and see if rnorm gives good approximation to skewness and fourth moment (kurtosis) as sample size increases. Any linear combination of independent normally distributed random variables is itself normally distributed. Page 131 Given the conditional mean and variance we have just computed, we see that the conditional distribution must be N(b1z1,(b2)2). Joint density can also be expressed as in (4.12), but with z1 and w interchanged, as follows in (14): We are now ready to compute the unconditional, or marginal, density of w. To do so, we integrate the joint density (4.14) with respect to z1; conclude that the marginal density of w is f(w) = (w). Page 132 Linear combinations of normal random variables that are not necessarily independent. We introduce the multivariate normal distribution. This is a family of distributions for random vectors, completely characterized by their first two moments. Set of m mutually independent standard normal variables, zi, which we can assemble into a random m-vector z. Then any m-vector x of linearly independent linear combinations. Can always be written as Az, for some nonsingular m x m matrix A. We can always find a lower-triangular A such that AA = . We write this as x ~ N(0,). If we add an m-vector  of constants to x, the resulting vector must follow the N(, ) distribution. Page 133 If x is any multivariate normal vector with zero covariances, the components of x are mutually independent. In general, a zero covariance between two random variables does not imply that they are independent. A nice property of normal variables! Page 134 Z1, …, zm are mutually independent standard normal random variables. z~N(0,I) in matrix notation. (15) is said to follow the chi-squared distribution with m degrees of freedom. Its mean is m Variance of the sum of the (zi)2 is just the sum of the (identical) variances: Variance of chi sq is 2m by (17) using the fact that E(z4i) =3. If y1 ~ X2(m1) and y2 ~ X2(m2) and y1 and y2 are indep then y1+y2 is chi-sq with df=m1+m2. just add the degrees of freedom! Page 135 Many test statistics can be written as quadratic forms in normal vectors, or as functions of such quadratic forms. Useful Thm 4.1, especially it says something when projection matrices (Hat matrix or Identity minus Hat or M matrix) are involved. We have seen that they are common for regression. So regression theory is easy once you understand these projection matrices. A quadratic form in these projection matrices is Chi-sq distributed with df=rank of projection (easily counted) Page 136 If z ~ N(0,1) and y ~ X2 (m), and z and y are independent, then the random variable (18) is said to follow the Student’s t distribution with m degrees of freedom. Note the square root in the denominator of the definition. Only the first m – 1 moments exist for the t variable. Thus the t(1) distribution, which is also called the Cauchy distribution, has no moments at all, and the t(2) distribution has no variance. As m increases, the chance that the denominator of (4.18) is close to zero diminishes (see Figure 4.4), and so the tails become thinner. Var(t) = m/(m – 2). Thus, as m  , the variance tends to 1. Since Chi-square can be expressed as a sum of squares of N(0,1) variables, by la law of large numbers, such as (3.16), y/m, which is the average of the (zi)2, tends to its expectation as m . Denominator of (4.18), (y/m) ½, also tends to 1, and hence that t  z ~ N(0,1) as m  . Page 137 If y1 and y2 are independent random variables distributed as χ2(m1) and χ 2(m2), respectively, then the random variable (19) is said to follow the F distribution. Square of a random variable which is distributed as t(m2) is distributed as F(1,m2). Page 138 The next two sections (page 138 on ) deal with exact tests and asymptotic tests, respectively. Most of the results in Section 4.4 are well known and reasonably elementary, as is the Chow test, which is introduced here as an example of the F test. In contrast, even though there are no real proofs, the asymptotic results in Section 4.5 are inevitably a bit more advanced. Here we continue the discussion of laws of large numbers, which was begun in Section 3.3, and we also introduce central limit theorems. If you are to understand asymptotic theory at all, this material is essential. It is assumed in (20) that the error vector u is statistically independent of the matrix X. This idea is expressed in matrix notation in (20) by using the multivariate normal distribution. OK to express this independence assumption by saying that the regressor X are exogenous. TEST of a Single Restriction: Page 139 Under the assumption that the model (4.21) is correctly specified, we find that. This yields a test statistic analogous to (4.03), p.123 given in (23) The test statistic z 2 defined in (4.23) has exactly the same distribution under the null hypothesis as the test statistic z defined in (4.03). In the more realistic case in which the variance of the error terms is unknown, we need to replace  in equation (4.23) by s. Page 140 If, as usual, Mx is the orthogonal projection on to S perpendicualr(X), then we have. This test statistic is distributed as t(n-k) under the null hypothesis. Not surprisingly, it is called a t statistic. (4.25) can be rewritten as (26) Under any DGP that belongs to (4.21) we write (27) We saw above that Mxy = Mxu. The n n matrix of covariances of components of Pxu and Mxu is thus the last eq on page 140. Page 141 Even though the numerator and denominator of (4.26) both depend on y, this orthogonality implies that they are independent. We just have to use the t(n-k) distribution instead of the N(0,1) distribution to compute P values or critical values. What if there are several restrictions and Tests of Several Restrictions: Either 2 is zero or we have (28) We want to compute a single test statistic for all the k2 restrictions at once. SSR from restricted model (4.29) cannot be smaller, and is almost always larger, than the SSR from the unrestricted model (4.28). It seems natural to base a test statistic on the difference between these two SSRs. See (30) Page 142 Restricted SSR is ytM1y, and unrestricted one is ytMxy. One way to obtain a convenient expression for the difference between expressions is to use the FWL Theorem. Under the null hypothesis, Mxy – Mxu and M1y – M1u. Thus, under this hypothesis, the F statistic (4.33) reduces to (34) where, as before,  = u/ . Random variables in the numerator and denominator are independent, because Mx and Pm1x2 project on to mutually orthogonal subspaces: MxM1X2 – Mx(X2 – P1X2) = O. Thus (4.34) statistic follows the F(r,n – k) distribution. Page 143 A Threefold Orthogonal Decompositon The three fold orthogonal decomposition is (37). Use tilde to denote the restricted estimates, and a hat (ˆ) to denote the unrestricted estimates.  Sub 2^ is a subvector of estimates from the unrestricted model. Finally, Mxy is the vector of residuals from the unrestricted model. In (38) two hats on left side and one tilde and one hat on right side F statistic (4.33) can be written as the ratio of the squared norm of the second componenet in (4.37) to the squared norm of the third, each normalized by the appropriate number of degrees of freedom. F test serves to detect the possible presence of systematic variation, related to X2, in the second component of (4.37). We want to reject the null whenever the numerator of the F statistic, RSSR – USSR, is relatively large. Page 144 Thus we compute the P value as if for a one-tailed test. However, F tests are really two-tailed tests. The square of the t statistic t22, defined in (4.25) is in equation before (39). This test statistic is evidently a special case of (4.33), with the vector x2 replacing the matrix X2. An Example of F test: The most familiar application of the F test is testing the hypothesis that all the coefficients in a classical normal linear model, except the constant term, are zero. The last eq. on page 144 shows that the F statistic (4.40) depends on the data only through the centered R2, of which it is a monotonically increasing function. Page 145 Chow Test of equality of two parameter Vectors in eq (43) below. It is often natural to divide a sample into two, or possibly more than two, subsamples. These might correspond to periods of fixed exchange rates and floating exchange rates. We may then ask whether a linear regression model has the same coefficients for both the subsamples. It is natural to use an F test for this purpose. A good example in Greene’s text is demand for gasoline before and after the Arab oil embargo of 1973. Clearly the gasoline demand changed due to long lines at the gas pump, and high prices and energy efficiency requirements on Detroit cars imposed by the US Congress. Two subsamples, of lengths n1 and n2, with n = n1 + n2. We will assume that both n1 and n2 are greater than k, the number of regressors. Now partition y and X into two parts. Write (41) and use it to define matrix Z and then write (42) This (42) is a regression model with n observations and 2k regressors. Constructed in such a way that 1 is estimated directly, while 2 is estimated using the relation 2 =  + 1. Restriction that 1 = 2 is equivalent to the restriction that  =0 ) in (4.42). Since (4.42) is just a classical normal linear model with k linear restrictions to be tested, the F test provides the appropriate way to test those restrictions. The null hypothesis in the oil embargo application is that the gas demand model remained unchanged before and after the embargo. There is another way to compute the USSR. In Exercise 4.11, readers are invited to show that it is simply the sum of the two SSRs obtained by running two independent regressions on the two subsamples. F statistic becomes (43). This is the Chow statistic. Page 146 (43) is distributed as F(k, n – 2k) under the null hypothesis that 1 = 2. Large Sample Tests: Tests that we developed in the previous section are exact only under the strong assumptions of the classical normal linear model. If the error vector were not normally distributed or not independent of the matrix of regressors, we could still compute t and F statistics, but they would not actually follow their namesake distributions in finite samples. In many cases approximately follow know distributions in large samples. Asymptotic theory Asymptotic theory gives us results about the distributions of t and F statistics under much weaker assumptions than those of the classical normal linear model. A law of large numbers (LLN) may apply to any quantity, which can be written as an average of n random variables, that is, 1/n times their sum. Page 147 A fairly simple LLN assures us that, as n  ., xbar tends to . sample mean approaches the population mean as n tends to . The empirical distribution defined by this sample is the discrete distribution that puts a weight of 1/n at each of the xt, t = 1,…,n. I(.) is the indicator function, which takes the value 1 when its argument is true and takes the value 0 otherwise. (4.44) counts the number of realizations xt that are smaller than or equal to x. The EDF has the form of a step function: The height of each step is 1/n, and the width is equal to the difference between two successive values of xt. Page 148 These may be compared with the CDF of the standard normal distribution in the lower panel of Figure 4.2. Thus (4.44) is the mean of n IID random terms, each with finite expectation. The simplest of all LLNs (due to Khinchin) applies to such a mean, and we conclude that, for every x, F(x) is a consistent estimator of F(x). If we can apply a LLN to any random average, we can treat it as a nonrandom quantity for the purpose of asymptotic analysis. The matrix n-1XXtX, under many plausible assumptions about how X is generated, tends to a nonstochastic limiting matrix SxTx as n  . Central limit theorems are crucial in establishing the asymptotic distributions of estimators and test statistics. In many circumstances, 1/ n times the sum of n centered random variables approximately follows a normal distribution when n is sufficiently large. Page 149 It may seem curious that we divide by n instead of by n in (4.45), but this is an essential feature of every CLT. To see why, we calculate the variance of z. Whenever we want to use a CLT, we must ensure that a factor of n-1/2=1/ n is present. The assumption that the xt are identically distributed is easily relaxed, as is the assumption that they are independent. However, if there is either too much dependence or too much heterogeneity, a CLT may not apply. CLT says that, for a sequence of random variables xt, t = 1,…,, with E(xt) =0. Multivariate, versions of CLTs. Suppose that we have a sequence of random m-vectors xt, for some fixed m, with E(xt) = 0. Then the appropriate multivariate version of a CLT tells us that each Var(xt) is an m x m matrix. Figure 4.7 illustrates the fact that CLTs often provide good approximations. Page 150 Top figure is for uniform density and lower figure is for Chisq. Both converge to normality by a process of averaging over n elements from these densities, but the convergence is much faster and for a lower value of n when underlying density is uniform. ASYMPTOTIC t and F TESTS are actually valid under much weaker conditions than normality of errors. How do we show this? Weaken the conditions, find the expression for the limit of the t or F statistic as n  by applying the CLT and LLN For example let us assume iid errors where error terms are drawn from some specific but unknown distribution in eq (47) Page 151 When we use iid, we abandon the assumption of exogenous regressors and replace it with assumption (3.10). The conditional means and variance in the new set up are (48) From the point of view of the error terms, (48) says that they are innovations. From the point of view of the explanatory variables Xt, assumption (3.10) says that they are predetermined with respect to the error terms. In general the matrix XX   as t   because each term in the kk matrix increases beyond bound. However the average of each term does not necessarily diverge to  . This is stated as an addition assumption (49) in order to be able to use asymptotic results. Where Sxtx is finite, deterministic, positive definite matrix. Condition (4.49) is violated in many cases. For example, it cannot hold if one of the columns of the X matrix is a linear time trend, because  t2 grows at a rate faster than n. Where the numerator and denominator both been multiplied by n-1/2. It follows from the consistency of s2 that the first factor in (4.50) tends to 1/o as n  . If data are generated by (47) with 2 = 0, we have that M1y = M1u, and so (4.50) is asymptotically equivalent to (51). page 152 Recall that we want to Weaken the conditions; find the expression for the limit of the t statistic as n  by applying the CLT and LLN We see that (51) is equivalent to (50) and show that the The numerator of (4.51) is n-1/2 times a weighted sum of the ut, each of which has mean 0 and variance 1, which does not depend on X. Hence the conditional variance with knowledge of X is also the unconditional variance when the knowledge of X is missing. Thus, (4.51) numerator evidently has mean 0 and variance 1, i.e., it is N(0,1) The denominator can be ignored under the null, since it is non-random. Under the null with exogenous regressors, we have (52) where The notation “~a” means that tb2 is asymptotically distributed as N(0,1). This result (52) justifies the use of t test beyond the standard assumptions of normal regression model. This subsection provides an excellent example of how asymptotic theory works. How to apply LLN and CLT to the expression for t test in a regression. We begin by Pulling a k-vector v out of thin air and consider its limit by CLT as multivariate normal We will then consider sub vectors of this v and submatrices S11, S12, S22 of Sxx. Page 153 We applied a LLN in reverse to go from the first line to the second. Recall that (51) is equivalent to the t stat. Consider the numerator of (4.51). After considerable manipulations, it can be written as (55) The first term of this expression is just the last, or kth, component of v, which we can denote by v2. The subsection on the t test with predetermined regressors is, somewhat more advanced than the rest of this section, and it can be omitted without loss of continuity. Page 154 But the key idea p.154 is that even if regressors are not exogenous but merely predetermined, asymptotic theory can be used to justify the usual t test on regression coefficients. The denominator of the t statistic in (4.51) is easier to analyze. In the limit, all the pieces of this expression become submatrices of Sxx. And we have Numerator  normal variable with mean zero Denominator  standard deviation of the numerator normal variable For example, if y ~N(0,2) we know that (y/ ) ~N(0,1) This is what happens to t statistic. Hence t stat tends to N(0,1) asymptotically With regressors that are not necessarily exogenous but merely predetermined, still we have the useful asymptotic result that t2 ~a~ N(0,1). ASYMPTOTIC F TESTS A similar analysis can be performed for the F statistic (4.33). Thus see (59) which says that r times F tends to Chisq with df=r, where r denotes df of the numerator of F (4.52) and (4.59) justify the use of t and F tests outside the confines of the classical normal linear model. These P values from asymptotic tests are approximate, and one worries that tests based on them are not exact in finite samples in the sense that they may over-reject in small samples (or underreject). The conclusions may end up being too liberal. Page 155 Whether they overreject or underreject, and how severely, depends on many things, including the sample size, the distribution of the error terms, the number of regressors and their properties, and the relationship between the error terms and the regressors. SIMULATION BASED TESTS p. 155 Section 4.6 contains what is intended to be an accessible introduction to simulation-based tests, including Monte Carlo tests, which are exact, and bootstrap tests, which are not. Students who understand the material of this section should be capable of performing bootstrap tests in regression models, including dynamic regression models, with or without assumptions about the distribution of the error terms. In the usual tests we assume: Distribution of the statistic under the null hypothesis was not only (approximately) known, but also exactly the same for all DGPs contained in the null hypothesis. Consider a compound hypothesis, which is represented by a model that contains more than one DGP. Now the test statistic has different distributions under the different DGPs contained in the model. This creates practical problems for econometrics called loss of pivot. A random variable with the property that its distribution is the same (say Normal) for all DGPs in a model M is said to be pivotal. There is no pivot problem for simple hypotheses (one at a time), but there is a problem for compound hypotheses. A pivot is a statistic whose sampling distribution does not depend on unknown parameters. (b-)/SE is a good example of a pivotal statistic. Its distribution is student’s t and it does not depend on . Page 156 One can use asymptotic tests for compound hypotheses. The price that large sample asymptotic tests pay for this added generality is that t and F statistics now have distributions that depend on things like the error distribution: They are therefore not pivotal statistics in finite samples. However, their asymptotic distributions are independent of such things across DGPs and are said to asymptotically pivotal. For any pivotal test statistic, the P value can be estimated by simulation to any desired level of accuracy. The Fundamental Thm of Stats says: Empirical distribution of a set of independent drawings of a random variable generated by some DGP converges to the true CDF of the random variable under the DGP. This is just as true of simulated drawings. Empirical cdf → true CDF If we knew that a certain test statistic was pivotal but did not know how it was distributed, we could select any DGP in the null model and generate simulated samples from it. The TRICK: Use simulation to get empirical cdf of a statistic and use funda. Thm Suppose that we have computed a test statistic t. The P value for a test based on ^ is (60). Note that P value is the probability of observing as extreme or even more extreme value than the observed value. This probability is the Tail area beyond the observed value. Let capital F denote the cumulative density or cdf. F(^) gives the cdf till the observed value ^. It is the area to the left of ^ We do not want area to the left, we want tail area. Hence we subtract from 1 in eq. (60) to get the p value. observed p value is denoted by lower case p. This P value can be estimated if we can estimate the CDF F evaluated at ^. The procedure is very general. Page 157 How to compute P-values in a simulation? Choose any DGP in M, and draw B(=999) samples of size n from it. Denote the simulated samples as yj*, j = 1,…,B. The star (*) notation will be used systematically to denote quantities generated by simulation. The proportion for which the statistic *j is less than or equal to  can be readily calculated in simulation. Just count the number of times simulated values exceeds something set at the observed value. See eq. 62 Since the EDF converges to the true CDF, it follows that, if B were infinitely large, this procedure would yield and exact test. Simulating a pivotal statistic, is called a Monte Carlo test; Dufour and Khalaf (2001) provides a more detailed introduction and references. Simulation experiments in general are often referred to as Monte Carlo experiments. RNG, is a program for generating random numbers. Page 158 Drawings from the uniform U(0,1) distribution, can then be transformed into drawings from other distributions. Fortunately, we do not have to do this explicitly if we have R software. The RNG generator (63) starts with a (generally large) positive integer zo called the seed, multiplies it by  , and then adds c to obtain an integer that may well be bigger than m. It uses the modulo function to prevent that and gets the right random number in the desired range. In R software this is done easily. If I want to choose 12 questions from a list of 50 to ask in a final exam I can use uniform random variables between 1 and 50 which are rounded to be integers. I type sort(round(runif(12, min=1, max=50))) and get 3 7 8 10 16 22 30 31 40 49 49 50 How well or badly this procedure works depends on how , m, and c are chosen. Set c = 0 and use for m a prime number that is either a little less than 232 or a little less than 231. When and m are chosen properly with c = 0, the RNG has a Period of m – 1. This means that it generates every rational number with denominator m between 1/m and (m-1)/m precisely once until, after m – 1 steps, zo comes up again. After that, the generator repeats itself. Page 159 Most test statistics in econometrics are not pivotal. The vast majority of them are, however, asymptotically pivotal. Bootstrap is for the non-pivotal cases and when underlying distribution is not known to be of any standard form of density. It is necessary to estimate a bootstrap DGP from which to draw the simulated samples. DGP that generated the original data is unknown, and so it cannot be used to generate simulated data. The bootstrap DGP is an estimate of the unknown true DGP. There are many ways to specify the bootstrap DGP. Page 160 We will take for our example a linear regression model with normal errors, but with a lagged dependent variable among the regressors. (4.65) is a fully specified parametric model, which means that each set of parameter values for ,  and 2 defines just one DGP. Simplest type of bootstrap DGP for fully specified models is given by the parametric bootstrap. Draw an n-vector u* from the N (0, s~2I) distribution. Since lagged values are present, use recursive method. See in (67) y1* determined in the first equation is put in the second equation’s right side before ~ and so on sequentially. The bootstrap sample from (4.67) has y1* y2*, …, yn* and is of course CONDITIONAL on the observed data initial value y0, when such a recursive scheme is used. Page 161 RECALL THE PROBLEM at hand: Let ^ denote the value of the t statistic obtained from the data, before any bootstrapping. Note that we want to know if it is significant, but are not sure that the underlying sampling distribution it Student’s t. So we are going to create B=999 bootstrap (simulated) values of *j which will follow whatever density that they do follow. We are not worried if it is not Student t. All we want to know is the tail area or P-value based on these 999 realizations. For each of the B bootstrap samples, denoted by a vector y*, a bootstrap test statistic *j is computed. The bootstrap P value p*(t) is then computed by formula (4.62). We simply count the number of times *j exceeds ^ out of the 999 attempts, divide by 999 and bingo, we have the p-value we want to know, without bothering with asymptotic distributions stuff. Under the null hypothesis, the OLS residual vector u~ for the restricted model is a consistent estimator of the error vector u. For each t we have the plim of u~t is the true ut From the Fundamental Theorem of Statistics, we know that the empirical (cumulative) distribution function of the error terms is a consistent estimator of the unknown CDF of the error distribution. ecdf of u~  coverges to true CDF of u after all the residuals consistently estimate the true errors. Each bootstrap sample contains some of the residuals exactly once, some of them more than once, and some of them not at all. Page 162 Suppose that, when forming one of the bootstrap samples, the ten drawings from the U(0,1) distribution happen to be. This implies that the ten index values are after rounding to integers are given in middle of page 162. They are 7, 3, 8 etc They mean we select 7th, 3rd, 8th data point to go in the bootstrap re-sample. (68) with empirical cdf from observed regression residuals is called nonparametric bootstrap, although strictly speaking it is semi-parametric, since it uses estimates of ~ (regr coeff) parameters on the right side before computing the star values on the left side. The empirical distribution of the residuals may fail to satisfy some of the properties that the null hypothesis imposes on the true error distribution. Page 163 One case in which this failure has grave consequences arises when the regression (4.65) yt=Xt + yt-1+ut, is forced through the origin, because then the sample mean of the residuals is not, in general, equal to 0. {Beware of forcing regressions thru the origin, i.e., eliminating the intercept} Variance of the empirical distribution of the residuals is s~2(n – k1)/n. We can still draw from a distribution with variance s~2. All we do is to Re-scale the residuals. These are obtained by Multiplying the OLS residuals by (n/(n-k1)1/2. If the distribution of the error terms displays substantial skewness (that is, a nonzero third moment) or excess kurtosis (that is, a fourth moment greater than 34o), then same may be true of rescaled residuals. [double bootstrap is used for that purpose, see “Implementing the Double Bootstrap,” Vinod and McCullough in Computational Economics, Vol. 10, 1-17, 1997. Or “Double Bootstrap for Shrinkage Estimators,” Journal of Econometrics Vol. 68(2) 1995, pp. 287-302.] Page 164 Suppose that a = .05 and B = 99. Then there are 5 out of 100 values of r, namely, r = 0,1,…,4, that would lead us to reject the null hypothesis. We suggest choosing B = 999. Page 165 Bootstrap tests generally perform better than tests based on approximate asymptotic distributions. The errors committed by both asymptotic and bootstrap tests diminish as n increases, but those committed by bootstrap tests diminish more rapidly. Figure 4.8 p. 166, shows the rejection frequencies based on 500,000 replications for each of 31 sample sizes: n = 10, 12, 14,…,60. The results of this experiment are striking. The asymptotic test over-rejects quite noticeably, although it gradually improves as n increases. Page 166 In contrast, in Fig. 4.8, two bootstrap tests over-reject only very slightly. Their rejection frequencies are always very close to the nominal level of .05, and they approach that level quite quickly as n increases. The last substantive section 4.7 of the chapter on page 166 deals with test power, a subject that is often overlooked by applied econometricians. You should understand the implications of Figure 4.10, which shows power functions for several sample sizes. To study the power of a test we need to know how it behaves in the case when the null hypothesis fails to hold. Null is often that mean is zero. When null fails, mean becomes nonzero. Then we need non-central t and F densities, which in turn depend on non-central chisq density. See Fig 4.9 page 168 for non-central Chisq distributions. The distribution under the null is quite different from the distribution under the alternative. When the NULL SHOULD BE REJECTED and if the DGP places most of the probability mass of the test statistic in rejection region of a test, the test has high power, that is, a high probability of rejecting the null. POWER TO REJECT WHAT SHOULD BE REJECTED. Page 167 We would surely prefer to employ the more powerful test statistic. We study the power of exact and bootstrap tests in this section. We need to focus on the difference between what happens under the null and under alternative. Hence, for exact tests, the Power must come from the numerator of 4.34 page 142 alone. Use these facts to determine the distribution of the quadratic form (4.71). To do so, we must introduce the noncentral chi-squared distribution. Noncentrality parameter is  = If x ~N(, ), then x-1 x ~ Noncentral Chisq(m, -1) Under the null 2=0 Last line on page 167 abs 2 since it is now nonzero under the alternative hypothesis Page 168 Under the null, Λ= 0. The F statistic therefore has a distribution that we can write as. A ratio of two chisq variables divided by respective degrees of freedom or F Where the numerator chisq is NON-Central with explicit  stated as 2(r, )/ r As Λ increases, the distribution moves to the right and becomes more spread out. This is illustrated in Figure 4.9 page 168. Page 169 At any given level, the critical value (dividing line between accept and reject) of a chisq or F test increases (moves to the right side) as degrees of freedom r increase. This distribution (4.73post) is known as the non-central t distribution, with (n – k) degrees of freedom and non-centrality parameter . It is written as t(n-k, ). Note that Λ2 = ,. When we know the distribution of a test statistic under the alternative hypothesis, we can determine the power of a test of given level  as a function of the (UNKNOWN) parameters of that hypothesis. This function is called the power function of the test. Power of the t test depends only on this ratio. The power function is generally unknown and only hypothesized and analyzed for theoretical discussion in the context of comparing different testing methods. Test method 1 has higher power than test method 2  reject method 2 Page 170 Since the test is exact, all the power functions are equal to  =0.05 when  = 0, i.e., when the null hypothesis is valid (See figure on page 170). Power then increases as  moves away from 0. Recall that power is the prob. Of rejection when null should be rejected away from 0. Power when n = 400 exceeds the power when n = 100, which in turn exceeds the power when n = 25, for every value of   0. Foot of the vertical segment is at 0.05 and the head is at 1. The horizontal segment goes over the range -1 to 1. What about the power when the test is NOT EXACT, such as when we use the bootstrap? Recall that the Bootstrap p-value is the Tail area IT is only estimated from a random bootstrap, not exact But, as B=999 or typical number in a bootstrap →  we need not worry in terms of plims Page 171 Bootstrap testing procedure discussed in Section 4.6 incorporates this random variation, and in so doing it reduces the power of the test. But z test N(0,1) can be MORE powerful than a t test if bootstrap is used. This is somewhat counterintuitive, since one would think that t test is more conservative, i.e., is more willing to reject the null. The result arises because the denominator of t test itself is computed by bootstrap and created additional variability of the t ratio, which renders it not conservative at all Page 172 Power loss is very rarely a problem when B = 999, and that it is never a problem when B = 9,999.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Davidson-McKinnon book chapter 4 notes