Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SIMULTANEOUS EQUATIONS REGRESSION MODEL INTRODUCTION The classical linear regression model, general linear regression model, seemingly unrelated regressions model make the following assumption: The error term is uncorrelated with each explanatory variable. If this assumption is violated, then the OLS, FGLS and SUR estimators produce biased estimates in small samples, and inconsistent estimates in large samples. SOURCES OF CORRELATION BETWEEN THE ERROR TERM AND EXPLANATORY VARIABLE The most important sources of correlation between the error term and an explanatory variable are omitted confounding variables and reverse causation. Omitted Confounding Variable Consider the following wage equation, Y = β1 + β2X + μ, Y is the worker’s wage, X is the worker’s years of education, and μ is the error term. We want to analyze the effect of education on the wage. Let Z be the worker’s innate ability. Since we omit Z from the equation its effect is included in μ. Workers with more innate ability have higher wages, and therefore larger errors. Also, workers with more innate ability have more education, and therefore higher values of X. Thus, the error term and education are positively correlated. The OLS, FGLS, and SUR estimators will include the effect of innate ability in the estimate of β2. This results in a biased estimate of the effect of education on the wage. Reverse Causation Consider the following simple Keynesian model of income determination comprised of two equations: a consumption function, and equilibrium condition C=a+bY+ Y=C+I C is aggregate consumption; Y is aggregate income; I is exogenous investment; a and b are parameters; and is an error term that summarizes all factors other than Y that influence C (e.g., wealth, interest rate). Now, suppose that increases. This will directly increase C in the consumption function. However, the equilibrium condition tells us that the increase in C will increase Y. Therefore, and Y are positively correlated. The OLS, FGLS, and SUR estimators will produce a biased estimate of the effect of income on consumption because it will capture the reverse effect of consumption on income. OBJECTIVE 1 In this section of the course, we will examine statistical models that assume the error term is correlated with an explanatory variable. This can result from either an omitted confounding variable or reverse causation. We will spend most of our time on the simultaneous equations model. This model assumes that the error term is correlated with an explanatory variable because of reverse causation. However, the estimators we develop for the simultaneous equations model can be used for any model for which the error term is correlated with an explanatory variable because of an omitted confounding variable. INTRODUCTION TO THE SIMULTANEOUS EQUATIONS MODEL When a single equation is embedded in a system of simultaneous equations, at least one of the right-hand side variables will be endogenous, and therefore the error term will be correlated with at least one of the right-hand side variables. In this case, the true data generation process is not described by the classical linear regression model, general linear regression model, or seemingly unrelated regression model; rather, it is described by a simultaneous equations regression model. If you use the OLS estimator, FGLS estimator, SUR estimator, or ISUR estimator, you will get biased and inconsistent estimates of the population parameters. Definitions and Basic Concepts Endogenous variable – a variable whose value is determined within an equation system. The values of the endogenous variables are the solution of the equation system. More generally, any variable that is correlated with the error term. Exogenous variable – a variable whose value is determined outside an equation system. More generally, any variable not correlated with the error term. Structural equation – an equation that has one or more endogenous right-hand side variables. Reduced form equation – an equation for which all right-hand side variables are exogenous. Structural parameters – the parameters of a structural equation. Reduced form parameters – the parameters of a reduced form equation. THE IDENTIFICATION PROBLEM Before you estimate a structural equation, you must first determine if it is identified. An equation is identified if you have enough information to get meaningful estimates of its parameters. A meaningful estimate is one that has a useful interpretation. An equation is not identified if you don’t have enough information to get meaningful estimates of its parameters. If an equation is not identified, then estimating its parameters is meaningless. This is because the estimates you obtain will have no useful interpretation. Example You want to estimate the price elasticity of demand for a good. You collect annual data on price (P) and quantity bought and sold (Q) for the period 1980 to 2015. You estimate the following equation, lnQ = γ1 + γ2 lnP + μ where ln designates natural logarithm. The problem is that this equation does not have an identity. It can be either a demand equation, supply equation, or some combination of both. Therefore, γ2 might measure the price elasticity of demand, price elasticity of supply, or some combination of both. Running a regression of lnQ on lnP has no useful interpretation. Classifying Structural Equations 2 Every structural equation can be placed under one of three categories. Unidentified Equation. Not enough information to get a meaningful estimate. Exactly Identified Equation. Just enough information to get a meaningful estimate. Overidentified Equation. More than enough information to get a meaningful estimate. Exclusion Restrictions The most often used way to identify a structural equation is to use prior information provided by economic theory to exclude certain variables from an equation that appear in a model. This is called obtaining identification through exclusion restrictions. To exclude a variable from a structural equation, we restrict the value of its coefficient to zero. This type of zero fixed value restriction is called an exclusion restriction because it has the effect of omitting a variable from the equation to obtain identification. Rank and Order Condition for Identification Exclusion restrictions are most often used to identify a structural equation in a simultaneous equations model. When using exclusion restrictions, you can use two general rules to check if identification is achieved. These are the rank condition and the order condition. The order condition is a necessary but not sufficient condition for identification. The rank condition is both a necessary and sufficient condition for identification. Because the rank condition is more difficult to apply, many economists only check the order condition and gamble that the rank condition is satisfied. This is usually, but not always the case. Order Condition The order condition is a simple counting rule that you can use to determine if one structural equation in a system of linear simultaneous equations is identified. Define the following: G = total number of endogenous variables in the model (i.e., in all equations that comprise the model). K = total number of variables (endogenous and exogenous) excluded in the equation being checked for identification. The order condition is as follows. If If If K=G–1 K>G–1 K<G–1 the equation is exactly identified the equation is overidentified the equation is unidentified SPECIFICATION OF A SIMULTANEOUS EQUATIONS MODEL A simultaneous equation regression model has two alternative specifications: reduced form and structural form. The reduced-form specification is comprised of M reduced-form equations and a set of assumptions about the error terms in the reduced form equations. The reduced-form specification of the model is usually not estimated, because it provides limited information about the economic process in which you are interested. The structural-form specification is comprised of M structural equations and a set of assumptions about the error terms in the structural equations. The structural-form specification of the model is the specification most often estimated. This is because it provides more information about the economic process in which you are interested. 3 Specification of the Structural Form A set of assumptions defines the specification of the structural form of a simultaneous equations regression model. The key assumption is that the error term is correlated with one or more explanatory variables. There are several alternative specifications of the structural form of the model depending on the remaining assumptions we make about the error term. For example, if we assume that the error term has non-constant variance, then we have a simultaneous equation regression model with heteroscedasticity. If we assume the errors in one or more equations are correlated, then we have a simultaneous equation regression model with autocorrelation. ESTIMATION Single Equation Vs System Estimation Two alternative approaches can be used to estimate a simultaneous equation regression model are single equation estimation and system estimation. Single Equation Estimation Single equation estimation involves estimating either one equation in the model, or two or more equations in the model separately. For example, suppose you have a simultaneous equation regression model that consists of two equations: a demand equation and a supply equation. Suppose your objective is to obtain an estimate of the price elasticity of demand. In this case, you might estimate the demand equation only. Suppose your objective is to obtain estimates of price elasticity of demand and price elasticity of supply. In this case, you might estimate the demand equation by itself and the supply equation by itself. System Estimation System estimation involves estimating two or more equations in the model jointly. For instance, in the above example you might estimate the demand and supply equations together. You might do this even if your objective is to obtain an estimate of the price elasticity of demand only. Advantages and Disadvantages of the Two Approaches The major advantage of system estimation is that it uses more information, and therefore results in more precise parameter estimates. The major disadvantages are that it requires more data and is sensitive to model specification errors. The opposite is true for single equation estimation. SINGLE EQUATION ESTIMATION If the error term is correlated with an explanatory variable, then we cannot find an estimator that is unbiased in small samples. This means we must look for an estimator that has desirable large sample properties. We will consider 4 single equation estimators. 1. 2. 3. 4. Ordinary least squares (OLS) estimator Instrumental variables (IV) estimator Two-stage least squares (2SLS) estimator Generalized method of moments (GMM) estimator ORDINARY LEAST SQUARES (OLS) ESTIMATOR 4 The OLS estimator is given by the rule: OLS^ = (XTX)-1XTy Properties of the OLS Estimator If the error term is correlated with an explanatory variable, then the OLS estimator is biased in small samples, and inconsistent in large samples. It does not produce maximum likelihood estimates. Thus, it has undesirable small and large sample properties. Role of OLS Estimator The OLS estimator should be used as a preliminary estimator. You should initially estimate the equation using the OLS estimator. Then estimate the equation using a consistent estimator. Then compare the OLS estimate and consistent estimate of a parameter to determine the possible direction of the bias. This is because a consistent estimator will have a smaller bias than the inconsistent OLS estimator in any finite sample. INSTRUMENTAL VARIABLES (IV) ESTIMATOR The IV estimator involves the following two-step procedure. 1. Find one instrumental variable for each right-hand side variable in the equation to be estimated. A valid instrumental variable has two properties: 1. Instrument relevance. It is correlated with the variable for which it is to serve as an instrument. 2. Instrument exogeneity. It is not correlated with the error term in the equation to be estimated. 2. Apply the following formula to the sample data: IV^ = (ZTX)-1ZTy . Where X is the TxK data matrix for the original right-hand side variables; Z is the TxK data matrix for the instrumental variables; y is the Tx1 column vector of observations on the dependent variable in the equation to be estimated. Comments Each exogenous right-hand side variable can serve as its own instrumental variable. This is because it is perfectly correlated with itself and is not correlated with the error term by assumption of exogeneity. The best candidates to be an instrumental variable for an endogenous right-hand side variable in the equation to be estimated are exogenous variables that appear in other equations in the model. This is because they are correlated with the endogenous variables in the model via the reduced-form equations, but they are not correlated with the error term in any equation. Oftentimes there will exist more than one exogenous variable that can serve as an instrumental variable for an endogenous variable. In this case, you can do one of two things. 1) Use as your instrumental variable the exogenous variable that is most highly correlated with the endogenous variable. 2) Use as your instrumental variable the linear combination of candidate exogenous variables most highly correlated with the endogenous variable. As we will see later, if we do this we have a more general type of IV estimator called the two-stage least squares estimator. Relationship Between the IV Estimator and Identification The following relationship exists between the IV estimator and identification. If the equation is exactly identified, then there are exactly enough exogenous variables excluded from the equation to serve as instrumental variables for the endogenous right-hand side variable(s). 5 If the equation is overidentified, then there are more than enough exogenous variables excluded from the equation to serve as instrumental variables for the endogenous right-hand side variable(s). If the equation is unidentified, then there are not enough exogenous variables excluded from the equation to serve as instrumental variables for the endogenous right-hand side variable(s). In this case, the IV estimator cannot be used. Properties of the IV Estimator Like all estimators, it is biased in finite samples. It is consistent in large samples. It is not necessarily asymptotically efficient. This is because an endogenous variable can have more than one instrumental variable. Each instrumental variable results in a different IV estimator. The higher the correlation between the endogenous variable and the instrumental variable, the more efficient the IV estimator. If there is heteroscedasticity, then the IV estimator is not efficient in the class of consistent estimators and the estimated standard errors are biased and inconsistent. It is not the maximum likelihood estimator. TWO-STAGE LEAST SQUARES (2SLS) ESTIMATOR The 2SLS estimator is a generalization of the IV estimator. It reduces to the IV estimator if the equation is exactly identified. 2SLS Rule The 2SLS estimator is given by the rule, ^2sls = (XTPX)-1XTPy where P = Z(ZTZ)-1ZT is called the projection matrix Note that Z is now a TxI matrix, where I is the number of instruments (identifying and other). If the equation is exactly identified, then I = K. If the equation is overidentified, then I > K. If the error term has constant variance and the errors are uncorrelated, then the variance-covariance matrix of estimates is, cov( ^2sls) = σ2(XTPX)-1 The estimated variance-covariance matrix replaces unknown σ2 with the estimate σ2 = RSS/T. Default 2SLS in Stata uses T-k because it believes this is a better approximation in finite samples. Asymptotically, they are equivalent. Two-Stage Implementation of Rule This estimator can be implemented by using two successive applications of the OLS estimator. This twostage procedure is as follows. Stage #1: Regress each right-hand side endogenous variable in the equation to be estimated on all exogenous variables in the simultaneous equation model using the OLS estimator. Calculate the fitted values for each of these endogenous variables. Stage #2: In the equation to be estimated, replace each endogenous right-hand side variable by its fitted value variable. Estimate the equation using the OLS estimator. 6 Comments Stage 1 is identical to estimating the reduced-form equation for each endogenous right-hand side variable in the equation to be estimated. The exogenous variables in the stage 1 regression are the instruments. They can be placed under two categories. 1) Identifying instruments. 2) Other instruments. An identifying instrument is any exogenous variable that has been excluded from an equation to identify it. Other instruments are exogenous variables included in the equation that serve as instruments for themselves. The fitted value variable from the stage 1 regression is the linear combination of instruments that has the highest correlation with the endogenous explanatory variable in the structural equation. At least one identifying instrument must be partially correlated with the endogenous explanatory variable; if not, then the fitted value variable will be perfectly correlated with the exogenous variables included in the stage 2 regression and the 2SLS estimator cannot be used. The estimated standard errors obtained from the stage 2 regression are incorrect and must be corrected. This is because the estimate of σ2 = RSS/(T-k) which uses RSS from the second stage estimate is wrong. We need to use RSS from the estimated structural equation. Statistical programs that have a 2SLS procedure make this correction automatically and report the correct standard errors. Logic of 2SLS Estimator Suppose Y is the dependent variable, X is the endogenous right-hand side variable, μ is the error term, and Z is instrumental variable. We can decompose the variation in X into 2 parts. One part is correlated with μ. The other part is uncorrelated with μ. To get an unbiased estimate of the effect of X on Y, we need to use the variation in X that is uncorrelated with μ, and eliminate the variation in X that is correlated with μ. To capture the variation in X that is uncorrelated with μ, we use an instrumental variable, Z, that is correlated with X, but uncorrelated with μ. For Z to perform this function, it must be relevant and exogenous. If it is not relevant, then it is not correlated with X, and therefore it does not capture variation in X. If it is not exogenous, then it is correlated with μ, and therefore it captures variation in X that is correlated with μ. How does the 2SLS estimator capture the variation in X uncorrelated with μ, and disregard the variation in X correlated with μ? The stage 1 regression can be written as: X = π0 + π0Z + νt. This regression decomposes the variation in X into 2 parts. 1) The systematic component π0 + π0Z captures the variation in X explained by Z, but not explained by μ. This is because Z is correlated with X but uncorrelated with μ. The error term ν captures the variation in X explained by μ and any additional factors other than Z. However, the true values π0 + π0Z are unknown because the parameters π0 and π0 are unknown. Therefore, we use the predicted values X^ = π0^ + π0^Z from a regression of X on Z using OLS. The stage 2 regression can be written as: Yt = α + βXt^ + εt. OLS yields a consistent estimate of β, because Xt^ is not correlated with the error term μt. Note that εt = Yt - α - βXt^, while μt = Yt - α - βXt. To obtain a correct estimate of the standard error of the estimate, we must use the residuals μt^ = Yt – α^ β^Xt. Statistical programs with a 2SLS command calculate these residuals for you. Properties of the 2SLS Estimator Like all estimators, it is biased in finite samples. It is consistent in large samples. If there is no heteroscedasticity or autocorrelation, then it is asymptotically efficient. If there is hetero or auto, then it is not asymptotically efficient and the estimated standard errors are inconsistent. To get consistent estimates, you can use White robust standard errors. It is not the maximum likelihood estimator. 7 2SLS vs OLS If an explanatory variable is correlated with the error term, the OLS estimator is biased and inconsistent. OLS has a smaller variance than the 2SLS. If you compare the OLS and 2SLS standard errors and tstatistics, OLS tends to have smaller standard errors and bigger t-statistics. The 2SLS estimator is consistent regardless of whether or not the error term is correlated with an explanatory variable. But if the error term is not correlated with an explanatory variable, then you should use OLS because it has a smaller variance than 2SLS and will produce more precise estimates. GENERAL METHOD OF MOMENTS ESTIMATOR The general method of moments estimator is a generalization of the 2SLS and IV estimators. If the error term has constant variance and the errors are uncorrelated, then the GMM estimator reduces to the 2SLS estimator if the equation is overidentified and the IV estimator if the equation is exactly identified. Logic of GMM Estimator Assume that the instruments (identifying and other) are not correlated with the error term. If this is valid, then Cov(Z,u) = E[Z(Y – Xβ)] = 0 in the population. This results in I moments or orthogonality conditions, one for each instrumental variable. GMM imposes this restriction on the sample. This yields a system of I equations with K unknown parameters . The expectations operator E[∙] for the population is replaced by the average operator N-1 ∑T t=1 for the sample. If the structural equation is exactly identified (I = K), then the number of instruments is exactly equal to the number of right-hand side variables, the number equations is equal to the number of unknown parameters, and there is a unique solution for . In this case GMM reduces to IV. However, if the equation is overidentified (I > K), then the number of instruments is greater than the number of right-hand side variables, the number of equations is greater than the number of unknown parameters, and there is not a unique solution for . In this case, weights can be applied to the instruments to find a unique solution. These weights are the elements of a weighting matrix, designate this M. M is an IxI matrix . GMM Estimator Rule The GMM estimator is given by the rule, ^GMM = (XTZMZTX)-1XTZMZTy There is a different GMM estimator for each possible weighting matrix M. The optimal weighting matrix is the one that produces asymptotically efficient estimates. This is given by, M = [(1/T) (ZTWZ)]-1 where T is the sample size and W is the TxT variance-covariance matrix of errors. To obtain an estimate of M we need to estimate the elements of W. Assume the errors are uncorrelated but we have heteroscedasticity of unknown form. The elements on the principal diagonal are the unknown variances of the T observations. The elements off the principal diagonal are zero by the assumption of no autocorrelation. To estimate the T unknown variances, we use the T squared residuals. This will produce a consistent estimate of M. This is because M is an IxI matrix, where I is the number of instruments (identifying and other). We can get a consistent estimate of an IxI matrix with T observations, because the elements of Z are known numbers (data for the instruments). The most common way to implement 8 this GMM estimator is to use the following two-step procedure. This is called the two-step GMM estimator. Two-Step GMM Estimator Step #1: Estimate the equation using 2SLS. Save the residuals. Square the residuals. Use the squared residuals to obtain an estimate of W. Use the estimate of W to obtain an estimate of M. Step #2: Apply the GMM estimator rule: ^GMM = (XTZMZTX)-1XTZMZTy Properties of the GMM Estimator Like all estimators, it is biased in small samples. It is consistent in large samples. It is asymptotically efficient and produces consistent estimates of standard errors. If there is hetero, it produces more efficient estimates than 2SLS. If there is no hetero, then the GMM estimator reduces to the 2SLS estimator. Testing for Heteroscedasticity Suppose that we want to test the null hypothesis of no hetero for one structural equation in a system of m structural equations. If the remaining m – 1 structural equations have no hetero, then you can use the White test. However, if any of these other structural equations have hetero, then the White test is not valid. This is true even if we don’t estimate these other structural equations. In this case, the appropriate test is a Pagan-Hall test. The test statistic has an approximate chi-square distribution with degrees of freedom equal to the number of instruments I (identifying and other) in the equation, PH statistic ~ χ2(I) CHECKING FOR VALIDITY OF INSTRUMENTS For the IV, 2SLS, and GMM estimators to have desirable properties, the instruments must be relevant and exogenous. We can use the sample data to check instrument relevance if there is only one endogenous explanatory variable. (There are more complicated methods if you have two or more endogenous variables). We can also test the hypothesis of exogeneity if we have enough information in the sample. We have enough information if the equation is overidentified. Checking Instrument Relevance The instruments can be either irrelevant or relevant. If they are relevant, they can vary from weak to strong. We can think of the strength of instruments as a continuum. Irrelevant → Weak → Strong If the instruments are irrelevant, then they are not correlated with the endogenous explanatory variable. This is typically not the case in practice. The higher the correlation the stronger the instruments. Irrelevant or weak instruments cause two problems. 1) The 2SLS estimator is still consistent, but it can have a large bias in finite samples. It can produce estimates that are worse than the OLS estimator. 2) Hypothesis tests are not valid. To check the strength of identifying instruments, calculate the F-statistic for the null hypothesis the identifying instruments have no joint effect in the first-stage regression. The bigger (smaller) the Fstatistic, the stronger (weaker) the instruments. A larger F-statistic indicates that the instruments contain 9 more information about the endogenous variable. How big must the F-statistic be for the instruments to be sufficiently strong? There is no specific answer to this question, only rules-of-thumb. Stock and Watson show that the mean of the sampling distribution of the 2SLS estimator in large samples is approximately: (βOLS – β) 1 ^2SLS E(β ) = β + ———— = β + (βOLS – β) ———— [E(F) – 1] [E(F) – 1] where βOLS is the OLS estimator, (βOLS – β) is the bias in the OLS estimator, and E(F) is the expected value of the F-statistic. Note that the expression 1 / [E(F) – 1] is the bias in β^2SLS relative to βOLS. The larger (smaller) the F-statistic, the smaller (larger) the bias in β^2SLS relative to the βOLS. For example, if F =2 then, 1 / [E(F) – 1] = 1 / (2 – 1) = 1. In this case, the bias in β^2SLS is the same as the bias in βOLS. If F = 3 then, 1 / [E(F) – 1] = 1 / (3 – 1) = ½. In this case, the bias in β^2SLS is one-half the bias in βOLS. If F = 11, then 1 / [E(F) – 1] = 1 / (11 – 1) = 1/10. In this case, the bias in β^2SLS is one-tenth the bias in βOLS. Some econometricians believe that a bias of about 10% or less is small enough to be acceptable in most applications, but this is only a rule-of-thumb. Checking Instrument Exogeneity If any instrument is correlated with the error term, then it is not exogenous. If it is not exogenous, then IV, 2SLS, and GMM will be inconsistent. We cannot test whether an instrument is correlated with the error term if the equation is exactly identified because we don’t have enough information. We can test whether the instruments are correlated with the error term if the equation is overidentified because we have sufficient information. To test for exogeneity of instruments, we do a test of overidentifying restrictions. This test is discussed below. SYSTEM ESTIMATORS A system estimator can be use to estimate two or more equations in a simultaneous equations model together. It uses more information than a single equation estimator (e.g., contemporaneous correlation among the error terms across equations, cross-equation restrictions, etc.), and therefore will produce more precise estimates. We will consider 2 system estimators. 1. Three-stage least squares (3SLS) estimator 2. Iterated three-stage least squares (I3SLS) estimator THREE-STAGE LEAST SQUARES (3SLS) ESTIMATOR The 3SLS estimator involves the following 3 stage procedure. 1. Same as stage 1 in 2SLS. 2. Same as stage 2 in 2SLS 3. Apply the SUR estimator. ITERATED THREE-STAGE LEAST SQUARES (I3SLS) ESTIMATOR The 3SLS estimator involves the following 3 stage procedure. 1. Same as stage 1 in 2SLS 2. Same as stage 2 in 2SLS 3. Apply the ISUR estimator. 10 Properties of the 3SLS I3SLS Estimators If the error term is correlated with one or more explanatory variables, then the 3SLS and I3SLS estimators are biased in small samples. However, if there is no heteroscedasticity, then they are both consistent and asymptotically more efficient than single equation estimators. Even though they have the same asymptotic properties, there estimates can differ in small samples. There is an ongoing debate about whether I3SLS or 3SLS produces better estimates in small samples. If there is heteroscedasticity, then both 3SLS and I3SLS produce inconsistent estimates of the parameters and they should not be used. Major Shortcoming of the 3SLS and I3SLS Estimators If there is heteroscedasticity, then both 3SLS and I3SLS are inconsistent. Many economists choose not to use either of these with cross-section data, because with cross-section data the error term oftentimes has non-constant variance. In this case, these estimators can produce poor estimates. HYPOTHESIS TESTING The small sample t-test and F-test cannot be used for a simultaneous equations model. This is because if the error term is correlated with one or more explanatory variables, we don’t know the sampling distributions of the t-statistic and F-statistic. The following large sample tests can be used: 1) Asymptotic t-test. 2) Approximate F-test. 3) Wald test. 4) Lagrange multiplier test. Note that because the IV, 2SLS, GMM, 3SLS, and I3SLS estimators do not produce maximum likelihood estimates, the likelihood ratio test cannot be used to test hypotheses. SPECIFICATION TESTING A specification test uses the sample data to test an assumption that defines the specification of the model. Two important specification tests for simultaneous equation regression models are: 1. Test of Exogeneity 2. Test of overidentifying restrictions We will implement these tests using a single equation estimation procedure. TEST OF EXOGENEITY This is a test of whether one or more right-hand side variables are exogenous against the alternative they are endogenous. It is also a test of whether the OLS estimator is biased against the alternative it is unbiased. Notation Designate the equation to be estimated and the identifying instruments as Y = a + bY1 + cX + ; Z = identifying instruments Where Y is the dependent variable; Y1 is a vector of one or more right-hand side variables that you believe may or may not be exogenous; X is a vector of right-hand side variables you believe are exogenous; a is the intercept; b and c are vectors of slope coefficients attached to the variables in Y1 and 11 X, respectively; and Z is a vector of exogenous variables that are excluded from this equation, and therefore are used as identifying instruments for the endogenous variable(s) in Y1; and is the error term. Hausman Test The most often used test of exogeneity is the Hausman test. It is also used to test if the OLS estimator is biased. The Hausman test is based on the following methodology. Let Y1 be interpreted more generally as a vector that contains one or more variables that you believe may be correlated with the error term . The null and alternative hypotheses are as follows: H0: Y1 and not correlated (Y1 is exogenous). H1: Y1 and correlated (Y1 is endogenous). To test the null hypothesis that Y1 and are not correlated, we proceed as follows. 1. Compare the OLS and 2SLS estimators. OLS is a consistent estimator if the null hypothesis is true but inconsistent if the null hypothesis is false. 2SLS is a consistent estimator if the null hypothesis is true or false. 2. If the null hypothesis is true, then both estimators should produce similar estimates. If the null hypothesis is false, then the two estimators should produce significantly different estimates. Thus, to test the null hypothesis you test the equality of the estimates produced by the two estimators. 3. If the estimates produced by the two estimators are significantly different, then you reject the null hypothesis and conclude that the sample provides evidence that Y1 is correlated with in the population. If the parameter estimates produced by the two estimators are not significantly different, then you accept the null hypothesis and conclude that Y1 is not correlated with in the population. If the vector Y1 contains one variable, then you are testing whether a single right-hand side variable is exogenous. If the vector Y1 contains two or more variables, then we are testing whether two or more right-hand side variables are jointly exogenous. If we are testing whether the OLS estimator produces biased estimates, then we interpret the null and alternative hypotheses as follows. H0: OLS is unbiased H1: OLS is biased Interpretation of Hausman Test If we reject the null hypothesis, then you have evidence that Y1 is correlated with , and therefore Y1 is endogenous. However, we cannot conclude with certainty what causes the correlation between X and . It may be reverse causation, or an omitted confounding variable, or both. If we reject the null hypothesis, we have also found evidence that OLS is biased relative to 2SLS. If we accept the null, this suggests that the OLS may not be biased. This may be the case if Y1 is not correlated with or weakly correlated with . In this case, we may want to use OLS. This is because OLS is more efficient than 2SLS, and therefore may produce estimates that are closer to the true values of the parameters. Implementation of the Hausman Test 12 The easiest way to implement the Hausman test is to use Wu’s approach. This involves the following steps. 1. Regress each variable in Y1 on all variables in X and Z (all exogenous variables in the model) using the OLS estimator. This is the stage 1 regression(s) of 2SLS. 2. Save the residuals from each of these regressions. Denote this vector of residuals ^. The residuals from each regression in step #1 is a “residual variable”. 3. Estimate the following regression equation using the OLS estimator: Y = a + bY1 + cX + d^ + v Where d denotes the vector of coefficients attached to the residual variables. 4. Test the following null and alternative hypotheses: H0: d = 0 H1: d 0 (Y1 is exogenous; OLS is unbiased) (Y1 is endogenous; OLS is biased) 5. If there is one variable in Y1, and therefore one residual variable in ^ and one coefficient in d, then this hypothesis can be tested using a t-test. If there is more than one variable in Y1, and therefore more than one residual variable in ^ and more than one coefficient in d, then this hypothesis can be tested using a F-test. Logic of Wu’s Approach The structural equation is Y = a + bY1 + cX + We want to test if Y1 is correlated with . The first-stage regression is Y1 = α1 + α2X + α3Z + Because Y1 depends upon , Y1 is correlated with . Y1 is uncorrelated with if is uncorrelated with . Write = d + v. If d = 0 then and are uncorrelated, and therefore and Y1 are uncorrelated. If d 0 then they are correlated. We can substitute the expression for into the structural equation and rewrite it Y = a + bY1 + cX + d + v If d=0 then there is no evidence that Y1 is correlated with . If d 0 then there is evidence that Y1 is correlated with . To do the test, we use the residuals as an estimate of the unknown errors. TEST OF THE OVERIDENTIFYING RESTRICTIONS It is possible to test the overidentifying restrictions for a single equation in a system of equations. When we test the overidentifying restrictions, we are testing whether the variables that you excluded to get identification can be validly excluded, or whether at least one should be included in the equation. Therefore, we are testing the following null and alternative hypotheses, H0: Overidentifying restrictions are valid. H1: Overidentifying restrictions are not valid. 13 An alternative interpretation of the null and alternative hypotheses is, H0: The instruments are exogenous (not correlated with the error term). H1: At least one instrument is endogenous (correlated with the error term). We cannot test the identifying restriction(s) for an equation that is exactly identified because you don’t have enough information to conduct the test. Notation Designate the equation to be estimated before the identifying instruments are excluded as Y = a + bY1 + cX + dZ + Where all variables and parameters have been defined previously. Note that this is the equation before it is identified, and therefore the variables in the vector Z have not been excluded. The null and alternative hypotheses can be expressed as follows. H0: d = 0 H1: d 0 (Z has no effect on Y, and therefore Z is not correlated with μ ) (At least one variable in Z has an effect on Y and therefore is correlated with μ). If we reject the null and include at least one of the instruments in Z belongs in the equation, then at least one of the instruments is endogenous and correlated with the error term because its effect is included in the error term. Sargan Lagrange Multiplier Test The easiest way to test the null of hypothesis that the overidentifying restrictions are valid is to use a Lagrange multiplier test. This is called a Sargan Test. The test statistic and sampling distribution for this test are LM = TR2 ~ 2(Z – Y1) Where T is the sample size; R2 is the uncentered R-squared statistic from an auxiliary regression; 2 is the chi square distribution with Z – Y1 degrees of freedom, where Z is the number of variables excluded from the equation and Y1 is the number of endogenous right-hand side variables in the equation (this difference is equal to the number of overidentifying restrictions). Calculating the LM Test Statistic To calculate the LM test statistic, we need to estimate the restricted model without the variables in Z. We then use information obtained from the restricted model to run an auxiliary regression to obtain the uncentered R2 statistic. This two-step approach is as follows. 1. Estimate the following restricted model using the 2SLS estimator Y = a + bY1 + cX + Use as instruments for Y1 all variables in the vectors X and Z. 2. Save the residuals from this regression. Denote the residual variable as ^. 14 3. Regress the residual variable, ^, on all the variables in X and Z using the OLS estimator; that is, estimate the following equation using the OLS estimator ^ = X + Z + v 4. Use this regression to calculate the LM test statistic. Notes about the Test of Overidentifying Restrictions 1. A sufficiently high R2 statistic indicates that one or more of the variables in Z is correlated with the residuals ^. What determines sufficiently high? We compare the LM test statistic to a critical value for a given level of significance. This provides evidence that at least one variable in Z is correlated with the error term and should be included in the restricted model. This variable may be correlated with the error term either because it has a direct effect on Y or it is correlated with another variable in μ that has an effect on Y. 2. If you reject the null hypothesis, then you are rejecting the overidentifying restrictions. This casts doubt on the identifying restrictions. This is because the overidentifying restrictions cannot be separated from the identifying restrictions. 3. If you reject the overidentifying restrictions, the test gives you no guidance about what to do next. A test does not exist that allows you do determine which variable or variables in Z should not be excluded from the equation being estimated. Heteroscedasticity The Sargan test assumes that the error term has constant variance. If this assumption is not valid, then the Sargin test is not valid. The appropriate test is a Hansen test. The test statistic for the Hansen test is called the J-statistic. This has an approximate chi-square distribution with degrees of freedom equal to the number of overidentifying restrictions tested. The Sargan test statistic is a special case of the Jstatistic when there is no hetero. To do a Hansen test, you must estimate the equation using the GMM estimator. 15