Download Statistical analysis of Quantitative Data

Statistical analysis of Quantitative Data Arkadiusz M. Kowalski Tomasz M. Napiórkowski The textbook is co-financed by the European Union from the European Social Fund. Statistical analysis of Quantitative Data Arkadiusz M. Kowalski Tomasz M. Napiórkowski Statistical analysis of Quantitative Data Warsaw 2014 This textbook was prepared for the purposes of International Doctoral Programme in Management and Economics organized within the Collegium of World Economy at Warsaw School of Economics. The textbook is co-financed by the European Union from the European Social Fund. This textbook is distributed free of charge. Table of Contents INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1. BASELINE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.1. Equations Explained . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2. Parameter Estimation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3. Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4. Using Statistical Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.5. Econometrics Software. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2. DATA TYPES AND STRUCTURAL EQUATIONS DESCRIPTION . . . . . . . . . . . 2.1. Cross-section, Time-series and Panel data defined . . . . . . . . . . . . . . 2.2. Structural Equation Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1. Cross-Section structural equation. . . . . . . . . . . . . . . . . . . . . . 2.2.2. Time-Series structural equation . . . . . . . . . . . . . . . . . . . . . . . 2.2.3. Panel structural equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 13 14 14 16 16 3. VARIABLES AND DATASET WORK. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1. Naming Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2. Examining Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3. Stationarity Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4. Correlation Matrix Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5. Descriptive Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6. Hypotheses Formulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7. Dummy Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1. Dummy Variables: Example . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.2. Dummy Variables: Pitfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8. Data Cleaning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9. Data Description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 19 20 21 24 25 26 27 27 29 30 32 4. MODEL DETERMINATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1. Model Estimation with a Forward Stepwise Method . . . . . . . . . . . . 34 5 Table of Contents 5. MODEL TESTING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1. Multicolinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2. Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3. Heteroscedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 39 40 44 6. MODEL’S RESULTS INTERPRETATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.1. Interpreting and Testing Variables and Coefficients . . . . . . . . . . . . . 48 6.2. Interpreting Model’s Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 7. FORECASTING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1. Forecasting as a Model Testing Tool . . . . . . . . . . . . . . . . . . . . . . . . . 7.2. Forecasting with ARIMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3. Forecast Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 57 59 60 8. CONCLUSIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 A. TRANSITION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 EXAMPLE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Hypothesis Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Correlation matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Unit Route Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Model Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 FINAL REMARKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 STATISTICAL TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 z-table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 t-table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 F-table at 0.01 level of significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 F-table at 0.025 level of significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 F-table at 0.05 level of significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 F-table at 0.1 level of significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 χ2 distribution table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 NOTES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6 INTRODUCTION Dr Arkadiusz M. Kowalski Tomasz M. Napiórkowski Is this book for you? If you are connected with econometrics in any way, this book is for you. If you are just starting the subject, this book will provide you with the basic theory and show you how to use it effectively through the employment of econometric software (EViews,1 for example). On the other hand, if you already have some experience, then this book will be a useful bring-it-all together place that you may want to visit every time you have a question about statistical tests, degrees of freedom or other questions about econometrics and its uses. So, now when we know that this book is for you, welcome to the place where some of the most common econometric theories and methodologies are explained using a step-by-step look at the blueprint of econometric research. Starting with the raw data and ending with obtaining the final ready-to-submit model, its testing and interpretation. Each chapter uses everyday-language explanations so there are no worries that you will be overwhelmed by pages of equations or words that would require you to carry a dictionary with you. Every theory, every statistical test is clearly defined and supported by a real-world example with software outputs and their full interpretation. Take note that this book is example- and hands-on heavy, with only the essential theory explained. At the end of the book you will find a full-length example of a research that will tie directly to the book by following its chapters and subsections. The use of a full, real-life example that takes the reader from the beginning to the end adds additional strength to the reader’s understanding of methodologies used. Clear references to specific sections of the book will provide a deep understanding of the workflow associated with preforming an econometric study for the conducted research. 1 To find out more about this software, please visit: http://www.eviews.com/home.html. 7 Introduction Book Sections The book is designed around eight chapters and the example. 1. Baseline: • this section explains basic econometrics and associated notation as well as what data is used in the examples. 2. Dataset Types and Structural Equations Description: • a comprehensive introduction to three most common dataset types (cross-section, time-series and panel) and a how-to regarding the construction of the structural equation. 3. Variables and Dataset Work: • a look at how to efficiently name, examine, test, adjust and record variables, what they are, how to create and use dummy variables as well as how to clean the dataset. 4. Model’s Determination: • development of the model from the initial stage to its final version by employing the LM test for additional information in residuals. 5. Model Testing: • a detailed look at detection, consequences and solutions to problems of multicollinearity, autocorrelation and heteroscadesticity. 6. Model Results Interpretation: • interpretation of the estimated and corrected model, its coefficients (including coefficient testing) and model descriptive statistics like R-squared. 7. Forecasting: • a look at forecasting as a model verification tool, ex-ante and ex-post forecasting using the plug-in method and ARIMA models. 8. Conclusion: • crafting finishing remarks. 8 CHAPTER ONE Baseline In order to make sure that all readers, regardless of the level of advancement in the subject, can use this book to their full benefit, this section covers basic topics and terms needed when working with econometrics and using this book to its full potential. Econometrics is a science of crafting mathematical models based on economic sets of data. As one will come to realize, data can be found on almost anything that is happening in the world; be it human behavior like consumer spending or decisions of Federal Reserve Banks on interest rates, all those decisions are recorded and used by others for analysis. Popular sources of data are federal banks (e.g., for the U.S., the Federal Reserve Bank of St. Louis Federal Reserve Economic Data, or FRED, is such an example1), special government agencies like the U.S. Bureau of Labor Statistics2 or world organizations in the vein of the World Bank,3 the Organization for Economic CoOperation and Development (OECD),4 EuroStat5 and the International Monetary Fund (IMF).6 1.1. Equations Explained In this book, all models will be built based on the “skeleton” with n independent variables as can be seen in Equation 1: 1 For more information see: http://research.stlouisfed.org/fred2/. For more information see: http://bls.gov/. 3 For more information see: http://data.worldbank.org/. 4 For more information see: http://stats.oecd.org/. 5 For more information see: http://epp.eurostat.ec.europa.eu/portal/page/portal/statistics/ search_database. 6 For more information see: http://www.imf.org/external/data.htm. 2 9 Chapter one: Baseline Equation 1. Basic structural equation, i.e., the skeleton Y = β0 + β1X1 + β2X2 + ... βnXn + ε Source: Authors’ own equation Here, the following symbols are used: Y – the dependent variable. It is called the dependent variable because its value β0 – β1 – X1 – ε– depends on other variables. This is the variable that the model is aiming to explain. a parameter called the constant term. Its presence is required in all estimated models. a parameter called the coefficient of the first independent (or explanatory) variable. When we talk about estimating the model, we are referring to estimating these coefficients as well as the estimation of the constant term. After the estimation is completed, especially if all of the explanatory variables have identical units (dollars, for example), it is useful to list them in the estimated model in the order of magnitude. This benefits the reader of the report by immediately informing him or her which of the used variables have the greatest impact on the dependent variable. the first independent or explanatory variable. This is one of the variables used in explaining the movement or the value of the dependent variable. It is also called the independent variable because it comes from the dataset and, ideally, does not directly depend on any other variables in the model. the error term. This accounts for any inaccuracies in the model. Since there is no such thing as a perfect model, the gap between the estimated model and the “perfect model” that predicts values of the dependent variable equal to its actual values, is called the residual. 1.2. Parameter Estimation Methods There are many methods that allow the researcher to obtain a model and its parameters, each suited for specific situations. The word “method” really refers to the way the parameters, βn, are estimated. Such approaches include: Ordinary Least Squares (OLS), Generalized Least Squares (GLS), Weighted Least Squares (WLS), Two and Three Stage Least Squares (2SLS and 3SLS) and General Method of Moments (GMM). This book will mostly employ the use of Ordinary Least Squares, that can be modified by, for example, adding cross-section and time fixed or random effects, as it is the most common and the easiest one to use, and adequately serves the purpose of estimating models like the ones talked about in this book. 10 1.3. Hypothesis Testing For detailed explanations of OLS and other approaches, as well as detailed mathematical explanations of concepts covered in this work, I suggest you refer to books that are theory heavy. One book worth recommending is Econometric Models and Economic Forecasts by Robert S. Pindyck and Daniel L. Rubinfeld.7 1.3. Hypothesis Testing Hypothesis Testing is used to statistically test models and their parts as well as other statistic-based questions, e.g., difference of means of two data sets collected. The hardest step in performing a statistical test is to correctly setting up the two hypotheses. The first one is called the null hypothesis (H0) and the second one, adequately, is referred to as the alternative hypothesis (HA or H1), which, as can be expected, states the opposite of the null. When performing a test, the decision is made whether to reject or fail to reject the null hypothesis. The decision depends on the decision rule that states that the null hypothesis is to be rejected if the critical value is less than the observed one and if the p-value is less that the established level of significance. P-value represents the probability that, given the random sample, the difference between sample means is as large, or larger, than the one being observed. The level of significance is how much error we are allowing to exist in the model. At 5% a model is much more restrictive than at 10%. Depending on the area of research, different levels can be implemented.8 For example, in marketing the levels will be greater than when performing research in the medical field. 1.4. Using Statistical Tables When performing a statistical test, the researcher, through the use of appropriate formulas, arrives at the observed value that he or she then compares with the critical value, which is obtained from Statistical Tables – tables that list values for different distributions based on degrees of freedom and the level of significance (more on this in later parts of the book). Main distributions include: t-distribution, F-distribution and Chi-square – χ2-distribution. The way of using these tables is explained throughout the book on the first occasion it is used. All tables are included at the end of the book. 7 8 Pindyck, Rubinfeld (1998). Do not worry, all this will be clear when we move to examples. 11 Chapter one: Baseline 1.5. Econometrics Software There are many different econometrics software packages available, each with unique strengths and limitations, and designations (e.g., SPSS is used in social studies). EViews, SAS,9 STATA,10 SPSS,11 just to name a few, are the most common ones used. In this book, all of the outputs, models graphs and estimations will be done using the EViews package that can be acquired at http://www.eviews.com/. One of the main advantages of this software is that it is very easy to use as well as visual in providing econometrics solutions. Another benefit is that it allows working with most common types of dataset (each of which is explained in detail in Data Types and Structural equations Description chapter). Lastly, it also allows using many statistical methods, some of which were mentioned in this chapter’s section titled Parameter Estimation Methods. 9 For more information see: http://www.sas.com/. For more information see: http://www.stata.com/. 11 For more information see: http://www-01.ibm.com/software/analytics/spss/. 10 12 CHAPTER TWO Data Types and Structural Equations Description Depending on the type of research, datasets – the way the data is arranged – can be divided into three main categories: time-series, cross-section and panel data. Each econometrics project that aims at estimating the model requires setting up a structural equation that can be viewed as a skeleton on which the final model will be constructed. Such equations have their unique specifics that depend on the data set that is being used. 2.1. Cross-section, Time-series and Panel data defined Cross-section data looks at many variables at one particular moment in time, a snapshot of the entire situation. That is why we say that it is a onedimensional dataset. An example of such data would be an attempt to estimate the price of the house by looking at individual factors of the house sold (number of rooms, size of the house, and presence of a pool, for example). To perform such research, a dataset would consist of many observations (houses), each with a sale price as well as above-suggested data points. Time-series data is one that overlooks a particular variable (or a few variables) through the set time period (for example, the U.S. imports from the first quarter of 1960 to the last quarter of 2010); the letter t is usually used to depict the period in which the measurement has been taken. Time-series is used in most macroeconomics models (the U.S. GDP as a function of consumer spending, trade deficit, government spending and whether the country is in a recession or not, for example). Panel data consists of observations of subjects (that are our dependent variables) over a specified period of time; a combination of cross-section and time-series sets. 13 Chapter two: Data Types and Structural Equations Description A good example is a set of data looking at profits of the top 10 transportation companies in the U.S. over 10 years. Of course, each of the firms comes with its own set of explanatory data points – an example of such a set is included in Table 1. Table 1. An example of panel data with averages per firm and per year listed in the last row and the last column, respectively Year / Firm Firm 1 Firm 2 Firm 10 Industry’s Average 2000 10 42 --- 44 32.00 2001 22 15 21 19.33 2002 53 62 37 50.67 17 20 52 29.67 8.33 --2008 2009 9 11 5 Firm’s Average 22.2 30 31.8 Source: Authors’ own table on original theoretical data. Some of the advantages of the panel data are: 1) a large number of data points (allowing for increased accuracy and additional degrees of freedom), 2) a combination of time-series and cross-section approaches that minimize the probability of omitted-variables problem. In the firm example, the use of panel data allows the researcher not only to measure the variation in profits of a single company over time but also to measure the variation in profits between companies. The advancement of the panel data is also the source of its problems as it brings together issues from cross-section and time-series sets. This book focuses on research done using cross-section and time-series sets of data. For a detailed look at working with panel data, there is no better place than Econometric Analysis of Cross Section and Panel Data by J.M. Wooldridge.1 2.2. Structural Equation Description The structural equation is the basic representation of the model to be estimated. It provides the reader with a quick mathematical view of what it is that the research is going to do and what it is trying to achieve. 2.2.1. Cross-Section structural equation For a simple cross-section dataset, a structural equation in linear form that attempts, for example, to model house sale price (SalePrice) as a linear function of the area of the house (Area), number of bedrooms (Beds) and the existence of an in-ground pool (Pool) will look like Equation 2: 1 14 Wooldridge (2010). 2.2. Structural Equation Description Equation 2. Simple linear form structural equation for working with a crosssection data set, with i representing a specific observation SalePricei = β0 + β1Areai + β2Areai + β3Pooli + εi Source: Authors’ own equation. The interpretation of coefficients obtained with a model in a simple linear form is very straightforward – a one unit increase in the independent variable X will impact the dependent variable Y by βX of its units, where βX is the coefficient of the X independent variable. For example, referring to Equation 2, assume that the sale price (SalePrice) of a house is reported in the U.S. dollars, the area of a house (Area) is reported in squared meters and that the value of β1 is 1520. In this case the interpretation of β1 is as follows: An increase in the area of the house by one squared meter will increase the price of the house by 1,520 U.S. dollars. A semi-log form (Equation 3 and Equation 4) of the same model would have at least one variable (dependent or independent) in the model logged – common practice is to either log the entire right- (linear-log form) or left-hand (log-linear form) side and to use the natural logarithm, ln. Equation 3. Simple semi-log form structural equation for working with a cross-section data set, with i representing a specific observation – loglinear form InSalePricei = β0 + β1Areai + β2Bedsi + β3Pooli + εi Source: Authors’ own equation. Equation 4. Simple semi-log form structural equation for working with a cross-section data set, with i representing a specific observation – linearlog form SalePricei = β0 + β1lnAreai + β2lnBedsi + β3lnPooli + εi Source: Authors’ own equation. In case of semi-log forms, the interpretation of the coefficient is a bit more complicated. Starting with the log-linear form (Equation 3), a one unit increase in the independent variable X will impact the dependent variable Y by 100βX %, where βX is the coefficient of the X independent variable. Holding all the assumption from the linear-form example, let us assume now that βX equals 0.20. In this case the interpretation of β1 is as follows: An increase in the area of the house by one squared meter will increase the price of the house by 20%. 15 Chapter two: Data Types and Structural Equations Description Moving to the linear-log form (Equation 4), a one-percent increase in the independent variable X will impact the dependent variable Y by 0.01βX of its units, where βX is the coefficient of the X independent variable. Giving the value of βX to be 3000, its interpretation is as follows: An increase in the area of the house by 1% will increase the price of the house by 30 U.S. dollars. A log-log form (also known as a full-log form, Equation 5) has all variables in logs. Equation 5. Simple full-log form structural equation for working with a cross-section data set, with i representing a specific observation InSalePricei = β0 + β1lnAreai + β2lnBedsi + β3lnPooli + εi Source: Authors’ own equation. In the case of a full-log form, the interpretation is simpler, that is: a onepercent increase in the independent variable X will impact the dependent variable Y by βX %, where βX is the coefficient of the X independent variable. Assigning βX to be 4, its interpretation in Equation 5 is as follows: An increase in the area of the house by 1% will increase the price of the house by 4%. 2.2.2. Time-Series structural equation When presenting the reader with a functional form of a model base on timeseries data (one with a time factor), a subscript to represent the time period should be added. The equation (Equation 6) that regresses the U.S. Imports (IM) on the U.S. GDP (GDP), the U.S. Exports (EX) and Change in Inventory (chng inv) has the following structural form. Equation 6. Simple linear form structural equation for working with a timeseries data set, with t representing a specific year IMt = β0 + β1GDPt + β2EXt + β3chg_invt + εt Source: Authors’ own equation. This structural equation, just like the model presented in Equation 2, can also be transformed into its semi-log and full-log forms. 2.2.3. Panel structural equation As can be expected, the structural equation of the model that is to be estimated based on panel data, will combine the features in Equation 2 and Equation 6. For example, when attempting to model inward foreign direct investment from the U.S. (IFDI) to six countries (i = 1, 2… 6) over ten years (from the year 2000 to the year 2009, that is, t = 2000, 2001… 2009) as a function 16 2.2. Structural Equation Description of hosts’ gross domestic products (GDP), their exports (X) and costs of labor (LCOST), Equation 7 can be used. Equation 7. Simple linear form structural equation for working with a panel data set, with i representing cross-section elements, i.e., host countries, and t representing time-series elements, i.e., a specific year IFDIit = β0 + β1GDPit + β2Xit + β3LCOSTit + εit Source: Authors’ own equation. This structural equation, just like the model presented in Equation 2, can also be transformed into its semi-log and full-log forms. Notice that in all of the above-shown cases, the constant, β0, does not have a cross-section or a time subscript, unlike the error term, εit. 17 CHAPTER THREE Variables and Dataset Work Proper treatment of variables is probably one of the most crucial steps to setting up a successful project. Indistinctive names, errors and missing values and other mistakes are bound to occur, and therefore invalidate the entire research. This section will aim to show how to avoid such pitfalls. Additionally, it is very important and useful for reference and further research that as work is conducted all steps and changes are documented. 3.1. Naming Variables After the literature review and establishing which variables are going to play a role in obtaining the model, the next step is to properly name them all. There are two rules to properly doing so: 1) keep it short – it is very likely that they will have to be entered multiple times while using the software package, 2) be sure that you can recognize the name. For example, if one of your variables is Disposable Income, naming the variable disposable_income is not efficient (rule 1 violation) and naming it Yd, if you are not familiar with using the letter Y to represent income, infringes on the second rule. Again, writing everything down is crucial. Table 2 represents an example of a good way of keeping track of your variables. 19 Chapter three: Variables and Dataset Work Table 2. Variables Info Table Name Symbol in the model Unit Gross Domestic Product GDP Constant 2000 United States Dollars Source of data World Bank Transformations NA Source: Authors’ own table. 3.2. Examining Variables When dealing with time-series data, it is a good practice to examine how the variable moves over time. For example, the gross domestic product (GDP) data for the United States in its graphical representation is shown in Graph 1. Graph 1. The U.S. gross domestic product (left-hand axis in billion, USD) Source: Authors’ own graph of data from International Monetary Fund. A simple analysis of the variable presented in Graph 1 should follow for each of the variables. An example of such an analysis is: As expected, as time progresses, GDP increases; therefore, it has an upward trend and it appears to be a non-stationary variable. When looking at time as the only component, it is useful to add a trendline (this can be done in Microsoft Excel, for example1).2 1 This can be done by right-clicking the line on the graph and choosing “Add Trendline…” Here, a regression can be fitted in various forms (exponential, linear, logarithmic, power, polynomial and moving average). It is also useful to check the “Display Equation on chart” box and the “Display R-squared value on chart” box – more on these topics later in the book. 2 Be very careful when analyzing and falling back on these results. As much as these tools are 20 3.3. Stationarity Test Graph 2. The U.S. gross domestic product (left-hand axis in billion, USD) with a linear trendline Source: Authors’ own graph of data from International Monetary Fund. A visual analysis can also give an indication into whether we should use a linear, square (a parabolic shape – think of the capital letter U upright or upside down), cube (wave-like) or a log form (a half-parabola on its side with the tip being the intercept term) of the variable in the model. 3.3. Stationarity Test In case of cross-section data, since there is no time factor, this analysis can be skipped. When dealing with time-series data, a variable needs to be tested and corrected for nonstationarity. By definition, a stationary variable will have its mean, variance and autocorrelation constant over time. There are three general tests to see if the variable is stationary: 1) visual test (also known as the ocular test, which is the easiest), 2) correlogram, 3) Augmented Dickey-Fuller test. The ocular test can be done by plotting the data in levels – without any adjustments – as presented in Graph 1. If an average line drawn (in this case the linear trend line, Graph 2) is not close to a horizontal line, the data is considered to be not stationary. helpful in providing some insight, these insights are very limited as the presented model uses the horizontal axis’ variable, in this case time, only to model the vertical axis’ variable. 21 Chapter three: Variables and Dataset Work Table 3. An example of a correlogram of data with a unit root present Autocorrelation Partial Correlation .|******* .|******* 1 .|****** .|. | 2 .|***** .|. | 3 .|**** .|. | 4 .|*** .|. | 5 .|** .|. | 6 .|* .|. | 7 Source: Authors’ own graph based on results obtained with EViews software. The correlogram (again, in levels), here presented in Table 3, in a case of nonstationary data, will have Autocorrelation bars slightly decreasing and Partial Correlation will have one bar that represents a unit root. More on autocorrelation in Chapter 6: Model Testing. In this type of an output, the extent of the bar is represented by the amount of stars; the longer the bar the more stars are used to represent it. Table 4. Output of the Augmented Dickey-Fuller test t-Statistic Prob.* Augmented Dickey-Fuller test statistic 0.885655 0.9952 Test critical values: 1% level -3.464280 5% level -2.876356 10% level -2.574746 Source: Authors’ own table based on results obtained with EViews software. The hypotheses setup for the Augmented Dickey-Fuller test used to detect the presence of a unit root (data being nonstationary) is: H0: the variable is nonstationary H1: the variable is stationary The analysis of the Augmented Dickey-Fuller output (presented in Table 4) looks first at the test t-statistic (0.885655) and compares it with the Test critical value (negative 2.876356), at a chosen level of significance, i.e. 5%. Also, Prob. (the p-value associated with the test) equals 0.9952, which is greater than the one associated with a 5% level of significance; p-value = 0.05. Based on the test’s results, we fail to reject the null hypothesis and therefore conclude that the variable in question is not stationary. This conclusion is a result of the test t-statistic being greater than the test’s critical value and Prob. being more than the 0.05. 22 3.3. Stationarity Test To solve the problem of nonstationarity, differencing is applied; Yt – Y(t-1). Differencing takes the observation of the past period’s value (Y(t-1)) and subtracts it from the observation from the current period (Yt). Graph 3. A graphical representation of the U.S. GDP after it has been transformed into a stationary variable via first-order differencing; D (GDP) Source: Authors’ own graph based on results obtained with EViews software. Table 5. A correlogram of the U.S. GDP after it has been transformed into a stationary variable Autocorrelation Partial Correlation .|** | .|** | 1 .|** | .|* | 2 .|* | .|. | 3 .|* | .|. | 4 .|. | .|. | 5 Source: Authors’ own table based on results obtained with EViews software. Table 6. The Augmented Dickey-Fuller test output testing the 1st difference of the U.S. GDP for stationarity (only relevant information included) t-Statistic Prob.* Augmented Dickey-Fuller test statistic -5.477582 0.0000 Test critical values: -2.876356 5% level Source: Authors’ own table based on results obtained with EViews software. 23 Chapter three: Variables and Dataset Work The stationary data will have a following graph with the overall linear trend being nearly horizontal (Graph 3), correlogram (Table 5) and the Augmented Dickey-Fuller output (Table 6) with the t-static from the test (-5.477582) being less (greater in the absolute value) than the one at the desired confidence level (-2.876356) with Prob. = 0.00. Sometimes taking the first difference is not enough. If that is the case, differentiation should be repeated until the data is proved to be stationary. This should be done within the realm of reason. The second degree is usually the highest degree of differencing. 3.4. Correlation Matrix Analysis Following should be the analysis of the Correlation Matrix (a table of correlation coefficients between variables). This has the following goals: 1) to see if there is a linear relationship between the dependent variable and chosen independent variables, 2) to see the relative strength of the relationship, 3) to see the sign of the relationship, 4) to assess the possibility of multicollinearity. Table 7. A correlation matrix for the number of the U.S. FDI firms and the GDP in two regions in Poland DOLNOŚLĄSKIE KUJAWSKO-POMORSKIE Pearson Correlation -0.246 Sig. (2-tailed) 0.639 Pearson Correlation 0.909 Sig. (2-tailed) 0.012 Source: Authors’ own table based on results obtained with SPSS software. Let us go over points 1 through 4 by looking at the example data shown in Table 7. The null hypothesis states that the coefficient of correlation is equal to zero; therefore, stating that there is no linear correlation between the two tested variables. In the example, the p-value for the correlation coefficient -0.246 between the number of the U.S. FDI firms in the Dolnośląskie region and that region’s GDP is equal to 0.639. Since this value is significantly above any logical and practical level of significance, the conclusion is that there is no evidence to state that there is a linear relationship between the two tested variables. When looking at the Kujawsko-Pomorskie region, since the p-value is equal to 0.012, i.e., less than the one set at a 5% level of significance (0.05), a statement can be made that there is a high, positive and statistically significant linear correlation between the two tested variables for this region. When describing correlation 24 3.5. Descriptive Statistics between two variables, it is important to make a note of three facts: one, the strength of the correlation; two, the direction (is the correlation coefficient positive/negative, suggesting that as one variable increases the other increases/ decreases); and three, the statistical significance. The correlation matrix should also be used to look at correlation coefficients between independent variables in order to detect multicollinearity, which occurs when one explanatory variable is highly correlated with another. For example, if a model would use household’s income and household’s taxes where the latter is a derivative of the former, there would be a strong suspicion of multicollinearity. The rule-of-thumb is that if the correlation coefficient (which suggests the strength of a linear relationship between two variables) is greater than 0.8, then we can expect multicollinearity (which is also suspected when the model has a significantly high R-squared and very small, in absolute value, t-statistics). More on this problem, its consequences and its solutions in the Model Testing chapter. It is important to note that just because two or more variables are highly correlated with each other it does not mean that one causes another. For example, the U.S. imports and the U.S. GDP are highly correlated but it does not automatically mean that one causes another. Here is why. First, the same example, the question can be asked: Does a high correlation coefficient between the U.S. imports and the U.S. GDP signifies that changes in the U.S. imports cause changes in the U.S. GDP, or do changes in the U.S. GDP cause changes in the U.S. imports? This question can be answered by falling back on the theory, but on the basis of the results of the correlation coefficient such a question is impossible to answer.3 3.5. Descriptive Statistics The next-to-last step in the analysis of variables is to look at the statistical summary (Table 8), usually provided within the econometrics software, of all the variables. Mean (the average value), median (the value in the middle of the set), mode (the most common value), extreme values (the minimum and maximum) and the number of observations should be examined – it is important that all variables have the same number of observations (196 in this case) as missing values will significantly distort the estimated model’s coefficients. When looking at dummy variables, the mean will represent the percentage of observations that were coded with 1 (for example, 18.3673% of all observations took place during a recession). 3 A hint into the cause-and-effect relationship can be also given by the Granger Causality test; see: Pindyck, Rubinfeld (1998), pp. 242–245. 25 Chapter three: Variables and Dataset Work To see if the variable has a normal distribution, the researcher can use three statistics. First, Skewness which shows the distribution of the mass of the variable to the left, long tail on the right (positive value) or to the right, long tail on the left (negative value). Kurtosis on the other hand measures how flat or how tall the distribution is with an ideal value of 3. The lower/higher the value, the flatter/peaker the distribution is. Third, the Jarque-Bera statistic can be used to test for normal distribution with the null hypothesis stating that the variable is normally distributed. In this example ( Table 8), as p-values (Probability) are less than 0.00000, we reject the null in favor of the alternative. Still, it needs to be remembered that the assumption of normal distribution is an “ideal” one and very often does not work in the real world.4 Table 8. Descriptive statistics of the U.S. imports, the U.S. exports and a dummy variable for recession IM EX RECES Mean 735.4768 561.9069 0.183673 Median 490.3720 355.4060 0.000000 Maximum 2208.336 1670.431 1.000000 Minimum 108.4540 94.75800 0.000000 Std. Dev. 635.9210 441.4533 0.388209 Skewness 1.049200 0.827283 1.633843 Kurtosis 2.767814 2.426459 3.669444 Jarque-Bera 36.40038 25.04337 90.86179 Probability 0.000000 0.000004 0.000000 Sum 144153.4 110133.7 36.00000 Sum Sq. Dev. 78857124 38001795 29.38776 Observations 196 196 196 Source: Authors’ own table based on results obtained with EViews software. 3.6. Hypotheses Formulation The next step, taken after preforming all of the analytical steps presented in the previous section, is to construct Hypotheses Tests for each variable based on economic theory and literature review. For example, for GDP in relation to imports, the hypotheses regarding the sigh of the coefficient of the GDP explanatory variable is as follows: H0: βGDP < 0 and H1: βGDP > 0 where we want to statistically reject the null hypothesis; therefore, 4 26 For more see: Wolldrigde (2010). 3.7. Dummy Variables allowing for a statement that GDP has a positive and a statistically significant impact on the dependent variable, i.e., the U.S. imports. Summary of the information for all variables can be presented in a form of a table (e.g., Table 9). Table 9. A summary of information for the U.S. GDP variable Variable U.S. GDP Name in the model GDP Alternative Hypothesis H1: βGDP > 0 Source: Authors’ own table. 3.7. Dummy Variables It is very often that some information cannot be directly inputted into the model. Variables like sex (male, female), race (white, black, for example), location (Washington, Richmond, for example) and many more need to be transformed prior to their use. Another important use of dummy variables is to distinguish between two periods. When looking at any variables around an economic or a social event, a researcher may want to designate those observations that took place prior to the event versus those that followed. For example, Poland joined the European Union in the year 2004; as a result, a dummy variable (coded EUDV) can be created that takes the value of zero for the years prior to the year 2004 and one for the years 2004 and after (see Table 10). 3.7.1. Dummy Variables: Example Table 10. Dummy variable creation: European Union membership example Year EUDV 2002 0 2003 0 2004 1 2005 1 2006 1 Source: Authors’ own table. Here is another example. When seeing if the sale price of a specific car, which is our left-hand-side variable of the original structural equation (Equation 8), depends on the sex of the buyer, the researcher should decide on the presented course of action; first, the set up. 27 Chapter three: Variables and Dataset Work Equation 8. Dummy variable creation: Sale price example, original equation (no dummy variable) SalePricei = β0 + β1X1i + . . . βnXni + εi where: CarSalesi – the dependent variable; ith sale price of a specific car in βn – coefficient of the nth independent variable X εi – error term Source: Authors’ own equation. Let us assume that we have the original data set as presented in Table 11. The first purchase was done by a male, the second by a female and the third by a male; the data coded this way cannot be effectively used in model determination. The solution is to simply assign the value of 1 if the buyer was a female, and the value of 0 if the buyer was a male.5 In this case, the original data set will be transformed to look like the one presented in Table 12. Table 11. Dummy variable creation: Sale price example, original data set Sale Price Sex $120,000 M $67,450 F $87,090 M Source: Authors’ own table on original data. Table 12. Dummy variable creation: Sale price example, transformed data set Sale Price SexDV $120,000 0 $67,450 1 $87,090 0 Notice that when a variable is a dummy variable, it is very useful to mark that fact by, for example, adding capital DV at the end of its name. Source: Authors’ own table on original data. This introduces one dummy variable to the original structural equation (Equation 8), which results in a new one (Equation 9). 5 It does not matter which sex takes which value as long as you have it clearly noted for interpretation purposes. 28 3.7. Dummy Variables Equation 9. Dummy variable creation: Sale price example, original equation (with a dummy variable) SalePricei = β0 + β1X1i + . . . βnXni + βn+1 SEXDVi + εi Source: Authors’ own equation. Given how SEXDV is coded (0 for male and 1 for female), the interpretation of its coefficient, βn+1 , is as follows: 1) if the coefficient of the dummy variable SEXDV is positive, then a statement can be made that if the buyer is a female, the dependent variable, i.e., the price for which the car is sold, will be higher than in the case of a male buyer, 2) if the coefficient of the dummy variable SEXDV is negative, then a statement can be made that if the buyer is a female, the dependent variable, i.e., the price for which the car is sold, will be lower than in the case of a male buyer. 3.7.2. Dummy Variables: Pitfalls One may be tempted to solve the example from the previous section by creating two dummy variables; namely, Male (MDV) and Female (FDV). The first taking the value of zero if the buyer was a female and one if the buyer was male; the second, the value of zero for a male buyer and the value of one for a female buyer. In this case, the transformed data set will look as is presented in Table 13 and the structural equation will take the form shown in Equation 10. Table 13. Dummy variable creation: Sale price example, transformed, version 2, data set Sale Price MDV FDV $120,000 1 0 $67,450 0 1 $87,090 1 0 Source: Authors’ own table on original data. Equation 10. Dummy variable creation: Sale price example, original equation (with two dummy variables) SalePricei = β0 + β1X1i + . . . βnXni + βn+1 MDVi + βn+1 FDVi + εi Source: Authors’ own equation. 29 Chapter three: Variables and Dataset Work This procedure shows the first most common pitfall when employing dummy variables; namely, including all categories in the model. This creates the problem of multicollinearity (more on this later on) as MDV and FDV are perfectly correlated with each other. That is, as one increases from zero to one, the other, for the same ith observation, decreases from one to zero as compared with the previous occurrence. Obviously, one can include just one of the two new variables; MDV showing if the buyer was male (value of one) or not (value of zero) or FDV showing if the buyer was a female (value of one) or not (value of zero). The interpretation will be parallel to the one made in the above-presented example from the previous section. The second common pitfall is when the researcher decides to base the model on too many dummy variables. The rule of thumb is that the model should not contain more than two, maximum three dummy variables – this of course being subject to the fact that the model does not suffer from the problem of underspecification (too few explanatory variables) or the problem or overspecification (too many explanatory variables).6 3.8. Data Cleaning One of the main reasons for inspecting the data, in addition to getting a feel of the links between variables (i.e., correlations) as well as how variables change over time, is to determine whether there are any inconsistencies. Looking at extreme values, for instance, allows the researcher to identify miscoded entries. An example would be a house with 0 square footage, 22 bathrooms and 2 bedrooms, an average minimum labor cost of 15 dollars an hour with a maximum value of 115 – all of which are clearly illogical and an error. Another way of finding such values or finding missing entries is to sort the data according to each variable to see which cells were left empty in the spreadsheet. What is important, this should be done one variable at time to avoid distorting the data. Identifying the problems is straightforward whereas amending the issue can be as easy as deleting observations and as complicated as finding alternative ways of acquiring the missing data. Prior to deciding on the solution, it is important to note that retrieving the missing data is the better preferred approach – this way the size of the data set, and therefore degrees of freedom, is not being decreased. If the researcher decides to look for an alternative source of data, say to complement the unit labor cost for Poland for the year 2004 (values for other years are known), it is crucial to look at the methodology behind the data collection of the first source and make sure that the data point that is being supplemented comes 6 The number of independent variables depends first and foremost on the number of observations and on the literature review. 30 3.8. Data Cleaning from a source that employs the same methodology. Due to differences in methodology, differences can reach 30% – this issue is evident when looking at data on Foreign Direct Investments, for example. When dealing with cross-section data, where there is no continuity between observations, deletion of a single or even of few observations usually does not cause concern; that is as long as the sample size stays large enough. Deletion, though tempting, is not a good solution when working with time-series data, data that has a “flow” to it. Removing an observation from a time-series set (for example, for the first quarter of 1998, when looking at data from the year 1990 to the year 2010 quarterly) creates a hole. If deletion is the only option while working with time-series data, it has to be done on the variable bases, that is, an entire variable for which data is missing is deleted. As can be expected, when put into a corner, that is when deletion of an entire variable is not possible, it is possible to employ some algorithms that will methodologically supplement the missing data, e.g., supplementing the data according to its simplified, linear, trend. A simpler alternative to it is an averaging method, see Equation 11. Equation 11. Simple averaging method Vt = (Vt+1 + Vt–1) / 2 Source: Authors’ own equation. Say the situation is as presented in Table 14, where observation of GDP for the year 2004 (GDP2004) is missing and there is no possibility of obtaining it from another source. Deletion, as has been explained, is not an option as it distorts continuity. Table 14. Supplementing the missing data example, original data set Year GDP (in billion) 2002 4 2003 6 2004 ??? 2005 9 Source: Authors’ own table based on original data. In this case, the missing value will be 7.5 = (9+6)/2 = (GDP2005 + GDP2003)/2. This method can be employed under the following conditions: 1) values of the missing variable continue to grow at a more-or-less steady pace, in other words, the value preceding the missing data point is not, for example 5 and the subsequent value 134, 31 Chapter three: Variables and Dataset Work 2) the number of supplemented observations is minimal in reference to the entire number of observations in the series, 3) the researcher employing this, or for the matter of fact any other method regardless of its mathematical advancement, is aware of its limitations. 3.9. Data Description The purpose of describing the data is to explain to the reader everything that he or she needs to know. It will include things like sources of data (International Monetary Fund, for example), its frequency and the range of the data (for timeseries), the number of observations, any transformations – and the methods used – done to the data (for example, converting monthly data into quarterly data), assumptions (if using one data to represent another, a proxy; for example, using daily market closing numbers to reflect customers’ wealth) and the creation of dummy variables. 32 CHAPTER FOUR Model Determination Now when the variables have been examined and the structural equation is defined, the next step is the model estimation. This part of the book outlines step-by-step the procedure of moving from raw data obtained from data sources to arriving at, correcting and interpreting the final model. There are many options when it comes to deciding which variables should be included in the model. Usually, the explanatory variables are decided on based on the literature review and then simply put into the model. The problem arises when there is the issue of oversaturation of the literature with possible determinants.1 In this situation, when there is no empirical research that dictates which independent variables should be used, the researcher is usually forced to rely on his subjectivism, which, due to its nature, can be questioned by others and usually should be avoided in any research. Other solutions include, but are not limited to, stepwise approaches that add explanatory factors to the initial very limited model based on some statistical property. This property is usually the maximization of the F-statistic or R-squared. Addressing some shortcomings of the stepwise method, three main issues should be understood. First, this method is not a substitute for a literature review. What this means is that it picks the variables from a given evoked set, regardless of their theoretical connection, or its lack of, with the dependent variable. As a result, the set of possible explanatory factors should include only those variables that have a strong backing in the theory and in the literature on the topic being researched. Second, new variables are added based on their statistical importance, not their theoretical importance. As a result, the order in which the variables are entered is not necessarily the order of importance 1 To study the great article that shows the extent of this issue when doing research in the field of foreign direct investment see: B.A. Blonigen, J. Piger (2011), Determinants of Foreign Direct Investment, NBER Working Paper 16704. 33 Chapter four: Model Determination from the point of view of theory and/or the impact a change in the independent variable will have on the dependent variable. Third, which variables are added depends on which variables are already in the model. Therefore, least some variables should be forced into to create an initial model based on their most common occurrence in the literature on the subject.2 4.1. Model Estimation with a Forward Stepwise Method In the forward stepwise method,3 the starting point is the initial model that consists of a small number of independent variables that were decided on based on commonalities in the articles read during the literature review. This section looks at the procedure from the manual point of view, that is, whether all the steps were carried out by the researcher. Despite the fact that this can be done automatically in such software packages as SPSS, other econometric programs (e.g., EViews) do not have the automatic option and require the following procedure to be conducted “by hand.” All estimations are done with Ordinary Least Squares method of estimation. Holding all other variables constant, the initial structural equation is presented in Equation 12. Notice that in this example, subscripts that would designate either cross-section (i) or time-series (t) modeling are substituted for simplicity with a. Equation 12. Model estimation with forward stepwise method example – initial structural, restricted equation Ya = β0 + β1X1a + β2X2a + β3X3a + β4X4a + β5X5a + εa Source: Authors’ own equation. After the structural equation is estimated with econometric software, it becomes a model. The structural representation is shown in Equation 13. Since new explanatory factors are going to be added to this model, it is called the unrestricted model. Notice that now that we are talking about an estimated 2 For more information on the stepwise approach and its limitations see: 1) B. Thompson (1989), Why Won’t Stepwise Methods Die?, “Measurement and Evaluation in Counseling and Development,” Vol. 21, pp. 146–149. 2) C.J. Hubert (1989), Problems with Stepwise Methods – Better Alternatives, “Advances in Social Science Methodology,” Vol. 1, pp. 43–70. 3) J.S. Whitaker (1997), Use of Stepwise Methodology in Discriminant Analysis, paper presented at the annual meeting of the Southwest Educational Research Association, Austin, Texas, January 23, 1997. 3 The reason why this method is called “forward” is because the researcher starts with a small, restricted initial model and then adds new variables to it. If the opposite was the case, that is, an unrestricted model with many explanatory variables is the starting point and the objective is to statistically drop independent variables, the method would be referred to as a backward stepwise method. 34 4.1. Model Estimation with a Forward Stepwise Method model, all the parameters that have been estimated have a hat (^) on top of them and the error term becomes known as the residuals. Equation 13. Model estimation with forward stepwise method example – initial structural, restricted model Ya = β0 + β1X1a + β2X2a + β3X3a + β4X4a + β5X5a + εa Source: Authors’ own equation. For the sale price of the house example that has been mentioned previously, the initial model would, for example, consist of the area of the house, location (that is, a city or a state), its age, number of rooms and the number of baths. In order to determine if the initial model is sufficient or not, a statistical test, the Lagrange Multiplier (LM) test, should be implemented to check for the presence of additional information hidden in residuals (estimates of errors). The reason why residuals are expected to hold additional information is that the restricted model, or any other model, only extracts the information relating to the used independent variables. As a result, there is always some information that is not accounted for. The mentioned test requires an auxiliary regression. Such an equation has the residuals ( ^ εa) from the initial model (Equation 13) as the dependent variable that is being regressed on all explanatory variables collected by the researcher. In the example, there are overall 20 possible independent variables suggested by the literature, X1–X20, for which the data has been collected. In Equation 14, the structural equation has alphas (α) that designates the parameters to be estimated and γa represents the error of the auxiliary regression. Equation 14. Model estimation with forward stepwise method example – auxiliary structural equation ^ ε a = α0 + α1X1a + α2X2a + α3X3a + . . . + β19X19a + β20X20a + γa Source: Authors’ own equation. The estimated auxiliary regression is shown in Equation 15. Equation 15. Model estimation with forward stepwise method example – auxiliary structural model ^ ε a =^ α0 + ^ α1X1a + ^ α2X2a + ^ α3X3a + . . . + ^ α19X19a + ^ α20X20a + ^ γa Source: Authors’ own equation. 35 Chapter four: Model Determination When looking at the output of the auxiliary regression, it is important to note that the variables already included in the model (X1 – X5) will have a low (in absolute value) t-statistics and a high p-values. The null hypothesis states that all of the coefficients in the auxiliary model are equal to zero, and therefore, there is no further information to be extracted. The alternative hypothesis states that at least one of the referred to coefficients are not equal to one. H0: αk+1 = αk+2 = … = αk+m = 0 • no more information to be extracted H1: αk+i ≠ 0; least for some i • some information that can be added The LM formula that is used has a Chi-square distribution and is shown in Equation 16, where n represents the number of observations and R2aux is the R-squared statistic from the auxiliary model and is described in detail in the Model Results Interpretation section of this work. Equation 16. Lagrange Multiplier formula LM = nR2aux Source: Pindyck, Rubinfeld (1998), p. 282. The degrees of freedom would be the number of all available variables minus the number of variables used in the model being tested, that is, the initial (restricted, Equation 13) model (20–5 in this example). Table 15. A section of the Chi-square table with error levels in the first row and degrees of freedom in the first column Right tail areas for the Chi-square Distribution df\area 0.25 0.1 0.05 1 1.3233 2.70554 3.84146 14 17.11693 21.06414 23.68479 15 18.24509 22.30713 24.99579 16 19.36886 23.54183 26.29623 Source: Authors’ own table. The first step after calculations of the LM statistic are completed is to find the critical value. In our example, χ2critical for 15 degrees of freedom (size of the set of possible explanatory variables net the number of independent variables used) at 5% (0.05) will be 24.99579 and can be read from a Chi-square distribution table (a part of which is shown in Table 15). This value is compared with χ2observed from the LM formula. 36 4.1. Model Estimation with a Forward Stepwise Method If the number of observations (n) is, for example, 900 and the Raux2 is 0.257, the LM would be (900 • 0.257) 231.3. As a result of the test, χ2critical being less than χ2observed (24.99579 < 231.3), the null hypothesis is rejected and a statement can be made that there is still some information to be added to the model. In order to determine which variables ought to be added to the model, the examination of the auxiliary regression’s output should follow. The possible explanatory variables that have the highest (again, in absolute value) t-statistics, and therefore, the lowest p-values, should be added as they are the most statistically significant. It is wise to only add no more than two variables at a time. The safest course of action is to add one new independent variable at a time. Let us say that one new variable has been added to the original restricted model’s right-hand side, X6. After this addition, the new model looks as presented in Equation 17. Equation 17. Model estimation with forward stepwise method example – initial structural unrestricted model ^ ^ ^ ^ ^ ^ ^ Ya = β0 + β1X1a + β2X2a + β3X3a + β4X4a + β5X5a + ^εa Source: Authors’ own equation. Notice that the model, after it has been expanded with the addition of a new explanatory element, is referred to as an unrestricted model. At this point, the new model should be tested again with the LM test. The procedure is to be repeated till we fail to reject the null hypothesis (in other words, till χ2critical > χ2observed). At that point, a statement can be made that the final model has been achieved and now should be tested for multicollinearity, autocorrelation and heteroscadesticity as described in the next chapter titled Model Testing. 37 CHAPTER FIVE Model Testing After the model is estimated, it needs to be checked. The three most common and major problems are: multicollinearity, autocorrelation and heteroscedasticity. This chapter provides the definition of each of these three issues and the ways of detecting and remedying them. 5.1. Multicollinearity Multicollinearity exists when two or more of the explanatory variables (for example, the U.S. GDP and the U.S. Exports in the U.S. Imports estimation example) are highly correlated with each other. Another cause of multicollinearity is overfitting or overspecification, which suggests that the researcher was adding independent variables simply to maximize R-squared without regard for their statistical significance. As mentioned earlier, another common cause of multicollinearity is associated with dummy variables. When using dummy variables, it is important to always leave one of the categories out. If, for example, the explained variable is believed to be dependent on the seasons of the year, four dummy variables would be created to reflect whether the observation took place in summer, autumn, winter or spring. But, when estimating the model, only three of the four dummy variables would be included to avoid the multicollinearity problem. If significant multicollinearity is present, the computer software will not be able to estimate the model, as one of the mathematical functions it uses will be impossible to execute. There are two main ways of detecting this problem. One, the correlation matrix (shown in the 3.4. Correlation Matrix Analysis section, in Table 7) and, two, the examination of the regression output (discussed in detail in the next 39 Chapter five: Model Testing chapter). If the correlation coefficient is high (0.8 and above – again, a rule of thumb) between any independent variables, multicollinearity can be a problem. Also, if the model has a very high R-squared statistic, but the coefficients are not statistically significant (low t-statistics and high p-values), multicollinearity is expected. The most common remedies are to either increase the sample size (get more observations) or drop variables that are the least significant (highest p-values) and/or are the main suspects of causing the problem. The latter solution needs to be performed under a caution as deletion of too many variables can lead to the problem of underspecification. 5.2. Autocorrelation Autocorrelation (Serial Correlation) exists when a variable is a time function of itself (today is affected by yesterday, for example) and is a problem only when dealing with time-series and panel sets of data. If the problem occurs in a crosssection set, it can be either ignored or, preferably, the order of observations can be changed to solve the problem. The presence of autocorrelation causes the estimated coefficients of independent variables to be inefficient (though still unbiased). In addition, standard errors are biased and any individual hypothesis testing is invalid. Autocorrelation of the dependent variable can be detected by the ocular test of the residuals, the correlogram, the Breusch-Godfrey Serial Correlation LM test and the examination of the Durbin-Watson statistic. Graph 4. Graph of residuals of a model with the U.S. imports (IM) as the dependent variable Source: Authors’ own graph based on calculations conducted with EViews software. 40 5.2. Autocorrelation The residuals graph may look like the one in Graph 4, in which a pattern that suggests the presence of the problem of autocorrelation is visible. Here, one observation appears to be dictated by the one before it. These quick changes in the trend create sharp tips of the graph. The main benefit of this approach is that it is quick as it does not require any calculations. At the same time, its disadvantage comes in the form of subjectivism used by the researcher. As much as this method can be a good indicator, conclusions on the presence of the autocorrelation should not be made solely based on it. Table 16. An example of a correlogram output for the U.S. imports model Autocorrelation Partial Correlation AC PAC Q-Stat Prob. .|***** | .|***** | 1 0.751 0.751 114.36 0.000 .|**** | .|* | 2 0.598 0.080 187.40 0.000 .|*** | **|. | 3 0.375 -0.227 216.18 0.000 .|** | .|. | 4 0.228 -0.020 226.93 0.000 .|* | .|* | 5 0.169 0.147 232.85 0.000 Source: Authors’ own table based on calculations conducted with EViews software. Significant bars in the Partial correlation column in the correlogram (Table 16) suggest that there is a problem of autocorrelation. The placement of theses bars serves as an indicator to which order of the autocorrelation is present in the model. In this case, the first and possibly the third level of autocorrelation can be expected as those orders have the longest bars in the right column. The Prob. column provides p-vales for each of the autocorrelation orders. It is important to note that initially a large number of autocorrelation orders will be found statistically significant, which is a big disadvantage of this approach. The reason for this is that the third order, for example, may be caused by the first and/or the second order of autocorrelation. On the plus side, this method of detecting autocorrelation provides more information than the ocular examination as it suggests which orders of autocorrelation can be expected. The next possibility of detecting autocorrelation involves the use of the Breusch-Godfrey Serial Correlation Lagrange Multiplier Test, for which the hypotheses setup for autocorrelation looks as follows: H0: No Autocorrelation H1: Autocorrelation exists 41 Chapter five: Model Testing Table 17. An example of the Breusch-Godfrey Serial Correlation LM test output for the U.S. imports model Breusch-Godfrey Serial Correlation LM Test: F-statistic 425.9017 Prob. F (2,192) 0.0000 Obs*R-squared 163.2115 Prob. Chi-Square (2) 0.0000 Source: Authors’ own table based on calculations conducted with EViews software. The LM formula (Equation 16), as mentioned earlier, has the Chi-square distribution. For the U.S. imports example, the degrees of freedom would be 6 (the number of explanatory variables in the model). From the Chi-square table χ2critical = 12.59 and the χ2observed = 163.2115 (which we can either calculate or read from the Breusch-Godfrey Serial Correlation LM test output – Table 17); hence, we reject the null hypothesis and conclude that autocorrelation is present. This is the preferred way of approaching the issue of testing residuals for the presence of autocorrelation as, due to its mathematical nature, it removes all subjectivism and its interpretation is clear. The last way of determining the presence of the autocorrelation is to examine the Durbin-Watson statistic (that is explained in more detail in the following chapter). The ideal value is 2.00. Anything below 2.00 suggests a positive autocorrelation and when the reading is above 2.00 it indicates the presence of a negative autocorrelation. There is a number of ways to correct for autocorrelation (Generalized Least Squares method or adding more significant variables, to just name two). The easiest two to implement are an introduction of an autoregressive (AR(p)) term where the letter p indicates the order of the serial correlation and the introduction of a lagged dependent variable as one of the explanatory variables. When using the AR(p) approach (Equation 18), it is important that AR(1) through p terms are introduced. For example, if there is third order autocorrelation (as suggested in Table 16) terms AR(1), AR(2) and AR(3) should be added to the model (Equation 19). Equation 18. Structural equation with an AR(p) term Yt = β0 + β1X1t + . . . + βnXnt + δ1AR(p) + εt Source: Authors’ own equation. Equation 19. Structural equation with AR(p) terms 1 through 3 Yt = β0 + β1X1t + . . . + βnXnt + δ1AR(1) + δ2AR(2) + δ3AR(3)+ εt Source: Authors’ own equation. 42 5.2. Autocorrelation AR terms are subject to the same statistical significance tests as other coefficients (more on that topic in the next chapter). As much as they are easy to implement, their biggest drawback is that they are very hard, if possible, to interpret. When applying the second solution, after introducing the lagged dependent variable term (Yt-1) into the equation (as shown in Equation 20), all of the original coefficients (including the constant term) need to be adjusted to properly reflect their values that have changed due to the correction. Equation 20. Structural equation with lagged dependent variable as an additional explanatory variable Yt = β0 + β1X1t + . . . + βnXnt + 1Yt–1+ εt Source: Authors’ own equation. The adjustment (Equation 21) requires dividing the original coefficient’s ^ estimated value ( βn ) by 1 minus the sum of all coefficients associated with lagged dependent variables used as explanatory variables. Equation 21. Adjustment of the nth coefficient with r lagged dependent variables used as independent factors ^ βn β’n = ––––––––– r 1 – ( Σ 1 m) ^ where: ^ ^ β’n – the adjusted value of the original coefficient, βn m – number of the coefficient of the lagged dependent variable r – number of lagged dependent variables used as explanatory variables Source: Authors’ own equation. For the Equation 20, the adjustment for the coefficient of the first independent variable would take the following form presented in Equation 22. Equation 22. Adjustment of the 1st coefficient with one lagged dependent variables used as independent factors ^ β1 ^ β’1 = ––––––––– 1 – ( 1) Source: Authors’ own equation. 43 Chapter five: Model Testing Analogously to using AR(p) terms, if higher orders of autocorrelation are expected (for example, 3rd order), all of the orders should be included in the model (1 through 3). The advantage of this method is that despite the need for adjustment, it provides coefficients that are easy to incorporate in the interpretation, description, of the estimated coefficients assigned to the used explanatory variables. 5.3. Heteroscedasticity Heteroscedasticity is the existence of different variances among random variables. A good example of this problem would be the variance of consumer spending – lower income earners will have a smaller variance when people in the upper income bracket will have a higher variance. It causes the same problems as autocorrelation. To detect this problem tests like the ocular test of residuals (at one end the spread of residuals will be small and it will increase as the residuals are plotted, a megaphone or a cone-shape graph), or any of the White, Goldfield-Quandt or Breusch-Pagan LM tests can be implemented. Just like with autocorrelation or stationarity, the ocular examination of the graph should be used only as an indicator of the presence, or the lack of, the problem. Statistical test, like the ones mentioned above, are the preferred option. For the LM White test, for example, the hypotheses will look as follows: H0: No Heteroscedasticity H1: Heteroscedasticity exists Table 18. An example of a heteroscedasticity LM White test for the U.S. Imports model Heteroscadesticity Test: White F-statistic 40.04048 Prob. F (20,179) 0.0000 Obs*R-squared 163.4623 Prob. Chi-Square (20) 0.0000 Scaled explained SS 271.0250 Prob. Chi-Square (20) 0.0000 Source: Authors’ own table based on calculations conducted with EViews software. An example of a statistical test for heteroscedasticity, the LM White test (shown in Table 18), suggests that the model tested suffers from the presence of heteroscedasticity and needs to be corrected for it. We make such a conclusion as the LM test statistic, also known as χ2observed, 163.4623 (in Table 18 reflected as Obs*R-squared – number of observations multiplied by R-squared of the auxiliary regression) is greater than χ2critical at 5% level of significance and 20, for example, degrees of freedom. The decision to reject the null of no heteroscedasticity is 44 5.3. Heteroscedasticity supported by the fact that the p-value of Prob. Chi-Square (20), read from the output, is less than 0.00. One of the popular remedies is called the Weight Least Squares method of estimating the parameters of a model – where weights are assigned to observations to adjust for the difference in variance. The key problem with Weight Least Squares is assigning proper weights to specific observations in such a way not to distort the results of the research. An easy way out is provided by many software packages (EViews, for example) that offer an automatic option that cures the problem of heteroscedasticity. 45 CHAPTER SIX Model’s Results Interpretation Model interpretation consists of analyzing two parts of the output received after estimating the designed model using econometric software, that is, one, the output regarding the estimation of the model’s parameters (Table 19) and statistics describing the model as a whole (Table 22). Each of the mentioned outputs plays a key role in assessing the estimated model. This process will give the researcher hints as to whether or not the chosen independent variables and the model as a whole, statistically, do a good job of representing the data. For this section, the model that is used as an example is estimated based on the following linear structural equation, Equation 23. Equation 23. Linear structural equation of the model used in Model’s Results Interpretation chapter IMt = β0 + β1YDt + β2POPt + β3Wt + β4GDPt + β5EXt + εt Source: Authors’ own equation. 47 Chapter six: Model’s Results Interpretation 6.1. Interpreting and Testing Variables and Coefficients Table 19. Coefficient estimation output from software after estimating the U.S. imports Variable C Coefficient Std. Error 3376.174 278.8045 YD 0.142595 POP -0.024214 W GDP EX t-Statistic Prob. 12.10947 0.0000 0.065082 2.191002 0.0296 0.001978 -12.24395 0.0000 0.032785 0.007324 4.476295 0.0000 0.299346 0.067465 4.437042 0.0000 0.21193 0.06036 3.51112 0.0006 Where the dependent variable, the U.S. imports (IM) is being regressed on the constant term (C), Disposable Income (YD), the U.S. Population (POP), Wealth (W), the U.S. GDP (GDP) and the U.S. Exports (EX). Source: Authors’ own table based on calculations conducted with EViews software. In Table 19, the Variable column lists all the explanatory variables entered into the model as well the constant term, Coefficient column lists the estimated values of the coefficients of the independent variables as well as the constant term, Sdt. Error represent standard errors of the coefficients and the constant term, t-Statistic is parameter’s value divided by its standard error and the Prob. column shows the p-value associated with each of the estimated coefficients and the constant term. The estimated version of Equation 23, based on results presented in Table 19, is shown in Equation 24. Equation 24. Estimated version of the linear structural equation of the model used in Model’s Results Interpretation chapter ^ IMt = 3,376.174 + 0.143YDt + 0.02POPt + 0.03Wt + 0.299GDPt + 0.212EXt + εt Source: Authors’ own equation. When dealing with non-probability models (ones that do not involve estimating the probability that an occurrence will take place based on given characteristics of the object making the decision, for example), the coefficients are easy to interpret – as has been shown in section 2.2. Structural Equation Description. When interpreting the coefficients, it is very important to realize that all other coefficients are held constant (ceteris paribus). The reason for such a statement is necessary. Moving away from economics, let us say that 48 6.1. Interpreting and Testing Variables and Coefficients an overweight person decides to go on a diet and start an exercise program. After two months, the weight of this person has decreased by 10 kilograms. The questions are, was it due to the decrease in calories eaten, the exercise program, or maybe both, and if that is the case, which, the diet or the exercise program, had a greater impact on the reached weight loss? A parallel example can be seen of course in any discipline. To bring the discussion back to economics, let us look at the unemployment rate, that as we know depends on many economic conditions. Or the gross domestic product, is it spending, is it investment in capital, is it government spending or net exports. If none of the variables in the model are logged, like the example in Table 19, coefficients represent something called marginals. The marginal is interpreted as follows: a one unit increase in the disposable income (YD) will increase the dependent variable, the U.S. imports, by 0.142595 units; this is why it is crucial for the model interpretation to have the units specified clearly. In the U.S. imports example, the measurements are done in billions of U.S. 2005 dollars, unless stated otherwise. Using that information, the above analysis of the coefficient of the disposable income can be improved upon by stating that: In case of the disposable income of the U.S. customers increasing by 1 billion U.S. 2005 dollars, the model suggests that the U.S. imports will increase by 0.142595 billion U.S. 2005 dollars, or by 142,595,000 U.S. 2005 dollars. Analogously, if the population of the U.S. increases by one person (unit of measurement), the U.S. imports will decrease by 0.024214 billion U.S. 2005 dollars; and so on. The above-presented interpretation of the estimated parameters of the model is only valid when and if the estimated coefficients are found to be statistically significant or not. To do so, again the output shown in Table 19 is examined. By looking at the assigned value of Prob. (that is, the p-value), a statement regarding the statistical significance of an individual variable can be made. At the 5% level of significance the p-value is equal to 0.05, which serves as a cutoff point. The hypotheses for this test are as presented below: H0: Xn is not statistically significant H1: Xn is statistically significant If the p-value of the estimated coefficient of the nth variable is less than 0.05, we reject the null hypothesis and state that the nth variable is statistically significant. Any Prob. reading above the cutoff point fails to reject the null because, as will be shown in a bit, its coefficient is not significantly different from zero. In addition to looking at the p-value, each of the coefficients can be tested for its significance with a t-test. If, for example, it is expected that the coefficient ^ of the U.S. GDP variable (βGDP) will be positive (in other words, it is expected that as GDP increases, the imports of the U.S. are expected to also increase) it ^ is important to test if the estimated coefficient, βGDP , actually is, as expected, greater than zero. 49 Chapter six: Model’s Results Interpretation The test for the significance of the calculated coefficient of the nth variable should have the following steps: 1) set up the null and the alternative hypotheses statements, 2) select the critical value based on the confidence interval, 3) compute the tobserved and compare against tcritical, 4) make a statement about the success of rejecting or a failure to reject of the null hypothesis. This is called the one-tail t-test as there is some expectation on the sign on the tested coefficient. Table 20. Summary of the coefficient testing procedure for one-tail tests If the variable is expected to have Hypothesis statement a negative coefficient a positive coefficient H0: βn ≥ 0 H1: βn < 0 H0: βn ≤ 0 H1: βn > 0 Formula ^ (βn – βtest) tβn = ––––––––– Sβ^n ^ Where Sβn is the standard error of the coefficient of the nth variable and βtest is the value βn is compared to, in this case βn = 0. Source: Authors’ own table with formula from Pindyck, Rubinfeld (1998), p. 112. The degrees of freedom are equal to n – k (number of observations less the number of explanatory variables in the model). If, for example, the number of observations is 50 and the number of independent variables in the model is 20, then the degrees of freedom are 30 and the t-statistic at 5% is 1.697. If tcritical is less than tobserved, we reject the null and confirm that the coefficient of the nth variable is consistent with the assumptions made based on the economic theory. It is possible to see if statistically the coefficient of the nth variable is greater, less than or equal to a specific value. For the first two tests, simply choose the appropriate hypothesis statement from Table 20 and substitute the value for βtest ^ that βn is being tested against into the formula. For example, to test if the coefficient of disposable income (YD in Table 19, with 200 observations) is statistically greater than 0.01, the test would look as follows: Hypothesis setup: H0: βn ≤ 0.01 H1: βn > 0.01 T-test is shown in Equation 25. 50 6.1. Interpreting and Testing Variables and Coefficients Equation 25. Example of the t-test (0.142595 – 0.01) tβ^n = ––––––––––––––– = 2.0373 0.065082 Source: Authors’ own equation. Conclusion Since tobserved (2.0373) is more than tcritical (1.6448), we reject the null hypothesis in favor of accepting the alternative hypothesis, and therefore state that statistically, the coefficient of the disposable income variable is greater than 0.01. The test to see if the coefficient is statistically different from a given value, the two-tail t-test is used. This test is also used when there is no inclination, hints, to whether a positive change in a tested independent variable will have a positive or a negative impact on the dependent variable. The two-tail t-test mimics the one-tail t-test, with the following adjustments listed in Table 21. Table 21. Summary of the coefficient testing procedure for two-tail tests Hypothesis statement Formula t-statistic ^ (βn – βtest) tβn = ––––––––– Sβ^n H0: βn = 0 H1: βn ≠ 0 Example: given d.f. = 30 at 5% the t-statistic = 2.042 Source: Authors’ own table with formula from Pindyck, Rubinfeld (1998), p. 112. The decision rule for null’s rejection (tcritical < tobserved) stays the same. Sometimes, the one- or two-tail t-test, which is used only for a single variable at a time, will result in that variable being statistically insignificant. Yet, there are times when the same variable combined with another one (or two, or three) will be, as a group, found statistically significant, and therefore the model will be improved with their addition. To test for a combined significance (or joint significance) of two or more variables at the same time, the F-test with F distribution is used. In principal, the F-test compares the restricted model (one to which new variables are to be added, Equation 26) to the unrestricted model (Equation 27) containing the new independent variables. Equation 26. Joint significance test – structural model, restricted ^ ^ ^ ^ ^ ^ Yt = β0 + β1X1t + β2X2t + β3X3i + εt Source: Authors’ own equation. 51 Chapter six: Model’s Results Interpretation Equation 27. Joint significance test – structural model, unrestricted ^ ^ ^ ^ ^ ^ ^ Yt = β0 + β1X1t + β2X2t + β3X3i + β4X4i + β5X5i + ^ εt Source: Authors’ own equation. The hypothesis statement for the joint test is as follows: H0: β4 = β5 = 0 H1: β4 ≠ β5 ≠ 0 Similarly to the LM procedure for adding new explanatory variables to a restricted model, the null hypothesis assumes that there is no difference between the coefficients of newly inserted independent variables and that they both equal to zero. The alternative hypothesis states that the coefficients are not equal and that they are different from zero. The F-test formula is shown in Equation 28. Equation 28. F-test formula with Error Sum Squares (ESSR – ESSUR) / q Fq,(n–k) = ––––––––––––––– ESSUR / (n – k) Source: Pindyck, Rubinfeld (1998), p. 129. The UR and R subscripts designate the use of unrestricted or restricted models respectively, q is the number of variables tested (in this case 2, Equation 27), n is the number of observations and k is the number of explanatory variables in the unrestricted model. The principle of this test is that the Error Sum of Squares (covered later) is less in the unrestricted than it is in the restricted model if the added variables, combined, are truly significant explanatory contributors. After a few transformations (shown in Equation 29 and Equation 30), the formula can be rewritten using R-squared (R2) as presented in Equation 29. Equation 29. R2 of the unrestricted model as a function of its Error Sum of Squares and Total Sum of Squares ESSUR R2UR = 1 – –––––– TSSUR Source: Pindyck, Rubinfeld (1998), p. 130. 52 6.2. Interpreting Model’s Statistics Equation 30. R2 of the unrestricted model as a function of its Error Sum of Squares and Total Sum of Squares ESSR R2R = 1 – –––––––– TSSR Source: Pindyck, Rubinfeld (1998), p. 130. Equation 31. F-test formula with R-squared (R2UR – R2R) / q Fq, n–k = ––––––––––––––– (1 – R2R ) / (n – k) Source: Pindyck, Rubinfeld (1998), p. 130. This formula assumes that, if the alternative hypothesis is correct, the unrestricted model explains a greater amount of variation in the dependent variable as compared with the restricted model The decision rule is similar to other tests: if Fcritical is less than Fobserved, the null hypothesis is rejected and, in the example above, combined coefficients of X4 and X5 are statically different from each other and zero; therefore, they add additional information to the model. 6.2. Interpreting Model’s Statistics After conducting a detailed analysis of estimated coefficients, the model as a whole has to be taken under revision using its descriptive statistics (Table 22). Table 22. Model’s statistics output from the software after estimating the U.S. imports by regressing them on the constant term (C), disposable income (YD), the U.S. population (POP), wealth (W), the U.S. GDP and the U.S. exports R-squared 0.99315 Mean dependent var 757.2164 Adjusted R-squared 0.992974 S.D. dependent var 647.7554 S.E. of regression 54.29724 Akaike info criterion 10.85636 571948.9 Schwarz criterion 10.95531 Hannan-Quinn criter. 10.89641 Durbin-Watson stat 0.185468 Sum squared resid Log likelihood F-statistic Prob(F-statistic) -1079.636 5625.545 0 Source: Authors’ own table based on calculations conducted with EViews software. 53 Chapter six: Model’s Results Interpretation R-squared – Equation 32 – represents the percentage of the variation in the dependent variable explained by the model. For example, in Table 22, R-squared equals 0.99315; therefore, a statement can be made that 99.31% of variation in the dependent variable is explained by its regression on the independent variables. It ranges from 0 to 1. As much as this statistic is very often the quoted one, it suffers from a serious problem. That is, it will increase as long as explanatory variables are added to the model, regardless of their true significance. This issue arises as there is no adjustment for changing degrees of freedom. For the purpose of solving this issue, the Adjusted R-squared statistic was developed. Equation 32. R-squared formula R2 = ^ _ ^ _ Σ (Y + Y) / Σ (Y –Y) 2 i 2 i Source: Pindyck, Rubinfeld (1998), pp. 112–113. Adjuster R-squared – Equation 33 – is interpreted similarly to R-squared but it is the preferred measurement as it is adjusted for the degrees of freedom. With regular R-squared, the addition of variables, regardless of their statistical significance to the model, will always increase it. When adding insignificant variables, the Adjusted R-squared will decrease and it can become negative. Adjusted R-squared ranges from – 1 to 1. Similar to R-squared, the higher the value of the Adjusted R-squared the better job the model does in explaining the variation of the dependent variable. Equation 33. Adjusted R-squared formula Adjsuted R = 1 – 2 [Σ ^2 εt –––––– (n – k) _ ] /[ Σ (Yi – Y)2 –––––––– (n – 1) ] Source: Pindyck, Rubinfeld (1998), pp. 112–113. Sum Squared resid. – Error Sum of Squares (ESS, Equation 35), is a measure ^ of the discrepancy between the original data (Yi ) and the estimated model (Yi); in other words, variation in the residuals or the unexplained variation. In the estimated model, ESS is equal to the Total Sum of Squares (TSS) net Regression Sum of Squares (RSS, Equation 37). TSS is the variation in the dependent variable (Equation 36) and RSS is the explained variation (Equation 37). 54 6.2. Interpreting Model’s Statistics Equation 34. Total Sum or Squares TSS = RSS + ESS or Σ _ (Yi – Yi)2 = Σ ^ _ (Yi – Yi)2 + ^ 2 Σ (Y – Y ) i i Source: Pindyck, Rubinfeld (1998), p. 89. Equation 35. Error (Residual) sum of squares ESS = ^ 2 Σ (Y – Y ) i i Source: Pindyck, Rubinfeld (1998), p. 89. Equation 36. Total sum of squares TSS = Σ _ (Yi – Yi)2 Source: Pindyck, Rubinfeld (1998), p. 89. Equation 37. Regression sum of squares RSS = Σ ^ _ (Yi – Yi)2 Source: Pindyck, Rubinfeld (1998), p. 89. F-statistic – statistic used to measure the overall statistical significance of the model. Prob(F-statistic) – probability of the F-statistic. The null hypothesis states that the model as a whole is statistically insignificant, while the alternative hypothesis says that the model as a whole is statistically significant. If the probability of the F-statistic is less than the level of significance the null hypothesis is rejected and a conclusion can be made that the model as a whole is statistically significant. Mean Dependent Variable – mean of the dependent variable. S.D. Dependent Variable – standard deviation of the dependent variable. Akaike, Schwarz and Hanna-Quinn criterions – information criterions that measure the goodness of fit of an estimated statistical model. The smaller the value assigned to those criterions the better the fit. 55 Chapter six: Model’s Results Interpretation Durbin-Watson statistic – a guide to detecting autocorrelation, with the ideal value being 2.00 (suggesting no autocorrelation). A reading below suggests a presence of a positive autocorrelation and a reading above hints of a negative autocorrelation. As has been described in the section on autocorrelation, this statistic comes with its drawbacks that need to be made a note of. In addition to R-squared and other statistics, plotting actual, fitted data and residuals on one graph (Graph 5) provides a good representation of how the model fits the original data and how the residuals are behaving. For example, the presented graph shows that the model does a good job of fitting the data as the fitted and the actual plots are hard to distinguish between. Residuals, with the exception of the year 2009, appear to show no signs of heteroscadesticity and suggest minimal, if any, signs of autocorrelation.1 Residuals graph is also a good place to look for signs of seasonality. Graph 5. Graph of actual values (Actual), the fitted model (Fitted); both on left-hand side axis, and resulting residuals (Residuals); right-hand axis Source: Authors’ own graph based on calculations conducted with EViews software. 1 Of course, these are just ocular observations and, as has been mentioned when discussing each of these problems, additional tests should be conducted before making final claims on residuals of the model. 56 CHAPTER SEVEN Forecasting The purpose of econometric models can be seen from two perspectives: one, to look at what took place and two, to look into the future (time-series) or to predict values (cross-section). In other words, given certain conditions, what should be the value of the dependent variable? Looking into the past allows the researcher to see what variables, and in what magnitude, have contributed to the value of the researched (explained) variable. An example would be a hedonic housing price model that as a result of provided characteristics, estimates their direct effect on the price for which the house was sold. This allows, with a certain degree of error, to estimate what a house with a set of given descriptive characteristics should sell for. The same can be applied to time-series models when, given values of specific explanatory variables, the model allows for an estimation of the parameters of explanatory values, which then can be used to simulate what the value of the dependent, researched, variable will be given the values of independent, explanatory variables. 7.1. Forecasting as a Model Testing Tool There is no testing like testing in the field. In addition to testing the model by looking at its statistics (R-squared, for example) another form of testing the model is by using it as a forecasting tool. The problem with testing the model using conventional forecasting is that it cannot be done immediately; it has to be done at the future (ex-ante or out-of-sample) time when model forecasts can be compared with actual numbers. For example, if we were to estimate a model with Poland’s imports as the dependent variable based on data from 1990 to 2010, and the estimation itself took place in the year 2010, we would have to wait till another observations of independent variables could be collected (if the 57 Chapter seven: Forecasting data is annual, then the year 2011), plug them into the model and compare the value obtained from the model with the actual record of Poland’s imports. The solution to this problem is ex-post forecasting, or forecasting within the dataset available. In order to do this, prior to estimating the model, a sample has to be properly set up. Figure 1. Division of the original data set into Estimation Period, Ex post and Ex ante sections T1 T2 T3 Present time Estimation period Ex post forecast period Ex ante forecast period Source: Authors’ own graphic based on Pindyck, Rubinfeld (1998), p. 203. Usually, when the model is being estimated, it is done so on all data that is being available at the moment of the research being conducted (in Figure 1, T1 to T3). There are many good reasons why; two big ones are: one, to have the biggest data set possible, which in turn increases precision as well as allows for the use of more explanatory variables due to the increased number of degrees of freedom; two, it allows to capture the most recent trends (for obvious reasons, an estimation of a model with data from the years from 1960 to 1970 to reflect current trends misses the point completely). Data permitting, some observations, ideally the most recent ones as they are the closest to what comes next, should be left out of the data used to estimate the model. That is, the model should be estimated on data from T1 to T2 (Figure 1), leaving observations from T2 to T3 for model’s testing via ex post forecasting. To perform a forecast or test the model ex post, the easiest way is to simply plug in the values of explanatory variables from T2 to T3 (Figure 1) into a model with estimated parameters, and then to compare the results with the values of the dependent variable from that data frame. The plug-in method can also be used to forecast ex ante. The only difference is that the question is not how close our estimated values of the independent variable are to their corresponding actual values, but, given the values of dependent variables, what would the value of the independent variable be in, for example, T3+1. 58 7.2. Forecasting with ARIMA 7.2. Forecasting with ARIMA When dealing with time-series data, a very popular way of forecasting variables is through the use of an ARIMA (p, d, q), an Autoregressive Integrated Moving Average model. ARIMA models consist of the autoregressive (AR, p) and moving average (MA, q) terms where p and q are the respective orders of those processes. I in the model comes from indifferencing the data to make the variable a stationary one with d being the order of integration. When the forecasted variable used is stationary to begin with, (d = 0), ARIMA becomes ARMA (p, q). ARIMA as a tool has its significant advantages. First, it does not require any explanatory variables, all one needs is the dependent value itself. Second, this makes the analysis quick to conduct and, when needed, is also no time consuming when the procedure needs to be repeated (which comes in handy as will be proven a bit later). The obvious drawback of ARIMA is that, as it depends on the order of observations, it can only be used to work with time-series data. Also, as no independent variables are used, this analysis does not provide any information on determinants of changes in the variable of interest. There are four steps that need to be followed in order to achieve an effective ARIMA model: 1) test and correct the variable for nonstationarity,1 2) identify the AR and MA terms. Correlogram of the U.S. imports variable (shown in Table 23), is used to identify p and q, with the Autocorrelation column representing orders of Moving Averages and Partial-Autocorrelation representing orders of autocorrelation. Table 23. Correlogram of the U.S. imports variable after it has been differentiated Autocorrelation .|**** | Partial Correlation .|**** | 1 .|*** | .|* | 2 .|* | *|. | 3 .|* | .|. | 4 Source: Authors’ own table based on calculations conducted with EViews software. Looking at the above results (which are of I/1/) p = 1 and q = 2. 3) finalize the ARIMA model. 1 Described in detail in section 3.3. Stationarity Test 59 Chapter seven: Forecasting Based on the correlogram (Table 23), the ARIMA (1, 1, 2) model is estimated (results of which are posted in Table 24). Note that since the first difference was needed to be taken in order to achieve data stationarity, the dependent variable (the U.S. imports in this case) in the model is in its first difference. Table 24. ARIMA (1, 1, 2) model output for the U.S. imports variable. Note that the independent variable is not IM but d(IM) – the 1st-level difference of the original independent variable Variable C Coefficient Std. Error t-Statistic Prob. 10.95103 2.640449 4.147412 0.0001 AR(1) 0.41178 0.146435 2.812033 0.0055 MA(1) 0.042426 0.143773 0.29509 0.7683 MA(2) 0.308006 0.089616 3.436974 0.0007 Source: Authors’ own table based on calculations conducted with EViews software. It is hard to interpret the coefficients though their statistical significance can be tested using p-values. Similarly, model statistics like R-squared can be used to evaluate the model’s fit to the original data. Since that is not the main purpose of using ARIMA models, high R-squares are unlikely. 4) forecast and test the model, adjust when needed. It can be hard to estimate the best ARIMA model on the first attempt, as reading of the correlogram is subjective to researcher’s interpretation. That is why using this approach sometimes is referred to as an art or a skill. If the initial model proves to be unsatisfactory, adjustments to the number of AR and MA terms can be made. Also, it is always worth checking the neighboring model, that is, ±1 AR and ±1 MA orders, when looking for the best fit and the best forecast (the word “best” being used relatively, of course). Of course, it is useful to first test the estimated model ex post in order of its evaluation and then ex ante. 7.3. Forecast Evaluation The researcher has a lot of tools to evaluate the forecast. Two common ones are descriptive statistics as Proportions and the Root Mean Square Error (provided by the software and which the process aims to minimize) and the ocular test by introducing upper and lower limits. Starting with the latter, as with any ocular test, a lot is left open for the interpretation of the examiner. As a result, setting the limits is a subjective 60 7.3. Forecast Evaluation procedure. One common approach is to take the forecasted value and then add double the standard error of the forecast to create the upper limit and to subtract it; therefore, creating the lower limit. By plotting the original and the forecasted values with the addition of limits (example shown in Graph 6) over the ex post period shows how well the forecast fits the actual occurrences within the set boundaries. If the forecast is expected to meet more restrictive requirements, the above-mentioned limits can be created with, for example, just one standard error – the case is opposite for more liberal requirements. The rule of thumb is that as long as original values stay within the limits of the forecast, the model does a good job of forecasting the dependent variable. Same evaluation method can be applied to the plug-in method. Graph 6. A plot of the original U.S. imports data (IM) versus the forecast (IMF) and the upper (UP) and lower (DOWN) limits Source: Authors’ own graph based on calculations conducted with EViews software. From the ocular examination of the forecasted values, the used ARIMA (1, 1, 2) model performs well over the first year; its values are nearly indistinguishable from the actual ones. But at the end of the year 2008, the forecast loses its validity as the actual values cross the set lower limit. It is very likely that a better ARIMA model should be used. Moreover, this shows that the longer the forecasted period, the greater the allowance for its error. Moving to some statistics as tools of evaluation, the first one is the Root Mean Squared Error (in the shown example it is equal to 267.5270) that is useful when comparing forecasts carried out with different models; the better the forecast the lower the value of the discussed statistic. The catch is that this is a comparative statistic, i.e., it is used to compare between the forecasts performed with different models, not the forecast itself. Other three statistics 61 Chapter seven: Forecasting that should be examined are the Bias Proportion, the Variance Proportion and the Covariance Proportion. The first statistic shows the spread between the mean of the forecast and the mean of the actual data. The second one does the same but for the variation of the forecast and the actual data. The last one measures what is left, that is, the unsystematic forecasting error. For a forecast to be considered a good forecast, the bias and the variance proportions should be as close to zero as possible, with all the noise being collected in the covariance proportion.2 In the example the bias proportion is equal to 0.449675, the variance proportion to 0.237572 and the covariance proportion to 0.312753; again suggesting that a better ARIMA model should be sought after. 2 62 For more information see: Pindyck, Rubinfeld (1998), pp. 210–214. CHAPTER EIGHT Conclusions After completing the research and describing it in an appropriate length and detail, a Conclusions paragraph consisting of closing remarks is written. The conclusion is not the same as the abstract that talks about what took place from the beginning to the end; the conclusion focuses more on end results and future actions. A brief summary of the results and their comparison with conclusions drawn from the literature review and economic theory are a good starting point. Another common topic to be included in this segment is the discussion of any problems incurred with the work, their sources and a list of possible solutions. One person cannot cover the topic researched in its entirety. Therefore, the researcher should suggest the areas related to the topic in which further studies should be conducted or parts of his or her own work that can be improved upon. 63 A. Transition At this point you have a good understanding of what it takes to get raw data and transform it using econometrics software packages into meaningful information. To further see how this is done, it is a good idea to take a look at one example that carries you through all the steps. 65 Example Let us work with data regarding the U.S., more specifically, its macroeconomic conditions. Prior to starting, it is important to note three things: one, this example is a full-length one, but it is made as short as possible by omitting some descriptions; two, this example focuses on the econometric part of a study, as a result the descriptive parts of the study as well as the literature review have been omitted; and three, as this is a real-world example, that is, the data is not staged or edited, some of the results may not look as pretty as they should. Setup The aim of this study is to look at what factors should be taken under consideration when explaining changes in the import of the U.S. Therefore, the structural equation will take the form shown in Equation 38. Equation 38. Structural equation for the U.S. imports as the dependent variable IMt = β0 + β1X1 + β2X2 + . . . + βnXn + εt Source: Authors’ own equation. IMt represents the U.S. imports in year t and it will be explained with the set of potential independent variables listed in Table 25. These variables were found through the process of the literature review. 67 Example Table 25. Potential independent variables Name U.S. imports Real Disposable Personal Income Symbol Unit Source of data in the model Imports of Goods and Services U.S. Department Billions of of Commerce: IM Chained U.S. Bureau of 2005 Dollars Economic Analysis Independent Variables U.S. Department Billions of of Commerce: YD Chained U.S. Bureau of 2005 Dollars Economic Analysis Total Population: All Ages including Armed Forces Overseas POP Thousands U.S. Department of Commerce: Census Bureau Dow Jones Index DJ Index finance.yahoo. com* CPI Index, 1982– 84 = 100 U.S. Department of Labor: Bureau of Labor Statistics Consumer Price Index For All Urban Consumers: All Items Exports of Goods and Services EX Real Gross Domestic Product GDP Real Change in Private Inventories CHG.INV Presence of NAFTA NAFTA U.S. Department of Commerce: Billions of U.S. Bureau of Dollars** Economic Analysis U.S. Department Billions of of Commerce: Chained U.S. Bureau of 2005 Dollars Economic Analysis U.S. Department Billions of of Commerce: Chained U.S. Bureau of 2005 Dollars Economic Analysis Dummy Variable (1 – Yes, 0 – No) 68 Note Seasonally Adjusted, Annual Rate Seasonally Adjusted, Annual Rate Reported monthly, transformed into quarterly to match other data Reported monthly, transformed into quarterly to match other data Seasonally Adjusted Seasonally Adjusted, Annual Rate Seasonally Adjusted, Annual Rate Example Dummy Variable Presence of the Gold Standard GOLD (1 – Yes, 0 – No) Dummy Variable Presence of the recession RECES (1 – Yes, FRED*** 0 – No) * Source: http://finance.yahoo.com/q/hp?s=^DJI&a=00&b=1&c=1960&d=02&e=2&f =2010&g=m&z=66&y=594. ** Ideally, all data would be in the same constant units, but such data was not available for the U.S. exports. *** http://research.stlouisfed.org/fred2/help-faq/. Source: Authors’ own table. Descriptive Statistics Now that the set of variables to work with has been selected and the data for them has been collected, it is time to look at descriptive statistics presented in the end of the text. The most important observation is that there is an equal number of observations (200) for all variables. As expected, none of the variables have a normal distribution, but, as it is discussed earlier and in the suggested reference, this is not an issue. 69 Example Hypothesis Statements Hypothesis statements, which are based on the examined literature, are presented in Table 26. Table 26. Hypothesis statements for all independent variables Variable Name in the model Alternative Hypothesis Real Disposable Personal Income YD H1: βYD > 0 Total Population: All Ages including Armed Forces Overseas POP H1: βPOP > 0 Dow Jones Index DJ H1: βDJ > 0 Consumer Price Index For All Urban Consumers: All Items CPI H1: βCPI < 0 Exports of Goods and Services EX H1: βEX ≠ 0 GDP H1: βGDP > 0 Real Gross Domestic Product Real Change in Private Inventories CHG.INV H1: βCHG.INV > 0 Presence of NAFTA NAFTA H1: βNAFTA > 0 Presence of the Gold Standard GOLD H1: βGOLD ≠ 0 Presence of the recession RECES H1: βRECES < 0 Source: Authors’ own table. 70 Example Correlation matrix The next step is to look at the correlation matrix (see Table 27) for high correlations coefficients between the dependent variable and the possible independent variables as well as for signs of multicollinearity. RECES GOLD NAFTA CHGINV GDP 0.97 0.94 0.98 0.94 0.98 0.97 -0.02 0.85 -0.49 0.01 YD 0.97 1.00 0.99 0.94 0.99 0.98 1.00 -0.05 0.86 -0.64 0.02 POP 0.94 0.99 1.00 0.91 0.99 0.97 0.99 -0.04 0.86 -0.68 0.01 DJ 0.98 0.94 0.91 1.00 0.91 0.97 0.95 -0.08 0.87 -0.41 0.03 CPI 0.94 0.99 0.99 0.91 1.00 0.96 0.99 -0.05 0.87 -0.64 0.01 EX 0.98 0.98 0.97 0.97 0.96 1.00 0.98 -0.05 0.89 -0.53 0.04 0.97 1.00 0.99 0.95 0.99 0.98 1.00 -0.03 0.87 -0.62 0.00 GDP CHGINV NAFTA GOLD RECES EX CPI 1.00 YD IM IM DJ POP Table 27. Correlation Matrix for all variables -0.02 -0.05 -0.04 -0.08 -0.05 -0.05 -0.03 0.85 0.86 0.86 0.87 0.87 0.89 0.87 1.00 0.03 -0.01 -0.50 0.03 1.00 -0.42 -0.05 -0.49 -0.64 -0.68 -0.41 -0.64 -0.53 -0.62 -0.01 -0.42 0.01 0.02 0.01 0.03 0.01 1.00 -0.02 0.04 0.00 -0.50 -0.05 -0.02 1.00 Source: Authors’ own table based on calculations conducted with EViews software. From the above-presented correlation table it is clear that all but three (change in inventory, presence of the Gold Standard and the presence of recession) of the independent variables are highly, positively and statistically significantly correlated with the dependent variable. Unfortunately, when looking at correlation coefficients between independent variables themselves, there exists a high probability of multicollinearity. As a result, some of the variables that are derived from other variables (for example, export and gross domestic product) should be paid attention to as only one of them, theoretically the one that has the highest correlation coefficient with the dependent variable, should be included in the model. Additionally, R-squared of the model and p-values of coefficients of included explanatory variables will be monitored for the signs of multicollinearity. 71 Example Unit Route Test As the literature shows that the most important explanatory variables are disposable income (the more money people have, the more they will buy) and population (the higher the number of customers, the higher the number of purchases), the first test for stationarity only for those and the dependent variable will be carried out. Hypotheses statements for the unit route test for each of the three variables are shown in Table 28. Table 28. Hypothesis statements for the Unit Route tests Variable IM Null Hypothesis H0: the variable is nonstationary Alternative Hypothesis H1: the variable is stationary YD H0: the variable is nonstationary H1: the variable is stationary POP H0: the variable is nonstationary H1: the variable is stationary Source: Authors’ own table. The results of the Augmented Dickey-Fuller tests1 are shown in Table 29. None of the original data was stationary on its levels. Taking the first difference solved the problem for the U.S. imports and disposable income, but the variable representing the U.S. population had to be differenced twice in order to achieve stationarity. Table 29. Results of the Augmented Dickey-Fuller test for the presence of a Unit Route IM Test critical values: Augmented Dickey-Fuller test statistic D(IM) t-StatisProb. Augmented tic Dickey-Fuller test 0.493 0.986 statistic Augmented Dickey-Fuller test statistic 1% level 5% level 10% level YD -3.463 -2.876 -2.575 1% level 5% Test critical values: level 10% level D(YD) t-StatisProb. Augmented tic Dickey-Fuller test 3.455 1.000 statistic t-Statistic Prob. -7.628 0.000 -3.463 -2.876 -2.575 t-Statistic Prob. -8.902 0.000 1 There are other tests for the presence of the unit route, but the Augmented Dickey-Fuller test is administered as it does not suffer from the problem of subjectivism like other tests, for example, the analysis of the graph. 72 Example 1% level Test critical values: 5% level 10% level POP -2.876 1% level 5% level 10% level 5% -2.876 level 10% -2.575 level D(POP, 2) -2.575 -3.465 -2.877 -2.575 -3.464 Test critical values: t-StatisProb. Augmented tic Dickey-Fuller test 1.776 1.000 statistic Augmented Dickey-Fuller test statistic Test critical values: 1% level -3.463 1% level 5% Test critical values: level 10% level t-Statistic Prob. -4.427 0.000 -3.465 -2.877 -2.575 Source: Authors’ own table based on calculations conducted with EViews software. Model Estimation As mentioned earlier, thankfully, the literature review has put forward two independent variables that are the most cross-quoted in previous works; therefore, allowing for the construction of the restricted structural equation shown in Equation 39.2 Equation 39. Restricted structural equation D(IMt ) = β0 + β1D(YD)+ β2D (POP, 2) + εt Source: Authors’ own equation. Now that the restricted equation is properly specified, the estimation procedure can begin. The Ordinary Least Squares method of estimation of the parameters of the model is employed. The results of the estimation are presented in Table 30 and model’s statistics are shown in Table 31, and the resulting structural model is shown in Equation 40. Equation 40. Restricted structural model D(IMt ) = 5.073 + 0.142D(YD)+ 0.008D (POP, 2) Source: Authors’ own equation based on calculations conducted with EViews software. 2 Note that if the final model can be constructed based on the literature review, it is the preferred way to proceed. 73 Example Table 30. Values of the restricted model’s parameters Variable Coefficient Std. Error t-Statistic Prob. C 5.073 1.741 2.913 0.004 D(YD) 0.142 0.027 5.210 0.000 D(POP,2) -0.008 0.015 -0.513 0.609 Source: Authors’ own table based on calculations conducted with EViews software. Table 31. Values of the restricted model’s statistics R-squared 0.130 Mean dependent var. 11.023 Adjusted R-squared 0.120 S.D. dependent var. 19.094 17.908 Akaike info criterion 8.624 Schwarz criterion 8.676 Hannan-Quinn criter. 8.645 Durbin-Watson stat. 1.168 S.E. of regression Sum squared resid. Log likelihood F-statistic Prob. (F-statistic) 58686.390 -799.065 13.662 0.000 Source: Authors’ own table based on calculations conducted with EViews software. Examining model’s statistics first, it can be said that the model as a whole is statistically significant – Prob. (F-statistic) = 0.000 – but it is a poor model as it only explains 13% of the variation in the dependent variable according to the R-squared statistic and even less, 12%, according to the Adjusted R-squared statistic. In addition, the model suffers from the presence of autocorrelation that is suggested by the Durbin-Watson statistic (1.168) and confirmed by the Breusch-Godfrey Serial Correlation Lagrange Multiplier test with the null hypothesis of no autocorrelation that is rejected due to p-value = 0.000 (Table 32). Table 32. Results of the Breusch-Godfrey Serial Correlation Lagrange Multiplier test Breusch-Godfrey Serial Correlation LM Test F-statistic 23.221 Prob. F (2,181) 0.000 Obs*R-squared 37.980 Prob. Chi-Square (2) 0.000 Source: Authors’ own table based on calculations conducted with EViews software. As for the coefficients of the independent variables forced in based on the literature (Table 30), only the one assigned to disposable income is statistically significant (p-value = 0.000) and its sign in line with the set hypothesis (0.142). The coefficient of population is found to be highly statistically insignificant 74 Example (p-value = 0.609) given the 5% level of confidence and its sign is opposite of what is expected (-0.008).3 Obviously, the model needs to be improved on. To do so, first the auxiliary regression (Equation 41) is estimated with the residuals from Equation 40 as the dependent variable and all possible explanatory variables from Table 25 as independent factors. Equation 41. Structural auxiliary equation εa = α0 + α1YD + α2POP + α3DJ + α4CPI + α5EX + α6GDP + α7CHG.INV + α8NAFTA + α9GOLD + α10RECES + γa Source: Authors’ own equation. The results of the equation (Table 33) are as expected when looking at p-values of already included independent variables (p-value of independent income is very high, 0.7909 and the p-value for population is low, which is expected given its lack its of statistical significance in the restricted model). Table 33. Results of the auxiliary regression (1) Variable C Coefficient -195.817 Std. Error t-Statistic 107.739 -1.818 Prob. 0.071 YD -0.005 0.017 -0.266 0.791 POP 0.001 0.001 1.907 0.058 DJ 0.005 0.002 2.088 0.038 CPI 0.154 0.175 0.878 0.381 EX -0.014 0.034 -0.411 0.681 GDP -0.016 0.015 -1.041 0.299 0.192 0.043 4.492 0.000 CHGINV NAFTA -0.756 7.061 -0.107 0.915 GOLD 1.819 6.596 0.276 0.783 RECES -10.658 3.416 -3.120 0.002 Source: Authors’ own table based on calculations conducted with EViews software. Prior to adding any new explanatory variables to the restricted model, a statistical test with the use of a Lagrange Multiplier (number of observations times R-squared from the auxiliary model; 200 • 0.360679 = 72.1358) is carried 3 This may happen. Some variables often work for some test subjects, in this case countries, and some, even the ones most often used by the literature, may be found to highly statistically insignificant. Still, as both of the used explanatory factors are the ones that are used in the literature the most, they will stay in the model. At the same time, if none or significant most of the staple independent variables work, it is wise to use other ones. 75 Example out with the null hypothesis of no more information to be extracted (H0: αk+1 = αk+2 = … = αk+m = 0) and the alternative that some more information can be added (H1: αk+i ≠ 0; least for some i). Since at a 5% level of confidence and 10 – 2 degrees of freedom χ2critical (15.50731) is less than χ2observed (72.1358), the null hypothesis is rejected and a statement can be made that there is still some information that can be added, extracted. From the output presented above in Table 33, the obvious choice for addition to the unrestricted model is the variable representing changes in inventory (p-value = 0.000) and the presence of recession (p-value = 0.002). First, the two variables are tested for stationarity (Table 34). Both variables prove to be stationary, do not have unit route, in levels as critical values for both are more than the observed and p-values are less than 0.00. Table 34. Stationarity test for CHG.INV and RECES variables CHG.INV Augmented Dickey-Fuller test statistic Test critical values: Prob. -7.2201 0.000 1% level -3.4654 5% level -2.8768 10% level -2.5750 RECES Augmented Dickey-Fuller test statistic Test critical values: t-Statistic t-Statistic Prob. -5.3811 0.000 1% level -3.4654 5% level -2.8768 10% level -2.5750 Source: Authors’ own table based on calculations conducted with EViews software. After adding new selected independent variables to the model, the structural equation takes the form shown in Equation 42 and the estimated parameters have values shown in Table 35, with model’s statistics presented in Table 36. Equation 42. Unrestricted structural model D(IMt ) = β0 + β1D(YD)+ β2D(POP, 2) + β3CHGINV + β4RECES + εt Source: Authors’ own equation. 76 Example Table 35. Values of the unrestricted model’s parameters Variable C D(YD) D(POP,2) CHGINV RECES Coefficient Std. Error t-Statistic Prob. 2.105 1.456 0.147 3.064 0.094 0.024 3.844 0.000 -0.002 0.013 -0.178 0.859 0.209 0.041 5.142 0.000 -12.254 3.457 -3.544 0.001 Source: Authors’ own table based on calculations conducted with EViews software. Table 36. Values of the unrestricted model’s statistics R-squared 0.355 Mean dependent var 11.023 Adjusted R-squared 0.340 S.D. dependent var 19.094 15.507 Akaike info criterion 8.347 Schwarz criterion 8.434 Hannan-Quinn criter. 8.382 Durbin-Watson stat 1.413 S.E. of regression Sum squared resid Log likelihood F-statistic Prob(F-statistic) 43,526.250 -771.272 24.870 0.000 Source: Authors’ own table based on calculations conducted with EViews software. As expected, both the R-squared and the Adjusted R-squared have increased; therefore, confirming the notion that the addition of the new explanatory variables was a good decision. The model as a whole is still statistically valid with a higher F-statistic that has probability less than 0.00. Another round of tests is run to see if there is still some more information to be extracted. As it turns out, for 10–4 degrees of freedom at a 5% confidence interval, χ2critical (12.59159) is less than χ2observed (200 • 0.195209 = 39.0418), hence the output of the second auxiliary model should be examined for new independent variables. Still, for the purpose of this example, let us assume that the unrestricted model based on the structural equation shown in Equation 42 is the final model and proceed with tests. Starting with the test of the model for multicollinearity, the correlation matrix (Table 27) strongly suggests that it may prove to be an issue. Still, given the fact that the independent variables in the model are statistically significant (with the exception of population) and R-squared is not excessively high, multicollinearity is not expected to be an issue. As for autocorrelation, the Breusch-Godfrey Serial Correlation Lagrange Multiplier test shows that it is an issue for of the 1st (Table 37) and the 2nd (Table 38) order. 77 Example Table 37. Breusch-Godfrey Serial Correlation Lagrange Multiplier test for the final model (1) F-statistic Obs*R-squared Breusch-Godfrey Serial Correlation LM Test 15.49961 Prob. F (2,179) 27.45655 Prob. Chi-Square (2) 0.00 0.00 Source: Authors’ own table based on calculations conducted with EViews software. Table 38. Breusch-Godfrey Serial Correlation Lagrange Multiplier test for the final model (2) F-statistic Obs*R-squared Breusch-Godfrey Serial Correlation LM Test 6.178035 Prob. F (2,177) 12.07182 Prob. Chi-Square (2) 0.0025 0.0024 Source: Authors’ own table based on calculations conducted with EViews software. Because there are few independent variables in the model, using the lags of the independent variable is not advised since the ratio of the prior to the latter would be 2:1. A solution to this issue is an inclusion of AR(p) terms. After AR(1) was introduced, the Breusch-Godfrey Serial Correlation Lagrange Multiplier test still shows that autocorrelation is an issue (Table 38). Therefore, the second term, AR(2), was added; test’s results shown in Table 39 suggest failing to reject the null of no autocorrelation eventually yielding the output of the final model shown in Table 40 and its statistics in Table 41. This result is supported by the Durbin-Watson statistic (1.942) being very close to its ideal value, 2.00. Table 39. Breusch-Godfrey Serial Correlation Lagrange Multiplier test for the final model (3) F-statistic Obs*R-squared Breusch-Godfrey Serial Correlation LM Test 0.867717 Prob. F (2,175) 1.806768 Prob. Chi-Square (2) 0.4217 0.4052 Source: Authors’ own table based on calculations conducted with EViews software. Table 40. Values of the corrected unrestricted mode’s parameters Variable C D(YD) D(POP,2) CHGINV RECES AR(1) AR(2) Coefficient 6.478 0.055 -0.001 0.156 -15.311 0.262 0.243 Std. Error 2.810 0.022 0.010 0.045 3.981 0.077 0.075 t-Statistic 2.305 2.558 -0.150 3.485 -3.846 3.401 3.240 Prob. 0.022 0.011 0.881 0.001 0.000 0.001 0.001 Source: Authors’ own table based on calculations conducted with EViews software. 78 Example Table 41. Values of the corrected unrestricted model’s statistics R-squared Adjusted R-squared S.E. of regression Sum squared resid Log likelihood F-statistic Prob(F-statistic) 0.448 0.429 14.457 36991.520 -749.007 23.903 0.000 Mean dependent var S.D. dependent var Akaike info criterion Schwarz criterion Hannan-Quinn criter. Durbin-Watson stat 11.191 19.129 8.217 8.340 8.267 1.942 Source: Authors’ own table based on calculations conducted with EViews software. Another test is to see if the residuals have a normal distribution.4 This is done by examining the Jarque-Bera statistic (5.569) with the null hypothesis of normal distribution. Since the p-value associated with the statistic equals 0.059051, and is just above the p-value at a 5% level of confidence (0.05), it is possible to say that, given the level of significance, the residuals are normally distributed. The last test is the White test for heteroscadesticity with the null hypothesis of no heteroscadesticity. Since the p-value of the test (0.0862, shown in Table 42) is above the 5% cut off point, a conclusion can be said that the model does not suffer from heteroscadesticity. Table 42. White heteroscadesticity test for the final model F-statistic Obs*R-squared Scaled explained SS Heteroscadesticity Test: White 1.444 Prob. F (27,156) 36.788 Prob. Chi-Square (27) 44.504 Prob. Chi-Square (27) 0.086 0.099 0.018 Source: Authors’ own table based on calculations conducted with EViews software. Moving to the model’s assessment, the estimated model explains 44.8% of variation in the dependent variable (R-squared = 0.448). In its entirety, the model is statistically significant – Prob. (F-statistic) = 0.000. As for the coefficients, all but the one assigned to population (p-value = 0.811) are statistically significant at a 5% level of significance (this includes both autoregressive terms) – the highest p-value, 0.011, is associated with disposable income. The interpretation of statistically significant coefficients is as follows: 1) YD: If the difference in real disposable income increases by one billion (of Chained U.S. 2005) USD, the difference5 in the U.S. imports will increase by 0.055 billion (of Chained U.S. 2005) USD, or 55,491,000 Chained U.S. 2005 USD, 4 5 Remembering that, as presented earlier, this is an ideal assumption. Remember that for stationarity reasons, the dependent variable had to be differenced. 79 Example 2) CHGINV: If the real change in private inventories increases by one billion (of Chained U.S. 2005) USD, the difference in the U.S. imports will increase by 0.156 billion (of Chained U.S. 2005) USD, or 156,196,000 Chained U.S. 2005 USD, 3) RECES: If the U.S. is in a recession, the difference in the U.S. imports will decrease by 15.311 billion (of Chained U.S. 2005) USD, or 15,311,240,000 Chained U.S. 2005 USD. All of the hypothesis statements regarding the signs of incorporated independent variables (as listed in Table 26) have been statistically confirmed at a 5% level of significance.6 Additionally, let us examine the graph (Graph 7) that shows how the fitted data looks when set against the actual data with incorporated residuals. Graph 7. Actual, fitted data and residuals of the final model Source: Authors’ own graph based on calculations conducted with EViews software. The fitted data is still off the actual data, which is seen as the discrepancy between the two series and high jumps in the values of the residuals.7 Lastly, the model is tested ex post over the data from the first quarter of the year 2007 to the fourth quarter of the year 2009. This is done in two ways. First, the forecast (IMF) is evaluated visually (Graph 8) by comparing it to the actual data (IM) with the upper/lower boundary being set by adding/subtracting twice the value of the standard error of the forecast to/from the IMF value. 6 This statement is made based on the examination of p-values of those coefficients. Of course, t-tests can be carried out to manually prove the referred to statement, but in practice it is omitted to avoid repetition. 7 This is expected as the low quality of the fit is suggested by the value of the R-squared statistic and is due to the assumption that the analyzed model is the final model. 80 Example Graph 8. Ex post forecast of the final model Source: Authors’ own graph based on calculations conducted with EViews software. The graph shows that till the third quarter of the year 2008, the model did a very good job when it comes to forecasting the values of the U.S. imports. After that, the actual data begins to significantly deviate from the forecast that still was able to detect the incoming downward trend with a recovery at the end. This discrepancy between the two values will be corrected for by adding new explanatory variables (for example, the variable correcting for the presence of the 2007 economic crisis that occurred at this time). Looking at the forecast by examining its statistics, the key three (bias, 0.3411, variance, 0.5972, and covariance, 0.0617) proportions,8 it is possible to say that the bulk of the bias is associated with the fact that the variation of the forecast is far from the variation of the actual series, followed by the fact that the mean of the forecast is far from the mean of the actual series, with the least bias being associated with the covariance proportion.9 8 The Root Mean Squared Error, although very important, is not evaluated here as it is used to compare between the forecasts. 9 In a good forecast, bias and variance proportions ought to be very low with the bulk of the bias being attributed to the covariance proportion. 81 82 IM 757.22 505.55 2208.34 108.45 647.76 0.97 2.54 32.94 0.00 200 YD 5402.41 5078.25 10095.10 1955.50 2415.92 0.43 2.03 13.94 0.00 200 POP 240896.80 237375.50 308413.30 179590.30 36955.19 0.18 1.85 12.07 0.00 200 DJ CPI EX 3770.61 106.52 580.11 1262.77 105.78 357.77 13379.36 218.91 1670.43 573.47 29.40 94.76 3940.41 61.10 455.34 1.01 0.20 0.78 2.41 1.66 2.29 37.21 16.29 24.47 0.00 0.00 0.00 200 200 200 GDP CHGINV 7339.93 24.62 6708.77 25.01 13415.27 117.20 2802.62 -160.22 3202.17 37.41 0.43 -1.21 1.95 7.82 15.35 242.21 0.00 0.00 200 200 Source: Authors’ own table based on calculations conducted with EViews software. Mean Median Max. Min. Std.Dev. Skewness Kurtosis J-B Prob. Obs. Table 43. Descriptive Statistics NAFTA GOLD RECES 0.38 0.22 0.20 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.49 0.42 0.40 0.49 1.35 1.54 1.24 2.83 3.37 33.83 61.16 80.16 0.00 0.00 0.00 200 200 200 Example Final Remarks Performing econometric research is a science and writing a clear description is an art. The purpose of this book was to guide you through bringing both of those skills together. Be it describing variables or using the LM test to find the presence of autocorrelation, the important thing is to understand that conducting research is a step-by-step process. Yet, just like any other highly structured process, even this one sometimes requires adjustments. Think, make a plan, take notes and you will be fine. And always remember: if the critical and p-value are low, the null has to go. 83 Statistical Tables Statistical Tables Statistical Tables z-table Area between 0 and z 0 0.01 0.02 0.03 0 0.004 0.008 0.012 0.04 0.05 0.06 0.07 0.08 0.09 0.016 0.0199 0.0239 0.0279 0.0319 0.0359 0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753 0.2 0.0793 0.0832 0.0871 0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.4 0.1554 0.1591 0.1628 0.1664 0.5 0.1915 0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549 0.7 0.091 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141 0.148 0.1517 0.17 0.1736 0.1772 0.1808 0.1844 0.1879 0.195 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.219 0.2224 0.258 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852 0.8 0.2881 0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 1 0.291 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133 0.334 0.3365 0.3389 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621 1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177 1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319 1.5 0.4332 0.4345 0.4357 1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545 1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633 1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706 1.9 2 0.377 0.379 0.381 0.383 0.398 0.3997 0.4015 0.437 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.475 0.4756 0.4761 0.4767 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817 2.1 0.4821 0.4826 2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916 2.4 0.4918 0.492 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936 2.5 0.4938 0.494 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952 2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.496 0.4961 0.4962 0.4963 0.4964 2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.497 0.4971 0.4972 0.4973 0.4974 2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986 3 86 0 0.483 0.4834 0.4838 0.4842 0.4846 0.485 0.4854 0.4857 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.489 0.498 0.4981 0.499 0.499 Statistical Tables t-table degrees of probability freedom one-tail 0.4 test two-tail 0.8 test 1 0.32492 0.25 0.1 0.05 0.025 0.01 0.005 0.0005 0.5 0.2 0.1 0.05 0.02 0.01 0.001 1 3.077684 6.313752 12.7062 31.82052 63.65674 636.6192 2 0.288675 0.816497 1.885618 2.919986 4.30265 6.96456 9.92484 31.5991 3 0.276671 0.764892 1.637744 2.353363 3.18245 4.5407 5.84091 12.924 4 0.270722 0.740697 1.533206 2.131847 2.77645 3.74695 4.60409 8.6103 5 0.267181 0.726687 1.475884 2.015048 2.57058 3.36493 4.03214 6.8688 6 0.264835 0.717558 1.439756 1.94318 2.44691 3.14267 3.70743 5.9588 7 0.263167 0.711142 1.414924 1.894579 2.36462 2.99795 3.49948 5.4079 8 0.261921 0.706387 1.396815 1.859548 2.306 2.89646 3.35539 5.0413 9 0.260955 0.702722 1.383029 1.833113 2.26216 2.82144 3.24984 4.7809 10 0.260185 0.699812 1.372184 1.812461 2.22814 2.76377 3.16927 4.5869 11 0.259556 0.697445 1.36343 1.795885 2.20099 2.71808 3.10581 4.437 12 0.259033 0.695483 1.356217 1.782288 2.17881 2.681 3.05454 4.3178 13 0.258591 0.693829 1.350171 1.770933 14 0.258213 0.692417 2.16037 2.65031 3.01228 4.2208 1.34503 1.76131 2.14479 2.62449 2.97684 4.1405 15 0.257885 0.691197 1.340606 1.75305 2.13145 2.60248 2.94671 4.0728 16 0.257599 0.690132 1.336757 1.745884 2.11991 2.58349 2.92078 4.015 17 0.257347 0.689195 1.333379 1.739607 2.10982 2.56693 2.89823 3.9651 18 0.257123 0.688364 1.330391 1.734064 2.10092 2.55238 2.87844 3.9216 19 0.256923 0.687621 1.327728 1.729133 2.09302 2.53948 2.86093 3.8834 20 0.256743 0.686954 1.325341 1.724718 2.08596 2.52798 2.84534 3.8495 21 0.25658 0.686352 1.323188 1.720743 2.07961 2.51765 2.83136 3.8193 22 0.256432 0.685805 1.321237 1.717144 2.07387 2.50832 2.81876 3.7921 23 0.256297 0.685306 1.31946 1.713872 2.06866 2.49987 2.80734 3.7676 24 0.256173 0.68485 1.317836 1.710882 2.0639 2.49216 2.79694 3.7454 25 0.68443 1.316345 1.708141 2.05954 2.48511 2.78744 3.7251 26 0.255955 0.684043 1.314972 1.705618 2.05553 2.47863 2.77871 3.7066 27 0.255858 0.683685 1.313703 1.703288 2.05183 2.47266 2.77068 3.6896 28 0.255768 0.683353 1.312527 1.701131 2.04841 2.46714 2.76326 3.6739 29 0.255684 0.683044 1.311434 1.699127 2.04523 2.46202 2.75639 3.6594 30 0.255605 0.682756 1.310415 1.697261 2.04227 2.45726 2.75 3.646 inf 0.253347 1.95996 2.32635 2.57583 3.2905 0.25606 0.67449 1.281552 1.644854 87 Statistical Tables F-table at 0.01 level of significance df in numerator df in denominator 1 2 3 4 5 6 1 4052.181 4999.5 5403.352 5624.583 5763.65 5858.986 2 98.503 99 99.166 99.249 99.299 99.333 3 34.116 30.817 29.457 28.71 28.237 27.911 4 21.198 18 16.694 15.977 15.522 15.207 5 16.258 13.274 12.06 11.392 10.967 10.672 6 13.745 10.925 9.78 9.148 8.746 8.466 7 12.246 9.547 8.451 7.847 7.46 7.191 8 11.259 8.649 7.591 7.006 6.632 6.371 9 10.561 8.022 6.992 6.422 6.057 5.802 10 10.044 7.559 6.552 5.994 5.636 5.386 11 9.646 7.206 6.217 5.668 5.316 5.069 12 9.33 6.927 5.953 5.412 5.064 4.821 13 9.074 6.701 5.739 5.205 4.862 4.62 14 8.862 6.515 5.564 5.035 4.695 4.456 15 8.683 6.359 5.417 4.893 4.556 4.318 16 8.531 6.226 5.292 4.773 4.437 4.202 17 8.4 6.112 5.185 4.669 4.336 4.102 18 8.285 6.013 5.092 4.579 4.248 4.015 19 8.185 5.926 5.01 4.5 4.171 3.939 20 8.096 5.849 4.938 4.431 4.103 3.871 21 8.017 5.78 4.874 4.369 4.042 3.812 22 7.945 5.719 4.817 4.313 3.988 3.758 23 7.881 5.664 4.765 4.264 3.939 3.71 24 7.823 5.614 4.718 4.218 3.895 3.667 25 7.77 5.568 4.675 4.177 3.855 3.627 26 7.721 5.526 4.637 4.14 3.818 3.591 27 7.677 5.488 4.601 4.106 3.785 3.558 28 7.636 5.453 4.568 4.074 3.754 3.528 29 7.598 5.42 4.538 4.045 3.725 3.499 30 7.562 5.39 4.51 4.018 3.699 3.473 40 7.314 5.179 4.313 3.828 3.514 3.291 60 7.077 4.977 4.126 3.649 3.339 3.119 120 6.851 4.787 3.949 3.48 3.174 2.956 inf 6.635 4.605 3.782 3.319 3.017 2.802 88 Statistical Tables df in numerator df in denominator 7 8 9 10 12 15 1 5928.356 5981.07 6022.473 6055.847 6106.321 6157.285 2 99.356 99.374 99.388 99.399 99.416 99.433 3 27.672 27.489 27.345 27.229 27.052 26.872 4 14.976 14.799 14.659 14.546 14.374 14.198 5 10.456 10.289 10.158 10.051 9.888 9.722 6 8.26 8.102 7.976 7.874 7.718 7.559 7 6.993 6.84 6.719 6.62 6.469 6.314 8 6.178 6.029 5.911 5.814 5.667 5.515 9 5.613 5.467 5.351 5.257 5.111 4.962 10 5.2 5.057 4.942 4.849 4.706 4.558 11 4.886 4.744 4.632 4.539 4.397 4.251 12 4.64 4.499 4.388 4.296 4.155 4.01 13 4.441 4.302 4.191 4.1 3.96 3.815 14 4.278 4.14 4.03 3.939 3.8 3.656 15 4.142 4.004 3.895 3.805 3.666 3.522 16 4.026 3.89 3.78 3.691 3.553 3.409 17 3.927 3.791 3.682 3.593 3.455 3.312 18 3.841 3.705 3.597 3.508 3.371 3.227 19 3.765 3.631 3.523 3.434 3.297 3.153 20 3.699 3.564 3.457 3.368 3.231 3.088 21 3.64 3.506 3.398 3.31 3.173 3.03 22 3.587 3.453 3.346 3.258 3.121 2.978 23 3.539 3.406 3.299 3.211 3.074 2.931 24 3.496 3.363 3.256 3.168 3.032 2.889 25 3.457 3.324 3.217 3.129 2.993 2.85 26 3.421 3.288 3.182 3.094 2.958 2.815 27 3.388 3.256 3.149 3.062 2.926 2.783 28 3.358 3.226 3.12 3.032 2.896 2.753 29 3.33 3.198 3.092 3.005 2.868 2.726 30 3.304 3.173 3.067 2.979 2.843 2.7 40 3.124 2.993 2.888 2.801 2.665 2.522 60 2.953 2.823 2.718 2.632 2.496 2.352 120 2.792 2.663 2.559 2.472 2.336 2.192 inf 2.639 2.511 2.407 2.321 2.185 2.039 89 Statistical Tables df in numerator df in denominator 20 30 40 60 120 INF 1 6208.73 6260.649 6286.782 6313.03 6339.391 6365.864 2 99.449 99.466 99.474 99.482 99.491 99.499 3 26.69 26.505 26.411 26.316 26.221 26.125 4 14.02 13.838 13.745 13.652 13.558 13.463 5 9.553 9.379 9.291 9.202 9.112 9.02 6 7.396 7.229 7.143 7.057 6.969 6.88 7 6.155 5.992 5.908 5.824 5.737 5.65 8 5.359 5.198 5.116 5.032 4.946 4.859 90 9 4.808 4.649 4.567 4.483 4.398 4.311 10 4.405 4.247 4.165 4.082 3.996 3.909 11 4.099 3.941 3.86 3.776 3.69 3.602 12 3.858 3.701 3.619 3.535 3.449 3.361 13 3.665 3.507 3.425 3.341 3.255 3.165 14 3.505 3.348 3.266 3.181 3.094 3.004 15 3.372 3.214 3.132 3.047 2.959 2.868 16 3.259 3.101 3.018 2.933 2.845 2.753 17 3.162 3.003 2.92 2.835 2.746 2.653 18 3.077 2.919 2.835 2.749 2.66 2.566 19 3.003 2.844 2.761 2.674 2.584 2.489 20 2.938 2.778 2.695 2.608 2.517 2.421 21 2.88 2.72 2.636 2.548 2.457 2.36 22 2.827 2.667 2.583 2.495 2.403 2.305 23 2.781 2.62 2.535 2.447 2.354 2.256 24 2.738 2.577 2.492 2.403 2.31 2.211 25 2.699 2.538 2.453 2.364 2.27 2.169 26 2.664 2.503 2.417 2.327 2.233 2.131 27 2.632 2.47 2.384 2.294 2.198 2.097 28 2.602 2.44 2.354 2.263 2.167 2.064 29 2.574 2.412 2.325 2.234 2.138 2.034 30 2.549 2.386 2.299 2.208 2.111 2.006 40 2.369 2.203 2.114 2.019 1.917 1.805 60 2.198 2.028 1.936 1.836 1.726 1.601 120 2.035 1.86 1.763 1.656 1.533 1.381 inf 1.878 1.696 1.592 1.473 1.325 1 Statistical Tables F-table at 0.025 level of significance df in numerator df in denominator 1 2 3 4 5 6 1 647.789 799.5 864.163 899.5833 921.8479 937.1111 2 38.5063 39 39.1655 39.2484 39.2982 39.3315 3 17.4434 16.0441 15.4392 15.101 14.8848 14.7347 4 12.2179 10.6491 9.9792 9.6045 9.3645 9.1973 5 10.007 8.4336 7.7636 7.3879 7.1464 6.9777 6 8.8131 7.2599 6.5988 6.2272 5.9876 5.8198 7 8.0727 6.5415 5.8898 5.5226 5.2852 5.1186 8 7.5709 6.0595 5.416 5.0526 4.8173 4.6517 9 7.2093 5.7147 5.0781 4.7181 4.4844 4.3197 10 6.9367 5.4564 4.8256 4.4683 4.2361 4.0721 11 6.7241 5.2559 4.63 4.2751 4.044 3.8807 12 6.5538 5.0959 4.4742 4.1212 3.8911 3.7283 13 6.4143 4.9653 4.3472 3.9959 3.7667 3.6043 14 6.2979 4.8567 4.2417 3.8919 3.6634 3.5014 15 6.1995 4.765 4.1528 3.8043 3.5764 3.4147 16 6.1151 4.6867 4.0768 3.7294 3.5021 3.3406 17 6.042 4.6189 4.0112 3.6648 3.4379 3.2767 18 5.9781 4.5597 3.9539 3.6083 3.382 3.2209 19 5.9216 4.5075 3.9034 3.5587 3.3327 3.1718 20 5.8715 4.4613 3.8587 3.5147 3.2891 3.1283 21 5.8266 4.4199 3.8188 3.4754 3.2501 3.0895 22 5.7863 4.3828 3.7829 3.4401 3.2151 3.0546 23 5.7498 4.3492 3.7505 3.4083 3.1835 3.0232 24 5.7166 4.3187 3.7211 3.3794 3.1548 2.9946 25 5.6864 4.2909 3.6943 3.353 3.1287 2.9685 26 5.6586 4.2655 3.6697 3.3289 3.1048 2.9447 27 5.6331 4.2421 3.6472 3.3067 3.0828 2.9228 28 5.6096 4.2205 3.6264 3.2863 3.0626 2.9027 29 5.5878 4.2006 3.6072 3.2674 3.0438 2.884 30 5.5675 4.1821 3.5894 3.2499 3.0265 2.8667 40 5.4239 4.051 3.4633 3.1261 2.9037 2.7444 60 5.2856 3.9253 3.3425 3.0077 2.7863 2.6274 120 5.1523 3.8046 3.2269 2.8943 2.674 2.5154 inf 5.0239 3.6889 3.1161 2.7858 2.5665 2.4082 91 Statistical Tables df in numerator df in denominator 7 8 9 10 12 15 1 948.2169 956.6562 963.2846 968.6274 976.7079 984.8668 2 39.3552 39.373 39.3869 39.398 39.4146 39.4313 3 14.6244 14.5399 14.4731 14.4189 14.3366 14.2527 4 9.0741 8.9796 8.9047 8.8439 8.7512 8.6565 5 6.8531 6.7572 6.6811 6.6192 6.5245 6.4277 6 5.6955 5.5996 5.5234 5.4613 5.3662 5.2687 7 4.9949 4.8993 4.8232 4.7611 4.6658 4.5678 8 4.5286 4.4333 4.3572 4.2951 4.1997 4.1012 92 9 4.197 4.102 4.026 3.9639 3.8682 3.7694 10 3.9498 3.8549 3.779 3.7168 3.6209 3.5217 11 3.7586 3.6638 3.5879 3.5257 3.4296 3.3299 12 3.6065 3.5118 3.4358 3.3736 3.2773 3.1772 13 3.4827 3.388 3.312 3.2497 3.1532 3.0527 14 3.3799 3.2853 3.2093 3.1469 3.0502 2.9493 15 3.2934 3.1987 3.1227 3.0602 2.9633 2.8621 16 3.2194 3.1248 3.0488 2.9862 2.889 2.7875 17 3.1556 3.061 2.9849 2.9222 2.8249 2.723 18 3.0999 3.0053 2.9291 2.8664 2.7689 2.6667 19 3.0509 2.9563 2.8801 2.8172 2.7196 2.6171 20 3.0074 2.9128 2.8365 2.7737 2.6758 2.5731 21 2.9686 2.874 2.7977 2.7348 2.6368 2.5338 22 2.9338 2.8392 2.7628 2.6998 2.6017 2.4984 23 2.9023 2.8077 2.7313 2.6682 2.5699 2.4665 24 2.8738 2.7791 2.7027 2.6396 2.5411 2.4374 25 2.8478 2.7531 2.6766 2.6135 2.5149 2.411 26 2.824 2.7293 2.6528 2.5896 2.4908 2.3867 27 2.8021 2.7074 2.6309 2.5676 2.4688 2.3644 28 2.782 2.6872 2.6106 2.5473 2.4484 2.3438 29 2.7633 2.6686 2.5919 2.5286 2.4295 2.3248 30 2.746 2.6513 2.5746 2.5112 2.412 2.3072 40 2.6238 2.5289 2.4519 2.3882 2.2882 2.1819 2.0613 60 2.5068 2.4117 2.3344 2.2702 2.1692 120 2.3948 2.2994 2.2217 2.157 2.0548 1.945 inf 2.2875 2.1918 2.1136 2.0483 1.9447 1.8326 Statistical Tables df in numerator df in denominator 20 30 40 1 993.1028 1001.414 2 39.4479 39.465 3 14.1674 4 5 60 120 INF 1005.598 1009.8 1014.02 1018.258 39.473 39.481 39.49 39.498 14.081 14.037 13.992 13.947 13.902 8.5599 8.461 8.411 8.36 8.309 8.257 6.3286 6.227 6.175 6.123 6.069 6.015 6 5.1684 5.065 5.012 4.959 4.904 4.849 7 4.4667 4.362 4.309 4.254 4.199 4.142 8 3.9995 3.894 3.84 3.784 3.728 3.67 9 3.6669 3.56 3.505 3.449 3.392 3.333 10 3.4185 3.311 3.255 3.198 3.14 3.08 11 3.2261 3.118 3.061 3.004 2.944 2.883 12 3.0728 2.963 2.906 2.848 2.787 2.725 13 2.9477 2.837 2.78 2.72 2.659 2.595 14 2.8437 2.732 2.674 2.614 2.552 2.487 15 2.7559 2.644 2.585 2.524 2.461 2.395 16 2.6808 2.568 2.509 2.447 2.383 2.316 17 2.6158 2.502 2.442 2.38 2.315 2.247 18 2.559 2.445 2.384 2.321 2.256 2.187 19 2.5089 2.394 2.333 2.27 2.203 2.133 20 2.4645 2.349 2.287 2.223 2.156 2.085 21 2.4247 2.308 2.246 2.182 2.114 2.042 22 2.389 2.272 2.21 2.145 2.076 2.003 23 2.3567 2.239 2.176 2.111 2.041 1.968 24 2.3273 2.209 2.146 2.08 2.01 1.935 25 2.3005 2.182 2.118 2.052 1.981 1.906 26 2.2759 2.157 2.093 2.026 1.954 1.878 27 2.2533 2.133 2.069 2.002 1.93 1.853 28 2.2324 2.112 2.048 1.98 1.907 1.829 29 2.2131 2.092 2.028 1.959 1.886 1.807 30 2.1952 2.074 2.009 1.94 1.866 1.787 40 2.0677 1.943 1.875 1.803 1.724 1.637 60 1.9445 1.815 1.744 1.667 1.581 1.482 120 1.8249 1.69 1.614 1.53 1.433 1.31 inf 1.7085 1.566 1.484 1.388 1.268 1 93 Statistical Tables F-table at 0.05 level of significance df in numerator df in denominator 1 2 3 4 5 6 1 161.4476 199.5 215.7073 224.5832 230.1619 233.986 2 18.5128 19 19.1643 19.2468 19.2964 19.3295 3 10.128 9.5521 9.2766 9.1172 9.0135 8.9406 4 7.7086 6.9443 6.5914 6.3882 6.2561 6.1631 5 6.6079 5.7861 5.4095 5.1922 5.0503 4.9503 6 5.9874 5.1433 4.7571 4.5337 4.3874 4.2839 7 5.5914 4.7374 4.3468 4.1203 3.9715 3.866 8 5.3177 4.459 4.0662 3.8379 3.6875 3.5806 9 5.1174 4.2565 3.8625 3.6331 3.4817 3.3738 10 4.9646 4.1028 3.7083 3.478 3.3258 3.2172 11 4.8443 3.9823 3.5874 3.3567 3.2039 3.0946 12 4.7472 3.8853 3.4903 3.2592 3.1059 2.9961 13 4.6672 3.8056 3.4105 3.1791 3.0254 2.9153 14 4.6001 3.7389 3.3439 3.1122 2.9582 2.8477 15 4.5431 3.6823 3.2874 3.0556 2.9013 2.7905 16 4.494 3.6337 3.2389 3.0069 2.8524 2.7413 17 4.4513 3.5915 3.1968 2.9647 2.81 2.6987 18 4.4139 3.5546 3.1599 2.9277 2.7729 2.6613 19 4.3807 3.5219 3.1274 2.8951 2.7401 2.6283 20 4.3512 3.4928 3.0984 2.8661 2.7109 2.599 21 4.3248 3.4668 3.0725 2.8401 2.6848 2.5727 22 4.3009 3.4434 3.0491 2.8167 2.6613 2.5491 23 4.2793 3.4221 3.028 2.7955 2.64 2.5277 24 4.2597 3.4028 3.0088 2.7763 2.6207 2.5082 25 4.2417 3.3852 2.9912 2.7587 2.603 2.4904 26 4.2252 3.369 2.9752 2.7426 2.5868 2.4741 27 4.21 3.3541 2.9604 2.7278 2.5719 2.4591 28 4.196 3.3404 2.9467 2.7141 2.5581 2.4453 29 4.183 3.3277 2.934 2.7014 2.5454 2.4324 30 4.1709 3.3158 2.9223 2.6896 2.5336 2.4205 40 4.0847 3.2317 2.8387 2.606 2.4495 2.3359 2.2541 94 60 4.0012 3.1504 2.7581 2.5252 2.3683 120 3.9201 3.0718 2.6802 2.4472 2.2899 2.175 inf 3.8415 2.9957 2.6049 2.3719 2.2141 2.0986 Statistical Tables df in numerator df in denominator 7 8 9 10 12 15 1 236.7684 238.8827 240.5433 241.8817 243.906 245.9499 2 19.3532 19.371 19.3848 19.3959 19.4125 19.4291 3 8.8867 8.8452 8.8123 8.7855 8.7446 8.7029 4 6.0942 6.041 5.9988 5.9644 5.9117 5.8578 5 4.8759 4.8183 4.7725 4.7351 4.6777 4.6188 6 4.2067 4.1468 4.099 4.06 3.9999 3.9381 7 3.787 3.7257 3.6767 3.6365 3.5747 3.5107 8 3.5005 3.4381 3.3881 3.3472 3.2839 3.2184 9 3.2927 3.2296 3.1789 3.1373 3.0729 3.0061 10 3.1355 3.0717 3.0204 2.9782 2.913 2.845 11 3.0123 2.948 2.8962 2.8536 2.7876 2.7186 12 2.9134 2.8486 2.7964 2.7534 2.6866 2.6169 13 2.8321 2.7669 2.7144 2.671 2.6037 2.5331 14 2.7642 2.6987 2.6458 2.6022 2.5342 2.463 15 2.7066 2.6408 2.5876 2.5437 2.4753 2.4034 16 2.6572 2.5911 2.5377 2.4935 2.4247 2.3522 17 2.6143 2.548 2.4943 2.4499 2.3807 2.3077 18 2.5767 2.5102 2.4563 2.4117 2.3421 2.2686 19 2.5435 2.4768 2.4227 2.3779 2.308 2.2341 20 2.514 2.4471 2.3928 2.3479 2.2776 2.2033 21 2.4876 2.4205 2.366 2.321 2.2504 2.1757 22 2.4638 2.3965 2.3419 2.2967 2.2258 2.1508 23 2.4422 2.3748 2.3201 2.2747 2.2036 2.1282 24 2.4226 2.3551 2.3002 2.2547 2.1834 2.1077 25 2.4047 2.3371 2.2821 2.2365 2.1649 2.0889 26 2.3883 2.3205 2.2655 2.2197 2.1479 2.0716 27 2.3732 2.3053 2.2501 2.2043 2.1323 2.0558 28 2.3593 2.2913 2.236 2.19 2.1179 2.0411 29 2.3463 2.2783 2.2229 2.1768 2.1045 2.0275 30 2.3343 2.2662 2.2107 2.1646 2.0921 2.0148 40 2.249 2.1802 2.124 2.0772 2.0035 1.9245 60 2.1665 2.097 2.0401 1.9926 1.9174 1.8364 120 2.0868 2.0164 1.9588 1.9105 1.8337 1.7505 inf 2.0096 1.9384 1.8799 1.8307 1.7522 1.6664 95 Statistical Tables df in numerator df in denominator 20 30 40 60 120 INF 1 248.0131 250.0951 251.1432 252.1957 253.2529 254.3144 2 19.4458 19.4624 19.4707 19.4791 19.4874 19.4957 3 8.6602 8.6166 8.5944 8.572 8.5494 8.5264 4 5.8025 5.7459 5.717 5.6877 5.6581 5.6281 5 4.5581 4.4957 4.4638 4.4314 4.3985 4.365 6 3.8742 3.8082 3.7743 3.7398 3.7047 3.6689 7 3.4445 3.3758 3.3404 3.3043 3.2674 3.2298 8 3.1503 3.0794 3.0428 3.0053 2.9669 2.9276 96 9 2.9365 2.8637 2.8259 2.7872 2.7475 2.7067 10 2.774 2.6996 2.6609 2.6211 2.5801 2.5379 11 2.6464 2.5705 2.5309 2.4901 2.448 2.4045 12 2.5436 2.4663 2.4259 2.3842 2.341 2.2962 13 2.4589 2.3803 2.3392 2.2966 2.2524 2.2064 14 2.3879 2.3082 2.2664 2.2229 2.1778 2.1307 15 2.3275 2.2468 2.2043 2.1601 2.1141 2.0658 16 2.2756 2.1938 2.1507 2.1058 2.0589 2.0096 17 2.2304 2.1477 2.104 2.0584 2.0107 1.9604 18 2.1906 2.1071 2.0629 2.0166 1.9681 1.9168 19 2.1555 2.0712 2.0264 1.9795 1.9302 1.878 20 2.1242 2.0391 1.9938 1.9464 1.8963 1.8432 21 2.096 2.0102 1.9645 1.9165 1.8657 1.8117 22 2.0707 1.9842 1.938 1.8894 1.838 1.7831 23 2.0476 1.9605 1.9139 1.8648 1.8128 1.757 24 2.0267 1.939 1.892 1.8424 1.7896 1.733 25 2.0075 1.9192 1.8718 1.8217 1.7684 1.711 26 1.9898 1.901 1.8533 1.8027 1.7488 1.6906 27 1.9736 1.8842 1.8361 1.7851 1.7306 1.6717 28 1.9586 1.8687 1.8203 1.7689 1.7138 1.6541 29 1.9446 1.8543 1.8055 1.7537 1.6981 1.6376 30 1.9317 1.8409 1.7918 1.7396 1.6835 1.6223 40 1.8389 1.7444 1.6928 1.6373 1.5766 1.5089 60 1.748 1.6491 1.5943 1.5343 1.4673 1.3893 120 1.6587 1.5543 1.4952 1.429 1.3519 1.2539 inf 1.5705 1.4591 1.394 1.318 1.2214 1 Statistical Tables F-table at 0.1 level of significance df in numerator df in denominator 1 2 3 4 5 6 1 39.86346 49.5 53.59324 55.83296 57.24008 58.20442 2 8.52632 9 9.16179 9.24342 9.29263 9.32553 3 5.53832 5.46238 5.39077 5.34264 5.30916 5.28473 4 4.54477 4.32456 4.19086 4.10725 4.05058 4.00975 5 4.06042 3.77972 3.61948 3.5202 3.45298 3.40451 6 3.77595 3.4633 3.28876 3.18076 3.10751 3.05455 7 3.58943 3.25744 3.07407 2.96053 2.88334 2.82739 8 3.45792 3.11312 2.9238 2.80643 2.72645 2.66833 9 3.3603 3.00645 2.81286 2.69268 2.61061 2.55086 10 3.28502 2.92447 2.72767 2.60534 2.52164 2.46058 11 3.2252 2.85951 2.66023 2.53619 2.45118 2.38907 12 3.17655 2.8068 2.60552 2.4801 2.39402 2.33102 13 3.13621 2.76317 2.56027 2.43371 2.34672 2.28298 14 3.10221 2.72647 2.52222 2.39469 2.30694 2.24256 15 3.07319 2.69517 2.48979 2.36143 2.27302 2.20808 16 3.04811 2.66817 2.46181 2.33274 2.24376 2.17833 17 3.02623 2.64464 2.43743 2.30775 2.21825 2.15239 18 3.00698 2.62395 2.41601 2.28577 2.19583 2.12958 19 2.9899 2.60561 2.39702 2.2663 2.17596 2.10936 20 2.97465 2.58925 2.38009 2.24893 2.15823 2.09132 21 2.96096 2.57457 2.36489 2.23334 2.14231 2.07512 22 2.94858 2.56131 2.35117 2.21927 2.12794 2.0605 23 2.93736 2.54929 2.33873 2.20651 2.11491 2.04723 24 2.92712 2.53833 2.32739 2.19488 2.10303 2.03513 25 2.91774 2.52831 2.31702 2.18424 2.09216 2.02406 26 2.90913 2.5191 2.30749 2.17447 2.08218 2.01389 27 2.90119 2.51061 2.29871 2.16546 2.07298 2.00452 28 2.89385 2.50276 2.2906 2.15714 2.06447 1.99585 29 2.88703 2.49548 2.28307 2.14941 2.05658 1.98781 30 2.88069 2.48872 2.27607 2.14223 2.04925 1.98033 40 2.83535 2.44037 2.22609 2.09095 1.99682 1.92688 60 2.79107 2.39325 2.17741 2.04099 1.94571 1.87472 120 2.74781 2.34734 2.12999 1.9923 1.89587 1.82381 inf 2.70554 2.30259 2.0838 1.94486 1.84727 1.77411 97 Statistical Tables df in numerator df in denominator 7 8 9 10 12 15 1 58.90595 59.43898 59.85759 60.19498 60.70521 61.22034 2 9.34908 9.36677 9.38054 9.39157 9.40813 9.42471 3 5.26619 5.25167 5.24 5.23041 5.21562 5.20031 4 3.97897 3.95494 3.93567 3.91988 3.89553 3.87036 5 3.3679 3.33928 3.31628 3.2974 3.26824 3.23801 6 3.01446 2.98304 2.95774 2.93693 2.90472 2.87122 7 2.78493 2.75158 2.72468 2.70251 2.66811 2.63223 8 2.62413 2.58935 2.56124 2.53804 2.50196 2.46422 9 2.50531 2.46941 2.44034 2.41632 2.37888 2.33962 10 2.41397 2.37715 2.34731 2.3226 2.28405 2.24351 11 2.34157 2.304 2.2735 2.24823 2.20873 2.16709 12 2.28278 2.24457 2.21352 2.18776 2.14744 2.10485 13 2.2341 2.19535 2.16382 2.13763 2.09659 2.05316 14 2.19313 2.1539 2.12195 2.0954 2.05371 2.00953 15 2.15818 2.11853 2.08621 2.05932 2.01707 1.97222 16 2.128 2.08798 2.05533 2.02815 1.98539 1.93992 17 2.10169 2.06134 2.02839 2.00094 1.95772 1.91169 18 2.07854 2.03789 2.00467 1.97698 1.93334 1.88681 19 2.05802 2.0171 1.98364 1.95573 1.9117 1.86471 20 2.0397 1.99853 1.96485 1.93674 1.89236 1.84494 21 2.02325 1.98186 1.94797 1.91967 1.87497 1.82715 22 2.0084 1.9668 1.93273 1.90425 1.85925 1.81106 23 1.99492 1.95312 1.91888 1.89025 1.84497 1.79643 24 1.98263 1.94066 1.90625 1.87748 1.83194 1.78308 25 1.97138 1.92925 1.89469 1.86578 1.82 1.77083 26 1.96104 1.91876 1.88407 1.85503 1.80902 1.75957 27 1.95151 1.90909 1.87427 1.84511 1.79889 1.74917 28 1.9427 1.90014 1.8652 1.83593 1.78951 1.73954 29 1.93452 1.89184 1.85679 1.82741 1.78081 1.7306 30 1.92692 1.88412 1.84896 1.81949 1.7727 1.72227 40 1.87252 1.82886 1.7929 1.76269 1.71456 1.66241 60 1.81939 1.77483 1.73802 1.70701 1.65743 1.60337 120 1.76748 1.72196 1.68425 1.65238 1.6012 1.545 inf 1.71672 1.6702 1.63152 1.59872 1.54578 1.48714 98 Statistical Tables df in numerator df in denominator 20 30 40 60 120 INF 1 61.74029 62.26497 62.52905 62.79428 63.06064 63.32812 2 9.44131 9.45793 9.46624 9.47456 9.48289 9.49122 3 5.18448 5.16811 5.15972 5.15119 5.14251 5.1337 4 3.84434 3.81742 3.80361 3.78957 3.77527 3.76073 5 3.20665 3.17408 3.15732 3.14023 3.12279 3.105 6 2.83634 2.79996 2.78117 2.76195 2.74229 2.72216 7 2.59473 2.55546 2.5351 2.51422 2.49279 2.47079 8 2.42464 2.38302 2.36136 2.3391 2.31618 2.29257 9 2.29832 2.25472 2.23196 2.20849 2.18427 2.15923 10 2.20074 2.15543 2.13169 2.10716 2.08176 2.05542 11 2.12305 2.07621 2.05161 2.02612 1.99965 1.97211 12 2.05968 2.01149 1.9861 1.95973 1.93228 1.90361 13 2.00698 1.95757 1.93147 1.90429 1.87591 1.8462 14 1.96245 1.91193 1.88516 1.85723 1.828 1.79728 15 1.92431 1.87277 1.84539 1.81676 1.78672 1.75505 16 1.89127 1.83879 1.81084 1.78156 1.75075 1.71817 17 1.86236 1.80901 1.78053 1.75063 1.71909 1.68564 18 1.83685 1.78269 1.75371 1.72322 1.69099 1.65671 19 1.81416 1.75924 1.72979 1.69876 1.66587 1.63077 20 1.79384 1.73822 1.70833 1.67678 1.64326 1.60738 21 1.77555 1.71927 1.68896 1.65691 1.62278 1.58615 22 1.75899 1.70208 1.67138 1.63885 1.60415 1.56678 23 1.74392 1.68643 1.65535 1.62237 1.58711 1.54903 24 1.73015 1.6721 1.64067 1.60726 1.57146 1.5327 25 1.71752 1.65895 1.62718 1.59335 1.55703 1.5176 26 1.70589 1.64682 1.61472 1.5805 1.54368 1.5036 27 1.69514 1.6356 1.6032 1.56859 1.53129 1.49057 28 1.68519 1.62519 1.5925 1.55753 1.51976 1.47841 29 1.67593 1.61551 1.58253 1.54721 1.50899 1.46704 30 1.66731 1.60648 1.57323 1.53757 1.49891 1.45636 40 1.60515 1.54108 1.50562 1.46716 1.42476 1.37691 60 1.54349 1.47554 1.43734 1.3952 1.34757 1.29146 120 1.48207 1.40938 1.3676 1.32034 1.26457 1.19256 inf 1.4206 1.34187 1.29513 1.23995 1.1686 1 99 Statistical Tables χ2 distribution table degrees probability of freedom 0.995 0.99 0.975 0.95 0.9 0.75 0.5 1 0.00004 0.00016 0.00098 0.00393 0.01579 0.10153 0.45494 2 0.01003 0.0201 0.05064 0.10259 0.21072 0.57536 1.38629 3 0.07172 0.11483 0.2158 0.35185 0.58437 1.21253 2.36597 4 0.20699 0.29711 0.48442 0.71072 1.06362 1.92256 3.35669 5 0.41174 0.5543 0.83121 1.14548 1.61031 2.6746 4.35146 6 0.67573 0.87209 1.23734 1.63538 2.20413 3.4546 5.34812 7 0.98926 1.23904 1.68987 2.16735 2.83311 4.25485 6.34581 8 1.34441 1.6465 2.17973 2.73264 3.48954 5.07064 7.34412 9 1.73493 2.0879 2.70039 3.32511 4.16816 5.89883 8.34283 10 2.15586 2.55821 3.24697 3.9403 4.86518 6.7372 9.34182 11 2.60322 3.05348 3.81575 4.57481 5.57778 7.58414 10.341 12 3.07382 3.57057 4.40379 5.22603 6.3038 8.43842 11.34032 13 3.56503 4.10692 5.00875 5.89186 7.0415 9.29907 12.33976 14 4.07467 4.66043 5.62873 6.57063 7.78953 10.16531 13.33927 15 4.60092 5.22935 6.26214 7.26094 8.54676 11.03654 14.33886 16 5.14221 5.81221 6.90766 7.96165 9.31224 11.91222 15.3385 17 5.69722 6.40776 7.56419 8.67176 10.08519 12.79193 16.33818 18 6.2648 7.01491 8.23075 9.39046 10.86494 13.67529 17.3379 19 6.84397 7.63273 8.90652 10.11701 11.65091 14.562 18.33765 20 7.43384 8.2604 9.59078 10.85081 12.44261 15.45177 19.33743 21 8.03365 8.8972 10.2829 11.59131 13.2396 16.34438 20.33723 22 8.64272 9.54249 10.98232 12.33801 14.04149 17.23962 21.33704 23 9.26042 10.19572 11.68855 13.09051 14.84796 18.1373 22.33688 24 9.88623 10.85636 12.40115 13.84843 15.65868 19.03725 23.33673 25 10.51965 11.52398 13.11972 14.61141 16.47341 19.93934 24.33659 26 11.16024 12.19815 13.8439 15.37916 17.29188 20.84343 25.33646 27 11.80759 12.8785 14.57338 16.1514 18.1139 21.7494 26.33634 28 12.46134 13.56471 15.30786 16.92788 18.93924 22.65716 27.33623 29 13.12115 14.25645 16.04707 17.70837 19.76774 23.56659 28.33613 30 13.78672 14.95346 16.79077 18.49266 20.59923 24.47761 29.33603 100 Statistical Tables degrees probability of freedom 0.25 0.1 0.05 0.025 0.01 0.005 1 1.3233 2.70554 3.84146 5.02389 6.6349 7.87944 2 2.77259 4.60517 5.99146 7.37776 9.21034 10.59663 3 4.10834 6.25139 7.81473 9.3484 11.34487 12.83816 4 5.38527 7.77944 9.48773 11.14329 13.2767 14.86026 5 6.62568 9.23636 11.0705 12.8325 15.08627 16.7496 6 7.8408 10.64464 12.59159 14.44938 16.81189 18.54758 7 9.03715 12.01704 14.06714 16.01276 18.47531 20.27774 8 10.21885 13.36157 15.50731 17.53455 20.09024 21.95495 9 11.38875 14.68366 16.91898 19.02277 21.66599 23.58935 10 12.54886 15.98718 18.30704 20.48318 23.20925 25.18818 11 13.70069 17.27501 19.67514 21.92005 24.72497 26.75685 12 14.8454 18.54935 21.02607 23.33666 26.21697 28.29952 13 15.98391 19.81193 22.36203 24.7356 27.68825 29.81947 14 17.11693 21.06414 23.68479 26.11895 29.14124 31.31935 15 18.24509 22.30713 24.99579 27.48839 30.57791 32.80132 16 19.36886 23.54183 26.29623 28.84535 31.99993 34.26719 17 20.48868 24.76904 27.58711 30.19101 33.40866 35.71847 18 21.60489 25.98942 28.8693 31.52638 34.80531 37.15645 19 22.71781 27.20357 30.14353 32.85233 36.19087 38.58226 20 23.82769 28.41198 31.41043 34.16961 37.56623 39.99685 21 24.93478 29.61509 32.67057 35.47888 38.93217 41.40106 22 26.03927 30.81328 33.92444 36.78071 40.28936 42.79565 23 27.14134 32.0069 35.17246 38.07563 41.6384 44.18128 24 28.24115 33.19624 36.41503 39.36408 42.97982 45.55851 25 29.33885 34.38159 37.65248 40.64647 44.3141 46.92789 26 30.43457 35.56317 38.88514 41.92317 45.64168 48.28988 27 31.52841 36.74122 40.11327 43.19451 46.96294 49.64492 28 32.62049 37.91592 41.33714 44.46079 48.27824 50.99338 29 33.71091 39.08747 42.55697 45.72229 49.58788 52.33562 30 34.79974 40.25602 43.77297 46.97924 50.89218 53.67196 101 Bibliography 1) Barro, R.J., D.B. Gordon (1983), A Positive Theory of Monetary Policy in a Natural Rate Model, “The Journal of Political Economy,” Vol. 91, No. 4, pp. 589–610, accessed via jstor.org, date of publication: 8.1983, date of accession: 3.2010, http://www.jstor.org/pss/1831069. 2) Caporale, G.M., L.A. Gil-Alana (2008), Modeling the U.S., U.K. and Japanese unemployment rates: Fractional integration and structural breaks, “Computational Statistics & Data Analysis,” Vol. 52, No. 11, pp. 4998–5013, date of publication: 7.2008, date of accession: 4.2010, http://www.sciencedirect.com/science/article/B6V8V-4SC78KW1/1/4db0cb865a44de068cb172a6c0f8ece3. 3) Chiang, A.C. (1984), Fundamental Methods of Mathematical Economics, McGraw-Hill, 1984. 4) Dunaev, B.B. (2005), Measuring Unemployment and Inflation as Wages Functions, “Cybernetics and System Analysis,” Vol. 41, No. 3, pp. 403– 414, date of publication: 5.2005, date of accession: 4.2010, http://www. springerlink.com/content/f075283t34082664/. 5) Greene, W.H. (2003), Econometric Analysis, Prentice Hall, 2003. 6) Gujarati, D.N. (2006), Essentials of Econometrics, McGraw-Hill/Irwin, New York 2006. 7) Hanke, J.E., D.W. Wichern (2005), Business Forecasting, Pearson Education, 2005. 8) Intriligator, M.D. (1978), Econometric Models, Techniques & Applications, Prentice-Hall, 1978. 9) Montgomery, A.L., V. Zarnowitz, S. Tsay, Ruey, G.C. Tiao (1998), Forecasting the U.S. Unemployment Rate, “Journal of the American Statistical Association,” Vol. 93, No. 442, pp. 478–493, accessed via jstor.org, date of publication: 6.1998, date of accession: 3.2010, http://www/jstor.ord/pss/2670094. 10) Pindyck, R.S., D.L. Rubinfeld (1998), Econometric Models and Econometric Forecasts, Irwin/McGraw-Hill International Editions, Singapore 1998. 11) Proietti, T. (2003), Forecasting the U.S. unemployment rate, “Computational Statistics & Data Analysis,” Vol. 42, No. 3, pp. 451–476, date of publication: 103 Bibliography 12) 13) 14) 15) 16) 17) 18) 104 3.2003, date of accession: 3.2010, http://portal.acm.org/citation. cmf?id=770742. Rothman, Ph. (1998), Forecasting Asymmetric Unemployment Rates, MIT Press, “The Review of Economics and Statistics,” Vol. 80, Issue: 1, pp. 164– 168, date of publication: 2.1998, date of accession: 3.2010, http://www. mitpressjournals.org/doi/abs/10.1162/003465398557276. Salop, S. (1979), A Model of the Natural Rate of Unemployment, “The American Economic Review,” Vol. 69, No. 1, pp. 117–125, accessed via jstor. org, date of publication: 3.1979, date of accession: 4.2010, http://www/ jstor.ord/pss/1802502. Shimer, R. (1998), Why Is the U.S. Unemployment Rate so Much Lower?, “NBER Macroeconomics Annual,” Vol. 13, pp. 11–61, accessed via jstor.org, date of publication: 1998, date of accession: 4.2010, http://www.jstor.org/ pss/4623732. Stock, J.H., M.W. Watson (2008), Introduction to Econometrics, Pearson Education, 2008. Studenmund, A.H. (2006), Using Econometrics. A practical guide, Pearson Education, 2006. Theil, H. (1971), Principles of Econometrics, A Wiley/Hamilton Publication, 1971. Wooldridge, J.M. (2010), Econometric Analysis of Cross Section and Panel Data, Massachusetts Institute of Technology, MIT Press, Cambridge 2010. List of Figures List of Figures Equation 1. Basic structural equation, i.e., the skeleton . . . . . . . . . . . . . . . .10 Equation 2. Simple, linear form structural equation for working with a cross-section data set, with i representing a specific observation . . . . .15 Equation 3. Simple, semi-log form structural equation for working with a cross-section data set, with i representing a specific observation – log-linear form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15 Equation 4. Simple, semi-log form structural equation for working with a cross-section data set, with i representing a specific observation – linear-log form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15 Equation 5. Simple, full-log form structural equation for working with a cross-section data set, with i representing a specific observation . . . . .16 Equation 6. Simple, linear form structural equation for working with a time-series data set, with t representing a specific year . . . . . . . . . . . .16 Equation 7. Simple, linear form structural equation for working with a panel data set, with i representing cross-section elements, i.e., host countries, and t representing time-series elements, i.e., a specific year. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17 Equation 8. Dummy variable creation: Sale price example, original equation (no dummy variable) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28 Equation 9. Dummy variable creation: Sale price example, original equation (with a dummy variable) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29 Equation 10. Dummy variable creation: Sale price example, original equation (with two dummy variables) . . . . . . . . . . . . . . . . . . . . . . . . . . .29 Equation 11. Simple averaging method . . . . . . . . . . . . . . . . . . . . . . . . . . . .31 Equation 12. Model estimation with forward stepwise method example – initial structural, restricted equation . . . . . . . . . . . . . . . . . . . .34 Equation 13. Model estimation with forward stepwise method example – initial structural, restricted model . . . . . . . . . . . . . . . . . . . . . .35 105 List of Figures Equation 14. Model estimation with forward stepwise method example – auxiliary structural equation . . . . . . . . . . . . . . . . . . . . . . . . . .35 Equation 15. Model estimation with forward stepwise method example – auxiliary structural model . . . . . . . . . . . . . . . . . . . . . . . . . . . .35 Equation 16. Lagrange Multiplier formula . . . . . . . . . . . . . . . . . . . . . . . . . .36 Equation 17. Model estimation with forward stepwise method example – initial structural, unrestricted model . . . . . . . . . . . . . . . . . . . .37 Equation 18. Structural equation with a AR(p) term . . . . . . . . . . . . . . . . . . .42 Equation 19. Structural equation with AR(p) terms 1 through 3 . . . . . . . . . .42 Equation 20. Structural equation with lagged dependent variable as an additional explanatory variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43 Equation 21. Adjustment of the nth coefficient with r lagged dependent variables used as independent factors . . . . . . . . . . . . . . . . . .43 Equation 22. Adjustment of the 1st coefficient with one lagged dependent variables used as independent factors . . . . . . . . . . . . . . . . . .43 Equation 23. Linear structural equation of the model used in Model’s Results Interpretation chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47 Equation 24. Estimated version of the linear structural equation of the model used in Model’s Results Interpretation chapter . . . . . . . . . . . .48 Equation 25. Example of the t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .51 Equation 26. Joint significance test – structural model, restricted. . . . . . . . .51 Equation 27. Joint significance test – structural model, unrestricted. . . . . . .52 Equation 28. F-test formula with Error Sum Squares. . . . . . . . . . . . . . . . . . .52 Equation 29. R2 of the unrestricted model as a function of its Error Sum of Squares and Total Sum of Squares . . . . . . . . . . . . . . . . . . . . . . . .52 Equation 30. R2 of the unrestricted model as a function of its Error Sum of Squares and Total Sum of Squares . . . . . . . . . . . . . . . . . . . . . . . .53 Equation 31. F-test formula with R-squared . . . . . . . . . . . . . . . . . . . . . . . . .53 Equation 32. R-squared formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .54 Equation 33. Adjusted R-squared formula. . . . . . . . . . . . . . . . . . . . . . . . . . .54 Equation 34. Total Sum or Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .55 Equation 35. Error (Residual) sum of squares . . . . . . . . . . . . . . . . . . . . . . . .55 Equation 36. Total sum of squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .55 Equation 37. Regression sum of squares . . . . . . . . . . . . . . . . . . . . . . . . . . . .55 106 List of Figures Equation 38. Structural equation for U.S. imports as the dependent variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .67 Equation 39. Restricted structural equation . . . . . . . . . . . . . . . . . . . . . . . . .73 Equation 40. Restricted structural model . . . . . . . . . . . . . . . . . . . . . . . . . . .73 Equation 41. Structural auxiliary equation . . . . . . . . . . . . . . . . . . . . . . . . . .75 Equation 42. Unrestricted structural model. . . . . . . . . . . . . . . . . . . . . . . . . .76 Figure 1. Division of the original data set into Estimation Period, Ex post and Ex ante sections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .58 Graph 1. U.S. gross domestic product (left-hand axis in billion, USD) . . . . . .20 Graph 2. U.S. gross domestic product (left-hand axis in billion, USD) with a linear trendline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21 Graph 3. A graphical representation of U.S. GDP after it has been transformed into a stationary variable via first-order differencing; D(GDP)23 Graph 4. Graph of residuals of a model with U.S. imports (IM) as the dependent variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40 Graph 5. Graph of actual values (Actual), the fitted model (Fitted); both on left-hand side axis, and resulting residuals (Residuals); right-hand axis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .56 Graph 6. A plot of the original U.S. imports data (IM) versus the forecast (IMF) and the upper (UP) and lower (DOWN) limits . . . . . . . . . .61 Graph 7. Actual, fitted data and residuals of the final model . . . . . . . . . . . .80 Graph 8. Ex post forecast of the final model . . . . . . . . . . . . . . . . . . . . . . . . .81 Table 1. An example of panel data with averages per firm and per year listed in the last row and the last column respectively . . . . . . . . . . .14 Table 2. Variables Info Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20 Table 3. An example of a correlogram of data with a unit root present . . . .22 Table 4. Output of the Augmented Dickey-Fuller test. . . . . . . . . . . . . . . . . . .22 Table 5. A correlogram of U.S. GDP after it has been transformed into a stationary variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23 Table 6. the Augmented Dickey-Fuller test output testing the 1st difference of U.S. GDP for stationarity (only relevant information included)23 107 List of Figures Table 7. A correlation matrix for the number of U.S. FDI firms and the GDP in two regions in Poland. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24 Table 8. Descriptive statistics of U.S. imports, U.S. exports and a dummy variable for recession . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26 Table 9. A summary of information for the U.S. GDP variable . . . . . . . . . . . .27 Table 10. Dummy variable creation: European Union membership example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27 Table 11. Dummy variable creation: Sale price example, original data set . . .28 Table 12. Dummy variable creation: Sale price example, transformed data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28 Table 13. Dummy variable creation: Sale price example, transformed, version 2, data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29 Table 14. Supplementing the missing data example, original data set . . . . .31 Table 15. A section of the Chi-square table with error levels in the first row and degrees of freedom in the first column . . . . . . . . . . . . . . . . . . .36 Table 16. An example of a correlogram output for the U.S. imports model.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41 Table 17. An example of the Breusch-Godfrey Serial Correlation LM test output for the U.S. imports model . . . . . . . . . . . . . . . . . . . . . . . . . .42 Table 18. An example of a heteroscedasticity LM White test for the U.S. Imports model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44 Table 19. Coefficient estimation output from software after estimating the U.S. imports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .48 Table 20. Summary of the coefficient testing procedure for one-tail tests. . .50 Table 21. Summary of the coefficient testing procedure for two-tail tests . .51 Table 22. Model’s statistics output from the software after estimating the U.S. imports by regressing them on the constant term (C), disposable income (YD), U.S. population (POP), wealth (W), U.S. GDP and U.S. exports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .53 Table 23. Correlogram of U.S. imports variable after it has been differentiated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59 Table 24. ARIMA (1, 1, 2) model output for U.S. imports variable. Note that the independent variable is not IM but d(IM) – a 1st-level difference of the original independent variable . . . . . . . . . . . . . . . . . . . .60 Table 25. Potential independent variables . . . . . . . . . . . . . . . . . . . . . . . . . . .68 Table 26. Hypothesis statements for all independent variables . . . . . . . . . . .70 108 List of Figures Table 27. Correlation Matrix for all variables . . . . . . . . . . . . . . . . . . . . . . . . .71 Table 28. Hypothesis statements for the Unit Route tests . . . . . . . . . . . . . . .72 Table 29. Results of the Augmented Dickey-Fuller test for the presence of a Unit Route. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72 Table 30. Values of the restricted model’s parameters. . . . . . . . . . . . . . . . . .74 Table 31. Values of the restricted model’s statistics. . . . . . . . . . . . . . . . . . . .74 Table 32. Results of the Breusch-Godfrey Serial Correlation Lagrange Multiplier test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .74 Table 33. Results of the auxiliary regression (1) . . . . . . . . . . . . . . . . . . . . . . .75 Table 34. Stationarity test for CHG.INV and RECES variables . . . . . . . . . . . . .76 Table 35. Values of the unrestricted model’s parameters . . . . . . . . . . . . . . .77 Table 36. Values of the unrestricted model’s statistics. . . . . . . . . . . . . . . . . .77 Table 37. Breusch-Godfrey Serial Correlation Lagrange Multiplier test for the final model (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .78 Table 38. Breusch-Godfrey Serial Correlation Lagrange Multiplier test for the final model (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .78 Table 39. Breusch-Godfrey Serial Correlation Lagrange Multiplier test for the final model (3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .78 Table 40. Values of the corrected unrestricted model’s parameters. . . . . . . .78 Table 41. Values of the corrected unrestricted model’s statistics . . . . . . . . . .79 Table 42. White heteroscadesticity test for the final model . . . . . . . . . . . . . .79 Table 43. Descriptive Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .82 109 Notes Notes 112

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Statistical analysis of Quantitative Data