* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Document
Data assimilation wikipedia , lookup
Time series wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Choice modelling wikipedia , lookup
Regression toward the mean wikipedia , lookup
Linear regression wikipedia , lookup
CERAM February-March-April 2008 Class 4 Ordinary Least Squares Lionel Nesta Observatoire Français des Conjonctures Economiques [email protected] Introduction to Regression Ideally, the social scientist is interested not only in knowing the intensity of a relationship, but also in quantifying the magnitude of a variation of one variable associated with the variation of one unit of another variable. Regression analysis is a technique that examines the relation of a dependent variable to independent or explanatory variables. Simple regression y = f(X) Multiple regression y = f(X,Z) Let us start with simple regressions Scatter Plot of Fertilizer and Production Scatter Plot of Fertilizer and Production Scatter Plot of Fertilizer and Production Error Yi Yi Pr ediction Yi Scatter Plot of Fertilizer and Production Scatter Plot of Fertilizer and Production Objective of Regression It is time to ask: “What is a good fit?” “A good fit is what makes the error small” “The best fit is what makes the error smallest” Three candidates 1. To minimize the sum of all errors 2. To minimize the sum of absolute values of errors 3. To minimize the sum of squared errors To minimize the sum of all errors n min yi yi i 1 Problem of sign Y Y + – – – X + + X To minimize the sum of absolute values of errors Problem of middle min y y point n i 1 i i Y Y +2 –1 +3 –1 X X To minimize the sum of squared errors n min yi yi i 1 2 Solve both problems Y + – – X To minimize the sum of squared errors n min yi yi i 1 2 n min 2 i 1 ε² Overcomes the sign problem Goes through the middle point Squaring emphasizes large errors Easily Manageable Has a unique minimum Has a unique – and best - solution ε Scatter Plot of Fertilizer and Production Scatter Plot of R&D and Patents (log) Scatter Plot of R&D and Patents (log) Scatter Plot of R&D and Patents (log) Scatter Plot of R&D and Patents (log) The Simple Regression Model yi xi i E ( yi ) xi yi Dependent variable (to be explained) xi Independent variable (explanatory) α First parameter of interest Second parameter of interest εi Error term The Simple Regression Model y i xi and are estimates of the true - but unkown - and . To minimize the sum of squared errors n min yi yi i 1 2 ε² n min yi yi i 1 n 2 0 2 0 i 1 n i 1 ε 2 n min yi xi i 1 2 To minimize the sum of squared errors n min yi yi i 1 2 ε² y y x x x x i i 2 i y x ε Application to CERAM_BIO Data using Excel lnpat_assets lnrd_assets -12.77 -12.51 -12.74 -12.52 -12.12 -12.53 -12.09 Mean of y -12.16 -2.28 -2.24 -2.20 -2.31 -2.25 -2.26 -2.25 Mean of x -2.29 Alpha_hat -8.148 Beta_hat 1.749 Deviation to the mean -0.61 -0.35 -0.58 -0.36 0.04 -0.37 0.07 0.01 0.05 0.09 -0.02 0.04 0.03 0.04 Numerator Beta_Hat Denominator Beta_Hat -0.01 -0.02 -0.05 0.01 0.00 -0.01 0.00 Sum 448.75 0.00 0.00 0.01 0.00 0.00 0.00 0.00 Sum 256.55 Application to CERAM_BIO Data using Excel lnpat_assets lnrd_assets -12.77 -12.51 -12.74 -12.52 -12.12 -12.53 -12.09 Mean of y -12.16 -2.28 -2.24 -2.20 -2.31 -2.25 -2.26 -2.25 Mean of x -2.29 Alpha_hat -8.148 Beta_hat 1.749 Deviation to the mean -0.61 -0.35 -0.58 -0.36 0.04 -0.37 0.07 0.01 0.05 0.09 -0.02 0.04 0.03 0.04 Numerator Beta_Hat Denominator Beta_Hat -0.01 -0.02 -0.05 0.01 0.00 -0.01 0.00 Sum 448.75 0.00 0.00 0.01 0.00 0.00 0.00 0.00 Sum 256.55 Patent R&D ln 8.148 1.748 ln i Assets Assets Interpretation Patent R&D ln 8.148 1.748 ln i Assets Assets When the log of R&D (per asset) increases by one unit, the log of patent per asset increases by 1.748 Remember! A change in log of x is a relative change of x itself A 1% increase in R&D (per asset) entails a 1.748% increase in the number of patent (per asset). Application to Data using SPSS Analyse Régression Linéaire Coefficientsa Modèle 1 (constante) lnrd_as sets Coefficients non standardis és Erreur B standard -8.151 .244 1.748 .101 a. Variable dépendante : lnpat_ass ets Coefficients standardis és Bêta .642 t -33.392 17.323 Signification .000 .000 Assessing the Goodness of Fit It is important to ask whether a specification provides a good prediction on the dependent variable, given values of the independent variable. Ideally, we want an indicator of the proportion of variance of the dependent variable that is accounted for – or explained – by the statistical model. This is the variance of predictions (ŷ) and the variance of residuals (ε), since by construction, both sum to overall variance of the dependent variable (y). Overall Variance Decomposing the overall variance (1) Decomposing the overall variance (2) Coefficient of determination R² R2 is a statistic which provides information on the goodness of fit of the model. SStot yi y SS fit SSres 2 yi y SStot SS fit SS res 2 yi yi 2 R² SS fit SStot 0 R² 1 Fisher’s F Statistics Fisher’s statistics is relevant as a form of ANOVA on SSfit which tells us whether the regression model brings significant (in a statistical sense, information. Model SS df MSS (1) (2) (3) (2)/(3) p MSS fit Fitted y y 2 i Residual y y Total y y i i p: number of parameters N: number of observations 2 i 2 F MSS fit MSSres N–p–1 N–1 MSSres Application to Data using SPSS Analyse Régression Linéaire Récapitulatif du modèle Modèle 1 R .642a R-deux .412 R-deux ajus té .410 Erreur standard de l'estimation 1.61647 a. Valeurs prédites : (constantes), lnrd_assets ANOVAb Modèle 1 Régres sion Résidu Total Somme des carrés 784.132 1120.970 1905.102 ddl 1 429 430 a. Valeurs prédites : (constantes), lnrd_as set s b. Variable dépendante : lnpat_as sets Carré moy en 784.132 2.613 F 300.090 Signification .000a What the R² is not Independent variables are a true cause of the changes in the dependent variable The correct regression was used The most appropriate set of independent variables has been chosen There is co-linearity present in the data The model could be improved by using transformed versions of the existing set of independent variables Inference on β We have estimated E ( yi ) y i xi Si 0, E ( yi ) Si 0, E ( y ) xi Therefore we must test whether the estimated parameter is significantly different than 0, and, by way of consequence, we must say something on the distribution – the mean and variance – of the true but unobserved β* The mean and variance of β It is possible to show that is a good approximation, i.e. an unbiased estimator, of the true parameter β*. E ˆ * The variance of β is defined as the ratio of the mean square of errors over the sum of squares of the explanatory variable VAR ˆ 2 n x x i 1 2 where yi y i 2 2 n 1 1 The confidence interval of β We must now define de confidence interval of β, at 95%. To do so, we use the mean and variance of β and define the t value as follows: t * sˆ Therefore, the 95% confidence interval of β is: * t .025 n 2 x x i 1 If the 95% CI does not include 0, then β is significantly different than 0. Student t Test for β We are also in the position to infer on β H0: β* = 0 H1: β* ≠ 0 * t sˆ sˆ Rule of decision Accept H0 is | t | < tα/2 Reject H0 is | t | ≥ tα/2 Application to Data using SPPS Analyse Régression Linéaire Coefficientsa Modèle 1 (constante) lnrd_as sets Coefficients non standardis és Erreur B standard -8.151 .244 1.748 .101 a. Variable dépendante : lnpat_ass ets Coefficients standardis és Bêta .642 t -33.392 17.323 Signification .000 .000 Assignments on CERAM_BIO Regress the number of patent on R&D expenses and consider: 1. 2. 3. Repeat steps 1 to 3 using: The quality of the fit The significance and direction of R&D expenses The interpretation of the result in an economic sense R&D expenses divided by one million (you need to generate a new variable for that) The log of R&D expenses What do you observe? Why?