* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Simple linear regression and correlation analysis
Data assimilation wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Time series wikipedia , lookup
Choice modelling wikipedia , lookup
Regression toward the mean wikipedia , lookup
Regression analysis wikipedia , lookup
Simple linear regression and correlation analysis 1. 2. 3. Regression Correlation Significance testing 1. Simple linear regression analysis Simple regression describes relationship between two variables Two variables, generally Y = f(X) Y = dependent variable (regressand) X = independent variable (regressor) Simple linear regression yi f ( xi ) ei f (x) – regression equation ei – random error, residual deviation independent N (0, σ2) random quantity Simple linear regression – straight line yi est b0 b1 xi b0 = constant b1 = coefficient of regression Parameter estimates → least squares condition 2 n y y est i 1 i i yi est b0 b1 xi min difference of the actual Y from the estimated Y est. is minimal di yi yi est yi b0 b1 xi hence n f b0 , b1 yi b0 b0 xi min 2 i 1 n is number of observations (yi,xi) n adjustment y i 1 i 2 yi est min under partial derivation of function according to parameters b0, b1, derivation of the S sum of squared deviationas are equated to zero: f 2 yi b0 b1 xi xi 0 b0 f 2 yi b0 b1 xi 1 0 b1 Two approches to parameter estimates with using of least squares condition (made for straight line equation) 1. Normal equation system for straight line b0 xi b1 xi2 xi yi b0 n b1 xi yi 2. Matrix computation approach y b y = dependent variable vector X = independent variable matrix b = vector of regression coefficient (straight line → b0 and b1) ε = vector of random values b y 1 Simple linear regression observation yi smoothed values yi est; residual deviation y i´ di yi yi est 2 n n 2 S y y est d residual sum of squares i i i r i 1 i 1 n residual variance Sr s nk 2 r y b i 1 i 0 b1 xi n2 2 Simple lin. reg. → dependence Y on X Straight line equation yi est b0 yx b1 yx xi Normal equation system n n n b0 yx b1 yx xi yi n i 1 n i 1 n b0 yx xi b1 yx xi2 xi yi i 1 i 1 i 1 Parameter estimates – computational formula b1 yx n xi yi xi yi n x xi 2 i b0 yx y b1 yx x 2 Simple lin. reg. → dependence X on Y Associated straight line equation xi est b0 xy b1xy yi Parameters estimates – computational formula b1 xy n xi yi xi yi n yi2 yi b0 xy x b1 xy y 2 2. Correlation analysis corr. analysis measures strength of dependence – coeff. of correlation „r“ │r│is in <0; +1> │r│is in <0; 0,33> │r│is in <0,34; 0,66> │r│is in <0,67; 1> ryx rxy ryx n x ryx b1 yx .b1xy weak dependence medium strong dependence strong to very strong dependence n xi yi xi yi 2 i xi . n yi2 yi 2 b1 yx ryx . 2 sy sx b1 xy rxy . sx sy r2 = coeff. of determination, proportion (%) of variance Y, that is caused by the effect of X 3. Significance testing in simple regression Significance test of parameters b1 (straight line) H0 : 0 H1 : 0 test criterion t (two-sided) b1 sb estimate sb for par. b1 table value t ( n k ) sy 1 r 2 sb sx n 2 (two-sided) if test criterion>table value→H0 is rejected and H1 is valid; if test alfa>p-value→H0 is rejected Coefficient of regression estimation interval estimate for the unknown βi Pbi i bi 1 t ( n k ) sb Significance test of coeff. corr. r (straight line) H0 : 0 H1 : 0 test criterion t table value r 1 r t ( n k ) 2 (two-sided) n2 (two-sided) if test criterion>table value→H0 is rejected and H1 is valid; if test alfa>p-value→H0 is rejected Coefficient of correlation estimation small samples and not normal distribution Fischer Z – transformation first r is assigned to Z (by tables) interval estimate for the unknown σ Z1 Z u 1 n3 ; Z 2 Z u 1 n3 1 last step Z1 a Z2 is assigned to r1 a r2 The summary ANOVA Variation Sum of deviaton squares along the regression function S1 yi y across the regression function S r yi yi 2 df Variance k-1 S1 s k 1 n-k Sr s nk 2 2 1 2 r Test criterion 2 1 2 r s F s The summary ANOVA (alternatively) test criterion R nk F 2 1 R k 1 2 table value F ( m 1),( n 1) Multicollinearity relationship between (among) independent variables among independent variables (X1; X2….XN) is almost perfect linear relationship, high multicollinearity before model formation is needed to analyze of relationship linear independent of culumns (variables) is disturbed Causes of multicollinearity tendencies of time series, similar tendencies among variables (regression) including of exogenous variables, delay using 0;1 coding in our sample Consequences of multicollinearity wrong sampling null hypothesis about zero regression coefficient is not rejected, really is rejected confidence intervals are wide regression coeff estimation is very influented by data changing regression coeff can have wrong sign regression equation is not suitable for prediction Testing of multicollinearity Paired coefficient of correlation t - test Farrar-Glauber test 1 2 p 5 ln R test criterion B n 1 6 table value 12 p ( p1) / 2 if test criterion>table value→H0 is rejected Elimination of multicollinearity variables get excluding new sample once again re-formulate and think out the model (chosen variables) transformation – chosen variables recounting (not total consumption, but consumption per capita… etc.) variables Regression diagnostics Data quality for the chosen model Suitable model for the chosen dataset Method conditions Data quality evaluation A) outlying observation in „y“ set Studentized residuals |SR| > 2 → outlying observation → outlying need not to be influential (influential has cardinal influence on regression) Data quality evaluation B) outlying observation in „x“ set Hat Diag leverage hii – diagonal values of hat matrix H H = X . (XT . X)-1 . XT p hii > 2 n → outlying observation Data quality evaluation C) influential observation Cook D (influential obs. influence the whole equation) Di > 4 → influential obs. – Kuh DFFITS distance (influential obs. influence smoothed observation) Welsch |DFFITS| > p 2 n → influential obs. Method condition regression parameters <-∞; +∞> regression model is linear in parameters (not linear – data transformation) independent of residues normal distribution of residues N(0;σ2)