Download Simple linear regression and correlation analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Time series wikipedia , lookup

Choice modelling wikipedia , lookup

Regression toward the mean wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Simple linear
regression and
correlation analysis
1.
2.
3.
Regression
Correlation
Significance testing
1. Simple linear regression analysis
Simple regression describes relationship
between two variables
 Two variables, generally Y = f(X)
 Y = dependent variable (regressand)
 X = independent variable (regressor)

Simple linear regression
yi  f ( xi )  ei


f (x) – regression equation
ei – random error, residual deviation
 independent
N
(0, σ2)
random quantity
Simple linear regression – straight line
yi est  b0  b1  xi


b0 = constant
b1 = coefficient of regression
Parameter estimates → least squares condition
2
n
  y  y est 
i 1

i
i
yi est  b0  b1  xi
 min
difference of the actual Y from the estimated Y est. is minimal
di  yi  yi est  yi  b0  b1  xi 

hence
n
f b0 , b1     yi  b0  b0  xi   min
2
i 1
n is number of observations (yi,xi)
n

adjustment
 y
i 1
i
2
 yi est   min under partial derivation of
function according to parameters b0, b1, derivation of the S sum of
squared deviationas are equated to zero:
f
 2  yi  b0  b1  xi    xi   0
b0
f
 2  yi  b0  b1  xi    1  0
b1
Two approches to parameter estimates with using of
least squares condition (made for straight line equation)
1.
Normal equation system for straight line
b0  xi  b1  xi2   xi yi
b0  n  b1  xi   yi
2.
Matrix computation approach
y    b  




y = dependent variable vector
X = independent variable matrix
b = vector of regression coefficient (straight line → b0 and b1)
ε = vector of random values
b         y
1
Simple linear regression

observation
yi

smoothed values
yi est;

residual deviation
y i´
di  yi  yi est
2
n

n
2


S

y

y
est

d
residual sum of squares
 i i
 i
r
i 1
i 1
n

residual variance
Sr
s 

nk
2
r
 y  b
i 1
i
0
 b1 xi 
n2
2
Simple lin. reg. → dependence Y on X

Straight line equation
yi est  b0 yx  b1 yx  xi

Normal equation system
n
n
n  b0 yx  b1 yx  xi   yi
n
i 1 n
i 1
n
b0 yx  xi  b1 yx  xi2   xi yi
i 1

i 1
i 1
Parameter estimates – computational formula
b1 yx 
n xi yi   xi  yi
n x   xi 
2
i
b0 yx  y  b1 yx  x
2
Simple lin. reg. → dependence X on Y

Associated straight line equation
xi est  b0 xy  b1xy  yi

Parameters estimates – computational formula
b1 xy 
n xi yi   xi  yi
n yi2   yi 
b0 xy  x  b1 xy  y
2
2. Correlation analysis


corr. analysis measures strength of dependence – coeff. of correlation „r“
│r│is in <0; +1>
│r│is in <0; 0,33>
 │r│is in <0,34; 0,66>
 │r│is in <0,67; 1>

ryx  rxy
ryx 
n x
ryx   b1 yx .b1xy

weak dependence
medium strong dependence
strong to very strong dependence
n xi yi   xi  yi
2
i

  xi  . n yi2   yi 
2
b1 yx  ryx .
2
sy
sx

b1 xy  rxy .
sx
sy
r2 = coeff. of determination, proportion (%) of variance Y, that is caused by
the effect of X
3. Significance testing in simple
regression
Significance test of parameters b1 (straight line)
H0 :   0

H1 :   0
test criterion t 
(two-sided)
b1
sb

estimate sb for par. b1

table value
t ( n  k )
sy 1 r 2
sb 
sx n  2
(two-sided)
if test criterion>table value→H0 is rejected and H1 is valid;
if test alfa>p-value→H0 is rejected
Coefficient of regression estimation

interval estimate for the unknown βi
Pbi    i  bi     1  
  t ( n k )  sb
Significance test of coeff. corr. r (straight line)
H0 :   0

H1 :   0
test criterion t 
table value
r
1 r
t ( n  k )
2
(two-sided)
n2
(two-sided)
if test criterion>table value→H0 is rejected and H1 is valid;
if test alfa>p-value→H0 is rejected
Coefficient of correlation estimation
small samples and not normal distribution
 Fischer Z – transformation
 first r is assigned to Z (by tables)

 interval
estimate for the unknown σ

 Z1  Z  u


1
n3
; Z 2  Z  u

  1  
n3
1
last step Z1 a Z2 is assigned to r1 a r2
The summary ANOVA
Variation
Sum of deviaton
squares


along the
regression
function
S1   yi  y
across the
regression
function
S r    yi  yi 
2
df
Variance
k-1
S1
s 
k 1
n-k
Sr
s 
nk
2
2
1
2
r
Test
criterion
2
1
2
r
s
F
s
The summary ANOVA (alternatively)

test criterion
R
nk
F

2
1  R k 1
2

table value
F ( m 1),( n 1) 
Multicollinearity
relationship between (among) independent
variables
 among independent variables (X1; X2….XN) is
almost perfect linear relationship, high
multicollinearity
 before model formation is needed to analyze of
relationship
 linear independent of culumns (variables) is
disturbed

Causes of multicollinearity

tendencies of time series, similar
tendencies among variables (regression)

including of exogenous variables, delay

using 0;1 coding in our sample
Consequences of multicollinearity
wrong sampling
 null hypothesis about zero regression
coefficient is not rejected, really is rejected
 confidence intervals are wide
 regression coeff estimation is very
influented by data changing
 regression coeff can have wrong sign
 regression equation is not suitable for
prediction

Testing of multicollinearity

Paired coefficient of correlation
t

- test
Farrar-Glauber test
1


2 p  5  ln R
 test criterion B   n  1 
6


 table
value
12  p ( p1) / 2
if test criterion>table value→H0 is rejected
Elimination of multicollinearity
 variables
 get
excluding
new sample
 once
again re-formulate and think out the
model (chosen variables)
transformation – chosen variables
recounting (not total consumption, but
consumption per capita… etc.)
 variables
Regression diagnostics

Data quality for the chosen model

Suitable model for the chosen dataset

Method conditions
Data quality evaluation

A) outlying observation in „y“ set
 Studentized
residuals
|SR| > 2 → outlying observation
→ outlying need not to be influential (influential
has cardinal influence on regression)
Data quality evaluation

B) outlying observation in „x“ set
 Hat
Diag leverage
hii – diagonal values of hat matrix H
H = X . (XT . X)-1 . XT
p
hii > 2 
n
→ outlying observation
Data quality evaluation

C) influential observation
 Cook
D (influential obs. influence the whole equation)
Di > 4 → influential obs.
– Kuh DFFITS distance (influential obs.
influence smoothed observation)
 Welsch
|DFFITS| >
p
2
n
→ influential obs.
Method condition

regression parameters <-∞; +∞>

regression model is linear in parameters
(not linear – data transformation)

independent of residues

normal distribution of residues N(0;σ2)