Download CorrelationRegression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Bias of an estimator wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Regression toward the mean wikipedia , lookup

Choice modelling wikipedia , lookup

Time series wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
MULTI VARIATE VARIABLE
n-th
x1T 
OBJECT  
 x1,1  x1,i  x1,m 
 



   

C  x Tj    x j,1  x j,i  x j,m 
  


 
    
m-th
x T   x n ,1  x n ,i  x n ,m 
 VARIABLE
 n 
1
STATISTICAL DEPENDANCE
CORRELATION – relationship between
QUANTIVATIVE (measured) data
CONTINGENCE – relationship between
QUALITATIVE (descriptive) data
2
CORRELATION
simple – for two variables,
multiple – for more then two variables,
parcial – describes relationship of two variables in
multivariable data set (we exclude influence of all other
variables)
3
CORRELATION
positive
4
negative
Correlation
x2
TOTAL
VARIABILITY
CELKOVÁ VARIABILITA Y
(odchylka měřené hodnoty od
průměru)
REZIDUÁLNÍ VARIABILITA
(odchylka měřených a
modelových - vypočítaných –
hodnot)
RESIDUAL
VARIABLITY
x2
MODEL
VARIABILITY
VARIABILITA
VYSVĚTLENÁ MODELEM
(odchylka modelových hodnot
od průměru)
x1
5
CORRELATION
COEFF. OF DETERMINATION
2
R =
S
S
2
x2
2
x2
= 1-
S
2
x1 x 2
2
x2
S
COEFF. OF CORRELATION
R=
6
S 2x2
S
2
x2
= 1-
S 2x1x2
S
2
x2
COEFF. OF DERETMINATION
quantifies which part of total variability of the response is
explained by model
r2 = 0.9
r2 =
0.05
7
r2 = 1
COEFF. OF CORRELATION
simple correlation
Pearson
Spearman (rank correlation)
8
PEARSON COEFF. OF CORRELATION
BIVARIATE
normal distribution
= standardised covariance
rx1x 2  rx 2x1 
9
covx1x 2
Sx1  Sx 2
COVARIANCE
COVARIANCE:
measure of linear relationship
always is non – negative
product of standard deviations is its upper limit
its magnitude is depend on units of arguments 
standardisation is necessary
n
cov x1x2
10
1

 x1i  x1    x2i  x2 

n  1 i 1
PEARSON COEFF. OF CORRELATION
Basic properties:
11
It is dimensionless measure of correlation;
0 – 1 for positive correlation, 0 – (-1) for negative
correlation;
0 means that there is no linear relationship between
variables (can be nonlinear!) or this relationship is
not statistically significant on the basis of available
data;
1 or (-1) indicates a functional (perfect) relationship;
Value of correlaion coefficient is the same for
dependence x1 on x2 and for reverse dependence x2
on x1.
SPEARMAN CORRELATION
COEFFICIENT
nonparametric correlation coeff. based on ranks
n
rS  1 
12
6
i 1
3
2
di
n n
difference between
ranks of X and Y in
one row
SPEARMAN CORRELATION
COEFFICIENT
influential points (extremes)
Pearson R = -0,412
(influential points are fully
counted)
13
Spearman R = +0,541
(influential points are
stronly limited)
CONFIDENCE INTERVAL R (CI)
CI () includes interval of possible values of population
correlation coefficient  (with probability 1 - )
Because distribution of corr. coeff. is not normal, we must
use Fisher transformation
1 R
Z(R )  arctgh (R )  0.5 ln
1 R
with appox. normal distribution with mean E(Z) = Z() and
variance D(Z) = 1/(n-3).
14
CONFIDENCE INTERVAL R (CI)
half of CI of transformed value
R
Fisher transformation
Z ( R)  z1 
2
Z(R)
1

n3
lower and
retransformation Z(R) to correlation coeff.
upper
boundary of
CI in Fisher
tranformation
15
lower and
upper
boundary of
CI in Fisher
tranformation
lower and
upper
boundary of
CI of
correlation
coeff.
CONFIDENCE INTERVAL R (CI)
Fisher
value
R = 0.95305
fisherz(0.95305) = 1.864
CI Fisher value:
1
Z     1.864  1.96 
 1.864  0.65333 =
12  3
 1.2107; 2.51737
1.21
1.864
2.517
CI correlation coeff:
=fisherz2r(1.2107) = 0.83689
=fisherz2r (2.5174) = 0.98707
16
0.837
0.953 0.987
závisle proměnná Y
dependent, explained, response var.
REGRESSION ANALYSIS
MEASURED VALUES
MODEL VALUES
independent
variable
nezávisle (explanatory)
proměnná X
17
REGRESSION MODEL
 y1   x11
 y  x
 2   21
  
 
 yi   xi1
  
  
 yn   xn1
y
response
variable
18
x12
x1 j
x22
x2 j
xi 2
xij
xn 2
xnj
X
explanatory variable(s)
y=X+
x1m   1   1 

x2 m    2   2 
   
    
    
xim    j    i 
    
    
xnm    m   n 
β
ε
regression random
parameters error
OLS ESTIMATOR
b   X  X  X  y
T
1
T
yˆ  X  X  X   X  y
T
1
T
estimation of parameters
estimation of predicted
values
ˆy  X  X  X   XT  y
T
19
1
hat matrix H
ASSUMPTIONS OF OLS ESTIMATOR
linearity - no other curved relationship represents the
relationships between each of the predictors and the
response variable.
The model should be linear in the parameters namely the βk
20
ASSUMPTIONS OF OLS ESTIMATOR
normality - the residuals, and therefore the populations
from which each of the responses were collected, are
normally distributed. Note that in the majority of
multiple linear regression cases, the predictor variables
are measured (not specifically set), and therefore the
respective populations are also assumed to be normally
distributed.
21
ASSUMPTIONS OF OLS ESTIMATOR
homogeneity of variance - the residuals (populations
from which each of the responses were collected) are
equally varied.
22
ASSUMPTIONS OF OLS ESTIMATOR
(multi)collinearity - a predictor variable must not be
correlated to the combination of other predictor
variables. Multicollinearity has major detrimental effects
on model fitting:
• instability of the estimated partial
regression slopes (small changes in the data
or variable inclusion can cause dramatic
changes in parameter estimates).
• inflated standard errors and confidence
intervals of model parameters, thereby
increasing the type II error rate (reducing
power) of parameter hypothesis tests.
23
ASSUMPTIONS OF OLS ESTIMATOR
VIF – variance inflation factor
diag(R-1)
VIF > 5  high multicolinearity
VIF > 10  „critical“ multicolinearity
24
REGRESSION MODEL
response
závisle proměnná Y
1
intercept
absolutníačlen
Xvariable
nezávisle proměnná
independent
(explanatory)
25
regression
regresní
parametr
parameter
b
CONFIDENCE INTERVAL OF MODEL
VALUE OF REGRESSION
MODEL ( these values are only
point estimates )
upper boundary of CI
CI of one
model
value
lower boundary of CI
Area where all possible models
computed from any sample (coming
from the same population) are appear
with probability 1 - 
26
CI OF Y VALUES –
PREDICTION INTERVAL
is an estimate of an interval in which future observations will
fall, with a certain probability 1 - 
y i (min, max)  yi  t 
2
27
;n  m

CONFIDENCE INTERVAL OF MODEL (CI),
PREDICTION INTERVAL OF RESPONSE (PI)
28
COMPARISON OF REGRESSION
MODELS
Akaike information criterion (AIC)
 RSS 
AIC  n  ln 
  2m
 n 
RSC
m
rezidual sum of squares
number of parameters
The AIC is smaller, the model is better
(from the statistical point of view!!).
29
REGRESSION DIAGNOSTICS
Diagnostics of residuals:
• normality
• homoscedasticity (constant variance)
• independence
30
REGRESSION DIAGNOSTICS
Breusch–Pagan test (and many others…)
Weighted OLS method
31
REGRESSION DIAGNOSTICS
32
REGRESSION DIAGNOSTICS
Influential points
33
REGRESSION DIAGNOSTICS
HAT VALUES (leverages)
the hat matrix, H, relates the fitted values to the
observed values. It describes the influence each
observed value has on each fitted value.
The diagonal elements of the hat matrix are the
leverages, which describe the influence each observed
value has on the fitted value for that same observation.
34
REGRESSION DIAGNOSTICS
Cook distance
measures the effect of deleting a given observation. Data points
with large residuals (outliers) and/or high leverage may distort
the outcome and accuracy of a regression.
35
REGRESSION DIAGNOSTICS
DFFITS
statistic is a scaled measure of the change in the predicted
value for the ith observation and is calculated by deleting the ith
observation. A large value indicates that the observation is very
influential in its neighborhood of the X space.
A general cutoff to consider is 2;
a size-adjusted cutoff recommended is
36
REGRESSION DIAGNOSTICS
DFBETAS
are the scaled measures of the change in each parameter
estimate and are calculated by deleting the ith observation
General cut off value is 2, size adjusted
37