Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data assimilation wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Bias of an estimator wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
Regression toward the mean wikipedia , lookup
Choice modelling wikipedia , lookup
Time series wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Regression analysis wikipedia , lookup
MULTI VARIATE VARIABLE n-th x1T OBJECT x1,1 x1,i x1,m C x Tj x j,1 x j,i x j,m m-th x T x n ,1 x n ,i x n ,m VARIABLE n 1 STATISTICAL DEPENDANCE CORRELATION – relationship between QUANTIVATIVE (measured) data CONTINGENCE – relationship between QUALITATIVE (descriptive) data 2 CORRELATION simple – for two variables, multiple – for more then two variables, parcial – describes relationship of two variables in multivariable data set (we exclude influence of all other variables) 3 CORRELATION positive 4 negative Correlation x2 TOTAL VARIABILITY CELKOVÁ VARIABILITA Y (odchylka měřené hodnoty od průměru) REZIDUÁLNÍ VARIABILITA (odchylka měřených a modelových - vypočítaných – hodnot) RESIDUAL VARIABLITY x2 MODEL VARIABILITY VARIABILITA VYSVĚTLENÁ MODELEM (odchylka modelových hodnot od průměru) x1 5 CORRELATION COEFF. OF DETERMINATION 2 R = S S 2 x2 2 x2 = 1- S 2 x1 x 2 2 x2 S COEFF. OF CORRELATION R= 6 S 2x2 S 2 x2 = 1- S 2x1x2 S 2 x2 COEFF. OF DERETMINATION quantifies which part of total variability of the response is explained by model r2 = 0.9 r2 = 0.05 7 r2 = 1 COEFF. OF CORRELATION simple correlation Pearson Spearman (rank correlation) 8 PEARSON COEFF. OF CORRELATION BIVARIATE normal distribution = standardised covariance rx1x 2 rx 2x1 9 covx1x 2 Sx1 Sx 2 COVARIANCE COVARIANCE: measure of linear relationship always is non – negative product of standard deviations is its upper limit its magnitude is depend on units of arguments standardisation is necessary n cov x1x2 10 1 x1i x1 x2i x2 n 1 i 1 PEARSON COEFF. OF CORRELATION Basic properties: 11 It is dimensionless measure of correlation; 0 – 1 for positive correlation, 0 – (-1) for negative correlation; 0 means that there is no linear relationship between variables (can be nonlinear!) or this relationship is not statistically significant on the basis of available data; 1 or (-1) indicates a functional (perfect) relationship; Value of correlaion coefficient is the same for dependence x1 on x2 and for reverse dependence x2 on x1. SPEARMAN CORRELATION COEFFICIENT nonparametric correlation coeff. based on ranks n rS 1 12 6 i 1 3 2 di n n difference between ranks of X and Y in one row SPEARMAN CORRELATION COEFFICIENT influential points (extremes) Pearson R = -0,412 (influential points are fully counted) 13 Spearman R = +0,541 (influential points are stronly limited) CONFIDENCE INTERVAL R (CI) CI () includes interval of possible values of population correlation coefficient (with probability 1 - ) Because distribution of corr. coeff. is not normal, we must use Fisher transformation 1 R Z(R ) arctgh (R ) 0.5 ln 1 R with appox. normal distribution with mean E(Z) = Z() and variance D(Z) = 1/(n-3). 14 CONFIDENCE INTERVAL R (CI) half of CI of transformed value R Fisher transformation Z ( R) z1 2 Z(R) 1 n3 lower and retransformation Z(R) to correlation coeff. upper boundary of CI in Fisher tranformation 15 lower and upper boundary of CI in Fisher tranformation lower and upper boundary of CI of correlation coeff. CONFIDENCE INTERVAL R (CI) Fisher value R = 0.95305 fisherz(0.95305) = 1.864 CI Fisher value: 1 Z 1.864 1.96 1.864 0.65333 = 12 3 1.2107; 2.51737 1.21 1.864 2.517 CI correlation coeff: =fisherz2r(1.2107) = 0.83689 =fisherz2r (2.5174) = 0.98707 16 0.837 0.953 0.987 závisle proměnná Y dependent, explained, response var. REGRESSION ANALYSIS MEASURED VALUES MODEL VALUES independent variable nezávisle (explanatory) proměnná X 17 REGRESSION MODEL y1 x11 y x 2 21 yi xi1 yn xn1 y response variable 18 x12 x1 j x22 x2 j xi 2 xij xn 2 xnj X explanatory variable(s) y=X+ x1m 1 1 x2 m 2 2 xim j i xnm m n β ε regression random parameters error OLS ESTIMATOR b X X X y T 1 T yˆ X X X X y T 1 T estimation of parameters estimation of predicted values ˆy X X X XT y T 19 1 hat matrix H ASSUMPTIONS OF OLS ESTIMATOR linearity - no other curved relationship represents the relationships between each of the predictors and the response variable. The model should be linear in the parameters namely the βk 20 ASSUMPTIONS OF OLS ESTIMATOR normality - the residuals, and therefore the populations from which each of the responses were collected, are normally distributed. Note that in the majority of multiple linear regression cases, the predictor variables are measured (not specifically set), and therefore the respective populations are also assumed to be normally distributed. 21 ASSUMPTIONS OF OLS ESTIMATOR homogeneity of variance - the residuals (populations from which each of the responses were collected) are equally varied. 22 ASSUMPTIONS OF OLS ESTIMATOR (multi)collinearity - a predictor variable must not be correlated to the combination of other predictor variables. Multicollinearity has major detrimental effects on model fitting: • instability of the estimated partial regression slopes (small changes in the data or variable inclusion can cause dramatic changes in parameter estimates). • inflated standard errors and confidence intervals of model parameters, thereby increasing the type II error rate (reducing power) of parameter hypothesis tests. 23 ASSUMPTIONS OF OLS ESTIMATOR VIF – variance inflation factor diag(R-1) VIF > 5 high multicolinearity VIF > 10 „critical“ multicolinearity 24 REGRESSION MODEL response závisle proměnná Y 1 intercept absolutníačlen Xvariable nezávisle proměnná independent (explanatory) 25 regression regresní parametr parameter b CONFIDENCE INTERVAL OF MODEL VALUE OF REGRESSION MODEL ( these values are only point estimates ) upper boundary of CI CI of one model value lower boundary of CI Area where all possible models computed from any sample (coming from the same population) are appear with probability 1 - 26 CI OF Y VALUES – PREDICTION INTERVAL is an estimate of an interval in which future observations will fall, with a certain probability 1 - y i (min, max) yi t 2 27 ;n m CONFIDENCE INTERVAL OF MODEL (CI), PREDICTION INTERVAL OF RESPONSE (PI) 28 COMPARISON OF REGRESSION MODELS Akaike information criterion (AIC) RSS AIC n ln 2m n RSC m rezidual sum of squares number of parameters The AIC is smaller, the model is better (from the statistical point of view!!). 29 REGRESSION DIAGNOSTICS Diagnostics of residuals: • normality • homoscedasticity (constant variance) • independence 30 REGRESSION DIAGNOSTICS Breusch–Pagan test (and many others…) Weighted OLS method 31 REGRESSION DIAGNOSTICS 32 REGRESSION DIAGNOSTICS Influential points 33 REGRESSION DIAGNOSTICS HAT VALUES (leverages) the hat matrix, H, relates the fitted values to the observed values. It describes the influence each observed value has on each fitted value. The diagonal elements of the hat matrix are the leverages, which describe the influence each observed value has on the fitted value for that same observation. 34 REGRESSION DIAGNOSTICS Cook distance measures the effect of deleting a given observation. Data points with large residuals (outliers) and/or high leverage may distort the outcome and accuracy of a regression. 35 REGRESSION DIAGNOSTICS DFFITS statistic is a scaled measure of the change in the predicted value for the ith observation and is calculated by deleting the ith observation. A large value indicates that the observation is very influential in its neighborhood of the X space. A general cutoff to consider is 2; a size-adjusted cutoff recommended is 36 REGRESSION DIAGNOSTICS DFBETAS are the scaled measures of the change in each parameter estimate and are calculated by deleting the ith observation General cut off value is 2, size adjusted 37