Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Expectation–maximization algorithm wikipedia , lookup
Regression toward the mean wikipedia , lookup
Choice modelling wikipedia , lookup
Forecasting wikipedia , lookup
Time series wikipedia , lookup
Data assimilation wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Linear regression wikipedia , lookup
Applied Regression Analysis BUSI 6220 KNN Ch. 3 Diagnostics and Remedial Measures Diagnostics for the Predictor Variable Dot Plots Sequence Plots Stem-and-Leaf Plots Essentially to check for outlying observations which will be useful in later diagnosis. Residual Analysis Why Look at the Residuals? Detect non-linearity of regression function Detect Heteroscedasticity (=lack of constant variance) Auto-correlation Outliers Non-normality Important predictor variables left out? Regression Model Assumptions: Errors are Independent (Have Zero Covariance) Errors have Constant Variance Errors are Normally Distributed Diagnostics for Residuals PLOT OF RESIDUALS Detect non-linearity of regression function Heteroscedasticity Auto-correlation Outliers Non-normality Important predictor variables left out? 1. 2. 3. 4. 5. 6. 7. against predictor (if X1 only) (Absolute or Sqd. Residual) against predictor against fitted values (for many X i) against time against omitted predictor variables Box plot Normal probability plot Diagnostics for Residuals Normal probability plot Approximate expected value of kth smallest residual : k 0.375 MSE z n 0.25 Tests involving Residuals The Correlation test for Normality H0: The residuals are normal HA: The residuals are not normal Correlation between ei(s) and their expected values under normality. Use Table B.6 Observed coeff. of correlation should be at least as large as table value for a given level of significance. Tests involving Residuals Other tests for Normality H0: The residuals are normal HA: The residuals are not normal Anderson-Darling (very powerful, may be used for small sets, n<25) Ryan-Joiner Shapiro-Wilk Kolmogorov-Smirov Tests involving Residuals The Correlation test for Normality H0: The residuals are normal HA: The residuals are not normal Correlation between ei(s) and their expected values under normality. Use Table B.6 Observed coeff. of correlation should be at least as large as table value for a given level of significance. Tests involving Residuals (Constancy of Error Variance) The Modified Levene Test Partitions the independent variable into two groups (High X values and low X values), then tests the null H0: The groups have equal variances Similar to a pooled variance t-test for difference in two means of independent samples. It is robust to departures from normality or error terms Large sample size essential so that dependencies of error terms on each other can be neglected Uses group “median” instead of the “mean”(Why ?) Tests involving Residuals (Constancy of Error Variance) The Modified Levine Test d1 d 2 t 1 1 s n1 n2 * L where, d i1 ei1 e~1 and d i1 ei 2 e~2 Now, the d i1 and d i2 are the data points, i.e the t - test is based on these two sets of data points. s (n1 1) s12 (n2 1) s22 n1 n2 2 Read “Comments” on page 118 and go thru’ the Breusch-Pagan test on page 119. F test for Lack of Fit A comparison of “Full Model” sum of squares error and “Lack of Fit” sum of squares. For best results, requires repeat observations at, at least one X level. Full model: Yij=mj+ eij (mj = mean response when X=Xj) Reduced model: Yij= b0 + b1Xj+ eij (Why “Reduced” ?) F test for Lack of Fit SSE(Full)=SSPE= Y ij j Y 2 j i (Labeled “Pure Error” since unbiased estimator of true error variance. See 3.31 and 3.32, page 123) SSLF=SSE(Reduced)-SSPE, (where SSE(Reduced)= SSE from ordinary least squares regression model) Test Statistic : SSLF c p * F SSPE nc (what is “p”?) Be sure to compare the ANOVA table on page 126 with OLS ANOVA table. Overview of some Remedial Measures The Problem: Simple Linear Regression is not appropriate. The solution: 1. Abandon the model (“Eagle to Hawk; abort mission and return to base”.) 2. Remedy the situation: If Non-independent error terms then work with a model that calls for correlated error terms (Ch.12) If Heteroscedasticity then use WLS method to estimate parameters (Ch. 10) or use transformations of data. If scatter plot indicates non-linearity, then either use non-linear regression function (Ch.7) or transform to linear. NEXT: We will look at one such powerful transformation method. The Box-Cox Transformation Method The family of power transforms on Y is given as: Y'=Yl The family easily includes simple transforms such as the square root, squared etc. By definition, when l0, then Y'=logeY When the response variable is so transformed, the normal error regression model becomes: Yil b0 + b1Xi+ ei We would like to determine the “best” value of l Method 1: Maximum likelihood estimation Max. b ,b 0 L2 10 , ,lR 4 1 2 n 2 2 2 1 n l exp Y b 0 b1 X i 2 i 2 i 1 The Box-Cox Transformation Method Method 2: Numerical Search Step 1: Set a value of l. Step 2: Standardize the Yi observations If l0, then: Wi=K1(Yil1 ) If l0, then: Wi=K2(logeYi) 1/ n where, K2 Yi i 1 n 1 and K1 lK 2l 1 Step 3: Now regress the set W on the set X. Step 4: Note the corresponding SSE. Step 5: Change l, and repeat steps 2 to 4 until lowest SSE is obtained. Let’s try both this method with the GMAT data. What should we get as the best l?