Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
STAB22 Statistics I Lecture 9 1 Linear Model True value (y) Linear model equation: Residual (y−ŷ) ŷ b0 b1 x Predicted value (ŷ) Where: intercept b0: ŷ value at x=0 slope b1: change in ŷ for unit increase in x b0 b1 2 0 Linear Regression Best fitting line minimizes sum of squared residuals (least squares criterion), given by: b1 r sy sx & b0 y b1 x Where: r correlation between x and y sx , s y std. dev. of x, y Note: regression line always passes through point ( x , y ) i.e. through means of the variables x , y sample mean of x, y 3 # obsn=25 Mean SD Lung Cancer Mortality (y) 109.00 26.11 Smoking (x) 102.88 r 120 100 60 Variable 80 Lung Cancer vs Smoking Mortality 140 Example .7162 17.20 70 80 90 100 120 Smoking Find linear model Predict lung cancer mortality for Smoking = 85 4 0 -1 Let’s standardize data (i.e. take variable z-scores) Fill in summary table -2 z-score(Mortality) 1 Example (cont’d) Variable Mean SD r z-score (y) z-score (x) -2 -1 0 1 2 z-score(Smoking) Find new linear model for z-scores 5 Regression & Correlation Correlation coefficient (r) between two variables essentially equals slope of linear model of standardized values (z-scores) Direction & strength of relationship ↔ sign & magnitude of slope. E.g. r=+0.05 -1 -1 -2 -2 -2 -1 0 1 2 r=+0.85 0 -1 -2 0 0 1 2 2 r=−0.50 1 2 1 -2 -1 0 1 2 6 -2 -1 0 1 2 Regression Diagnostics Can fit linear model to any set of data E.g. X, Y don’t need to be linearly related Want to check whether linear model offers good description of data Can use scatterplot; but residual plot often provides a better picture 7 Residual Plot Plot residuals (y−ŷ) against x residual plot scatterplot y ( y yˆ ) 0 x Residual plot should be evenly scattered around 0, with no particular pattern Note: mean of residuals is always 0 x 8 Residual Plot What can go wrong? Non-linearity scatterplot residual plot Easier to see in residual plot 9 Residual Plot Uneven dispersion scatterplot residual plot If residual spread changes with x → linear model is not evenly accurate throughout x 10 Residual Standard Deviation If linear regression assumptions are satisfied, can measure prediction accuracy using residual SD (a.k.a. error SD) s = √ (mean square e residuals) se y yˆ 2 n2 se measures average distance between true and predicted values (i.e. between data & linear model) 11 Coefficient of Determination How useful is linear regression model in describing y-variable? Compare y-data’s variation to residual variation from linear model data variation y y 2 model variation yˆ y 2 y ŷ b0 b1 x 12 Coefficient of Determination Coefficient of determination (R2): Proportion of y-variation accounted for by linear model R2 between 0 & 1 Equal to squared coefficient of correlation (r) model var. R r2 data var. 2 Proportion of y-variation left in residuals = 1−R2 residual variation y yˆ 2 13 Mean SD Lung Cancer Mortality (y) 109.00 26.11 Smoking (x) 102.88 Find R2 r .7162 17.20 120 100 80 Variable 60 Lung Cancer vs Smoking Mortality 140 Example 70 80 90 100 120 Smoking 14