Download Lecture 3

Outline  Ordinary least squares regression  Ridge regression Data mining and statistical learning, lecture 3 Ordinary least squares regression (OLS) y Model: y   0  β T X  error y   0  1 x1  ...  β p x p  error x1 Terminology: 0: intercept (or bias) 1, …, p: regression coefficients (or weights) x2 … xp The response variable responds directly and linearly to changes in the inputs Data mining and statistical learning, lecture 3 Least squares regression Assume that we have observed a training set of data Case 1 2 3 X1 X2 Xp x 11 x 12 x 13 x 21 x 22 x 23 xp1 xp2 xp3 Y y1 y2 y3 N x 1N x 2N x pN yN Estimate the  coefficients by minimizing the residual sum of squares N p i 1 j 1 RSS (  )   ( yi   0   j  X ij ) 2 Data mining and statistical learning, lecture 3 Matrix formulation of OLS regression Differentiating the residual sum of squares and setting the first derivatives equal to zero we obtain RSS (  )   ( y      X ) p n 2 i 1 i 0 j ij X T ( y  X )  0 j 1 where  1 x11   1 x12  X     1 x 1N  x21 x22 x2 N x p1   xp2       x pN  and Data mining and statistical learning, lecture 3  y1     y2    y        y   N Parameter estimates and predictions Least squares estimates of the parameters RSS (  )   ( y      X ) p n 2 i 1 i 0 j j 1 ij ˆ  ( X T X ) 1 X T y Predicted values yˆ  Xˆ  X ( X T X ) 1 X T y  Hy yˆ  Xˆ  X ( X T X ) 1 X T y  Hy Data mining and statistical learning, lecture 3 Different sources of inputs Quantitative inputs Transformations of quantitative inputs RSS (  )   ( y      X ) Numeric or dummy coding of the levels of qualitative inputs p n 2 i 1 i 0 j j 1 ij Interactions between variables (e.g. X3 = X1 X2) Example of dummy coding: yˆ  Xˆ  X ( X T X ) 1 X T y  Hy 1, if Jan X1   0, otherwise 1, if Feb X2   0, otherwise 1, if Nov X 11   0, otherwise Data mining and statistical learning, lecture 3 An example of multiple linear regression Response variable: Requested price of used Porsche cars (1000 SEK) n p i 1 j 1 RSS (  )   ( yi   0   j  X ij ) 2 Inputs: X1 = Manufacturing year X2 = Milage (km) X3 = Model (0 or 1) X4 = Equipment (1 2, 3) X5 = Colour (Red Black Silver Blue Black White Green) Data mining and statistical learning, lecture 3 Price of used Porsche cars Response variable: Requested price of used Porsche cars (1000 SEK) n p i 1 j 1 RSS (  )   ( yi   0   j  X ij ) 2 Inputs: X1 = Manufacturing year X2 = Milage (km) Inputs Year Milage Year, Milage Estimated model Price = -76829 + 38.6Year Price = 430.7 -0.001862Milage Price = -6389 +32.1Year – 0.000789Milage Data mining and statistical learning, lecture 3 RSS 113030 230212 92541 Interpretation of multiple regression coefficients Assume that p Y  0    j X j   j 1 and that the regression coefficients are estimated by ordinary least squares regression Then the multiple regression coefficient ˆ j represents the additional contribution of xj on y, after xj has been adjusted for x0, x1, …, xj-1, xj+1, …, xp Data mining and statistical learning, lecture 3 Confidence intervals for regression parameters Assume that Y   0  p  j 1 j X j  p n where RSS (  ) the      X ) are fixed and the error terms are i.i.d.  ( yX-variables and N(0, ) 2 i 1 i 0 j j 1 ij Then  j  ˆ j  t0.05 ( N  p  1) v j ˆ (95%) where vj is the jth diagonal element of ( X T X ) 1 Data mining and statistical learning, lecture 3 Interpretation of software outputs Regression of the price of used Porsche cars vs milage (km) and manufacturing year Predictor Constant Milage (km) Predictor Constant Milage (km) Year Coef SE Coef T P 430.69 17.42 24.72 0.000 -0.0018621 0.0002959 -6.29 0.000 Coef SE Coef T P -63809 6976 -9.15 0.000 -0.0007894 0.0002222 -3.55 0.001 32.103 3.486 9.21 0.000 Adding new independent variables to a regression model alters at least one of the old regression coefficients unless the columns of the X-matrix are orthogonal, i.e. Data mining and statistical learning, lecture 3 N x i 1 x 0 ij ik Stepwise Regression: Price (1000SEK) versus Year, Milage (km), ... Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Step Constant 1 -76829 2 -63809 3 -53285 4 -52099 Year T-Value P-Value 38.6 11.87 0.000 32.1 9.21 0.000 26.8 7.00 0.000 26.2 6.88 0.000 -0.00079 -3.55 0.001 -0.00066 -3.08 0.003 -0.00062 -2.88 0.006 37 2.72 0.009 27 1.83 0.073 Milage (km) T-Value P-Value Model T-Value P-Value Equipment T-Value P-Value S R-Sq R-Sq(adj) Mallows Cp 11.0 1.52 0.135 44.1 70.82 70.32 23.8 40.3 76.11 75.27 11.3 38.2 78.89 77.76 5.7 37.8 79.74 78.27 5.4 Data mining and statistical learning, lecture 3 Classical statistical model selection techniques are model-based. In data-mining the model selection is data-driven. The p-value refers to a t-test of the hypothesis that the regression coefficient of the last entered x-variable is zero Stepwise Regression: Price (1000SEK) versus Year, Milage (km), ... - model validation by visual inspection of residuals Residual = Observed - Predicted Versus Fits Residuals Versus Milage (km) (response is Price (1000SEK)) 200 200 150 150 100 100 Residual Residual (response is Price (1000SEK)) 50 50 0 0 -50 -50 -100 -100 200 250 300 350 Fitted Value 400 450 500 0 20000 Data mining and statistical learning, lecture 3 40000 60000 80000 Milage (km) 100000 120000 140000 The Gram-Schmidt procedure for regression by successive orthogonalization and simple linear regression 1. Intialize z0 = x0 = 1 2. For j = 1, … , p, compute j 1  zk , x j  k 0  zk , zk  zj  xj  j 1 zk  x j   ˆkj zk , k 0 where  depicts the inner product (the sum of coordinate-wise products) 3. Regress y on zp to obtain the multiple regression coefficient ̂ p Data mining and statistical learning, lecture 3 Prediction of a response variable using correlated explanatory variables - daily temperatures in Stockholm, Göteborg, and Malmö 30 20 10 0 -10 -20 30 Malmö temperature Göteborg temperature Göteborg temperature 30 20 10 0 -10 -20 -20 -10 0 10 Stockholm temperature 20 30 20 10 0 -10 -20 -20 -10 0 10 20 30 Malmö temperature Data mining and statistical learning, lecture 3 -20 -10 0 10 Stockholm temperature 20 30 Absorbance Absorbance records for ten samples of chopped meat 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Sample_1 Sample_2 Sample_3 1 response variable (protein) Sample_4 Sample_5 Sample_6 Sample_7 Sample_8 Sample_9 Sample_10 1 12 23 34 45 56 67 78 89 100 Channel Data mining and statistical learning, lecture 3 100 predictors (absorbance at 100 wavelengths or channels) The predictors are strongly correlated to each other Absorbance records for 240 samples of chopped meat Protein (%) 25 20 15 The target is poorly correlated to each predictor 10 5 0 0 2 4 6 Absorbance in channel 50 Data mining and statistical learning, lecture 3 Ridge regression The ridge regression coefficients minimize a penalized residual sum of squares: ˆ ridge p N  2  argmin  ( yi   0  1 x1 j  ...   p x pj )     j2  j 1  i 1  or ˆ ridge  argmin N (y i 1 i   0  1 x1 j  ...   p x pj ) 2 p subject to   j2  s j 1 Normally, inputs are centred prior to the estimation of regression coefficients Data mining and statistical learning, lecture 3 Matrix formulation of ridge regression for centred inputs RSS ( )  ( y - X ) 1 ( y - X )   T  ˆ ridge  ( X T X  I ) 1 X T y If the inputs are orthogonal, the ridge estimates are just a scaled version ˆ ridge  ˆ , where 0    1 of the least squares estimates Shrinking enables estimation of regression coefficients even if the number of parameters exceeds the number of cases Figure 3.7 Data mining and statistical learning, lecture 3 Ridge regression – pros and cons Ridge regression is particularly useful if the explanatory variables are strongly correlated to each other. The variance of the estimated regression coefficient is reduced at the expensive of (slightly) biased estimates Data mining and statistical learning, lecture 3 The Gauss-Markov theorem Consider a linear regression model in which: – the inputs are regarded as fixed – the error terms are i.i.d. with mean 0 and variance 2. Then, the least squares estimator of a parameter aT has variance no bigger than any other linear unbiased estimator of aT Biased estimators may have smaller variance and mean squared error! Data mining and statistical learning, lecture 3 SAS code for an ordinary least squares regression proc reg data=mining.dailytemperature outest = dtempbeta; model daily_consumption = stockholm g_teborg malm_; run; Data mining and statistical learning, lecture 3 SAS code for ridge regression proc reg data=mining.dailytemperature outest = dtempbeta ridge=0 to 10 by 1; model daily_consumption = stockholm g_teborg malm_; proc print data=dtempbeta; run; _TYPE_ PARMS RIDGE RIDGE RIDGE RIDGE RIDGE RIDGE RIDGE RIDGE RIDGE RIDGE RIDGE _DEPVAR_ Daily_Consumption Daily_Consumption Daily_Consumption Daily_Consumption Daily_Consumption Daily_Consumption Daily_Consumption Daily_Consumption Daily_Consumption Daily_Consumption Daily_Consumption Daily_Consumption _RIDGE_ _RMSE_ Intercept STOCKHOLM G_TEBORG MALM_ 30845.8 480268.9 -5364.6 -548.3 -3598.2 0 30845.8 480268.9 -5364.6 -548.3 -3598.2 1 36314.6 462824.0 -2327.8 -2357.6 -2512.6 2 43008.7 450349.7 -1830.1 -1899.4 -2011.6 3 48325.9 442054.5 -1514.3 -1584.8 -1674.9 4 52401.2 436146.6 -1292.7 -1358.6 -1434.4 5 55571.5 431726.2 -1128.0 -1188.6 -1254.1 6 58092.1 428294.6 -1000.8 -1056.3 -1114.1 7 60138.0 425553.4 -899.4 -950.4 -1002.1 8 61829.0 423313.5 -816.7 -863.8 -910.6 9 63248.9 421448.8 -747.9 -791.7 -834.4 10 64457.3 419872.4 -689.8 -730.6 -770.0 Data mining and statistical learning, lecture 3

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture 3