Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Regression Jordan Boyd-Graber University of Colorado Boulder LECTURE 11 Slides adapted from Lauren Hannah Jordan Boyd-Graber | Boulder Regression | 1 of 34 Linear Regression 6 ● ● ● ●● 0 2 4 ● ● y ● −2 ● −6 ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● −4 ●● ● ● ● ●● ●● ● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● −4 −2 0 2 4 x Data are the set of inputs and outputs, D = {(xi , yi )}ni=1 Jordan Boyd-Graber | Boulder Regression | 2 of 34 Linear Regression 6 ● ● ● ●● 0 2 4 ● ● y ● −2 ● −6 ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● −4 ●● ● ● ● ●● ●● ● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● −4 −2 0 2 4 x In linear regression, the goal is to predict y from x using a linear function Jordan Boyd-Graber | Boulder Regression | 3 of 34 Linear Regression 6 ● ● ● ●● 4 ● ● 2 ● ● ●● ●● ● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● y ● −2 ● −6 ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● −4 ●● ● ● ● ● ● ● ●● ● −4 −2 0 2 4 x Examples of linear regression: • given a child’s age and gender, what is his/her height? • given unemployment, inflation, number of wars, and economic growth, what will the president’s approval rating be? • given a browsing history, how long will a user stay on a page? Jordan Boyd-Graber | Boulder Regression | 4 of 34 Linear Regression 6 ● ● ● ●● 0 2 4 (xi , yi ) y ● −2 ● −6 ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ●● ●● ●● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −4 ●● ● ● ● ● ● ● ● f (x) = !0 + !1 x ● ●● ● −4 −2 0 2 4 x Jordan Boyd-Graber | Boulder Regression | 5 of 34 Multiple Covariates Often, we have a vector of inputs where each represents a different feature of the data x = (x1 , . . . , xp ) The function fitted to the response is a linear combination of the covariates p X f (x) = β0 + βj xj j=1 Jordan Boyd-Graber | Boulder Regression | 6 of 34 Multiple Covariates • Often, it is convenient to represent x as (1, x1 , . . . , xp ) • In this case x is a vector, and so is β (we’ll represent them in bold face) • This is the dot product between these two vectors • This then becomes a sum (this should be familiar!) βx = β0 + p X βj xj j=1 Jordan Boyd-Graber | Boulder Regression | 7 of 34 Hyperplanes: Linear Functions in Multiple Dimensions Hyperplane 35 ● ● ● ● 25 ● ● ●● ● ● ● ● ● 20 y ● ● ● ● ● ● ●● ● ● ●● ● ● ● 500 ●● x2 30 ● ● 400 15 300 200 10 100 0 1 Jordan Boyd-Graber | Boulder 2 3 4 x1 5 6 Regression | 8 of 34 Covariates • Do not need to be raw value of x1 , x2 , . . . • Can be any feature or function of the data: ◦ Transformations like x2 = log(x1 ) or x2 = cos(x1 ) ◦ Basis expansions like x2 = x12 , x3 = x13 , x4 = x14 , etc ◦ Indicators of events like x2 = 1{−1≤x1 ≤1} ◦ Interactions between variables like x3 = x1 x2 • Because of its simplicity and flexibility, it is one of the most widely implemented regression techniques Jordan Boyd-Graber | Boulder Regression | 9 of 34 Fitting a Regression Plan Fitting a Regression Example Regularized Regression Wrapup Jordan Boyd-Graber | Boulder Regression | 10 of 34 Fitting a Regression Fitting a Linear Regression 6 ● ● ● ●● 2 4 ● ● 0 ( yi ! !1 x ) y ● ●● ● ● −2 ● ● ● ● −6 ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● −4 ● ●● ●● ● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● ● ● i ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● −4 −2 0 2 4 x Idea: minimize the Euclidean distance between data and fitted line n 1X RSS(β) = (yi − βxi )2 2 i=1 Jordan Boyd-Graber | Boulder Regression | 11 of 34 Fitting a Regression How to Find β • Use calculus to find the value of β that minimizes the RSS • The optimal value is Pn yi xi 2 i=1 xi β̂ = Pi=1 n 6 ● ● ● ●● 4 ● ● ● ●● ● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● 0 2 ● y ● −2 ● ● ● ● −6 ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● −4 ●● ● ● ●● ● ● ● ● ● ● ●● ● −4 −2 0 2 4 x Jordan Boyd-Graber | Boulder Regression | 12 of 34 Fitting a Regression Prediction • After finding β̂, we would like to predict an output value for a new set of covariates • We just find the point on the line that corresponds to the new input: ŷ = β0 + β1 x Jordan Boyd-Graber | Boulder (1) Regression | 13 of 34 Fitting a Regression Prediction • After finding β̂, we would like to predict an output value for a new set of covariates • We just find the point on the line that corresponds to the new input: ŷ = β0 + β1 x Jordan Boyd-Graber | Boulder (1) Regression | 13 of 34 Fitting a Regression Prediction • After finding β̂, we would like to predict an output value for a new set of covariates • We just find the point on the line that corresponds to the new input: ŷ = 1.0 + 0.5x (1) y=1.0 + 0.5x Jordan Boyd-Graber | Boulder Regression | 13 of 34 Fitting a Regression Prediction • After finding β̂, we would like to predict an output value for a new set of covariates • We just find the point on the line that corresponds to the new input: ŷ = 1.0 + 0.5 ∗ 5 (1) y=1.0 + 0.5x x=5.0 Jordan Boyd-Graber | Boulder Regression | 13 of 34 Fitting a Regression Prediction • After finding β̂, we would like to predict an output value for a new set of covariates • We just find the point on the line that corresponds to the new input: ŷ = 3.5 (1) y=1.0 + 0.5x x=5.0 Jordan Boyd-Graber | Boulder Regression | 13 of 34 Fitting a Regression Probabilistic Interpretation • Our analysis so far has not included any probabilities • Linear regression does have a probabilisitc (probability model-based) interpretation 6 ● ● ● ●● 4 ● ● ● ●● ● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● 0 2 ● y ● −2 ● ● ● ● −6 ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● −4 ●● ● ● ●● ● ● ● ● ● ● ●● ● −4 −2 0 2 4 x Jordan Boyd-Graber | Boulder Regression | 14 of 34 Fitting a Regression Probabilistic Interpretation • Linear regression assumes that response values have a Gaussian distribution around the linear mean function, Yi | xi , β ∼ N(xi β, σ 2 ) • This is a discriminative model, where inputs x are not modeled 6 ● ● ● ●● 0 2 4 ● ● y ● −2 ● −6 ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● −4 ●● ● ● ● ●● ●● ● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● −4 −2 0 2 4 x • Minimizing RSS is equivalent to maximizing conditional likelihood Jordan Boyd-Graber | Boulder Regression | 15 of 34 Example Plan Fitting a Regression Example Regularized Regression Wrapup Jordan Boyd-Graber | Boulder Regression | 16 of 34 Example Example: Old Faithful Jordan Boyd-Graber | Boulder Regression | 17 of 34 Example Example: Old Faithful We will predict the time that we will have to wait to see the next eruption given the duration of the current eruption ● ● 90 ● ● ● ● ● 80 ● ● 70 ● ● ● ● ● 50 ● ● ● ● ● ● 60 Waiting Time (min) ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ●●● ●● ● ● ●● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ●●● ● ● ● ●● ● ● ●● ● ●●● ●● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●● ●● ●● ●● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Current Eruption Time (min) Jordan Boyd-Graber | Boulder Regression | 18 of 34 Example Example: Old Faithful We can plot our data and make a function for new predictions > # Plot a line on the data > abline(fit.lm,col="red",lwd=3) > > # Make a function for prediction > fit.lm$coefficients[1] (Intercept) 33.4744 > fit.lm$coefficients[2] eruptions 10.72964 > faithful.fit <- function(x) fit.lm$coefficients[1] + fit.lm$coefficients[2]*x > x.pred <- c(2.0, 2.7, 3.8, 4.9) > faithful.fit(x.pred) [1] 54.93368 62.44443 74.24703 86.04964 Jordan Boyd-Graber | Boulder Regression | 19 of 34 Example Example: Old Faithful ● ● 90 ● ● ● ● ● 80 ● 70 ● ● ● ● ● 50 ● ● ● ● ● ● 60 Waiting Time (min) ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ●●● ●● ● ● ●● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ●●● ● ● ● ●● ● ● ●● ● ●●● ●● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●● ●● ●● ●● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Current Eruption Time (min) Jordan Boyd-Graber | Boulder Regression | 20 of 34 Regularized Regression Plan Fitting a Regression Example Regularized Regression Wrapup Jordan Boyd-Graber | Boulder Regression | 21 of 34 Regularized Regression Multivariate Linear Regression 1.0 1.0 Example: p = 1, have 2 points −0.5 y 0.0 0.5 ● −1.0 −1.0 −0.5 y 0.0 0.5 ● ● −1.0 ● −0.5 0.0 x 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 x • Have p + 1 or fewer points, line hits all (or p with mean 0 data) • ≥ p + 1 (but still close to that number), line goes close to all points Jordan Boyd-Graber | Boulder Regression | 22 of 34 Regularized Regression Noise, Bias, Variance Tradeoff • Noise: Lower bound on performance • Bias: Error as a result as choosing the wrong model • Variance: Variation due to training sample and randomization Jordan Boyd-Graber | Boulder Regression | 23 of 34 Regularized Regression Noise, Bias, Variance Tradeoff • Noise: Lower bound on performance • Bias: Error as a result as choosing the wrong model • Variance: Variation due to training sample and randomization • No model is perfect • More complex models are more susceptible to errors due to variance Jordan Boyd-Graber | Boulder Regression | 23 of 34 Regularized Regression Multivariate Linear Regression θ θ low BIAS high Why linear regression: • has few parameters to estimate (p) • really restrictive model–low variance, higher bias θ θ low VARIANCE high • should be good for data with few observations, large number of covariates... • ... but we can’t use it in this situation Jordan Boyd-Graber | Boulder Regression | 24 of 34 Regularized Regression Multivariate Linear Regression Idea: if we have a large number of covariates compared to observations, say n < 2p, best to estimate most coefficients as 0! • not enough info to determine all coefficients • try to estimate ones with strong signal • set everything else to 0 (or close) Coefficients of 0 may not be a bad assumption... If we have 1,000s of coefficients, are they all equally important? Jordan Boyd-Graber | Boulder Regression | 25 of 34 Regularized Regression Gene Expression Example: microarray gene expression data • gene expression: want to measure the level at which information in a gene is used in the synthesis of a functional gene product (usually protein) • can use gene expression data to determine subtype of cancer (e.g. which type of Lymphoma B?) or predict recurrence, survival time, etc • problem: thousands of genes, hundreds of patients, p > n! Intuition: only a handful of genes should affect outcomes Jordan Boyd-Graber | Boulder Regression | 26 of 34 Regularized Regression Gene Expression • gene expression levels are continuous values • data: observation i is gene expression levels from patient i, attached to outcome for patient (survival time) • covariates: expression levels for p genes Jordan Boyd-Graber | Boulder Regression | 27 of 34 Regularized Regression Gene Expression • collinearity: does it matter which gene is selected for prediction? No! • overfitting: now fitting p 0 non-0 coefficients to n observations with p 0 << n means less fitting of noise Jordan Boyd-Graber | Boulder Regression | 28 of 34 Regularized Regression Regularized Linear Regression Regularization: • still minimize the RSS • place a penalty on large values for β1 , ..., βp (why not β0 ? can always easily estimate mean) • add this penalty to the objective function • solve for β̂! New objective function: β̂ = arg min β n p i=1 j=1 X 1X (yi − xi β)2 + λ penalty(βj ) 2 λ acts as a weight on penalty: low values mean few coefficients near 0, high values mean many coefficients near 0 Jordan Boyd-Graber | Boulder Regression | 29 of 34 Regularized Regression Regularized Linear Regression Regularization: • still minimize the RSS • place a penalty on large values for β1 , ..., βp (why not β0 ? can always easily estimate mean) • add this penalty to the objective function • solve for β̂! New objective function: β̂ = arg min β n p i=1 j=1 X 1X (yi − xi β)2 + λ penalty(βj ) 2 λ acts as a weight on penalty: low values mean few coefficients near 0, high values mean many coefficients near 0 Jordan Boyd-Graber | Boulder Regression | 29 of 34 Regularized Regression Regularized Linear Regression Regularization: • still minimize the RSS • place a penalty on large values for β1 , ..., βp (why not β0 ? can always easily estimate mean) • add this penalty to the objective function • solve for β̂! New objective function: β̂ = arg min β n p i=1 j=1 X 1X (yi − xi β)2 + λ penalty(βj ) 2 λ acts as a weight on penalty: low values mean few coefficients near 0, high values mean many coefficients near 0 Jordan Boyd-Graber | Boulder Regression | 29 of 34 Regularized Regression Regularized Linear Regression Regularization: • still minimize the RSS • place a penalty on large values for β1 , ..., βp (why not β0 ? can always easily estimate mean) • add this penalty to the objective function • solve for β̂! New objective function: β̂ = arg min β n p i=1 j=1 X 1X (yi − xi β)2 + λ penalty(βj ) 2 λ acts as a weight on penalty: low values mean few coefficients near 0, high values mean many coefficients near 0 Jordan Boyd-Graber | Boulder Regression | 29 of 34 Regularized Regression Regularized Linear Regression Regularization: what is a good penalty function? Same as penalties used to fit errors: • Ridge regression (squared penalty): β̂ Ridge = arg min β n p i=1 j=1 X 1X (yi − xi β)2 + λ βj2 2 • Lasso regression (absolute value penalty): β̂ Jordan Boyd-Graber | Boulder Lasso n p i=1 j=1 X 1X = arg min (yi − xi β)2 + λ |βj | β 2 Regression | 30 of 34 Regularized Regression Comparing Ridge and Lasso Objective Estimator Coefficients Stability 1 2 Ridge P (y − xi β)2 + λ pj=0 βj2 i i=1 −1 XT X + λI XT y most close to 0 robust to changes in X, y Pn 1 2 Lasso Pp 2 (y − x i i β) + λ i=1 j=0 |βj | not closed form most exactly 0 not robust to changes in X, y Pn Regularized linear regression is fantastic for low signal datasets or those with p >> n • Ridge: good when many coefficients affect value but not large (gene expression) • Lasso: good when you want an interpretable estimator Jordan Boyd-Graber | Boulder Regression | 31 of 34 Regularized Regression Choosing λ Both Ridge and Lasso have a tunable parameter, λ • use cross validation to find best λ λ̂ = arg min λ n X yi − xi β̂−i,λ 2 i=1 • try out many values • see how well it works on “development” data Jordan Boyd-Graber | Boulder Regression | 32 of 34 Wrapup Plan Fitting a Regression Example Regularized Regression Wrapup Jordan Boyd-Graber | Boulder Regression | 33 of 34 Wrapup Regression • Workhorse technique of data analysis • Fundamental tool that we saw before (“Logistic Regression”) • Important to understand interpretation of regression parameters Jordan Boyd-Graber | Boulder Regression | 34 of 34