Download Chapter 4 Regression Topics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

Transcript
Chapter 4.2
Regression Topics
Credits
Hastie, Tibshirani, Friedman Chapter 3
Padhraic Smyth Lecture Notes
Wolfgang Jank Lecture Notes
Data Mining - 2011 - Volinsky - Columbia University
1
Regression Review
• Linear Regression models a numeric outcome as a linear function of
several predictors.
• It is the king of all statistical and data mining models
– ease of interpretation
– mathematically concise
– tends to perform well for prediction, even under violations of assumptions
• Characteristics
– numeric response - ideally real valued
– numeric predictors- but not necessarily
Data Mining - 2011 - Volinsky - Columbia University
2
Linar Regression Model
• Basic model:
• you are not modelling y, but you are modelling the mean of y
for a given x!
• Simple Regression - one x.
– easy to describe, good for mathematics, but not used often in data
mining
• Multiple regression - many x – response surface is a plane…harder to conceptualize
• Useful as a baseline model
Data Mining - 2011 - Volinsky - Columbia University
3
Linear Regression Model
• Assumptions:
– linearity
– constant variance
– normality of errors
• residuals ~ Normal(mu,sigma^2)
• Assumptions must be checked,
– but if inference is not the goal, you can accept some
deviation from assumptions (don’t’ tell the statisticians I
said that!)
• Multicollinearity also an issue
– creates unstable estimates
Data Mining - 2011 - Volinsky - Columbia University
4
Fitting the Model
• We can look at regression as a matrix problem
=
• We want a score function which minimizes “a”:
which is minimized by
Data Mining - 2011 - Volinsky - Columbia University
5
Fitting models: in-sample
Minimize the sum of the squared errors:
•
S = S e2 = e’ e
= (y – X a)’ (y – X a)
•
= y’ y – a’ X’ y – y’ X a + a’ X’ X a
•
= y’ y – 2 a’ X’ y + a’ X’ X a
Take derivative of S with respect to a:
•
dS/da = -2X’y + 2 X’ X a
Set this to 0 to find the (minimum) of S as a function of a…


- 2X’y + 2 X’ X a = 0
X’Xa = X’ y
 a = ( X’ X )-1 X’ y
 Prediction follows easily:
yˆ k X k a
Data Mining - 2011 - Volinsky - Columbia University
6
Fitting regression: out-of-sample
• Can also optimize “a” based on a hold-out sample
and a search over all “a”s
– But how to search over all values of all a’s?
– This will minimize MSE – might give a different answer
• MSE=Bias + Variance
• Because of the nice algebraic form, typically insample is used
– But different loss function may change things
– R2 measures a ratio between
• regression sum of squares - how much of the variance does
the regression explain, and
• the total sum of squares - how much variation is there
altogether
– If it is close to 1, your fit is good. But be careful.
Data Mining - 2011 - Volinsky - Columbia University
7
Limitations of Linear Regression
• True relationship of X and Y might be non-linear
– Suggests generalizations to non-linear models
• Correlation/Collinearity among the X variables
– Can cause numerical instability
– Problems in interpretability (identifiability)
• Includes all variables in the model…
– But what if p=100 and only 3 variables are related to Y?
Data Mining - 2011 - Volinsky - Columbia University
8
Checking assumptions
• linearity
– look to see if transformations make relationships ‘more’
linear
• normality of errors
– Histograms and qqplots
• Non-constant variance
– Beware of ‘fanning’ residuals
• Time effects
– Can be revealed in an ordering plot
• Influence
– Use hat matrix
Data Mining - 2011 - Volinsky - Columbia University
9
Checking influence
• Influence
^
• H is called the hat matrix (why?):
• The element of H for a given observation is its influence
• The leverage hi quantifies the influence that the observed
response yi has on its predicted value y
•It measures the distance between the X values for the ith
case and the means of the X values for all n cases.
• influence hi is a number between 0 and 1 inclusive.
Data Mining - 2011 - Volinsky - Columbia University
10
Influence Measures for Linear Model
• There are a few
quite influential
(and extreme)
points…
• What to do?
Data Mining - 2011 - Volinsky - Columbia University
11
Diagnostic Plots
Data Mining - 2011 - Volinsky - Columbia University
12
Data Mining - 2011 - Volinsky - Columbia University
13
Model selection: finding the best k variables
• If noisy variables are included in the model, it can effect the
overall performance.
• Best to remove an predictors which have no effect, lest
random patterns look significant.
• Searching all possible models
– How many are there?
– Heuristic search is used to search over model space:
• Forward or backward stepwise search
• Leaps and bound techniques do exhaustive search
– In-sample: penalize for complexity (AIC, BIC, Mallow’s Cp)
– Out-of-sample: use cross validation
Data Mining - 2011 - Volinsky - Columbia University
14
R ‘step’: uses AIC
Data Mining - 2011 - Volinsky - Columbia University
15
Leaps output
R ‘leaps’ : uses Cp
Data Mining - 2011 - Volinsky - Columbia University
16
Generalizing Linear Regression
Data Mining - 2011 - Volinsky - Columbia University
17
Complexity versus Goodness of Fit
y
Training data
x
Data Mining - 2011 - Volinsky - Columbia University
18
Complexity versus Goodness of Fit
y
Too simple?
Training data
y
x
x
Data Mining - 2011 - Volinsky - Columbia University
19
Complexity versus Goodness of Fit
y
Too simple?
Training data
y
x
x
Too complex ?
y
x
Data Mining - 2011 - Volinsky - Columbia University
20
Complexity versus Goodness of Fit
y
Too simple?
Training data
y
x
x
Too complex ?
y
About right ?
y
x
x
Data Mining - 2011 - Volinsky - Columbia University
21
Complexity and Generalization
Score Function
e.g., squared
error
Stest(q)
Strain(q)
Optimal model
complexity
Data Mining - 2011 - Volinsky - Columbia University
Complexity = degrees
of freedom in the model
(e.g., number of variables)
22
Non-linear models, linear in parameters
• We can add additional polynomial terms in our equations,
• non-linear functional form, but linear in the parameters (so still referred
to as “linear regression”)
– We can just treat the xi xj terms as additional fixed inputs
– In fact we can add in any non-linear input functions!, e.g.
Comments:
- Number of parameters can explode => greater chance of overfitting
– Adding complexity: must use penalties!
Data Mining - 2011 - Volinsky - Columbia University
23
Non-linear (both model and parameters)
•
We can generalize further to models that are nonlinear in all aspects
where the g’s are non-linear functions (k of them)
This is called a Neural Network (we’ll talk about it later)
Closed form (analytical) solutions are rare.
This is a a multivariate non-linear optimization problem
(which may be quite difficult!)
Data Mining - 2011 - Volinsky - Columbia University
24
Generalizing Regression
• Generalized Linear Models (GLM)
linear combination of the
predictors
independent RV with
distribution based on the
error term
function which connects the two
GLMs are defined by
error structure (Gaussian, Poisson, Binomial)
linear predictor (single variables, interactions, polynomials)
link function (identity, log, reciprocal)
Data Mining - 2011 - Volinsky - Columbia University
25
Logistic Regression
• Logistic regression is the most common GLM.
• response in this case is binary (0,1). (Y follows a bernoulli or Binomial
distribution)
• we model the probability of a 1 (p) occurring.
• for mathematical convenience, we model the odds:
– p/(1-p)
– log odds are even better - logit function
– scales on the real line, rather than [0,1]
• Deviance: -2 x (difference in log-likelihood from saturated model)
Data Mining - 2011 - Volinsky - Columbia University
26
Logistic Regression
• Interpretation of coefficients changes!
Data Mining - 2011 - Volinsky - Columbia University
27
Logistic example
• womensrole data (R handbook)
– Survey in 1975: “Women should take care of running
their homes and leave running the coutnry up to men”
education sex agree disagree
1
0 Male 4
2
2
1 Male 2
0
3
2 Male 4
0
4
3 Male 6
3
5
4 Male 5
5
6
5 Male 13
7
7
6 Male 25
9
8
7 Male 27
15
9
8 Male 75
49
10
9 Male 29
29
11
10 Male 32
45
•
…
Data Mining - 2011 - Volinsky - Columbia University
28
Womensrole Logistic fit
Data Mining - 2011 - Volinsky - Columbia University
29
Other GLMs
• Another useful GLM is for count data
– model Y ~ Poisson(lambda)
– link is log(Y)
– Also called ‘log-linear’ models
– Typically used for counts:
• People at a store
• Calls at a help center
• Spams in an hour
Data Mining - 2011 - Volinsky - Columbia University
30
Shrinkage Models: Ridge Regression
• Variable selection is a binary process
– That makes it high variance: small changes can effect final model
– Can we have a more continuous process, where each variable is
‘partly’ included?
• Ridge regression “shrinks” coefficients on by imposing a
penalty for the model “size”
• Minimize the penalized sum of squares:
Lis a complexity parameter which controls the amount of
shrinkage - the larger l is, the more the coefficients are
shrunk towards 0.
Data Mining - 2011 - Volinsky - Columbia University
31
Ridge Regression
• Model is imposing a penalty on the coefficient size
• Since a’s depend on the units, care must be taken to
standardize inputs.
• Also, you can show that the ridge estimates are a
linear function of y:
• this adds a positive constant to the diagonal and
allows inverision even if the matrix is not full rank
– So, can be used in cases where p > n!
• In general: increasing bias, decreasing variance
– Often decreases MSE
Data Mining - 2011 - Volinsky - Columbia University
32
Ridge coefficients
df(l) is a one-to-one monotone function
of lsuch that df(l) ranges from 0 to p.
l= 0; s=p : least squares solution; p
degrees of freedom
l= inf; s=0; heaviest shrinkage; all
parameter estimates = 0; zero degrees of
freedom
Look at plot as a function of degrees of
freedom
df(l)
Data Mining - 2011 - Volinsky - Columbia University
33
Lasso
• Very similar to ridge with one important difference:
• L2 penalty replaced by L1
• has an interesting effect on the profile plot:
–
–
–
–
–
if lambda is large then estimates go to zero
continuous variable selection
s=1 is least squares answer
s=0 all estimates are 0
s=0.5 was the value chosen by cross validation
Data Mining - 2011 - Volinsky - Columbia University
34
lasso coefficients
Note how parameters
shrink to zero!
This is the appeal of
lasso (in addition to
good performance)
Data Mining - 2011 - Volinsky - Columbia University
s = df(l) / p
35
Principal Components Regression
• Create PC from the original data vectors and use
them in any of the above regression schemes
• Removes the ‘less important’ parts of the data space,
while creating a reduced data set
• Since each PC is a linear combination of the original
variables, we can express the solution in terms of
the initial coefficients.
Data Mining - 2011 - Volinsky - Columbia University
36
Comparison of results (prostate data)
Term
LS
Best
Subset
Ridge
Lasso
PCR
Intercept
2.465
2.477
2.452
2.468
2.497
Lcavol
0.680
0.740
0.420
0.533
0.543
Lwight
0.236
0.316
0.238
0.169
0.289
Age
-0.141
-0.046
Lbph
0.210
0.162
0.002
0.214
Svi
0.305
0.227
0.094
0.315
Lcp
-0.288
0.000
-0.051
Gleason
-0.021
0.040
0.232
Pgg45
0.267
0.133
-0.056
Test Error
0.521
0.492
0.492
0.479
0.449
Std Error
0.179
0.143
0.165
0.164
0.105
-0.152
Cross validation allows all of these different methods to be
comparable to each other
Data Mining - 2011 - Volinsky - Columbia University
37
Nonparametric Modeling
• A nonparametric model does not assume any parameters
to be estimated (thus the name nonparametric)
– Its general form is Y = f(X) + ε
– Typically, we only assume that f() is some smooth, continuous
function
– Also, we typically assume independent and identically
distributed errors, ε~N(0,σ^2), but that’s not necessary.
– 1-D nonparametric regression = density estimation
Data Mining - 2011 - Volinsky - Columbia University
38
Advantages & Disadvantages
• Advantage
– More flexibility leads to better data-fit, often also
to better predictive capabilities
– Smoothness can also lead to entirely new
concepts, such as dynamics (via derivatives) and
thus to flexible differential equation models, etc
• Disadvantage
– Much more complexity, hard to explain
Data Mining - 2011 - Volinsky - Columbia University
39
Fitting Nonparametric models
• How do we estimate the function f()?
– Restrictions on f: smoothness, continuity, existence of the
first and second derivatives
– options for estimating f include scatterplot smoothers,
regression splines, smoothing splines, B-splines, thinplate splines, wavelets, and many, many more…
– one particularly popular option, the smoothing spline
Data Mining - 2011 - Volinsky - Columbia University
40
Splines
• Splines are piecewise polynomials smoothly
connected together. The joining points of the
polynomial pieces are called knots.
• Smoothing splines are splines that are
penalized against too much local variability
(and thus appear smoother)
– Must be differentiable at the knots
– linear spline: 0-times differentiable
– cubic spline: twice differentiable
Data Mining - 2011 - Volinsky - Columbia University
41
Piecewise Polynomial cont.
• Piecewise constant and piecewise linear
“Knots”
Data Mining - 2011 - Volinsky - Columbia University
42
Spline cont. (Linear Spline)
Data Mining - 2011 - Volinsky - Columbia University
43
Spline cont. (Cubic Spline)
Cubic
spline
Data Mining - 2011 - Volinsky - Columbia University
44
Definition of Smoothing Splines
• Smoothing Splines arise as the solution to the
following simple regression problem
– Find a piecewise polynomial f(x) with smooth breakpoints
– f(x) minimizes the penalized sum-of-squares
n
RSS( f , l)  {y i  f (x)} 2  l  { f (t)} 2 dt
i1
fit
curvature

Data Mining - 2011 - Volinsky - Columbia University
45
Example of Smoothing Splines
• Two Smoothing
Splines fit to the
Prestige Data
– Little smoothing, λ
small (red line)
– Heavy smoothing,
λ large (blue line)
Data Mining - 2011 - Volinsky - Columbia University
46
The smoothing parameter
• The magnitude of λ affects the quality of the
smoother; many ad-hoc approaches to find a
“good” smoothing parameter
– Visual trial and error
– Minimize mean-squared error of the fit
– Cross-validation, optimization on hold-out
sample, etc
Data Mining - 2011 - Volinsky - Columbia University
47
Prestige Data Revisited
• Education (X1) and Income
(X2) influence the perceived
Prestige (Y) of a profession
• Is there a linear relationship
between the X’s and Y?
• If we’re not sure of the type
of relationship between X
and Y, nonparametric
regression can be a very
useful exploratory tool.
Data Mining - 2011 - Volinsky - Columbia University
48
Additive Model Estimates
Parametric coefficients:
Estimate std. err. t ratio Pr(>|t|)
constant 46.833 0.6889 67.98 <2e-16
Approximate significance of smooth terms:
edf
chi.sq p-value
s(income)
3.118
58.12 8.39e-10
s(education) 3.177
152.79
<2e-16
R-sq.(adj) = 0.836
Deviance explained = 84.7%
GCV score = 52.143
Intercept!
Inference for
Income and
Education,
similar to F-test
Measures of model
fit
Data Mining - 2011 - Volinsky - Columbia University
49
Compare to Classical Regression
Parametric coefficients:
Estimate std. err. t ratio Pr(>|t|)
(Intercept)
-6.8478
3.219 -2.127 0.0359
income 0.0013612 0.000224 6.071 2.36e-08
education
4.1374 0.3489 11.86 <2e-16
R-sq.(adj) = 0.794
Deviance explained = 79.8%
GCV score = 62.847
Better model fit for the
nonparametric model!!
Data Mining - 2011 - Volinsky - Columbia University
50
Function Estimates from Additive
Regression Model
• What is the nature of the relationship of the individual predictor
variables and prestige?
Data Mining - 2011 - Volinsky - Columbia University
51