Download linear regression

Document related concepts

Interaction (statistics) wikipedia , lookup

Data assimilation wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Time series wikipedia , lookup

Choice modelling wikipedia , lookup

Regression toward the mean wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Advanced Statistics
for Interventional
Cardiologists
What you will learn
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Introduction
Basics of multivariable statistical modeling
Advanced linear regression methods
Hands-on session: linear regression
Bayesian methods
Logistic regression and generalized linear model
Resampling methods
Meta-analysis
Hands-on session: logistic regression and meta-analysis
Multifactor analysis of variance
Cox proportional hazards analysis
Hands-on session: Cox proportional hazard analysis
Propensity analysis
Most popular statistical packages
Conclusions and take home messages
1st day
2nd day
What you will learn
• Multiple Linear Regression
– Basic concepts
•
•
•
•
•
–
–
–
–
–
–
–
–
Some examples
Linear regression model
Estimation and testing the regression coefficients
Testing and evaluating the regression model
Predictions
Multiple regression models
The model building process
Selection of predictor variables
Model diagnostics
Remedial measures
Model validation
Qualitative Predictor variables
Practical examples
Multiple linear regression
Example from cardiology
How can I predict the impact of balloon dilation
pressure on post-procedure minimum lumen
diameter (MLD), taking concomitantly into
account diabetes status and ACC/AHA lesion
type?
In other words, how can I predict the impact of a
given variable (aka independent) on another
continuous variable (aka dependent), taking
concomitantly into account other variables ?
Multiple linear regression
Example from cardiology
Time to restenosis
(days) (mm)
lumen diameter
Minimum
400
350
300
250
200
150
100
50
0
0
10
20
30
40
Dilation pressureLesion
duringLenght
stenting (ATM)
50
60
Multiple linear regression
Example from cardiology
Briguori et al, Eur Heart J 2002
Multiple linear regression
Example from cardiology
Briguori et al, Eur Heart J 2002
Multiple linear regression
Example from cardiology
Mauri et al, Circulation 2005
Multiple linear regression
Example from cardiology
Mauri et al, Circulation 2005
Multiple linear regression
Fitness demo example
Aerobic fitness can be evaluated using a special test that measures
the oxygen uptake of a person running on a treadmill for a prescribed
distance. However, it would be more economical to evaluate fitness
with a formula that predicts oxygen uptake using simple
measurements such as running time and pulse.
The table on the next slide shows the partial listing of the
measurements for 31 subjects.
Objective
Find the regression equation that allows us to make the most
reliable predictions of O2 uptake.
Multiple Regression
Fitness Data
Subject
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Donna
Gracie
Luanne
Mimi
Chris
Allen
Nancy
Patty
Suzanne
Teresa
Bob
Harriett
Jane
Harold
Sammy
Buffy
Trent
Jackie
Ralph
Jack
Annie
Kate
Carl
Don
Effie
George
Iris
Mark
Steve
Vaughn
William
Sex
F
F
F
F
M
M
F
F
F
F
M
F
F
M
M
F
M
F
M
M
F
F
M
M
F
M
F
M
M
M
M
Age
42
38
43
50
49
38
49
52
57
51
40
49
44
48
54
52
52
47
43
51
51
45
54
44
48
47
40
57
54
44
45
Weight
68,15
81,87
85,84
70,87
81,42
89,02
76,32
76,32
59,08
77,91
75,07
73,37
73,03
91,63
83,12
73,71
82,78
79,15
81,19
69,63
67,25
66,45
79,38
89,47
61,24
77,45
75,98
73,37
91,63
81,42
87,66
Oxy
59,57
60,06
54,30
54,63
49,16
49,87
48,67
45,44
50,55
46,67
45,31
50,39
50,54
46,77
51,85
45,79
47,47
47,27
49,09
40,84
45,12
44,75
46,08
44,61
47,92
44,81
45,68
39,41
39,20
39,44
37,39
RunTime
8,17
8,63
8,65
8,92
8,95
9,22
9,40
9,63
9,93
10,00
10,07
10,08
10,13
10,25
10,33
10,47
10,50
10,60
10,85
10,95
11,08
11,12
11,17
11,37
11,50
11,63
11,95
12,63
12,88
13,08
14,03
RunPulse
166
170
156
146
180
178
186
164
148
162
185
168
168
162
166
186
170
162
162
168
172
176
156
178
170
176
176
174
168
174
186
RstPulse
40
48
45
48
44
55
56
48
49
48
62
67
45
48
50
59
53
47
64
57
48
51
62
62
52
58
70
58
44
63
56
MaxPulse
172
186
168
155
185
180
188
166
155
168
185
168
168
164
170
188
172
164
170
172
172
176
165
182
176
176
180
176
172
176
192
Multiple linear regression
• Simple linear regression is a statistical model to predict
the value of one continuous variable Y (dependent,
response) from another continuous variable X
(independent, predictor, covariate, prognostic factor).
• Multiple linear regression is a natural extension of the
simple linear regression model
– We use it to investigate the effect on the response
variable of several predictor variables, simultaneously
– It is a hypothetical model of the relationship between
several independent variables and a response
variable.
• Let’s start by reviewing the concepts of the simple linear
regression model.
Simple linear regression
The theoretical model
Y = β0 + β1 X + ε
Independent
variable
Distribution of the
dependent variable
Mean of the distribution of
values of the dependent
variable
Regression
line
Simple linear regression
The estimated model
Yestimated = b0 + b1 X
Yˆi  Yestimated
Y
Re siduals : ei  Yi  Yˆi
Yi
b1 : slope
unit
b0: intercept
X (independent)
Linear Regression
An estimation problem
•
Estimate the model parameters β0
and β1 as good as possible.
•
Find the ‘best-fitting’ line
(Y = b0 + b1.X) through the
measured coördinates.
•
How do we find this line?
Minimize the sum of squared
differences (least squares)
between y and yestimated
•
Parameter Estimators
b1 = (n Σ Xi Yi - Σ Xi Σ Yi) / ( n Σ Xi2 – (Σ Xi)2 )
b0 = mean Y – (b1 . mean X)
Linear regression
Assumptions
Least square assumptions
Linear relation between X and Y :
E(εi) = 0 for all i
Homoscedasticity (constant variance):
Var(εi) = σ2 for all i
Uncorrelated residuals :
E(εi εj) = σ2 for all i ≠ j
Significance tests assumptions
Residuals are normally distributed :
εi ≈ N ( 0, σ2 ) for all i
Linear Regression
Testing parameter significance
•
Testing significance of the regression parameters allows us to evaluate if there is
an effect of the independent variable on the dependent variable.
•
If the slope β1 is significantly different from zero than we will conclude that the
independent variable has a significant effect on the dependent variable.
•
Is the slope β1 significantly different from 0 ?
– No : the value of x will not improve the prediction of y over the ordinary mean,
– Yes : knowledge of the x-values will significantly improve the predictions
•
We can test if the slope is significantly different from 0 in two ways
– Using a classical t-test
– Construction of a confidence interval
Linear Regression
Testing parameter significance
• The t- test is based on a function of the slope estimate b1 which
has the t-distribution when the null hypothesis of ‘zero slope’ is
true :
tdf = b1/SE(b1)
Decision rule: reject H0 if
t  t / 2,n2
• Testing with confidence intervals :
b1 - tn-2,α/2 SEb < β1 < b1 - tn-2,α/2 SEb
Decision rule: reject H0 if 0 is not in the confidence region
Simple Linear Regression
Growth example
Examine how weight to height
ratio changes as kids grow up.
Measurements were taken from
72 children between birth and
70 months.
What are your conclusions
looking at the scatterplot ?
Linear Regression
Growth Example
Linear Fit
ratio = 0,66562 + 0,00528 age
Summary of Fit
RSquare
0,822535
RSquare Adj
0,819999
Root Mean Square Error
0,051653
Mean of Response
0,855556
Observ ations (or Sum Wgts)
72
Analy sis of Variance
Source
Model
DF
Sum of Squares
Mean Square
F Ratio
1
0,8656172
0,865617
324,4433
Error
70
0,1867605
0,002668
Prob>F
C Total
71
1,0523778
<,0001
Parameter Estimates
Term
Std Error
t Ratio
Prob>|t|
Intercept
0,6656231
Estimate
0,012176
54,67
<,0001
Lower 95%
0,6413397
Upper 95%
0,6899065
age
0,0052759
0,000293
18,01
<,0001
0,0046917
0,0058601
Evaluate the effect of age on ratio using the parameter estimates table?
What you will learn
• Multiple Linear Regression
– Basic concepts
•
•
•
•
•
–
–
–
–
–
–
–
–
Some examples
Linear regression model
Estimation and testing the regression coefficients
Testing and evaluating the regression model
Predictions
Multiple regression models
The model building process
Selection of predictor variables
Model diagnostics
Remedial measures
Model validation
Qualitative Predictor variables
Practical examples
Linear regression
Assessing the fit of the model
Analysis of Variance
summarizes info about the
sources of variation in the
data by splitting the total sum
of squares into two or more
components.
Y
X
SSTotal
= SSModel + SSError
n
n
n
i 1
i 1
2
2
ˆ
ˆ
 (Yi  Y )  (Yi  Y )   (Yi  Yi )
2
i 1
Linear regression
ANOVA table for simple regression
Source of
Variation
Sum of
Squares
Degrees of
Freedom
Mean Square
F-ratio
Model
SSM
1
SSM/dfM
MSM/MSE
Error
SSE
n-2
SSE/dfE
Total
SST
n-1
SST/dfT
P-value
In general, dfM equals the number of predictor terms in the model; dfE
equals the number of observations minus the number of estimated
coefficients in the model; and dfT equals the number of observations
minus 1 (if the intercept is included in the model)
Linear regression
Significance of the model
• F-ratio (and its p-value) is used to evaluate significance of
the regression model.
• MSModel / MSError ≈ F1;n-2 for simple regression
• If the observed F-ratio is greater than a critical F-value, than
we can conclude that this ratio is significantly greater than 1
and that the regression model explains a significant portion of
the variation of the response variable.
• Since the simple regression model has only one predictor
variable, the F-ratio can also be used to determine if β1 = 0,
i.e. if there is a significant effect of the predictor on the
response variable (note: squared t-ratio = F-ratio)
Linear regression
Measure of Fit
• A natural measure of ‘goodness of fit’ for the regression model is
the coefficient of determination: R2
• R2 expresses the % of variability of the dependent variable
explained by the variations in the independent variable
• R2 = total variation (SST) – unexplained variation (SSE)
total variation (SST)
• Properties
– R2 varies between 0 and 1 (perfect fit)
– The larger R2 is, the more variation of Y is explained by the predictor
X
– Large values indicate a “strong” relationship between predictor and
response variables.
– For simple linear regression, R2 is the square of the correlation
coefficient.
Linear Regression
Growth Example
Linear Fit
ratio = 0,66562 + 0,00528 age
Summary of Fit
RSquare
0,822535
RSquare Adj
0,819999
Root Mean Square Error
0,051653
Mean of Response
0,855556
Observ ations (or Sum Wgts)
72
Analy sis of Variance
Source
Model
DF
Sum of Squares
Mean Square
F Ratio
1
0,8656172
0,865617
324,4433
Error
70
0,1867605
0,002668
Prob>F
C Total
71
1,0523778
<,0001
Parameter Estimates
Term
Std Error
t Ratio
Prob>|t|
Intercept
0,6656231
Estimate
0,012176
54,67
<,0001
Lower 95%
0,6413397
0,6899065
age
0,0052759
0,000293
18,01
<,0001
0,0046917
0,0058601
Is this regression model significant ?
What % of the variation of the response is explained by the model?
Upper 95%
Linear regression
Examine residuals
It is always a good idea to look at the residuals from a
regression (the difference between the actual values and the
predicted values). Residuals should be scattered randomly
about a mean of zero.
Linear Regression
Residual analysis
• Residual is the difference between the
observed value and the fitted value at a
certain level of X
^
ei  Yi  Y i
• Once a model has been fit, the residuals
are used to:
•
Validate the assumptions of the model
•
Diagnose departures from those assumptions
•
Identify corrective methods to refine the model
Linear Regression
Predictions
• Two type of predictions of the response Y at new levels of X, the
predictor variable, can be made using a validated regression
equation.
– Estimating the mean response (mean of the distribution of the response
Y at new level of X)
– Estimating a new observation of Y (individual outcome drawn from the
distribution of the response Y at new level of X)
• We calculate prediction intervals using the variance of the
estimators. The estimation interval for an individual outcome is
always larger than the one for a mean response, since the variance
of the individual responses is greater than the variance of the mean
response.
Linear Regression
Confidence band for regression line
• This band allows us to see the region in which the entire regression
line lies. It is useful for determining the appropriateness of a fitted
regression function.
Linear Regression
Demonstration
How to do a linear regression analysis with
the EXCEL data analysis option ?
What you will learn
• Multiple Linear Regression
–
–
–
–
–
–
–
–
–
Basic concepts
Multiple regression models
The model building process
Selection of predictor variables
Model diagnostics
Remedial measures
Model validation
Qualitative Predictor variables
Practical examples
Multiple linear regression
Fitness demo example
Aerobic fitness can be evaluated using a special test that measures
the oxygen uptake of a person running on a treadmill for a prescribed
distance. However, it would be more economical to evaluate fitness
with a formula that predicts oxygen uptake using simple
measurements such as running time and pulse.
The table on the next slide shows the partial listing of the
measurements for 31 subjects.
Objective
Find the regression equation that allows us to make the most
reliable predictions of O2 uptake.
Multiple Regression
Fitness Data
Subject
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Donna
Gracie
Luanne
Mimi
Chris
Allen
Nancy
Patty
Suzanne
Teresa
Bob
Harriett
Jane
Harold
Sammy
Buffy
Trent
Jackie
Ralph
Jack
Annie
Kate
Carl
Don
Effie
George
Iris
Mark
Steve
Vaughn
William
Sex
F
F
F
F
M
M
F
F
F
F
M
F
F
M
M
F
M
F
M
M
F
F
M
M
F
M
F
M
M
M
M
Age
42
38
43
50
49
38
49
52
57
51
40
49
44
48
54
52
52
47
43
51
51
45
54
44
48
47
40
57
54
44
45
Weight
68,15
81,87
85,84
70,87
81,42
89,02
76,32
76,32
59,08
77,91
75,07
73,37
73,03
91,63
83,12
73,71
82,78
79,15
81,19
69,63
67,25
66,45
79,38
89,47
61,24
77,45
75,98
73,37
91,63
81,42
87,66
Oxy
59,57
60,06
54,30
54,63
49,16
49,87
48,67
45,44
50,55
46,67
45,31
50,39
50,54
46,77
51,85
45,79
47,47
47,27
49,09
40,84
45,12
44,75
46,08
44,61
47,92
44,81
45,68
39,41
39,20
39,44
37,39
RunTime
8,17
8,63
8,65
8,92
8,95
9,22
9,40
9,63
9,93
10,00
10,07
10,08
10,13
10,25
10,33
10,47
10,50
10,60
10,85
10,95
11,08
11,12
11,17
11,37
11,50
11,63
11,95
12,63
12,88
13,08
14,03
RunPulse
166
170
156
146
180
178
186
164
148
162
185
168
168
162
166
186
170
162
162
168
172
176
156
178
170
176
176
174
168
174
186
RstPulse
40
48
45
48
44
55
56
48
49
48
62
67
45
48
50
59
53
47
64
57
48
51
62
62
52
58
70
58
44
63
56
MaxPulse
172
186
168
155
185
180
188
166
155
168
185
168
168
164
170
188
172
164
170
172
172
176
165
182
176
176
180
176
172
176
192
Multiple linear regression
• To investigate the effect on the response variable Y,
of several independent X variables, simultaneously.
• Even if we are interested in the effect of only one
variable, it is wise to include other variables as
regressors to reduce the residual variance and
improve significance tests of the effects.
• Multiple regression models often improve precision of
the predictions.
Multiple linear regression
• The model : yi   0  1 xi1   2 xi 2  ...   p xip   i
• βi represents the change in the response for
an incremental change in the ith predictor
variable, while all other predictor variables
are held constant. βi is referred to as the
partial regression coefficient.
• Assumptions: residuals (or errors) εi are
independent and normally distributed with
mean 0 and standard deviation σ
Multiple linear regression
Estimated additive model with two predictors
yi = b0 + b1 xi1 + b2 xi2 + ei
b0 : Y value when X1 and X2 equal 0
b1 : effect of X1 on Y controlling for X2
b2 : effect of X2 on Y controlling for X1
Multiple linear regression
Fitness Example
Response:
O2 Uptake
Summary of Fit
RSquare
0,761424
RSquare Adj
0,744383
Root Mean Square Error
2,693374
Mean of Response
47,37581
Observ ations (or Sum Wgts)
Model explains 76% of
the variation around the
mean of 02 uptake.
31
Parameter Estimates
Term
Estimate
Std Error
t Ratio
Prob>|t|
Lower 95%
76,191947
Upper 95%
Intercept
93,088766
8,248823
11,29
<,0001
Run Time
-3,140188
0,373265
-8,41
<,0001
-3,90478
-2,375595
Run Pulse
-0,073509
0,050514
-1,46
0,1567
-0,176983
0,0299637
Ef fect Test
Source
Nparm
DF
Sum of Squares
F Ratio
Prob>F
Run Time
1
1
513,41745
70,7746
<,0001
Run Pulse
1
1
15,36208
2,1177
0,1567
109,98559
Evaluate the effect of
Run Time and Run
Pulse on O2 uptake
If you take an effect away from the
model, then the SSError will be higher.
Difference in SS is used to construct an
F-test on whether the contribution of the
variable is significant.
Prediction Equation: O2 uptake = 93,089 – 3,14 Run time – 0.074 Run Pulse
Multiple Regression
Whole Model Leverage Plot
Graphical method to view the wholemodel hypothesis using a scatterplot
of actual response values against the
predicted values.
The vertical distance from a point to
the 45° line is the residual error. The
idea is to get a feel for how much
better the slope line fits than the
horizontal line at the mean.
If the confidence curves cross the
horizontal line, the whole model F
test is significant.
Mulitple Regression
Effect Leverage Plots
partial plot, partial regression leverage plot, added variable plot
Plot shows how each effect contributes to the fit after all the other effects
have been included in the model.
The distance from each point to the sloped line measures the residual for
the full model. The distance from each point to the horizontal line measures
the residual for a model without the effect (reduced model).
Multiple regression models
Interaction model with two predictor variables
Y = β0 + β1 X1 + β2 X2 + β12 X1X2 + ε
The change in the response associated with X1 depends on the level
of X2 (and vice versa)
Multiple regression models
Quadratic model with two predictor variables
Y = β0 + β1 X1 + β2 X2 + β12 X1X2 + β11 X12 + β22 X22 + ε
Quadratic models can only represent three basic types of shapes:
mountains, valleys, saddles.
Multiple regression models
• Model terms may be divided into the following categories
–
–
–
–
–
Constant term
Linear terms / main effects (e.g. X1)
Interaction terms (e.g. X1X2)
Quadratic terms (e.g. X12)
Cubic terms (e.g. X13)
• Models are usually described by the highest term
present
– Linear models have only linear terms
– Interaction models have linear and interaction terms
– Quadratic models have linear, quadratic and first order
interaction terms
– Cubic models have terms up to third order.
What you will learn
• Multiple Linear Regression
–
–
–
–
–
–
–
–
–
–
Basic concepts
Multiple regression models
The model-building process
Selection of predictor variables
Model diagnostics
Remedial measures
Model validation
Qualitative Predictor variables
Interaction effects
Practical examples
The model-building process
Source:
Applied Linear Statistical Models,
Neter, Kutner, Nachtsheim, Wasserman
The model-building process
Aerobic fitness can be evaluated using a special test that measures
the oxygen uptake of a person running on a treadmill for a prescribed
distance. However, it would be more economical to evaluate fitness
with a formula that predicts oxygen uptake using simple
measurements such as running time and pulse. The table shows the
partial listing of the measurements for 31 subjects.
Age
Weight
O2 uptake
Runtime
Rest pulse
Run pulse
Max Pulse
38
81,87
60,055
8,63
48
170
186
38
89,02
49,874
9,22
55
178
180
40
75,07
45,313
10,07
62
185
185
40
75,98
45,681
11,95
70
176
180
42
68,15
59,571
8,17
40
166
172
44
85,84
54,297
8,65
45
156
184
Objective
Find the regression equation that allows us to make the most
reliable predictions of O2 uptake.
Selection of predictor variables
Objective
• Goal is to find the “best” model that is able to predict well over the
range of interest. Many variables (especially in exploratory
observational studies) may contribute to the response variation.
Include them all in the model ?
• No, the model must also be parsimonious
– A simple model with only a few relevant explanatory variables is easier to
understand and use than a complex model
– Pareto principle – a few variables contribute the most of information
• Reducing the number of variables reduces ‘multicollinearity’
• Increasing ratio observations/variables reduces the variability of b and
improves prediction.
• Gathering and maintaining data on many factors is difficult and
expensive
• In the words of Albert Einstein :
“Make things as simple as possible but no simpler”
Model Selection Methods
• Find the ‘best’ model by comparing all
possible regression models using all
combinations of the explanatory variables.
• Automatic model selection methods
– Forward selection
– Backward elimination
– Stepwise selection
Model selection methods
All Possible Subsets
• A regression model is estimated for each possible subset
of the predictor variables. The constant term is included
in each of the subset models.
• If there are k possible terms to be included in the
regression model, there are 2k possible subsets to be
estimated.
• Purpose of the all subsets approach is to identify a small
group of models that are “good” according to a specified
criterion, so that further detailed examination can be
done of these models.
Determining the Best Model
• A variety of statistics have been developed to help
determine the “best” subset model. These include :
–
–
–
–
–
–
Coefficient of determination : R2
Adjusted R2
Relative PRESS
Mallows Cp
Akaike information criterion (AIC)
Swarz Information Criterion (SIC or BIC)
• Residual analysis and various graphical displays also
help to select the “best” subset regression model.
Coefficient of Determination
• R2 measures the proportion of the total variation in the response that
is explained by the regression model, i.e.
R2 
SSModel
SSError
 1
SSTotal
SSTotal
• Values of R2 close to 1 indicate a good fit to the data. However, R2
can be arbitrarily increased by adding extra terms in the model
without necessarily improving the predictive ability of the model.
• To correct for the number of terms in the model, Adjusted R2 is
defined as
SSError / df Error
2
2
with Radj
1
Radj
 1
SSTotal / dfTotal
2
• Large difference between R2 and Radj indicates the presence of
unnecessary terms in the model.
Relative PRESS
• The PRedictive Error Sums of Squares is given by,
n
2
ˆ
(
Y

Y
)
i
(i )
PRESS = 
i 1
where Yˆ represents the predicted value of Yi using the model that was fitted
(i )
with the ith observation deleted.
• Relative Press is similar to R2 and
2
Radj
Relative PRESS = 1 – PRESS / SSTotal
• It can be shown that
Relative PRESS ≤ 1
Mallows Cp Statistic
Cp 
(n  p) s 2p
s
2
 (n  2 p)
• n is the number of observations
• p is the number of terms in the model (including the intercept)
• sp2 is the estimate of error from the subset model containing p terms
• s2 is the estimate of error from the model containing all possible terms.
s2 is assumed to be a “good” estimate of the experimental error.
If a model with p terms is adequate, Cp ≈ p
Otherwise, if the model contains unnecessary terms, Cp > p
AIC and BIC
AIC (Akaike Information Criterion) and BIC (Swarz Information
Criterion) are two popular model selection methods. They not only
reward goodness of fit, but also include a penalty that is an
increasing function of the number of estimated parameters. This
penalty discourages overfitting. The preferred model is the one with
the lowest value for AIC or for BIC. These criteria attempt to find the
model that best explain the data with a minimum of free
parameters. The AIC penalizes free parameters less strongly than
does the Schwarz criterion.
AIC = 2k + n [ln (SSError / n)]
BIC = n ln (SSError / n) + k ln(n)
Model selection methods
Stepwise regression
• While the All Subsets procedure is the only way to guarantee that
the “best” subset is chosen, other techniques have been developed
that are less computationally intensive.
• Stepwise Regression is an automatic model selection procedure
which enters or removes terms sequentially.
• The basic steps in the procedure are:
– Compute an initial regression model
– Enter terms that significantly improve the fit (p-value less than the p-toenter value), or remove terms that do not significantly harm the fit (pvalue greater than the p-to-remove value)
– Compute the new regression model
– Stop when entering or removing terms will not significantly improve the
model
• Unfortunately, the order in which the terms are entered or removed
can lead to different models.
Variable Selection
Fitness Example
p-values to enter or
remove a variable
from the model
model comparison criteria
Overview of the
variable selection
procedure
Best model
Best Model with Leverage Plot
Fitness Example
Response:
Oxy
Summary of Fit
RSquare
0,835425
RSquare Adj
0,810106
Root Mean Square Error
2,321437
Mean of Response
47,37581
Observ ations (or Sum Wgts)
31
Parameter Estimates
Term
Estimate
Std Error
t Ratio
Prob>|t|
11,65754
8,34
<,0001
Intercept
97,185202
Age
-0,189218
0,09439
-2,00
0,0555
Runtime
-2,775606
0,341602
-8,13
<,0001
RunPulse
-0,345272
0,118209
-2,92
0,0071
MaxPulse
0,2714364
0,134383
2,02
0,0538
Ef fect Test
Source
F Ratio
Prob>F
Age
Nparm
1
DF
1
Sum of Squares
21,65647
4,0186
0,0555
Runtime
1
1
355,78610
66,0199
<,0001
RunPulse
1
1
45,97614
8,5314
0,0071
MaxPulse
1
1
21,98663
4,0799
0,0538
Multiple linear regression
Example from cardiology
Mauri et al, Circulation 2005
Multiple linear regression
Example from cardiology
Mauri et al, Circulation 2005
What you will learn
• Multiple Linear Regression
–
–
–
–
–
–
–
–
–
Basic concepts
Multiple regression models
The model-building process
Selection of predictor variables
Model diagnostics
Remedial measures
Model validation
Qualitative Predictor variables
Practical examples
What you will learn
• Multiple Linear Regression
– Model diagnostics and Remedial measures
•
•
•
•
•
•
Assumptions
Scaling of Residuals
Distribution of Residuals
Unequal Variances
Outliers and Influential Observations
Multicollinearity
– Remedial Measures
– Model Validation
Building the Regression Model
Diagnostics
Remember that the residuals are useful for :
Validating the assumptions of the model
Diagnosing departures from those assumptions
Identifying corrective methods to refine the model
ei  Yi  Yˆi
Linear regression
Assumptions
Least square assumptions
Linear relation between X and Y :
E(εi) = 0 for all i
Homoscedasticity (constant variance):
Var(εi) = σ2 for all i
Uncorrelated residuals :
E(εi εj) = σ2 for all i ≠ j
Significance tests assumptions
Residuals are normally distributed :
εi ≈ N ( 0, σ2 ) for all i
Scaling of Residuals
• Raw – the unscaled residuals
These are in the original units of the response. They
are used to diagnose extreme values in magnitude
and direction.
• Standardized – to make the residuals more
comparable
Standardized residuals are calculated dividing the
residuals by their standard deviation. They are
approximately normal with a mean of zero and a
standard deviation of one. Hence approximately 95%
of the standardized residuals should be between +
2σ.
Examining residuals
Growth example
The picture you hope to see is the
residuals scattered randomly about a
mean of zero. Look for patterns and
for points that violate this random
scatter.
The plot above is suspicious.
Why? Violation against which
assumption?
Examining residuals
Regression line with
extreme points excluded.
Now the residuals are scattered
randomly around a mean of zero.
But excluding points is probably not
the best solution.
Examining residuals
Comparison of Linear and Second Order
Polynomial Fit
Ratio = b0 + b1 age + b2 age2 + residual
There still apppears to be a
pattern in the residuals.
Continue by fitting a model with
higher order terms.
Examining residuals
Comparison of Linear and Higher Order
Polynomial Fit
The residuals of the fourth
order polynomial fit do not
show any pattern anymore.
Violation Linearity Assumption
Remedial measures
• Polynomial regression: previous example
• Transformation (Box-Cox) of the independent variable
• Broken line (or piecewise) regression
Distribution of residuals
Normality check
• Graphical Methods
– Histogram, Box-plot
– Normal Probability Plot (NPP)
• Formal Tests given in many stat. packages
– Shapiro-Wilk statistic W for small samples (< 50)
– Kolmogorov-Smirnov test
• If violation against the normality assumption of Y we
can try to solve the problem using transformations
of Y (use NPP and Box-Cox to decide which
transformations).
Distribution of residuals
Normality check
Polynomial 4th order model
Simple regression
0,10
.01
.05 .10
.25
.50
.75
.90 .95
0,08
.99
.01
.05 .10
.25
.50
.75
.90 .95
.99
0,06
0,05
0,04
-0,00
0,02
-0,05
0,00
-0,10
-0,02
-0,15
-0,04
-0,20
-0,06
-0,25
-0,08
-3
-2
-1
0
1
2
3
-3
Normal Quantile
-2
Normal Quantile
Test f or Normality
Test f or Normality
Shapiro-Wilk W Test
Shapiro-Wilk W Test
W
0,862849
W
Prob<W
<,0001
-1
Conclusions ?
0,973852
Prob<W
0,3949
0
1
2
3
Normality check
Example from cardiology: late loss
Mauri et al, Circulation 2005
Normality check
Example from cardiology: late loss
Mauri et al, Circulation 2005
Why graphics are important ?
Linear Fit
Statistical reports
for four analysis
Linear Fit
Y 1 = 3,00009 + 0,50009 X1
Y 2 = 3,00091 + 0,5 X2
Summary of Fit
Summary of Fit
RSquare
0,666542
RSquare
0,666242
RSquare Adj
0,629492
RSquare Adj
0,629158
Root Mean Square Error
1,236603
Root Mean Square Error
1,237214
Mean of Response
7,500909
Mean of Response
7,500909
Observ ations (or Sum Wgts)
11
Observ ations (or Sum Wgts)
Analy sis of Variance
Source
DF
Model
What do you expect
about the underlying
data ?
C Total
Analy sis of Variance
Sum of Squares
Mean Square
F Ratio
Source
DF
Model
1
27,500000
27,5000
17,9656
9
13,776291
1,5307
Prob>F
10
41,276291
1
27,510001
27,5100
17,9899
9
13,762690
1,5292
Prob>F
Error
10
41,272691
0,0022
C Total
Error
Parameter Estimates
Term
11
Sum of Squares
Mean Square
0,0022
Parameter Estimates
Estimate
Std Error
t Ratio
Prob>|t|
Term
Intercept
3,0000909
1,124747
2,67
0,0257
Intercept
X1
0,5000909
0,117906
4,24
0,0022
X2
Estimate
3,0009091
0,5
Std Error
t Ratio
Prob>|t|
1,125302
2,67
0,0258
0,117964
4,24
0,0022
Linear Fit
Linear Fit
Y 4 = 3,00173 + 0,49991 X4
Y 3 = 3,00245 + 0,49973 X3
Summary of Fit
Summary of Fit
RSquare
0,666324
RSquare
0,666707
RSquare Adj
0,629249
RSquare Adj
0,629675
Root Mean Square Error
1,236311
Root Mean Square Error
1,235695
Mean of Response
7,500909
Mean of Response
7,5
Observ ations (or Sum Wgts)
Observ ations (or Sum Wgts)
11
Model
Error
C Total
DF
11
Analy sis of Variance
Analy sis of Variance
Source
F Ratio
Sum of Squares
Mean Square
F Ratio
Source
Model
1
27,470008
27,4700
17,9723
9
13,756192
1,5285
Prob>F
Error
10
41,226200
0,0022
C Total
DF
Sum of Squares
Mean Square
F Ratio
1
27,490001
27,4900
18,0033
9
13,742490
1,5269
Prob>F
10
41,232491
0,0022
Parameter Estimates
Parameter Estimates
Std Error
t Ratio
Prob>|t|
Term
Std Error
t Ratio
Prob>|t|
Intercept
3,0024545
1,124481
2,67
0,0256
Intercept
3,0017273
1,123921
2,67
0,0256
X3
0,4997273
0,117878
4,24
0,0022
X4
0,4999091
0,117819
4,24
0,0022
Term
Estimate
Estimate
Why graphics are important ?
Regression lines
for four analysis
Unequal Variances
• Model Assumption: var (εi) = var (yi) is a constant σ2
• Heteroscedasticity (unequal variances) does not
bias the estimates of the regression parameters β
but it causes variances of parameter estimates to be
large and can affect R2, s2 and tests substantially.
• Detect heteroscedasticity through plots of the
(standardized) residuals against ŷ
• Remedial actions :
– Variance stabilizing transformations of yi (eg.
square root, logarithm)
– Weighting the regression parameters (WLS)
Unequal Variances
example
Residuals
e
ŷ
Outliers and Influential
Observations
• Sometimes, while most of the observations fit the model, some of
the observations clearly do not. This occurs when there is something
wrong with the observations or if the model is faulty.
• A point has great influence when it has a large effect on the
parameter estimates.
• There are two types of influential observations
– Outliers: extreme observations of the dependent variable that exhibit
large residuals
– Leverage points: observation with extreme value on one of the
independent variables
• Outliers are detected by examining various types of residuals.
• Leverage points are detected with the leverage hii This measure
describes how far away a point is from the centroid of all points in
the space of the independent variables. So, leverage is a measure
of remoteness.
• Influence is assessed by examining the residuals and the leverages.
Outliers and Influential
Observations
Influential Outlier
Influential Leverage point
Outliers and Influential diagnostics
• For detecting leverage points we examine the leverage
hii of the observations.
( X i  X )2
1
hii   n
n
2
(
X

X
)
 j
j 1
• leverage value of ith observation increases with
increasing devation of Xi of the mean of this variable.
• hii takes values between 0 and 1 and the sum is p
(number of parameters)
• hii with values greater than 2p/n are called leverage
points and need further investigation
Outliers and Influential diagnostics
• For detecting outliers that do not belong to the
model, Studentized (deleted) residuals are mostly
used.
e 
*
i
s(i )
ei
 t n  p 1
1  hii
s(i) is equivalent to the standard deviation s if least
squares is run after deleting the ith case.
hii is the leverage
Outliers and Influential diagnostics
• Not all influential points have large ei* s.
• Additional measures are defined, measures that tell us
how much parameter b or ŷ would change if a given
point were deleted.
• Most used measures are :
DFBETAS
DFFITS
ei2 hii
COOK’s Distance =
2
2
k  1s (1  h ) 
ii
• Criteria can be defined to decide if point is influential.
“Large” Cook’s Distance values are the most influential
and should be investigated.
Checking for Influential
Observations
Contour curves show Cook’s Distances corresponding to the
50th, 75th and 95th influence percentiles. As a result, 5% of
the residuals will always fall outside the 95th percentile curve.
Multicollinearity
• The quality of estimates, as measured by their
variances, can be seriously affected if the
independent variables are closely related (highly
correlated) to each other.
• An obvious method of assessing the degree to
which each independent variable is related to all
other independent variables is to examine R2
• Popular measures:
– tolerance TOLj = 1 – Rj2
– variance inflation factor VIFj = TOLj -1
Remedial measures
Overview
• Depending on the nature of the problem, one or
more of the following may be appropriate :
–
–
–
–
–
–
–
–
Consider Transforming the data
Consider using Weighted Least Squares
Consider using Robust Regression
Use a more complicated equation e.g. add quadratic
or cubic terms to the model
Add an omitted predictor variable
Consider more complicated models e.g. time series
models
Consider variable reduction techniques
Consider ridge regression
Model validation
• After remedial measures have been taken and
diagnostics analyzed to make sure that the remedial
measures were succesful, the final step of the modelbuilding proces is the validation of the selected
regression model.
• Three basic ways of validating a regression model are:
– Collection of new data to check predictive ability of the model
→ preferred method, but not practical
– Comparison of the results with theory, simulation results or
previous empirical results
– Split the study data into model-building (training) and validation
data set randomly : cross-validation. Validation data set is used
to re-estimate and compare the regression coefficients.
Sample Size
• For the linear regression model the desired sample size
is determined via an analysis of the power of the F-test.
• For this analysis we need following info :
– the number of predictors you want to analyze (rule of thumb:
number of observations must be at least 15 times number of
predictors in the study)
– the significance level alpha (type 1 error)
– the size of the effect in the population (use measures for
proportion of explained variation such as R2 for the whole model)
• The bigger the expected effect, the smaller the size of
the sample.
What you will learn
• Multiple Linear Regression
–
–
–
–
–
–
–
–
–
Basic concepts
Multiple regression models
The model-building process
Selection of predictor variables
Model diagnostics
Remedial measures
Model validation
Qualitative predictor variables
Practical examples
Categorical predictors
• So far we have utilized only quantitative
predictor variables in the regression models.
• Qualitative predictor variables can also be
incorporated in the linear regression model by
using indicator variables.
• Indicator variables or dummy variables are
variables that take only two values eg. 0 and 1
or -1 and 1.
• Let’s have a look at the simplest example :
one qualitative predictor with two levels:
dichotomous predictor.
Dichotomous predictor
Fitness example
Regression model with one dichotomous predictor variable.
yi = β0 + β1 xi + ε
with xi = -1 for male and xi = 1 for female
Response:
Oxy
This is a model for the familiar
two-sample testing problem :
H0: µ1 = µ2 against H1: µ1 ≠ µ2
Summary of Fit
RSquare
0,234692
RSquare Adj
0,208302
Root Mean Square Error
4,740032
Mean of Response
47,37581
Observ ations (or Sum Wgts)
31
Parameter Estimates
Term
Estimate
Std Error
t Ratio
Prob>|t|
Intercept
47,293867
0,851778
55,52
<,0001
Sex[F-M]
2,5401333
0,851778
2,98
0,0057
Ef fect Test
Source
Sex
Nparm
1
DF
1
Sum of Squares
199,81246
F Ratio
Prob>F
8,8932
0,0057
What can you conclude from
the statistical output about the
mean response for the males
versus the females ?
Dichotomous predictor
Leverage Plots
Polychotomous predictor
• To incorporate a categorical variable with more than
two levels in the regression model, we will create
several dummy variables.
• For a variable with g categories, we need to
incorporate only g -1 dummy variables in the model.
The category for which the dummy variable is not in
the model is the reference category.
Category
A
B
X1 (A)
1
0
X2 (B)
0
1
X3 (C)
0
0
C
0
0
1
Polychotomous predictor
Example
Let’s examine the effect of drug treatment (a, d or placebo) on the
response (LBS=bacteria count) in 30 subjects.
Which statistical
method would you use
to tackle this problem ?
Polychotomous predictor
Example
Response:
LBS
Summary of Fit
RSquare
0,227826
RSquare Adj
0,170628
Root Mean Square Error
6,070878
Mean of Response
7,9
Observ ations (or Sum Wgts)
30
Parameter Estimates
Term
Estimate
Std Error
t Ratio
Prob>|t|
7,9
1,108386
7,13
<,0001
Drug[a-placebo]
-2,6
1,567494
-1,66
0,1088
Drug[d-placebo]
-1,8
1,567494
-1,15
0,2609
Intercept
Ef fect Test
Source
Drug
Nparm
2
DF
2
Sum of Squares
293,60000
F Ratio
Prob>F
3,9831
0,0305
From linear regression to the
general linear model.
Coding scheme for the categorical
variable defines the interpretation of
the parameter estimates.
Polychotomous predictor
Example - Regressor construction
• Terms are named according to how the regressor variables were
constructed.
• Drug[a-placebo] means that the regressor variable is coded as 1
when the level is “a”, - 1 when the level is “placebo”, and 0
otherwise.
• Drug[d-placebo] means that the regressor variable is coded as 1
when the level is “d”, - 1 when the level is “placebo”, and 0
otherwise.
• You can write the notation for Drug[a-placebo] as ([Drug=a][Drug=Placebo]), where [Drug=a] is a one-or-zero indicator of
whether the drug is “a” or not.
• The regression equation then looks like:
Y = b0 + b1*((Drug=a)-(Drug=placebo)) + b2*(Drug=d)-(Drug=placebo)) + error
Polychotomous predictor
Example – Parameters and Means
•
With this regression equation, the predicted values for the levels “a”, “d”
and “placebo” are the means for these groups.
•
For the “a” level:
Pred y = 7.9 + -2.6*(1-0) + -1.8*(0-0) = 5.3
For the “d” level:
Pred y = 7.9 + -2.6*(0-0) + -1.8(1-0) = 6.1
For the “placebo” level:
Pred y = 7.9 + -2.6(0-1) + -1.8*(0-1) = 12.3
•
The advantage of this coding system is that the regression parameter
tells you how different the mean for that group is from the means of the
means for each level (the average response across all levels).
•
Other coding schemes result in different interpretations of the
parameters.
What did you learn - hopefully
• Multiple Linear Regression
–
–
–
–
–
–
–
–
–
Basic concepts
Multiple regression models
The model-building process
Selection of predictor variables
Model diagnostics
Remedial measures
Model validation
Qualitative predictor variables
Practical examples
Linear regression:
do-it-yourself with SPSS
Scatterplot
Linear regression
Linear regression
Linear regression
Questions?
Take home messages
• Multiple regression models are generally used to study the
effect of continuous independent variables on a
continuous response variable, simultaneously.
• Before modelling, look carefully at the data and always
start the analysis by plotting all variables individually and
against each other. Evaluate correlations.
• Keep in mind the model assumptions from the start.
• Building the ‘best’ regression model is an iterative
process.
• Be careful with automatic variable selection procedures.
• Examine residuals and leverages to diagnose and remedy
the model.
• Validate the model using an independent data set.
And now a real break…
For further slides on these topics
please feel free to visit the
metcardio.org website:
http://www.metcardio.org/slides.html