Download Linear regression - Welcome! | MGH Biostatistics Center

Document related concepts

Psychometrics wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Confidence interval wikipedia , lookup

Linear least squares (mathematics) wikipedia , lookup

Taylor's law wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Linear regression
Brian Healy, PhD
BIO203
Previous classes

Hypothesis testing
– Parametric
– Nonparametric

Correlation
What are we doing today?

Linear regression
– Continuous outcome with continuous, dichotomous
or categorical predictor
– Equation: E (Y | X  x)   0  1 x
Interpretation of coefficients
 Connection between regression and

– correlation
– t-test
– ANOVA
Big picture
Linear regression is the most commonly used
statistical technique. It allows the comparison of
dichotomous, categorical and continuous
predictors with a continuous outcome.
 Extensions of linear regression allow

– Dichotomous outcomes- logistic regression
– Survival analysis- Cox proportional hazards regression
– Repeated measures

Amazingly, many of the analyses we have
learned can be completed using linear
regression
Example
Yesterday,
we
investigated
the
association
between
age and BPF
using a
correlation
coefficient
 Can we fit a
line to this
data?
.75
.8
BPF
.85
.9
.95

20
30
40
Age
50
60
Quick math review
As you remember from
high school math, the
basic equation of a line is
given by y=mx+b where
m is the slope and b is
the y-intercept
 One definition of m is
that for every one unit
increase in x, there is an
m unit increase in y
 One definition of b is the
value of y when x is equal
to zero

Line
20
18
16
y = 1.5x + 4
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Picture
Look at the data in
this picture
 Does there seem to
be a correlation
(linear relationship) in
the data?
 Is the data perfectly
linear?
 Could we fit a line to
this data?

25
20
15
10
5
0
0
2
4
6
8
10
12
What is linear regression?
Linear regression tries
to find the best line
(curve) to fit the data
 The method of finding
the best line (curve)
is least squares,
which minimizes the
sum of the distance
from the line for each
of points

25
20
y = 1.5x + 4
15
10
5
0
0
2
4
6
8
10
12
How do we find the best line?
Let’s look at three
candidate lines
 Which do you think is
the best?
 What is a way to
determine the best
line to use?

Residuals
The actual observations,
yi, may be slightly off the
population line because
of variability in the
population. The equation
is yi = 0 + 1xi + ei,
where ei is the deviation
from the population line
(See picture).
 This is called the residual

This is the distance from
the line for patient 1, e1
Least squares

The method employed to find the best line is
called least squares. This method finds the
values of  that minimize the squared vertical
distance from the line to each of the point. This
is the same as minimizing the sum of the ei2
n
e
i 1
n
2
i
   yi   0  1 x1 
i 1
2
Estimates of regression coefficients

Once we have solved the least squares equation,
we obtain estimates for the ’s, which we refer
n
to ˆ0 , ˆ1 as
ˆ1 
 x  x y
i 1
i
i
y

 x  x 
n
i 1
2
i
ˆ0  y  ˆ1 x

The final least squares equation is where yhat is
the mean value of y for a value of x1
yˆ  ˆ0  ˆ1 x1
Assumptions of linear regression

Linearity
– Linear relationship between outcome and predictors
– E(Y|X=x)=0 + 1x1 + 2x22 is still a linear regression
equation because each of the ’s is to the first power

Normality of the residuals
– The residuals, ei, are normally distributed, N(0, s2

Homoscedasticity of the residuals
– The residuals, ei, have the same variance

Independence
– All of the data points are independent
– Correlated data points can be taken into account
using multivariate and longitudinal data methods
Linearity assumption
One of the assumptions of linear regression is
that the relationship between the predictors and
the outcomes is linear
 We call this the population regression line E(Y |
X=x) = my|x = 0 + 1x
 This equation says that the mean of y given a
specific value of x is defined by the  coefficients
 The coefficients act exactly like the slope and yintercept from the simple equation of a line from
before

Normality and homoscedasticity
assumption

Two other assumptions
of linear regression are
related to the ei’s
– Normality- the
distribution of the
residuals are normal.
– Homoscedasticity- the
variance of y given x is
the same for all values
of x
Distribution of y-values at each value
of x is normal with the same variance
Example
Here is a regression
equation for the
comparison of age
and BPF
.8
BPF
.85
.9
.95
BPFi   0  1agei  e i
.75

E ( BPF | age)   0  1age
20
30
40
Age
50
60
Results
The estimated
regression equation
.8
.85
.9
.95
BP Fˆ  0.957  0.0029 * age
.75

20
30
40
Age
BPF
50
predval
60
. regress bpf age
Source
Model
Residual
Total
SS
df
MS
Number of obs
F( 1,
27)
Prob > F
R-squared
Adj R-squared
Root MSE
.022226034 slope
1 .022226034
Estimated
.044524108
27 .001649041
.066750142
bpf
Coef.
age
_cons
-.0028799
.957443
28
.002383934
Std. Err.
.0007845
.035037
Estimated intercept
t
-3.67
27.33
P>|t|
0.001
0.000
=
=
=
=
=
=
29
13.48
0.0010
0.3330
0.3083
.04061
[95% Conf. Interval]
-.0044895
.885553
-.0012704
1.029333
Interpretation of regression
coefficients

The final regression equation is
BP Fˆ  0.957  0.0029 * age

The coefficients mean
– the estimate of the mean BPF for a patient with an age
of 0 is 0.957 (0hat)
– an increase of one year in age leads to an estimated
decrease of 0.0029 in mean BPF (1hat)
Unanswered questions
Is the estimate of 1 (1hat) significantly
different than zero? In other words, is
there a significant relationship between
the predictor and the outcome?
 Have the assumptions of regression been
met?

Estimate of variance for hat ’s

In order to determine if there is a significant
association, we need an estimate of the variance
of 0hat and 1hat
 
seˆ ˆ0  s y| x
1

n
x2
n
 x  x 
i 1

2
i
 
seˆ ˆ1 
s y| x
n
 x  x 
i 1
2
i
sy|x is the residual variance in y after accounting
for x (standard deviation from regression, root
mean square error)
Test statistic

For both regression coefficients, we use a tstatistic to test any specific hypothesis
– Each has n-2 degrees of freedom (This is the sample
size-number of parameters estimated)

What is the usual null hypothesis for 1?
ˆ0   0
t
seˆ ˆ0
 
ˆ1  1
t
seˆ ˆ1
 
Hypothesis test
1)
2)
3)
4)
5)
6)
7)
H0: 1 =0
Continuous outcome, continuous predictor
Linear regression
Test statistic: t=-3.67 (27 dof)
p-value=0.0011
Since the p-value is less than 0.05, we reject
the null hypothesis
We conclude that there is a significant
association between age and BPF
. regress bpf age
Source
SS
df
Model
.022226034
Estimated
slope271
Residual
.044524108
Total
.066750142
bpf
Coef.
age
_cons
-.0028799
.957443
28
MS
.022226034
p-value
.001649041
.002383934
Std. Err.
.0007845
.035037
Estimated intercept
t
-3.67
27.33
P>|t|
0.001
0.000
Number of obs
F( 1,
27)
Prob
> F
for
slope
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
29
13.48
0.0010
0.3330
0.3083
.04061
[95% Conf. Interval]
-.0044895
.885553
-.0012704
1.029333
Comparison to correlation
In this example, we found a relationship between
the age and BPF. We also investigated this
relationship using correlation
 We get the same p-value!!
 Our conclusion is exactly the same!!
 There are other relationships we will see later

Method
Correlation
Linear regression
p-value
0.0010
0.0010
Confidence interval for 1
As we have done previously, we can construct a
confidence interval for the regression coefficients
 Since we are using a t-distribution, we do not
automatically use 1.96. Rather we use the cut-off
from the t-distribution

ˆ  t
1

 
 
ˆ , ˆ  t
ˆ
ˆ
ˆ
*
s
e

*
s
e

 / 2, dof
1
1
 / 2, dof
1
Interpretation of confidence interval is same as
we have seen previously
Intercept
STATA also provides a test statistic and pvalue for the estimate of the intercept
 This is for Ho: 0 = 0, which is often not a
hypothesis of interest because this
corresponds to testing whether the BPF is
equal to zero at age of 0
 Since BPF can’t be 0 at age 0, this test is
not really of interest
 We can center covariates to make this test
important

Prediction
Prediction
Beyond determining if there is a significant
association, linear regression can also be used
to make predictions
 Using the regression equation, we can predict
the BPF for patients with specific age values

– Ex. A patient with age=40
BPFˆ  0.957  0.0029 * 40  0.841

The expected BPF for a patient of age 40 based
on our experiment is 0.841
Extrapolation
.8
.85
.9
.95
Can we predict the BPF for a patient with
age 80? What assumption would we be
making?
.75

20
30
40
Age
BPF
50
predval
60
Confidence interval for prediction
We can place a confidence interval around our
predicted mean value
 This corresponds to the plausible values for the
mean BPF at a specific age
 To calculate a confidence interval for the
predicted mean value, we need an estimate of
variability in the predicted mean

seˆ yˆ   s y| x
2


1
xx
 n
n
2


x

x
 i
i 1
Confidence interval
Note that the standard error equation has a
different magnitude based on the x value. In
particular, the magnitude is the least when x=the
mean of x
 Since the test statistic is based on the tdistribution, our confidence interval is

yˆ  t
 / 2 , df

* seˆ yˆ , yˆ  t / 2,df * seˆ yˆ 
This confidence interval is rarely used for
hypothesis testing because
.95
.9
.85
.8
.75
20
30
40
Age
50
60
Prediction interval
A confidence interval for a mean provides
information regarding the accuracy of a
estimated mean value for a sample size
 Often, we are interested in how accurate our
prediction would be for a single observation, not
the mean of a group of observations. This is
called a prediction interval
 What would you estimate as the value for a
single new observation?
 Do you think a prediction interval is narrower or
wider?

Prediction interval
Confidence interval always tighter than prediction
intervals
 The variability in the prediction of a single
observation contains two types of variability

– Variability of the estimate of the mean (confidence
interval)
– Variability around the estimate of the mean (residual
variability)
2
2
~
se y   s y| x  seˆ yˆ 
~
~
~
~
ˆ
ˆ
y  t / 2,df * se y , y  t / 2,df * se y 
1
.9
.8
.7
20
30
40
Age
50
60
Conclusions

Prediction interval is always wider than
confidence interval
– Common to find significant differences
between groups but not be able to predict
very accurately
– To predict accurately for a single patient, we
need limited overlap of the distribution. The
benefit of an increased sample size
decreasing the standard error does not help
Model checking
How good is our model?
Although we have found a relationship between
age and BPF, linear regression also allows us to
assess how well our model fits the data
 R2=coefficient of determination=proportion of
variance in the outcome explained by the model

– When we have only one predictor, it is the proportion
of the variance in y explained by x
R 
2
s s
2
y
s
2
y
2
y| x
2
R

What if all of the variability in y was explained
by x?
– What would R2 equal?
– What does this tell you about the correlation between
x and y?
– What if the correlation between x and y is negative?

What if none of the variability in y is explained
by x?
– What would R2 equal?
– What is the correlation between x and y in this case?
r vs.
2
R
R2=(Pearson’s correlation coefficient)2=r2
 Since r is between -1 and 1, R2 is always less
than r

Method
r
R2
– r=0.1, R2=0.01
– r=0.5, R2=0.25
Estimate
-0.577
0.333
Evaluation of model

Linear regression required several assumptions
–
–
–
–
Linearity
Homoscedasticity
Normality
Independence-usually from study design
We must determine if the model assumptions
were reasonable or a different model may have
been needed
 Statistical research has investigated relaxing
each of these assumptions

Scatter plot

A good first step in any regression is to
look at the x vs. y scatter plot. This allows
us to see
– Are there any outliers?
– Is the relationship between x and y
approximately linear?
– Is the variance in the data approximately
constant for all values of x?
Tests for the assumptions

There are several different ways to test the
assumptions of linear regression.
– Graphical
– Statistical

Many of the tests use the residuals, which are
the distances from the fitted line and the
outcomes

eˆi  yi  yˆi  yi  ˆ0  ˆ1 xi

Residual plot
-.1
-.05
0
.05
.1
If the
assumptions of
linear
regression are
met, we will
observe a
random scatter
of points
.8
.85
Fitted values
.9
Investigating linearity
Scatter plot of
predictor vs outcome
 What do you notice
here?
 One way to handle
this is to transform
the predictor to
include a quadratic or
other term

Non-linear relationship
50
45
40
35
30
25
20
15
10
5
0
0
2
4
6
8
10
12
.8
.7
BPF
.75
Research has
shown that the
decrease in BPF
in normal people
is pretty slow up
until age 65 and
then there is a
more steep drop
.65

.85
Aging
40
50
60
Age
70
80
.65
.7
.75
.8
.85
Fitted line
40
50
60
Age
70
80
Note how the majority of the values are above the fitted line
in the middle and below the fitted line on the two ends
What if we fit a line for this?
.05
Residual plot shows a non-random scatter
because the relationship is not really linear
-.05
0
Residuals

.72
.74
.76
.78
Fitted values
.8
.82
What can we do?
If the relationship between x and y is not
linear, we can try a transformation of the
values
 Possible transformations

– Add a quadratic term
– Fit a spline. This is when there is a slope for a
certain part of the curve and a different slope
for the rest of the curve
.65
.7
.75
.8
.85
Adding a quadratic term
40
50
60
Age
70
80
-.05
0
Residuals
.05
Residual plot
.7
.72
.74
.76
Fitted values
.78
.8
Checking linearity
Plot of residuals vs. the predictor is also
used to detect departures from linearity
 These plots allow you to investigate each
predictor separately so becomes important
in multiple regression
 If linearity holds, we anticipate a random
scatter of the residuals on both types of
residual plot

Homoscedasticity
The second assumption
is equal variance across
the values of the
predictor
 The top plot shows the
assumption is met, while
the bottom plot shows
that there is a greater
amount of variance for
larger fitted values

0
100000
200000
Expression level
300000
Example
1
2
3
Lipid number
4
5
6
0
Residuals
100000
-100000
In this example,
we can fit a
linear regression
model assuming
that there is a
linear increase in
expression with
lipid number, but
here is the
residuals plot
from this
analysis
 What is wrong?

200000
Example
-50000
0
50000
Fitted values
100000
log Expression level
12
10
8
6
Clearly, the
residuals
showed that we
did not have
equal variance
 What if we logtransform our yvalue?

14
Transform the y-value
1
2
3
4
Lipid number
5
6
New regression equation

By transforming the outcome variable we have
changed our regression equation:
– Original: Expressioni =0+ 1*lipidi+ei
– New: ln(Expressioni) =0+ 1*lipidi+ei

What is the interpretation of 1 from the new
regression model?
– For every one unit increase in lipid number, there is a
1 unit increase in the ln(Expression) on average
– The interpretation has changed due to the
transformation
1
0
-1
-2
-3
On the logscale, the
assumption of
equal variance
appears much
more
reasonable
Residuals

2
Residual plot
7
8
9
Fitted values
10
11
Checking homoscedasticity

If we do not appear to have equal
variance, a transformation of the outcome
variable can be used
– Most common are log-transformation or
square root transformation

Other approaches involving weighted least
squares can also be used if a
transformation does not work
Normality
– Histogram of
residuals
– Normal
probability plot
8
6
Density
4
2
0
Regression
requires that the
residuals are
normally
distributed
 To test if the
residuals are
normal:

-.1
-.05
0
.05
resid
Several statistical tests for normality
of residuals are also available
What if normality does not hold?
Transformations of the outcome can often
help
 Changing to another type of regression
that does not require normality of the
residuals

– Logistic regression
– Poisson regression
Outliers
Investigating the residuals also provides
information regarding outliers
 If a value is extreme in the vertical direction, the
residual will be extreme as well

– You will see this in lab

If a value is extreme in the horizontal direction,
this value can have too much importance
(leverage)
– This is beyond the scope of this class
Example

Another measure of disease burden in MS
is the T2 lesion volume in the brain
– Over the course of the disease patients
accumulate brain lesions that they do not
recover from
This is a measure of the disease burden in
the brain
 Is the significant linear relationship
between T2 lesion volume and age?

20
30
40
Age
50
60
0
10
20
Lesion volume
30
Linear model

Our initial linear model:
– LVi =0+1*agei +ei
– What is the interpretation of 1?
– What is the interpretation of 0?

Using STATA, we get the following regression
equation:
LVˆ  3.70  0.062 * age
i
i
– Is there a significant relationship between age and
lesion volume?
Hypothesis test
1)
2)
3)
4)
5)
6)
7)
H0: 1 =0
Continuous outcome, continuous predictor
Linear regression
Test statistic: t=0.99 (102 dof)
p-value=0.32
Since the p-value is more than 0.05, we fail to
reject the null hypothesis
We conclude that there is no significant
association between age and lesion volume
. regress lv_entry age
Source
SS
df
MS
Model
Residual
33.1886601
3440.84404
1
102
33.1886601
33.7337651
Total
3474.0327
103
33.7284729
lv_entry
Coef.
age
_cons
.0623605
3.699857
Std. Err.
.0628706
2.742369
Estimated coefficient
t
0.99
1.35
Number of obs
F( 1,
102)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.324
0.180
p-value
=
104
=
0.98
= 0.3236
= 0.0096
= -0.0002
= 5.8081
[95% Conf. Interval]
-.0623429
-1.739618
.187064
9.139333
30
20
Residuals
10
0
-10
5
5.5
6
6.5
Fitted values
7
7.5
Linear model

Our initial linear model:
– ln(LVi) =0+1*agei +ei
– What is the interpretation of 1?
– What is the interpretation of 0?

Using STATA, we get the following regression
equation:
ln̂( LVi )  1.36  0.0034 * agei
– Is there a significant relationship between age and
lesion volume?
Hypothesis test
1)
2)
3)
4)
5)
6)
7)
H0: 1 =0
Continuous outcome, continuous predictor
Linear regression
Test statistic: t=0.38 (102 dof)
p-value=0.71
Since the p-value is more than 0.05, we fail to
reject the null hypothesis
We conclude that there is no significant
association between age and lesion volume
. regress lnlv age
Source
SS
df
MS
Model
Residual
.100352931
71.4750773
1
102
.100352931
.700736052
Total
71.5754302
103
.69490709
lnlv
Coef.
age
_cons
.0034291
1.355875
Std. Err.
.0090613
.3952489
Estimated coefficient
t
0.38
3.43
Number of obs
F( 1,
102)
Prob > F
R-squared
Adj R-squared
Root MSE
=
104
=
0.14
= 0.7059
= 0.0014
= -0.0084
=
.8371
P>|t|
[95% Conf. Interval]
0.706
0.001
-.014544
.5719006
p-value
.0214022
2.139849
2
1
Residuals
0
-1
-2
1.4
1.45
1.5
Fitted values
1.55
Histograms of residuals
Untransformed values
Transformed values
Conclusions for model checking

Checking model assumptions for linear
regression is needed to ensure inferences
are correct
– If you have the wrong model, your inference
will be wrong as well
Majority of model checking based on the
residuals
 If model fit is bad, should use a different
model

Dichotomous predictors
Linear regression with dichotomous
predictor
Linear regression can also be used for
dichotomous predictors, like sex
 To do this, we use an indicator variable, which
equals 1 for male and 0 for female. The resulting
regression equation for BPF is

E ( BPF | sex)   0  1sex
BPFi   0  1sexi  e i
.75
.8
BPF
.85
.9
.95
Graph
0
.2
.4
.6
Sex
.8
1

The regression equation can be rewritten as
BPF female   0  e i
BPFmale   0  1  e i

The meaning of the coefficients in this case are
– 0 is the mean BPF when sex=0, in the female
group
– 0  1 is the mean BPF when sex=1, in the male
group

What is the interpretation of 1?
– For a one-unit increase in sex, there is a 1 increase
in mean of the BPF
– The difference in mean BPF between the males and
females
Interpretation of results

The final regression equation is
BPFˆ  0.823  0.037 * sex

The meaning of the coefficients in this case are
– 0.823 is the estimate of the mean BPF in the female
group
– 0.037 is the estimate of the mean increase in BPF
between the males and females
– What is the estimated mean BPF in the males?

How could we test if the difference between the
groups is statistically significant?
Hypothesis test
1)
2)
3)
4)
5)
6)
7)
H0: There is no difference based on gender (1
=0)
Continuous outcome, dichotomous predictor
Linear regression
Test statistic: t=1.82 (27 dof)
p-value=0.079
Since the p-value is more than 0.05, we fail to
reject the null hypothesis
We conclude that there is no significant
difference in the mean BPF in males compared
to females
. regress bpf sex
Source
SS
df
MS
Model
Residual
.007323547
.059426595
1
27
.007323547
.002200985
Total
.066750142
28
.002383934
bpf
Coef.
sex
_cons
.0371364
.8228636
Std. Err.
.0203586
.0100022
Estimated
difference between
groups
t
1.82
82.27
Number of obs
F( 1,
27)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
29
3.33
0.0792
0.1097
0.0767
.04691
P>|t|
[95% Conf. Interval]
0.079
0.000
-.004636
.8023407
p-value for
difference
.0789087
.8433865
.1
.05
-.05
0
Residuals
-.1
.82
.83
.84
Fitted values
.85
.86
T-test
As hopefully you remember, you could
have tested this same null hypothesis
using a two sample t-test
 Linear regression makes an equal variance
assumption, so let’s use the same
assumption for our t-test

Hypothesis test
1)
2)
3)
4)
5)
6)
7)
H0: There is no difference based on gender
Continuous outcome, dichotomous predictor
t-test
Test statistic: t=-1.82 (27 dof)
p-value=0.079
Since the p-value is more than 0.05, we fail to
reject the null hypothesis
We conclude that there is no significant
difference in the mean BPF in males compared
to females
. ttest bpf, by(sex)
Two-sample t test with equal variances
Group
Obs
Mean
0
1
22
7
combined
29
diff
Std. Err.
Std. Dev.
[95% Conf. Interval]
.8228636
.86
.0096717
.0196457
.0453645
.0519775
.8027502
.8119288
.8429771
.9080712
.8318276
.0090667
.0488255
.8132553
.8503998
-.0371364
.0203586
-.0789087
.004636
diff = mean(0) - mean(1)
Ho: diff = 0
Ha: diff < 0
Pr(T < t) = 0.0396
t =
degrees of freedom =
Ha: diff != 0
Pr(|T| > |t|) = 0.0792
-1.8241
27
Ha: diff > 0
Pr(T > t) = 0.9604
Amazing!!!
We get the same result using both
approaches!!
 Linear regression has the advantages of:

– Allowing multiple predictors (tomorrow)
– Accommodating continuous predictors
(relationship to correlation)
– Accommodating categorical predictors
(tomorrow)

Very flexible approach
Conclusion
Indicator variables can be used to
represent dichotomous variables in a
regression equation
 Interpretation of the coefficient for an
indicator variable is the same as for a
continuous variable

– Provides a group comparison

Tomorrow we will see how to use
regression to match ANOVA results