Slide 1 Download

Transcript
Statistics for Health Research
Correlation and Linear
Regression
Peter T. Donnan
Professor of Epidemiology and Biostatistics
CONTENTS
• Correlation coefficients
• meaning
• values
• role
• significance
• Regression
• line of best fit
• prediction
• significance
2
INTRODUCTION
• Correlation
• the strength of the linear relationship between two
variables
• Regression analysis
• determines the nature of the relationship
• Is there a relationship between the number of
units of alcohol consumed and the likelihood of
developing cirrhosis of the liver?
3
PEARSON’S COEFFICIENT OF
CORRELATION (r)
• Measures the strength of the linear relationship
between one dependent and one independent
variable
• curvilinear relationships need other techniques
• Values lie between +1 and -1
• perfect positive correlation r = +1
• perfect negative correlation r = -1
• no linear relationship r = 0
4

r = +1

PEARSON’S COEFFICIENT OF
CORRELATION
r = -1









r=0









r = 0.6




5
SCATTER PLOT

BMD

dependent variable
make inferences about





Calcium intake
independent variable
6
NON-NORMAL DATA
7
NORMALISED
8
SPSS OUTPUT: SCATTER PLOT
9
SPSS OUTPUT: CORRELATIONS
10
Interpreting correlation

Large r does not necessarily imply:

strong correlation


r increases with sample size
cause and effect



strong correlation between the number of
televisions sold and the number of cases of
paranoid schizophrenia
watching TV causes paranoid schizophrenia
may be due to indirect relationship
11
Interpreting correlation

Variation in dependent variable due to:






relationship with independent variable: r2
random factors: 1 - r2
r2 is the Coefficient of Determination or
Variation explained
e.g. r = 0.661
r2 = = 0.44
less than half of the variation (44%) in the
dependent variable due to independent
variable
12
13
Agreement

Correlation should never be used to determine
the level of agreement between repeated
measures:




measuring devices
users
techniques
It measures the degree of linear relationship

You can have high correlation with poor agreement
14
Non-parametric correlation



Make no assumptions
Carried out on ranks
Spearman’s r


Kendall’s t




easy to calculate
has some advantages over r
distribution has better statistical properties
easier to identify concordant / discordant pairs
Usually both lead to same conclusions
15
Role of regression


Shows how one variable changes with another
By determining the line of best fit


linear
curvilinear
16
Line of best fit


Simplest case linear
Line of best fit between:


dependent variable Y
 BMD
independent variable X
 dietary intake of Calcium








Y = a + bX
value of Y when X=0 change in Y when X increases by 1
17
Role of regression

Used to predict



the value of the dependent variable
when value of independent variable(s) known
within the range of the known data



extrapolation risky!
relation between age and bone age
Does not imply causality
18
SPSS OUTPUT: REGRESSION
19
Multiple regression

More than one independent variable

BMD dependent on:






age
gender
calorific intake
Use of bisphosphonates
Exercise
etc
20
Summary

Correlation





strength of linear relationship between two variables
Pearson’s - parametric
Spearman’s / Kendall’s non-parametric
Interpret with care!
Regression




line of best fit
prediction
Multiple regression
logistic
21
Statistics for Health Research
Regression:
Checking the Model
Peter T. Donnan
Professor of Epidemiology and Biostatistics
Objectives of session
• Recognise the need to check fit of
the model
• Carry out checks of assumptions in
SPSS for simple linear regression
• Understand predictive model
• Understand residuals
How is the fitted line
obtained?
Use method of least squares (LS)
Seek to minimise squared vertical
differences between each point and
fitted line
Results in parameter estimates or
regression coefficients of slope (b)
and intercept (a) – y=a+bx
Dependent (y)
Consider Fitted line of
y = a +bx
a
Explanatory (x)
Consider the regression of age on
minimum LDL cholesterol achieved
• Select Regression
Linear….
• Dependent (y) – Min LDL achieved
• Independent (x) - Age_Base
Output from SPSS linear
regression
Coefficientsa
Model
1
Unstandardized Coefficients Standardized Coefficients
B
Std. Error Beta
t
(Constant)
2.024
.105
19.340
Age at baseline
-.008
.002
-.121
-4.546
sig
.000
.000
a. Dependent Variable: Min LDL achieved
N.B. 0.008 may look very small but
represents:
The DECREASE in LDL achieved for each
increase in one unit of age i.e. ONE year
Output from SPSS linear
regression
Coefficientsa
Model
1
Unstandardized Coefficients Standardized Coefficients
B
Std. Error Beta
t
(Constant)
2.024
.105
19.340
Age at baseline
-.008
.002
-.121
-4.546
sig
.000
.000
a. Dependent Variable: Min LDL achieved
H0 : slope b = 0
Test t = slope/se = -0.008/0.002 = 4.546 with
p<0.001, so statistically significant
Predicted LDL = 2.024 - 0.008xAge
Prediction Equation from linear
regression
Predicted LDL achieved = 2.024 - 0.008xAge
So for a man aged 65 the predicted LDL
achieved = 2.024 – 0.008x 65 = 1.504
Age
Predicted Min LDL
45
1.664
55
1.584
65
1.504
75
1.424
Assumptions of Regression
1. Relationship is linear
2. Outcome variable and hence
residuals or error terms are approx.
Normally distributed
Use Graphs and Scatterplot
to obtain the Lowess line of
fit
Use Graphs and Scatterplot to
obtain the Lowess line of fit
1. Create Scatterplot and then
double-click to enter chart
editor
2. Chose Icon ‘Add fit line at
total’
3. Then select type of fit such
as Lowess
Linear assumption: Fitted
lowess smoothed line
Lowess smoothed line (red) gives a good eyeball
examination of linear assumption (green)
Definition of a residual
A residual is the difference between
the predicted value (fitted line) and the
actual value or unexplained variation
ri = yi – E ( yi )
Or
ri = yi – ( a + bx )
Residuals
To assess the residuals in SPSS
linear regression, select plots…..
Normalised
or
standardised
predicted
value of LDL
Normalised
residual
Select
histogram of
residuals and
normal
probability plot
In SPSS linear regression, select
Statistics…..
Model fit
Select
confidence
intervals for
regression
coefficients
Select DurbinWatson for
serial correlation
and identification
of outliers
Output:
Scatterplot of residuals vs. predicted
Note
1) Mean of
residuals
= 0
2) Most of
data lie
within +
or -3
SDs of
mean
Assumptions of Regression
1. Relationship is linear
2. Outcome variable and hence
residuals or error terms are approx.
Normally distributed
Output:
Histogram of standardised residuals
Plot of
residuals
with
normal
curve
superimposed
Output:
Cumulative probability plot
Look for
deviation
from
diagonal
line to
indicate
nonnormality
Output:
Description of residuals
Descriptive statistics for residuals
Residuals Statisticsa
Minimum Maxim um
Predicted Value
1.314867 1.843205
Residual
-1.65389 4.0658469
Std. Predicted Value
-2.750
3.264
Std. Residual
-2.302
5.660
Mean Std. Deviation
1.556478
.0878548
.0000000
.7181448
.000
1.000
.000
1.000
a. Dependent Variable: Min LDL achieved
Worth
investigation?
Subjects with standardised
residuals > 3
Casewise Diagnostics(a)
N
1383
1383
1383
1383
Case NumberStd. Residual Min LDL
164
5.660
5.5840
209
4.395
4.5260
250
3.143
3.7875
268
3.064
3.8730
274
3.227
4.0953
362
4.095
4.5350
517
3.636
4.3240
849
3.968
4.3290
1047
4.207
4.4360
1075
3.885
4.4040
1103
3.519
3.9905
1229
3.016
3.7660
1290
3.975
4.2345
Predicted
1.518153
1.368685
1.529325
1.671664
1.777153
1.593460
1.711788
1.478113
1.413686
1.613219
1.462584
1.599254
1.379107
a. Dependent Variable: Min LDL achieved
Residual
4.0658471
3.1573148
2.2581750
2.2013357
2.3180975
2.9415398
2.6122125
2.8508873
3.0223141
2.7907805
2.5279157
2.1667456
2.8553933
Output:
Model fit and serial correlation
Model Summary
Model
1
R
.121a
R Square Adjusted R Square Std. Error of the Estimate Durbin-Watson
.015
.014
.7184048
2.034
a. Predictors: (Constant), Age at baseline
R – correlation between min LDL achieved and Age at
baseline, here 0.121
R2 - % variation explained, here 1.5%, not particularly
high
Durbin-Watson test - serial correlation of residuals
should be approximately 2 if no serial correlation
Summary
After fitting any regression model check
assumptions • Functional form – linearity is default,
often not best fit, consider quadratic…
• Check Residuals for approx. normality
• Check Residuals for outliers (> 3 SDs)
• All accomplished within SPSS
Practical on Model Checking
Read in ‘LDL Data.sav’
1) Fit age squared term in min LDL model and
check fit of model compared to linear fit
(Hint: Use transform/compute to create age
squared term and fit age and age2)
2) Fit separate linear regressions with min
Chol achieved with predictors of 1) baseline
Chol 2) APOE_lin 3) adherence
Check assumptions and interpret results