Download Introduction to Statistics 4.COD

Document related concepts

Data assimilation wikipedia , lookup

Choice modelling wikipedia , lookup

Regression toward the mean wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Introduction to Statistics
Lecture 4
1
Outline







Regression
Correlation in more detail
Multiple Regression
ANCOVA
Normality Checks
Non-parametrics
Sample Size Calculations
2
Linear Regression & Correlation




Sometimes we have two quantitative
measurements which are related in some
way
e.g. the peak height on a chromatograph and
the concentration of analyte in a specimen
They are called the dependent variable (Y)
and the independent variable (X)
Y & X may be arbitrary, but if a variable is set
by the experimenter rather than simply
observed then that variable is X
3
Regression & Correlation

A correlation measures the “degree of
association” between two variables (interval
(50,100,150…) or ordinal (1,2,3...))

Associations can be positive (an increase
in one variable is associated with an
increase in the other) or negative (an
increase in one variable is associated with
a decrease in the other)
4
Correlation coefficient r

The (Pearson) sample correlation coefficient is
defined as
r

 ( xi  x )( yi  y )
2
2
 ( xi  x )  ( yi  y )
Describes how y varies with x (numerator) and
scaled (by dividing by the variability of x and of y)
so -1 ≤ r ≤ 1
 -1 = 100% negative correlation,
+1 = 100% positive correlation
5
Interpreting r






r=0 no relationship
|r|=1 means a perfect linear relationship
|r| < 0.4: weak relationship
0.4 ≤ |r| ≤ 0.7: moderate relationship
|r| > 0.7: strong relationship
P-value assigned to r tells us how likely it
is that r is right, i.e. confidence in our
statement
6
Example

Daytime SBP and age
collected for 447
hypertensive males.
SBP
115
130
128
123
126
…
Age
34
40
28
21
39
…
7
Example (contd)


Is there a linear relationship between SBP
and Age?
r=0.145  weak positive relationship
8
Correlation examples
9
Example 2: Height vs. Weight
Graph One: Relationship between Height
and Weight
180

Strong positive correlation
between height and weight

Can see how the
relationship works, but
cannot predict one from the
other

If 120cm tall, then how
heavy?
Weight (kgs)
160
140
120
100
80
60
40
20
0
0
50
100
150
200
Height (cms)
10
Regression
Problem: to draw a straight line through the
points that best explains the variance
9
8
7
Line can then be used to
predict Y from X
6
5
4
3
2
1
0
0
1
2
3
4
5
6
11
Example: Symptom Index vs Drug A
Graph Three: Relationship between
Symptom Index and Drug A
(with best-fit line)

“Best fit line”

allows us to describe
relationship between variables
more accurately.

We can now predict specific
values of one variable from
knowledge of the other

All points are close to the line
180
Symptom Index
160
140
120
100
80
60
40
20
0
0
50
100
150
200
250
Drug A (dose in mg)
12
Simple Linear Regression

Assume the population regression line:
y = α+βx
200
180
160
Where: α = y intercept (constant)
β = slope of line
y = dependent variable
x = independent variable
140
120
100
80
60
40
20
0
0

50
100
150
200
250
yi = α+βxi+εi
13
Regression

Establish equation for the best-fit line:
y = a+bx

Best-fit line same as regression line

b is the regression coefficient for x

x is the predictor or regressor variable for y
14
Fit a line to the data:

Not great:
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
15
Fit a line to the data:

Better:
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
16
Least Squares


Minimise the (squared) distance between the
points and the line
a and b are the estimates of α and β which
minimise
yi  (  xi )
2
17
Least Squares Estimates

Using calculus (partial derivatives), we get
( xi  x )( yi  y )

b
2
 ( xi  x )
a  y  bx

Note b is related to the correlation coefficient r
(same numerator)- if x and y are positively
correlated then the slope is positive
18
Example from the literature

Predicting tandem repeat variability
19
Example: symptom Index versus Drug
A dose
Graph Three: Relationship between
Symptom Index and Drug A
(with best-fit line)

“Best fit line”

Allows us to describe
relationship between variables
more accurately.

We can now predict specific
values of one variable from
knowledge of the other

All points are close to the line
180
Symptom Index
160
140
120
100
80
60
40
20
0
0
50
100
150
200
250
Drug A (dose in mg)
20
Example: Symptom Index versus
Drug B dose

We can still predict specific
values of one variable from
knowledge of the other

Will predictions be as
accurate?

Why not?

Large “residual” variation
(random error)
Graph Four: Relationship between Symptom
Index and Drug B
(with best-fit line)
160
Symptom Index
140
120
100
80
60
40
20
0
0
50
100
150
Drug B (dose in mg)
200
250
= Difference between observed
data and that predicted by the
equation
21
Regression Hypothesis Tests

Hypotheses about the intercept
H0: α = 0 HA: α  0

But most research focuses on the slope
H0: β = 0 HA: β  0
This addresses the general question “Is X
predictive of Y?”
22
Regression
Estimates of a slope (b) have a sampling
distribution, like any other statistic
 If certain assumptions are met (NB normality,
homogeneity of variance) the sampling
distribution approximates the t-distribution
 Thus, we can assess the probability that a
given value of b would be observed, if  = 0
→ hypothesis tests & confidence intervals

23
Regression



R2, the coefficient of determination, is the percentage of variation
explained by the “regression”.
R2 > 0.6 is deemed reasonably good.
Note, the model must also be significant, e.g.
24
Back to SBP and Age example




a=123 and b=0.159 approximately
What does b mean?
Is age predictive of BP? i.e. is there evidence
that b ≠ 0?
How good is the fit of the regression line?
25
Regression

2
R Interpretation
R2= proportion of variation explained by (or
predictive ability of) the regression
Y variation
Variation in y is almost fully
explained by x: R2 ≈ 1
Y variation
Still some variation in y left over
(not explained by x): R2 < 1
26
25
UNexplained
error
20
TOTAL
error
15
explained
error
10
MEAN
5
52
48
44
40
36
32
28
24
20
16
12
8
REGRESSION LINE
4
0
0
27
Regression – four possibilities
b≠0
P-value nonsignificant
Relationship but not much evidence
b≈0
P-value nonsignificant
No relationship & not much evidence
b≠0
P-value
significant
Plenty of evidence for a relationship
b≈0
P-value
significant
Plenty of evidence for no relationship
28
Regression Assumption:
Homoskedasticity (Equal Error Variance)
10
8
Examine error at
different values of X.
Is it roughly equal?
6
Here, things look
pretty good.
4
2
0
0
20000
40000
60000
80000
100000
INCOME
29
Heteroskedasticity: Unequal Error
Variance
At higher values of
X, error variance
increases a lot.
10
8
6
A transformation
of data (e.g. log)
can remove
heterskedasticity
4
2
0
0
20000
10000
INCOME
40000
30000
60000
50000
80000
70000
100000
90000
30
50% of slides complete!
31
Multiple Regression



Extension of simple linear regression to more than one
(continuous/ordinal) independent variables
We use least squares in exactly the same way to obtain
estimates of the regression coefficients
e.g. with 2 independent variables x and z, we fit the regression
y=a+bx+cz…
where a,b and c are the regression coefficients. This represents
a plane in 3d space
Previous example
32
Notes on multiple regression


Make sure variables are normal. If not, transform
them. If still not, can split into 2 groups (categories
(0/1)) for e.g. high vs. low responders
Can combine with “stepwise selection”: instead of
using every variable and forcing them into a final
model, can drop out variables automatically, e.g.
petri dish temperature, that are not predictive
33
Example



Study to evaluate the effect of the duration of
anesthesia and degree of trauma on
percentage depression of lymphocyte
transformation
35 patients
Trauma factor classified as 0, 1, 3 and 4,
depending upon severity
34
Duration
Trauma
Depression Duration
Trauma
Depression
4
6
1.5
4
2.5
3
3
2.5
3
3
2
8
5
2
2.5
2
1.5
1
3
3
2
2
2
2
2
2
3
3
3
3
4
2
2
2
2
1
36.7
51.3
40.8
58.3
42.2
34.6
77.8
17.2
-38.4
1
53.7
14.3
65
5.6
4.5
1.6
6.2
12.2
3
3
3
3
4
4
2
2
2
1
1
1
1
1
3
4
2
29.9
76.1
11.5
19.8
64.9
47.8
35
1.7
51.5
20.2
-9.3
13.9
-19
-2.3
41.6
18.4
9.9
3
4
3
3
7
6
2
4
2
1
1
2
1
3
4
8
2
35
Example (con’t)

Fitted regression line is
y=-2.55+10.375x+1.105z
or
Depression= -2.55+10.375*Trauma+1.105*Duration



Both slopes are non-significant (p-value=0.1739 for trauma,
0.7622 for duration)
R2=16.6% of the variation in lymphocyte depression is
explained by the regression
Conclusion: Trauma score and duration of anesthesia are
poor explanations for lymphocyte depression
36
Collinearity



If two (independent) variables are closely
related its difficult to estimate their regression
coefficients because they tend to get
confused
This difficulty is called collinearity
Solution is to exclude one of the highly
correlated variables
37
Example (cont’d)


Correlation between trauma and duration= 0.762 (quite
strong)
Drop trauma from regression analysis
Depression=9.73+4.94*Duration



P-value for duration is 0.0457, statistically significant!
However, the R2 is still small (11.6%)
Conclusion: Although there is evidence for a non-zero slope
or linear relationship with duration, there is still considerable
variation not explained by the regression.
38
Outliers in Regression



Outliers: cases with extreme values that
differ greatly from the rest of your sample
Even a few outliers can dramatically
change estimates of the slope (b)
Outliers can result from:



Errors in coding or data entry (→rectify)
Highly unusual cases (→exclude?)
Or, sometimes they reflect important “real”
variation (→include?)
39
Outliers: Example
Extreme case that
pulls regression
line up
4
2
-4
-2
0
-2
-4
2
4
Regression line
with extreme case
removed from
sample
40
What about non-linear relationships?
Fail to reject
β=0
Reject β=0
41
Non-linear models





Linear Regression fits a straight line to your data
Non-linear models may, in some cases, be more appropriate
Non-linear models are usually used in situations where nonlinear relationships have some biological explanation
e.g. non-linear models are commonly used in
pharmacokinetics studies and compartmental analysis
Computationally intensive - estimates may not “converge”
and results may be difficult to interpret
42
Analysis of Covariance-ANCOVA




Modelling both categorical and continuous
independent variables (covariates)
Justification: Consideration of a covariate
may improve the precision (reduce variability)
of the comparisons of categories made in
ANOVA
Often used as a correction for baseline
imbalances
ANOVA + regression combined
43
44
Example

Antihypertensive Drug Clinical Trial: 84
patients randomly allocated to either Aliskiren
or Losartan. Blood pressure measured at
baseline and after several weeks treatment.
Age and gender was also recorded for each
patient. Is there a significant difference in the
two treatments for BP reduction?
45
Age
66
62
48
32
61
68
60
33
69
53
63
64
58
52
54
43
46
46
66
29
50
49
40
52
68
35
49
47
Treatment
SPP75
SPP75
Los100
Los100
Los100
SPP75
Los100
SPP75
Los100
SPP75
SPP75
Los100
SPP75
SPP75
SPP75
SPP75
SPP75
Los100
Los100
SPP75
SPP75
SPP75
SPP75
Los100
Los100
SPP75
Los100
SPP75
Ba_daySBP Red_daySBP gender
131.1
0.6
Female
160.4
-0.7
Male
147.3
17.9
Male
144.8
-2.4
Male
150.7
21.6
Female
152.4
6
Male
143.6
15.3
Male
143.2
5.1
Male
166.6
24.6
Male
147.6
Female
163.7
-0.2
Male
145.7
15.5
Male
168.3
0.5
Male
156.8
0.6
Male
154.9
7.3
Male
170.5
-2.7
Female
155.5
18.3
Female
173.1
26.7
Male
151.2
-1.7
Male
139.8
-1.5
Male
162.6
13
Male
178.8
17.2
Female
146.8
0
Male
157.1
-0.2
Female
152
8.9
Male
145.4
2.8
Male
153.7
14.9
Male
139.2
10
Male
Age
45
68
48
69
64
40
61
47
60
35
61
62
54
45
57
48
59
47
54
45
61
69
21
69
66
42
47
60
Treatment
Los100
Los100
Los100
SPP75
Los100
SPP75
SPP75
Los100
Los100
SPP75
SPP75
Los100
Los100
Los100
SPP75
SPP75
Los100
SPP75
Los100
SPP75
SPP75
Los100
SPP75
SPP75
Los100
SPP75
SPP75
SPP75
Ba_daySBP Red_daySBP gender
156
17
Male
149.9
6.8
Female
147
16
Female
145.3
-1.2
Male
142.5
20.6
Female
168.6
2.6
Male
165.6
8.1
Male
156.3
2.3
Female
147.7
11.5
Female
157.4
12.5
Male
143.7
-1.8
Male
148.6
6.6
Male
164.6
39.6
Female
145.3
-6.1
Female
143.9
7
Female
144.3
2.4
Male
147.8
1.1
Female
150.4
-2.3
Male
143.9
-0.2
Male
145.1
6.6
Male
158
3
Male
154.8
Male
142.1
Male
171.3
-2
Male
140.3
2.8
Male
146
5.7
Male
159.3
14.6
Female
157.8
6.8
Male
46
Age
45
62
57
54
54
63
35
52
66
55
61
35
52
60
37
54
62
55
57
40
56
53
52
46
Treatment
SPP75
SPP75
SPP75
Los100
Los100
Los100
SPP75
SPP75
Los100
SPP75
Los100
Los100
Los100
SPP75
SPP75
Los100
Los100
Los100
Los100
SPP75
SPP75
Los100
Los100
SPP75
Ba_daySBP Red_daySBP gender
162.9
10.7
Male
173.4
55.9
Female
141.1
0.2
Female
147.6
11.1
Male
140.5
-16.5
Male
156
19.3
Female
150.9
26.8
Male
143.8
-9.7
Female
150.9
0.6
Male
155.8
-3
Female
162.1
21
Male
149.1
21.3
Male
177
-2.7
Male
157.8
6.8
Male
143.4
-3.9
Male
163.9
26.6
Male
147.4
Female
141.9
8.7
Female
163.1
3.6
Male
153.3
0.9
Male
165.9
11.9
Male
144.1
-1.9
Female
144.1
18.6
Male
147.7
-5.6
Male
Age
Treatment
58
41
42
57
56
Los100
SPP75
Los100
SPP75
Los100
Ba_daySBP Red_daySBP gender
153
149.9
165.8
154.8
146.6
-10.3
4.3
35.5
16.1
18.1
Male
Male
Male
Female
Female
47
Analysis without covariates



Since treatment has only 2 levels (Losartan &
Aliskiren), the ANOVA is equivalent to the
two-sample t-test
Treatment difference (LosartanAliskiren)=5.06 with P-value=0.0554
Borderline non-significant at the 5% level of
significance
48
ANCOVA analysis

We have as factors (categorical independent
variables)



Treatment
Gender
As covariates (continuous independent
variables)


Age
Baseline BP
49
Results: ANOVA table
Source
df
SS
MS
F
P-value
Baseline BP
Treatment
Age
Gender
1
1
1
1
1572.90
651.73
71.57
279.02
1572.90
651.73
71.57
279.02
14.36
5.58
0.61
2.39
0.0005
0.0208
0.4363
0.1264
50
Distributional assumptions
Classification of statistical methods
based on distributional assumptions

Likelihood Based Methods


Assume a distribution of data explicitly at the outset
Can fit very complex models e.g. model correlation
structures with the data, allow for unequal variances, etc.

T-tests, ANOVA, regression, etc. rely on normality of
data or that the data be of a sufficiently large size
(Central Limit Theorem)

Non-parametric methods


Relaxes assumptions about the shape of the distribution
Methods are based on ranking of the data points
51
So how normal is your data?

Difficult to see visually if a
histogram looks normal

Use normal probability
(quantile-quantile) plots

Points must lie along a
line

Also useful for detecting
outliers
52
Significant Deviations from
Normality: Alternatives

Transform your data to make it more “normal” and
use standard parametric tests
 e.g. log transform (eliminates skew)
 Difficulty - may wish to reverse-transform results
back (e.g. exponentiate parameter estimates)

Use non-parametric methods
 Make less assumptions on distribution shape
 These may have less power than parametric
methods
53
Non-parametric Equivalents
Parametric
Non-parametric
Paired T-test
Wilcoxon Signed Rank Test
Two-sample T-test
Wilcoxon Rank-Sum
ANOVA
Kruskal Wallis Analysis
Pearson’s Correlation
Spearman’s Rank Correlation
ANCOVA
ANCOVA on ranked data
54
Other Multivariate Methods

Multivariate Analysis of Variance (MANOVA)
 ANOVA with more than one dependent variable or response
 Also MANCOVA
 Low power is a problem
 Not so widely used

The objectives behind multivariate analyses can be quite
different (to those presented), namely
 Discriminant Analysis
 Classification
 Clustering
 Pattern Recognition (principal components analysis)
55
Sample Size Calculations

How many observations are needed in a study?

No mystery to this: sample sizes are often
calculated by inverting the computations involved
in determining statistical significance

Unfortunately, several quantities are required
before we can do any calculations (and the
argument is a little circular!)
56
Need to specify:

Study design and statistical test to be used



e.g. we wish to compare 2 treatments, use the two-sample
t-test, two-tailed
Level of significance α (usually 0.05 i.e. 5%)
Power β (commonly set to 0.80)
But we also need (with more difficulty):
 Difference we would like to detect?
 An estimate of the variability σ2 one expects to see in
the data?
57
Example: Two-sample t-test

The sample size calculation for the two-sample t-test
is (with n per group)
n

Where
2( z1  z1  )2
2
1  2


(called the standardized difference)
58
Example: Two-sample t-test
calculation

In order to detect a standardised difference of
0.3 at the 5% level of significance (one-tail)
and 80% power (z0.95=1.645, z0.80=0.84), we
need a sample size of 138 per group.
59
Recommended Books/Links




“Statistical Methods in Medical Research” by Armitage,
Berry, Matthews (4th edition, 2002). Oxford, Blackwell
Science
“Practical Statistics for Medical Research” by Altman
(1999). Kluwer Academic Publishers#
Statnotes: Topics in Multivariate Analysis, by G. David
Garson:
http://www2.chass.ncsu.edu/garson/PA765/statnote.htm
Statistical computing resources – UCLA:
http://www.ats.ucla.edu/stat/
60
Statistics Software

Excel stats add-in is not recommended



STATA, S-Plus, Minitab, SAS, JMP, SPSS, R



Stats calculations cannot be trusted (e.g. standard deviations
may be inaccurate)
Doesn’t have the functionality of other software
Certain amount of programming skill required but well worth it
R is free; SAS is most expensive (up to 10k per person per
annum)
DataDesk, InStata –user friendly. “point-and-click”,
with reasonable functionality
61
You’re free!!
Questions?
[email protected]
62