Download lecture6_na

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Student's t-test wikipedia, lookup

Resampling (statistics) wikipedia, lookup

Degrees of freedom (statistics) wikipedia, lookup

Psychometrics wikipedia, lookup

Analysis of variance wikipedia, lookup

Omnibus test wikipedia, lookup

Transcript
Introduction to choosing the
correct statistical test
+
Tests for Continuous Outcomes I
Which test should I use?
Are the observations independent or
correlated?
Outcome Variable
independent
correlated
Assumptions
Continuous
Ttest
ANOVA
Linear correlation
Linear regression
Paired ttest
Repeated-measures ANOVA
Mixed models/GEE modeling
Outcome is normally
distributed (important
for small samples).
Outcome and predictor
have a linear
relationship.
Relative risks
Chi-square test
Logistic regression
McNemar’s test
Conditional logistic regression
GEE modeling
Sufficient numbers in
each cell (>=5)
Kaplan-Meier statistics
Cox regression
n/a
Cox regression
assumes proportional
hazards between
groups
(e.g. pain scale,
cognitive function)
Binary or
categorical
(e.g. fracture yes/no)
Time-to-event
(e.g. time to fracture)
1. Which
What is the
test should I use?
dependentAre the observations independent or
variable? correlated?
Outcome Variable
independent
correlated
Assumptions
Continuous
Ttest
ANOVA
Linear correlation
Linear regression
Paired ttest
Repeated-measures ANOVA
Mixed models/GEE modeling
Outcome is normally
distributed (important
for small samples).
Outcome and predictor
have a linear
relationship.
Relative risks
Chi-square test
Logistic regression
McNemar’s test
Conditional logistic regression
GEE modeling
Sufficient numbers in
each cell (>=5)
Kaplan-Meier statistics
Cox regression
n/a
Cox regression
assumes proportional
hazards between
groups
(e.g. pain scale,
cognitive function)
Binary or
categorical
(e.g. fracture yes/no)
Time-to-event
(e.g. time to fracture)
2. Are the
observations
correlated?
Which test should I use?
Are the observations independent or
correlated?
Outcome Variable
independent
correlated
Assumptions
Continuous
Ttest
ANOVA
Linear correlation
Linear regression
Paired ttest
Repeated-measures ANOVA
Mixed models/GEE modeling
Outcome is normally
distributed (important
for small samples).
Outcome and predictor
have a linear
relationship.
Relative risks
Chi-square test
Logistic regression
McNemar’s test
Conditional logistic regression
GEE modeling
Sufficient numbers in
each cell (>=5)
Kaplan-Meier statistics
Cox regression
n/a
Cox regression
assumes proportional
hazards between
groups
(e.g. pain scale,
cognitive function)
Binary or
categorical
(e.g. fracture yes/no)
Time-to-event
(e.g. time to fracture)
3. Are key model
assumptions met?
Which test should I use?
Are the observations independent or
correlated?
Outcome Variable
independent
correlated
Assumptions
Continuous
Ttest
ANOVA
Linear correlation
Linear regression
Paired ttest
Repeated-measures ANOVA
Mixed models/GEE modeling
Outcome is normally
distributed (important
for small samples).
Outcome and predictor
have a linear
relationship.
Relative risks
Chi-square test
Logistic regression
McNemar’s test
Conditional logistic regression
GEE modeling
Sufficient numbers in
each cell (>=5)
Kaplan-Meier statistics
Cox regression
n/a
Cox regression
assumes proportional
hazards between
groups
(e.g. pain scale,
cognitive function)
Binary or
categorical
(e.g. fracture yes/no)
Time-to-event
(e.g. time to fracture)
Are the observations
correlated?
1.
What is the unit of observation?





2.
person* (most common)
limb
half a face
physician
clinical center
Are the observations independent or
correlated?


Independent: observations are unrelated (usually different,
unrelated people)
Correlated: some observations are related to one another,
for example: the same person over time (repeated
measures), legs within a person, half a face
Example: correlated data

Split-face trial:



Researchers assigned 56 subjects to apply
SPF 85 sunscreen to one side of their faces
and SPF 50 to the other prior to engaging
in 5 hours of outdoor sports during midday. The outcome is sunburn (yes/no).
Unit of observation = side of a face
Are the observations correlated? Yes.
Russak JE et al. JAAD 2010; 62: 348-349.
Results ignoring correlation:
Table I -- Dermatologist grading of sunburn after an average of 5 hours of
skiing/snowboarding (P = .03; Fisher’s exact test)
Sun protection factor
Sunburned
Not sunburned
85
1
55
50
8
48
Fisher’s exact test compares the following proportions: 1/56 versus
8/56. Note that individuals are being counted twice!
Correct analysis of data:
Table 1. Correct presentation of the data from: Russak JE et
al. JAAD 2010; 62: 348-349. (P = .016; McNemar’s exact test).
SPF-50 side
SPF-85 side
Sunburned
Not sunburned
Sunburned
1
0
Not sunburned
7
48
McNemar’s exact test evaluates the probability of the following: In all 7
out of 7 cases where the sides of the face were discordant (i.e., one side
burnt and the other side did not), the SPF 50 side sustained the burn.
Correlations

Ignoring correlations will:


overestimate p-values for within-person or
within-cluster comparisons
underestimate p-values for betweenperson or between-cluster comparisons
Common statistics for various
types of outcome
Are data
key model
Are the observationsassumptions
independent or
correlated?
met?
Outcome Variable
independent
correlated
Assumptions
Continuous
Ttest
ANOVA
Linear correlation
Linear regression
Paired ttest
Repeated-measures
ANOVA
Mixed models/GEE
modeling
Outcome is normally
distributed (important
for small samples).
Outcome and predictor
have a linear
relationship.
Relative risks
Chi-square test
Logistic regression
McNemar’s test
Conditional logistic regression
GEE modeling
Sufficient numbers in
each cell (>=5)
Kaplan-Meier statistics
Cox regression
n/a
Cox regression
assumes proportional
hazards between
groups
(e.g. pain scale,
cognitive function)
Binary or
categorical
(e.g. fracture yes/no)
Time-to-event
(e.g. time to fracture)
Key assumptions of linear
models
Assumptions for linear models (ttest, ANOVA,
linear correlation, linear regression, paired
ttest, repeated-measures ANOVA, mixed
models):
1.
Normally distributed outcome variable
•
Most important for small samples; large samples
are quite robust against this assumption.
Predictors have a linear relationship with
the outcome
2.
•
Graphical displays can help evaluate this.
Common statistics for various
types of outcome data
Are the observations independent or
correlated?
Outcome Variable
independent
Continuous
Ttest
ANOVA
Linear correlation
Linear regression
(e.g. pain scale,
cognitive function)
Binary or
categorical
(e.g. fracture yes/no)
Time-to-event
(e.g. time to fracture)
Assumptions
Are key model
Paired
ttest
assumptions
met?
correlated
Repeated-measures
ANOVA
Mixed models/GEE
modeling
Outcome is normally
distributed (important
for small samples).
Outcome and predictor
have a linear
relationship.
Relative risks
Chi-square test
Logistic regression
McNemar’s test
Conditional logistic regression
GEE modeling
Sufficient numbers in
each cell (>=5)
Kaplan-Meier statistics
Cox regression
n/a
Cox regression
assumes proportional
hazards between
groups
Key assumptions for
categorical tests
Assumptions for categorical tests (relative risks,
chi-square, logistic regression, McNemar’s
test):
1.
Sufficient numbers in each cell (np>=5)
In the sunscreen trial, “exact” tests (Fisher’s
exact, McNemar’s exact) were used because of
the sparse data.
With sparse data


Need to use “exact” tests
Need to be cautious with regression
modeling, as there is a risk of overfitting
Sparse Data, Example
Retrospective study comparing prophylaxis during rehabilitation
and VTEs.
Risk Factor
n
No.VTEs
All
140
14
Tinzaparin 3500 units
14
5
Tinzaparin 4500 units
58
5
Enoxaparin
68
4
None
33
7
Enoxaparin 40 or 30 mg 2 × daily
78
4
Heparin 5000 units 3 × daily
24
3
Treatment doses LMWH†
2
0
Absent
113
11
Present
27
3
Nontraumatic
54
3
A, B, C, or D
78
11
Pharmacologic prophylaxis during
rehabilitation
Pharmacologic prophylaxis before
admission
A much higher
proportion of
tinzaparin 3500
patients had
VTEs.
Could it be due
to confounding?
IVC filter
AIS level
Enoxaparin Versus Tinzaparin for Venous
Thromboembolic Prophylaxis During
Rehabilitation PM&R 2012; 4:11-17.
Retrospective study comparing prophylaxis during rehabilitation
and VTEs.
Risk Factor
n
No.VTEs
All
140
14
Tinzaparin 3500 units
14
5
Tinzaparin 4500 units
58
5
Enoxaparin
68
4
None
33
7
Enoxaparin 40 or 30 mg 2 × daily
78
4
Heparin 5000 units 3 × daily
24
3
Treatment doses LMWH†
2
0
Absent
113
11
Present
27
3
Nontraumatic
54
3
A, B, C, or D
78
11
Pharmacologic prophylaxis during
rehabilitation
Pharmacologic prophylaxis before
admission
Note the
sparse data
due to low
numbers of
VTEs and low
numbers of
tinzaparin
3500-treated
patients.
IVC filter
AIS level
Enoxaparin Versus Tinzaparin for Venous
Thromboembolic Prophylaxis During
Rehabilitation PM&R 2012; 4:11-17.
Characteristic
Tinzaparin 3500
units n = 14
AIS level, n (%)
Nontraumatic
1
A, B, or C
13
D
0
Not available
0
Dividing tinzaparin
3500 participants
by their other
characteristics
identifies some of
them uniquely.
Walk or use wheelchair, n (%)
Use wheelchair
12
Walk
1
Not available
1
Pharmacologic prophylaxis before admission, n
(%)
None
9
Enoxaparin 40 or 30 mg 2 × daily
4
Heparin 5000 units 3 × daily
1
Treatment doses LMWH
0
Not available
0
Authors ran
regressions to
adjust for
confounding. But
may be impossible
to adjust for some
confounders; and
small numbers risk
over-fitting.
Initial regression model


VTE = intercept ( 1 parameter) +
prophylaxis during rehabilitation (2
parameters) + AIS level (1 parameter)
+ age (1 parameter) + prophylaxis
before rehabilitation (3 parameters)
14 events/8 parameters…high risk of
over-fitting
Continuous outcome (means)
Are the observations independent or correlated?
Outcome
Variable
Continuous
(e.g. pain
scale,
cognitive
function)
independent
correlated
Alternatives if the normality
assumption is violated (and
small sample size):
Ttest: compares means
Paired ttest: compares means
Non-parametric statistics
between two independent
groups
ANOVA: compares means
between more than two
independent groups
Pearson’s correlation
coefficient (linear
correlation): shows linear
correlation between two
continuous variables
Linear regression:
multivariate regression technique
used when the outcome is
continuous; gives slopes
between two related groups (e.g.,
the same subjects before and
after)
Wilcoxon sign-rank test:
Repeated-measures
ANOVA: compares changes
Wilcoxon sum-rank test
(=Mann-Whitney U test): non-
over time in the means of two or
more groups (repeated
measurements)
Mixed models/GEE
modeling: multivariate
regression techniques to compare
changes over time between two
or more groups; gives rate of
change over time
non-parametric alternative to the
paired ttest
parametric alternative to the ttest
Kruskal-Wallis test: non-
parametric alternative to ANOVA
Spearman rank correlation
coefficient: non-parametric
alternative to Pearson’s correlation
coefficient
Binary or categorical outcomes
(proportions)
Are the observations correlated?
Outcome
Variable
Binary or
categorical
(e.g.
fracture,
yes/no)
independent
correlated
Alternative to the chisquare test if sparse
cells:
Chi-square test:
McNemar’s chi-square test:
Fisher’s exact test: compares
Conditional logistic
regression: multivariate
McNemar’s exact test:
compares proportions between
more than two groups
compares binary outcome between
correlated groups (e.g., before and
after)
Relative risks: odds ratios
or risk ratios
Logistic regression:
multivariate technique used
when outcome is binary; gives
multivariate-adjusted odds
ratios
regression technique for a binary
outcome when groups are
correlated (e.g., matched data)
GEE modeling: multivariate
regression technique for a binary
outcome when groups are
correlated (e.g., repeated measures)
proportions between independent
groups when there are sparse data
(some cells <5).
compares proportions between
correlated groups when there are
sparse data (some cells <5).
Time-to-event outcome
(survival data)
Are the observation groups independent or correlated?
Outcome
Variable
Time-toevent (e.g.,
time to
fracture)
independent
correlated
Kaplan-Meier statistics: estimates survival functions for
n/a (already over
time)
each group (usually displayed graphically); compares survival
functions with log-rank test
Cox regression: Multivariate technique for time-to-event data;
gives multivariate-adjusted hazard ratios
Modifications to
Cox regression
if proportionalhazards is
violated:
Time-dependent
predictors or timedependent hazard
ratios (tricky!)
Tests for continuous outcomes
I…
To be continued next week…
Continuous outcome (means)
Are the observations independent or correlated?
Outcome
Variable
Continuous
(e.g. pain
scale,
cognitive
function)
independent
correlated
Alternatives if the normality
assumption is violated (and
small sample size):
Ttest: compares means
Paired ttest: compares means
Non-parametric statistics
between two independent
groups
ANOVA: compares means
between more than two
independent groups
Pearson’s correlation
coefficient (linear
correlation): shows linear
correlation between two
continuous variables
Linear regression:
multivariate regression technique
used when the outcome is
continuous; gives slopes
between two related groups (e.g.,
the same subjects before and
after)
Wilcoxon sign-rank test:
Repeated-measures
ANOVA: compares changes
Wilcoxon sum-rank test
(=Mann-Whitney U test): non-
over time in the means of two or
more groups (repeated
measurements)
Mixed models/GEE
modeling: multivariate
regression techniques to compare
changes over time between two
or more groups; gives rate of
change over time
non-parametric alternative to the
paired ttest
parametric alternative to the ttest
Kruskal-Wallis test: non-
parametric alternative to ANOVA
Spearman rank correlation
coefficient: non-parametric
alternative to Pearson’s correlation
coefficient
Example: two-sample t-test

In 1980, some researchers reported that
“men have more mathematical ability than
women” as evidenced by the 1979 SAT’s,
where a sample of 30 random male
adolescents had a mean score ± 1 standard
deviation of 436±77 and 30 random female
adolescents scored lower: 416±81 (genders
were similar in educational backgrounds,
socio-economic status, and age). Do you
agree with the authors’ conclusions?
Two sample ttest
Statistical question: Is there a difference in SAT
math scores between men and women?
 What is the outcome variable? Math SAT
scores
 What type of variable is it? Continuous
 Is it normally distributed? Yes
 Are the observations correlated? No
 Are groups being compared, and if so, how
many? Yes, two
 two-sample ttest
Two-sample ttest mechanics…
Data Summary
Group 1:
women
Group 2:
men
n
Sample
Mean
Sample
Standard
Deviation
30
416
81
30
436
77
Two-sample t-test
1. Define your hypotheses (null,
alternative)
H0: ♂-♀ math SAT = 0
Ha: ♂-♀ math SAT ≠ 0 [two-sided]
Two-sample t-test
2. Specify your null distribution:
F and M have approximately equal standard deviations/variances, so make a “pooled” estimate of
standard deviation/variance:
81  77
sp 
 79
2
s 2p  792
The standard error of a difference of two means is:
2
2
792 792



 20.4
n
m
30 30
sp
sp
Differences in means follow a T-distribution for small samples; Z-distribution for
large samples…
T distribution



A t-distribution is like a Z distribution,
except has slightly fatter tails to reflect
the uncertainty added by estimating the
standard deviation.
The bigger the sample size (i.e., the
bigger the sample size used to estimate
), then the closer t becomes to Z.
If n>100, t approaches Z.
Student’s t Distribution
Note: t
Z as n increases
Standard
Normal
(t with df = )
t (df = 13)
t-distributions are bellshaped and symmetric, but
have ‘fatter’ tails than the
normal
t (df = 5)
0
from “Statistics for Managers” Using Microsoft® Excel 4th Edition, Prentice-Hall 2004
t
Student’s t Table
Upper Tail Area
df
.25
.10
.05
1 1.000 3.078 6.314
Let: n = 3
df = n - 1 = 2
 = .10
/2 =.05
2 0.817 1.886 2.920
/2 = .05
3 0.765 1.638 2.353
The body of the table
contains t values, not
probabilities
from “Statistics for Managers” Using Microsoft® Excel 4th Edition, Prentice-Hall 2004
0
2.920 t
t distribution values
With comparison to the Z value
Confidence
t
Level
(10 d.f.)
t
(20 d.f.)
t
(30 d.f.)
Z
____
.80
1.372
1.325
1.310
1.28
.90
1.812
1.725
1.697
1.64
.95
2.228
2.086
2.042
1.96
.99
3.169
2.845
2.750
2.58
Note: t
Z as n increases
from “Statistics for Managers” Using Microsoft® Excel 4th Edition, Prentice-Hall 2004
Two-sample t-test
2. Specify your null distribution:
F and M have approximately equal standard deviations/variances, so make a “pooled” estimate of
standard deviation/variance:
81  77
sp 
 79
2
s 2p  792
The standard error of a difference of two means is:
2
2
792 792



 20.4
n
m
30 30
sp
sp
Differences in means follow a T-distribution; here we have a T-distribution with
58 degrees of freedom (60 observations – 2 means)…
Two-sample t-test
3. Observed difference in our experiment = 20
points
Two-sample t-test
4. Calculate the p-value of what you observed
Critical value for
two-tailed p-value
of .05 for T58=2.000
0.98<2.000, so
p>.05
20  0
T58 
 .98
20.4
p  .33
5. Do not reject null! No evidence that men are better
in math ;)
Corresponding confidence
interval…
20  2.00 * 20.4  20.8  60.8
Note that the 95% confidence
interval crosses 0 (the null value).
Review Question 1
A t-distribution:
a.
b.
c.
d.
Is approximately a normal distribution if
n>100.
Can be used interchangeably with a normal
distribution as long as the sample size is large
enough.
Reflects the uncertainty introduced when using
the sample, rather than population, standard
deviation.
All of the above.
Review Question 2
In a medical student class, the 6 people born on odd days had heights
of 64.64 inches; the 10 people born on even days had heights of
71.15 inches. Height is roughly normally distributed. Which of the
following best represents the correct statistical test for these data?
a. Z  71.1  64.6  6.5  1.44; p  ns
4.5
4.5
b.
Z
71.1  64.6 6.5

 4.6; p  .0001
4.5
1.4
16
c. T14 
d.
T14 
71.1  64.6
4.7 2 4.7 2

10
6

6.5
 2.7; p  .05
2.4
71.1  64.6 6.5

 1.44; p  ns
4.5
4.5
Continuous outcome (means)
Are the observations independent or correlated?
Outcome
Variable
Continuous
(e.g. pain
scale,
cognitive
function)
independent
correlated
Alternatives if the normality
assumption is violated (and
small sample size):
Ttest: compares means
Paired ttest: compares means
Non-parametric statistics
between two independent
groups
ANOVA: compares means
between more than two
independent groups
Pearson’s correlation
coefficient (linear
correlation): shows linear
correlation between two
continuous variables
Linear regression:
multivariate regression technique
used when the outcome is
continuous; gives slopes
between two related groups (e.g.,
the same subjects before and
after)
Wilcoxon sign-rank test:
Repeated-measures
ANOVA: compares changes
Wilcoxon sum-rank test
(=Mann-Whitney U test): non-
over time in the means of two or
more groups (repeated
measurements)
Mixed models/GEE
modeling: multivariate
regression techniques to compare
changes over time between two
or more groups; gives rate of
change over time
non-parametric alternative to the
paired ttest
parametric alternative to the ttest
Kruskal-Wallis test: non-
parametric alternative to ANOVA
Spearman rank correlation
coefficient: non-parametric
alternative to Pearson’s correlation
coefficient
Example: paired ttest
TABLE 1. Difference between Means of "Before" and "After" Botulinum Toxin A Treatment
Before BTxnA
After BTxnA
Difference
Significance
Social skills
5.90
5.84
NS
.293
Academic performance
5.86
5.78
.08
.068**
Date success
5.17
5.30
.13
.014*
Occupational success
6.08
5.97
.11
.013*
Attractiveness
4.94
5.07
.13
.030*
Financial success
5.67
5.61
NS
.230
Relationship success
5.68
5.68
NS
.967
Athletic success
5.15
5.38
.23
.000**
*
Significant at 5% level.
Significant at 1% level.
**
Paired ttest
Statistical question: Is there a difference in date
success after BoTox?
 What is the outcome variable? Date success
 What type of variable is it? Continuous
 Is it normally distributed? Yes
 Are the observations correlated? Yes, it’s the
same patients before and after
 How many time points are being compared?
Two
 paired ttest
Paired ttest mechanics
1.
2.
3.
4.
5.
6.
Calculate the change in date success score for each
person.
Calculate the average change in date success for
the sample. (=.13)
Calculate the standard error of the change in date
success. (=.05)
Calculate a T-statistic by dividing the mean change
by the standard error (T=.13/.05=2.6).
Look up the corresponding p-values. (T=2.6
corresponds to p=.014).
Significant p-values indicate that the average
change is significantly different than 0.
Paired ttest example 2…
Patient
BP Before (diastolic)
BP After
1
100
92
2
89
84
3
83
80
4
98
93
5
108
98
6
95
90
Example problem: paired ttest
Patient
Diastolic BP Before
D. BP After
Change
1
100
92
-8
2
89
84
-5
3
83
80
-3
4
98
93
-5
5
108
98
-10
6
95
90
-5
Null Hypothesis: Average Change = 0
Example problem: paired ttest
X
 8  5  3  5  10  5  36

 6
6
6
Change
-8
( 8  6) 2  ( 5  6) 2  ( 3  6) 2 ...
sx 

5
4  1  9  1  16  1
32

 2.5
5
5
-5
-3
-5
sx 
2.5
 1.0
6
60
T5 
 6
1.0
Null Hypothesis: Average Change = 0
With 5 df, T>2.571
corresponds to p<.05
(two-sided test)
-10
-5
Example problem: paired ttest
Change
95% CI : - 6  2.571* (1.0)
 (-3.43, - 8.571)
Note: does not include 0.
-8
-5
-3
-5
-10
-5
Continuous outcome (means)
Are the observations independent or correlated?
Outcome
Variable
Continuous
(e.g. pain
scale,
cognitive
function)
independent
correlated
Alternatives if the normality
assumption is violated (and
small sample size):
Ttest: compares means
Paired ttest: compares means
Non-parametric statistics
between two independent
groups
ANOVA: compares means
between more than two
independent groups
Pearson’s correlation
coefficient (linear
correlation): shows linear
correlation between two
continuous variables
Linear regression:
multivariate regression technique
used when the outcome is
continuous; gives slopes
between two related groups (e.g.,
the same subjects before and
after)
Wilcoxon sign-rank test:
Repeated-measures
ANOVA: compares changes
Wilcoxon sum-rank test
(=Mann-Whitney U test): non-
over time in the means of two or
more groups (repeated
measurements)
Mixed models/GEE
modeling: multivariate
regression techniques to compare
changes over time between two
or more groups; gives rate of
change over time
non-parametric alternative to the
paired ttest
parametric alternative to the ttest
Kruskal-Wallis test: non-
parametric alternative to ANOVA
Spearman rank correlation
coefficient: non-parametric
alternative to Pearson’s correlation
coefficient
Using our class data…


Hypothesis: Students who consider
themselves street smart drink more
alcohol than students who consider
themselves book smart.
Null hypothesis: no difference in alcohol
drinking between street smart and book
smart students.
“Non-normal” class
data…alcohol…
Wilcoxon sum-rank test
Statistical question: Is there a difference in alcohol
drinking between street smart and book smart
students?
 What is the outcome variable? Weekly alcohol intake
(drinks/week)
 What type of variable is it? Continuous
 Is it normally distributed? No (and small n)
 Are the observations correlated? No
 Are groups being compared, and if so, how many?
two
 Wilcoxon sum-rank test
Results:
Book smart:
Mean=1.6 drinks/week; median
= 1.5
Street smart:
Mean=2.7 drinks/week; median
= 3.0
Wilcoxon rank-sum test
mechanics…




Book smart values (n=13): 0 0 0 0 1 1 2 2 2 3 3 4 5
Street Smart values (n=7): 0 0 2 3 3 5 6
Combined groups (n=20): 0 0 0 0 0 0 1 1 2 2 2 2 3 3
334556
Corresponding ranks: 3.5* 3.5 3.5 3.5 3.5 3.5 7.5 7.5
10.5 10.5 10.5 10.5 14.5 14.5 14.5 14.5 17 18.5 18.5
20
*ties are assigned average ranks; e.g., there are 6 zero’s, so zero’s get the average of the ranks
1 through 6.
Wilcoxon rank-sum test…







Ranks, book smart: 3.5 3.5 3.5 3.5 7.5 7.5 10.5 10.5 10.5 14.5
14.5 17 18.5
Ranks, street smart: 3.5 3.5 10.5 14.5 14.5 18.5 20
Sum of ranks book smart:
3.5+3.5+3.5+3.5+7.5+7.5+10.5+10.5+10.5+
14.5+14.5+17+18.5= 125
Sum of ranks street smart: 3.5+3.5+10.5+14.5
+14.5+18.5+20= 85
Wilcoxon sum-rank test compares these numbers accounting for
the differences in sample size in the two groups.
Resulting p-value (from computer) = 0.24
Not significantly different!
Example 2, Wilcoxon sum-rank
test…
10 dieters following Atkin’s diet vs. 10 dieters following
Jenny Craig
Hypothetical RESULTS:
Atkin’s group loses an average of 34.5 lbs.
J. Craig group loses an average of 18.5 lbs.
Conclusion: Atkin’s is better?
Example: non-parametric tests
BUT, take a closer look at the individual data…
Atkin’s, change in weight (lbs):
+4, +3, 0, -3, -4, -5, -11, -14, -15, -300
J. Craig, change in weight (lbs)
-8, -10, -12, -16, -18, -20, -21, -24, -26, -30
Jenny Craig
30
25
20
P
e
r
c 15
e
n
t
10
5
0
-30
-25
-20
-15
-10
-5
0
5
Weight Change
10
15
20
Atkin’s
30
25
20
P
e
r
c 15
e
n
t
10
5
0
-300
-280
-260
-240
-220
-200
-180
-160
-140
-120
-100
-80
Weight Change
-60
-40
-20
0
20
Wilcoxon Rank-Sum test







RANK the values, 1 being the least weight
loss and 20 being the most weight loss.
Atkin’s
+4, +3, 0, -3, -4, -5, -11, -14, -15, -300
1, 2, 3, 4, 5, 6, 9, 11, 12, 20
J. Craig
-8, -10, -12, -16, -18, -20, -21, -24, -26, -30
7, 8, 10, 13, 14, 15, 16, 17, 18, 19
Wilcoxon Rank-Sum test
Sum of Atkin’s ranks:
 1+ 2 + 3 + 4 + 5 + 6 + 9 + 11+ 12 +
20=73
 Sum of Jenny Craig’s ranks:
7 + 8 +10+ 13+ 14+ 15+16+ 17+ 18+19=137



Jenny Craig clearly ranked higher!
P-value *(from computer) = .018
Review Question 3
When you want to compare mean blood
pressure between two groups, you should:
a.
b.
c.
d.
e.
Use a ttest
Use a nonparametric test
Use a ttest if blood pressure is normally
distributed.
Use a two-sample proportions test.
Use a two-sample proportions test only if
blood pressure is normally distributed.
Continuous outcome (means)
Are the observations independent or correlated?
Outcome
Variable
Continuous
(e.g. pain
scale,
cognitive
function)
independent
correlated
Alternatives if the normality
assumption is violated (and
small sample size):
Ttest: compares means
Paired ttest: compares means
Non-parametric statistics
between two independent
groups
ANOVA: compares means
between more than two
independent groups
Pearson’s correlation
coefficient (linear
correlation): shows linear
correlation between two
continuous variables
Linear regression:
multivariate regression technique
used when the outcome is
continuous; gives slopes
between two related groups (e.g.,
the same subjects before and
after)
Wilcoxon sign-rank test:
Repeated-measures
ANOVA: compares changes
Wilcoxon sum-rank test
(=Mann-Whitney U test): non-
over time in the means of two or
more groups (repeated
measurements)
Mixed models/GEE
modeling: multivariate
regression techniques to compare
changes over time between two
or more groups; gives rate of
change over time
non-parametric alternative to the
paired ttest
parametric alternative to the ttest
Kruskal-Wallis test: non-
parametric alternative to ANOVA
Spearman rank correlation
coefficient: non-parametric
alternative to Pearson’s correlation
coefficient
DHA and eczema…
P-values from
Wilcoxon signrank tests
Figure 3 from: Koch C, Dölle S, Metzger M, Rasche C, Jungclas H, Rühl R, Renz H, Worm M. Docosahexaenoic
acid (DHA) supplementation in atopic eczema: a randomized, double-blind, controlled trial. Br J Dermatol. 2008
Apr;158(4):786-92. Epub 2008 Jan 30.
Wilcoxon sign-rank test
Statistical question: Did patients improve in SCORAD
score from baseline to 8 weeks?
 What is the outcome variable? SCORAD
 What type of variable is it? Continuous
 Is it normally distributed? No (and small numbers)
 Are the observations correlated? Yes, it’s the same
people before and after
 How many time points are being compared? two
  Wilcoxon sign-rank test
Wilcoxon sign-rank test
mechanics…




1. Calculate the change in SCORAD score for
each participant.
2. Rank the absolute values of the changes in
SCORAD score from smallest to largest.
3. Add up the ranks from the people who
improved and, separately, the ranks from the
people who got worse.
4. The Wilcoxon sign-rank compares these
values to determine whether improvements
significantly exceed declines (or vice versa).
Continuous outcome (means)
Are the observations independent or correlated?
Outcome
Variable
Continuous
(e.g. pain
scale,
cognitive
function)
independent
correlated
Alternatives if the normality
assumption is violated (and
small sample size):
Ttest: compares means
Paired ttest: compares means
Non-parametric statistics
between two independent
groups
ANOVA: compares means
between more than two
independent groups
Pearson’s correlation
coefficient (linear
correlation): shows linear
correlation between two
continuous variables
Linear regression:
multivariate regression technique
used when the outcome is
continuous; gives slopes
between two related groups (e.g.,
the same subjects before and
after)
Wilcoxon sign-rank test:
Repeated-measures
ANOVA: compares changes
Wilcoxon sum-rank test
(=Mann-Whitney U test): non-
over time in the means of two or
more groups (repeated
measurements)
Mixed models/GEE
modeling: multivariate
regression techniques to compare
changes over time between two
or more groups; gives rate of
change over time
non-parametric alternative to the
paired ttest
parametric alternative to the ttest
Kruskal-Wallis test: non-
parametric alternative to ANOVA
Spearman rank correlation
coefficient: non-parametric
alternative to Pearson’s correlation
coefficient
ANOVA example
Mean micronutrient intake from the school lunch by school
Calcium (mg)
Iron (mg)
Folate (μg)
Zinc (mg)
a
Mean
SDe
Mean
SD
Mean
SD
Mean
SD
S1a, n=28
117.8
62.4
2.0
0.6
26.6
13.1
1.9
1.0
S2b, n=25
158.7
70.5
2.0
0.6
38.7
14.5
1.5
1.2
S3c, n=21
206.5
86.2
2.0
0.6
42.6
15.1
1.3
0.4
School 1 (most deprived; 40% subsidized lunches).
b School 2 (medium deprived; <10% subsidized).
c School 3 (least deprived; no subsidization, private school).
d ANOVA; significant differences are highlighted in bold (P<0.05).
P-valued
0.000
0.854
0.000
0.055
FROM: Gould R, Russell J,
Barker ME. School lunch
menus and 11 to 12 year old
children's food choice in three
secondary schools in Englandare the nutritional standards
being met? Appetite. 2006
Jan;46(1):86-92.
ANOVA
Statistical question: Does calcium content of
school lunches differ by school type
(privileged, average, deprived)
 What is the outcome variable? Calcium
 What type of variable is it? Continuous
 Is it normally distributed? Yes
 Are the observations correlated? No
 Are groups being compared and, if so, how
many? Yes, three
  ANOVA
ANOVA
(ANalysis Of VAriance)


Idea: For two or more groups, test
difference between means, for normally
distributed variables.
Just an extension of the t-test (an
ANOVA with only two groups is
mathematically equivalent to a t-test).
One-Way Analysis of Variance

Assumptions, same as ttest
 Normally distributed outcome
 Equal variances between the groups
 Groups are independent
Hypotheses of One-Way
ANOVA
H 0 : μ1  μ 2  μ 3  
H 1 : Not all of the population means are the same
ANOVA

It’s like this: If I have three groups to
compare:



I could do three pair-wise ttests, but this
would increase my type I error
So, instead I want to look at the pairwise
differences “all at once.”
To do this, I can recognize that variance is
a statistic that let’s me look at more than
one difference at a time…
The “F-test”
Is the difference in the means of the groups more
than background noise (=variability within groups)?
Summarizes the mean differences
between all groups at once.
Variabilit y between groups
F
Variabilit y within groups
Analogous to pooled variance from a ttest.
The F-distribution

A ratio of variances follows an F-distribution:


2
between
2
within
~ Fn ,m
The
F-test tests the hypothesis that two variances
are equal.
F
will be close to 1 if sample variances are equal.
2
2
H 0 :  between
  within
H a :
2
between

2
within
ANOVA example 2


Randomize 33 subjects to three groups:
800 mg calcium supplement vs. 1500
mg calcium supplement vs. placebo.
Compare the spine bone density of all 3
groups after 1 year.
Spine bone density vs.
treatment
1.2
1.1
1.0
S
P
I
N
E
0.9
Within group
variability
Between
group
variation
Within group
variability
Within group
variability
0.8
0.7
PLACEBO
800mg CALCIUM
1500 mg CALCIUM
Group means and standard
deviations

Placebo group (n=11):



800 mg calcium supplement group (n=11)



Mean spine BMD = .92 g/cm2
standard deviation = .10 g/cm2
Mean spine BMD = .94 g/cm2
standard deviation = .08 g/cm2
1500 mg calcium supplement group (n=11)


Mean spine BMD =1.06 g/cm2
standard deviation = .11 g/cm2
The size of the
groups.
Between-group
variation.
The F-Test
2
sbetween
The difference of
each group’s
mean from the
overall mean.
2
2
2
(.
92

.
97
)

(.
94

.
97
)

(
1
.
06

.
97
)
 ns x2  11* (
)  .063
3 1
2
swithin
 avg s 2  1 (.102  .082  .112 )  .0095
3
F2,30
The average
amount of
variation within
groups.
2
between
2
within
s

s
.063

 6.6
.0095
Large F value indicates
Each group’s variance.
that the between group
variation exceeds the
within group variation
(=the background
noise).
Review Question 4
Which of the following is an assumption of
ANOVA?
a. The outcome variable is normally
distributed.
b. The variance of the outcome variable is the
same in all groups.
c. The groups are independent.
d. All of the above.
e. None of the above.
ANOVA summary


A statistically significant ANOVA (F-test)
only tells you that at least two of the
groups differ, but not which ones differ.
Determining which groups differ (when
it’s unclear) requires more sophisticated
analyses to correct for the problem of
multiple comparisons…
Question: Why not just do 3
pairwise ttests?


Answer: because, at an error rate of 5% each test,
this means you have an overall chance of up to 1(.95)3= 14% of making a type-I error (if all 3
comparisons were independent)
If you wanted to compare 6 groups, you’d have to
do 15 pairwise ttests; which would give you a high
chance of finding something significant just by
chance.
Multiple comparisons
Correction for
multiple comparisons
How to correct for multiple comparisons
post-hoc…
• Bonferroni correction (adjusts p by most
conservative amount; assuming all tests
independent, divide p by the number of
tests)
• Tukey (adjusts p)
• Scheffe (adjusts p)
1. Bonferroni
For example, to make a Bonferroni correction, divide your desired alpha cut-off
level (usually .05) by the number of comparisons you are making. Assumes
complete independence between comparisons, which is way too conservative.
Obtained P-value
Original Alpha
# tests
New Alpha
Significant?
.001
.05
5
.010
Yes
.011
.05
4
.013
Yes
.019
.05
3
.017
No
.032
.05
2
.025
No
.048
.05
1
.050
Yes
2/3. Tukey and Sheffé

Both methods increase your p-values to
account for the fact that you’ve done
multiple comparisons, but are less
conservative than Bonferroni (let
computer calculate for you!).
Review Question 5
I am doing an RCT of 4 treatment regimens for blood
pressure. At the end of the day, I compare blood
pressures in the 4 groups using ANOVA. My p-value is
.03. I conclude:
a.
b.
c.
d.
All of the treatment regimens differ.
I need to use a Bonferroni correction.
One treatment is better than all the rest.
At least one treatment is different from the
others.
e. In pairwise comparisons, no treatment will be
Continuous outcome (means)
Are the observations independent or correlated?
Outcome
Variable
Continuous
(e.g. pain
scale,
cognitive
function)
independent
correlated
Alternatives if the normality
assumption is violated (and
small sample size):
Ttest: compares means
Paired ttest: compares means
Non-parametric statistics
between two independent
groups
ANOVA: compares means
between more than two
independent groups
Pearson’s correlation
coefficient (linear
correlation): shows linear
correlation between two
continuous variables
Linear regression:
multivariate regression technique
used when the outcome is
continuous; gives slopes
between two related groups (e.g.,
the same subjects before and
after)
Wilcoxon sign-rank test:
Repeated-measures
ANOVA: compares changes
Wilcoxon sum-rank test
(=Mann-Whitney U test): non-
over time in the means of two or
more groups (repeated
measurements)
Mixed models/GEE
modeling: multivariate
regression techniques to compare
changes over time between two
or more groups; gives rate of
change over time
non-parametric alternative to the
paired ttest
parametric alternative to the ttest
Kruskal-Wallis test: non-
parametric alternative to ANOVA
Spearman rank correlation
coefficient: non-parametric
alternative to Pearson’s correlation
coefficient
Non-parametric ANOVA
(Kruskal-Wallis test)
Statistical question: Do nevi counts differ by training
velocity (slow, medium, fast) group in marathon
runners?
 What is the outcome variable? Nevi count
 What type of variable is it? Continuous
 Is it normally distributed? No (and small sample size)
 Are the observations correlated? No
 Are groups being compared and, if so, how many?
Yes, three
  non-parametric ANOVA
Example: Nevi counts and
marathon runners
Richtig et al. Melanoma Markers in Marathon Runners: Increase with Sun Exposure and Physical Strain. Dermatology 2008;217:38-44.
Non-parametric ANOVA
Kruskal-Wallis one-way ANOVA
(just an extension of the Wilcoxon Sum-Rank test for
2 groups; based on ranks)
Example: Nevi counts and
marathon runners
By non-parametric ANOVA, the groups
differ significantly in nevi count
(p<.05) overall.
By Wilcoxon sum-rank test (adjusted
for multiple comparisons), the lowest
velocity group differs significantly
from the highest velocity group
(p<.05)
Richtig et al. Melanoma Markers in Marathon Runners: Increase with Sun Exposure and Physical Strain. Dermatology 2008;217:38-44.
Review Question 6
I want to compare depression scores between three
groups, but I’m not sure if depression is normally
distributed. What should I do?
a.
b.
c.
d.
e.
Don’t worry about it—run an ANOVA anyway.
Test depression for normality.
Use a Kruskal-Wallis (non-parametric) ANOVA.
Nothing, I can’t do anything with these data.
Run 3 nonparametric ttests.
Review Question 7
If depression score turns out to be very non-normal,
then what should I do?
a.
b.
c.
d.
e.
Don’t worry about it—run an ANOVA anyway.
Test depression for normality.
Use a Kruskal-Wallis (non-parametric) ANOVA.
Nothing, I can’t do anything with these data.
Run 3 nonparametric ttests.
Review Question 8
I measure blood pressure in a cohort of elderly men
yearly for 3 years. To test whether or not their blood
pressure changed over time, I compare the mean blood
pressures in each time period using a one-way ANOVA.
This strategy is:
a.
b.
c.
d.
e.
Correct. I have three means, so I have to use ANOVA.
Wrong. Blood pressure is unlikely to be normally distributed.
Wrong. The variance in BP is likely to greatly differ at the three
time points.
Correct. It would also be OK to use three ttests.
Wrong. The samples are not independent.