Download 03_HTE_Methods

Document related concepts

Epidemiology wikipedia , lookup

Group development wikipedia , lookup

Transcript
Scientific Methods for Health Sciences (SMHS): Special Topics
(HS853)
SOCR Team
http://www.socr.umich.edu/people/dinov/2016/Fall/HS853/
Contents
Methods for Studying Heterogeneity of Treatment Effects, Case-Studies of Comparative
Effectiveness Research ....................................................................................................................................... 2
Methods for Studying Heterogeneity of Treatment Effects ............................................................. 2
Examine the Fifth Dutch growth study (2009) fdgs ....................................................................... 2
Classification and Regression Trees (CART)..................................................................................... 3
Random Forests ........................................................................................................................................ 12
Latent growth and growth mixture modeling (LGM/GMM) ........................................................ 13
Results interpretation: ........................................................................................................................... 15
Meta-analysis ............................................................................................................................................. 21
Series of "N of 1" trials ............................................................................................................................ 25
Quantile Treatment Effect (QTE) ....................................................................................................... 28
Nonparametric Regression Methods ..................................................................................................... 32
Predictive risk models ................................................................................................................................. 38
HIV Example ............................................................................................................................................... 38
Comparative Effectiveness Research: Case-Studies ........................................................................ 41
Case-Study 1: The Cetuximab Study.................................................................................................. 41
Case-Study 2: The Rosiglitazone Study ............................................................................................ 45
Case-Study 3: The Nurses’ Health Study.......................................................................................... 50
Methods for Studying Heterogeneity of Treatment Effects, CaseStudies of Comparative Effectiveness Research
#install.packages("rpart")
#install.packages("rpart.plot")
library("rpart")
library(rpart.plot)
Methods for Studying Heterogeneity of Treatment Effects
Examine the Fifth Dutch growth study (2009) fdgs
Is it true that "the world's tallest nation has stopped growing taller: the height of Dutch
children from 1955 to 2009"?
#install.packages("mice")
library("mice")
## Loading required package: Rcpp
## mice 2.25 2015-11-09
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
# ?fdgs # to see more info about the FDGS
head(fdgs)
##
##
##
##
##
##
##
1
2
3
4
5
6
id
100001
100003
100004
100005
100006
100018
reg
West
West
West
West
West
East
age sex
hgt wgt
13.09514 boy 175.5 75.0
13.81793 boy 148.4 40.0
13.97125 boy 159.9 42.0
13.98220 girl 159.7 46.5
13.52225 girl 160.3 47.8
10.21492 boy 157.8 39.7
hgt.z
1.751
-2.292
-1.000
-0.743
-0.414
2.025
wgt.z
2.410
-1.494
-1.315
-0.783
-0.355
0.823
summary(fdgs)
##
##
##
##
##
##
##
##
##
##
##
id
Min.
:100001
1st Qu.:106353
Median :203855
Mean
:180091
3rd Qu.:210591
Max.
:401955
hgt
Min.
: 46.0
1st Qu.: 83.8
reg
North: 732
East :2528
South:2931
West :2578
City :1261
age
Min.
: 0.008214
1st Qu.: 1.618754
Median : 8.084873
Mean
: 8.157936
3rd Qu.:13.547570
Max.
:21.993155
wgt
Min.
: 2.585
1st Qu.: 11.600
sex
boy :4829
girl:5201
hgt.z
Min.
:-4.470000
1st Qu.:-0.678000
wgt.z
Min.
:-5.04000
1st Qu.:-0.62475
##
##
##
##
##
Median :131.5
Mean
:123.9
3rd Qu.:162.3
Max.
:208.0
NA's
:23
Median : 27.500
Mean
: 32.385
3rd Qu.: 51.100
Max.
:135.300
NA's
:20
Median :-0.019000
Mean
:-0.006054
3rd Qu.: 0.677000
Max.
: 3.900000
NA's
:23
Median : 0.02600
Mean
: 0.04573
3rd Qu.: 0.70700
Max.
: 4.74100
NA's
:20
summary(fdgs)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
id
Min.
:100001
1st Qu.:106353
Median :203855
Mean
:180091
3rd Qu.:210591
Max.
:401955
hgt
Min.
: 46.0
1st Qu.: 83.8
Median :131.5
Mean
:123.9
3rd Qu.:162.3
Max.
:208.0
NA's
:23
reg
North: 732
East :2528
South:2931
West :2578
City :1261
age
Min.
: 0.008214
1st Qu.: 1.618754
Median : 8.084873
Mean
: 8.157936
3rd Qu.:13.547570
Max.
:21.993155
wgt
Min.
: 2.585
1st Qu.: 11.600
Median : 27.500
Mean
: 32.385
3rd Qu.: 51.100
Max.
:135.300
NA's
:20
sex
boy :4829
girl:5201
hgt.z
Min.
:-4.470000
1st Qu.:-0.678000
Median :-0.019000
Mean
:-0.006054
3rd Qu.: 0.677000
Max.
: 3.900000
NA's
:23
wgt.z
Min.
:-5.04000
1st Qu.:-0.62475
Median : 0.02600
Mean
: 0.04573
3rd Qu.: 0.70700
Max.
: 4.74100
NA's
:20
Classification and Regression Trees (CART)
CART models start with an apriori regression model 𝑦 ∼ 𝑥1 + 𝑥2 +. . . +𝑥𝑛 + 𝜖), where 𝑦 is
the response variable we are modeling as a function of the predictors {𝑥1 , 𝑥2 , . . . , 𝑥𝑛 }, a
collection of independent variables.
•
•
If 𝑦 is numeric, the resulting tree will be a regression tree.
**If the 𝑦 is categorical*, the resulting tree will be a classification tree.
The rpart package allows independent variables of all types for both classification and
regression trees. To determine if missing observations should be split left or right rpart
assesses surrogate splits where the missing observation is sent to the child node with the
largest relative frequency. The following libraries are helpful for:
•
•
•
•
library(rpart) # CART modeling
library(randomForest) # Random forest modeling
library(rpart.plot) # rpart graphics
library(caret) # model assessment using confusion matrices
Let's use the data frame fdgs to predict Region, from Age, Height, and Weight.
CP = complexity parameter.
# grow tree
fit.1 <- rpart(reg ~ age + hgt + wgt,
method="class", data= fdgs[,-1])
par(mar=c(3,6,3,6))
printcp(fit.1) # display the results
##
##
##
)
##
##
##
##
##
##
##
##
##
##
##
##
Classification tree:
rpart(formula = reg ~ age + hgt + wgt, data = fdgs[, -1], method = "class"
Variables actually used in tree construction:
[1] age
Root node error: 7099/10030 = 0.70778
n= 10030
CP nsplit rel error xerror
xstd
1 0.029582
0
1.00000 1.00000 0.0064159
2 0.021834
1
0.97042 0.97619 0.0065193
3 0.010000
2
0.94858 0.95140 0.0066161
plotcp(fit.1)
# visualize cross-validation results
rpart.plot(fit.1, extra=103)
prp(fit.1, type=1, extra=103, branch=1)
summary(fit.1)
# detailed summary of splits
##
##
)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
rpart(formula = reg ~ age + hgt + wgt, data = fdgs[, -1], method = "class"
n= 10030
CP nsplit rel error
xerror
xstd
1 0.02958163
0 1.0000000 1.0000000 0.006415919
2 0.02183406
1 0.9704184 0.9761938 0.006519283
3 0.01000000
2 0.9485843 0.9514016 0.006616142
Variable importance
age wgt hgt
49 26 25
Node number 1: 10030 observations,
complexity param=0.02958163
predicted class=South expected loss=0.7077767 P(node) =1
class counts:
732 2528 2931 2578 1261
probabilities: 0.073 0.252 0.292 0.257 0.126
left son=2 (4207 obs) right son=3 (5823 obs)
Primary splits:
age < 5.397673 to the left, improve=61.38357, (0 missing)
hgt < 115.15
to the left, improve=53.45438, (23 missing)
wgt < 19.55
to the left, improve=51.41483, (20 missing)
Surrogate splits:
hgt < 115.05
to the left, agree=0.983, adj=0.959, (0 split)
wgt < 20.05
to the left, agree=0.973, adj=0.935, (0 split)
Node number 2: 4207 observations
predicted class=East
expected loss=0.6893273
class counts:
266 1307 1097 1228
309
probabilities: 0.063 0.311 0.261 0.292 0.073
P(node) =0.4194417
Node number 3: 5823 observations,
complexity param=0.02183406
predicted class=South expected loss=0.6850421 P(node) =0.5805583
class counts:
466 1221 1834 1350
952
probabilities: 0.080 0.210 0.315 0.232 0.163
left son=6 (1118 obs) right son=7 (4705 obs)
Primary splits:
age < 17.11294 to the right, improve=82.03326, (0 missing)
hgt < 166.85
to the right, improve=52.81994, (4 missing)
wgt < 52.95
to the right, improve=51.27184, (13 missing)
Surrogate splits:
wgt < 63.9
to the right, agree=0.848, adj=0.207, (0 split)
hgt < 179.65
to the right, agree=0.840, adj=0.165, (0 split)
Node number 6: 1118 observations
predicted class=East
expected loss=0.5805009
class counts:
50
469
314
224
61
probabilities: 0.045 0.419 0.281 0.200 0.055
P(node) =0.1114656
## Node number 7: 4705 observations
##
predicted class=South expected loss=0.6769394
##
class counts:
416
752 1520 1126
891
##
probabilities: 0.088 0.160 0.323 0.239 0.189
P(node) =0.4690927
# plot tree
par(oma=c(0,0,2,0))
plot(fit.1, uniform=TRUE, margin=0.3, main="Classification Tree for Region (
FDGS Data)")
text(fit.1, use.n=TRUE, all=TRUE, cex=1.0)
# create a better plot of the classification tree
post(fit.1, title = "Classification Tree for Region (FDGS Data)", file = "")
The printcp() call generates a cost complexity parameter table including: * The
complexity parameter (CP), that control the size of the tree; the greater CP values
correspond to fewer splits in the tree. The optimal CP value is determined by a 10-fold
cross validation. Then, the optimal size of the tree is the row in the CP table that minimizes
the error with fewest branches. * relative model error (1 − relative error ∼
variance explained), * error estimated from a 10-fold cross validation (xerror), and * the
standard error of the xerror (xstd).
Pruning the tree
Pruning back the resulting classificaiton tree aims to reduce data overfitting. A tree size
that minimizes the cross-validation error, the xerror column printed by printcp( ), would
be preferred. This command will automatically choose the optimal complexity parameter
(CP) associated with the smallest cross-validation error (xerror).
pruned.fit.1<- prune(fit.1, cp=fit.1$cptable[which.min(fit.1$cptable[,"xerror
"]),"CP"])
# plot the pruned tree
plot(pruned.fit.1, uniform=TRUE, margin = 0.4, main="Pruned Classification Tr
ee for Region (FDGS Data)")
text(pruned.fit.1, use.n=TRUE, all=TRUE, cex=1.0)
post(pruned.fit.1, title = "Pruned Classification Tree for Region (FDGS Data
)")
# not much change, as the initial tree is not complex!
Assessing the Tree Classificaiton Accuracy
First assess the model with suboptimal (largest 𝐶𝑃 = 0.029582)
fdgs.train80.ind <- sort(sample(nrow(fdgs), nrow(fdgs)*0.8)) # 80% training
fdgs.train80 <- fdgs[fdgs.train80.ind,]
fdgs.test20 <- fdgs[-fdgs.train80.ind,]
fit.1 <- rpart(reg ~ age + hgt + wgt,
1])
printcp(fit.1)
##
##
##
##
##
##
##
##
##
##
##
method="class", data= fdgs.train80[,-
Classification tree:
rpart(formula = reg ~ age + hgt + wgt, data = fdgs.train80[,
-1], method = "class")
Variables actually used in tree construction:
[1] age
Root node error: 5700/8024 = 0.71037
n= 8024
##
##
CP nsplit rel error xerror
xstd
## 1 0.030526
0
1.00000 1.00000 0.0071283
## 2 0.025789
1
0.96947 0.98895 0.0071842
## 3 0.010000
2
0.94368 0.94561 0.0073796
pruned.1 <- prune(fit.1, cp=0.029582) # cp=0.01 is the optimal CP
pred.1 <- predict(pruned.1, newdata=fdgs.test20, type="class") #predicting cl
ass test data using the pruned model
confusionMatrix(pred.1, fdgs.test20$reg) #compute confusion matrix and summar
y statistics
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Confusion Matrix and Statistics
Reference
Prediction North East South West City
North
0
0
0
0
0
East
54 256
231 252
66
South
79 241
376 264 187
West
0
0
0
0
0
City
0
0
0
0
0
Overall Statistics
Accuracy
95% CI
No Information Rate
P-Value [Acc > NIR]
:
:
:
:
0.3151
(0.2948, 0.3359)
0.3026
0.1171
Kappa : 0.0499
Mcnemar's Test P-Value : NA
Statistics by Class:
Sensitivity
Specificity
Pos Pred Value
Neg Pred Value
Prevalence
Detection Rate
Detection Prevalence
Balanced Accuracy
Sensitivity
Specificity
Pos Pred Value
Neg Pred Value
Prevalence
Detection Rate
Class: North Class: East Class: South Class: West
0.0000
0.5151
0.6194
0.0000
1.0000
0.6004
0.4489
1.0000
NaN
0.2980
0.3278
NaN
0.9337
0.7899
0.7311
0.7428
0.0663
0.2478
0.3026
0.2572
0.0000
0.1276
0.1874
0.0000
0.0000
0.4282
0.5718
0.0000
0.5000
0.5577
0.5342
0.5000
Class: City
0.0000
1.0000
NaN
0.8739
0.1261
0.0000
## Detection Prevalence
## Balanced Accuracy
0.0000
0.5000
#sensitivity=producer's accuracy and specificity=user's accuracy
Next assess the model with optimal (smalest 𝐶𝑃 = 0.01):
pruned.2 <- prune(fit.1, cp=0.01) # cp=0.01 is the optimal CP
pred.2 <- predict(pruned.2, newdata=fdgs.test20, type="class") #predicting cl
ass test data using the pruned model
confusionMatrix(pred.2, fdgs.test20$reg) #compute confusion matrix and summar
y statistics
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Confusion Matrix and Statistics
Reference
Prediction North East South West City
North
0
0
0
0
0
East
62 331
299 300
76
South
71 166
308 216 177
West
0
0
0
0
0
City
0
0
0
0
0
Overall Statistics
Accuracy
95% CI
No Information Rate
P-Value [Acc > NIR]
:
:
:
:
0.3185
(0.2982, 0.3394)
0.3026
0.0634
Kappa : 0.0621
Mcnemar's Test P-Value : NA
Statistics by Class:
Sensitivity
Specificity
Pos Pred Value
Neg Pred Value
Prevalence
Detection Rate
Detection Prevalence
Balanced Accuracy
Sensitivity
Specificity
Pos Pred Value
Neg Pred Value
Prevalence
Detection Rate
Class: North Class: East Class: South Class: West
0.0000
0.6660
0.5074
0.0000
1.0000
0.5116
0.5497
1.0000
NaN
0.3099
0.3284
NaN
0.9337
0.8230
0.7200
0.7428
0.0663
0.2478
0.3026
0.2572
0.0000
0.1650
0.1535
0.0000
0.0000
0.5324
0.4676
0.0000
0.5000
0.5888
0.5285
0.5000
Class: City
0.0000
1.0000
NaN
0.8739
0.1261
0.0000
## Detection Prevalence
## Balanced Accuracy
0.0000
0.5000
#sensitivity=producer's accuracy and specificity=user's accuracy
Notice that the classificaiton accuracy, albeit not great, increased from 0.3061 (model.1,
only South Predictions, degenerate) to 0.3534 (model.2, East and South predictions).
Random Forests
Random forests may improve predictive accuracy by generating a large number of
bootstrapped trees (based on random samples of variables). It classifies cases using each
tree in this new "forest", and decides the final predicted outcome by combining the results
across all of the trees (an average in regression, a majority vote in classification). See the
randomForest package.
The randomForest algorithm fits many (1,000's) CART models to random subsets of input
data combining the results of hte predicitons of the individual trees. Again, all data types
may be used as independent variables, regardless of whether the model is a classification
or regression tree. However, missing values are handles with rfImpute() which uses a
proximity matrix from the randomForest to impute (populate) incomplete data by:
•
•
for continuous predictors: the weighted average of the non-missing observations
(weighted by the proximities),
Categorical predictors: the category with the largest average proximity.
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
##
margin
fit.2 <- randomForest(reg ~ age + hgt + wgt,
.omit, data= fdgs[,-1])
print(fit.2)
# view results
method="class", na.action = na
##
## Call:
## randomForest(formula = reg ~ age + hgt + wgt, data = fdgs[, -1],
hod = "class", na.action = na.omit)
##
Type of random forest: classification
##
Number of trees: 500
## No. of variables tried at each split: 1
##
##
OOB estimate of error rate: 68.06%
met
##
##
##
##
##
##
##
Confusion matrix:
North East South West City class.error
North
23 176
247 183 100
0.9684499
East
48 951
728 628 165
0.6226190
South
48 640 1246 739 248
0.5734338
West
59 712
844 731 219
0.7150097
City
50 265
396 302 239
0.8091054
importance(fit.2)
# importance of each predictor
##
MeanDecreaseGini
## age
2623.875
## hgt
2304.597
## wgt
2370.084
•
•
•
•
Note on missing values/incomplete data: If the data have missing values, we have 3
choices:
Use a different tool (rpart handles missing values well)
Impute the missing values
For a small number of missing cases, we can use na.action = na.omit
Latent growth and growth mixture modeling (LGM/GMM)
We can illustrate the latent class linear mixed models implemented in hlme through a study
of the quadratic trajectories of the response (remission) with TumorSize, adjusting for
CO2*Pain interaction and assuming correlated random-effects for the functions of
SmokingHx and Sex. To estimate the corresponding standard linear mixed model using 1
latent class where CO2 interacts with Pain:
# install.packages("lcmm")
library("lcmm")
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
##
cluster
hdp <- read.csv("http://www.ats.ucla.edu/stat/data/hdp.csv")
hdp <- within(hdp, {
Married <- factor(Married, levels = 0:1, labels = c("no", "yes"))
DID <- factor(DID)
HID <- factor(HID)
})
# add a new subject ID column (last column in the data, "ID"), this is necess
ary for the hmle call
hdp$ID <- seq.int(nrow(hdp))
model.hlme <- hlme(remission ~ IL6 + CRP + LengthofStay + Experience + I(tumo
rsize^2) + co2*pain + I(tumorsize^2)*pain, random=~ SmokingHx + Sex, subject=
'ID', data=hdp, ng=1)
## Be patient, hlme is running ...
## The program took 22.05 seconds
summary(model.hlme)
##
##
##
##
##
Hx
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Heterogenous linear mixed model
fitted by maximum likelihood method
hlme(fixed = remission ~ IL6 + CRP + LengthofStay + Experience +
I(tumorsize^2) + co2 * pain + I(tumorsize^2) * pain, random = ~Smoking
+
Sex, subject = "ID", ng = 1, data = hdp)
Statistical Model:
Dataset: hdp
Number of subjects: 8525
Number of observations: 8525
Number of latent classes: 1
Number of parameters: 21
Iteration process:
Convergence criteria satisfied
Number of iterations: 34
Convergence criteria: parameters= 1.2e-09
: likelihood= 8.3e-06
: second derivatives= 2.7e-05
Goodness-of-fit statistics:
maximum log-likelihood: -5223.9
AIC: 10489.79
BIC: 10637.86
Maximum Likelihood Estimates:
Fixed effects in the longitudinal model:
intercept
IL6
CRP
LengthofStay
Experience
I(tumorsize^2)
coef
0.28636
-0.01134
-0.00674
-0.04834
0.01695
0.00000
Se
Wald p-value
0.24314
1.178 0.23890
0.00183 -6.184 0.00000
0.00167 -4.043 0.00005
0.00463 -10.436 0.00000
0.00119 14.263 0.00000
0.00001 -0.076 0.93953
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
co2
-0.03549 0.16204
pain
0.03930 0.04278
co2:pain
-0.01489 0.02871
I(tumorsize^2):pain 0.00000 0.00000
-0.219
0.919
-0.519
0.553
0.82663
0.35832
0.60395
0.58045
Variance-covariance matrix of the random-effects:
intercept SmokingHxformer SmokingHxnever Sexmale
intercept
0.19311
SmokingHxformer -0.10618
0.20916
SmokingHxnever
-0.12389
0.06834
0.22627
Sexmale
-0.08131
-0.00735
-0.00002 0.17302
Residual standard error:
coef
Se
0.12998 1.18743
Results interpretation:
•
•
•
•
•
•
•
The first part of the summary provides information about the dataset, the number of
subjects, observations, observations deleted (since by default, missing observations
are deleted), number of latent classes and number of parameters.
Next, details about the algorithm convergence is provided along with the number of
iterations, the convergence criteria, and the information indicating if the model
converged correctly: "convergence criteria satisfied".
The maximum log-likelihood, Akaike criterion (AIC) and Bayesian Information
criterion (BIC) are reported.
Estimates of parameters, the estimated standard error, the Wald Test statistics (with
Normal approximation) and the corresponding p-values are reported below.
For the random-effect distribution, the estimated matrix of covariance of the randomeffects is displayed.
The standard error of the residuals is given along with its estimated standard error.
The effect of TumorSize seems not associated with change over Pain of Remission.
This may be formally assessed using a multivariate Wald test:
WaldMult(model.hlme, pos=c(6,8))
##
Wald Test p_value
## I(tumorsize^2) = pain = 0
0.85562 0.65193
# pos - a vector containing the indices in model.hlme of the parameters to te
st
We may consider the model with an adjustment for CRP only on the intercept. Below we
estimate the corresponding models for a varying number of latent classes (from 1 to 3)
using the default initial values:
# Initial Model: model.hlme <- hlme(remission ~ IL6 + CRP + LengthofStay + Ex
perience + I(tumorsize^2) + co2*pain + I(tumorsize^2)*pain, random=~ SmokingH
x + Sex, subject='ID', data=hdp, ng=1)
model.hlme.1 <- hlme(tumorsize ~ IL6 + CRP + LengthofStay, subject='ID', data
=hdp, ng=1)
## Be patient, hlme is running ...
## The program took 0.57 seconds
model.hlme.2 <- hlme(tumorsize ~ IL6 + CRP + LengthofStay + SmokingHx, mixtur
e=~ SmokingHx, subject='ID', data=hdp, ng=2)
## Be patient, hlme is running ...
## The program took 29.53 seconds
model.hlme.3 <- hlme(tumorsize ~ IL6 + CRP + LengthofStay + SmokingHx, mixtur
e=~ SmokingHx, subject='ID', data=hdp, ng=3) # this may take over 6-min to c
omplete!
## Be patient, hlme is running ...
## The program took 375.19 seconds
The estimation process for a varying number of latent classes can be summarized with
summarytable, which gives the log-likelihood, the number of parameters, the Bayesian
Information Criterion, and the posterior proportion of each class:
summarytable(model.hlme.1, model.hlme.2, model.hlme.3)
##
G
loglik npm
BIC
%class1
%class2 %class3
## model.hlme.1 1 -33301.82
5 66648.89 100.000000
## model.hlme.2 2 -31592.79 11 63285.15 99.214076 0.7859238
## model.hlme.3 3 -31589.55 15 63314.86
6.357771 82.2991202 11.34311
(This program may take over 400 seconds to complete!)
In this example, the optimal number of latent classes according to the BIC is two (the
smallest BIC). The posterior classification is described with:
postprob(model.hlme.2)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Posterior classification:
class1 class2
N 8458.00 67.00
%
99.21
0.79
Posterior classification table:
--> mean of posterior probabilities in each class
prob1 prob2
class1 0.8555 0.1445
class2 0.4362 0.5638
Posterior probabilities above a threshold (%):
class1 class2
prob>0.7 92.48
2.99
## prob>0.8
## prob>0.9
##
77.38
38.53
0.00
0.00
In this example, the first class includes a posteriori 8458 subjects (99%) while class 2
includes 67 (0.79%) subjects. Subjects were classified in class 1 with a mean posterior
probability of 0.8555%.
In class 1, 92.48% were classified with a posterior probability above 0.7 while 2.99% of the
subjects were classified in class 2 with a posterior probability above 0.7. Goodness-of-fit of
the model can be assessed by displaying the residuals as in figure and the mean predictions
of the model as in figure, according to the time variable given in var.time:
plot(model.hlme.2)
# Figure (left panel)
plot(model.hlme.2, which="fit", var.time="Age", bty="l", ylab=" Remission ",
xlab="Age", lwd=2)
# Figure (right panel)
plot(model.hlme.2, which="fit", var.time="Age", bty="l", ylab=" Remission ",
xlab="Age", lwd=2, marg=FALSE)
The latent process mixed models implemented in lcmm are illustrated through the study of
the linear trajectory of ntumors with Age adjusted for Sex and assuming correlated
random-effects for the intercept and Age.
In the plot below, lines estimate the corresponding latent process mixed model with
different link functions:
library("lcmm")
model.hlme.lin <- lcmm(ntumors ~ Age*Sex, random=~ Age ,subject='ID', data=
hdp)
## Be patient, lcmm is running ...
## The program took 27.02 seconds
model.hlme.beta <- lcmm(ntumors ~ Age*Sex, random=~ Age, subject='ID', data
=hdp, link='beta')
## Be patient, lcmm is running ...
## The program took 109.44 seconds
model.hlme.spl <- lcmm(ntumors ~ Age*Sex, random=~ Age, subject='ID', data=
hdp, link='splines')
## Be patient, lcmm is running ...
## The program took 54.36 seconds
model.hlme.spl5q <- lcmm(ntumors ~ Age*Sex, random=~ Age, subject='ID', dat
a=hdp, link='5-quant-splines') # takes over 4 minutes
## Be patient, lcmm is running ...
## The program took 53.15 seconds
Link function:
An optional family of link functions. By default, * "linear" option specifies a linear link
function leading to a standard linear mixed model (homogeneous or heterogeneous as
estimated in hlme). * "beta" for estimating a link function from the family of Beta
cumulative distribution functions, * "thresholds" for using a threshold model to describe
the correspondence between each level of an ordinal outcome and the underlying latent
process, and * "Splines" for approximating the link function by I-splines. For this latter case,
the number of nodes and the nodes location should be also specified. The number of nodes
is first entered followed by ??? -, then the location is specified with "equi", "quant" or
"manual" for respectively equidistant nodes, nodes at quantiles of the marker distribution
or interior nodes entered manually in argument * intnodes. It is followed by - and finally
"splines" is indicated. For example, "7-equi-splines" means I-splines with 7 equidistant
nodes, "6-quant-splines" means I-splines with 6 nodes located at the quantiles of the
marker distribution and "9-manual-splines" means I-splines with 9 nodes, the vector of 7
interior nodes being entered in the argument intnodes.
summary (model.hlme.lin)
Objects mlin, mbeta, mspl and mspl3eq are latent process mixed models that assume the
exact same trajectory for the underlying latent process but respectively a linear, BetaCDF,
I-splines with 5 equidistant knots (default with link='splines') and I-splines with 5 knots at
percentiles. mlin reduces to a standard linear mixed model (link='linear' by default). The
only difference with a hlme object is the parameterization for the intercept and the residual
standard error that are considered as rescaling parameters.
col <- rainbow(4)
plot(model.hlme.lin, which="linkfunction", bty='l', ylab="Number-of-Tumors"
, col=col[1], lwd=2, xlab="underlying latent process")
plot(model.hlme.beta, which="linkfunction", add=T, col=col[2], lwd=2)
plot(model.hlme.spl, which="linkfunction", add=T, col=col[3], lwd=2)
plot(model.hlme.spl5q, which="linkfunction", add=T, col=col[4], lwd=2)
legend(x="topleft",legend=c("linear", "beta","splines (5equidistant)", "spl
ines (5 at quantiles)"), lty=1,col=col,bty="n",lwd=2)
# to obtain confidence bands use function predictlink
link.lin <- predictlink(model.hlme.lin, ndraws=2000)
You would most likely get an error like: Error in predictlink.lcmm(model.hlme.spl, ndraws =
2000):No confidence intervals can be produced since the program did not converge properly
To fix that, change the convergence criteria
model.hlme.lin <- lcmm(ntumors ~ Age*Sex, random=~ Age ,subject='ID', epsY
= 0.5, convB = 1e-01, convL = 1e-01, convG = 1e-01, maxiter=200, data=hdp); m
odel.hlme.lin$conv
# Now that we have convergence, we can obtain CI's!!!
link.lin <- predictlink(model.hlme.lin, ndraws=2000)
# plot(model.hlme.lin, which="linkfunction", bty='l', ylab="Number-of-Tumor
s", col=col[1], lwd=2, xlab="underlying latent process")
plot(link.lin, add=TRUE, col=col[1], lty=2, lwd=2)
legend(x="left", legend=c("95% confidence bands", "for linear fit"), lty=c(
2,NA), col=c(col[1],NA), bty="n", lwd=2)
# Repeat using the other link functions . model.hlme.beta, model.hlme.spl, .
model.hlme.beta <- lcmm(ntumors ~ Age*Sex, random=~ Age, subject='ID', data=h
dp, link='beta',
convB = 1e-01, convL = 1e-01, convG = 1e-01, maxiter=200); model.hlme.beta$co
nv
link.beta <- predictlink(model.hlme.beta, ndraws=2000)
plot(link.beta, add=TRUE, col=col[2], lty=2, lwd=2)
legend(x="left", legend=c("95% confidence bands", "for BETA fit"), lty=c(3,NA
), col=c(col[2],NA), bty="n", lwd=1)
Meta-analysis
Meta-analysis is an approach to combine treatment effects across trials or studies into an
aggregated treatment effect with higher statistical power than observed in each individual
trials. It may detect HTE by testing for differences in treatment effects across similar RCTs.
It requires that the individual treatment effects are similar to ensure pooling is meaningful.
In the presence of large clinical or methodological differences between the trials, it may be
to avoid meta-analyses. The presence of HTE across studies in a meta-analysis may be due
to differences in the design or execution of the individual trials (e.g., randomization
methods, patient selection criteria). Cochran's Q is a methods for detection of
heterogeneity, which is computed as the weighted sum of squared differences between
each study's treatment effect and the pooled effects across the studies. It is a barometer of
inter-trial differences impacting the observed study result. A possible source of error in a
meta-analysis is publication bias.
Trial size may introduce publication bias since larger trials are more likely to be published.
Language and accessibility represent other potential confounding factors. When the
heterogeneity is not due to poor study design, it may be useful to optimize the treatment
benefits for different cohorts of participants.
Cochran's Q statistics is the weighted sum of squares on a standardized scale. The
corresponding P value indicates the strength of the evidence of presence of heterogeneity.
This test may have low power to detect heterogeneity sometimes and it is suggested to use
a value of 0.10 as a cut-off for significance (Higgins et al., 2003). The Q statistics also may
have too much power as a test of heterogeneity when the number of studies is large.
Simulation Example 1
# Install and Load library
# install.packages("meta")
library(meta)
## Loading 'meta' package (version 4.6-0).
## Type 'help("meta-package")' for a brief overview.
# Set number of studies
n.studies = 15
# number of treatments: case1, case2, control
n.trt = 3
# number of outcomes
n.event = 2
# simulate the (balanced) number of cases (case1 and
each study
ctl.group = rbinom(n = n.studies, size = 200, prob =
case1.group = rbinom(n = n.studies, size = 200, prob
case2.group = rbinom(n = n.studies, size = 200, prob
case2) and controls in
0.3)
= 0.3)
= 0.3)
# Simulate the number of outcome events (e.g., deaths) and no events in the
control group
event.ctl.group = rbinom(n = n.studies, size = ctl.group, prob = rep(0.1, l
ength(ctl.group)))
noevent.ctl.group = ctl.group - event.ctl.group
# Simulate the number of events and no events in the case1 group
event.case1.group = rbinom(n = n.studies, size = case1.group, prob = rep(0.
5, length(case1.group)))
noevent.case1.group = case1.group - event.case1.group
# Simulate the number of events and no events in the case2 group
event.case2.group = rbinom(n = n.studies, size = case2.group, prob = rep(0.
6, length(case2.group)))
noevent.case2.group = case2.group - event.case2.group
# Run the univariate meta-analysis using metabin(), Meta-analysis of binary
outcome data # Calculation of fixed and random effects estimates (risk ratio, odds ratio
, risk difference or arcsine
# difference) for meta-analyses with binary outcome data.
Mantel-Haenszel
(MH),
# inverse variance and Peto method are available for pooling.
# method = A character string indicating which method is to be used for poo
ling of studies.
# one of "MH" , "Inverse" , or "Cochran"
# sm = A character string indicating which summary measure ("OR", "RR" "RD"
=risk difference) is to be
# used for pooling of studies
# Control vs. Case1, n.e and n.c are numbers in experimental and control gr
oups
meta.ctr_case1 <- metabin(event.e = event.case1.group, n.e = case1.group, e
vent.c = event.ctl.group, n.c = ctl.group, method = "MH", sm = "OR")
# in this case we use Odds Ratio, of the odds of death in the experimental
and control studies
forest(meta.ctr_case1)
# Control vs. Case2
meta.ctr_case2 <- metabin(event.e = event.case2.group, n.e = case2.group, e
vent.c = event.ctl.group, n.c = ctl.group, method = "MH", sm = "OR")
forest(meta.ctr_case2)
# Case1 vs. Case2
meta.case1_case2 <- metabin(event.e = event.case1.group, n.e = case1.group,
event.c = event.case2.group, n.c = case2.group, method = "MH", sm = "OR")
forest(meta.case1_case2)
summary(meta.case1_case2)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Number of studies combined: k = 15
OR
95%-CI
z p-value
Fixed effect model
0.6178 [0.5133; 0.7435] -5.10 < 0.0001
Random effects model 0.6183 [0.5131; 0.7450] -5.06 < 0.0001
Quantifying heterogeneity:
tau^2 = 0; H = 1.00 [1.00; 1.36]; I^2 = 0.0% [0.0%; 45.8%]
Test of heterogeneity:
Q d.f. p-value
11.98
14
0.6076
Details on meta-analytical method:
- Mantel-Haenszel method
- DerSimonian-Laird estimator for tau^2
The forest plot shows the I2 test indicates the evidence to reject the null hypothesis (no
study heterogeneity and the fixed effects model should be used).
Series of "N of 1" trials
This technique combines (a "series of") n-of-1 trial data to identify HTE. An n-of-1 trial is a
repeated crossover trial for a single patient, which randomly assigns the patient to one
treatment vs. another for a given time period, after which the patient is re-randomized to
treatment for the next time period, usually repeated for 4-6 time periods. Such trials are
most feasibly done in chronic conditions, where little or no washout period is needed
between treatments and treatment effects are identifiable in the short-term, such as pain or
reliable surrogate markers. Combining data from identical n-of-1 trials across a set of
patients enables the statistical analysis controlling for patient fixed or random effects,
covariates, centers, or sequence effects, see Figure below. These combined trials are often
analyzed within a Bayesian context using shrinkage estimators that combine individual and
group mean treatment effects to create a "posterior" individual mean treatment effect
estimate which is a form of inverse variance-weighted average of the individual and group
effects. Such trials are typically more expensive than standard RCTs on a per-patient basis,
however, they require much smaller sample sizes, often less than 100 patients (due to the
efficient individual-as-own-control design), and create individual treatment effect
estimates that are not possible in a non-crossover design . For the individual patient, the
treatment effect can be re-estimated after each time period, and the trial stopped at any
point when the more effective treatment is identified with reasonable statistical certainty.
Example
A study involving 8 participants collected data across 30 days, in which 15 treatment days
and 15 control days are randomly assigned within each participant . The treatment effect is
represented as a binary variable (control day=0; treatment day=1). The outcome variable
represents the response to the intervention within each of the 8 participants. Study
employed a fixed-effects modeling. By creating N-1 dummy-coded variables representing
the N=8 participants, where the last (i=8) participant serves as the reference (i.e., as the
model intercept). So, each dummy-coded variable represents the difference between each
participant (i) and the 8th participant. Thus, all other patients' values will be relative to the
values of the 8th (reference) subject. The overall differences across participants in fixed
effects can be evaluated with multiple degree-of-freedom F-tests.
Intercept
Physical Activity
Intervention
WP Social Support
PM Social Support (0=3)
Self Efficacy (0=25)
Constant
PhyAct
Tx
WPSS
PMss3
SelfEff25
rm(list=ls())
Nof1 <-read.table("https://umich.instructure.com/files/330385/download?down
load_frd=1", sep=",", header = TRUE)
# 02_Nof1_Data.csv
attach(Nof1)
head(Nof1)
##
ID Day Tx SelfEff SelfEff25 WPSS SocSuppt PMss PMss3 PhyAct
## 1 1
1 1
33
8 0.97
5.00 4.03 1.03
53
## 2 1
2 1
33
8 -0.17
3.87 4.03 1.03
73
## 3 1
3 0
33
8 0.81
4.84 4.03 1.03
23
## 4
## 5
## 6
1
1
1
4
5
6
0
1
1
33
33
33
8 -0.41
8 0.59
8 -1.16
3.62 4.03
4.62 4.03
2.87 4.03
1.03
1.03
1.03
36
21
0
df.1 = data.frame(PhyAct, Tx, WPSS, PMss3, SelfEff25)
library("lme4")
## Loading required package: Matrix
##
## Attaching package: 'lme4'
## The following objects are masked from 'package:lcmm':
##
##
fixef, ranef
lm.1 = model.lmer <- lmer(PhyAct ~ Tx + SelfEff + Tx*SelfEff + (1|Day) + (1
|ID) , data= df.1)
summary(lm.1)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Linear mixed model fit by REML ['lmerMod']
Formula: PhyAct ~ Tx + SelfEff + Tx * SelfEff + (1 | Day) + (1 | ID)
Data: df.1
REML criterion at convergence: 8820
Scaled residuals:
Min
1Q Median
-2.7012 -0.6833 -0.0333
3Q
0.6542
Max
3.9612
Random effects:
Groups
Name
Variance Std.Dev.
Day
(Intercept) 4.808e-13 6.934e-07
ID
(Intercept) 6.015e+02 2.453e+01
Residual
9.690e+02 3.113e+01
Number of obs: 900, groups: Day, 30; ID, 30
Fixed effects:
Estimate Std. Error t value
(Intercept) 38.3772
14.4738
2.651
Tx
4.0283
6.3745
0.632
SelfEff
0.5818
0.5942
0.979
Tx:SelfEff
0.9702
0.2617
3.708
Correlation of Fixed Effects:
(Intr) Tx
SlfEff
Tx
-0.220
SelfEff
-0.946 0.208
Tx:SelfEff 0.208 -0.946 -0.220
# Model: PhyAct = Tx + WPSS + PMss3 + Tx*WPSS + Tx*PMss3 + SelfEff25 + Tx*
SelfEff25 + ??
lm.2 = lm(PhyAct ~ Tx + WPSS + PMss3 + Tx*WPSS + Tx*PMss3 + SelfEff25 + Tx*
SelfEff25, df.1)
summary(lm.2)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
lm(formula = PhyAct ~ Tx + WPSS + PMss3 + Tx * WPSS + Tx * PMss3 +
SelfEff25 + Tx * SelfEff25, data = df.1)
Residuals:
Min
1Q
-102.39 -28.24
Median
-1.47
3Q
25.16
Max
122.41
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
52.0067
1.8080 28.764 < 2e-16 ***
Tx
27.7366
2.5569 10.848 < 2e-16 ***
WPSS
1.9631
2.4272
0.809 0.418853
PMss3
13.5110
2.7853
4.851 1.45e-06 ***
SelfEff25
0.6289
0.2205
2.852 0.004439 **
Tx:WPSS
9.9114
3.4320
2.888 0.003971 **
Tx:PMss3
8.8422
3.9390
2.245 0.025025 *
Tx:SelfEff25
1.0460
0.3118
3.354 0.000829 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 37.03 on 892 degrees of freedom
Multiple R-squared: 0.2394, Adjusted R-squared: 0.2334
F-statistic: 40.11 on 7 and 892 DF, p-value: < 2.2e-16
Quantile Treatment Effect (QTE)
QTE employs quantile regression estimation (QRE) to examine the central tendency and
statistical dispersion of the treatment effect in a population. These may not be revealed by
the conventional mean estimation in RCTs. For instance, patients with different
comorbidity scores may respond differently to a treatment. Quantile regression has the
ability to reveal HTE according to the ranking of patients' comorbidity scores or some other
relevant covariate by which patients may be ranked. Therefore, in an attempt to inform
patient-centered care, quantile regression provides more information on the distribution of
the treatment effect than typical conditional mean treatment effect estimation. QTE
characterizes the heterogeneous treatment effect on individuals and groups across various
positions in the distributions of different outcomes of interest. This unique feature has
given quantile regression analysis substantial attention and has been employed across a
wide range of applications, particularly when evaluating the economic effects of welfare
reform.
One caveat of applying QRE in clinical trials for examining HTE is that the QTE doesn't
demonstrate the treatment effect for a given patient. Instead, it focuses on the treatment
effect among subjects within the qth quantile, such as those who are exactly at the top 10th
percent in terms of blood pressure or a depression score for some covariate of interest, for
example, comorbidity score. It is not uncommon for the qth quantiles to be two different
sets of patients before and after the treatment. For this reason, we have to assume that
these two groups of patients are homogeneous if they were in the same quantiles.
Income-Food Expenditure Example
Let's examine the Engel data (N=235) on the relationship between food expenditure
(foodexp) and household income income. We can plot the data and then explore the
superposition of the six fitted quantile regression lines.
#install.packages("quantreg")
library(quantreg)
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
##
backsolve
##
## Attaching package: 'quantreg'
## The following object is masked from 'package:survival':
##
##
untangle.specials
data(engel)
attach(engel)
head(engel)
##
##
##
##
##
##
##
•
1
2
3
4
5
6
income
420.1577
541.4117
901.1575
639.0802
750.8756
945.7989
foodexp
255.8394
310.9587
485.6800
402.9974
495.5608
633.7978
Note: If 𝑌 be a real valued random variable with cumulative distribution function
𝐹𝑌 (𝑦) = 𝑃(𝑌 ≤ 𝑦), then the 𝜏-quantile of 𝑌 is given by:
𝑄𝑌 (𝜏) = 𝐹𝑌−1 (𝜏) = inf{𝑦: 𝐹𝑌 (𝑦) ≥ 𝜏},
where 0 ≤ 𝜏 ≤ 1.
# (1) Graphics
plot(income, foodexp, cex=.25, type="n", xlab="Household Income", ylab="Foo
d Expenditure")
points(income, foodexp, cex=.5, col="blue")
# tau - the quantile(s) to be estimated, in the range from 0 to 1. An objec
t "rq.process" and an object "rqs"
# are returned containing the matrix of coefficient estimates at the specif
ied quantiles.
abline( rq(foodexp ~ income, tau=.5), col="blue")
# Quantile Regression
Model
abline( lm(foodexp ~ income), lty=2, lwd=3, col="red")
taus <- c(0.05, 0.1, 0.25, 0.75, 0.90, 0.95)
colors <- rainbow(length(taus))
# linear model
models <- vector(mode = "list", length = length(taus)) # define a vector of
models to store QR for diff taus
model.names <- vector(mode = "list", length = length(taus)) # define a vect
or model names
for( i in 1:length(taus)){
models[[i]] <- rq(foodexp ~ income, tau=taus[i])
var <- taus[i]
model.names[[i]] <- paste("Model [", i , "]: tau=", var)
abline( models[[i]], lwd=2, col= colors[[i]])
}
legend(3000, 1100, model.names, col= colors, pch= taus, bty='n', cex=.75)
Inference about quantile regression coefficients
As an alternative to the rank-inversion confidence intervals, we can obtain a table of
coefficients, standard errors, t-statistics, and p-values using the summary function:
summary(models[[3]], se = "nid")
##
##
##
##
##
##
##
##
##
Call: rq(formula = foodexp ~ income, tau = taus[i])
tau: [1] 0.25
Coefficients:
Value
Std. Error t value Pr(>|t|)
(Intercept) 95.48354 21.39237
4.46344 0.00001
income
0.47410 0.02906
16.31729 0.00000
Alternatively, we can use summary.rq to compute bootstrapped standard errors.
summary.rq(models[[3]], se = "nid")
##
## Call: rq(formula = foodexp ~ income, tau = taus[i])
##
## tau: [1] 0.25
##
## Coefficients:
##
Value
Std. Error t value Pr(>|t|)
## (Intercept) 95.48354 21.39237
## income
0.47410 0.02906
4.46344
16.31729
0.00001
0.00000
Nonparametric Regression Methods
Nonparametric regression enables dealing with HTE in RCTs. Different nonparametric
methods, such as kernel smoothing methods and series methods, can be used to generate
test statistics for examining the presence of HTE. A kernel method is a weighting scheme
based on a kernel function (e.g. uniform, Gaussian). When evaluating the treatment effect of
a patient in RCTs, the kernel method assigns larger weights to those observations with
similar covariates. This is done because it is assumed that patients with similar covariates
provide more relevant data on predicted treatment response. Examining participants that
have different backgrounds (e.g., demographic, clinical), kernel smoothing methods utilize
information from highly divergent participants when estimating a particular subject's
treatment effect. Lower weights are assigned to very different subjects and the kernel
methods require choosing a set of smoothing parameters to group patients according to
their relative degree of similarities. A drawback is that the corresponding proposed test
statistics may be sensitive to the chosen bandwidths, which inhibits the interpretation of
the results. Series methods use approximating functions (splines or power series of the
explanatory variables) to construct test statistics. Compared to kernel smoothing methods,
series methods normally have the advantage of computational convenience; however, the
precision of test statistics depends on the number of terms selected in the series.
Canadian Wage Data Example: Nonparametric regression extends the classical parametric
regression (e.g., lm, lmer) involving one continuous dependent variable, y, and (1 or more)
continuous explanatory variable(s), x. Let's start with a popular parametric model of a
wage equation that we can extend to a fully nonparametric regression model. First, we will
compare and contrast the parametric and nonparametric approach towards univariate
regression and then proceed to multivariate regression.
Let's use the Canadian cross-section wage data (cps71) consisting of a random sample
taken from the 1971 Canadian Census for male individuals having common education
(High-School). N=205 observations, 2 variables, the logarithm of the individual's wage
(logwage) and their age (age). The classical wage equation model includes a quadratic term
of age.
# install.packages("np")
library("np")
## Nonparametric Kernel Methods for Mixed Datatypes (version 0.60-2)
## [vignette("np_faq",package="np") provides answers to frequently asked ques
tions]
data("cps71")
# (1) Linear Model ??? R2 = 0.2308
model.lin <- lm( logwage ~ age + I(age^2), data = cps71)
summary(model.lin)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
lm(formula = logwage ~ age + I(age^2), data = cps71)
Residuals:
Min
1Q
-2.4041 -0.1711
Median
0.0884
3Q
0.3182
Max
1.3940
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.0419773 0.4559986 22.022 < 2e-16 ***
age
0.1731310 0.0238317
7.265 7.96e-12 ***
I(age^2)
-0.0019771 0.0002898 -6.822 1.02e-10 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5608 on 202 degrees of freedom
Multiple R-squared: 0.2308, Adjusted R-squared: 0.2232
F-statistic: 30.3 on 2 and 202 DF, p-value: 3.103e-12
# (2) Next, we consider the local linear nonparametric method employing cro
ss-validated
# bandwidth selection and estimation in one step. Start with computing the
least-squares
# cross-validated bandwidths for the local constant estimator (default).
# Note that R2 = 0.3108675
bandwidth <- npregbw(formula= logwage ~ age, data = cps71)
##
Multistart
Multistart
Multistart
Multistart
Multistart
Multistart
1
1
1
1
1
1
of
of
of
of
of
of
1
1
1
1
1
1
|
|
|
/
|
|
model.np <- npreg(bandwidth,
ts = TRUE, data = cps71)
summary(model.np)
##
##
##
##
##
##
##
##
##
##
regtype = "ll",
bwmethod = "cv.aic", gradien
Regression Data: 205 training points, in 1 variable(s)
age
Bandwidth(s): 1.892157
Kernel Regression Estimator: Local-Constant
Bandwidth Type: Fixed
Residual standard error: 0.5307943
R-squared: 0.3108675
## Continuous Kernel Type: Second-Order Gaussian
## No. Continuous Explanatory Vars.: 1
# NP model significance may be tested by
npsigtest(model.np)
## Kernel Regression Significance Test
## Type I Test with IID Bootstrap (399 replications, Pivot = TRUE, joint = FA
LSE)
## Explanatory variables tested for significance:
## age (1)
##
##
age
## Bandwidth(s): 1.892157
##
## Individual Significance Tests
## P Value:
## age < 2.22e-16 ***
## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So, as was the case for the linear parametric model, Age is significant in the local linear NPmodel
# (3) Graphical comparison of parametric and nonparametric models.
plot(cps71$age, cps71$logwage, xlab = "age", ylab = "log(wage)", cex=.1)
lines(cps71$age, fitted(model.lin), lty = 2, col = " red")
lines(cps71$age, fitted(model.np), lty = 1, col = "blue")
legend("topright", c("Data", "Linear", "Non-linear"), col=c("Black", "Red",
"Blue"), pch = c(1, 1, 1), bty='n', cex=.75)
# some additional plots resenting the parametric (quadratic, dashed line) a
nd the nonparametric estimates
# (solid line) of the regression function for the cps71 data.
plot(model.np, plot.errors.method = "asymptotic")
plot(model.np, gradients = TRUE)
lines(cps71$age, coef(model.lin)[2]+2*cps71$age*coef(model.lin)[3], lty = 2
, col = "red")
plot(model.np, gradients = TRUE, plot.errors.method = "asymptotic")
# (4) using the Lin and NL models to generate predictions based on the obta
ined appropriate
# bandwidths and estimated a nonparametric model. We need to create a set o
f explanatory
# variables for which to generate predictions. These can be part of the ori
ginal dataset or be
# outside its scope. Typically, we don't have the outcome for the evaluatio
n data and need only
# provide the explanatory variables for which predicted values are generate
d by the models.
# Occasionally, splitting the dataset into two independent samples (trainin
g/testing), allows estimation
# of a model on one sample, and evaluation of its performance on another.
cps.eval.data <- data.frame(age = seq(10,70, by=10)) # simulate some explan
atory X values (ages)
pred.lin <- predict(model.lin, newdata = cps.eval.data)
# Linear Pred
iction of log(Wage)
pred.np <- predict(model.np, newdata = cps.eval.data)
# non-Linear Pred
iction of log(Wage)
plot(pred.lin, pred.np)
abline(lm(pred.np ~ pred.lin))
Predictive risk models
Predictive risk models represent a class of methods for identifying potential for HTE when
the individual patient risk for disease-related events at baseline depends on observed
factors. For instance, common measures are disease staging criteria, such as those used in
COPD or heart failure, Framingham risk scores for cardiovascular event risk, or genetic
variations, e.g., HER2 for breast cancer. Initial predictive risk modeling, aka risk function
estimation, is often performed without accounting for treatment effects. Least squares or
Cox proportional hazards regression methods are appropriate in many cases and provide
relatively more interpretable risk functions, but rely on linearity assumptions and may not
provide optimal predictive metrics. Partial least squares is an extension of least squares
methods that can reduce the dimensionality of the predictor space by interposing latent
variables, predicted by linear combinations of observable characteristics, as the
intermediate predictors of one or more outcomes. Recursive partitioning, such as random
forests, support vector machines, and neural networks represent latter methods with
better predictive power than linear methods. Risk function estimation can range from
highly exploratory analyses to near meta-analytic model validation, and may be useful at
any stage of product development.
HIV Example
The "hmohiv" dataset represents a study of HIV positive patients examining whether there
was a difference in survival times of HIV positive patients between a cohort using
intravenous drugs (drug=1) and a cohort not using the IV drug (drug=0). The hmohiv data
includes the following variables: ID time age drug censor entdate enddate.
#cleaning up environment
rm(list=ls())
# load survival library
library(survival)
# load hmohiv data
hmohiv<-read.table("http://www.ats.ucla.edu/stat/r/examples/asa/hmohiv.csv"
, sep=",", header = TRUE)
attach(hmohiv)
## The following object is masked from Nof1:
##
##
ID
# construct a frame of the 2 cohorts IV_drug and no-IV-drug
drug.new<-data.frame(drug=c(0,1))
# Fit Cox proportional hazards regression model
cox.model <- coxph( Surv(time, censor) ~ drug, method="breslow")
fit.1 <- survfit(cox.model, newdata=drug.new)
# plot results
plot(fit.1, xlab="Survival Time (Months)", ylab="Survival Probability")
points(fit.1$time, fit.1$surv[,1], pch=1)
points(fit.1$time, fit.1$surv[,2], pch=2)
legend(40, .8, c("Drug Absent", "Drug Present"), pch=c(1,2))
# to inslect the resulting Cox Proportional Hazard Model
cox.model
##
##
##
##
##
##
##
##
Call:
coxph(formula = Surv(time, censor) ~ drug, method = "breslow")
coef exp(coef) se(coef)
z
p
drug 0.779
2.180
0.242 3.22 0.0013
Likelihood ratio test=10.2 on 1 df, p=0.00141
n= 100, number of events= 80
Comparative Effectiveness Research: Case-Studies 1
Observational Studies: Tips for the CER Practitioners
• Different study types can offer different understandings; neither should be
discounted without closer examination.
• RCTs provide an accurate understanding of the effect of a particular intervention in
a well-defined patient group under “controlled” circumstances.
• Observational studies provide an understanding of real-world care and its impact,
but can be biased due to uncontrolled factors.
• Observational studies differ in the types of databases used. These databases may
lack clinical detail and contain incomplete or inaccurate data.
• Before accepting the findings from an observational study, consider whether
confounding factors may have influenced the results.
• In this scenario, subgroup analysis was vital in clarifying both study designs; what is
true for the many (e.g., overall, estrogen appeared to be detrimental) may not be true
for the few (e.g., that for the younger post-menopausal woman, the benefits were
greater and the harms less frequent).
• Carefully examine the generalizability of the study. Do the study’s patients and
intervention match those under consideration?
• Observational studies can identify associations but cannot prove cause-and-effect
relationships.
Case-Study 1: The Cetuximab Study 2
What was done and what was found?
Cetuximab, an anti-epidermal growth factor receptor (EGFR) agent, has
recently been added to the therapeutic armamentarium. Two important CRTs
examined its impact in patients with mCRC (metastatic-stage Colorectal
cancer). In the first one, 56 centers in 11 European countries investigated the
outcomes associated with cetuximab therapy in 329 mCRC patients who
experienced disease progression either on irinotecan therapy or within 3
months thereafter. The study reported that the group on a combination of
irinotecan and cetuximab had a significantly higher rate of overall response to
treatment (primary endpoint) than the group on cetuximab alone: 22.9%
(95% CI, 17.5-29.1%) vs. 10.8% (95% CI, 5.7-18.1%) (P=0.007), respectively.
Similarly, the median time to progression was significantly longer in the
combination therapy group (4.1 vs. 1.5 months, P<0.001). As these patients
had already progressed on irinotecan prior to the study, any response was
1
Based on 2009 NPC report, www.npcnow.org/publication/demystifying-comparative-effectiveness-
research-case-study-learning-guide
2
http://www.cancer.gov/cancertopics/druginfo/fda-cetuximab
viewed as positive. Safety between the two treatment arms was similar:
approximately 80% of patients in each arm experienced a rash. Grade 3 or 4
(the more severe) toxic effects on the skin were slightly more frequent in the
combination-therapy group compared to cetuximab monotherapy, observed in
9.4% and 5.2% of participants, respectively. Other side effects such as
diarrhea and neutropenia observed in the combination-therapy arm were
considered to be in the range expected for irinotecan alone. Data from this
study demonstrated the efficacy and safety of cetuximab and were
instrumental in the FDA’s 2004 approval.
A second CRT (2007) examined 572 patients and suggested efficacy of
cetuximab in the treatment of mCRC. This study was a randomized, nonblinded, controlled trial that examined cetuximab monotherapy plus best
supportive care compared to best supportive care alone in patients who had
received and failed prior chemotherapy regimens. It reported that median
overall survival (the primary endpoint) was significantly higher in patients
receiving cetuximab plus best supportive care compared to best supportive
care alone (6.1 vs. 4.6 months, respectively) (hazard ratio for death=0.77; 95%
CI: 0.64- 0.92, P=0.005). This RCT described a greater incidence of adverse
events in the cetuximab plus best supportive care group compared to best
supportive care alone including (most significantly) rash, as well as edema,
fatigue, nausea and vomiting.
Was this the right answer?
These RCTs had fairly broad enrollment criteria and the cetuximab benefits
were modest. Emerging scientific theories raised the possibility that
genetically defined population subsets might experience a greater-thanaverage treatment benefit. One such area of inquiry entailed examining
“biomarkers,” or genetic indicators of a patient’s greater response to therapy.
Even as the above RCTs were being conducted, data emerged showing the
importance of the KRAS gene.
Emerging Data
Based on the emerging biochemical evidence that the epidermal growth factor
receptor (EGFR) treatment mechanism (Cetuximab,) was even more finely
detailed than previously understood, the study authors of the 2007 RCT
undertook a retrospective subgroup analysis using tumor tissue samples
preserved from their initial study. Following laboratory analysis, all viable
tissue samples were classified as having a wild-type (non-mutated) or a
mutated KRAS gene. Instead of the previous two study arms (cetuximab plus
best supportive care vs. best supportive care alone), there were 4 for this new
analysis: each of the two original study arms was further divided by wild-type
vs. mutated KRAS status. Laboratory evaluation determined that 40.9% and
42.3% of all patients in the RCT had a KRAS mutation in the cetuximab plus
best supportive care group compared to the best supportive care group alone,
respectively. The efficacy of cetuximab was found to be significantly correlated
with KRAS status: in patients with wild-type (non-mutated). KRAS genes,
cetuximab plus best supportive care compared to best supportive care alone
improved overall survival (median 9.5 vs. 4.8 months, respectively; hazard
ratio for death=0.55; 95% CI, 0.41-0.74, P<0.001), and progression-free
survival (median 3.7 vs. 1.9 months, respectively; hazard ratio for progression
or death=0.40; 95% CI, 0.30-0.54, P<0.001). Meanwhile, in patients with
mutated KRAS tumors, the authors found no significant difference in outcome
between cetuximab plus best supportive care vs. best supportive care alone.
What next?
Based on these and similar results from other studies, the FDA narrowed its
product labeling in July 2009 to indicate that cetuximab is not recommended
for mCRC patients with mutated KRAS tumors. This distinction reduces the
relevant population by approximately 40%. Similarly, the American society of
Clinical oncology released a provisional clinical recommendation that all mCRC
patients have their tumors tested for KRAS status before receiving anti-EGFR
therapy. The benefits of targeted treatment are many. Patients who previously
underwent cetuximab therapy without knowing their genetic predisposition
would no longer have to be exposed to the drug’s toxic effects if unnecessary, as
the efficacy of cetuximab is markedly higher in the genetically defined
appropriate patients. In a less-uncertain environment, clinicians can be more
confident in advocating a course of action in their care of patients. And finally,
knowledge that targeted therapy is possible suggests the potential for further
innovation in treatment options. In fact, research continues to demonstrate
options for targeted cetuximab treatment of mCRC at an even finer scale than
seen with KRAS; and similar genetic targeting is being investigated, and
advocated, in other cancer types.
Lessons Learned From this case Study
Although RCTs are generally viewed as the gold standard, results of one or even
a series of trials may not accurately reflect the benefits experienced by an
individual patient. This case-study suggests that cetuximab initially appeared
to have rather modest clinical benefits. Albeit, new information that became
available and subsequent genetic subgroup assessments led to very different
conclusions. Clinicians should be aware that the current knowledge is likely to
evolve and any decisions about patient care should be carefully considered
with that sense of uncertainty in mind. As in this case study, subgroup analyses
(e.g., genetic subtypes) need a theoretical rationale. Ideally, the analyses should
be determined at the time of original RCT design and should not just occur as
explorations of the subsequent data. When improperly employed, post hoc
analyses may lead to incorrect patient care conclusions.
RCTs Tips for the CER Practitioners
o RCTs can determine whether an intervention can provide benefit in a very
controlled environment.
o The controlled nature of an RCT may limit its generalizability to a broader
population.
o No results are permanent; advances in scientific knowledge and
understanding can influence how we view the effectiveness (or safety) of a
therapeutic intervention.
o Targeted therapy illuminated by carefully thought out subgroup analyses can
improve the efficacious and safe use of an intervention.
Case-Study 2: The Rosiglitazone Study 3
Meta-analysis
Often the results for the same intervention differ across clinical trials and it may not be
clear whether one therapy provides more benefit than another. As CER increases and
more studies are conducted, clinicians and policymakers are more likely to encounter
this scenario. In a systematic review, a researcher identifies similar studies and
displays their results in a table, enabling qualitative comparisons across the studies.
With a meta-analysis, the data from included studies are statistically combined into a
single “result.” Merging the data from a number of studies increases the effective
sample size of the investigation, providing a statistically stronger conclusion about the
body of research. By so doing, investigators may detect low frequency events and
demonstrate more subtle distinctions between therapeutic alternatives.
When studies have been properly identified and combined, the meta-analysis produces
a summary estimate of the findings and a confidence interval that can serve as a
benchmark in medical opinion and practice. However, when done incorrectly, the
quantitative and statistical analysis can create impressive “numbers” but biased
results. The following are important criteria for properly conducted meta-analyses:
1. Carefully defining unbiased inclusion or exclusion criteria for study selection
2. Including only those studies that have similar design elements, such as patient
population, drug regimen, outcomes being assessed, and timeframe
3. Applying correct statistical methods to combine and analyze the data
Reporting this information is essential for the reader to determine whether the data
were suitable to combine, and if the meta-analysis draws unbiased conclusions. Metaanalyses of randomized clinical trials are considered to be the highest level of medical
evidence as they are based upon a synthesis of rigorously controlled trials that
systematically reduce bias and confounding. This technique is useful in summarizing
available evidence and will likely become more common in the era of publicly funded
comparative effectiveness research. The following case study will examine several key
principles that will be useful as the reader encounters these publications.
Clinical Application
Heart disease is the leading cause of mortality in the United States, resulting in
approximately 20% of all deaths. Diabetics are particularly susceptible to heart
disease, with more than 65% of deaths attributable to it. The nonfatal complications of
diabetes are wide-ranging and include kidney failure, nerve damage, amputation,
stroke and blindness, among other outcomes. In 2007, the total estimated cost of
diabetes in the United States was $174B; $116B was derived from direct medical
expenditures and the rest from the indirect cost of lost productivity due to the disease.
With such serious health effects and heavy direct and indirect costs tied to diabetes,
3
http://www.nejm.org/doi/full/10.1056/NEJMoa072761
proper disease management is critical. Historically, diabetes treatment has focused on
strict blood sugar control, assuming that this goal not only targets diabetes but also
reduces other serious comorbidities of the disease.
Anti-diabetic agents have long been associated with key questions as to their
benefits/risks in the treatment of diabetes. the sulfonylurea tolbutamide, a first
generation anti-diabetic drug, was found in a landmark study in the 1970s to
significantly increase the CV mortality rate compared to patients not on this agent.
Further analysis by external parties concluded that the methods employed in this trial
were significantly flawed (e.g., use of an “arbitrary” definition of diabetes status,
heterogeneous baseline characteristics of the populations studied, and incorrect
statistical methods). Since these early studies, CV concerns continue to be an issue
with selected oral hypoglycemic agents that have subsequently entered the
marketplace.
A class of drugs, thiazolidinedione (TZD), was approved in the late 1990s, as a solution
to the problems associated with the older generation of sulfonylureas. Rosiglitazone,
a member of the TZD class, was approved by the FDA in 1999 and was widely
prescribed for the treatment of type-2 diabetes. A number of RCTs supported the
benefit of rosiglitazone as an important new oral antidiabetic agent. However, safety
concerns developed as the FDA received reports of adverse cardiac events potentially
associated with rosiglitazone. It was in this setting that a meta-analysis by Nissen and
Wolski was published in the New England Journal of Medicine in June 2007.
What was done?
Nissen and Wolski conducted a meta-analysis examining the impact of rosiglitazone on
cardiac events and mortality compared to alternative therapeutic approaches. The
study began with a broad search to locate potential studies for review. The authors
screened published phase II, III, and IV trials; the FDA website; and the drug
manufacturer’s clinical-trial registry for applicable data relating to rosiglitazone use.
When the initial search was complete, the studies were further categorized by prestated inclusion criteria. Meta-analysis inclusion criteria were simple: studies had to
include rosiglitazone and a randomized comparator group treated with either another
drug or placebo, study arms had to show similar length of treatment, and all groups
had to have received more than 24 weeks of exposure to the study drugs. The studies
had to contain outcome data of interest including the rate of myocardial infarction (MI)
or death from all CV causes. Out of 116 studies surveyed by the authors, 42 met their
inclusion criteria and were included in the meta-analysis. Of the studies they included,
23 had durations of 26 weeks or less, and only five studies followed patients for more
than a year. Until this point, the study’s authors were following a path similar to that of
any reviewer interested in CV outcomes, examining the results of these 42 studies and
comparing them qualitatively. Quantitatively combining the data, however, required
the authors to make choices about the studies they could merge and the statistical
methods they should apply for analysis. Those decisions greatly influenced the results
that were reported.
What was found?
When the studies were combined, the meta-analysis contained data from 15,565
patients in the rosiglitazone group and 12,282 patients as comparators. Analyzing
their data, the authors chose one particular statistical method (the Peto odds ratio
method, a fixed-effect statistical approach), which calculates the odds of events
occurring where the outcomes of interest are rare and small in number. In comparing
rosiglitazone with a “control” group that included other drugs or placebo, the authors
reported odds ratios of 1.43 (95% CI, 1.03-1.98; P=0.03) and 1.64 (95% CI,
0.98-2.74; P=0.06) for MI and death from CV causes, respectively. In other words, the
odds of an MI or death from a CV cause are higher for rosiglitazone patients than for
patients on other therapies or placebo. The authors reported that rosiglitazone was
significantly associated with an increase in the risk of MI and had borderline
significance in increasing the risk of death from all CV causes. These findings appeared
online on the same day that the FDA issued a safety alert regarding rosiglitazone.
Discussion of the meta-analysis was immediately featured prominently in the news
media. By December 2007, prescription claims for the drug at retail pharmacies had
fallen by more than 50%.
As diabetic patients and their clinicians reacted to the news, a methodologic debate
also ensued. This discussion included statistical issues pertaining to the conduct of the
analysis, its implications for clinical care, and finally the FDA and drug manufacturer’s
roles in overseeing and regulating rosiglitazone. The concern among patients with
diabetes regarding treatment, continues in the medical community today.
Was this the right answer?
Should the studies have been combined? Commentators faulted the authors for
including several studies that were not originally intended to investigate diabetes, and
for combining both placebo and drug therapy data into one comparator arm. Some
critics noted that despite the stated inclusion criteria, some data were derived from
studies where the rosiglitazone arm was allowed a longer follow-up than the
comparator arm. By failing to account for this longer follow-up period, commentators
felt that the authors may have overestimated the effect of rosiglitazone on CV
outcomes. Many reviewers were concerned that this meta-analysis excluded trials in
which no patients suffered an MI or died from CV causes – the outcomes of greatest
interest. Some reviewers also noted that the exclusion of zero-event trials from the
pooled dataset not only gave an incomplete picture of the impact of rosiglitazone but
could have increased the odds ratio estimate. In general, the pooled dataset was
criticized by many for being a faulty microcosm of the information available regarding
rosiglitazone.
It is essential that a meta-analysis be based on similarity in the data sources. If studies
differ in important areas such as the patient populations, interventions, or outcomes,
combining their data may not be suitable. The researchers accepted studies and
populations that were clinically heterogeneous, yet pooled them as if they were not.
The study reported that the results were combined from a number of trials that were
not initially intended to investigate CV outcomes. Furthermore, the available data did
not allow for time-to-event analysis, an essential tool in comparing the impact of
alternative treatment options. Reviewers considered the data to be insufficiently
homogeneous, and the line of cause and effect to be murkier than the authors
described.
Were the statistical methods optimal?
The statistical methods for this meta-analysis also came under significant criticism.
The critiques focused on the authors’ use of the Peto method as being an incorrect
choice because data were pooled from both small and very large studies, resulting in a
potential overestimation of treatment effect. Others reviewers pointed that the Peto
method should not have been used, as a number of the underlying studies did not have
patients assigned equally to rosiglitazone and comparator groups. Finally, critics
suggested that the heterogeneity of the included studies required an altogether
different set of analytic techniques.
Demonstrating the sensitivity of the authors’ initial analysis to the inclusion criteria
and statistical tests used, a number of researchers reworked the data from this study.
one researcher used the same studies but analyzed the data with a more commonly
used statistical method (Mantel-Haenszel), and found no significant increase in the
relative risk or common odds ratio with MI or CV death. When the pool of studies was
expanded to include those originally eliminated because they had zero CV events, the
odds ratios for MI and death from CV causes dropped from 1.43 to 1.26 (95% CI, 0.931.72) and from 1.64 to 1.14 (95% CI, 0.74-1.74), respectively. Neither of the
recalculated odd ratios were significant for MI or CV death. Finally, several newer longterm studies have been published since the Nissen meta-analysis. Incorporating their
results with the meta-analysis data showed that rosiglitazone is associated with an
increased risk of MI but not of CV death. Thus, the findings from these meta-analyses
varied with the methods employed, the studies included, and the addition of later
trials.
Emerging Data
The controversy surrounding the rosiglitazone meta-analysis authored by Nissen and
Wolski forced an unplanned interim analysis of a long-term, randomized trial
investigating the CV effects of rosiglitazone among patients with type 2 diabetes. The
authors of the RECORD trial noted that even though the follow-up at 3.75 years was
shorter than expected, rosiglitazone, when added to standard glucose-lowering
therapy, was found to be associated with an increase in the risk of heart failure but was
not associated with any increase in death from CV or other causes. Data at the time
were found to be insufficient to determine the effect of rosiglitazone on an increase in
the risk of MI. the final report of that trial, published in June 2009, confirmed the
elevated risk of heart failure in people with type 2 diabetes treated with rosiglitazone
in addition to glucose-lowering drugs, but continued to show inconclusive results
about the effect of the drug therapy on the risk of MI. Further, the RECORD trial
clarified that rosiglitazone does not result in an increased risk of CV morbidity or
mortality compared to standard glucose-lowering drugs. Other trials conducted since
the publishing of the meta-analysis have corroborated these results, casting further
doubt on the findings of the meta-analysis published by Nissen and Wolski.
Now what?
Some sources suggest that the original Nissen meta-analysis delivered more harm than
benefit, and that a well-recognized medical journal may have erred in its process of
peer review. Despite this criticism, it is important to note that subsequent publications
support the risk of adverse CV events associated with rosiglitazone, although
rosiglitazone use does not appear to increase deaths. These results and emerging data
point to the need for further rigorous research to clarify the benefits and risks of
rosiglitazone on a variety of outcomes, and the importance of directing the drug to the
population that will maximally benefit from its use.
Lessons Learned From this case Study
Results from initial randomized trials that seem definitive at one time may not be
conclusive, as further trials may emerge to clarify, redirect, or negate previously
accepted results. A meta-analysis of those trials can lead to varying results based upon
the timing of the analysis and the choices made in its performance.
Meta-Analysis: Tips for CER Practitioners
o
o
o
o
The results of a meta-analysis are highly dependent on the studies included (and
excluded). Are these criteria properly defined and relevant to the purposes of the
meta-analysis? Were the combined studies sufficiently similar? Can results from this
cohort be generalized to other populations of interest?
The statistical methodology can impact study results. Have there been reviews
critiquing the methods used in the meta-analysis?
A variety of statistical tests should be considered, and perhaps reported, in the analysis
of results. Do the authors mention their rationale in choosing a statistical method? Do
they show the stability of their results across a spectrum of analytical methods?
Nothing is permanent. Emerging data may change the playing field, and meta- analysis
results are only as good as the data and statistics from which they are derived.
Case-Study 3: The Nurses’ Health Study 4
An observational study
An observational study is a very common type of research design in which the effects of
a treatment or condition are studied without formally randomizing patients in an
experimental design. Such studies can be done prospectively, wherein data are
collected about a group of patients going forward in time; or retrospectively, in which
the researcher looks into the past, mining existing databases for data that have already
been collected. Latter studies are frequently performed by using an electronic database
that contains, for example, administrative, “billing,” or claims data. Less commonly,
observational research uses electronic health records, which have greater clinical
information that more closely resembles the data collected in an RCT. Observational
studies often take place in “real- world” environments, which allow researchers to
collect data for a wide array of outcomes. Patients are not randomized in these studies,
but the findings can be used to generate hypotheses for investigation in a more
constrained experimental setting. Perhaps the best known observational study is the
“Framingham study,” which collected demographic and health data for a group of
individuals over many years (and continues to do so) and has provided an
understanding of the key risk factors for heart disease and stroke.
Observational studies present many advantages to the comparative effectiveness
researcher. the study design can provide a unique glimpse of the use of a health care
intervention in the “real world,” an essential step in gauging the gap between efficacy
(can a treatment work in a controlled setting?) and effectiveness (does the treatment
work in a real-life situation?). Furthermore, observational studies can be conducted at
low cost, particularly if they involve the secondary analysis of existing data sources.
CER often uses administrative databases, which are based upon the billing data
submitted by providers during routine care. These databases typically have limited
clinical information, may have errors in them, and generally do not undergo auditing.
The uncontrolled nature of observational studies allows them to be subject to bias and
confounding. For example, doctors may prescribe a new medication only for the sickest
patients. Comparing these outcomes (without careful statistical adjustment) with those
from less ill patients receiving alternative treatment may lead to misleading results.
Observational studies can identify important associations but cannot prove cause and
effect. These studies can generate hypotheses that may require RCTs for fuller
demonstration of those relationships. Secondary analysis can also be problematic if
researchers overwork datasets by doing multiple exploratory analyses (e.g., datadredging): the more we look, the more we find, even if those findings are merely
statistical aberrations. Unfortunately, the growing need for CER and the wide
availability of administrative databases may lead to selection of research of poor quality
with inaccurate findings.
4
http://jech.bmj.com/content/59/9/740.short
In comparative effectiveness research, observational studies are typically considered to
be less conclusive than RCTs and meta-analyses. Nonetheless, they can be useful,
especially because they examine typical care. Due to lower cost and improvements in
health information, observational studies will become increasingly common. Critical
assessment of whether the described results are helpful or biased (based upon how the
study was performed) are necessary. This case will illustrate several characteristics of
the types of studies that will assist in evaluating newly published work.
Clinical Applications
Cardiovascular diseases (CVD) are the leading cause of death in women older than the
age of 50. Epidemiologic evidence suggests that estrogen is a key mediator in the
development of CVD. Estrogen is an ovarian hormone whose production decreases as
women approach menopause. The steep increase in CVD in women at menopause and
older and in women who have had hysterectomies further supports a relationship
between estrogen and CVD. Building on this evidence of biologic plausibility,
epidemiological and observational studies suggested that estrogen replacement
therapy (a form of hormone replacement therapy, or HRT) had positive effects on the
risk of CVD in postmenopausal women, (albeit with some negative effects in its
potential to increase the risk for breast cancer and stroke).65 Based on these findings,
in the 1980s and 1990s HRT was routinely employed to treat menopausal symptoms
and serve as prophylaxis against CVD.
What was done?
The Nurses’ Health Study (NHS) began collecting data in 1976. In the study, researchers
intended to examine a broad range of health effects in women over a long period of
time, and a key goal was to clarify the role of HRT in heart disease. The cohort (i.e., the
group being followed) included married registered nurses aged 30-55 in 1976 who
lived in the 11 most populous states. To collect data, the researchers mailed the study
participants a survey every 2 years that asked questions about topics such as smoking,
hormone use, menopausal status, and less frequently, diet. Data were collected for key
end points that included MI, coronary-artery bypass grafting or angioplasty, stroke,
total CVD mortality, and deaths from all causes.
What was found?
At a 10-year follow-up point, the NHS had a study pool of 48,470 women. The
researchers found that estrogen use (alone, without progestin) in postmenopausal
women was associated with a reduction in the incidence of CVD as well as in CVD
mortality compared to non-users. Later, estrogen-progestin combination therapy was
shown to be even more cardioprotective than estrogen monotherapy, and lower doses
of estrogen replacement therapy were found to deliver equal cardioprotection and
lower the risk for adverse events. NHS researchers were alert to the potential for bias in
observational studies. Adjustment for risk factors such as age (a typical practice to
eliminate confounding) did not change the reported findings.
Was this the right answer?
The NHS was not unique in reporting the benefits associated with HRT; other
observational studies corroborated the NHS findings. A secondary retrospective data
analysis of the UK primary care electronic medical record database, for example, also
showed the protective effect associated with HRT use. Researchers were aware of the
fundamental limitations of observational studies, particularly with regard to selection
bias. They and practicing clinicians were also aware of the potential negative health
effects of HRT, which had to be constantly weighed against the potential
cardioprotective benefits in deciding a patient’s course of treatment. As a large section
of the population could experience the health effects of HRT, researchers began
planning RCTs to verify the promising observational study results. It was highly
anticipated that those RCTs would corroborate the belief that estrogen replacement can
reduce CVD risk.
Randomized Controlled Trial: The Women’s Health Initiative
The Women’s health Initiative (WHI) was a major study established by the National
Institutes of health in 1992 to assess a broad range of health effects in postmenopausal
women. The trial was intended to follow these women for 8 years, at a cost of millions
of dollars in federal funding. Among its many facets, it included an RCT to confirm the
results from the observational studies discussed above. To fully investigate earlier
findings, the WHI had two subgroups. One subgroup consisted of women with prior
hysterectomies; they received estrogen monotherapy. The second group consisted of
women who had not undergone hysterectomy; they received estrogen in combination
with progestin. The WHI enrolled 27,347 women in their HRT investigation: 10,739 in
the estrogen-alone arm and 16,608 in the estrogen plus progestin arm. Within each
arm, women were randomly assigned to receive either HRT or placebo. All women in
the trial were postmenopausal and aged 50-79 years; the mean age was 63.6 years (a
fact that would be important in later analysis). Some participants had experienced
previous CV events. The primary outcome of both subgroups was coronary heart
disease (CHD), as described by nonfatal MI or death due to CHD.
The estrogen-progestin arm of the WHI was halted after a mean follow-up of 5.2 years,
3 years earlier than expected, as the HRT users in this arm were found to be at
increased risk for CHD compared to those who received placebo. The study also noted
elevated rates of breast cancer and stroke, among other poor outcomes. The estrogenalone arm continued for an average follow-up of 6.8 years before being similarly
discontinued ahead of schedule. Although this part of the study did not find an
increased risk of CHD, it also did not find any cardioprotective effect. Beyond failing to
locate any clear CV benefits, the WHI also found real evidence of harm, including
increased risk of blood clots, breast cancer and stroke. Initial WHI publications
therefore recommended against HRT being prescribed for the secondary prevention of
CVD.
What Next?
Scientists and the clinicians who relied on their data for guidance in treating patients,
were faced with conflicting data: epidemiological and observational studies suggested
that HRT was cardioprotective while the higher-quality evidence from RCTs strongly
suggested the opposite. Clinicians primarily followed the WHI results, so prescriptions
for HRT in postmenopausal women quickly declined. Meanwhile, researchers began to
analyze the studies for potential discrepancies, and found that the women being
followed in the NHS and the WHI differed in several important characteristics.
First, the WHI population was older than the NHS cohort, and many had entered
menopause at least 10 years before they enrolled in the RCT. Thus, the WHI enrollees
experienced a long duration from the onset of menopause to the commencement of
HRT. At the same time, many in the NHS population were closer to the onset of
menopause and were still displaying hormonal symptoms when they began HRT.
Second, although the NHS researchers adjusted the data for various confounding effects,
their results could still have been subject to bias. In general, the NHS cohort was more
highly educated and of a higher socioeconomic status than the WHI participants, and
therefore more likely to see a physician regularly. The NHS women were also leaner
and generally healthier than their RCT counterparts, and had been selected for their
evident lack of pre-existing CV conditions. This selection bias in the NHS enrollment
may have led to a “healthy woman” effect that in turn led to an overestimation of the
benefits of therapy in the observational study. Third, researchers noted that dosing
differences between the two study types may have contributed to the divergent results.
The NHS reported beneficial results following low-dose estrogen therapy. The WHL,
meanwhile, used a higher estrogen dose, exposing women to a larger dosage of
hormones and increasing their risk for adverse events. The increased risk profile of the
WHI women (e.g., older, more comorbidities, higher estrogen dose) could have
contributed to the evidence of harm seen in the WHI results.
Emerging Data
In addition to identifying the inherent differences between the two study populations,
researchers began a secondary analysis of the NHS and WHI trials. NHS researchers
reported that women who began HRT close to the onset of menopause had a
significantly reduced risk of CHD. In the subgroups of women that were older and had a
similar duration after menopause compared with the WHI women, they found no
significant relationship between HRT and CHD. Also, the WHI study further stratified
these results by age, and found that women who began HRT close to their onset of
menopause experienced some cardioprotection, while women who were further from
the onset of menopause had a slightly elevated risk for CHD.
Secondary analysis of both studies was therefore necessary to show that age and a
short duration from the onset of menopause are crucial to HRT success as a
cardioprotective agent. Neither study type provided “truth” or rather, both studies
provided “truth” if viewed carefully (e.g., both produced valid and important results).
The differences seen in the studies were rooted in the timing of HRT and the
populations being studied.81
Lessons Learned From this case Study
Although RCTs are given a higher evidence grade, observational studies provide
important clinical insights. In this example, the study populations differed. For
policymakers and clinicians, it is crucial to examine whether the CER was based upon
patients similar to those being considered. Any study with a dissimilar population may
provide non-relevant results. Thus, readers of CER need to carefully examine the
generalizability of the findings being reported.