Download Hands-on Lab using mboost: Modeling Breast Cancer Gene

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene therapy wikipedia , lookup

Gene expression profiling wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Designer baby wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression programming wikipedia , lookup

Transcript
Hands-on Lab using mboost:
Modeling Breast Cancer Gene
Expression Data
Matthias Schmid, Andreas Mayr
Institut für Medizininformatik, Biometrie und Epidemiologie;
Friedrich-Alexander-Universität Erlangen-Nürnberg
Email: [email protected]
Exercises
1. Download the data from
http://www.imbe.med.uni-erlangen.de/ma/M.Schmid/Tutorial.html
(files array.dat, clinical.dat) and read the data into R using the script getData.r. The
data set is now available as bcdata.
2. Make yourself familiar with the data set using the summary and table functions. Values
of the clinical covariates are contained in columns 1 to 11, gene expression measurements
are contained in columns 12 to 4930. For a short description of clinical covariates, see
below.
3. The goal is to construct a prediction rule for the outcome variable “time to death”
(variable survival). To obtain a fair estimate of the accuracy of the rule, we will use
independent data sets for estimating the prediction rule (“learning sample”) and for
evaluating the prediction accuracy of the rule (“test sample”).
⇒ Split the data randomly into a learning sample of size 197 (≈ two thirds of the
observations) and a test sample of size 98 (≈ one third of the observations) using the
script splitData.r.
4. Prediction rule 1. Fit a linear Cox model on the learning sample using the gene expression measurements only (function glmboost). Use 10-fold cross-validation to determine
the optimal stopping iteration. Evaluate the prediction accuracy using the observations
in the test sample and the CoxPH()@risk function. (Note: The partial log likelihood of
the observations in the test sample will be used as a measure of prediction accuracy.)
5. Prediction rule 2. To evaluate clinical relevance, it is of interest to investigate whether
gene expression data improve established clinical predictors, i.e., if they there is an added
predictive value when combining gene expression measurements and clinical data.
This issue can be addressed by constructing a prediction rule including maximum partial
likelihood estimates of the clinical data and regularized (boosting) estimates of the gene
expression data.
To construct such a prediction rule, first use the coxph function of the survival package
and fit a Cox model based on the learning sample and the clinical predictors only (“clinical model”). Afterwards, use the glmboost function and fit a linear Cox model based
on the gene expression measurements. Use the predictions of the clinical model as offset
values to force the clinical variables into the model (argument offset in glmboost).
6. Compare the prediction accuracies of prediction rules 1 and 2. How many genes were
selected by prediction rules 1 and 2?
7. Compare the accuracy of prediction rules 1 and 2 to the following alternative prediction
rules:
(a) The null model without any predictor variables (→ Cox model without covariates).
(b) The clinical Cox model (without gene expression measurements).
(c) Optionally, a Ridge-penalized Cox model (using the optL2 function of the (modified) R package penalized, which needs to be downloaded and installed from the
tutorial website). Use unpenalized estimation for the clinical variables and penalized estimation for the gene expression measurements.
An example call for Ridge regression (using 10-fold cross-validation to determine
the tuning parameter) is given by
# specify model formula for unpenalized variables
formula_1 <- ~ diameter + posnodes + age + mlratio +
chemotherapy + hormonaltherapy + typesurgery +
histolgrade + vasc.invasion
# fit model
Ridge_model <- optL2(Surv(survival, eventdea), penalized = training_sample[,12:4930],
unpenalized = formula_1, lambda1 = 0, fold = 10,
data=training_sample, model = "cox")
# compute predictions
pred_ridge <- predict(Ridge_model$fullfit, penalized = test_sample[,12:4930],
unpenalized = formula_1, data=test_sample, lperrg=TRUE)
(d) Optionally, an L1 -penalized Cox model (using the optL1 function of R package
penalized).
8. Repeat Excercises 4 to 7 using 50 different random splits of the data (→ use the script
splitData.r with different seeds). Compare the distributions of the prediction errors
obtained from the models/methods.
Data Set
The data set is described in
Van’t Veer L. J. et al. (2002), “Gene Expression Profiling Predicts Clinical Outcome of Breast
Cancer ”, Nature, 415 (6871), 530–536.
and
Van de Vijver M. J. et al. (2002), “A Gene-Expression Signature as a Predictor of Survival
in Breast Cancer ”, New England Journal of Medicine, 347 (25), 1999–2009.
Description of the clinical variables:
eventdea
survival status (0 = alive, 1 = dead).
survival
time to death (in years).
diameter
tumor diameter (in mm).
posnodes
number of positive lymph nodes.
age
age (in years).
mlratio
intensity ratio for estrogen-receptor status.
chemotherapy
chemotherapy (1 = no, 2 = yes).
hormonaltherapy hormonal therapy (1 = no, 2 = yes).
typesurgery
type of surgery (1 = breast-conserving therapy, 2 = mastectomy).
histolgrade
histologic grade (1 = intermediate, 2 = poor, 3 = good).
vasc.invasion
vascular invasion (1 = absent, 2 = > 3 vessels, 3 = 1 − 3 vessels).