* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Hands-on Lab using mboost: Modeling Breast Cancer Gene
Survey
Document related concepts
Transcript
Hands-on Lab using mboost: Modeling Breast Cancer Gene Expression Data Matthias Schmid, Andreas Mayr Institut für Medizininformatik, Biometrie und Epidemiologie; Friedrich-Alexander-Universität Erlangen-Nürnberg Email: [email protected] Exercises 1. Download the data from http://www.imbe.med.uni-erlangen.de/ma/M.Schmid/Tutorial.html (files array.dat, clinical.dat) and read the data into R using the script getData.r. The data set is now available as bcdata. 2. Make yourself familiar with the data set using the summary and table functions. Values of the clinical covariates are contained in columns 1 to 11, gene expression measurements are contained in columns 12 to 4930. For a short description of clinical covariates, see below. 3. The goal is to construct a prediction rule for the outcome variable “time to death” (variable survival). To obtain a fair estimate of the accuracy of the rule, we will use independent data sets for estimating the prediction rule (“learning sample”) and for evaluating the prediction accuracy of the rule (“test sample”). ⇒ Split the data randomly into a learning sample of size 197 (≈ two thirds of the observations) and a test sample of size 98 (≈ one third of the observations) using the script splitData.r. 4. Prediction rule 1. Fit a linear Cox model on the learning sample using the gene expression measurements only (function glmboost). Use 10-fold cross-validation to determine the optimal stopping iteration. Evaluate the prediction accuracy using the observations in the test sample and the CoxPH()@risk function. (Note: The partial log likelihood of the observations in the test sample will be used as a measure of prediction accuracy.) 5. Prediction rule 2. To evaluate clinical relevance, it is of interest to investigate whether gene expression data improve established clinical predictors, i.e., if they there is an added predictive value when combining gene expression measurements and clinical data. This issue can be addressed by constructing a prediction rule including maximum partial likelihood estimates of the clinical data and regularized (boosting) estimates of the gene expression data. To construct such a prediction rule, first use the coxph function of the survival package and fit a Cox model based on the learning sample and the clinical predictors only (“clinical model”). Afterwards, use the glmboost function and fit a linear Cox model based on the gene expression measurements. Use the predictions of the clinical model as offset values to force the clinical variables into the model (argument offset in glmboost). 6. Compare the prediction accuracies of prediction rules 1 and 2. How many genes were selected by prediction rules 1 and 2? 7. Compare the accuracy of prediction rules 1 and 2 to the following alternative prediction rules: (a) The null model without any predictor variables (→ Cox model without covariates). (b) The clinical Cox model (without gene expression measurements). (c) Optionally, a Ridge-penalized Cox model (using the optL2 function of the (modified) R package penalized, which needs to be downloaded and installed from the tutorial website). Use unpenalized estimation for the clinical variables and penalized estimation for the gene expression measurements. An example call for Ridge regression (using 10-fold cross-validation to determine the tuning parameter) is given by # specify model formula for unpenalized variables formula_1 <- ~ diameter + posnodes + age + mlratio + chemotherapy + hormonaltherapy + typesurgery + histolgrade + vasc.invasion # fit model Ridge_model <- optL2(Surv(survival, eventdea), penalized = training_sample[,12:4930], unpenalized = formula_1, lambda1 = 0, fold = 10, data=training_sample, model = "cox") # compute predictions pred_ridge <- predict(Ridge_model$fullfit, penalized = test_sample[,12:4930], unpenalized = formula_1, data=test_sample, lperrg=TRUE) (d) Optionally, an L1 -penalized Cox model (using the optL1 function of R package penalized). 8. Repeat Excercises 4 to 7 using 50 different random splits of the data (→ use the script splitData.r with different seeds). Compare the distributions of the prediction errors obtained from the models/methods. Data Set The data set is described in Van’t Veer L. J. et al. (2002), “Gene Expression Profiling Predicts Clinical Outcome of Breast Cancer ”, Nature, 415 (6871), 530–536. and Van de Vijver M. J. et al. (2002), “A Gene-Expression Signature as a Predictor of Survival in Breast Cancer ”, New England Journal of Medicine, 347 (25), 1999–2009. Description of the clinical variables: eventdea survival status (0 = alive, 1 = dead). survival time to death (in years). diameter tumor diameter (in mm). posnodes number of positive lymph nodes. age age (in years). mlratio intensity ratio for estrogen-receptor status. chemotherapy chemotherapy (1 = no, 2 = yes). hormonaltherapy hormonal therapy (1 = no, 2 = yes). typesurgery type of surgery (1 = breast-conserving therapy, 2 = mastectomy). histolgrade histologic grade (1 = intermediate, 2 = poor, 3 = good). vasc.invasion vascular invasion (1 = absent, 2 = > 3 vessels, 3 = 1 − 3 vessels).