Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Part 5: Linking Microarray Data with Survival Analysis Use of microarray data via model-based classification in the study and prediction of survival from lung cancer (Ben-Tovim Jones et al., 2005) Problems •Censored Observations – the time of occurrence of the event (death) has not yet been observed. •Small Sample Sizes – study limited by patient numbers •Specific Patient Group – is the study applicable to other populations? •Difficulty in integrating different studies (different microarray platforms) A Case Study: The Lung Cancer data sets from CAMDA’03 Four independently acquired lung cancer data sets (Harvard, Michigan, Stanford and Ontario). The challenge: To integrate information from different data sets (2 Affy chips of different versions, 2 cDNA arrays). The final goal: To make an impact on cancer biology and eventually patient care. “Especially, we welcome the methodology of survival analysis using microarrays for cancer prognosis (Park et al. Bioinformatics: S120, 2002).” Methodology of Survival Analysis using Microarrays Cluster the tissue samples (eg using hierarchical clustering), then compare the survival curves for each cluster using a non-parametric Kaplan-Meier analysis (Alizadeh et al. 2000). Park et al. (2002), Nguyen and Rocke (2002) used partial least squares with the proportional hazards model of Cox. Unsupervised vs. Supervised Methods Semi-supervised approach of Bair and Tibshirani (2004), to combine gene expression data with the clinical data. AIM: To link gene-expression data with survival from lung cancer in the CAMDA’03 challenge A CLUSTER ANALYSIS We apply a model-based clustering approach to classify tumour tissues on the basis of microarray gene expression. B SURVIVAL ANALYSIS The association between the clusters so formed and patient survival (recurrence) times is established. C DISCRIMINANT ANALYSIS We demonstrate the potential of the clustering-based prognosis as a predictor of the outcome of disease. Lung Cancer Approx. 80% of lung cancer patients have NSCLC (of which adenocarcinoma is the most common form). All Patients diagnosed with NSCLC are treated on the basis of stage at presentation (tumour size, lymph node involvement and presence of metastases). Yet 30% of patients with resected stage I lung cancer will die of metastatic cancer within 5 years of surgery. Want a prognostic test for early-stage lung adenocarcinoma to identify patients more likely to recur, and therefore who would benefit from adjuvant therapy. Lung Cancer Data Sets (see http://www.camda.duke.edu/camda03) Wigle et al. (2002), Garber et al. (2001), Bhattacharjee et al. (2001), Beer et al. (2002). Genes Heat Map for 2880 Ontario Genes (39 Tissues) Tissues Genes Heat Maps for the 20 Ontario Gene-Groups (39 Tissues) Tissues Tissues are ordered as: Recurrence (1-24) and Censored (25-39) Expression Profiles for Useful Metagenes (Ontario 39 Tissues) Gene Group 1 Gene Group 2 Log Expression Value Our Tissue Cluster 1 Our Tissue Cluster 2 Recurrence (1-24) Censored (25-39) Gene Group 19 Gene Group 20 Tissues Tissue Clusters CLUSTER ANALYSIS via EMMIX-GENE of 20 METAGENES yields TWO CLUSTERS: CLUSTER 1 (31): 23 (recurrence) plus Poor-prognosis 8 (censored) CLUSTER 2 (8): 1 (recurrence) plus 7 (censored) Good-prognosis SURVIVAL ANALYSIS: LONG-TERM SURVIVOR (LTS) MODEL S (t ) prob.{T t} p 1S1 (t ) p 2 where T is time to recurrence and p1 = 1- p2 is the prior prob. of recurrence. Adopt Weibull model for the survival function for recurrence S1(t). Fitted LTS Model vs. Kaplan-Meier Second PC PCA of Tissues Based on Metagenes First PC Second PC PCA of Tissues Based on Metagenes First PC Second PC PCA of Tissues Based on All Genes (via SVD) First PC Second PC PCA of Tissues Based on All Genes (via SVD) First PC Cluster-Specific Kaplan-Meier Plots Survival Analysis for Ontario Dataset • Nonparametric analysis: Cluster 1 2 No. of Tissues No. of Censored 29 8 Mean time to Failure (SE) 665 85.9 1388 155.7 8 7 A significant difference between Kaplan-Meier estimates for the two clusters (P=0.027). • Cox’s proportional hazards analysis: Variable Cluster 1 vs. Cluster 2 Tumor stage (I vs. II&III) Hazard ratio (95% CI) P-value 6.78 (0.9 – 51.5) 1.07 (0.57 – 2.0) 0.06 0.83 Discriminant Analysis (Supervised Classification) A prognosis classifier was developed to predict the class of origin of a tumor tissue with a small error rate after correction for the selection bias. A support vector machine (SVM) was adopted to identify important genes that play a key role on predicting the clinical outcome, using all the genes, and the metagenes. A cross-validation (CV) procedure was used to calculate the prediction error, after correction for the selection bias. ONTARIO DATA (39 tissues): Support Vector Machine (SVM) with Recursive Feature Elimination (RFE) 0.12 Error Rate (CV10E) 0.1 0.08 0.06 0.04 0.02 0 0 2 4 6 8 10 12 log2 (number of genes) Ten-fold Cross-Validation Error Rate (CV10E) of Support Vector Machine (SVM). applied to g=2 clusters (G1: 1-14, 16- 29,33,36,38; G2: 15,30-32,34,35,37,39) STANFORD DATA 918 genes based on 73 tissue samples from 67 patients. Row and column normalized, retained 451 genes after select-genes step. Used 20 metagenes to cluster tissues. Retrieved histological groups. Genes Heat Maps for the 20 Stanford Gene-Groups (73 Tissues) Tissues Tissues are ordered by their histological classification: Adenocarcinoma (1-41), Fetal Lung (42), Large cell (43-47), Normal (48-52), Squamous cell (53-68), Small cell (69-73) STANFORD CLASSIFICATION: Cluster 1: 1-19 (good prognosis) Cluster 2: 20-26 (long-term survivors) Cluster 3: 27-35 (poor prognosis) Genes Heat Maps for the 15 Stanford Gene-Groups (35 Tissues) Tissues Tissues are ordered by the Stanford classification into AC groups: AC group 1 (1-19), AC group 2 (20-26), AC group 3 (27-35) Expression Profiles for Top Metagenes (Stanford 35 AC Tissues) Log Expression Value Gene Group 1 Gene Group 2 Stanford AC group 1 Stanford AC group 2 Stanford AC group 3 Misallocated Gene Group 3 Gene Group 4 Tissues Cluster-Specific Kaplan-Meier Plots Cluster-Specific Kaplan-Meier Plots Survival Analysis for Stanford Dataset • Kaplan-Meier estimation: Cluster 1 2 No. of Tissues No. of Censored 17 5 Mean time to Failure (SE) 37.5 5.0 5.2 2.3 10 0 A significant difference in survival between clusters (P<0.001) • Cox’s proportional hazards analysis: Variable Cluster 3 vs. Clusters 1&2 Grade 3 vs. grades 1 or 2 Tumor size No. of tumors in lymph nodes Presence of metastases Hazard ratio (95% CI) P-value 13.2 (2.1 – 81.1) 1.94 (0.5 – 8.5) 0.96 (0.3 – 2.8) 1.65 (0.7 – 3.9) 4.41 (1.0 – 19.8) 0.005 0.38 0.93 0.25 0.05 Survival Analysis for Stanford Dataset • Univariate Cox’s proportional hazards analysis (metagenes): Metagene Coefficient (SE) P-value 1 2 3 4 5 1.37 (0.44) -0.24 (0.31) 0.14 (0.34) -1.01 (0.56) 0.66 (0.65) 0.002 0.44 0.68 0.07 0.31 6 7 8 9 10 -0.63 (0.50) -0.68 (0.57) 0.75 (0.46) -1.13 (0.50) 0.73 (0.39) 0.20 0.24 0.10 0.02 0.06 11 12 13 14 15 0.35 (0.50) -0.55 (0.41) -0.61 (0.48) 0.22 (0.36) 1.70 (0.92) 0.48 0.18 0.20 0.53 0.06 Survival Analysis for Stanford Dataset • Multivariate Cox’s proportional hazards analysis (metagenes): Metagene Coefficient (SE) P-value 1 3.44 (0.95) 0.0003 2 -1.60 (0.62) 0.010 8 -1.55 (0.73) 0.033 11 1.16 (0.54) 0.031 The final model consists of four metagenes. STANFORD DATA: Support Vector Machine (SVM) with Recursive Feature Elimination (RFE) 0.07 Error Rate (CV10E) 0.06 0.05 0.04 0.03 0.02 0.01 0 0 1 2 3 4 5 6 7 8 9 10 log2 (number of genes) Ten-fold Cross-Validation Error Rate (CV10E) of Support Vector Machine (SVM). Applied to g=2 clusters. CONCLUSIONS We applied a model-based clustering approach to classify tumors using their gene signatures into: (a) clusters corresponding to tumor type (b) clusters corresponding to clinical outcomes for tumors of a given subtype In (a), almost perfect correspondence between cluster and tumor type, at least for non-AC tumors (but not in the Ontario dataset). CONCLUSIONS (cont.) The clusters in (b) were identified with clinical outcomes (e.g. recurrence/recurrence-free and death/long-term survival). We were able to show that gene-expression data provide prognostic information, beyond that of clinical indicators such as stage. CONCLUSIONS (cont.) Based on the tissue clusters, a discriminant analysis using support vector machines (SVM) demonstrated further the potential of gene expression as a tool for guiding treatment therapy and patient care to lung cancer patients. This supervised classification procedure was used to provide marker genes for prediction of clinical outcomes. (In addition to those provided by the cluster-genes step in the initial unsupervised classification.) LIMITATIONS Small number of tumors available (e.g Ontario and Stanford datasets). Clinical data available for only subsets of the tumors; often for only one tumor type (AC). High proportion of censored observations limits comparison of survival rates.