Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CZ5225: Modeling and Simulation in Biology Lecture 6, Microarray Cancer Classification Prof. Chen Yu Zong Tel: 6874-6877 Email: [email protected] http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, National University of Singapore Why Microarray? • Although there has been some improvements over the past 30 years still there exists no general way for: – Identifying new cancer classes – Assigning tumors to known classes • In this paper they are introducing two general ways for – Class prediction of a new tumor – Class discovery of new unknown subclasses – Without using the previous biological information 2 Why Microarray? • Why do we need to classify cancers? – The general way of treating cancer is to: • Categorize the cancers in different classes • Use specific treatment for each of the classes • Traditional way – Morphological appearance. 3 Why Microarray? • Why traditional ways are not enough ? – There exists some tumors in the same class with completely different clinical courses • May be more accurate classification is needed – Assigning new tumors to known cancer classes is not easy • e.g. assigning an acute leukemia tumor to one of the – AML – ALL 4 Cancer Classification • Class discovery – Identifying new cancer classes • Class Prediction – Assigning tumors to known classes 5 Cancer Genes and Pathways • 15 cancer-related pathways, 291 cancer genes, 34 angiogenesis genes, 12 tumor immune tolerance genes Nature Medicine 10, 789-799 (2004); Nature Reviews Cancer 4, 177-183 (2004), 6, 613-625 (2006); Critical Reviews in Oncology/Hematology 59, 40-50 (2006) http://bidd.nus.edu.sg/group/trmp/trmp.asp 6 Disease outcome prediction with microarray Most discriminative genes Patient i: Patient Normal person j: SVM Important genes Normal Patient i: Patient Normal person j: SVM Signatures Predictor-genes Better predictive power Clues to disease genes, drug targets Normal Disease outcome prediction with microarray Expected features of signatures: Composition: • Certain percentages of cancer genes, genes in cancer pathways, and angiogenesis genes Stability: • Similar set of predictor-genes in different patient compositions measures under the same or similar conditions How many genes should be in a signature? Patient i: Patient Normal person j: SVM Normal Class No of Genes or Pathways Cancer genes (oncogenes, tumorsuppressors, stability genes) 219 Cancer pathways 15 Angiogenesis 34 Cancer immune tolerance 15 Class Prediction • How could one use an initial collection of samples belonging to known classes to create a class Predictor? – Gathering samples – Hybridizing RNA’s to the microarray – Obtaining quantitative expression level of each gene – Identification of Informative Genes via Neighborhood Analysis – Weighted votes 9 Neighborhood Analysis • We want to identify the genes whose expression pattern were strongly correlated with the class distinction to be predicted and ignoring other genes – Each gene is presented by an expression vector consisting of its expression level in each sample. – Counting no. of genes having various levels of correlation with ideal gene c. – Comparing with the correlation of randomly permuted c with it • The results show an unusually high density of correlated genes! 10 Idealized expression pattern Neighborhood analysis 11 Class Predictor • The General approach – Choosing a set of informative genes based on their correlation with the class distinction – Each informative gene casts a weighted vote for one of the classes – Summing up the votes to determine the winning class and the prediction strength 12 Computing Votes • Each gene Gi votes for AML or ALL depending on : – If the expression level of the gene in the new tumor is nearer to the mean of Gi in AML or ALL • The value of the vote is : – WiVi where: • Wi reflects how well Gi is correlated with the class distinction • Vi = | xi – (AML mean + ALL mean) / 2 | • The prediction strength reflects: – Margin of victory – (Vwin – Vloose) / (Vwin + Vloose) 13 Class Predictor 14 Evaluation • DATA – Initial Sample • 38 Bone Marrow Samples (27 ALL, 11 AML) obtained at the time of diagnosis. – Independent Sample • 34 leukemia consisted of 24 bone marrow and 10 peripheral blood samples (20 ALL and 14 AML). • Validation of Gene Voting – Initial Samples • 36 of the 38 samples as either AML or ALL and two as uncertain. All 36 samples agrees with clinical diagnosis. – Independent Samples • 29 of 34 samples are strongly predicted with 100% accuracy. 15 Validation of Gene Voting 16 An early kind of analysis: unsupervised learning learning disease sub-types p53 Rb 17 Sub-type learning: seeking ‘natural’ groupings & hoping that they will be useful… p53 Rb 18 E.g., for treatment Respond to treatment Tx1 p53 Do not Respond to treatment Tx1 Rb 19 The ‘one-solution fits all’ trap Do not Respond to treatment Tx2 p53 Rb Respond to treatment Tx2 20 A more modern view: supervised learning A B INDUCTIVE ALGORITHM C D TRAIN INSTANCES Classifier OR Regression Model E APPLICATION INSTANCES A1, B1, C1, D1, E1 A2, B2, C2, D2, E2 An, Bn, Cn, Dn, En CLASSIFICATION PERFORMANCE 21 Predictive Biomarkers & Supervised Learning A B INDUCTIVE ALGORITHM C D TRAIN INSTANCES Classifier OR Regression Model E APPLICATION INSTANCES A1, B1, C1, D1, E1 A2, B2, C2, D2, E2 An, Bn, Cn, Dn, En CLASSIFICATION PERFORMANCE Predictive Biomarkers 22 Predictive Biomarkers & Supervised Learning 23 A more modern view 2: Unsupervised learning as structure learning A B A INDUCTIVE ALGORITHM C D TRAIN INSTANCES E B C E D A1, B1, C1, D1, E1 A2, B2, C2, D2, E2 PERFORMANCE An, Bn, Cn, Dn, En 24 Causative biomarkers & (structural) unsupervised learning A B A INDUCTIVE ALGORITHM C D TRAIN INSTANCES E B C E D Causative Biomarkers A1, B1, C1, D1, E1 A2, B2, C2, D2, E2 PERFORMANCE An, Bn, Cn, Dn, En 25 Supervised learning: the geometrical interpretation Cancer patients p53 +P5 P2 New case, classified as cancer + + +P4 ? P1 SV clas M sifi er +P3 + New case, classified as normal + + ? + + Normals Rb 26 If 2D looks good, what happens in 3D? • 10,000-50,000 (regular gene expression microarrays, aCGH, and early SNP arrays) • 500,000 (tiled microarrays, SNP arrays) • 10,000-300,000 (regular MS proteomics) • >10, 000, 000 (LC-MS proteomics) This is the ‘curse of dimensionality problem’ 27 Problems associated with high-dimensionality (especially with small samples) • • • • Some methods do not run at all (classical regression) Some methods give bad results Very slow analysis Very expensive/cumbersome clinical application 28 Solution 1: dimensionality reduction 400 1st principal component (PC1) PC1: 3X-Y=0 350 Normal subjects Cancer patients 300 Gene Y 250 200 150 100 50 0 0 10 20 30 40 50 60 70 80 90 100 Gene X 29 Solution 2: feature selection P O A B C D K T H M I E J Q L N 30 Another (very real and unpleasant) problem Over-fitting • Over-fitting ( a model to your data)= building a model than is good in original data but fails to generalize well to fresh data 31 Over-fitting is directly related to the complexity of decision surface (relative to the complexity of modeling task) Outcome of Interest Y Training Data Test Data Predictor X 32 Over-fitting is also caused by multiple validations & small samples General Population: AUC of Model_1= 65%; AUC of Model_2= 85%; AUC of Model_3= 55% Sample used for training & validation Modeling Sample MS_1 1 TrainModel_1 TrainModel_2 TrainModel_3 Validate AUC of Model_1 = 88% Validate AUC of Model_2 = 76% Validate AUC of Model_3 = 63% Sample not used for training & validation Modeling Sample MS_n 2 TrainModel_1 TrainModel_2 TrainModel_3 Validate AUC of Model_1 = 61% Validate AUC of Model_2 = 87% Validate AUC of Model_3 = 67% Training & Validation Phase A sample in which over-fitting is detected Independent Evaluation Sample ES_1 3 Evaluate AUC of Model_1 = 65% A sample in which over-fitting is not detected Independent Evaluation Sample ES_2 4 Evaluate AUC of Model_1 =84% Validation With Independent Dataset 33 Over-fitting is also caused by multiple validations & small samples General Population: AUC of Model_1= 65%; AUC of Model_2= 85%; AUC of Model_3= 55% Sample not used for training & validation Modeling Sample MS_1 1 TrainModel_1 TrainModel_2 TrainModel_3 Validate AUC of Model_1 = 88% Validate AUC of Model_2 = 76% Validate AUC of Model_3 = 63% Sample used for training & validation Modeling Sample MS_n 2 TrainModel_1 TrainModel_2 TrainModel_3 Validate AUC of Model_1 = 61% Validate AUC of Model_2 = 87% Validate AUC of Model_3 = 67% Training & Validation Phase A sample falsely detecting over-fitting Independent Evaluation Sample ES_1 3 Evaluate AUC of Model_2 = 74% A sample not detecting over-fitting Independent Evaluation Sample ES_2 4 Evaluate AUC of Model_2 =90% Validation With Independent Dataset 34 A method to produce realistic performance estimates: nested n-fold cross-validation Dataset predictor variables outcome variable P1 P2 P3 Outer loop: Cross-validation for performance estimation … … Training Testing Average C Accuracy set set Accuracy P1, P2 P3 1 89% 83% P1,P3 P2 2 84% P2, P3 P1 1 76% Inner Loop: Cross-validation for model selection Training Validation set set P1 P2 P2 P1 P1 P2 P2 P1 C 1 2 Accuracy 86% 84% 70% 90% Average Accuracy 85% Choose C=1 since it maximizes accuracy 80% 35 How well supervised learning works in practice? 36 Datasets • • • • • • • • • • • • • • Bhattacharjee2 Bhattacharjee2_I Bhattacharjee3 Bhattacharjee3_I Savage Rosenwald4 Rosenwald5 Rosenwald6 Adam Yeoh Conrads Beer_I Su_I Banez - Lung cancer vs normals [GE/DX] Lung cancer vs normals on common genes between Bhattacharjee2 and Beer [GE/DX] Adenocarcinoma vs Squamous [GE/DX] Adenocarcinoma vs Squamous on common genes between Bhattacharjee3 and Su [GE/DX] Mediastinal large B-cell lymphoma vs diffuse large B-cell lymphoma [GE/DX] 3-year lymphoma survival [GE/CO] 5-year lymphoma survival [GE/CO] 7-year lymphoma survival [GE/CO] Prostate cancer vs benign prostate hyperplasia and normals [MS/DX] Classification between 6 types of leukemia [GE/DX-MC] Ovarian cancer vs normals [MS/DX] Lung cancer vs normals (common genes with Bhattacharjee2) [GE/DX] Adenocarcinoma vs squamous (common genes with Bhattacharjee3) [GE/DX Prostate cancer vs normals [MS/DX] 37 Methods: Gene Selection Algorithms • • • • • • • • • • • • • • • • • • ALL - No feature selection LARS - LARS HITON_PC HITON_PC_W -HITON_PC+ wrapping phase HITON_MB HITON_MB_W -HITON_MB + wrapping phase GA_KNN - GA/KNN RFE - RFE with validation of feature subset with optimized polynomial kernel RFE_Guyon - RFE with validation of feature subset with linear kernel (as in Guyon) RFE_POLY - RFE (with polynomial kernel) with validation of feature subset with polynomial optimized kernel RFE_POLY_Guyon - RFE (with polynomial kernel) with validation of feature subset with linear kernel (as in Guyon) SIMCA - SIMCA (Soft Independent Modeling of Class Analogy): PCA based method SIMCA_SVM - SIMCA (Soft Independent Modeling of Class Analogy): PCA based method with validation of feature subset by SVM WFCCM_CCR - Weighted Flexible Compound Covariate Method (WFCCM) applied as in Clinical Cancer Research paper by Yamagata (analysis of microarray data) WFCCM_Lancet - Weighted Flexible Compound Covariate Method (WFCCM) applied as in Lancet paper by Yanagisawa (analysis of mass-spectrometry data) UAF_KW - Univariate with Kruskal-Walis statistic UAF_BW - Univariate with ratio of genes between groups to within group sum of squares UAF_S2N - Univariate with signal-to-noise statistic 38 Classification Performance (average over all tasks/datasets) 39 How well dimensionality reduction and feature selection work in practice? 40 Number of Selected Features (average over all tasks/datasets) 10000.00 9000.00 8000.00 UAF_S2N UAF_BW UAF_KW WFCCM_CCR SIMCA_SVM SIMCA RFE_POLY_Guyon RFE_POLY RFE_Guyon RFE GA_KNN HITONgp_MB_W HITONgp_PC_W HITONgp_MB LARS ALL 2000.00 1000.00 0.00 HITONgp_PC 7000.00 6000.00 5000.00 4000.00 3000.00 41 Number of Selected Features (zoom on most powerful methods) 100.00 80.00 60.00 40.00 20.00 _G uy R on RF FE E_ _ PO PO LY LY _G uy on RF E RF E HI T LA RS O Ng p HI TO _PC Ng HI p_ TO M Ng B p_ HI PC TO _W Ng p_ M B_ W G A_ KN N 0.00 42 Number of Selected Features (average over all tasks/datasets) 43