* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Classification, subtype discovery, and prediction of outcome in
Gene desert wikipedia , lookup
Genome (book) wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Quantitative comparative linguistics wikipedia , lookup
Gene therapy wikipedia , lookup
Gene nomenclature wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Public health genomics wikipedia , lookup
Oncogenomics wikipedia , lookup
Microevolution wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Designer baby wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Gene expression profiling wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Using Emerging Patterns to Analyze Gene Expression Data Jinyan Li BioComputing Group Knowledge & Discovery Program Laboratories for Information Technology Singapore Outline • Introduction • Brief history of decision trees and emerging patterns • Basic ideas for decision trees and EPs • Advanced topics • Comparisons using gene expression data • Summary Introduction • Decision trees and emerging patterns both classification rules • Sharp discrimination power (no or little uncertainty) • Advantage over black-box learning models • Decision trees not the best (on accuracy) • EP-based classifiers: competitive to the best Brief History of Decision Trees CLS (Hunt etal. 1966)--- cost driven CART (Breiman et al. 1984) --- Gini Index ID3 (Quinlan, 1986 MLJ) --- Information-driven C4.5 (Quinlan, 1993) --- Pruning ideas Brief history of emerging patterns General EP (Dong & Li, 1999 Sigkdd) CAEP (Dong etal, DS99), JEP-C (Li etal, KAIS00) EP-space (Li etal, ICML00), DeEPs (Li etal, MLJ) PCL (Li & Wong, ECML02) Basic definitions • Relational data • Attributes (color, gene_x), attribute values (red, 200.1), attribute-value pair (equivalently, condition, item) • Patterns, instances • Training data, test data A simple dataset Outlook Sunny Sunny Sunny Sunny Sunny Overcast Overcast Overcast Overcast Rain Rain Rain Rain Rain Temp 75 80 85 72 69 72 83 64 81 71 65 75 68 70 Humidity 70 90 85 95 70 90 78 65 75 80 70 80 80 96 Windy true true false true false true false true false true true false false false class Play Don’t Don’t Don’t Play Play Play Play Play Don’t 9 Play samples Don’t Play 5 Don’t Play Play A total of 14. A decision tree outlook sunny overcast humidity <= 75 Play 2 rain > 75 windy false true Don’t 3 Play Play 4 NP-complete problem 3 Don’t 2 C4.5 • A heuristic • Using information gain to select the most discriminatory feature (for tree and subtrees) • Recursive subdivision over the original training data Characteristics of C4.5 trees • • • • Single coverage of training data (elegance) Divide-and-conquer splitting strategy Fragmentation problem Locally reliable but globally un-significant rules Missing many globally significant rules; mislead the system. Emerging Patterns (1) • An emerging pattern is a set of conditions usually involving several genes, with which most of a class satisfy but none of the other class satisfy. • Real example: {gene(37720_at) > 215, gene(38028_at)<=12}. 73% vs 0% • EPs are multi-gene discriminators. Emerging Patterns (2) 100 C1 Samples 80% 100 C2 Samples 0% C1 C2 An EP = {cond.1, cond.2, and cond.3} Boundary emerging patterns • Definition: A boundary EP is an EP whose proper subsets are not EPs. • Boundary EPs separate EPs from non- EPs • Distinguish EPs with high frequency from EPs with low frequency. • Boundary EPs are of our greatest interests. EP rules derived • A total of 12 EPs, some important ones of them never discovered by C4.5. • Examples: {Humi <=80, windy = false} -> Play (5:0). • A total of 5 rules in the decision tree induced by C4.5. • C4.5 missed many important rules. Characteristics of EP approach • Each EP is a tree with only one branch • A cluster of trees: EPs combined (loss of elegance) • Globally significant rules • Exponential number in size (need to focus on very important feature if large number of features exist.) Usefulness of Emerging Patterns in Classification • PCL (Prediction by Collective Likelihood of Emerging Patterns). • Accurate • Easily understandable. Spirit of the PCL Classifier (1) Top-Ranked EPs in Positive class Top-Ranked EPs in Negative class EP_1 (90%) EP_2 (86%) . . EP_n (68%) EP_1 (100%) EP_2 (95%) . . EP_n (80%) The idea of summarizing multiple top-ranked EPs is intended to avoid some rare tie cases. Spirit of the PCL Classifier (2) Most freq. EP from posi. class in the test sample K=10, Ideal Score_p = 10 Score_n = 0 Score_p = EP’_1_p / EP_1_p + … + EP’_k_p / EP_k_p Most freq. EP of posi class Similarly, Score_n = EP’_1_n / EP_1_n + … + EP’_k_n / EP_k_n If Score_p > Score_n, then positive class, otherwise negative class. C4.5 and PCL • Differences: – C4.5 is a greedy search algorithm using the divide-and-conquer idea. – PCL is a global search algorithm. – Leaves in C4.5 are allowed to contain mixed samples, however PCL does not. – Multiple trees used by PCL, but single tree used by C4.5. • Similarities: – Both can provide highlevel rules. Advanced topics • Bagging, boosting and C4.5 • Convexity of EP spaces (ICML00, Li etal). • Decomposition of EP spaces into a series of P-spaces and a small convex space (ECML02, Li and Wong). • DeEPs (to appear in MLJ, Li etal). • Version spaces (Mitchell, 1982 AI). Gene Expression Profiles • Huge number of features • Most of them can be ignored when for classification • Many good discriminating feautures • Number of instances relatively small Expression data in this talk • Prostate disease, 102 instances, Cancer Cell v.1, issue 1, 2002 • ALL disease, 327 instances, Cancer Cell, v.1, issue 2, 2002 • MLL disease, 72 instances, Nature Genetics, Jan., 2002 • Breast cancer, 98 instances, Nature, Jan. 2002 Classification models in this talk • K-nearest neighbor (simplest) • C4.5 (decision tree based, easily understandable) • Bagged and Boosted C4.5 • Support Vector Machines (black box) • Our PCL classifier Our Work Flow Original Training Data Feature selection Establishing Classification Model Giving a Test Sample Making a Prediction Selecting Discriminatory Genes • T-statistics and MIT-correlation. – Based on expression average and deviation between two classes of cell. • Entropy-based discretization methods, including Chi-Square statistics, and CFS. – Based on clear boundaries in the expression range of a gene. An Ideal Gene Expression Range C1 Xyz.x C2 • Expression values are two-end distributed. Results on Prostate Dataset 52 tumor samples and 50 normal samples, each represented by ~12,500 numeric values. Two Problems: -1- What is the main difference? How to use rules to represent the difference? -2- What’s the LOOCV accuracy by PCL and C4.5? C4.5 Tree 32598_at <=29 33886_at <= 10 40707_at > 10 Normal 6 34950_at <=5 Tumor >29 <= -6 > -6 Tumor 3+1 Normal >5 Normal 3+1 Emerging patterns Patterns {9, 36} {9, 23} {4, 9} {9, 14} {6, 9} {7, 21} {7, 11} {7, 43} {7, 39} {24, 29} Frequency (T) 38 instances 38 38 38 38 0 0 0 0 0 Frequency(N) 0 0 0 0 0 36 35 35 34 34 Reference number 9: the expression of 37720_at > 215. Reference number 36: the expression of 38028_at <= 12. LOOCV accuracies Classifier PCL Accuracy Error rate 95.1 5 C4.5 SVM Single Bagged Boosted 91.2 94.1 93.1 90.2 9 6 7 10 3-NN 96.1 4 Subtype classification of Childhood Leukemia Using Gene Expression Profiling One of our important projects. Collaborating Parties • St. Jude Children’s Research Hospital, USA. – Mary E. Ross, Sheila A. Shurtleff, W. Kent Williams, Divyen Patel, Rami Mahfouz, Fred G. Behm, Susana C. Raimondi, Mary V. Relling, Anami Patel, Cheng Cheng, Dario Campana, Ching-Hon Pui, William E. Evans, Clayton Naeve, and James R. Downing • NUH, Singapore. – Eng-Juh Yeoh • University of Mississippi, USA. – Dawn Wilkins, Xiaodong Zhou • LIT, Singapore. – Jinyan Li, Huiqing Liu, Limsoon Wong Important Motivations • Leukemia is a heterogeneous disease (T-ALL, E2A-PBX1, TEL-AML1, BCRABL, MLL, and Hyperdip>50). • Response is different. • Leukemia is 80% curable if subtype is correctly diagnosed. • Correct diagnosis needs many different tests and experts. • Tests and experts not commonly available in a single hospital, especially in less advanced countries. • So, developing new methods is needed. ALL Data Description Subtype Training Testing T-ALL E2A-PBX1 TEL-AML1 BCR-ABL MLL Hyperdip50 Others 28 18 52 9 14 42 52 15 9 27 6 6 22 27 Total 215 112 A sample = {gene_1, gene_2, …, gene_12558} Central questions • Diagnosis: Classification of more than six subtypes of the leukemia disease • Prognosis: Prediction of outcome of therapy Classification Strategy A new sample T-ALL? no E2A-PBX1? no TEL-AML1? no BCR-ABL? no MLL? no Hyperdip50? no Yes T-ALL predicted Yes E2A-PBX1 predicted Yes TEL-AML1 predicted Yes BCR-ABL predicted Yes MLL predicted Yes Hyperdip50 predicted Total Mis-classifications of the 112 Test Samples Feature Selection SVM NB k-NN C4.5 PCL Top20-ChiSquare Entropy Top20-mit 6 5 7 8 7 7 7 4 9 14 14 14 4 5 4 Overall, we achieved 96% accuracy level in the prediction. An Excellent Publication Cancer Cell, March ‘02 • Lead article • Cover page Other Questions for the Childhood Leukemia • Can one subtype separating from other subtypes? • Can we do classification in parallel, rather than using a tree structure? Test Error Rates Datasets PCL (112 test instances) BCR-ABL vs others 1:0 E2A-PBX1 vs others 0:0 Hyperdip50 vs others 2:2 MLL vs others 0:2 T-ALL vs others 0:0 TEL-AML1 vs others 2:0 Parallel classification 7 C4.5 Single Bagged Boosted 4:4 6:0 4:4 0:0 0:0 0:0 4:7 4:2 4:7 2:2 1:0 2:2 1:0 1:0 1:0 2:2 2:1 2:2 27 20 10 Overall, PCL is better than C4.5 (single, bagged, or boosted). MLL Distinction from Conventional ALL • Armstrong, etal, Nature Genetics, Jan. 2002 • 3 classes, 57 training samples, 15 test instances. • Much smaller than the St.Jude’s data. • Independently obtained data. Test Error Rates by PCL and C4.5 Datasets PCL ALL vs others AML vs others MLL vs others ALL vs AML ALL vs MLL AML vs MLL 0:0 0:0 0:1 0:0 0:0 0:0 C4.5 Single Bagged Boosted 1:2 0:0 1:2 1:0 0:0 0:0 0:0 0:0 0:0 0:0 1:0 0:0 0:0 0:0 0:0 0:0 0:0 0:0 Relapse Study on Breast Cancer • Veer etal, Nature, Jan. 2002 • 78 training samples, 19 test data for relapse study. Test Error Rates by Classifiers Dataset PCL Relapse vs Non-relapse 3:3 (12:7) C4.5 Single Bagged Boosted 5:0 5:2 2:4 SVM 3-NN 6:2 7:3 Accuracy not good; may need other more sophisticated methods; Gene expression may be not sufficient for relapse study. Summary • Discussed similarities and differences between decision trees and emerging patterns. • Discussed advanced topics such as bagging, boosting, convexity. • Performance comparison using 4 gene expression datasets. • Overall, PCL is better that C4.5 (single, bagged, or boosted) on accuracy and rules. Thank you! May 27, 2002