* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Hierarchical Text Categorization and Its Evaluation
History of genetic engineering wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Gene therapy wikipedia , lookup
Metagenomics wikipedia , lookup
Point mutation wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Gene desert wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Gene nomenclature wikipedia , lookup
Genome evolution wikipedia , lookup
Genome (book) wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Helitron (biology) wikipedia , lookup
Gene expression programming wikipedia , lookup
Gene expression profiling wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Microevolution wikipedia , lookup
Hierarchical Text Categorization and its Application to Bioinformatics Stan Matwin and Svetlana Kiritchenko joint work with Fazel Famili (NRC), and Richard Nock (Université Antilles-Guyane) School of Information Technology and Engineering University of Ottawa Outline • • • • • What is hierarchical text categorization (HTC) Functional gene annotation requires HTC Ensemble-based learning and AdaBoost Multi-class multi-label AdaBoost Generalized local hierarchical learning method • New global hierarchical learning algorithm • New hierarchical evaluation measure • Application to Bioinformatics 2 Text categorization • Given: dj D - textual documents C = {c1, …, c|C|} – predefined categories • Task: <dj, ci> DC {True, False} c1 c2 TC c3 c7 c6 c4 c5 3 Hierarchical text categorization • Hierarchy of categories: ≤ CC - reflexive, antisymmetric, transitive binary relation on C c1 HTC c2 c4 c3 c5 c6 c7 4 Advantages of HTC • Additional, potentially valuable information – Relationships between categories • Flexibility – High levels: general topics – Low levels: more detail 5 Outline • • • • • What is hierarchical text categorization (HTC) Functional gene annotation requires HTC Ensemble-based learning and AdaBoost Multi-class multi-label AdaBoost Generalized local hierarchical learning method • New global hierarchical learning algorithm • New hierarchical evaluation measure • Application to Bioinformatics 6 Text classification and bioinformatics • Clustering and classification of gene expression data – DNA chip time series – performance data – Gene function, process,… – genetic knowledge GO – Literature will connect the two - domain knowledge • Validation of results from performance data 7 Example: Gene Ontology 8 From data to knowledge via literature • Functional annotation of genes from biomedical literature 9 Other applications • Web directories • Digital libraries • Patent databases • Biological ontologies • Email folders 10 Outline • • • • • What is hierarchical text categorization (HTC) Functional gene annotation requires HTC Ensemble-based learning and AdaBoost Multi-class multi-label AdaBoost Generalized local hierarchical learning method • New global hierarchical learning algorithm • New hierarchical evaluation measure • Application to Bioinformatics 11 Boosting • not a learning technique on its own, but a method in which a family of “weakly” learning agents (simple learners) is used for learning • based on the fact that multiple classifiers that disagree with one another can be together more accurate than its component classifiers • if there are L classifiers, each with an error rate < 1/2, and the errors are independent, then the prob. that the majority vote is wrong is the area under binomial distribution for more than L/2 hypotheses 12 Why do we have committees (ensembles)? 13 Boosting – the very idea • Train an ensemble of classifiers, sequentially • Each next classifier focuses more on the training instances on which the previous one has made a mistake • The “focusing” is done thru the weighting of the training instances • To classify a new instance, make the ensemble vote 14 15 Boosting - properties • If each hl is only better than chances, boosting can attain ANY accuracy!! • No need for new examples, additional knowledge, etc • Original AdaBoost is on single-labeled data 16 Outline • • • • • What is hierarchical text categorization (HTC) Functional gene annotation requires HTC Ensemble-based learning and AdaBoost Multi-class multi-label AdaBoost Generalized local hierarchical learning method • New global hierarchical learning algorithm • New hierarchical evaluation measure • Application to Bioinformatics 17 AdaBoost.MH [Schapire and Singer, 1999] • (di, Ci) ((di, l), Ci[l]), l C • Initialize distribution P1(i,l) = 1/(mk) . • For t = 1, …, T: – Train weak learner using distribution Pt . – Get weak hypothesis ht: DC . • Update: • The final hypothesis: 18 BoosTexter [Schapire and Singer, 2000] • “Weak” learner: decision stump word w occurs Pt (i, l ) 1 i:wd i ,Ci [ l ] 1 q1l ln 2 Pt (i, l ) i:wd i ,Ci [ l ] 1 doesn’t occur Pt (i,l ) 1 i:wd i ,Ci [ l ] 1 q0l ln 2 Pt (i,l ) i:wd i ,Ci [ l ] 191 Thresholds for AdaBoost • AdaBoost often underestimates its confidences • 3 approaches to selecting better thresholds – single threshold for all classes – individual thresholds for each class – separate thresholds for each subtree rooted in the children of a top node (for tree-hierarchies only) 20 Thresholds for AdaBoost 21 Outline • • • • • What is hierarchical text categorization (HTC) Functional gene annotation requires HTC Ensemble-based learning and AdaBoost Multi-class multi-label AdaBoost Generalized local hierarchical learning method • New global hierarchical learning algorithm • New hierarchical evaluation measure • Application to Bioinformatics 22 Hierarchical consistency • if (dj, ci) True, then (dj, Ancestor(ci)) True c1 c1 c2 c4 c2 c3 c5 c6 consistent c7 c4 c3 c5 c6 c7 inconsistent 23 Hierarchical local approach c1 c2 c4 c3 c5 c8 c6 c7 c9 24 Hierarchical local approach c1 c2 c4 c3 c5 c8 c6 c7 c9 25 Hierarchical local approach c1 c2 c4 c3 c5 c8 c6 c7 c9 26 Hierarchical local approach c1 c2 c4 c3 c5 c8 c6 c7 c9 27 Hierarchical local approach c1 c2 c4 c3 c5 c8 c6 c7 c9 consistent classification 28 Generalized hierarchical local approach • stop classification at an intermediate level if none of the children categories seem relevant • a category node can be assigned only after all its parent nodes have been assigned c1 c2 c4 c3 c5 c8 c6 c7 c9 29 Outline • • • • • What is hierarchical text categorization (HTC) Functional gene annotation requires HTC Ensemble-based learning and AdaBoost Multi-class multi-label AdaBoost Generalized local hierarchical learning method • New global hierarchical learning algorithm • New hierarchical evaluation measure • Application to Bioinformatics 30 New global hierarchical approach • Make a dataset consistent with a class hierarchy – add ancestor category labels • Apply a regular learning algorithm – AdaBoost • Make prediction results consistent with a class hierarchy – for inconsistent labeling make a consistent decision based on confidences of all ancestor classes 31 New global hierarchical approach • Hierarchical (shared) attributes sports team, game, winner, etc. hockey football NHL, Senators, goalkeeper, etc. Super Bowl, Patriots, touchdown, etc. 32 Outline • • • • • What is hierarchical text categorization (HTC) Functional gene annotation requires HTC Ensemble-based learning and AdaBoost Multi-class multi-label AdaBoost Generalized local hierarchical learning method • New global hierarchical learning algorithm • New hierarchical evaluation measure • Application to Bioinformatics 33 Evaluation in TC c1 c2 Correct category c3 Incorrect category c4 c5 c6 precision c7 correctly predicted total predicted correctly predicted recall total in category ( 2 1) P R F - measure , 2 P R 0 34 Weaknesses of standard measures H1 H2 c1 c2 c4 c3 c5 c6 H3 c1 c2 c7 c4 c3 c5 c6 c1 c2 c7 c4 c3 c5 c6 c7 P(H1) = P(H2) = P(H3) R(H1) = R(H2) = R(H3) F(H1) = F(H2) = F(H3) Ideally, M(H1) > M(H3) and M(H2) > M(H3) 35 Requirements for a hierarchical measure 1. to give credit to partially correct classification H1 c1 c2 c4 c8 c2 c3 c5 c6 c9 H2 c1 c10 c4 c7 c11 M(H1) > M(H2) c8 c3 c5 c6 c9 c10 c7 c11 36 Requirements for a hierarchical measure 2. to punish distant errors more heavily: – to give higher evaluation for correctly classifying one level down comparing to staying at the parent node H1 c1 c2 c4 c8 c2 c3 c5 c6 c9 H2 c1 c10 c4 c7 c11 M(H1) > M(H2) c8 c3 c5 c6 c9 c10 c7 c11 37 Requirements for a hierarchical measure 2. to punish distant errors more heavily: – gives lower evaluation for incorrectly classifying one level down comparing to staying at the parent node H1 c1 c2 c4 c8 c2 c3 c5 c6 c9 H2 c1 c10 c4 c7 c11 M(H1) > M(H2) c8 c3 c5 c6 c9 c10 c7 c11 38 Requirements for a hierarchical measure 3. to punish errors at higher levels of a hierarchy more heavily H1 c1 c2 c4 c8 c2 c3 c5 c6 c9 H2 c1 c10 c4 c7 c11 M(H1) > M(H2) c8 c3 c5 c6 c9 c10 c7 c11 39 Advantages of the new measure • Simple, straight-forward to calculate • Based solely on a given hierarchy (no parameters to tune) • Satisfies all three requirements • Has much discriminating power • Allows to trade off between classification precision and classification depth 40 Our new hierarchical measure Correct category + all its ancestors (excluding c4 root) c1 c2 Correct category c3 Incorrect category c5 precision c6 c7 correctly predicted total predicted correctly predicted recall total in category ( 2 1) P R F - measure , 2 P R 0 41 Our new hierarchical measure H1 H2 c1 c2 c4 c3 c5 c6 c2 c7 correct: {c4} {c2, c4} predicted: {c2} {c2} hP( H 1 ) | {c2 } | 1 | {c2 } | | {c2 } | 1 hR( H1 ) | {c2 , c4 } | 2 H3 c1 c4 c3 c5 c6 {c4} {c2, c4} {c5} {c2, c5} hP( H 2 ) | {c2 } | 1 | {c2 , c5 } | 2 | {c2 } | 1 hR( H 2 ) | {c2 , c4 } | 2 c1 c2 c7 c4 c3 c5 c6 c7 {c4} {c2, c4} {c7} {c3, c7} hP( H 3 ) | {} | 0 | {c3 , c7 } | | {} | hR( H 3 ) 0 | {c2 , c4 }42| Measure consistency • Definition [Huang & Ling, 2005]: f, g – measures on domain R = {(a,b)|a,b , f(a)>f(b), g(a)>g(b)} S = {(a,b)|a,b , f(a)>f(b), g(a)<g(b)} f is statistically consistent with g if |R|>|S| • Experiment: – 100 randomly chosen hierarchies – New hierarchical F-measure and standard accuracy were consistent on 85% of random classifiers (|R|>5|S|) 43 Measure discriminancy • Definition [Huang & Ling, 2005]: f, g – measures on domain P = {(a,b)|a,b , f(a)>f(b), g(a)=g(b)} Q = {(a,b)|a,b , f(a)=f(b), g(a)>g(b)} f is statistically more discriminating than g if |P|>|Q| H1 • Examples: c1 c2 c4 H2 c3 c5 c6 H3 c1 c2 c7 c4 c3 c5 c6 c1 c2 c7 c4 c3 c5 c6 44 For one accuracy value - 3 different hierarchical values c7 Results: Hierarchical vs. Flat Synthetic data (hierarchical attributes) levels branching Flat H. AdaBoost 2 2 68.30 76.22 3 2 58.35 74.21 4 2 44.90 73.22 5 2 20.88 72.70 2 3 53.47 63.45 3 3 29.51 60.69 4 3 2.67 58.22 2 4 41.35 55.25 3 4 6.98 50.70 2 5 29.99 47.87 45 Results: Hierarchical vs. Flat Synthetic data (no hierarchical attributes) levels branching Flat H. AdaBoost 2 2 61.69 65.95 3 2 42.47 51.53 4 2 24.49 40.18 5 2 8.45 32.61 2 3 41.53 48.02 3 3 14.50 29.97 4 3 0.79 21.91 2 4 26.72 35.01 3 4 2.46 19.70 2 5 17.14 27.12 46 Results: Hierarchical vs. Flat Real data dataset Flat H. AdaBoost newsgroups 75.51 79.26 reuters 87.06 88.31 47 Results: Hierarchical vs. Local Synthetic data (hierarchical attributes) levels branching Local H. AdaBoost 2 2 73.42 76.22 3 2 69.40 74.21 4 2 68.18 73.22 5 2 68.44 72.70 2 3 61.99 63.45 3 3 58.81 60.69 4 3 57.40 58.22 2 4 54.26 55.25 3 4 50.66 50.70 2 5 47.26 47.87 48 Results: Hierarchical vs. Local Synthetic data (no hierarchical attributes) levels branching Local H. AdaBoost 2 2 59.83 65.95 3 2 44.00 51.53 4 2 33.44 40.18 5 2 26.03 32.61 2 3 43.87 48.02 3 3 26.33 29.97 4 3 17.97 21.91 2 4 32.51 35.01 3 4 17.96 19.70 2 5 26.04 27.12 49 Results: Hierarchical vs. Local Real data dataset Local H. AdaBoost newsgroups 80.01 79.26 reuters 89.11 88.31 50 Outline • • • • • What is hierarchical text categorization (HTC) Functional gene annotation requires HTC Ensemble-based learning and AdaBoost Multi-class multi-label AdaBoost Generalized local hierarchical learning method • New global hierarchical learning algorithm • New hierarchical evaluation measure • Application to Bioinformatics 51 Application to Bioinformatics • Functional annotation of genes from biomedical literature Learning (from fully-annotated genes in the db) retrieve GO codes and IDs of Medline entries from the db records 1 Genomic database (SGD) ID Symbol Name S0007287 15S_RRNA S0007287 15S_RRNA S0004660 AAC1 … … ADP/ATP translocator … Medline reference Evidence … GO ID PMID:6261980 PMID:6280192 ISS IGI GO:0003735 GO:0006412 PMID:2167309 TAS GO:0005743 … … … … 53 Learning (from fully-annotated genes in the db) 2 retrieve the corresponding Medline abstracts Medline PMID Abstract PMID:6261980 Nucleotide sequence of the gene for the mitochondrial 15S ribosomal RNA of yeast Sor F, Fukuhara H. We have determined the nucleotide sequence of a DNA segment carrying the entire 15S ribosomal RNA gene of yeast mitochondrial genome. … PMID:6280192 Suppressor of yeast mitochondrial ochre mutations that maps in or near the 15S ribosomal RNA gene of mtDNA. Fox TD, Staempfli S. A polypeptide chain-terminating mutation in the yeast mitochondrial oxi 1 gene has been shown to be an ochre (TAA) mutation by DNA sequence analysis. … PMID:2167309 Structure-function studies of adenine nucleotide transport in mitochondria. II. Biochemical analysis of distinct AAC1 and AAC2 proteins in yeast. Gawaz M, Douglas MG, Klingenberg M. AAC1 and AAC2 genes in yeast each encode functional ADP/ATP carrier (AAC) proteins of the mitochondrial inner membrane. … … … 54 Learning (from fully-annotated genes in the db) 3 form the training set: words from Medline abstracts (features) and GO codes (categories) Training set Abstract GO ID Nucleotide sequence of the gene for the mitochondrial 15S ribosomal RNA of yeast Sor F, Fukuhara H. We have determined the nucleotide sequence of a DNA segment carrying the entire 15S ribosomal RNA gene of yeast mitochondrial genome. … GO:0003735 Suppressor of yeast mitochondrial ochre mutations that maps in or near the 15S ribosomal RNA gene of mtDNA. Fox TD, Staempfli S. A polypeptide chain-terminating mutation in the yeast mitochondrial oxi 1 gene has been shown to be an ochre (TAA) mutation by DNA sequence analysis. … GO:0006412 Structure-function studies of adenine nucleotide transport in mitochondria. II. Biochemical analysis of distinct AAC1 and AAC2 proteins in yeast. Gawaz M, Douglas MG, Klingenberg M. AAC1 and AAC2 genes in yeast each encode functional ADP/ATP carrier (AAC) proteins of the mitochondrial inner membrane. … GO:0005743 … 55 … Learning (from fully-annotated genes in the db) 4 learn a classifier from the training set Learning Algorithm Classifier: Abstracts GO codes 56 Classification (for genes with missing annotation) 1 retrieve Medline abstracts mentioning the gene Medline Gene YLL057C Abstract Cloning and characterization of a sulfonate/alphaketoglutarate dioxygenase from Saccharomyces cerevisiae. Hogan DA, Auchtung TA, Hausinger RP. The Saccharomyces cerevisiae open reading frame YLL057c is predicted to encode a gene product with 31.5% amino acid sequence identity to Escherichia coli taurine/alphaketoglutarate dioxygenase and 27% identity to Ralstonia eutropha TfdA, a herbicide-degrading enzyme. … 57 Classification (for genes with missing annotation) 2 classify these abstracts in GO codes Classifier: Abstracts GO codes Gene GO code GO function YLL057C GO:0006790 sulfur metabolism 58 Results dataset level branching Flat Local H. AdaBoost biol. process 12 5.41 15.06 59.27 59.31 mol. function 10 10.29 8.78 43.36 38.17 cell. component 8 6.45 44.18 72.07 73.35 59 Conclusion We have presented: • hierarchical categorization task (categories are partially ordered) • generalized hierarchical local approach • new hierarchical global approach (hierarchical AdaBoost) • new hierarchical evaluation measure • application to gene annotation task 60 Future work • to try global hierarchical approach with other learning algorithms • to extend the gene annotation training sets with similar documents from Medline • to perform similar task for other organisms • to use gene annotations in gene classification and clustering 61 Gene expression analysis with functional annotations GO:0006790 GO:0006798 GO:0007315 GO:0007289 GO:0002132 GO:0002166 62