Download Formalizing Taxonomy: A Status Report

CleanTAX: An Infrastructure for Reasoning about Biological Taxonomies Dave Thau and Bertram Ludäscher keywords: knowledge management, automatic reasoning, semantic integration, biological classification Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 1 of 47 Outline • Brief Overview of Taxonomies • Impact of Different Taxonomic Views on Data Analysis • Taxonomies and Relations Between Them • Using Logic to Determine Inconsistencies and discover new relations • Initial Results of Large Scale Analysis • Some Optimizations • Future Work Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 2 of 47 Beginnings of Biological Taxonomy Egypt, 1500 BC: Ebers medical papyrus, classification of medicinal plants China, 350 BC: Erh-ya dictionary (second century BC) – classifies trees, grasses, herbs, grains, vegetables Greece, 300 BC: Theophrastus, Historia plantarum and Causae plantarum – 500 plants – trees, herbs, fruiting plants, perennials Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 3 of 47 Taxonomies are Everywhere: Systematics Plantae kingdom Tracheophyta phylum Magnoliopsida class Ranunculales order Ranunculaceae family Ranunculus genus Ranunculus asiaticus Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] species 4 of 47 Taxonomies are Everywhere: The Dewey Decimal System 000 100 200 300 400 500 600 700 800 900 Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 Computers and general reference Philosophy and psychology Religion Social sciences Language Science Technology Arts and Recreation Literature History and geography CleanTAX, Dave Thau [email protected] 5 of 47 Taxonomies are Everywhere: Phylogenies From Thomas D. Als, Roger Vila, Nikolai P. Kandul, David R. Nash, Shen-Horn Yen, Yu-Feng Hsu, André A. Mignault, Jacobus J. Boomsma and Naomi E. Pierce. Nature 432, 386-390. Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 6 of 47 Taxonomies are Everywhere: Protein Structure From Ed Green http://compbio.berkeley.edu/people/ed/SeqCompEval/ Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 7 of 47 Taxonomies are Useful, But Slippery • In all of these cases, taxonomies – Help us organize information – Allow us to make inferences at many levels of generality • However, taxonomies are simply "views" of real data – – – – Dewey Decimal or Library of Congress? Benson's view of Ranunculus or Kartesz's view? Conflicting phylogenies are common SCOP versus CATH Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 8 of 47 Different Taxonomies Can Lead To Different Results photo by David Behrens Predicted Distribution of Anhinga melanogaster based on Clement's 4th Edition Predicted Distribution of Anhinga melanogaster based on Clement's 5th Edition Anhinga is a Anhinga melanogaster is a Anhinga nova. contained in Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 Anhinga is a is a is a Anhinga rufa contained in Anhinga melanogaster  contained in CleanTAX, Dave Thau [email protected] is a  Articulations by Santa Barbara Software Products 9 of 47 Different Taxonomies Complicate Data Analysis What were the average number of Ranunculus arizonicus seen in transect 1 in 2005? Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 10 of 47 Reasoning With Taxonomic Concepts • • • Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 Peet05 articulates relation between Benson’48 and Kartesz’04 names … Is that articulation consistent? Can we infer additional information? CleanTAX, Dave Thau [email protected] 11 of 47 Problem Statement • What are taxonomies, anyway? • How do you know a taxonomy makes sense? • Given some articulations meant to translate between taxonomies: – do they make sense, or are there internal contradictions? – have they left out anything which may be inferred logically? Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 12 of 47 What are Taxonomies? A simple definition: A directed acyclic graph of nodes and edges, where the edges represent a "subtype" relation Anhinga is a Anhinga melanogaster is a Anhinga nova. is a Anhinga rufa Potential additional constraints: • children are disjoint (child-disjointness, D) • children partition their parents (coverage, C) • nodes are non-empty (non-emptiness, N) We call these "latent taxonomic assumptions" • More than one LTA may apply • 8 combinations:none, C, D, N, CD, CN, DN, CDN Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 13 of 47 Inconsistency in a Taxonomy Inconsistent under the ND (non-emptiness and disjoint children) LTA. A B C D If B and C are children of A, then they must be disjoint. However, they both contain elements of D Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 14 of 47 How do Taxonomies Relate? Articulations relate nodes between taxonomies Between any two nodes in the taxonomies, one, and only one, of the following five relations must hold: N M N M M N N M (i) congruence (ii) proper (iii) proper inverse (iv) partial overlap inclusion inclusion MN Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 M>N M<N MoN CleanTAX, Dave Thau [email protected] N M (v) exclusion MxN 15 of 47 Many Possible Articulation Sets Benson, 1948 FNA-03, 1997  < Ranunculus aquatilis R.a. var calvescens R.a. var capillaceus Ranunculus aquatilis R.a. var aquatilis R.a. var diffusus R.a. var hispidulus  < < Five relationships, plus "unknown/unstated relation", and 3 x 4 nodes results in 612 (over 2 billion) sets of articulations. Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 16 of 47 Articulations: Some Make Sense Taxonomy 1 Taxonomy 2 A<D A D isa isa isa isa B C E F CE B<F Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 17 of 47 Articulations: Some Are Impossible Taxonomy 1 Taxonomy 2 A D isa isa isa isa B C E F C>F B<F Assuming non-emptiness, and disjoint children LTAs Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 18 of 47 Articulations: Some Imply other Articulations Taxonomy 1 Taxonomy 2 AD A D isa isa isa isa B C E F CE Implies B  F Assuming non-emptiness, disjoint children and coverage LTAs Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 19 of 47 The Relation Lattice • Sometimes, a single relation between two nodes is unknown. • The relation lattice shows all 32 possible combined relations. • Each node represents a disjunction of relations. ><ox ><o ><x >ox <ox > < > o <o >x < x >< x >< o > < <x o >< >o <o  > < o ><ox ox >o x <o x >x x ox x  Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 20 of 47 The Complexity of Developing Articulations The Ranunculus data set 9 Taxonomies 654 Taxa 704 Articulations visualization by Martin Graham Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 21 of 47 Example Articulation Set Benson, 1948 Kartesz, 2004 O O A B C C D B K L M I A J E F G H X A: B: C: Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 R. petioralis R. macrantus R. fascicularis O X CleanTAX, Dave Thau [email protected] is included in equals overlaps disjoint 22 of 47 Goal – To Help Bob Know • that the taxonomies he's working with are consistent • when he's introduced an articulation that leads to inconsistency • when an articulation is implied by others • about ambiguous articulations Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 23 of 47 Berendsohn, et. al, 2003 - MoReTaX Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 24 of 47 Logic Based Approach • Devise a language LTax – First-order logic constraints on single-place predicates, where each predicate is a "taxon" • Render taxonomies and articulations between them into a set of first-order formulas • Then can ask, – does a taxonomy follow your definition of taxonomy? – is a pair of taxonomies plus articulations between them consistent? – are there unstated articulations? Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 25 of 47 Translating Taxonomy into Logic Taxonomy and LTA Formulas for each edge M isa N add x:M(x)  N(x) isa NonEmptiness Child Disjointness (N) for each node N, add x: N(x) (D) Coverage (C) for each two children N1, N2 of M, add x: N1(x)  N2(x) for each node M with children N1,..NL, add x:M(x)  N1(x)  …  NL(x) Articulation Formulas Congruence MN x:M(x)  N(x) Proper Inclusion M>N x:N(x)  M(x)  a: M(a)  N(a) Proper Inverse Inclusion Partial Overlap M<N x:M(x)  N(x)  a: N(a)  M(a) MoN abc: M(a)  N(a)  M(b)  N(b)  M(c)  N(c) Exclusion MxN x: M(x)  N(x) Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 26 of 47 Theorem Proving  = { x: B.Rac(x) → B.Ra(x), x: B.Rat(x) → B.Ra(x), x: B.Ra(x) ↔ K.Ra(x), x: B.Rat(x) → K.Ra(x)...} = x: B.Rac(x) → K.Ra(x)  a: K.Ra(a)   B.Rac(a) Want to show that ╞ , that  holds in  To prove it, show:   {} ├  Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 27 of 47 CleanTax Methodology Given a set of taxonomies and articulations between them 1. 2. 3. 4. Check each taxonomy under each LTA set to see if it's consistent Check the articulations under each LTA set to see if they are consistent Check the taxonomies plus the articulations under the LTA sets from above and make sure the combination is consistent If so, for each pair-wise combination of nodes, try to prove each possible relationship under each consistent LTA set. Implemented using python. The theorem prover prover9, and the model searcher mace4, are used to prove relationships and check consistency. Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 28 of 47 The CleanTAX Infrastructure • Features – – – • Command line options – – – – – • Specify taxonomies and articulation sets to test Specify relations to test Specify LTAs to test Specify nodes to test Pass parameters to the reasoners Inputs – – – • Designed to plug in a variety of reasoners Works with computer clusters (Sun Grid Engine) Can work with whole taxonomies or subsets Taxonomic Concept Schema (an XML spec) Individual reasoner files Internal representation Example Reports – – – Which taxonomies are consistent under which LTAs For each pair of nodes tested, for each relation, under each LTA, whether or not it can be proven true For each set of taxonomies and articulations, under each LTA, a graph showing new infered relations Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 29 of 47 Initial results We ran two Ranunculus taxonomies (Benson 1948, 218 Taxa and Kartesz 2004, 142 Taxa) and 206 Articulations from Peet 2005. When the taxonomies and the articulations were analyzed as a whole, only two LTA combinations were provably consistent: no LTAs and nonemptiness. This involved 928,680 judgments and took 46.0 hours. To get a better sense for the impact of LTAs, the combined taxonomies and articulations were divided into 82 connected subgraphs Among these we found 5 inconsistencies and 1946 new articulations This involved 166,920 judgments and took 4.8 hours. Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 30 of 47 Discovered Inconsistent Mapping under the {coverage, disjointness, non-emptiness} LTA set Benson, 1948 Kartesz, 2004 >  Ranunculus hydrocharoides R.h. var natans R.h. var stolonifer R.h. var typicus  Ranunculus hydrocharoides R.h. var stolonife r R.h. var typicus  Peet, 2005: B.1948:R.h.stolonifer is congruent to K.2004:R.h.stolonifer B.1948:R.h.typicus is congruent to K.2004:R.h.typicus B.1948:R. hydrocharoides is congruent to K.2004:R. hydrocharoides The most likely fix here is to change the congruence relation between the top two nodes to instead state that Benson's R. hydrocharoides includes Kartesz's Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 31 of 47 Formal Proof of Inconsistency Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 32 of 47 Inferring Additional Knowledge Does C = E? Or, is C > E? Benson, 1948 J Kartesz, 2004 < E K F H G <  < I < A B C D <  A: Ranunculus hispidus B: R.h. var caricetorum C: R.h. var hispidus D: R.h. var nitidus E: Ranunculus hispidus F: R.h. var eurylobus G: R.h. var greenmanii H: R.h. var marilandicus I: R.h. var typicus J: R. septentrionalis K: R. carolinanis Taxonomy provided isa () Articulated Proper Inverse Inclusion (<) Articulated Congruence () Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 33 of 47 Most Informative Relation (MIR) ><ox ><o ><x >ox <ox > < > o <o >x < x >< x >< o > < <x o >< >o <o  > < o ><ox ox >o x <o x >x x ox x  Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 34 of 47 Latent Taxonomic Assumptions vs New Maximally Informative Relations No LTAs All Three LTAs The Basic Five The Other 28 Relations Relations 245 304 475 74 Numbers represent novel provably true relations within 75 subtaxonomies. Main finding: More constraints lead to more specificity in provably true relations Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 35 of 47 Optimizations LTA Optimization NDC NC ND N D DC C If a set of axioms is inconsistent under one node, it will be inconsistent under all the supersets of that node.  Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 36 of 47 Finding the MIR Algorithm 1: Bottom Up (A↑) ><ox ><o ><x >ox <ox > < > o <o >x < x >< x >< o > < <x o >< >o <o  > < o ><ox ox >o x <o x >x x ox x  Try relations on the bottom rank in order, then, if none is true, go to the next rank. Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 37 of 47 Finding the MIR Algorithm 2: Top Down (A↓) ><ox ><o ><x >ox <ox > < > o <o >x < x >< x >< o > < <x o >< >o <o  > < o x  Just check the relations in penultimate rank Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] ><ox ox >o x <o x >x x ox ((A  B  C  D)  E)  ((B  C  D  E)  A)  (B  C  D ) 38 of 47 Relation Lattice Optimization Results 1 Comparing the two full taxonomies, under the nonemptiness LTA shows a strong improvement for the top-down optimization Number of Judgments Time (hours) A0 A↑ A↓ 928,680 912,779 154,780 46.0 45.3 7.8 (a 5.8x speedup) Logical Steps 2,634 (millions) Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 2,589 CleanTAX, Dave Thau [email protected] 442 39 of 47 Relation Lattice Optimization Results 2 Under more restrictive constraints, the bottom-up optimization improves. Results are for 75 sub-taxonomies under the NDC LTA. A0 Number of 17,019 Judgments Time 574.59 (seconds) Logical Steps 2,484 (thousands) Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 A↑ A↓ 2,194 2,745 83.61 100.47 (a 6.9x speedup) (a 5.7x speedup) 384 394 CleanTAX, Dave Thau [email protected] 40 of 47 Summary: Contributions To Date • Represented taxonomies and articulations between them in logic • Clarified and represented latent taxonomic assumptions • Created an infrastructure capable of applying reasoners large taxonomies and articulation sets – discovering inconsistencies – discovering interesting new relations – elucidating impact of LTAs on reasoning • Described and tested three optimizations Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 41 of 47 Future Work: Applications Paul Craig and Jessie Kennedy (2007), School of Computing, Napier University, Edinburgh Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 42 of 47 Future Work: Suggesting Fixes Benson, 1948 Kartesz, 2004  Ranunculus hydrocharoides R.h. var natans R.h. var stolonifer R.h. var typicus  Ranunculus hydrocharoides R.h. var stolonife r R.h. var typicus  Inconsistency found, suggested fixes: 1. 2. 3. 4. Change relation between Ranunculus hydrocharoides (Benson, 1948) and Ranunculus hydrocharoides (Kartesz, 2004) from  to >. Relax Non-Emptiness constraint, allowing Ranunculus hydrocharoides var. natans to be empty. Relax Coverage constraint, allowing R. hydrocharoides to contain specimens not contained in its children … Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 43 of 47 Future Work: Other Logics – DL Benson, 1948 Kartesz, 2004 Ranunculus Ranunculus macranthus Ranunculus petiolaris Ranunculus … Ranunculus petiolaris … < > Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 44 of 47 Other Future Work • • • • Better parallelization Better interfaces (GUI, Web Services) Applications to other domains Enhancing reporting tools to better support data curation Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 45 of 47 Conclusions • Taxonomies are more complicated than you may have thought. • Logic is a useful tool for discovering inconsistencies and new relations in taxonomies and articulations between them. • This is an interesting interdisciplinary line of research combining elements from systematics, artificial intelligence, and high-performance computing. Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 46 of 47 Thanks! Acknowledgements Invaluable Consultation: Bertram Ludäscher and Shawn Bowers Ranunculus Data Set: Bob Peet Visualization Tools: Jessie Kennedy, Martin Graham and Paul Craig Niche Modeling: Kirsten Menger-Anderson Funding and Context: The SEEK project References D. Thau and B. Ludäscher. Reasoning about Taxonomies in First-Order Logic. Journal of Ecological Informatics, (accepted for publication in 2007). D. Thau and B. Ludäscher. Toward Optimizing CleanTAX: An Automated Reasoning Method for Taxonomies and Articulations. (submitted to 2007 IEEE/WIC/ACM International Conference on Web Intelligence. SEEK is supported by the National Science Foundation under awards 0225676. 0225665, 0225635, and 0533368. Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 CleanTAX, Dave Thau [email protected] 47 of 47

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Formalizing Taxonomy: A Status Report