Download Formalizing Taxonomy: A Status Report

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Technological singularity wikipedia , lookup

AI winter wikipedia , lookup

Philosophy of artificial intelligence wikipedia , lookup

Intelligence explosion wikipedia , lookup

History of artificial intelligence wikipedia , lookup

Ethics of artificial intelligence wikipedia , lookup

Existential risk from artificial general intelligence wikipedia , lookup

Stanford University centers and institutes wikipedia , lookup

Transcript
CleanTAX:
An Infrastructure for Reasoning
about Biological Taxonomies
Dave Thau and
Bertram Ludäscher
keywords: knowledge management, automatic reasoning,
semantic integration, biological classification
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
1 of 47
Outline
• Brief Overview of Taxonomies
• Impact of Different Taxonomic Views on Data
Analysis
• Taxonomies and Relations Between Them
• Using Logic to Determine Inconsistencies and
discover new relations
• Initial Results of Large Scale Analysis
• Some Optimizations
• Future Work
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
2 of 47
Beginnings of Biological Taxonomy
Egypt, 1500 BC: Ebers
medical papyrus,
classification of medicinal
plants
China, 350 BC: Erh-ya
dictionary (second century
BC) – classifies trees,
grasses, herbs, grains,
vegetables
Greece, 300 BC:
Theophrastus, Historia
plantarum and Causae
plantarum – 500 plants –
trees, herbs, fruiting plants,
perennials
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
3 of 47
Taxonomies are Everywhere:
Systematics
Plantae
kingdom
Tracheophyta
phylum
Magnoliopsida
class
Ranunculales
order
Ranunculaceae
family
Ranunculus
genus
Ranunculus asiaticus
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
species
4 of 47
Taxonomies are Everywhere:
The Dewey Decimal System
000
100
200
300
400
500
600
700
800
900
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
Computers and general reference
Philosophy and psychology
Religion
Social sciences
Language
Science
Technology
Arts and Recreation
Literature
History and geography
CleanTAX, Dave Thau [email protected]
5 of 47
Taxonomies are Everywhere:
Phylogenies
From Thomas D. Als, Roger Vila, Nikolai P. Kandul, David R. Nash,
Shen-Horn Yen, Yu-Feng Hsu, André A. Mignault, Jacobus J. Boomsma and
Naomi E. Pierce. Nature 432, 386-390.
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
6 of 47
Taxonomies are Everywhere:
Protein Structure
From Ed Green http://compbio.berkeley.edu/people/ed/SeqCompEval/
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
7 of 47
Taxonomies are Useful, But
Slippery
• In all of these cases, taxonomies
– Help us organize information
– Allow us to make inferences at many levels of
generality
• However, taxonomies are simply "views" of real
data
–
–
–
–
Dewey Decimal or Library of Congress?
Benson's view of Ranunculus or Kartesz's view?
Conflicting phylogenies are common
SCOP versus CATH
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
8 of 47
Different Taxonomies Can Lead To Different
Results
photo by David Behrens
Predicted Distribution of Anhinga
melanogaster based on
Clement's 4th Edition
Predicted Distribution of Anhinga
melanogaster based on
Clement's 5th Edition
Anhinga
is a
Anhinga
melanogaster
is a
Anhinga
nova.
contained in
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
Anhinga
is a
is a
is a
Anhinga
rufa
contained in
Anhinga
melanogaster

contained in
CleanTAX, Dave Thau [email protected]
is a

Articulations by Santa Barbara Software Products
9 of 47
Different Taxonomies Complicate
Data Analysis
What were the average number of Ranunculus
arizonicus seen in transect 1 in 2005?
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
10 of 47
Reasoning With Taxonomic Concepts
•
•
•
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
Peet05 articulates relation between
Benson’48 and Kartesz’04 names …
Is that articulation consistent?
Can we infer additional information?
CleanTAX, Dave Thau [email protected]
11 of 47
Problem Statement
• What are taxonomies, anyway?
• How do you know a taxonomy makes
sense?
• Given some articulations meant to
translate between taxonomies:
– do they make sense, or are there internal
contradictions?
– have they left out anything which may be
inferred logically?
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
12 of 47
What are Taxonomies?
A simple definition: A directed acyclic graph of
nodes and edges, where the edges represent a
"subtype" relation
Anhinga
is a
Anhinga
melanogaster
is a
Anhinga
nova.
is a
Anhinga
rufa
Potential additional constraints:
• children are disjoint (child-disjointness, D)
• children partition their parents (coverage, C)
• nodes are non-empty (non-emptiness, N)
We call these "latent taxonomic assumptions"
• More than one LTA may apply
• 8 combinations:none, C, D, N, CD, CN, DN, CDN
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
13 of 47
Inconsistency in a Taxonomy
Inconsistent under the ND (non-emptiness
and disjoint children) LTA.
A
B
C
D
If B and C are children of A, then they must be
disjoint. However, they both contain elements of D
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
14 of 47
How do Taxonomies Relate?
Articulations relate nodes between
taxonomies
Between any two nodes in the taxonomies, one, and only one, of the following
five relations must hold:
N M
N M
M N
N
M
(i) congruence (ii) proper (iii) proper inverse (iv) partial overlap
inclusion
inclusion
MN
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
M>N
M<N
MoN
CleanTAX, Dave Thau [email protected]
N
M
(v) exclusion
MxN
15 of 47
Many Possible Articulation Sets
Benson, 1948
FNA-03, 1997
 <
Ranunculus
aquatilis
R.a. var
calvescens
R.a. var
capillaceus
Ranunculus
aquatilis
R.a. var
aquatilis
R.a. var
diffusus
R.a. var
hispidulus

<
<
Five relationships, plus "unknown/unstated relation", and
3 x 4 nodes results in 612 (over 2 billion) sets of
articulations.
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
16 of 47
Articulations: Some Make Sense
Taxonomy 1
Taxonomy 2
A<D
A
D
isa
isa
isa
isa
B
C
E
F
CE
B<F
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
17 of 47
Articulations: Some Are Impossible
Taxonomy 1
Taxonomy 2
A
D
isa
isa
isa
isa
B
C
E
F
C>F
B<F
Assuming non-emptiness, and disjoint children
LTAs
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
18 of 47
Articulations: Some Imply other
Articulations
Taxonomy 1
Taxonomy 2
AD
A
D
isa
isa
isa
isa
B
C
E
F
CE
Implies B  F
Assuming non-emptiness, disjoint children and coverage
LTAs
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
19 of 47
The Relation Lattice
• Sometimes, a single relation between two nodes is unknown.
• The relation lattice shows all 32 possible combined relations.
• Each node represents a disjunction of relations.
><ox
><o
><x
>ox
<ox
>
<
>
o
<o
>x
<
x
><
x
><
o
>
<
<x
o
><
>o
<o

>
<
o
><ox
ox
>o
x
<o
x
>x
x
ox
x

Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
20 of 47
The Complexity of Developing Articulations
The Ranunculus
data set
9 Taxonomies
654 Taxa
704 Articulations
visualization by
Martin Graham
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
21 of 47
Example Articulation Set
Benson, 1948
Kartesz, 2004
O
O
A
B
C
C
D
B
K L M
I
A
J
E F G H
X
A:
B:
C:
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
R. petioralis
R. macrantus
R. fascicularis
O
X
CleanTAX, Dave Thau [email protected]
is included in
equals
overlaps
disjoint
22 of 47
Goal – To Help Bob Know
• that the taxonomies he's working with are
consistent
• when he's introduced an articulation that
leads to inconsistency
• when an articulation is implied by others
• about ambiguous articulations
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
23 of 47
Berendsohn, et. al, 2003 - MoReTaX
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
24 of 47
Logic Based Approach
• Devise a language LTax
– First-order logic constraints on single-place predicates,
where each predicate is a "taxon"
• Render taxonomies and articulations between them
into a set of first-order formulas
• Then can ask,
– does a taxonomy follow your definition of taxonomy?
– is a pair of taxonomies plus articulations between them
consistent?
– are there unstated articulations?
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
25 of 47
Translating Taxonomy into Logic
Taxonomy and LTA Formulas
for each edge M isa N add x:M(x)  N(x)
isa
NonEmptiness
Child
Disjointness
(N)
for each node N, add x: N(x)
(D)
Coverage
(C)
for each two children N1, N2 of M,
add x: N1(x)  N2(x)
for each node M with children N1,..NL,
add x:M(x)  N1(x)  …  NL(x)
Articulation Formulas
Congruence
MN
x:M(x)  N(x)
Proper Inclusion
M>N
x:N(x)  M(x)  a: M(a)  N(a)
Proper Inverse
Inclusion
Partial Overlap
M<N
x:M(x)  N(x)  a: N(a)  M(a)
MoN
abc: M(a)  N(a)  M(b) 
N(b)  M(c)  N(c)
Exclusion
MxN
x: M(x)  N(x)
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
26 of 47
Theorem Proving
 = { x: B.Rac(x) → B.Ra(x),
x: B.Rat(x) → B.Ra(x),
x: B.Ra(x) ↔ K.Ra(x),
x: B.Rat(x) → K.Ra(x)...}
= x: B.Rac(x) → K.Ra(x) 
a: K.Ra(a)   B.Rac(a)
Want to show that ╞ , that  holds in 
To prove it, show:   {} ├ 
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
27 of 47
CleanTax Methodology
Given a set of taxonomies and articulations between
them
1.
2.
3.
4.
Check each taxonomy under each LTA set to see if it's consistent
Check the articulations under each LTA set to see if they are
consistent
Check the taxonomies plus the articulations under the LTA sets
from above and make sure the combination is consistent
If so, for each pair-wise combination of nodes, try to prove each
possible relationship under each consistent LTA set.
Implemented using python. The theorem prover
prover9, and the model searcher mace4, are used to
prove relationships and check consistency.
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
28 of 47
The CleanTAX Infrastructure
•
Features
–
–
–
•
Command line options
–
–
–
–
–
•
Specify taxonomies and articulation sets to test
Specify relations to test
Specify LTAs to test
Specify nodes to test
Pass parameters to the reasoners
Inputs
–
–
–
•
Designed to plug in a variety of reasoners
Works with computer clusters (Sun Grid Engine)
Can work with whole taxonomies or subsets
Taxonomic Concept Schema (an XML spec)
Individual reasoner files
Internal representation
Example Reports
–
–
–
Which taxonomies are consistent under which LTAs
For each pair of nodes tested, for each relation, under each LTA, whether or not it can be
proven true
For each set of taxonomies and articulations, under each LTA, a graph showing new infered
relations
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
29 of 47
Initial results
We ran two Ranunculus taxonomies (Benson 1948, 218 Taxa and
Kartesz 2004, 142 Taxa) and 206 Articulations from Peet 2005.
When the taxonomies and the articulations were analyzed as a whole,
only two LTA combinations were provably consistent: no LTAs and nonemptiness.
This involved 928,680 judgments and took 46.0 hours.
To get a better sense for the impact of LTAs, the combined taxonomies
and articulations were divided into 82 connected subgraphs
Among these we found 5 inconsistencies and 1946 new articulations
This involved 166,920 judgments and took 4.8 hours.
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
30 of 47
Discovered Inconsistent Mapping
under the {coverage, disjointness, non-emptiness} LTA set
Benson, 1948
Kartesz, 2004
>

Ranunculus
hydrocharoides
R.h. var
natans
R.h. var
stolonifer
R.h. var
typicus

Ranunculus
hydrocharoides
R.h. var
stolonife
r
R.h. var
typicus

Peet, 2005:
B.1948:R.h.stolonifer is congruent to K.2004:R.h.stolonifer
B.1948:R.h.typicus is congruent to K.2004:R.h.typicus
B.1948:R. hydrocharoides is congruent to K.2004:R. hydrocharoides
The most likely fix here is to change the congruence relation between the top
two nodes to instead state that Benson's R. hydrocharoides includes
Kartesz's
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
31 of 47
Formal Proof of Inconsistency
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
32 of 47
Inferring Additional Knowledge
Does C = E? Or, is C > E?
Benson, 1948
J
Kartesz, 2004
<
E
K
F
H
G
<

<
I
<
A
B
C
D
<

A: Ranunculus hispidus
B: R.h. var caricetorum
C: R.h. var hispidus
D: R.h. var nitidus
E: Ranunculus hispidus
F: R.h. var eurylobus
G: R.h. var greenmanii
H: R.h. var marilandicus
I: R.h. var typicus
J: R. septentrionalis
K: R. carolinanis
Taxonomy provided isa ()
Articulated Proper Inverse Inclusion (<)
Articulated Congruence ()
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
33 of 47
Most Informative Relation (MIR)
><ox
><o
><x
>ox
<ox
>
<
>
o
<o
>x
<
x
><
x
><
o
>
<
<x
o
><
>o
<o

>
<
o
><ox
ox
>o
x
<o
x
>x
x
ox
x

Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
34 of 47
Latent Taxonomic Assumptions vs New
Maximally Informative Relations
No LTAs
All Three
LTAs
The Basic Five The Other 28
Relations
Relations
245
304
475
74
Numbers represent novel provably true relations within 75 subtaxonomies.
Main finding: More constraints lead to more specificity in
provably true relations
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
35 of 47
Optimizations
LTA Optimization
NDC
NC
ND
N
D
DC
C
If a set of axioms is
inconsistent under one node, it
will be inconsistent under all
the supersets of that node.

Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
36 of 47
Finding the MIR
Algorithm 1: Bottom Up (A↑)
><ox
><o
><x
>ox
<ox
>
<
>
o
<o
>x
<
x
><
x
><
o
>
<
<x
o
><
>o
<o

>
<
o
><ox
ox
>o
x
<o
x
>x
x
ox
x

Try relations on the bottom rank in order,
then, if none is true, go to the next rank.
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
37 of 47
Finding the MIR
Algorithm 2: Top Down (A↓)
><ox
><o
><x
>ox
<ox
>
<
>
o
<o
>x
<
x
><
x
><
o
>
<
<x
o
><
>o
<o

>
<
o
x

Just check the relations in
penultimate rank
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
><ox
ox
>o
x
<o
x
>x
x
ox
((A  B  C  D)  E)

((B  C  D  E)  A)

(B  C  D )
38 of 47
Relation Lattice Optimization
Results 1
Comparing the two full taxonomies, under the nonemptiness
LTA shows a strong improvement for the top-down
optimization
Number of
Judgments
Time (hours)
A0
A↑
A↓
928,680
912,779
154,780
46.0
45.3
7.8
(a 5.8x speedup)
Logical Steps 2,634
(millions)
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
2,589
CleanTAX, Dave Thau [email protected]
442
39 of 47
Relation Lattice Optimization
Results 2
Under more restrictive constraints, the bottom-up
optimization improves. Results are for 75 sub-taxonomies
under the NDC LTA.
A0
Number of
17,019
Judgments
Time
574.59
(seconds)
Logical Steps 2,484
(thousands)
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
A↑
A↓
2,194
2,745
83.61
100.47
(a 6.9x speedup)
(a 5.7x speedup)
384
394
CleanTAX, Dave Thau [email protected]
40 of 47
Summary: Contributions To Date
• Represented taxonomies and articulations
between them in logic
• Clarified and represented latent taxonomic
assumptions
• Created an infrastructure capable of applying
reasoners large taxonomies and articulation sets
– discovering inconsistencies
– discovering interesting new relations
– elucidating impact of LTAs on reasoning
• Described and tested three optimizations
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
41 of 47
Future Work: Applications
Paul Craig and Jessie Kennedy (2007), School of Computing, Napier University, Edinburgh
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
42 of 47
Future Work: Suggesting Fixes
Benson, 1948
Kartesz, 2004

Ranunculus
hydrocharoides
R.h. var
natans
R.h. var
stolonifer
R.h. var
typicus

Ranunculus
hydrocharoides
R.h. var
stolonife
r
R.h. var
typicus

Inconsistency found, suggested fixes:
1.
2.
3.
4.
Change relation between Ranunculus hydrocharoides (Benson, 1948) and
Ranunculus hydrocharoides (Kartesz, 2004) from  to >.
Relax Non-Emptiness constraint, allowing Ranunculus hydrocharoides var.
natans to be empty.
Relax Coverage constraint, allowing R. hydrocharoides to contain specimens
not contained in its children
…
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
43 of 47
Future Work: Other Logics – DL
Benson, 1948
Kartesz, 2004
Ranunculus
Ranunculus
macranthus
Ranunculus
petiolaris
Ranunculus
…
Ranunculus
petiolaris
…
<
>
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
44 of 47
Other Future Work
•
•
•
•
Better parallelization
Better interfaces (GUI, Web Services)
Applications to other domains
Enhancing reporting tools to better support
data curation
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
45 of 47
Conclusions
• Taxonomies are more complicated than you may
have thought.
• Logic is a useful tool for discovering
inconsistencies and new relations in taxonomies
and articulations between them.
• This is an interesting interdisciplinary line of
research combining elements from systematics,
artificial intelligence, and high-performance
computing.
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
46 of 47
Thanks!
Acknowledgements
Invaluable Consultation:
Bertram Ludäscher and Shawn Bowers
Ranunculus Data Set:
Bob Peet
Visualization Tools:
Jessie Kennedy, Martin Graham and
Paul Craig
Niche Modeling:
Kirsten Menger-Anderson
Funding and Context:
The SEEK project
References
D. Thau and B. Ludäscher. Reasoning about Taxonomies in First-Order Logic. Journal of
Ecological Informatics, (accepted for publication in 2007).
D. Thau and B. Ludäscher. Toward Optimizing CleanTAX: An Automated Reasoning
Method for Taxonomies and Articulations. (submitted to 2007 IEEE/WIC/ACM
International Conference on Web Intelligence.
SEEK is supported by the National Science Foundation under awards 0225676. 0225665, 0225635, and 0533368.
Stanford Research Institute
Artificial Intelligence Center Seminar
8/16/2007
CleanTAX, Dave Thau [email protected]
47 of 47