Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Knowledge-Based Integration of Neuroscience Data Sources Amarnath Gupta Bertram Ludäscher Maryann Martone University of California San Diego A Standard Information Mediation Framework Client Query Integrated XML View View Definition XML View Mediator XML View Wrapper XML View Wrapper Data Source XML Data Source Data Source A Neuroscience Question Cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? Integrated View View Definition Wrapper Mediator Wrapper Wrapper Wrapper WWW protein localization morphometry neurotransmission CaBP, Expasy Integration Issues • Structural Heterogeneity – Resolved by converting to common semistructured data model • Heterogeneity in Query Capabilities – Resolved by writing wrappers with binding patterns and other capability-definition languages • Semantic Heterogeneity – Schema conflicts • Partially resolved by mapping rules in the mediator – Hidden Semantics? Hidden Semantics:Protein Localization Purkinje Cell layer of Cerebellar Cortex Molecular layer of Cerebellar Cortex Fragment of <protein_localization> <neuron type=“purkinje cell” /> <protein channel=“red”> <name>RyR</> …. </protein> <region h_grid_pos=“1” v_grid_pos=“A”> <density> <structure fraction=“0.8”> <name>spine</> <amount name=“RyR”>0</> </> <structure fraction=“0.2”> <name>branchlet</> dendrite <amount name=“RyR”>30</> </> Hidden Semantics: Morphometry Must be dendritic because Purkinje cells don’t have somatic spines <neuron name=“purkinje cell”> <branch level=“10”> Branch level beyond 4 <shaft> is a branchlet … </shaft> <spine number=“1”> <attachment x=“5.3” y=“-3.2” z=“8.7” /> <length>12.348</> <min_section>1.93</> <max_section>4.47</> <surface_area>9.884</> <volume>7.930</> <head> <width>4.47</> <length>1.79</> </head> </spine> … The Problem • Multiple Worlds Integration – compatible terms not directly joinable – complex, indirect associations among schema elements – unstated integrity constraints • Why not use ontologies? – typical ontologies associate terms along limited number of dimensions • What’s needed – a “theory” under which non-identical terms can be “semantically” joined Our Approach • Modify the standard Mediation Architecture – Wrapper • Extend to encode an object-version of the structure schema – Mediator • Redesign to incorporate auxiliary knowledge sources to – – • Correlate object schema of sources Define additional objects not specified but derivable from sources At the Mediator – Use a logic engine to • • • • Encode the mapping rules between sources Define integrated views using a combination of exported objects from source and the auxiliary knowledge sources Perform query decomposition We still use Global-as-View form of mediation The KIND Architecture Integrated User View Auxiliary Knowledge Source 1 View Definition Rules Logic Engine Integration Logic Schema of Registered Sources Materialized Views Object Wrapper Object Wrapper Structure Wrapper Structure Wrapper Src 1 Src 2 Auxiliary Knowledge Source 2 The Knowledge-Base • Situate every data object in its anatomical context – An illustration – New data is registered with the knowledge-base – Insertion of new data reconciles the current knowledgebase with the new information by: • Indexing the data with the source as part of registration • Extending the knowledge-base • Creating new views with complex rules to encode additional domain knowledge F-Logic for the Mediation Engine • Why F-Logic? – Provides the power of Datalog (with negation) and object creation through Skolem IDs – Correct amount of “notational sugar” and rules to provide object-oriented abstraction – Schema-level reasoning – Expressing variable arity • F-Logic in KIND – Source schema wrapped into F-Logic schema – Knowledge-sources programmed in F-Logic – Definition of Integrated Views Wrapping into Logic Objects • Automated Part <!ELEMENT Studies (Study)*> <!ELEMENT Study (study_id, … animal, experiments, experimenters> <!ELEMENT experiments (experiment)*> <!ELEMENT experiment (description, instrument, parameters)> studyDB[studies study]. study[study_id string; … animal animal; experiments experiment; experimenters string]. … • Non-automated Part • Subclasses mushroom_spine::spine • Rules S:mushroom_spine IF S:spine[head_;neck _]. • Integrity Constraints ic1(S):alert[type “invalid spine”; object S] IF S:spine[undef {head, neck}]. Computing with Auxiliary Sources • Creating Mediated Classes animal[MR] IF S:source, S.animal [MR] . union view animal[taxon ‘TAXON’.taxon]. X[taxonT] IF X: ‘PROLAB’.animal[name N], words(N,[W1,W2|_]), T: ‘TAXON’.taxon[genus W1;species W2]. association rule • Reasoning with Schema Schema taxon[subspecies string; species string; genus string; … phylum string; kingdom string; superkingdom string]. At Mediator subspecies::species::genus:: … kingdom::superkingdom Class creation by schema reasoning T:TR, TR::TR1 IF T: ‘TAXON’.taxon[Taxon_Rank TR, Taxon_Rank1 TR1], Taxon_Rank::Taxon_Rank1. Integrated View Definition • Views are defined between sources and knowledge base • Example: protein_distribution – given: organism, protein, brain_region – KB Anatom: • recursively traverse the has_a paths under brain_region collect all anatomical_entities – Source PROLAB: • join with anatomical structures and collect the value of attribute “image.segments.features.feature.protein_amount” where “image.segments.features.feature.protein_name” = protein and “study_db.study.animal.name” = organism – Mediator: • aggregate over all parents up to brain_region • report distribution Query Evaluation Example • protein distribution of Human NCS-1 homologue – from wrapped CaBP website: • get the amino acid sequence for human NCS-1 – from wrapped Expasy website: a second integrated view • submit amino acid sequence, get ranked homologues – at Mediator: • select homologues H found in rat, and homology > 0.70 – at Mediator: • for each h in H – from previous view: » protein_distribution(rat, h, cerebellum, distribution) • Construct result Implementation • System – Flora as F-Logic Engine – Communicate with ODBC databases through underlying XSB Prolog – XML wrapping and Web querying through XMAS, our XML query language and custom-built wrappers • Data – Human Brain Project sites – NPACI Neuroscience Thrust sites Work in Progress • Architecture – plug-in architecture for • domain knowledge sources • conceptual models from data sources • Functionality – better handling of large data – operations • expressive query language • operators for domain knowledge manipulation – query evaluation • query optimization using domain knowledge • Demonstration – at VLDB 2000