Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
From Database Federation to Model-Based Mediation: Databases Meets* Knowledge Representation Bertram Ludäscher [email protected] Data and Knowledge Systems San Diego Supercomputer Center U.C. San Diego * or rather rediscovers Outline • Information Integration from a database perspective – examples, mediator approach, some technical challenges • Part I: XML-Based Mediation – based on querying semistructured data & XML – navigation-driven query evaluation – ongoing/future research: querying XML streams • Part II: Model-Based Mediation – basic ideas & architecture, lifting data to knowledge sources – “glue maps” (domain maps, process maps) – ongoing/future research: mix of DB & KR techniques • Summary 2 An Online Shopper’s Information Integration Problem El Cheapo: “Where can I get the cheapest copy (including shipping cost) of Wittgenstein’s Tractatus Logicus-Philosophicus within a week?” addall.com ? Information Integration public library amazon.com barnes&noble.com “One-World” Mediation WWW half.com A1books.com A Home Buyer’s Information Integration Problem What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood with below-average crime rate and diverse population? ? Information Integration Realtor Crime Stats School Rankings “Multiple-Worlds” Mediation Demographics Information Integration from a DB Perspective • Information Integration Challenge – Given: data sources S_1, ..., S_k (DBMS, web sites, ...) and user questions Q_1,...,Q_n that can be answered using the S_i – Find: the answers to Q_1, ..., Q_n • The Database Perspective: source = “database” S_i has a schema (relational, XML, OO, ...) S_i can be queried define virtual (or materialized) integrated views V over S_1,...,S_k using database query languages questions become queries Q_i against V(S_1,...,S_k) • Why a Database Perspective? – scalability, efficiency, reusability (declarative queries), ... 5 PART I: XML-Based Mediation 6 Abstract XML-Based Mediator Architecture USER/Client Query Q o V (S_1,...,S_k) Integrated XML View V Integrated View Definition IVD(S1,...,Sn) MEDIATOR XML Queries & Results XML View XML View XML View Wrapper Wrapper Wrapper S_1 S_2 S_k 7 A Concrete (Future) XML-Based Mediator System USER/Client XQuery XML (Integrated View) Integrated View Definition IVD MEDIATOR Engine XQuery Processor XQuery XQuery XQuery XQuery First Results & Demos: XSQL XPATH XSLT XMAS language and algebra, VXD evaluation, BBQ UI, [WebDB99] [SSD99] [SIGMOD99] [EDBT00] (w/ Papakonstantinou, Vianu, ...) XML Queries & Results XML-Wrapper XML-Wrapper XML-Wrapper XSQL XPath XSLT SQL XScan http-get S1 S2 S3 8 Some Technical Challenges ... • XML Query Languages – DB community: QLs for semistructured data, e.g., TSIMMIS/MSL, Lorel, Yatl, ..., Florid/F-logic [InfSystems98] – CSE/SDSC: XMAS [SSD99,WebDB99,EDBT00] – W3C: XPath, XSLT, XQuery (Working Draft , June 2001) • DB Theory: Expressiveness/Complexity Trade-Off – querying: FO, (WF/S-)Datalog, FO(LFP), FO(PFP), ... , all – reasoning: query satisfiability, containment, equivalence 9 ... Some More Technical Challenges ... • DB Practice: Query Composition – compute Q o V(S_1,...,S_k) w/o computing all of V “push Q through V into S_i” in Datalog: view unfolding (resolution, unification) + simplification ~ top-down evaluation ~ magic sets in XML: some solutions (Papakonstantinou, ...) • Navigation-Driven Evaluation of Integrated View V: – V materialized => warehousing approach – V virtual => mediator approach – V virtual & driven by user-navigation => VXD approach [EDBT00] (w/ Papakonstantinou, Velikhov) 10 XMAS: XML Matching And Structuring language CONSTRUCT <books> <book> $a1 $t <pubs> $p { $p } </pubs> </book> { $a1, $t } </books> WHERE <books.book> $a1 : <author /> $t : <title /> </> IN "amazon.com" AND <authors.author> $a2 : <author /> <pubs> $p : <pub/> </> </> IN "www...DBLP… " AND value( $a1 ) = value( $a2 ) XMAS Integrated View Definition: “Find books from amazon.com and DBLP, join on author, group by authors and title” XMAS Algebra 11 XML (XMAS) Query Processing XMAS Query Q XMAS View Definition V Translator algebraic plans Composition (Q o V) composed plan Compile-time Rewriter/Optimizer optimized plan Run-time: lazy VXD evaluation Plan Execution 12 Navigation-Driven Evaluation: Lazy Mediators Input: client navigations result view definition ans = V( S_1 … S_k ) Lazy Mediator Output: source navigations S_1 XML source ... S_k XML source 13 Navigation-Driven Evaluation: Lazy Mediators Input: client navigations result view definition ans = V( S_1 … S_k ) Lazy Mediator Output: source navigations S_1 XML source ... S_k XML source 14 Navigation-Driven Evaluation: Lazy Mediators Input: client navigations result view definition ans = V( S_1 … S_k ) Lazy Mediator Output: source navigations S_1 XML source ... S_k XML source 15 Navigation-Driven Evaluation: Lazy Mediators Input: client navigations result view definition ans = V( S_1 … S_k ) Lazy Mediator Output: source navigations S_1 XML source ... S_k XML source 16 Navigation-Driven Evaluation: Lazy Mediators Input: client navigations result view definition ans = V( S_1 … S_k ) Lazy Mediator Output: source navigations S_1 XML source ... S_k XML source 17 Navigation-Driven Evaluation: Lazy Mediators Input: client navigations result view definition ans = V( S_1 … S_k ) Lazy Mediator Output: source navigations S_1 XML source ... S_k XML source 18 Navigation-Driven Evaluation: Lazy Mediators Input: client navigations result view definition ans = V( S_1 … S_k ) Lazy Mediator Output: source navigations S_1 XML source ... S_k XML source 19 Navigation-Driven Evaluation: Lazy Mediators Input: client navigations result view definition ans = V( S_1 … S_k ) Lazy Mediator Output: source navigations S_1 XML source ... S_k XML source 20 Navigation-Driven Evaluation: Lazy Mediators Input: client navigations result view definition ans = V( S_1 … S_k ) Lazy Mediator Output: source navigations S_1 XML source ... S_k XML source 21 Navigation-Driven Evaluation: Lazy Mediators Input: client navigations result view definition ans = V( S_1 … S_k ) Lazy Mediator Output: source navigations S_1 XML source ... S_k XML source 22 Open Issue: Querying XML Streams • Given: – stream S of XML events (open, close, data) – XML query Q over S – constraints: 1-pass “on-the-fly” processing, bounded memory • Find: – decide whether, and if so how, Q can be evaluated given the constraints • Initial Approach: – transducer model XSM (XML Stream Machine) to approximate “streamable” queries (w/ Papakonstantinou, Mukhopadhyay, Vianu) 25 Example: XML Stream Query XML query (r) = for each customer $C, list all orders $O Query-aware DTD design is even more important for stream queries! 26 Example: XML Stream Machine (XSM) input/output: stream of XML events memory: finite state control, buffers, transitions: on EVENT do ACTION transducer model 27 PART II: Model-Based Mediation 28 A Geoscientist’s Information Integration Problem What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How about their 3-D geometry ? How does it relate to host rock structures? ? Information Integration Geologic Map (Virginia) GeoChemical “Complex Multiple-Worlds” Mediation GeoPhysical GeoChronologic (gravity contours) (Concordia) Foliation Map (structure DB) A Neuroscientist’s Information Integration Problem What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? ? Information Integration protein localization sequence info (NCMIR) (CaPROT) “Complex Multiple-Worlds” Mediation morphometry neurotransmission (SYNAPSE) (SENSELAB) What’s the Problem with XML & Complex Multiple-Worlds? • XML is Syntax – canonical syntax for labeled ordered trees – a metalanguage, but all semantics lies outside of XML • DTDs => tags + nesting, XML Schema => DTDs + data modeling • need anything else? => write comments! • Domain Semantics is complex: – implicit assumptions, hidden semantics sources seem unrelated to the non-expert • Need Structure and Semantics beyond XML trees! employ richer OO models make domain semantics and “glue knowledge” explicit use ontologies to fix terminology and conceptualization avoid ambiguities by using formal semantics 31 Information Integration Landscape conceptual complexity/depth high Model-Based Mediation GO EcoCyc Ontologies KR formalisms RiboWeb UMLS Bioinformatics Geoinformatics Tambis BLAST MIA Entrez Cyc WordNet DB mediation techniques low addall book-buyer one-world home-buyer 24x7 consumer conceptual distance multiple-worlds 32 XML-Based vs. Model-Based Mediation CM ~ {Descr.Logic, ER, UML, RDF/XML(-Schema), …} Integrated-DTD := Glue Maps XML-QL(Src1-DTD,...) DMs, PMs CM-QL ~ {F-Logic, DAML+OIL, …} Integrated-CM := CM-QL(Src1-CM,...) No Domain Constraints IF THEN IF IFTHEN THEN Structural Constraints (DTDs), Parent, Child, Sibling, ... A = (B*|C),D B = ... C1 C2 .... XML Elements XML Models Raw Raw Data RawData Data C3 R .... . . .... .... Logical Domain Constraints Classes, Relations, is-a, has-a, ... (XML) Objects Conceptual Models What’s the Glue? What’s in a Link? • Syntactic Joins – (X,Y) := X.SSN = Y.SSN – (X,Y) := X.UMLS-ID = Y.UID Y X equality • “Speciality” Joins – (X,Y,Score) := BLAST(X,Y,Score) similarity • Semantic/Rule-Based Joins – (X,Y,C) := X isa C, Y isa C, BLAST(X,Y,S), S>0.8 homology, lub – (X,Y,[produces,B,increased_in]) := X produces B, B increased_in Y. rule-based e.g., X=-secretase, B=beta amyloid, Y=Alzheimer’s disease • YAC (Yet Another Challenge): – compile semantic joins into efficient syntactic ones 36 Model-Based Mediation Methodology ... • Lift Sources to export CMs: CM(S) = OM(S) + KB(S) + CON(S) • Object Model OM(S): – complex objects (frames), class hierarchy, OO constraints • Knowledge Base KB(S): – explicit representation of (“hidden”) source semantics – logic rules over OM(S) • Contextualization CON(S): – situate OM(S) data using “glue maps” (GMs): domain maps DMs (ontology) = terminological knowledge: concepts + roles process maps PMs = “procedural knowledge”: states + transitions 37 ... Model-Based Mediation Methodology • Integrated View Definition (IVD) – declarative (logic) rules with object-oriented features – defined over CM(S), domain maps, process maps – needs “mediation engineers” = domain + KRDB experts • Knowledge-Based Querying and Browsing (runtime): – mediator composes the user query Q with the IVD ... rewrites (Q o IVD), sends subqueries to sources ... post-processes returned results (e.g., situate in context) 38 Model-Based Mediator Architecture USER/Client “Glue” Maps GMs CM (Integrated View) DomainMaps Maps Domain Domain Maps DMs DMs DMs Mediator Engine Integrated View Definition IVD LP rule proc. XSB Engine DomainMaps Maps Domain Process Maps DMs DMs PMs semantic context CON(S) FL rule proc. Graph proc. GCM GCM GCM First results & Demos: CM S1 CM S2 CM S3 KIND prototype, formal DM semantics, PMs [SSDBM00] [VLDB00] [ICDE01] [NIH-HB01] (w/ Gupta, Martone) CM Queries & Results (exchanged in XML) CM(S) = OM(S)+KB(S)+CON(S) CM-Wrapper CM-Wrapper CM-Wrapper (XML-Wrapper) (XML-Wrapper) (XML-Wrapper) S1 S2 S3 39 Formalizing Glue Knowledge: Domain Map for SYNAPSE and NCMIR Domain Map = labeled graph with concepts ("classes") and roles ("associations") • additional semantics: expressed as logic rules (F-logic) Purkinje cells and Pyramidal cells have dendrites that have higher-order branches that contain spines. Dendritic spines are ion (calcium) regulating components. Spines have ion binding proteins. Neurotransmission involves ionic activity (release). Ion-binding proteins control ion activity (propagation) in a cell. Ion-regulating components of cells affect ionic activity (release). Domain Expert Knowledge Domain Map (DM) DM in Description Logic 40 Source Contextualization & DM Refinement In addition to registering (“hanging off”) data relative to existing concepts, a source may also refine the mediator’s domain map... sources can register new concepts at the mediator ... 41 Example: ANATOM Domain Map Browsing Registered Data with Domain Maps 43 Compilation : Domain Maps => F-Logic Rules • Domain Maps ~ Ontologies • DMs have a formal semantics via a translation to FLogic (~ Datalog + OO features) => Declarative + “Executable” Specification • query evaluation with deductive rules • reasoning over decidable fragments: • checking concept subsumption, equivalence 44 Query Processing “Demo” Integrated View Definition DERIVE protein_distribution(Protein, Organism, Brain_region, Feature_name, Anatom, Value) IF Contextualization CON(Result) wrt. ANATOM. I:protein_label_image[ proteins ->> {Protein}; organism -> Organism; anatomical_structures ->> {AS:anatomical_structure[name->Anatom]}] , % from PROLAB NAE:neuro_anatomic_entity[name->Anatom; % from ANATOM located_in->>{Brain_region}], AS..segments..features[name->Feature_name; value->Value]. Query results in context • provided by the domain expert and mediation engineer • deductive OO language (here: F-logic) Example: Inside Query Evaluation "How does the parallel fiber output (Yale/SENSELAB) relate to the distribution of Ryanodine Receptors (UCSD/NCMIR)?” push selection @SENSELAB: X1 := select targets of “output from parallel fiber” ; determine source context @MEDIATOR: X2 := “find and situate” X1 in ANATOM Domain Map; compute region of interest (here: downward closure) @MEDIATOR: X3 := subregion-closure(X2); push selection @NCMIR: X4 := select PROT-data(X3, Ryanodine Receptors); compute protein distribution @MEDIATOR: X5 := compute aggregate(X4); display in context @MEDIATOR/GUI: display X5 in context (ANATOM) Some Open Database & Knowledge Representation Issues • Mix of Query Processing and Reasoning – FaCT description logic reasoner for DMs? – or reconcilation of DMs via argumentation-frameworks (“games”) using well-founded and stable models of logic programs [ICDT97,PODS97,TCS00] • Modeling “Process Knowledge” => Process Maps – formal semantics? (dynamic/temporal/Kripke models?) – executable semantics? (Statelog?) • Graph Queries over DMs and PMs – expressible in F-logic [InfSystem98] – scalability? (UMLS Domain Map has millions of entries) • ... 47 Towards Process Maps with Abstractions and Elaborations • nodes ~ states • edges ~ processes, transitions • blue/red edges: • processes in Src1/Src2 • general form of edges: 48 Summary: Mediation Scenarios & Techniques Federated Databases One-World Common Schema XML-Based Mediation Model-Based Mediation One-/Multiple-Worlds Complex Multiple-Worlds Mediated Schema Common Glue Maps SQL, rules XML query languages DOOD query languages Schema Transformations Syntax-Aware Mappings Syntactic Joins Syntactic Joins DB expert DB expert Semantics-Aware Mappings “Semantic” Joins via Glue Maps KRDB + domain expert 49 Questions? Queries? 50 Some References • XML-Based and Model-Based Mediation: – MBM: Model-Based Mediation with Domain Maps, B. Ludäscher, A. Gupta, M. E. Martone, 17th Intl. Conference on Data Engineering (ICDE), Heidelberg, Germany, IEEE Computer Society,2001. – VXD/Lazy Mediaors: Navigation-Driven Evaluation of Virtual Mediated Views, B. Ludäscher, Y. Papakonstantinou, P. Velikhov, Intl. Conference on Extending Database Technology (EDBT), Konstanz, Germany, LNCS 1777, Springer, 2000. – DOOD: Managing Semistructured Data with FLORID: A Deductive Object-Oriented Perspective, B. Ludäscher, R. Himmeröder, G. Lausen, W. May, C. Schlepphorst, Information Systems, 23(8), Special Issue on Semistructured Data, 1998. • STATELOG (Logic Programming with States) – On Active Deductive Databases: The Statelog Approach, G. Lausen, B. Ludäscher, and W. May. In Transactions and Change in Logic Databases, Hendrik Decker, Burkhard Freitag, Michael Kifer, and Andrei Voronkov, editors. LNCS 1472, Springer, 1998. • Argumentation Frameworks as Games – Games and Total DatalogNeg Queries, J. Flum, M. Kubierschky, B. Ludäscher, Theoretical Computer Science, 239(2), pp.257-276, Elsevier, 2000. – Referential Actions as Logical Rules, B. Ludäscher, W. May, G. Lausen, Proc. 16th ACM Symposium on Principles of Database Systems (PODS'97), Tucson, Arizona, ACM Press, 1997. 51