Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher [email protected] Knowledge-Based Information Systems Lab San Diego Supercomputer Center and Department of Computer Science & Engineering University of California, San Diego Outline 1. Information Integration from a Database Perspective 2. XML-Based Data Integration 3. Model-Based / Semantic Mediation 4. Discussion 2 An Online Shopper’s Information Integration Problem El Cheapo: “Where can I get the cheapest copy (including shipping cost) of Wittgenstein’s Tractatus Logicus-Philosophicus within a week?” addall.com ? Information Mediator (virtual DB) Integration (vs. Datawarehouse) amazon.com barnes&noble.com “One-World” Scenario: XML-based mediator half.com A1books.com A Home Buyer’s Information Integration Problem Which houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood with below-average crime rate and diverse population? ? Information Integration Realtor Crime Stats “Multiple-Worlds” Scenario: XML-based mediator School Rankings Demographics A Neuroscientist’s Information Integration Problem What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? ? Information Integration protein localization sequence info (NCMIR) (CaPROT) “Complex MultipleWorlds” Scenario: Model-based mediator morphometry neurotransmission (SYNAPSE) (SENSELAB) A Geoscientist’s Information Integration Problem What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How about their 3-D geometry ? How does it relate to host rock structures? ? Information Integration Geologic Map (Virginia) GeoChemical “Complex MultipleWorlds” Scenario: Model-based mediator GeoPhysical GeoChronologic (gravity contours) (Concordia) Foliation Map (structure DB) Information Integration Challenges: Heterogeneities = S4... • System Aspects – platforms, devices, distribution, APIs, protocols, … • Syntaxes – heterogeneous data formats (one for each tool ...) • Structures – heterogeneous schemas (one for each DB ...) – heterogeneous data models (RDBs, ORDBs, OODBs, XMLDBs, flat files, …) • Semantics – unclear & “hidden” semantics : e.g., incoherent terminology, multiple / informal taxonomies, implicit assumptions, ... 7 Information Integration Challenges Semantics Structure Syntax System aspects reconciling S4 heterogeneities “gluing” together multiple data sources bridging information and knowledge gaps computationally • System aspects: “Grid” middleware – distributed data & computing – Web services, WSDL/SOAP, … – sources = functions, files, databases, … • Syntax & Structure: (XML-Based) Mediators – wrapping, restructuring – (XML) queries and views – sources = (XML) databases • Semantics: Model-Based/Semantic Mediators – conceptual models and declarative views – Semantic Web: ontologies, description logics, RDF(S), DAML+OIL, OWL, ... – sources = knowledge bases (DB+CMs+ICs) 8 Information Integration from a DB Perspective • Information Integration Problem – Given: data sources S1, ..., Sk (DBMS, web sites, ...) and user questions Q1,..., Qn that can be answered using the Si – Find: the answers to Q1, ..., Qn • The Database Perspective: source = “database” Si has a schema (relational, XML, OO, ...) Si can be queried define virtual (or materialized) integrated views V over S1 ,..., Sk using database query languages (SQL, XQuery,...) questions become queries Qi against V(S1,..., Sk) 9 Outline 1. Information Integration from a Database Perspective 2. XML-Based Data Integration 3. Model-Based / Semantic Mediation 4. Discussion 10 Extensible Markup Language (XML) ... in their wonderful book called SemWeb <title>SemWeb Tractat Tractat Tractat</title> </title> by <author>B. B.Lee, Schatz Schatz</author> T.B. Lee, by B. Schatz andby T.B. the and authors showthe how ... <book> authors and <author> show how T.B....Tractat</title> Lee</author>, the authors <title>SemWeb show how ... <author>B. Schatz</author> <author>T.B. Lee</author> </book> book title author author “SemWeb Tractat” “B. Schatz” “T.B. Lee” book: title: “SemWeb Tractat” • (meta)language for marking up text & data with user-definable tags – (X)HTML, XSLT, XML Schema, ... – MathML, BioML, GeoML, NeuroML, ... – XML-RPC, SOAP, WSDL, OWL, ... author: “B. Schatz” • semistructured tree data model author: “T.B. Lee” • container model: – flexible: marked-up text, web-pages, databases, ... – “boxes within boxes” 11 XML-Based Mediator Architecture USER/Client Query Q ( G (S1,..., Sk) ) Integrated Global XML View G Integrated View Definition MEDIATOR G(..) S1(..)…Sk(..) XML Queries & Results XML View XML View XML View Wrapper Wrapper Wrapper S1 S2 Sk 12 Some Challenges in XML-Based Integration ... • XML Query/Transformation Languages – DB community: QLs for semistructured data, e.g., TSIMMIS/MSL, Lorel, Yatl, ..., Florid/F-logic [InfSystems98] – CSE/SDSC: XMAS [SSD99,SIGMOD99,WebDB99,EDBT00] – W3C: XPath, XSLT, XQuery (Working Draft , June 2001) • XML Schema Languages – DTDs, RELAX NG, XML Schema, ... [XMLDM02] • DB Theoreticians: – Expressiveness/Complexity Trade-Off • querying: FO, (WF/S-)Datalog, FO(LFP), FO(PFP), ... , all • reasoning: query satisfiability, containment, equivalence • ... 13 XMAS: XML Matching And Structuring language CONSTRUCT <books> <book> $a1 $t <pubs> $p { $p } </pubs> </book> { $a1, $t } </books> WHERE <books.book> $a1 : <author /> $t : <title /> </> IN "amazon.com" AND <authors.author> $a2 : <author /> <pubs> $p : <pub/> </> </> IN "www...DBLP… " AND value( $a1 ) = value( $a2 ) Integrated View Definition: “Find books from amazon.com and DBLP, join on author, group by authors and title” XMAS Algebra XMAS [QL98,SIGMOD99] 14 [EDBT00] XML (XMAS) Query Processing XML Query Q XML Global View Definition G(S) Translator algebraic plans Composition Q(G) composed plan Compile-time Rewriter/Optimizer: Q’(S) optimized plan Run-time:query evaluation Plan Execution 15 …New Challenges in (XML-Based) Mediation • Global-As-View (GAV) – user query Q global relations G Q(G) – global relations G source relations S G(S) – challenge: compute answers Q(G(V(S))) without computing all of V and G query rewriting (with limited source capabilities): Q’(S) = Q(G) • Local-As-View (LAV) – user query Q global relations G Q(G) – source relations S global relations G S(G) – challenge: “reverse/rewrite rules” from S(G) to some G’(S) answering queries using views: equivalent rewritings may not exist find maximally contained ones: Q’(G’(S)) Q(G) • Inter(CS)disciplinary research needed: DB FP LP – GAV/LAV view (un)folding Clark’s completion, resolution, factoring 16 Querying XML Streams: A New Frontier • New applications for stream-based XML processing: – Continuous, real-time data streams (wireless sensor networks, …) – Data / message transformation in Web services (SOAP, RMI, processing …) – Extract-transform-load applications (Tera/Peta-byte archival migration, …) • … leading to a new XML querying & transformation paradigm: – how to execute (some) XML queries & transformations on very large (infinite) data streams using only limited memory – XML stream machine (XSM): extended XML transducers with buffers XSM network XQuery XSMs clearly outperform tree-based approaches on streamable queries (100x over Xalan) [A Transducer-Based XML Query Processor, Ludäscher Mukhopadhyay, Papakonstantinou, VLDB’02] 17 Outline 1. Information Integration from a Database Perspective 2. XML-Based Data Integration 3. Model-Based / Semantic Mediation 4. Discussion 18 A Neuroscientist’s Information Integration Problem What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? ? Information Integration protein localization sequence info (NCMIR) (CaPROT) “Complex Multiple-Worlds” Mediation morphometry neurotransmission (SYNAPSE) (SENSELAB) A Geoscientist’s Information Integration Problem What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How about their 3-D geometry ? How does it relate to host rock structures? ? Information Integration Geologic Map (Virginia) GeoChemical “Complex Multiple-Worlds” Mediation GeoPhysical GeoChronologic (gravity contours) (Concordia) Foliation Map (structure DB) What’s the Problem with XML & Complex Multiple-Worlds? • XML is Syntax – ... for labeled ordered trees – ... all semantics lies outside of XML • XML DTDs => tags + nesting • XML Schema => DTDs + data modeling • need anything else? => write comments! • Domain Semantics is Complex: – implicit assumptions, hidden semantics sources seem unrelated to the non-expert • Need Structure and Semantics beyond trees! employ richer OO models make domain semantics and “glue knowledge” explicit use ontologies to fix terminology and conceptualization avoid ambiguities by using KR and formal semantics 21 Information Integration Landscape conceptual complexity/depth high Model-Based Mediation GO EcoCyc Ontologies KR formalisms RiboWeb UMLS Bioinformatics Geo-, Ecoinformatics Tambis BLAST MIA Entrez Cyc WordNet DB mediation techniques low home-buyer 24x7 consumer addall book-buyer conceptual distance multiple-worlds one-world 22 XML-Based vs. Model-Based Mediation CM ~ {Descr.Logic, ER, UML, RDF/XML(-Schema), …} Integrated-DTD XML-QL(Src1-DTD,...) “Glue Maps” = Domain & Process Maps (ontologies) No Domain Constraints CM-QL ~ {F-Logic, DAML+OIL, …} Integrated-CM CM-QL(Src1-CM,...) IF THEN IF IFTHEN THEN Structural Constraints (DTDs), Parent, Child, Sibling, ... A = (B*|C),D B = ... C1 C2 .... XML Elements XML Models Raw Raw Data RawData Data C3 R .... . . .... .... Logical Domain Constraints Classes, Relations, is-a, has-a, ... (XML) Objects Conceptual Models What’s the Glue? What’s in a Link? • Syntactic Joins – (X,Y) := X.SSN = Y.SSN – (X,Y) := X.UMLS-ID = Y.UID X Y equality • “Speciality” Joins – (X,Y,Score) := BLAST(X,Y,Score) similarity • Semantic/Rule-Based Joins – (X,Y,C) := X isa C, Y isa C, BLAST(X,Y,S), S>0.8 homology, lub – (X,Y,[produces,B,increased_in]) := X produces B, B increased_in Y. rule-based e.g., X=-secretase, B=beta amyloid, Y=Alzheimer’s disease • CS Challenge: – compile semantic joins into efficient syntactic ones 24 Semantic Mediation Methodology @ SOURCES • Lift Sources to export CMs: CM(S) = OM(S) + KB(S) + CON(S) • Object Model OM(S): – complex objects (frames), class hierarchy, OO constraints • Knowledge Base KB(S): – explicit representation of (“hidden”) source semantics – logic rules over OM(S) • Contextualization CON(S): – situate OM(S) data using “glue maps” (ontologies): domain maps DMs = terminological knowledge: concepts + roles process maps PMs = “procedural knowledge”: states + transitions 25 Semantic Mediation Methodology @ MEDIATOR • Integrated View Definition (IVD) – declarative (logic) rules with object-oriented features – defined over CM(S), domain maps, process maps – needs “mediation engineers” = domain + KRDB experts • Knowledge-Based Querying and Browsing (runtime): – mediator composes the user query Q with the IVD ... rewrites (Q o IVD), sends subqueries to sources ... post-processes returned results (e.g., situate in context) 26 Model-Based Mediator Architecture USER/Client “Glue” Maps GMs CM (Integrated View) DomainMaps Maps Domain Domain Maps DMs DMs DMs Mediator Engine Integrated View Definition IVD LP rule proc. XSB Engine DomainMaps Maps Domain Process Maps DMs DMs PMs semantic context CON(S) FL rule proc. Graph proc. GCM GCM GCM First results & Demos: CM S1 CM S2 CM S3 KIND prototype, formal DM semantics, PMs [SSDBM00] [VLDB00] [ICDE01] [NIH-HB01] [BNCOD02] [ER02] [EDBT02] [BioInf02] CM Queries & Results (exchanged in XML) CM(S) = OM(S)+KB(S)+CON(S) CM-Wrapper CM-Wrapper CM-Wrapper (XML-Wrapper) (XML-Wrapper) (XML-Wrapper) S1 S3 S2 27 Formalizing Glue Knowledge: Domain Map for SYNAPSE and NCMIR Domain Map = labeled graph with concepts ("classes") and roles ("associations") • additional semantics: expressed as logic rules (F-logic) Purkinje cells and Pyramidal cells have dendrites that have higher-order branches that contain spines. Dendritic spines are ion (calcium) regulating components. Spines have ion binding proteins. Neurotransmission involves ionic activity (release). Ion-binding proteins control ion activity (propagation) in a cell. Ion-regulating components of cells affect ionic activity (release). Domain Expert Knowledge Domain Map (DM) 28 DM in Description Logic Source Contextualization & DM Refinement In addition to registering (“hanging off”) data relative to existing concepts, a source may also refine the mediator’s domain map... sources can register new concepts at the mediator ... 29 Example: ANATOM Domain Map Browsing Registered Data with Domain Maps 31 Query Processing Demo Mediator View Definition DERIVE Contextualization protein_distribution(Protein, Organism,Brain_region, Feature_name, Anatom, Value) CON(Result) wrt. ANATOM. WHERE I:protein_label_image[ proteins ->> {Protein}; organism -> Organism; anatomical_structures ->> {AS:anatomical_structure[name->Anatom]}] , % from PROLAB NAE:neuro_anatomic_entity[name->Anatom; % from ANATOM located_in->>{Brain_region}], Query results AS..segments..features[name->Feature_name; value->Value]. in context • provided by the domain expert and mediation engineer • deductive OO language (here: F-logic) Example: Inside Query Evaluation "How does the parallel fiber output (Yale/SENSELAB) relate to the distribution of Ryanodine Receptors (UCSD/NCMIR)?” push selection @SENSELAB: X1 := select targets of “output from parallel fiber” ; determine source context @MEDIATOR: X2 := “find and situate” X1 in ANATOM Domain Map; compute region of interest (here: downward closure) @MEDIATOR: X3 := subregion-closure(X2); push selection @NCMIR: X4 := select PROT-data(X3, Ryanodine Receptors); compute protein distribution @MEDIATOR: X5 := compute aggregate(X4); display in context @MEDIATOR/GUI: display X5 in context (ANATOM) => DEMONSTRATION Open Database & Knowledge Representation Issues • Mix of Query Processing and Reasoning – GAV & LAV with semantic query optimization (NIH BIRN, NSF GEON) – description logic reasoner for DMs (FaCT) ? – reconciliation of conflicting DMs via argumentation-frameworks (“games”) using well-founded and stable models of logic programs [ICDT97, PODS97, TCS00, TODS02] • Modeling “Process Knowledge” => Process Maps – formal semantics? (dynamic/temporal/Kripke models/Petri nets?) – executable semantics? (Statelog?) • Graph Queries over DMs and PMs – expressible in F-logic [InfSystem98] – scalability? (UMLS Domain Map has millions of entries) • How to incorporate “procedural features”? – Bioinformatics, Ecoinformatics, … => sources = DBs + analytical tools + … scientific workflow planning and management (“promoter identification workflow” for DOE SciDAC, NSF/ITR SEEK) 34 Process Maps with Abstractions and Elaborations: From Terminological to Procedural Glue • nodes ~ states • edges ~ processes, transitions • blue/red edges: • processes in Src1/Src2 • general form of edges: 35 related formalisms gi#’s from clusfavor blast Genomic gi# Chr # Gene location A Scientific Workflow: Promoter Identification cDNA gi# blast other species Gene name blast human Genomic gi# Chr # Gene location GC Island location Exon/intron location Repeats location Promoter location GRAIL TRANSFAC CLUS TAL Validates polII promoter location TAF’s Location on Genomic gi#’s Probabilities of match Probabilities of random match Data Consolidation TRANSFAC Consensus sequences CLUS TAL blast Questions: Are chr#’s in common? Are chr#’s locations in common? Are there conserved upstream sequences? Are gene locations conserved across species Questions: RNA POLII promoter? GpC Island present? Are there common TAF’s across genomic gi#? 36 promoter location Shared TAF’s across cluster Common consensus sequence blast Genomic gi# cDNA gi# Questions: Are there other common genes? Matthew Coleman, LLNL, 2002 SDM Demo & Architecture Translation Approach: Abstract Workflow (AWF) => Executable Workflow (EWF) 37 Analytical Pipelines: An Open Source Tool 38 A Commercial Tool for Analytical Pipelines 39 Summary: Mediation Scenarios & Techniques Federated Databases XML-Based Mediation Model-Based Mediation Glue? One-World Common Schema One-/Multiple-Worlds Complex Multiple-Worlds Mediated Schema Common Glue Maps SQL, rules XML query languages Schema Transformations Syntax-Aware Mappings Syntactic Joins Syntactic Joins DB expert DB expert 40 DOOD query languages Semantics-Aware Mappings “Semantic” Joins via Glue Maps KRDB + domain experts GEON vs. SEEK 41 Outline 1. Information Integration from a Database Perspective 2. XML-Based Data Integration 3. Model-Based / Semantic Mediation 4. Discussion 42 Thank you! Questions? Queries? 43 Some References • Model-Based Mediation: – A Model-Based Mediator System for Scientific Data Management, B. Ludäscher, A. Gupta, M. Martone, Bioinformatics: Managing Scientific Data, Lacroix, Critchlow (eds), Morgan Kaufmann, to appear, 2003 – Model-Based Mediation with Domain Maps, B. Ludäscher, A. Gupta, M. E. Martone, 17th Intl. Conference on Data Engineering (ICDE’01), Heidelberg, Germany, IEEE Computer Society, 2001. – Managing Semistructured Data with FLORID: A Deductive Object-Oriented Perspective, B. Ludäscher, R. Himmeröder, G. Lausen, W. May, C. Schlepphorst, Information Systems, 23(8), Special Issue on Semistructured Data, 1998. • XML-Based Mediation: – VXD/Lazy Mediators: Navigation-Driven Evaluation of Virtual Mediated Views, B. Ludäscher, Y. Papakonstantinou, P. Velikhov, Intl. Conference on Extending Database Technology (EDBT’00), Konstanz, Germany, LNCS 1777, Springer, 2000. – XML Streams: A Transducer-Based XML Query Processor, B. Ludäscher, P. Mukhopadhyay, Y. Papakonstantinou, Intl. Conference on Very Large Databases (VLDB’02), Hong Kong, 2002 44 Knowledge Representation: Relating Theory to the World via Formal Models John F. Sowa, Knowledge Representation: Logical, Philosophical, and Computational Foundations “All models are wrong, but some are useful!” 45