Download Domain Map - San Diego Supercomputer Center

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Semantic Data Integration:
From Syntax and Structural Transformations to Semantics
Bertram Ludäscher
[email protected]
Data and Knowledge Systems
San Diego Supercomputer Center
U.C. San Diego
Outline
• Information Integration from a DB Perspective
• Part I: XML-Based Mediation
– wrapper/mediator approach
– based on querying semistructured data & XML
• Part II: Model-Based Mediation
–
–
–
–
basic ideas & architecture, lifting data to knowledge sources
“glue maps” (domain maps, process maps)
formal framework: Description Logic, Frame-Logic
ongoing/future research: mix of DB & KR techniques
• Summary
2
An Online Shopper’s Information Integration Problem
El Cheapo: “Where can I get the cheapest copy (including shipping cost) of
Wittgenstein’s Tractatus Logicus-Philosophicus within a week?”
addall.com
?
Information
Integration
amazon.com
barnes&noble.com
half.com
“One-World”
Mediation
A1books.com
A Home Buyer’s Information Integration Problem
What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms,
a nearby school ranking in the upper third, in a neighborhood
with below-average crime rate and diverse population?
?
Information
Integration
Realtor
Crime Stats
School Rankings
“Multiple-Worlds”
Mediation
Demographics
A Geoscientist’s Information Integration Problem
What is the distribution and U/ Pb zircon ages of A-type plutons in VA?
How about their 3-D geometry ?
How does it relate to host rock structures?
?
Information
Integration
Geologic Map
(Virginia)
GeoChemical
“Complex
Multiple-Worlds”
Mediation
GeoPhysical GeoChronologic
(gravity contours) (Concordia)
Foliation Map
(structure DB)
A Neuroscientist’s Information Integration Problem
What is the cerebellar distribution of rat proteins with more than 70%
homology with human NCS-1? Any structure specificity?
How about other rodents?
?
Information
Integration
protein localization
sequence info
(NCMIR)
(CaPROT)
“Complex
Multiple-Worlds”
Mediation
morphometry
neurotransmission
(SYNAPSE)
(SENSELAB)
Information Integration from a DB Perspective
• Information Integration Challenge
– Given: data sources S_1, ..., S_k (DBMS, web sites, ...) and
user questions Q_1,...,Q_n that can be answered using the S_i
– Find: the answers to Q_1, ..., Q_n
• The Database Perspective: source = “database”
 S_i has a schema (relational, XML, OO, ...)
 S_i can be queried
 define virtual (or materialized) integrated views V over
S_1,...,S_k using database query languages
 questions become queries Q_i against V(S_1,...,S_k)
• Why a Database Perspective?
– scalability, efficiency, reusability (declarative queries), ...
7
Technical Issues and Challenges
• Integration Method and Architecture
– federated DBs, wrapper-mediator approach, GAV/LAV,
warehouse/on-demand, ...
• Suitable KRDB Formalisms and Frameworks
– XML, DTDs/XML Schema, XPath, XQuery, ...
– RDF(S), Ontologies, Description Logics, DAML+OIL, ...
– querying, deduction, subsumption, classification, ...
• Algorithms and Implementation
– query composition, rewriting, reasoning, source capabilities, ...
• Information Integration Scenario and Scope
– simple/complex, single/multiple worlds, ...
8
Information Integration Landscape
conceptual
complexity/depth
high
Model-Based Mediation
GO EcoCyc
Ontologies
KR formalisms
RiboWeb
UMLS
Bioinformatics
Geoinformatics
Tambis
BLAST
MIA Entrez
Cyc
WordNet
DB mediation
techniques
low
addall
book-buyer
one-world
home-buyer
24x7 consumer
conceptual distance
multiple-worlds
9
PART I: XML-Based Mediation
10
Abstract (XML-Based) Mediator Architecture
USER/Client
Query Q o V (S_1,...,S_k)
Integrated
XML View V
Integrated View
Definition
IVD(S_1,...,S_k)
MEDIATOR
XML Queries & Results
XML View
XML View
XML View
Wrapper
Wrapper
Wrapper
S_1
S_2
S_k
11
XMAS: XML Matching And Structuring language
CONSTRUCT <books>
<book>
$a1
$t
<pubs>
$p { $p }
</pubs>
</book> { $a1, $t }
</books>
WHERE <books.book>
$a1 : <author />
$t : <title />
</> IN WRAP(“amazon.com”)
AND
<authors.author>
$a2 : <author />
<pubs> $p : <pub/> </>
</> IN WRAP(“www...DBLP…”)
AND value( $a1 ) = value( $a2 )
XMAS
Integrated View Definition:
“Find publications from
amazon.com and DBLP,
join on author,
group by authors and title”
XMAS Algebra
12
PART II: Model-Based Mediation
13
What’s the Problem with XML & Complex Multiple-Worlds?
• XML is Syntax
– DTDs talk about element nesting
– XML Schema schemas give you data types
– need anything else? => write comments!
• Domain Semantics is complex:
– implicit assumptions, hidden semantics
 sources seem unrelated to the non-expert
• Need Structure and Semantics beyond XML trees!
 employ richer OO models
 make domain semantics and “glue knowledge” explicit
 use ontologies to fix terminology and conceptualization
 avoid ambiguities by using formal semantics
14
XML-Based vs. Model-Based Mediation
CM ~ {Descr.Logic, ER, UML, RDF/XML(-Schema), …}
Integrated-DTD :=
Glue Maps
XML-QL(Src1-DTD,...)
DMs, PMs
CM-QL ~ {F-Logic, DAML+OIL, …}
Integrated-CM :=
CM-QL(Src1-CM,...)
No Domain
Constraints
IF
 THEN 
IF
IFTHEN
THEN 
Structural Constraints (DTDs),
Parent, Child, Sibling, ...
A = (B*|C),D
B = ...
C1
C2
....
XML
Elements
XML Models
Raw
Raw
Data
RawData
Data
C3
R
....
. . ....
....
Logical
Domain
Constraints
Classes,
Relations,
is-a,
has-a, ...
(XML)
Objects
Conceptual Models
What’s the Glue? What’s in a Link?
• Syntactic Joins
– (X,Y) := X.SSN = Y.SSN
– (X,Y) := X.UMLS-ID = Y.UID

Y
X
equality
• “Speciality” Joins
– (X,Y,Score) := BLAST(X,Y,Score)
similarity
• Semantic/Rule-Based Joins
– (X,Y,C) :=
X isa C, Y isa C, BLAST(X,Y,S), S>0.8
homology, lub
– (X,Y,[produces,B,increased_in]) :=
X produces B, B increased_in Y.
rule-based
e.g., X=-secretase, B=beta amyloid, Y=Alzheimer’s disease
16
Model-Based Mediation Methodology ...
• Lift Sources to export CMs:
CM(S) = OM(S) + KB(S) + CON(S)
• Object Model OM(S):
– complex objects (frames), class hierarchy, OO constraints
• Knowledge Base KB(S):
– explicit representation of (“hidden”) source semantics
– logic rules over OM(S)
• Contextualization CON(S):
– situate OM(S) data using “glue maps” (GMs):
 domain maps DMs (ontology)
= terminological knowledge: concepts + roles
 process maps PMs
= “procedural knowledge”: states + transitions
17
... Model-Based Mediation Methodology
• Integrated View Definition (IVD)
– declarative (logic) rules with object-oriented features
– defined over CM(S), domain maps, process maps
– needs “mediation engineers” = domain + KRDB experts
• Knowledge-Based Querying and Browsing (runtime):
– mediator composes the user query Q with the IVD
... rewrites (Q o IVD), sends subqueries to sources
... post-processes returned results (e.g., situate in context)
18
Model-Based Mediator Architecture
USER/Client
“Glue” Maps
GMs
CM (Integrated View)
DomainMaps
Maps
Domain
Domain
Maps
DMs
DMs
DMs
Mediator
Engine
Integrated View
Definition IVD
LP rule proc.
XSB Engine
DomainMaps
Maps
Domain
Process
Maps
DMs
DMs
PMs
semantic
context
CON(S)
FL rule proc.
Graph proc.
GCM
GCM
GCM
First results & Demos:
CM S1
CM S2
CM S3
KIND prototype, formal
DM semantics, PMs
[SSDBM00] [VLDB00]
[ICDE01] [NIH-HB01]
CM Queries & Results
(exchanged in XML)
CM(S) =
OM(S)+KB(S)+CON(S)
CM-Wrapper
CM-Wrapper
CM-Wrapper
(XML-Wrapper)
(XML-Wrapper)
(XML-Wrapper)
S1
S2
S3
19
Domain Maps (Ontologies) as Glue Knowledge
Sources
• Domain Map = Ontology
– representation of terminological knowledge
• Use in Model-Based Mediation
– (derived) concepts as “drop points”, “anchor points”, “context”
for source classes
– compile-time use: view definition, subsumption,
classification,...
– runtime use: querying/deduction, path queries, ....
• Formalisms:
– Semantic nets, Thesauri, Frame-logic, Description logics, ...
20
Ontologies
• So what is an Ontology?
–
–
–
–
–
–
definition of things that are relevant to your application
representation of terminological knowledge (“TBox”)
explicit specification of a conceptualization
concept hierarchy (“is-a”)
further semantic relationships between concepts
abstractions of relational schemas, (E)ER, UML classes, XML
Schemas
• Examples:
–
–
–
–
NCMIR ANATOM
GO (Gene Ontology)
UMLS (Unified Medical Language System
CYC
21
Formalism for Ontologies: Description Logic
• DL definition of “Happy Father”
(Example from Ian Horrocks, U Manchester, UK)
22
Description Logics
• Terminological Knowledge (TBox)
– Concept Definition (naming of concepts):
– Axiom (constraining of concepts):
=> a mediators “glue knowledge source”
• Assertional Knowledge (ABox)
– the marked neuron in image 27
=> the concrete instances/individuals of the concepts/classes that
your sources export
23
Description Logic Statements as F-logic Rules
• In F-logic:
X : happyFather :-X : man, (X..child) : blue, (X..child) : green,
not ( (X..child) : poorunhappyChild ).
C : poorunhappyChild :-not C : rich, not C : happy.
• Alternatively: DLs as fragments of First-Order Logic
24
Querying vs. Reasoning
• Querying:
– given a DB instance I (= logic interpretation), evaluate a query
expression (e.g. SQL, FO formula, Prolog program, ...)
– boolean query: check if I |= 
(i.e., if I is a model of )
– (ternary) query: { (X, Y, Z) | I |=  (X,Y,Z) }
=> check happyFathers in a given database
• Reasoning:
– check if I |=  implies I |=  for all databases I,
– i.e., if  => 
– undecidable for FO, F-logic, etc.
– Descriptions Logics are decidable fragments
 concept subsumption, concept hierarchy, classification
 semantic tableaux, resolution, specialized algorithms
25
What’s in an Answer?
(What’s in a Link? revisited)

Y
X
• Semantic/Rule-Based Joins
– (X,Y,[produces,B,increased_in]) :=
X produces B, B increased_in Y.
rule-based
e.g., X=-secretase, B=beta amyloid, Y=Alzheimer’s disease
• What is the Erdoes number of person P?
– 3
• Really? Why?
– authority based: <VIP> said so
– faith based: don’t know but believe firmly
– query statement Q = ... derived it from DB
– query Q = ... derived it from DB and KB using derivation D
 logic-based systems often “come with explanations”
 “computations as proofs”
26
Formalizing Glue Knowledge:
Domain Map for SYNAPSE and NCMIR
Domain Map
= labeled graph with
concepts ("classes") and
roles ("associations")
• additional semantics: expressed
as logic rules (F-logic)
Purkinje cells and Pyramidal cells have dendrites
that have higher-order branches that contain spines.
Dendritic spines are ion (calcium) regulating components.
Spines have ion binding proteins. Neurotransmission
involves ionic activity (release). Ion-binding proteins
control ion activity (propagation) in a cell. Ion-regulating
components of cells affect ionic activity (release).
Domain Expert Knowledge
Domain Map (DM)
DM in Description Logic
27
Source Contextualization & DM Refinement
In addition to registering
(“hanging off”) data relative to
existing concepts, a source
may also refine the mediator’s
domain map...
 sources can register new
concepts at the mediator ...
28
Example:
ANATOM Domain Map
Browsing Registered Data with Domain Maps
30
Query Processing
“Demo”
Integrated View Definition
DERIVE
protein_distribution(Protein, Organism, Brain_region, Feature_name,
Anatom, Value)
IF
Contextualization
CON(Result) wrt.
ANATOM.
I:protein_label_image[
proteins
->> {Protein}; organism -> Organism;
anatomical_structures ->>
{AS:anatomical_structure[name->Anatom]}] ,
% from PROLAB
NAE:neuro_anatomic_entity[name->Anatom;
% from ANATOM
located_in->>{Brain_region}],
AS..segments..features[name->Feature_name;
value->Value].
Query results
in context
• provided by the domain expert and mediation engineer
• deductive OO language (here: F-logic)
Some Open Database & Knowledge
Representation Issues
• Mix of Query Processing and Reasoning
– FaCT description logic reasoner for DMs?
– or reconcilation of DMs via argumentation-frameworks
(“games”) using well-founded and stable models of logic
programs [ICDT97,PODS97,TCS00]
• Modeling “Process Knowledge” => Process Maps
– formal semantics? (dynamic/temporal/Kripke models?)
– executable semantics? (Statelog?)
• Graph Queries over DMs and PMs
– expressible in F-logic [InfSystem98]
– scalability? (UMLS Domain Map has millions of entries)
• ...
32
Process Maps with Abstractions and Elaborations:
=> From Terminological to Procedural Glue
• nodes ~ states
• edges ~ processes, transitions
• blue/red edges:
• processes in Src1/Src2
• general form of edges:
how about these?
33
Summary: Mediation Scenarios & Techniques
Federated Databases
One-World
Common Schema
XML-Based Mediation
Model-Based Mediation
One-/Multiple-Worlds
Complex Multiple-Worlds
Mediated Schema
Common Glue Maps
SQL, rules
XML query languages
DOOD query languages
Schema Transformations
Syntax-Aware Mappings
Syntactic Joins
Syntactic Joins
DB expert
DB expert
Semantics-Aware Mappings
“Semantic” Joins via Glue Maps
KRDB + domain expert
34
Models and Formal Approaches:
Relating Theory to the World
©2000 by John F. Sowa, http://www.jfsowa.com/krbook/, Knowledge Representation: Logical, Philosophical,
and Computational Foundations, Brooks/Cole, Pacific Grove, CA.
All models are wrong, but some are useful!
35
Questions?
Queries?
36
Some References
•
XML-Based and Model-Based Mediation:
– MBM: Model-Based Mediation with Domain Maps, B. Ludäscher, A. Gupta, M. E. Martone,
17th Intl. Conference on Data Engineering (ICDE), Heidelberg, Germany, IEEE Computer
Society,2001.
– VXD/Lazy Mediaors: Navigation-Driven Evaluation of Virtual Mediated Views, B. Ludäscher,
Y. Papakonstantinou, P. Velikhov, Intl. Conference on Extending Database Technology (EDBT),
Konstanz, Germany, LNCS 1777, Springer, 2000.
– DOOD: Managing Semistructured Data with FLORID: A Deductive Object-Oriented
Perspective, B. Ludäscher, R. Himmeröder, G. Lausen, W. May, C. Schlepphorst, Information
Systems, 23(8), Special Issue on Semistructured Data, 1998.
•
STATELOG (Logic Programming with States)
– On Active Deductive Databases: The Statelog Approach, G. Lausen, B. Ludäscher, and W. May.
In Transactions and Change in Logic Databases, Hendrik Decker, Burkhard Freitag, Michael
Kifer, and Andrei Voronkov, editors. LNCS 1472, Springer, 1998.
•
Argumentation Frameworks as Games
– Games and Total DatalogNeg Queries, J. Flum, M. Kubierschky, B. Ludäscher, Theoretical
Computer Science, 239(2), pp.257-276, Elsevier, 2000.
– Referential Actions as Logical Rules, B. Ludäscher, W. May, G. Lausen, Proc. 16th ACM
Symposium on Principles of Database Systems (PODS'97), Tucson, Arizona, ACM Press, 1997.
37