Download Data and Meta Data Alignment - UMIACS

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Lise Getoor, University of Maryland
Renée J. Miller, University of Toronto
From Webster….
Main Entry: align·ment Variant(s): also aline·ment \əlīn-mənt\ Function: noun Date: 1790 1: the act of
aligning or state of being aligned; especially : the
proper positioning or state of adjustment of parts (as
of a mechanical or electronic device) in relation to
each other
Information alignment: the process of finding,
modeling and using the correspondences or
connections that place information artifacts in relation
to each other
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
2
Outline
 1. Introduction
 What is Data / Metadata Alignment?
 2. Data Alignment
 Entity Resolution
 3. Metadata Alignment
 Schema Mapping
 4. Data & Metadata Alignment
 Ontology Alignment
 5. Conclusion
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
3
Alignment Example: Metadata
Writes
BibEntry
BibId
Title
Author
Publisher
Booktitle
Editor
Volume
Number
Keywords
Web Scraping (S)
PubID
Author
Keywords
PubID
Keyword
Publication
PubID
DateAdded
Source
JournalArticle
PubID
Title
Journal
Year
Volume
Number
Pages
TechnicalReport
PubID
Title
Institution
Number
Conference
PubID
Title
Conference
Year
Curated Publication Database (T)
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
4
Lines are not enough…
 Suppose we want to translate data from Web
extraction source S into the relational schema T
 Arrows do not give sufficient information to be able to
shred a bibEntry into a faithful relational instance
(maintaining the semantics of the data!)
 What if S and T contain overlapping data which is
represented differently?
BibEntry
bib1
J. Smith
Alignment, Solved!
Writes
908765
J. Smith
JournalArticle
908765
Aligment: solved
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
5
Alignment is Fundamental
 Result of alignment is a declarative mapping
 Representing semantic relationship between data
(entities) or metadata (schemas or ontologies)
 Mappings are basic building block for model
management
 Schema Integration requires models and mappings
 Well-studied operators on mappings: compose, invert
 Mappings use
 Data exchange, data integration, peer data sharing, and
more
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
6
(Virtual) Integration Architecture
User Query
Mediated Schema
Mediator
Integrated or
Target Schema
Reformulation Engine
Optimization Engine
Execution Engine
Wrapper
Wrapper
City
Database
County
Database
Wrapper
Pubic Data Server
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
Metadata
Wrapper
Outside Website
7
Common Case: Data Publishing
Legacy
(Relational)
Schema
Standard
Mapping
(XML)
Schema
“conforms to”
“conforms to”
data
data
Data Publishing: map a legacy (often relational)
source schema into a standardized (often XML)
target suitable for data sharing or publishing on
the web
•Need to create target instance, not (necessarily) answer queries
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
8
Other Applications
 Variations appear in numerous scenarios
 Data Warehousing


Map a new data source into warehouse schema
For performance, source data is transferred to target for
complex analysis
 Enterprise Integration


ICDE 2008
Map databases of acquired company into existing operational
databases
May be too expensive to build new software for runtime
coupling of systems
Getoor, Miller --- Data & Metadata Alignment
9
Data Exchange
Source
Schema S
“conforms to”
Mapping
Target
Schema T
“conforms to”
data
data
Data Exchange: given mapping and source instance, create
an instance of the target that reflects the source data as
accurately as possible
[Popa et al. VLDB02, Fagin et al. ICDT03]
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
10
Talk Scope
 Our focus
 Creating mappings
 Will not be covering extensively
 Representational power of different mapping expression
languages
 Mapping operators (compose, invert)
 Mapping maintenance
 Mapping use in different data management tasks
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
11
Talk Goals
 Bring together relevant research on finding data and
metadata alignments from the database and machine
learning communities
 Seek to understand commonalities and differences in the
underlying inference problems that drive research in this
area
 Provide a common language to discuss the problems of
data and metadata alignment.
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
12
Caveats
 Not exhaustive survey - our goal is to give examples of
key issues
 We will make available an extended bibliography
 If you have references to add, please email them
(bibtex w/ url) to us, [email protected] and
[email protected]
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
13
Acknowledgments
 Indrajit Bhattacharya, Mustafa Bilgic, Matthias
Broecheler, Fei Chiang, Oktie Hassanzadeh, Hyunmo
Kang, Louis Licamele, Preetam Maloor, Walaa El-Din
Moustafa, Patricia Rodriguez-Gianolli, Mo Sadoghi,
Hossam Sharara, Octavian Udrea and many more
 NSF, KDD, NSERC, CITO, IBM
 ICDE !
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
14
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
15
Roadmap
 Introduction
 Attribute-based Methods
 Relational Methods
 Interactive Methods
Basic Problem(s)
 Detecting and eliminating duplicate records
 Integrating and matching data across sources
 Many names: deduplication, merge/purge problem, entity
disambiguation, duplicate detection, record matching,
identity uncertainty, instance identification, object
identification, co-reference resolution, reference
reconciliation, record linkage, database hardening, fuzzy
matching, entity resolution…..
Example: Citation Data
L. Breiman, L. Friedman, and P. Stone, (1984).
Classification and Regression. Wadsworth, Belmont,
CA.
Leo Breiman, Jerome H. Friedman, Richard A. Olshen,
and Charles J. Stone. Classification and Regression
Trees. Wadsworth and Brooks/Cole, 1984.
R. Agrawal, R. Srikant. Fast algorithms for mining
association rules in large databases. In VLDB-94,
1994.
Rakesh Agrawal and Ramakrishnan Srikant. Fast
Algorithms for Mining Association Rules In Proc. Of the
20th Int'l Conference on Very Large Databases,
Santiago, Chile, September 1994.
Example: Customer Data
 Name: Preetam Maloor
Address: 18055 Cottage Garden Dr, Germantown, MD
 Name: Maloor, P
Marital Status: Single
Occupation: Research Engineer
 Name: Preetam A. Maloor
Type of Housing: Rent
Location: Germantown, MD
Other Examples
 Natural language processing
 Noun/Pronoun Resolution - John gave Edward the book.
He then stood up and called to John to come back into
the room.
 Computer vision
 Object Correspondence - Matching objects across video
frames
 Biology
 E.g., proteins, genes, etc.
 Many more: social networks, personal information
management, privacy (guaranteeing alignments are
not possible)
In the News…
 Companies such as SRD (Las Vegas, NV),
acquired by IBM in 2005, offer solutions to
casinos for detecting scams based on simple
entity resolution techniques using a system
called NORA (Non-obvious Relation
Awareness). Founder Jeff Jonas is now IBM
Distinguished Engineer and Chief Scientist
 Spock Awards $50,000 Grand Prize to
Spock Challenge Winners, from
Germany's Bauhaus University Weimar,
March 2008
Origin
First explored in late 1950s and 1960s
 Newcombe, Kennedy, Axford, & James. Automatic
linkage of vital records. Science 130(3381), 954–959
(1959)
 Fellegi & Sunter,. A theory for record linkage. Journal of
the American Statistics Association 64(328), 1183–1210
(1969).
 Seminal papers studied record linkage in the context
of matching census population records
Currently, explosion of research
 DB: Benjoullon et al; Sarawagi & Bhamidipaty;
Ananthakrishna, Chaudhuri, Ganti, Motwani; Doan et
al.; Dong, Halevy, Madhavan; Hernández & Stolfo;
Kalashnikov, Mehrotra & Chen; lots more…
 ML: Bhattacharya & Getoor; Bilenko & Mooney;
Cohen; Singla & Domingos; McCallum et al.; Pasula,
Russell & Milch; Monge & Elkan; Winkler; Tejada,
Knoblock, and Minton; lots more…
 Here, we try to summarize some of the main ideas…
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
23
The Entity Resolution Problem
James
Smith
John
Smith
“John Smith”
“Jim Smith”
“J Smith”
“James Smith”
Jonathan Smith
“Jon Smith”
“J Smith”
“Jonthan Smith”
Issues:
1.
Identification
2.
Disambiguation
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
25
Attribute-based Entity Resolution
Pair-wise classification
“J Smith”
“James Smith”
?
“Jim Smith”
“James Smith”
0.8
“J Smith”
“James Smith”
?
“John Smith”
“James Smith”
0.4
“Jon Smith”
“James Smith”
0.3
“Jonthan Smith”
“James Smith”
0.2
Attribute-based Similarity
 Static similarity computation
 Individual record fields are often stored as strings.
 Key component: string similarity measures
 Adaptive Similarity computation
 Learn weights for attributes
 More generally, can formulate as a classification problem
and apply standard machine learning algorithms
 Can also be formulated as a clustering problem (more on
this later)
String Similarity Overview
 Edit-based
 Levenshtein Distance
 Jaro-Winkler
Character -based
 Overlap
 Jaccard and Weighted Jaccard
 IR methods
 Cosine w/tf-idf
 Language Modeling
 Hidden Markov Models
Token-based
 Hybrid
 Generalized Edit Similarity (GES)
 SoftTFIDF
Comparison for name-matching, Cohen et al. IJCAI’03 workshop
Survey for duplicate detection, Elmagarmid, Ipeirotis, Verykios, TKDE’07
28
29
Edit-based Similarity
 tc(t,s) : minimum cost of edit operations to transform
t to s
 Edit operations: character insert, delete and replace
 Levenshtein distance: unit cost for all operations
tc(t , s )
simedit (t , s ) = 1 −
max( t , s )
simedit (“Microsoft”, “Macrosft”) = 1- (2/9) = 0.78
29
Other Edit-based Measures
 Jaro more sophisticated string similarity score which
look at the similarity within a certain neighborhood
 JaroWinkler score is based on Jaro and weights
matches at the beginning more highly.
 Other variants: Smith-Waterman and Monge-Elkan
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
30
Token-based
 Token/Word order is unimportant.
 Convert the strings s and t to token set or multisets (where
each token is a word) and consider similarity metrics on
these sets
 Jaccard Similarity: (|S ∩ T|)/(|S ∪ T|).
 Weights agreement on rare terms more heavily than
agreement on more common terms.
 Use term frequency-inverse document frequency (TF-IDF)
to weight
Limitations of Static Methods
 Monge and Elkan, “The Field Matching Problem:
Algorithms and Applications,” Knowledge Discovery and
Data Mining, 1996.
 Issue: Fundamental limitation of static similarity functions


By nature can include no special knowledge of the
specific problem at hand.
Even methods that have been tuned and tested on
many previous matching problems can perform poorly
on new and different matching problems.
 Make methods adaptive by formulating as a supervised
classification problem
Supervised Learning
 Represent record pairs as feature vectors, using as features
the distances between corresponding fields.
 Train a binary support vector machine classifier using these
feature vectors
 Challenge: being able to provide a covering and challenging
set of training pairs that bring out the subtlety of the
deduplication function.
 Bilenko, Mooney, Cohen, Ravikumar, Fienberg, "Adaptive Name
Matching in Information Integration," IEEE Intelligent Systems,
vol. 18, no. 5, pp. 16-23, Sept/Oct, 2003
Active Learning
 Learning-based deduplication system that uses a novel
method of interactively discovering challenging training
pairs using active learning
 Unlike an ordinary learner that trains using a static training
set, an active learner actively picks subsets of instances
which when labeled will provide the highest information
gain to the learner.
 S. Sarawagi and A. Bhamidipaty, “Interactive deduplication using active
learning,” Proceedings of the eighth ACM SIGKDD international
conference on Knowledge discovery and data mining, 2002.
Issues: Attribute-based ER
Pair-wise classification
“J Smith”
“James Smith”
?
“Jim Smith”
“James Smith”
0.8
“J Smith”
“James Smith”
?
“John Smith”
“James Smith”
0.1
“Jon Smith”
“James Smith”
0.7
“Jonthan Smith”
“James Smith”
1. Choosing threshold: precision/recall tradeoff
2. Inability to disambiguate
3. Perform transitive closure?
0.05
Common Issues: Efficiency
 Naïve pairwise entity resolution problem is
computationally expensive - N^2
 Blocking [Hernandez and Stolfo, 1995]
 Sliding window technique [Hernandez and Stolfo, 1995]
 Canopies [McCallum et al. 2003]
 Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking
methods for record linkage. In: Proceedings of ACM SIGKDD’03
Workshop on Data Cleaning, Record Linkage, and Object
Consolidation , 2003.
 Elmagarmid, Ipeirotis, Verykios. Duplicate Record Detection: A
Survey, TKDE, 2007.
 Interesting connections to efficient join algorithms
Common Issue: Evaluation
 Mostly evaluated as a pair-wise classification problem.
 Accuracy may not be the best metric
 datasets tend to be highly skewed; often less than 1% of
all pairs are duplicates
 Using precision over the duplicate prediction and recall
over the entire set of duplicates [Monge and Elkan, 1996]
 Cluster Purity – evaluate clusters generated compared to
the true clusters [Monge and Elkan, 1997]
 Other measures: F1, AUC, average precision, etc.
 Upcoming LREC workshop on evaluating entity resolution
algorithms, May 2008
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
39
InfoVis Co-Author Network Fragment
before
after
Relational Entity Resolution

References not observed independently




Links between references indicate relations between
the entities
Co-author relations for bibliographic data
To, cc: lists for email
Use relations to improve identification and
disambiguation
Pasula et al. 03, Ananthakrishna et al. 02, Bhattacharya & Getoor 04,06,07,
McCallum & Wellner 04, Li, Morie & Roth 05, Culotta & McCallum 05,
Kalashnikov et al. 05, Chen, Li, & Doan 05, Singla & Domingos 05, Dong et
al. 05
Relational Identification
Very similar names.
Added evidence from
shared co-authors
Relational Disambiguation
Very similar names
but no shared
collaborators
Relational Constraints
Co-authors are
typically distinct
Collective Entity Resolution
One resolution
provides evidence for
another => joint
resolution
Entity Resolution with Relations
 Naïve Relational Entity Resolution
 Also compare attributes of related references
 Two references have co-authors w/ similar names
 Collective Entity Resolution
 Use discovered entities of related references
 Entities cannot be identified independently
 Harder problem to solve
Relational ER Algorithms
 Relational Clustering (RC-ER)
 Bhattacharya & Getoor, DMKD’04, Wiley’06, DE
Bulletin’06,TKDD’07
 Generative Probabilistic Models (LDA-ER)
 Conditional Probabilistic Models (CRFs & MLNs)
 Experimental Comparison
P1: “JOSTLE: Partitioning of Unstructured Meshes for
Massively Parallel Machines”, C. Walshaw, M. Cross,
M. G. Everett, S. Johnson J
P2: “Partitioning Mapping of Unstructured Meshes to
Parallel Machine Topologies”, C. Walshaw, M. Cross, M.
G. Everett, S. Johnson, K. McManus J
P3: “Dynamic Mesh Partitioning: A Unied Optimisation and
Load-Balancing Algorithm”, C. Walshaw, M. Cross, M.
G. Everett
P4: “Code Generation for Machines with Multiregister
Operations”, Alfred V. Aho, Stephen C. Johnson,
Jefferey D. Ullman J
P5: “Deterministic Parsing of Ambiguous Grammars”, A.
Aho, S. Johnson, J. Ullman J
P6: “Compilers: Principles, Techniques, and Tools”, A. Aho,
R. Sethi, J. Ullman
P1: “JOSTLE: Partitioning of Unstructured Meshes for
Massively Parallel Machines”, C. Walshaw, M. Cross,
M. G. Everett, S. Johnson
P2: “Partitioning Mapping of Unstructured Meshes to
Parallel Machine Topologies”, C. Walshaw, M. Cross, M.
G. Everett, S. Johnson, K. McManus
P3: “Dynamic Mesh Partitioning: A Unied Optimisation and
Load-Balancing Algorithm”, C. Walshaw, M. Cross, M.
G. Everett
P4: “Code Generation for Machines with Multiregister
Operations”, Alfred V. Aho, Stephen C. Johnson,
Jefferey D. Ullman
P5: “Deterministic Parsing of Ambiguous Grammars”, A.
Aho, S. Johnson, J. Ullman
P6: “Compilers: Principles, Techniques, and Tools”, A. Aho,
R. Sethi, J. Ullman
Relational Clustering (RC-ER)
P1
C. Walshaw
M. Cross
M. G. Everett
S. Johnson
P2
C. Walshaw
M. Cross
M. Everett
S. Johnson
P4
P5
Alfred V. Aho
A. Aho
Jefferey D. Ullman
Stephen C. Johnson
J. Ullman
S. Johnson
K. McManus
Relational Clustering (RC-ER)
P1
C. Walshaw
M. Cross
M. G. Everett
S. Johnson
P2
C. Walshaw
M. Cross
M. Everett
S. Johnson
P4
P5
Alfred V. Aho
A. Aho
Jefferey D. Ullman
Stephen C. Johnson
J. Ullman
S. Johnson
K. McManus
Relational Clustering (RC-ER)
P1
C. Walshaw
M. Cross
M. G. Everett
S. Johnson
P2
C. Walshaw
M. Cross
M. Everett
S. Johnson
P4
P5
Alfred V. Aho
A. Aho
Jefferey D. Ullman
Stephen C. Johnson
J. Ullman
S. Johnson
K. McManus
Relational Clustering (RC-ER)
P1
C. Walshaw
M. Cross
M. G. Everett
S. Johnson
P2
C. Walshaw
M. Cross
M. Everett
S. Johnson
P4
P5
Alfred V. Aho
A. Aho
Jefferey D. Ullman
Stephen C. Johnson
J. Ullman
S. Johnson
K. McManus
Cut-based Formulation of RC-ER
M. G. Everett
S. Johnson
M. G. Everett
S. Johnson
M. Everett
S. Johnson
M. Everett
S. Johnson
S. Johnson
A. Aho
Alfred V. Aho
S. Johnson
A. Aho
Stephen C.
Johnson
Alfred V. Aho
Stephen C.
Johnson
Good separation of attributes
Many cluster-cluster relationships
Worse in terms of attributes
Fewer cluster-cluster relationships
 Aho-Johnson1, Aho-Johnson2, EverettJohnson1
 Aho-Johnson1, Everett-Johnson2
Objective Function

Minimize:
∑∑ w sim
A
i
weight for
attributes

A
(ci ,c j ) + wR simR (ci , c j )
j
similarity of
attributes
weight for
relations
Similarity based on relational
edges between ci and cj
Greedy clustering algorithm: merge cluster pair with max
reduction in objective function
∆ (ci ,c j )= w A sim A (ci ,c j ) + w R (|N (ci )||N (c j )|)
Similarity of attributes
Common cluster neighborhood
Attribute Similarity
 Use best available measure for each attribute
 Name Strings: Soft TF-IDF, Levenstein, Jaro
 Textual Attributes: TF-IDF
 Aggregate to find similarity between clusters
 Single link, Average link, Complete link
 Cluster representative
Relational Similarity: Example 1
A. Aho
P4
Alfred V. Aho
P5
Stephen C.
Johnson
S. Johnson
P4
P5
J. Ullman
Jefferey D. Ullman
All neighborhood clusters are shared: high
relational similarity
Relational Similarity: Example 2
Alfred V. Aho
K. McManus
P2
C. Walshaw
C. Walshaw
M. Everett
A. Aho
P1,
P2
S. Johnson
M. G. Everett
P4,
P5
S. Johnson
P1,
P2
P1,
P2
Stephen C.
Johnson
S. Johnson
P4,
P5
Jefferey D. Ullman
M. Cross
M. Cross
No neighborhood cluster is shared: no
relational similarity
J. Ullman
Comparing Cluster Neighborhoods
 Consider neighborhood as multi-set
 Different measures of set similarity
 Common Neighbors: Intersection size
 Jaccard’s Coefficient: Normalize by union size
 Adar Coefficient: Weighted set similarity
 Higher order similarity: Consider neighbors of
neighbors
Relational Clustering Algorithm
1.
2.
3.
Find similar references using ‘blocking’
Bootstrap clusters using attributes and relations
Compute similarities for cluster pairs and insert into priority
queue
4.
5.
6.
7.
8.
Repeat until priority queue is empty
Find ‘closest’ cluster pair
Stop if similarity below threshold
Merge to create new cluster
Update similarity for ‘related’ clusters

O(n k log n) algorithm w/ efficient implementation
Relational ER Algorithms
 Relational Clustering (RC-ER)
 Generative Probabilistic Models (LDA-ER)
 Bhattacharya & Getoor, SDM’06
 Conditional Probabilistic Models (CRFs & MLNs)
 Experimental Comparison
Probabilistic Generative Model
for Collective Entity Resolution

Model how references co-occur in data
1. Generation of references from entities
2. Relationships between underlying entities

Groups of entities instead of pair-wise relations
Discovering Collaboration Groups
Stephen P Johnson
Chris Walshaw
Mark Cross
Kevin McManus
Martin Everett
Parallel Processing Research Group
Stephen C Johnson
Alfred V Aho
Ravi Sethi
Jeffrey D Ullman
Bell Labs Group
P1: C. Walshaw, M. Cross, M. G. Everett,
S. Johnson
P4: Alfred V. Aho, Stephen C. Johnson,
Jefferey D. Ullman
P2: C. Walshaw, M. Cross, M. G. Everett,
S. Johnson, K. McManus
P5: A. Aho, S. Johnson, J. Ullman
P3: C. Walshaw, M. Cross, M. G. Everett
P6: A. Aho, R. Sethi, J. Ullman
LDA*-ER Model
α
*Latent Dirichlet Allocation, Blei, Ng Jordan, JMLR03

θ

z

Φ
a
T
r
V
A
R
P
β


Entity label a and group label z
for each reference r
Θ: ‘mixture’ of groups for each
co-occurrence
Φz: multinomial for choosing
entity a for each group z
Va: multinomial for choosing
reference r from entity a
Dirichlet priors with α and β
Approx. Inference Using Gibbs Sampling
 Conditional distribution over labels for each ref.
 Sample next labels from conditional distribution
 Repeat over all references until convergence
ndDTt +α T n aATt +β A
P(z i =t|z −i ,a,r) ∝ i DT
× i AT
n d * +α
n*t +β
i
P(a i = a|z,a −i ,r) ∝

n aAT
+β A
it
n
AT
*t
+β
× Sim(ri ,v a )
Converges to most likely number of entities
Faster Inference: Split-Merge Sampling

Naïve strategy reassigns references individually

Alternative: allow entities to merge or split

For entity ai, find conditional distribution for
1.
2.
3.
Merging with existing entity aj
Splitting back to last merged entities
Remaining unchanged

Sample next state for ai from distribution

O(n g + e) time per iteration compared to O(n g + n e)
Relational ER Algorithms
 Relational Clustering (RC-ER)
 Generative Probabilistic Models (LDA-ER)
 Conditional Probabilistic Models (CRFs & MLNs)
 McCallum and Wellner, IJCAI WS 2003
 Singla & Domingos, ICDM06
 Experimental Comparison
Markov Networks for ER
McCallum and Wellner
 Formulate Entity Resolution as Markov Network
 N2 hidden variables model coreferences decisions
 Uses feature and consistency functions
 Feature  learn weights
 Consistency  weights = -∞
Markov Networks for ER
Feature: fc
= 1 if any two nodes are 1 and the other one is 0
= 0 otherwise
C12
Ref1
Name=„Powell“
Tag-Context = . , VBP
C13
C23
Ref2
Ref3
Name=„Santa Claus“
Tag-Context = V, .
Feature: f1
= 1 if CXY=1 and substring(RefX.name,RefY.name)
= 0 otherwise
Name=„Mr. Powell“
Tag-Context = AUX, V
Markov Networks for ER
Feature construction is very flexible
 Normalized Substring
 Un-normalized Substring
 Acronym
 Identical
 Head Identical
 Modifier words and POS
Note: features are certainly not independent!
Markov Networks for ER
Computation of conditional probability
1
exp(∑ λl f l ( xi , x j , yij ) + ∑ λl ' f l ' ( yij , y jk , yik ))
P( y | x) =
Zx
i , j ,l
i , j , k ,l '
Find parameters using maximum likelihood and gradient ascent
∂L
= ∑ (∑ f l ( xi , x j , yij ) − ∑ PΛ ( y ' | x)∑ f l ( xi , x j , y 'ij ))
∂λl < x , y >∈D i , j
y'
i, j
Computation of expected feature value intractable
 single training instance
 use most likely assignment to compute feature value
Markov Networks for ER
The data set considered by McCallum and Wellner does
not have much structure
 Entities are not related to each
 Collective resolution in a weak sense
Markov Networks are a general technique which can also
be applied to collective entity resolution
Markov Logic Networks (MLN)
Singla & Domingos, ICDM06
 General framwork for constructing Markov Networks
and associated features
 Define cliques and features in Markov Networks using
First Order Logic
Markov Logic Networks (MLN)
1. Fix representation of domain in First Order Logic
 Need equality for entity resolution
„Entity Resolution with Markov
Logic“;
Authors: P. Singla, P. Domingos
Venue: ICDM 2006
HasTitle(P1, Entity Resolution with
Markov Logic)
HasWord(P1,Entity), ...
HasAuthor(P1,A1),..,HasVenue(P1,V
1)
HasName(A1,P. Singla)
HasWord(A1,Singla),...
HasEngram(A1,gla),...
HasName(A2, P. Domingos)
HasVenue(V1,ICDM 2006)
Markov Logic Networks (MLN)
2. Capture knowledge about the domain in a set of FOL
formulas F
 Facts:
x,y1,y2: HasAuthor(x,y1)  HasAuthor(x,y2)  Coauthor(y1,y2)
 Evidence:
x1,x2,y1,y2: HasWord(x1,y1)  HasWord(x2,y2)  y1=y2  x1=x2
x1,x2,y1,y2: Coauthor(x1,y1)  Coauthor(x2,y2)  x1=x2  y1=y2
Markov Logic Networks (MLN)
3. Construct Markov Network M as follows
 Assume a fixed number of references in domain
 Papers, authors, venues, etc.
 M contains one binary node for each possible
grounding of each predicate appearing in F
 =1  true; =0  false
 M contains one clique/feature for each possible
grounding of each formula Fi  F with weight wi
 Each formula constitutues a clique template
 =1 if formula is true, 0 otherwise
Markov Logic Networks (MLN)
hasAuthor(P1,A4)
Coauthor(A4,A7)
hasAuthor(P1,A7)
A2=A7
hasAuthor(P3,A2)
Coauthor(A3,A2)
hasAuthor(P1,A3)
A3=A4
HasWord(A4,
“Domingos“)
HasWord(A3,
“Domingos“)
„Domingos“ =
“Domingos“
Markov Logic Networks (MLN)
Assume all nodes are true
hasAuthor(P1,A4)
1
Coauthor(A4,A7)
hasAuthor(P1,A7)
A2=A7
hasAuthor(P3,A2)
1
hasAuthor(P1,A3)
Coauthor(A3,A2)
A3=A4
HasWord(A4,
“Domingos“)
HasWord(A3,
“Domingos“)
Fact:
x,y1,y2: HasAuthor(x,y1) 
HasAuthor(x,y2)  Coauthor(y1,y2)
„Domingos“ =
“Domingos“
Markov Logic Networks (MLN)
Assume all nodes are true
hasAuthor(P1,A4)
1
Coauthor(A4,A7)
hasAuthor(P1,A7)
A2=A7
hasAuthor(P3,A5)
0
hasAuthor(P1,A3)
Coauthor(A3,A2)
A3=A4
HasWord(A4,
“Domingos“)
HasWord(A3,
“Domingos“)
Fact:
x,y1,y2: HasAuthor(x,y1) 
HasAuthor(x,y2)  Coauthor(y1,y2)
„Domingos“ =
“Domingos“
Markov Logic Networks (MLN)
Assume all nodes are true
hasAuthor(P1,A4)
Coauthor(A4,A7)
hasAuthor(P1,A7)
A2=A7
1
hasAuthor(P3,A2)
Coauthor(A3,A2)
hasAuthor(P1,A3)
A3=A4
HasWord(A4,
“Domingos“)
HasWord(A3,
“Domingos“)
Evidence:
x1,x2,y1,y2: Coauthor(x1,y1) 
Coauthor(x2,y2)  x1=x2  y1=y2
„Domingos“ =
“Domingos“
Markov Logic Networks (MLN)
4. Probability Computation and Learning
|F |
1
P( X = x) =
exp(∑ wi ni ( x))
Zx
i =1
 Learn weights wi as before
 Derivative of Log-likelihood
 Gradient ascent using MaxWalkSAT to estimate the expected value
 Inference
 Argmaxx P(X=x)
 Uses MaxWalkSAT with learned weights
Markov Logic Networks (MLN)
Scalability
 Network grows at least quadratically in the number of
references  intractable
 Due to extensive use of clique templates
 Use blocking techniques to exclude irrelevant parts of
the network a priori
 TF-IDF
 Certain nodes are computed on the fly
Relational ER Algorithms
 Relational Clustering (RC-ER)
 Generative Probabilistic Models (LDA-ER)
 Conditional Probabilistic Models (CRFs & MLNs)
 Experimental Comparison
Datasets
 CiteSeer
 1,504 citations to machine learning papers (Lawrence et al.)
 2,892 references to 1,165 author entities
 arXiv
 29,555 publications from High Energy Physics (KDD Cup’03)
 58,515 refs to 9,200 authors
 Elsevier BioBase
 156,156 Biology papers (IBM KDD Challenge ’05)
 831,991 author refs
 Keywords, topic classifications, language, country and affiliation of
corresponding author, etc
Baselines

A: Pair-wise duplicate decisions w/ attributes only


Names: Soft-TFIDF with Levenstein, Jaro, Jaro-Winkler
Other textual attributes: TF-IDF

A*: Transitive closure over A

A+N: Add attribute similarity of co-occurring refs
A+N*: Transitive closure over A+N



Evaluate pair-wise decisions over references
F1-measure (harmonic mean of precision and recall)
ER over Entire Dataset
A
A*
A+N
A+N*
RC-ER
LDA-ER
CiteSeer
arXiv
BioBase
0.980
0.990
0.973
0.984
0.995
0.993
0.976
0.971
0.938
0.934
0.985
0.981
0.568
0.559
0.710
0.753
0.818
0.645
 RC-ER & LDA-ER outperform baselines in all datasets
 Collective resolution better than naïve relational resolution
 RC-ER and baselines require threshold as parameter
 Best achievable performance over all thresholds
 Best RC-ER performance better than LDA-ER
 LDA-ER does not require similarity threshold
Bhattacharya and Getoor, TKDD 07
ER over Entire Dataset
A
A*
A+N
A+N*
RC-ER
LDA-ER
CiteSeer
arXiv
BioBase
0.980
0.990
0.973
0.984
0.995
0.993
0.976
0.971
0.938
0.934
0.985
0.981
0.568
0.559
0.710
0.753
0.818
0.645
 CiteSeer: Near perfect resolution; 22% error reduction
 arXiv: 6,500 additional correct resolutions; 20% error reduction
 BioBase: Biggest improvement over baselines
Performance for Specific Names
Best F1 for
F1 for
cho_h
ATTR/ATTR*
0.80
LDA-ER
1.00
davis_a
0.67
0.89
kim_s
0.93
0.99
kim_y
0.93
0.99
lee_h
0.88
0.99
lee_j
0.98
1.00
liu_j
0.95
0.97
sarkar_s
0.67
1.00
sato_h
0.82
0.97
sato_t
0.85
1.00
shin_h
0.69
1.00
veselov_a
0.78
1.00
yamamoto_k
0.29
1.00
yang_z
0.77
0.97
zhang_r
0.83
1.00
zhu_z
0.57
1.00
Name
arXiv
Significantly larger
improvements for
‘ambiguous names’
Trends in Synthetic Data
A
A*
RC-ER
1
Bigger improvement with
 bigger % of ambiguous refs
0.9
F1
 more refs per co-occurrence
0.8
 more neighbors per entity
0.7
0
0.1
0.2
0.3
0.4
0.5
Percentage of ambiguous attributes
A
A*
A
RC-ER
A*
RC-ER
0.9
0.9
0.85
F1
F1
0.85
0.8
0.75
2.25
0.8
2.5
2.75
3
3.25
3.5
avg #references / hyper-edge
3.75
4
0
1
2
3
4
5
avg # neighbors / entity
6
7
8
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
92
User Interfaces for ER
 Combining rich statistical inference models with
visual interfaces that support knowledge discovery and
understanding
 Because the statistical confidence we may have in any
of our inferences may be low, it is important to be able
to have a human in the loop, to understand and
validate results, and to provide feedback.
 Especially for graph and network data, a well-chosen
visual representation, suited to the inference task at
hand, can improve the accuracy and confidence of user
input
D-Dupe: An Interactive Tool for ER
http://www.cs.umd.edu/projects/linqs/ddupe
Interactive Entity Resolution in Relational Data: A Visual Analytic Tool and Its Evaluation,
Kang, Getoor, Shneiderman, Bilgic, and Licamele, TVCG to appear
GeoDDupe: Tool for Interactive ER in Geospatial Data
http://www.cs.umd.edu/projects/linqs/geoddupe
Kang, Sehgal, Getoor, IV 07
Metadata Alignment
Metadata Alignment
 Input
 What type of metadata and data is assumed?
 Output
 What type of mapping is produced?
 Objectives
 How will mapping be used?
 Methods
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
97
Input to Schema Alignment
 Type of data and metadata
 Attributes names

Alignment of Web Interfaces
 Schema structure
 Relation names, domain types
 Constraints or relationships
 Nesting or foreign keys
 Keys or more general constraints
 Data
 Full instance of schema (with or without metadata)
 Data examples only
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
98
Output of Alignment Tasks
 Correspondences (matching)
 Set of pairs of attributes (or possibly, relations) that
“correspond”
 May be 1-1 (or sometimes 1-N, N-M)
 (Possibly) confidence score associated with each pair in
matching
 Similar to output of data alignment but correspondence
rarely means “identical”
 Many inference techniques for schema matching closely
resemble techniques used for data matching
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
99
Output of Alignment Task
 Schema mappings
 Define relationship between instances of schemas
 Conceptually can also be thought of as a binary relation
over instances, but this misses important point


Schemas are finite, but set of possible instances is generally
not
Need a finite, declarative expression for mapping
 Inference cannot effectively be done by matching
possible instances
 Creation of schema mappings uses very different
techniques
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
100
Schema Matching
Schema Matching Methods
 Vary based on assumed input
 Metadata Labels only
 Labels plus domain knowledge


Dictionary, thesaurus, ontology
May be generic or specific to application domain
 Schema structure
 Schema and data
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
102
Label Matching
 Based on labels (names) used in schema
 Method: compute similarity between labels
 Similarity functions include those used for entity
matching, e.g.,


Edit distance (no. ops need to transform one label to other)
Q-gram based similarity measures
 As with data matching can formulate as pairwise
classification problem or as clustering problem
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
103
Interface matching ≈ schema matching
[He and Chang, SIGMOD 03]
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
104
Web Forms are not schemas
 Attribute label must be sufficiently expressive to permit user to
understand form
 Limited use of acronyms and abbreviations and specialized domain
vocabulary
 In contrast, structured schemas have a limited (and specialized) user
population.
 Limited vocabulary: for easy understanding
 A large number of similar forms: a large number of sites offer the
same services or selling the same products.
 Additional structures: the information is usually organized in
some meaningful way in the interface.
 Mark-up language around attributes can convey relationships
between attributes
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
105
Adding Domain Knowledge
 Lexical analysis
 E.g., Knowledge of abbrv rules: ValId more likely to
match Value Number than Valid
 Text similarity


Common IR techniques
In semi-structured schemas and interfaces, labels may be
piece of text
 Dictionary-based (e.g., Word-Net)
 Synonyms, hypernyms, & other common natural
language relations
 Particularly popular in matching ontologies
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
106
Using Data
 Basic features of data can guide matching
 E.g., difference between two numeric fields with two
decimal places
price for computer products vs salary values
 Learn basic classifier based on these features

 Rely on data matching
 Apply similarity measures to individual values or to set
of values in an attribute
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
107
Combining Matches
 Basic matchers can be combined
 Combine similarity measures
 View schema as a graph
 Combine measures using local notion of graph
neighborhood
 Use global graph similarity measures

ICDE 2008
E.G., Similarity Flooding [Melnik et al, 2001]
Getoor, Miller --- Data & Metadata Alignment
108
Schema matching methods
Individual matchers
Schema-based
Instance-based
Element-level Structure-level Element-level
Linguistic Constraint- Constraintbased
based
• Names
• Types
• Descriptions • Keys
• Graph
matching
Linguistic Constraint
-based
• IR (word
• Value pattern
frequencies, and ranges
key terms)
Taxonomy from [E. Rahm, P. Bernstein, 2001]
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
109
Some trends…
 Adding context to matches
 Use local context in inference, [Bohannon et al, VLDB 06]
 Learn functions on matches, IMAP [Dhamankar, SIGMOD 04]
 Human interaction and visual interfaces very important in
matching
 Tools for schema matching highly interactive (like D-Dupe)
 Community-based tools where interaction comes from a massive
population
 Corpus-based techniques [Madahavan et al, ICDE 05]
 Leverage previous matches from corpus
 Extra knowledge permits use of unsupervised techniques
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
110
Schema Matching Methods
 Extensive research literature and commercial tools
 Cupid, COMA, LSD, SF, Artemis, Anchor-Prompt, SMatch, SemInt, AutoMatch, Similarity Flooding,
 BEA Aqualogic, Microsoft BizTalk Mapper, IBM
WebSphere Data Stage TX, Stylus Studio’s XML
Mapper, …
 Surveys
 Rahm & Bernstein, 01
 Doan and Halevy, 05
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
111
Schema Mapping
Schema Mapping
 Schema mappings
 High-level, declarative assertions that specify the
relationship between set of possible instances of two or more
schemas
 Ideally, schema mappings should be
 Expressive enough to specify many data interoperability
tasks including
 Data exchange
 Data integration
 Simple enough to be efficiently manipulated by tools
 Easy to maintain and reuse
 Incrementally create and refine
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
113
Creating Schema Mappings
 Provide high-level declarative language with common data
transformation operators
 IBM Express [Shu et al, TODS 77] CONVERT language

Efficient compilation into executable code
 Restructuring Complex Object Models [Abiteboul & Hull, TCS88]


Local structural transformations of types
Each restructuring primitive has corresponding query that expresses
data transformation
 Important development
 Declarative creation and manipulation of object identifiers ILOG
[Hull & Yoshikawa, VLDB 90]
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
114
Leveraging Matchings
 TranScm [Milo & Zohar, VLDB 98]
 Apply local transformation rules to matched schemas
Article
BibEntry
authors
author
author
author
author
 E.G., Descendents function checks the numbers and types of
the children of the current node
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
115
Beyond transformation operators
 Toronto/IBM Clio project begun in 1999
 Can we further automate the creation of
mapping specifications?
 Can we go beyond local transformation
rules?
 [Miller et al, VLDB 00, Popa et al, VLDB 02,
Fuxman et al, VLDB 06], IBM Rational Data
Architect
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
116
Mapping Creation

Leverage attribute matches
User friendly
 Automatic discovery



Even in 1999 quality of
matching tools already very
good
Preserve data semantics
discover data associations
 use constraints and schema
structure


Model incompleteness


generate new values for data
exchange
Produce correct grouping
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
117
Mapping Requirements
 From mapping should be able to produce execution
scripts that perform data exchange
 In different execution environments
 Declarative representation that is easy to manage and
maintain
 Incrementally create and reuse
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
118
What is a Schema Mapping?
Source: company join with grant on cid
company(cid,name,city),grant(cid,gid,amt,project)
If we stop here, we have a plain old view (GAV)
create view org (cid, cname) as
( select cid, name
from company, grant
where company.cid = grant.cid )
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
119
What is a Schema Mapping?
Source: company join with grant on cid
company(cid,name,city),grant(cid,gid,amt,project)
Target: org with nested set funding (FS) join financial on aid
org(cid,name,FS),FS(gid,proj,aid,recv),financial(aid,amt,date)
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
120
What is a Schema Mapping?
Source: query on source
∀ cid,name,city,gid,amt,project
company(cid,name,city),grant(cid,gid,amt,project)
→
Target: query on target
org(cid,name,FS),FS(gid,proj,aid,recv),financial(aid,amt,date)
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
121
What is a Schema Mapping?
Source: query on source
∀ cid,name,city,gid,amt,project
company(cid,name,city),grant(cid,gid,amt,project) →
Target: query on target
∃ FS,proj,aid,recv,date
org(cid,name,FS),FS(gid,proj,aid,recv),financial(aid,amt,date)
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
122
Schema Mapping Specification
 The relationship between the source and the target is given by a set
of mappings Mst that are
source-to-target tuple generating dependencies (s-t tgds)
∀ϕ(x) → ∃y ψ(x, y)
 ϕ(x) is a query over the source
 ψ(x, y) is a query over the target
 In theory queries are conjunctive queries
 In practice (tool), queries may include:
ICDE 2008

aggregation

order by

bag semantics
Getoor, Miller --- Data & Metadata Alignment
123
Mapping Generation
 Mapping: ∀x ϕs(x) → ∃y ψt(x, y)
 Matching determines shared variables x
 How do we find source (ϕs) and target (ψt) queries?
 Use chase [Maier, Mendelzon, Sagiv 79] to find connections within
the schemas.
 Originally defined to solve inference problem for relational
dependencies.
 We use it to generate possible alternative representations of
information (logical associations).

ICDE 2008
Generalized to nested-relational model [Popa et al, VLDB02]
Getoor, Miller --- Data & Metadata Alignment
124
Associations
expenseDB: Rcd
companies: Set of Rcd
company: Rcd
cid
name
city
grants: Set of Rcd
grant: Rcd
cid
gid
amt
sponsor
project
company(CID,N,C)
company(CID,N,C),grant(CID,GID,A,S,P)
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
statDB: Set of Rcd
cityStat: Rcd
city
orgs: Set of Rcd
org: Rcd
cid
name
fundings: Set of Rcd
funding: Rcd
gid
proj
aid
financials: Set of Rcd
financial: Rcd
aid
date
amount
cityStat(C,Os),Os(CID,N)
cityStat(C,Os),Os(CID,N),
funding(GID,P,AID),
financial(AID,D,A)
125
Associations
expenseDB: Rcd
companies: Set of Rcd
company: Rcd
cid
name
city
grants: Set of Rcd
grant: Rcd
cid
gid
amt
sponsor
project
statDB: Set of Rcd
cityStat: Rcd
city
orgs: Set of Rcd
org: Rcd
cid
name
fundings: Set of Rcd
funding: Rcd
gid
proj
aid
financials: Set of Rcd
financial: Rcd
aid
date
amount
M1
company(CID,N,C)
→ ∃C’,CID’,Os
company(CID,N,C),grant(CID,GID,A,S,P)
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
cityStat(C’,Os),Os(CID’,N)
cityStat(C,Os),Os(CID,N),
funding(GID,P,AID),
financial(AID,D,A)
126
Associations
expenseDB: Rcd
companies: Set of Rcd
company: Rcd
cid
name
city
grants: Set of Rcd
grant: Rcd
cid
gid
amt
sponsor
project
M1
M2
company(CID,N,C)
→ ∃C’,CID’,Os
∃C’,CID’,Os,
company(CID,N,C),grant(CID,GID,A,S,P) → P’,AID,D
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
statDB: Set of Rcd
cityStat: Rcd
city
orgs: Set of Rcd
org: Rcd
cid
name
fundings: Set of Rcd
funding: Rcd
gid
proj
aid
financials: Set of Rcd
financial: Rcd
aid
date
amount
cityStat(C’,Os),Os(CID’,N)
cityStat(C’,Os),Os(CID’,N),
funding(GID,P’,AID’),
financial(AID’,D,A)
127
Multiple Associations
expenseDB: Rcd
companies: Set of Rcd
company: Rcd
cid
name
city
grants: Set of Rcd
grant: Rcd
cid
gid
amount
project
sponsor
statDB: Set of Rcd
cityStat: Rcd
orgs: Set of Rcd
org: Rcd
cid
name
fundings: Set of Rcd
funding: Rcd
gid
proj
aid
financials: Set of Rcd
financial: Rcd
aid
date
amount
city

Grants may be associated with companies in multiple ways

Association 1: grants ⋈ companies join on cid = cid

Association 2: grants ⋈ companies join on sponsor = cid
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
128
Logical Inference
 Logical inference is basis for mapping discovery
 Use schema structure

(nesting implies a logical relationship between schemas)
 Constraints
 Postulated or discovered constraints

ICDE 2008
Mine for approximate or context-dependent constraints
Getoor, Miller --- Data & Metadata Alignment
129
Adding more knowledge
 Data-Metadata transformations
 HePToX [Bonafiti et al, VLDB 05]
 Tupelo [Fletcher & Wyss, EDBT 07]
 If a domain ontology or ER conceptual schema is
available, we can use it in our inference
 [An, Borgida et al ICDE 07]
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
130
Using Conceptual Model
Doctor
ssn
clinic
Employee
ssn
name
Employee
ssn
name
X
X
Scientist
ssn
lab
Doctor
ssn
clinic
Scientist
ssn
lab
1. ∀ ssn,name,clinic(doctor(ssn,name,clinic)
doctor ssn name clinic
→ ∃. x,y. employee(x,name,clinic,y)).
employee
2. ∀ ssn,name,lab(scientist(ssn,name,lab)
eid name clinic lab
→ ∃. x,y. employee(x,name,y,lab)).
3. ∀ ssn,name,clinic,lab(doctor(ssn,name,clinic)
scientist ssn name lab
∧scientist(ssn,name’,lab) → ∃. x. employee(x,name,clinic,lab)).
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
131
Open Issues
 Evaluation
 How to compare matchers or mappers?
 How to determine when one matcher or mapper will
produce better results?
 Are there schema or data characteristics that give us
clues to which type of inference would work best?
 Can we do matching and mapping collectively?
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
132
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
133
Recap
 Data Alignment
 Classification based on attributes-only
 Collective inference based on attributes and relations
 Metadata Alignment
 Schema matching based on classification
 Logical inference can be used to find schema mapping
We now look at methods which for combining
them for ontology alignment
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
134
Ontology Alignment
 Basic Idea
 Short overview of OWL Lite
 The ILIADS method
 Experimental evaluation
The basic idea
 Produce better quality alignments by
 using data (instances) effectively and
 using logical inference (e.g., in OWL) to estimate how
good an alignment is
 Parameterize the method such that
 It can be adapted for a wide variety of inputs
 The parameters can be adjusted with minimal effort
based on the input ontologies
Defining the terms
 Entity: everything that has an URI identifier (plus
literals)
 Ontology: software artifact consisting of classes,
instances, facts, axioms
 Alignment: Given two ontologies, find relationships
between their respective entities
 Integration: Merge two ontologies under a set of
alignments to obtain a consistent result
Ontology Alignment
 Motivation and goals
 Short overview of OWL Lite
 The ILIADS method
 Experimental evaluation
Example OWL Lite ontologies
(discoveredBy, owl:inverseOf, discoverer); (discoveredBy, owl:type, owl:FunctionalProperty)
(discoveredBy, owl:inverseOf, discoverer); (associatedWith, owl:type, owl:TransitiveProperty)
(resultsF rom, rdfs:subPropertyOf, associatedWith)
Example OWL Lite ontologies
(discoveredBy, owl:inverseOf, discoverer); (discoveredBy, owl:type, owl:FunctionalProperty)
(discoveredBy, owl:inverseOf, discoverer); (associatedWith, owl:type, owl:TransitiveProperty)
(resultsF rom, rdfs:subPropertyOf, associatedWith)
Example OWL Lite ontologies
(discoveredBy, owl:inverseOf, discoverer); (discoveredBy, owl:type, owl:FunctionalProperty)
(discoveredBy, owl:inverseOf, discoverer); (associatedWith, owl:type, owl:TransitiveProperty)
(resultsF rom, rdfs:subPropertyOf, associatedWith)
Example OWL Lite ontologies
(discoveredBy, owl:inverseOf, discoverer); (discoveredBy, owl:type, owl:FunctionalProperty)
(discoveredBy, owl:inverseOf, discoverer); (associatedWith, owl:type, owl:TransitiveProperty)
(resultsF rom, rdfs:subPropertyOf, associatedWith)
Example OWL Lite ontologies
(discoveredBy, owl:inverseOf, discoverer); (discoveredBy, owl:type, owl:FunctionalProperty)
(discoveredBy, owl:inverseOf, discoverer); (associatedWith, owl:type, owl:TransitiveProperty)
(resultsF rom, rdfs:subPropertyOf, associatedWith)
Inference in OWL (Lite)
 A tableau-based method
 Example tableau rule:
(p owl:inverseOf p’) (o1 p o2)
(o2 p’ o1)
 Example inconsistency:
(o1 owl:sameAs o2) (o2 owl:differentFrom o1)
┴
Example inference
(discoveredBy, owl:inverseOf, discoverer); (discoveredBy, owl:type, owl:FunctionalProperty)
(discoveredBy, owl:inverseOf, discoverer); (associatedWith, owl:type, owl:TransitiveProperty)
(resultsF rom, rdfs:subPropertyOf, associatedWith)
Example inference
(discoveredBy,
owl:inverseOf, discoverer)
Example inference
(discoveredBy, owl:type,
owl:FunctionalProperty)
The alignment problem
 Find a set of triples (entity1 relation entity2) where:
 entity1, entity2 are entities from the two ontologies
 relation is one of

subClassOf, equivalentClass, subPropertyOf,
equivalentProperty, sameAs
 For integration, the union of the ontologies and the
alignment must be consistent.
Ontology Alignment
 Motivation and goals
 Short overview of OWL Lite
 The ILIADS method
 Udrea, Getoor, Miller, SIGMOD07
 Experimental evaluation
State of the art
 Ideally, alignment should be treated as an
optimization problem
 Choose candidate pairs to maximize an ontology-level
similarity measure
 Unfeasible in practice
 To approximate, existing tools use locally computed
similarity measures
 Often, this means the “big picture” of the search space is
ignored
Incremental methods
Incremental methods
This score is high enough,
so we commit to the
owl:sameAs relation
Incremental methods
This changes the scores of
the neighbors
Incremental methods
This is again high-enough,
so we have found another
alignment
The core of ILIADS
 Compute alignment candidates based on well
established methods
 Lexical, structural, extensional similarity
 In addition, evaluate how “good” a candidate pair is
based on the logical consequences of asserting the
alignment
 We call this “inference similarity”
 Essentially a look-ahead that estimates the impact of the
alignment on the global similarity score
The ILIADS algorithm
repeat until no more candidates
1. Compute local similarities
2. Select promising candidates
3. For each candidate
Perform N inference steps
b. Update score with the inference similarity
a.
4. Select the candidate with the best score
end
Computing similarity
repeat until no more candidates
1. Compute local similarities
2. Select promising candidates
3. For each candidate
a.
b.
4.
Perform N inference steps
Update score with the inference
similarity
Select the candidate with the
best score
 sim(e,e’) =λx simlexical(e,e’)+



end

λs simstructural(e,e’)+
λe simextensional(e,e’)
Lexical similarity: Jaro-Winkler
and Wordnet
Structural similarity: Jaccard for
various neighborhoods
Extensional similarity: Jaccard
on extensions
Select candidates with sim(e,e’)
above a threshold
Performing inference
repeat until no more candidates
1.
Compute local similarities
2. Select promising candidates
3. For each candidate
a.
b.
4.
end
Perform N inference steps
Update score with the
inference similarity
Select the candidate with the
best score
For the candidate pair (e,e’):
 Select an axiom and apply the
corresponding rule
 The logical consequences are the
pairs of entities (e(i), e(j)) that
have just become equivalent
 Repeat a small number of times
(5)
Updated score
repeat until no more candidates
1.
Compute local similarities
2. Select promising candidates
3. For each candidate
a.
b.
4.
end
Perform N inference steps
Update score with the
inference similarity
Select the candidate with the
best score
For the candidate pair (e,e’):
 Compute the product P of
sim(e(i), e(j)) / (1 – sim(e(i), e(j)))
over all logical consequences
 simupdated(e,e’) = sim(e,e’) * P
Example inference similarity
Example inference similarity
We assume this candidate
pair is in a owl:sameAs
relation before starting
inference
Example inference similarity
(discoveredBy,
owl:inverseOf, discoverer)
Example inference similarity
Remember that during inference,
(E-Coli Poisoning, owl:sameAs,
E-Coli)
(discoveredBy, owl:type,
owl:FunctionalProperty)
Example inference similarity
Updated score: .5 * 1.5 = 7.5
This is the only logical
consequence.
P = .6 /.4 = 1.5
The ILIADS algorithm
 It is still a local method
 Ultimately, it selects the best alignment after each step
 But it estimates the global impact of each alignment
better
 The inference similarity is a look-ahead measure of how
good the candidate alignment is
Other issues
 ILIADS may not produce a consistent result
 Inconsistent ontologies in less than .5% of runs
 Pellet used to check consistency after ILIADS
 How do we decide between subsumption and
equivalence for a pair of entities?
 How do we select the promising candidates?
 How do we choose the axioms to apply in the five
inference steps?
Subsumption vs. equivalence
 Deciding whether two entities should subsume each
other or be equivalent is not clear-cut
 Simple extensional technique to distinguish between
the two cases
 E.g., measure whether the instances of class c are
“almost” the same of those of class c’ =>
rdfs:equivalentClass
 If they are a subset, then rdfs:subClassOf
Deciding relationship type
present in the
extensions of both
FoodPoisoning and
FoodBorneDisease
To measure how much the two classes have in
common, we divide the size of the unique part to
the size of the common part. We obtain 1/3 and 2/4
respectively.
Deciding relationship type
We decide based on λr. If λr = .49,
then we choose rdfs:subClassOf
Deciding relationship type
If λr = .7, then we choose
owl:equivalenClass
Cluster type selection
 Existing tools use various strategies to generate
candidates from classes, individuals or properties
 ILIADS supports:
 Randomly select from the three types
 Weighted random (more classes than individuals means
classes will be selected more often)
 Classes first / Individuals first
 Alternate at each step
Axiom selection policies
 The number of inference steps is small
 The axioms applied must make a difference
 ILIADS always selects from relevant axioms according
to a policy:
 Random
 Property axioms first (e.g, owl:TransitiveProperty)
 Class axioms first (e.g., rdfs:subClassOf)
 Transitive/Inverse/Functional first (since they tend to
“generated” sameAs relationships)
Ontology Alignment
 Motivation and goals
 Short overview of OWL Lite
 The ILIADS method
 Experimental evaluation
Experimental framework
 30 pairs of ontologies
 Ontologies from 194 to over 20000 triples
 Ground truth provided by human reviewers
 Comparison in terms of recall and precision with FCA-
merge and COMA++
 Two versions of the algorithm
 Best overall average quality ILIADS – FP
 Best parameters for each pair ILIADS – BP
ILIADS-BP parameter setting
Precision/recall
0.95
0.85
Recall
0.75
ILIADS-FP
0.65
ILIADS-BP
0.55
FCA-merge
0.45
COMA++
0.35
0.25
0.4
0.5
0.6
0.7
Precision
0.8
0.9
1
Precision/recall comparison
0.9
0.8
78.8%
75.3%
76.2%
73.2% 73.9% 73.3% 75.6%
0.7
76.3%
60.1%
0.6
70.7%
66.7%
50.6%
0.5
Precision
Recall
0.4
F1
0.3
0.2
0.1
0
ILIADS-FP
ILIADS-BP
FCA-merge
COMA++
Precision/recall for ontologies
with substantial instance data
0.9
0.8
80.8%
77.2%
78.7%
75.1% 76.1% 76.7%
74.0%
72.2%
0.7
58.9%
0.6
69.0%
64.6%
49.7%
0.5
Precision
Recall
0.4
F1
0.3
0.2
0.1
0
ILIADS-FP
ILIADS-BP
FCA-merge
COMA++
False negative analysis
6%
45%
45%
17%
30%
29%
Equiv. ind.
Equiv. prop.
Subprop.
Equiv. cls.
Subcls.
27%
19%
13%
9%
11%
8%
9%
4%
ILIADS
FCA-merge
COMA++
28%
Number of inference steps
The number of 5 inference steps was chosen as the best compromise between:
F1 quality
18000
0.78
16000
0.76
14000
0.74
12000
0.72
F-1 quality
Running time[s]
Running time
10000
8000
6000
0.7
0.68
0.66
4000
0.64
2000
0.62
0
0.6
0
2
4
6
N
8
10
12
0
2
4
6
N
8
10
Cluster type/axiom
selection policies
0.8
0.7
0.65
Trans/Inv/Func
0.6
Classes first
0.55
Prop. first
Random axiom
Cluster type selection policy
Axiom selection policy
F1 Quality
0.75
And the result is...
(discoveredBy, owl:inverseOf, discoverer); (discoveredBy, owl:type, owl:FunctionalProperty)
(discoveredBy, owl:inverseOf, discoverer); (associatedWith, owl:type, owl:TransitiveProperty)
(resultsF rom, rdfs:subPropertyOf, associatedWith)
Choosing the parameters
 The structural similarity coefficients strongly
correlate with the average degree of the node
 The structural coefficient for classes correlates
with the number of rdfs:subClassOf relationships
 The extensional coefficients correlate with the
ratio of instance to classes
Parameter sensitivity
 Structural coefficients are stable around the ILIADS-
FP setting for 25 out of 30 pairs
 The remaining 5 pairs have large differences between
their average node degrees
 Extensional coefficients are stable around the ILIADS-
FP setting for 21 pairs
 The remaining 9 pairs have a low ratio of instances to
classes (< 1.9)
Experimental results summary
 ILIADS has better quality than COMA++ and FCA-
merge, with a significant difference for all pairs with
substantial instance data
 Matching properties is the major cause of false
negatives for all three systems, but ILIADS does better
at matching instances
 Structural and extensional coefficients correlate with
structural properties and are stable for ontologies with
similar structure
ILLIADS Summary
 New algorithm that tightly integrates statistical
matching and logical inference to produce better
quality alignments
 Found intriguing correlations between structure and
matching strategies
 Improvement over existing systems
 25% higher quality than FCA-merge,
 11% higher recall than COMA++ at comparable precision
HOMER: Tool for Ontology Alignment
http://www.cs.umd.edu/projects/linqs/iliads
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
188
Information Alignment:
Summary
 The process of finding, modeling and using the
correspondences or connections that place
information artifacts in relation to each other
Need new, flexible, adaptive methods for information
alignment which can take context into account and which
can exploit both logical and probabilistic consequences
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
189
Open Issues
 Query-time data and metadata alignment
 Notion of multiple alignments; no single one best
 Need to keep track and make use of lineage
 Need to understand which information is most
informative and useful for alignment: data, structure,
metadata, etc.
 Need for methods for evaluation and quality measures
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
190
Thanks!
ICDE 2008
Getoor, Miller --- Data & Metadata Alignment
191