Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lise Getoor, University of Maryland Renée J. Miller, University of Toronto From Webster…. Main Entry: align·ment Variant(s): also aline·ment \əlīn-mənt\ Function: noun Date: 1790 1: the act of aligning or state of being aligned; especially : the proper positioning or state of adjustment of parts (as of a mechanical or electronic device) in relation to each other Information alignment: the process of finding, modeling and using the correspondences or connections that place information artifacts in relation to each other ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 2 Outline 1. Introduction What is Data / Metadata Alignment? 2. Data Alignment Entity Resolution 3. Metadata Alignment Schema Mapping 4. Data & Metadata Alignment Ontology Alignment 5. Conclusion ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 3 Alignment Example: Metadata Writes BibEntry BibId Title Author Publisher Booktitle Editor Volume Number Keywords Web Scraping (S) PubID Author Keywords PubID Keyword Publication PubID DateAdded Source JournalArticle PubID Title Journal Year Volume Number Pages TechnicalReport PubID Title Institution Number Conference PubID Title Conference Year Curated Publication Database (T) ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 4 Lines are not enough… Suppose we want to translate data from Web extraction source S into the relational schema T Arrows do not give sufficient information to be able to shred a bibEntry into a faithful relational instance (maintaining the semantics of the data!) What if S and T contain overlapping data which is represented differently? BibEntry bib1 J. Smith Alignment, Solved! Writes 908765 J. Smith JournalArticle 908765 Aligment: solved ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 5 Alignment is Fundamental Result of alignment is a declarative mapping Representing semantic relationship between data (entities) or metadata (schemas or ontologies) Mappings are basic building block for model management Schema Integration requires models and mappings Well-studied operators on mappings: compose, invert Mappings use Data exchange, data integration, peer data sharing, and more ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 6 (Virtual) Integration Architecture User Query Mediated Schema Mediator Integrated or Target Schema Reformulation Engine Optimization Engine Execution Engine Wrapper Wrapper City Database County Database Wrapper Pubic Data Server ICDE 2008 Getoor, Miller --- Data & Metadata Alignment Metadata Wrapper Outside Website 7 Common Case: Data Publishing Legacy (Relational) Schema Standard Mapping (XML) Schema “conforms to” “conforms to” data data Data Publishing: map a legacy (often relational) source schema into a standardized (often XML) target suitable for data sharing or publishing on the web •Need to create target instance, not (necessarily) answer queries ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 8 Other Applications Variations appear in numerous scenarios Data Warehousing Map a new data source into warehouse schema For performance, source data is transferred to target for complex analysis Enterprise Integration ICDE 2008 Map databases of acquired company into existing operational databases May be too expensive to build new software for runtime coupling of systems Getoor, Miller --- Data & Metadata Alignment 9 Data Exchange Source Schema S “conforms to” Mapping Target Schema T “conforms to” data data Data Exchange: given mapping and source instance, create an instance of the target that reflects the source data as accurately as possible [Popa et al. VLDB02, Fagin et al. ICDT03] ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 10 Talk Scope Our focus Creating mappings Will not be covering extensively Representational power of different mapping expression languages Mapping operators (compose, invert) Mapping maintenance Mapping use in different data management tasks ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 11 Talk Goals Bring together relevant research on finding data and metadata alignments from the database and machine learning communities Seek to understand commonalities and differences in the underlying inference problems that drive research in this area Provide a common language to discuss the problems of data and metadata alignment. ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 12 Caveats Not exhaustive survey - our goal is to give examples of key issues We will make available an extended bibliography If you have references to add, please email them (bibtex w/ url) to us, [email protected] and [email protected] ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 13 Acknowledgments Indrajit Bhattacharya, Mustafa Bilgic, Matthias Broecheler, Fei Chiang, Oktie Hassanzadeh, Hyunmo Kang, Louis Licamele, Preetam Maloor, Walaa El-Din Moustafa, Patricia Rodriguez-Gianolli, Mo Sadoghi, Hossam Sharara, Octavian Udrea and many more NSF, KDD, NSERC, CITO, IBM ICDE ! ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 14 ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 15 Roadmap Introduction Attribute-based Methods Relational Methods Interactive Methods Basic Problem(s) Detecting and eliminating duplicate records Integrating and matching data across sources Many names: deduplication, merge/purge problem, entity disambiguation, duplicate detection, record matching, identity uncertainty, instance identification, object identification, co-reference resolution, reference reconciliation, record linkage, database hardening, fuzzy matching, entity resolution….. Example: Citation Data L. Breiman, L. Friedman, and P. Stone, (1984). Classification and Regression. Wadsworth, Belmont, CA. Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification and Regression Trees. Wadsworth and Brooks/Cole, 1984. R. Agrawal, R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB-94, 1994. Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules In Proc. Of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, September 1994. Example: Customer Data Name: Preetam Maloor Address: 18055 Cottage Garden Dr, Germantown, MD Name: Maloor, P Marital Status: Single Occupation: Research Engineer Name: Preetam A. Maloor Type of Housing: Rent Location: Germantown, MD Other Examples Natural language processing Noun/Pronoun Resolution - John gave Edward the book. He then stood up and called to John to come back into the room. Computer vision Object Correspondence - Matching objects across video frames Biology E.g., proteins, genes, etc. Many more: social networks, personal information management, privacy (guaranteeing alignments are not possible) In the News… Companies such as SRD (Las Vegas, NV), acquired by IBM in 2005, offer solutions to casinos for detecting scams based on simple entity resolution techniques using a system called NORA (Non-obvious Relation Awareness). Founder Jeff Jonas is now IBM Distinguished Engineer and Chief Scientist Spock Awards $50,000 Grand Prize to Spock Challenge Winners, from Germany's Bauhaus University Weimar, March 2008 Origin First explored in late 1950s and 1960s Newcombe, Kennedy, Axford, & James. Automatic linkage of vital records. Science 130(3381), 954–959 (1959) Fellegi & Sunter,. A theory for record linkage. Journal of the American Statistics Association 64(328), 1183–1210 (1969). Seminal papers studied record linkage in the context of matching census population records Currently, explosion of research DB: Benjoullon et al; Sarawagi & Bhamidipaty; Ananthakrishna, Chaudhuri, Ganti, Motwani; Doan et al.; Dong, Halevy, Madhavan; Hernández & Stolfo; Kalashnikov, Mehrotra & Chen; lots more… ML: Bhattacharya & Getoor; Bilenko & Mooney; Cohen; Singla & Domingos; McCallum et al.; Pasula, Russell & Milch; Monge & Elkan; Winkler; Tejada, Knoblock, and Minton; lots more… Here, we try to summarize some of the main ideas… ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 23 The Entity Resolution Problem James Smith John Smith “John Smith” “Jim Smith” “J Smith” “James Smith” Jonathan Smith “Jon Smith” “J Smith” “Jonthan Smith” Issues: 1. Identification 2. Disambiguation ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 25 Attribute-based Entity Resolution Pair-wise classification “J Smith” “James Smith” ? “Jim Smith” “James Smith” 0.8 “J Smith” “James Smith” ? “John Smith” “James Smith” 0.4 “Jon Smith” “James Smith” 0.3 “Jonthan Smith” “James Smith” 0.2 Attribute-based Similarity Static similarity computation Individual record fields are often stored as strings. Key component: string similarity measures Adaptive Similarity computation Learn weights for attributes More generally, can formulate as a classification problem and apply standard machine learning algorithms Can also be formulated as a clustering problem (more on this later) String Similarity Overview Edit-based Levenshtein Distance Jaro-Winkler Character -based Overlap Jaccard and Weighted Jaccard IR methods Cosine w/tf-idf Language Modeling Hidden Markov Models Token-based Hybrid Generalized Edit Similarity (GES) SoftTFIDF Comparison for name-matching, Cohen et al. IJCAI’03 workshop Survey for duplicate detection, Elmagarmid, Ipeirotis, Verykios, TKDE’07 28 29 Edit-based Similarity tc(t,s) : minimum cost of edit operations to transform t to s Edit operations: character insert, delete and replace Levenshtein distance: unit cost for all operations tc(t , s ) simedit (t , s ) = 1 − max( t , s ) simedit (“Microsoft”, “Macrosft”) = 1- (2/9) = 0.78 29 Other Edit-based Measures Jaro more sophisticated string similarity score which look at the similarity within a certain neighborhood JaroWinkler score is based on Jaro and weights matches at the beginning more highly. Other variants: Smith-Waterman and Monge-Elkan ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 30 Token-based Token/Word order is unimportant. Convert the strings s and t to token set or multisets (where each token is a word) and consider similarity metrics on these sets Jaccard Similarity: (|S ∩ T|)/(|S ∪ T|). Weights agreement on rare terms more heavily than agreement on more common terms. Use term frequency-inverse document frequency (TF-IDF) to weight Limitations of Static Methods Monge and Elkan, “The Field Matching Problem: Algorithms and Applications,” Knowledge Discovery and Data Mining, 1996. Issue: Fundamental limitation of static similarity functions By nature can include no special knowledge of the specific problem at hand. Even methods that have been tuned and tested on many previous matching problems can perform poorly on new and different matching problems. Make methods adaptive by formulating as a supervised classification problem Supervised Learning Represent record pairs as feature vectors, using as features the distances between corresponding fields. Train a binary support vector machine classifier using these feature vectors Challenge: being able to provide a covering and challenging set of training pairs that bring out the subtlety of the deduplication function. Bilenko, Mooney, Cohen, Ravikumar, Fienberg, "Adaptive Name Matching in Information Integration," IEEE Intelligent Systems, vol. 18, no. 5, pp. 16-23, Sept/Oct, 2003 Active Learning Learning-based deduplication system that uses a novel method of interactively discovering challenging training pairs using active learning Unlike an ordinary learner that trains using a static training set, an active learner actively picks subsets of instances which when labeled will provide the highest information gain to the learner. S. Sarawagi and A. Bhamidipaty, “Interactive deduplication using active learning,” Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002. Issues: Attribute-based ER Pair-wise classification “J Smith” “James Smith” ? “Jim Smith” “James Smith” 0.8 “J Smith” “James Smith” ? “John Smith” “James Smith” 0.1 “Jon Smith” “James Smith” 0.7 “Jonthan Smith” “James Smith” 1. Choosing threshold: precision/recall tradeoff 2. Inability to disambiguate 3. Perform transitive closure? 0.05 Common Issues: Efficiency Naïve pairwise entity resolution problem is computationally expensive - N^2 Blocking [Hernandez and Stolfo, 1995] Sliding window technique [Hernandez and Stolfo, 1995] Canopies [McCallum et al. 2003] Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: Proceedings of ACM SIGKDD’03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation , 2003. Elmagarmid, Ipeirotis, Verykios. Duplicate Record Detection: A Survey, TKDE, 2007. Interesting connections to efficient join algorithms Common Issue: Evaluation Mostly evaluated as a pair-wise classification problem. Accuracy may not be the best metric datasets tend to be highly skewed; often less than 1% of all pairs are duplicates Using precision over the duplicate prediction and recall over the entire set of duplicates [Monge and Elkan, 1996] Cluster Purity – evaluate clusters generated compared to the true clusters [Monge and Elkan, 1997] Other measures: F1, AUC, average precision, etc. Upcoming LREC workshop on evaluating entity resolution algorithms, May 2008 ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 39 InfoVis Co-Author Network Fragment before after Relational Entity Resolution References not observed independently Links between references indicate relations between the entities Co-author relations for bibliographic data To, cc: lists for email Use relations to improve identification and disambiguation Pasula et al. 03, Ananthakrishna et al. 02, Bhattacharya & Getoor 04,06,07, McCallum & Wellner 04, Li, Morie & Roth 05, Culotta & McCallum 05, Kalashnikov et al. 05, Chen, Li, & Doan 05, Singla & Domingos 05, Dong et al. 05 Relational Identification Very similar names. Added evidence from shared co-authors Relational Disambiguation Very similar names but no shared collaborators Relational Constraints Co-authors are typically distinct Collective Entity Resolution One resolution provides evidence for another => joint resolution Entity Resolution with Relations Naïve Relational Entity Resolution Also compare attributes of related references Two references have co-authors w/ similar names Collective Entity Resolution Use discovered entities of related references Entities cannot be identified independently Harder problem to solve Relational ER Algorithms Relational Clustering (RC-ER) Bhattacharya & Getoor, DMKD’04, Wiley’06, DE Bulletin’06,TKDD’07 Generative Probabilistic Models (LDA-ER) Conditional Probabilistic Models (CRFs & MLNs) Experimental Comparison P1: “JOSTLE: Partitioning of Unstructured Meshes for Massively Parallel Machines”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson J P2: “Partitioning Mapping of Unstructured Meshes to Parallel Machine Topologies”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson, K. McManus J P3: “Dynamic Mesh Partitioning: A Unied Optimisation and Load-Balancing Algorithm”, C. Walshaw, M. Cross, M. G. Everett P4: “Code Generation for Machines with Multiregister Operations”, Alfred V. Aho, Stephen C. Johnson, Jefferey D. Ullman J P5: “Deterministic Parsing of Ambiguous Grammars”, A. Aho, S. Johnson, J. Ullman J P6: “Compilers: Principles, Techniques, and Tools”, A. Aho, R. Sethi, J. Ullman P1: “JOSTLE: Partitioning of Unstructured Meshes for Massively Parallel Machines”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson P2: “Partitioning Mapping of Unstructured Meshes to Parallel Machine Topologies”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson, K. McManus P3: “Dynamic Mesh Partitioning: A Unied Optimisation and Load-Balancing Algorithm”, C. Walshaw, M. Cross, M. G. Everett P4: “Code Generation for Machines with Multiregister Operations”, Alfred V. Aho, Stephen C. Johnson, Jefferey D. Ullman P5: “Deterministic Parsing of Ambiguous Grammars”, A. Aho, S. Johnson, J. Ullman P6: “Compilers: Principles, Techniques, and Tools”, A. Aho, R. Sethi, J. Ullman Relational Clustering (RC-ER) P1 C. Walshaw M. Cross M. G. Everett S. Johnson P2 C. Walshaw M. Cross M. Everett S. Johnson P4 P5 Alfred V. Aho A. Aho Jefferey D. Ullman Stephen C. Johnson J. Ullman S. Johnson K. McManus Relational Clustering (RC-ER) P1 C. Walshaw M. Cross M. G. Everett S. Johnson P2 C. Walshaw M. Cross M. Everett S. Johnson P4 P5 Alfred V. Aho A. Aho Jefferey D. Ullman Stephen C. Johnson J. Ullman S. Johnson K. McManus Relational Clustering (RC-ER) P1 C. Walshaw M. Cross M. G. Everett S. Johnson P2 C. Walshaw M. Cross M. Everett S. Johnson P4 P5 Alfred V. Aho A. Aho Jefferey D. Ullman Stephen C. Johnson J. Ullman S. Johnson K. McManus Relational Clustering (RC-ER) P1 C. Walshaw M. Cross M. G. Everett S. Johnson P2 C. Walshaw M. Cross M. Everett S. Johnson P4 P5 Alfred V. Aho A. Aho Jefferey D. Ullman Stephen C. Johnson J. Ullman S. Johnson K. McManus Cut-based Formulation of RC-ER M. G. Everett S. Johnson M. G. Everett S. Johnson M. Everett S. Johnson M. Everett S. Johnson S. Johnson A. Aho Alfred V. Aho S. Johnson A. Aho Stephen C. Johnson Alfred V. Aho Stephen C. Johnson Good separation of attributes Many cluster-cluster relationships Worse in terms of attributes Fewer cluster-cluster relationships Aho-Johnson1, Aho-Johnson2, EverettJohnson1 Aho-Johnson1, Everett-Johnson2 Objective Function Minimize: ∑∑ w sim A i weight for attributes A (ci ,c j ) + wR simR (ci , c j ) j similarity of attributes weight for relations Similarity based on relational edges between ci and cj Greedy clustering algorithm: merge cluster pair with max reduction in objective function ∆ (ci ,c j )= w A sim A (ci ,c j ) + w R (|N (ci )||N (c j )|) Similarity of attributes Common cluster neighborhood Attribute Similarity Use best available measure for each attribute Name Strings: Soft TF-IDF, Levenstein, Jaro Textual Attributes: TF-IDF Aggregate to find similarity between clusters Single link, Average link, Complete link Cluster representative Relational Similarity: Example 1 A. Aho P4 Alfred V. Aho P5 Stephen C. Johnson S. Johnson P4 P5 J. Ullman Jefferey D. Ullman All neighborhood clusters are shared: high relational similarity Relational Similarity: Example 2 Alfred V. Aho K. McManus P2 C. Walshaw C. Walshaw M. Everett A. Aho P1, P2 S. Johnson M. G. Everett P4, P5 S. Johnson P1, P2 P1, P2 Stephen C. Johnson S. Johnson P4, P5 Jefferey D. Ullman M. Cross M. Cross No neighborhood cluster is shared: no relational similarity J. Ullman Comparing Cluster Neighborhoods Consider neighborhood as multi-set Different measures of set similarity Common Neighbors: Intersection size Jaccard’s Coefficient: Normalize by union size Adar Coefficient: Weighted set similarity Higher order similarity: Consider neighbors of neighbors Relational Clustering Algorithm 1. 2. 3. Find similar references using ‘blocking’ Bootstrap clusters using attributes and relations Compute similarities for cluster pairs and insert into priority queue 4. 5. 6. 7. 8. Repeat until priority queue is empty Find ‘closest’ cluster pair Stop if similarity below threshold Merge to create new cluster Update similarity for ‘related’ clusters O(n k log n) algorithm w/ efficient implementation Relational ER Algorithms Relational Clustering (RC-ER) Generative Probabilistic Models (LDA-ER) Bhattacharya & Getoor, SDM’06 Conditional Probabilistic Models (CRFs & MLNs) Experimental Comparison Probabilistic Generative Model for Collective Entity Resolution Model how references co-occur in data 1. Generation of references from entities 2. Relationships between underlying entities Groups of entities instead of pair-wise relations Discovering Collaboration Groups Stephen P Johnson Chris Walshaw Mark Cross Kevin McManus Martin Everett Parallel Processing Research Group Stephen C Johnson Alfred V Aho Ravi Sethi Jeffrey D Ullman Bell Labs Group P1: C. Walshaw, M. Cross, M. G. Everett, S. Johnson P4: Alfred V. Aho, Stephen C. Johnson, Jefferey D. Ullman P2: C. Walshaw, M. Cross, M. G. Everett, S. Johnson, K. McManus P5: A. Aho, S. Johnson, J. Ullman P3: C. Walshaw, M. Cross, M. G. Everett P6: A. Aho, R. Sethi, J. Ullman LDA*-ER Model α *Latent Dirichlet Allocation, Blei, Ng Jordan, JMLR03 θ z Φ a T r V A R P β Entity label a and group label z for each reference r Θ: ‘mixture’ of groups for each co-occurrence Φz: multinomial for choosing entity a for each group z Va: multinomial for choosing reference r from entity a Dirichlet priors with α and β Approx. Inference Using Gibbs Sampling Conditional distribution over labels for each ref. Sample next labels from conditional distribution Repeat over all references until convergence ndDTt +α T n aATt +β A P(z i =t|z −i ,a,r) ∝ i DT × i AT n d * +α n*t +β i P(a i = a|z,a −i ,r) ∝ n aAT +β A it n AT *t +β × Sim(ri ,v a ) Converges to most likely number of entities Faster Inference: Split-Merge Sampling Naïve strategy reassigns references individually Alternative: allow entities to merge or split For entity ai, find conditional distribution for 1. 2. 3. Merging with existing entity aj Splitting back to last merged entities Remaining unchanged Sample next state for ai from distribution O(n g + e) time per iteration compared to O(n g + n e) Relational ER Algorithms Relational Clustering (RC-ER) Generative Probabilistic Models (LDA-ER) Conditional Probabilistic Models (CRFs & MLNs) McCallum and Wellner, IJCAI WS 2003 Singla & Domingos, ICDM06 Experimental Comparison Markov Networks for ER McCallum and Wellner Formulate Entity Resolution as Markov Network N2 hidden variables model coreferences decisions Uses feature and consistency functions Feature learn weights Consistency weights = -∞ Markov Networks for ER Feature: fc = 1 if any two nodes are 1 and the other one is 0 = 0 otherwise C12 Ref1 Name=„Powell“ Tag-Context = . , VBP C13 C23 Ref2 Ref3 Name=„Santa Claus“ Tag-Context = V, . Feature: f1 = 1 if CXY=1 and substring(RefX.name,RefY.name) = 0 otherwise Name=„Mr. Powell“ Tag-Context = AUX, V Markov Networks for ER Feature construction is very flexible Normalized Substring Un-normalized Substring Acronym Identical Head Identical Modifier words and POS Note: features are certainly not independent! Markov Networks for ER Computation of conditional probability 1 exp(∑ λl f l ( xi , x j , yij ) + ∑ λl ' f l ' ( yij , y jk , yik )) P( y | x) = Zx i , j ,l i , j , k ,l ' Find parameters using maximum likelihood and gradient ascent ∂L = ∑ (∑ f l ( xi , x j , yij ) − ∑ PΛ ( y ' | x)∑ f l ( xi , x j , y 'ij )) ∂λl < x , y >∈D i , j y' i, j Computation of expected feature value intractable single training instance use most likely assignment to compute feature value Markov Networks for ER The data set considered by McCallum and Wellner does not have much structure Entities are not related to each Collective resolution in a weak sense Markov Networks are a general technique which can also be applied to collective entity resolution Markov Logic Networks (MLN) Singla & Domingos, ICDM06 General framwork for constructing Markov Networks and associated features Define cliques and features in Markov Networks using First Order Logic Markov Logic Networks (MLN) 1. Fix representation of domain in First Order Logic Need equality for entity resolution „Entity Resolution with Markov Logic“; Authors: P. Singla, P. Domingos Venue: ICDM 2006 HasTitle(P1, Entity Resolution with Markov Logic) HasWord(P1,Entity), ... HasAuthor(P1,A1),..,HasVenue(P1,V 1) HasName(A1,P. Singla) HasWord(A1,Singla),... HasEngram(A1,gla),... HasName(A2, P. Domingos) HasVenue(V1,ICDM 2006) Markov Logic Networks (MLN) 2. Capture knowledge about the domain in a set of FOL formulas F Facts: x,y1,y2: HasAuthor(x,y1) HasAuthor(x,y2) Coauthor(y1,y2) Evidence: x1,x2,y1,y2: HasWord(x1,y1) HasWord(x2,y2) y1=y2 x1=x2 x1,x2,y1,y2: Coauthor(x1,y1) Coauthor(x2,y2) x1=x2 y1=y2 Markov Logic Networks (MLN) 3. Construct Markov Network M as follows Assume a fixed number of references in domain Papers, authors, venues, etc. M contains one binary node for each possible grounding of each predicate appearing in F =1 true; =0 false M contains one clique/feature for each possible grounding of each formula Fi F with weight wi Each formula constitutues a clique template =1 if formula is true, 0 otherwise Markov Logic Networks (MLN) hasAuthor(P1,A4) Coauthor(A4,A7) hasAuthor(P1,A7) A2=A7 hasAuthor(P3,A2) Coauthor(A3,A2) hasAuthor(P1,A3) A3=A4 HasWord(A4, “Domingos“) HasWord(A3, “Domingos“) „Domingos“ = “Domingos“ Markov Logic Networks (MLN) Assume all nodes are true hasAuthor(P1,A4) 1 Coauthor(A4,A7) hasAuthor(P1,A7) A2=A7 hasAuthor(P3,A2) 1 hasAuthor(P1,A3) Coauthor(A3,A2) A3=A4 HasWord(A4, “Domingos“) HasWord(A3, “Domingos“) Fact: x,y1,y2: HasAuthor(x,y1) HasAuthor(x,y2) Coauthor(y1,y2) „Domingos“ = “Domingos“ Markov Logic Networks (MLN) Assume all nodes are true hasAuthor(P1,A4) 1 Coauthor(A4,A7) hasAuthor(P1,A7) A2=A7 hasAuthor(P3,A5) 0 hasAuthor(P1,A3) Coauthor(A3,A2) A3=A4 HasWord(A4, “Domingos“) HasWord(A3, “Domingos“) Fact: x,y1,y2: HasAuthor(x,y1) HasAuthor(x,y2) Coauthor(y1,y2) „Domingos“ = “Domingos“ Markov Logic Networks (MLN) Assume all nodes are true hasAuthor(P1,A4) Coauthor(A4,A7) hasAuthor(P1,A7) A2=A7 1 hasAuthor(P3,A2) Coauthor(A3,A2) hasAuthor(P1,A3) A3=A4 HasWord(A4, “Domingos“) HasWord(A3, “Domingos“) Evidence: x1,x2,y1,y2: Coauthor(x1,y1) Coauthor(x2,y2) x1=x2 y1=y2 „Domingos“ = “Domingos“ Markov Logic Networks (MLN) 4. Probability Computation and Learning |F | 1 P( X = x) = exp(∑ wi ni ( x)) Zx i =1 Learn weights wi as before Derivative of Log-likelihood Gradient ascent using MaxWalkSAT to estimate the expected value Inference Argmaxx P(X=x) Uses MaxWalkSAT with learned weights Markov Logic Networks (MLN) Scalability Network grows at least quadratically in the number of references intractable Due to extensive use of clique templates Use blocking techniques to exclude irrelevant parts of the network a priori TF-IDF Certain nodes are computed on the fly Relational ER Algorithms Relational Clustering (RC-ER) Generative Probabilistic Models (LDA-ER) Conditional Probabilistic Models (CRFs & MLNs) Experimental Comparison Datasets CiteSeer 1,504 citations to machine learning papers (Lawrence et al.) 2,892 references to 1,165 author entities arXiv 29,555 publications from High Energy Physics (KDD Cup’03) 58,515 refs to 9,200 authors Elsevier BioBase 156,156 Biology papers (IBM KDD Challenge ’05) 831,991 author refs Keywords, topic classifications, language, country and affiliation of corresponding author, etc Baselines A: Pair-wise duplicate decisions w/ attributes only Names: Soft-TFIDF with Levenstein, Jaro, Jaro-Winkler Other textual attributes: TF-IDF A*: Transitive closure over A A+N: Add attribute similarity of co-occurring refs A+N*: Transitive closure over A+N Evaluate pair-wise decisions over references F1-measure (harmonic mean of precision and recall) ER over Entire Dataset A A* A+N A+N* RC-ER LDA-ER CiteSeer arXiv BioBase 0.980 0.990 0.973 0.984 0.995 0.993 0.976 0.971 0.938 0.934 0.985 0.981 0.568 0.559 0.710 0.753 0.818 0.645 RC-ER & LDA-ER outperform baselines in all datasets Collective resolution better than naïve relational resolution RC-ER and baselines require threshold as parameter Best achievable performance over all thresholds Best RC-ER performance better than LDA-ER LDA-ER does not require similarity threshold Bhattacharya and Getoor, TKDD 07 ER over Entire Dataset A A* A+N A+N* RC-ER LDA-ER CiteSeer arXiv BioBase 0.980 0.990 0.973 0.984 0.995 0.993 0.976 0.971 0.938 0.934 0.985 0.981 0.568 0.559 0.710 0.753 0.818 0.645 CiteSeer: Near perfect resolution; 22% error reduction arXiv: 6,500 additional correct resolutions; 20% error reduction BioBase: Biggest improvement over baselines Performance for Specific Names Best F1 for F1 for cho_h ATTR/ATTR* 0.80 LDA-ER 1.00 davis_a 0.67 0.89 kim_s 0.93 0.99 kim_y 0.93 0.99 lee_h 0.88 0.99 lee_j 0.98 1.00 liu_j 0.95 0.97 sarkar_s 0.67 1.00 sato_h 0.82 0.97 sato_t 0.85 1.00 shin_h 0.69 1.00 veselov_a 0.78 1.00 yamamoto_k 0.29 1.00 yang_z 0.77 0.97 zhang_r 0.83 1.00 zhu_z 0.57 1.00 Name arXiv Significantly larger improvements for ‘ambiguous names’ Trends in Synthetic Data A A* RC-ER 1 Bigger improvement with bigger % of ambiguous refs 0.9 F1 more refs per co-occurrence 0.8 more neighbors per entity 0.7 0 0.1 0.2 0.3 0.4 0.5 Percentage of ambiguous attributes A A* A RC-ER A* RC-ER 0.9 0.9 0.85 F1 F1 0.85 0.8 0.75 2.25 0.8 2.5 2.75 3 3.25 3.5 avg #references / hyper-edge 3.75 4 0 1 2 3 4 5 avg # neighbors / entity 6 7 8 ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 92 User Interfaces for ER Combining rich statistical inference models with visual interfaces that support knowledge discovery and understanding Because the statistical confidence we may have in any of our inferences may be low, it is important to be able to have a human in the loop, to understand and validate results, and to provide feedback. Especially for graph and network data, a well-chosen visual representation, suited to the inference task at hand, can improve the accuracy and confidence of user input D-Dupe: An Interactive Tool for ER http://www.cs.umd.edu/projects/linqs/ddupe Interactive Entity Resolution in Relational Data: A Visual Analytic Tool and Its Evaluation, Kang, Getoor, Shneiderman, Bilgic, and Licamele, TVCG to appear GeoDDupe: Tool for Interactive ER in Geospatial Data http://www.cs.umd.edu/projects/linqs/geoddupe Kang, Sehgal, Getoor, IV 07 Metadata Alignment Metadata Alignment Input What type of metadata and data is assumed? Output What type of mapping is produced? Objectives How will mapping be used? Methods ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 97 Input to Schema Alignment Type of data and metadata Attributes names Alignment of Web Interfaces Schema structure Relation names, domain types Constraints or relationships Nesting or foreign keys Keys or more general constraints Data Full instance of schema (with or without metadata) Data examples only ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 98 Output of Alignment Tasks Correspondences (matching) Set of pairs of attributes (or possibly, relations) that “correspond” May be 1-1 (or sometimes 1-N, N-M) (Possibly) confidence score associated with each pair in matching Similar to output of data alignment but correspondence rarely means “identical” Many inference techniques for schema matching closely resemble techniques used for data matching ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 99 Output of Alignment Task Schema mappings Define relationship between instances of schemas Conceptually can also be thought of as a binary relation over instances, but this misses important point Schemas are finite, but set of possible instances is generally not Need a finite, declarative expression for mapping Inference cannot effectively be done by matching possible instances Creation of schema mappings uses very different techniques ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 100 Schema Matching Schema Matching Methods Vary based on assumed input Metadata Labels only Labels plus domain knowledge Dictionary, thesaurus, ontology May be generic or specific to application domain Schema structure Schema and data ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 102 Label Matching Based on labels (names) used in schema Method: compute similarity between labels Similarity functions include those used for entity matching, e.g., Edit distance (no. ops need to transform one label to other) Q-gram based similarity measures As with data matching can formulate as pairwise classification problem or as clustering problem ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 103 Interface matching ≈ schema matching [He and Chang, SIGMOD 03] ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 104 Web Forms are not schemas Attribute label must be sufficiently expressive to permit user to understand form Limited use of acronyms and abbreviations and specialized domain vocabulary In contrast, structured schemas have a limited (and specialized) user population. Limited vocabulary: for easy understanding A large number of similar forms: a large number of sites offer the same services or selling the same products. Additional structures: the information is usually organized in some meaningful way in the interface. Mark-up language around attributes can convey relationships between attributes ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 105 Adding Domain Knowledge Lexical analysis E.g., Knowledge of abbrv rules: ValId more likely to match Value Number than Valid Text similarity Common IR techniques In semi-structured schemas and interfaces, labels may be piece of text Dictionary-based (e.g., Word-Net) Synonyms, hypernyms, & other common natural language relations Particularly popular in matching ontologies ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 106 Using Data Basic features of data can guide matching E.g., difference between two numeric fields with two decimal places price for computer products vs salary values Learn basic classifier based on these features Rely on data matching Apply similarity measures to individual values or to set of values in an attribute ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 107 Combining Matches Basic matchers can be combined Combine similarity measures View schema as a graph Combine measures using local notion of graph neighborhood Use global graph similarity measures ICDE 2008 E.G., Similarity Flooding [Melnik et al, 2001] Getoor, Miller --- Data & Metadata Alignment 108 Schema matching methods Individual matchers Schema-based Instance-based Element-level Structure-level Element-level Linguistic Constraint- Constraintbased based • Names • Types • Descriptions • Keys • Graph matching Linguistic Constraint -based • IR (word • Value pattern frequencies, and ranges key terms) Taxonomy from [E. Rahm, P. Bernstein, 2001] ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 109 Some trends… Adding context to matches Use local context in inference, [Bohannon et al, VLDB 06] Learn functions on matches, IMAP [Dhamankar, SIGMOD 04] Human interaction and visual interfaces very important in matching Tools for schema matching highly interactive (like D-Dupe) Community-based tools where interaction comes from a massive population Corpus-based techniques [Madahavan et al, ICDE 05] Leverage previous matches from corpus Extra knowledge permits use of unsupervised techniques ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 110 Schema Matching Methods Extensive research literature and commercial tools Cupid, COMA, LSD, SF, Artemis, Anchor-Prompt, SMatch, SemInt, AutoMatch, Similarity Flooding, BEA Aqualogic, Microsoft BizTalk Mapper, IBM WebSphere Data Stage TX, Stylus Studio’s XML Mapper, … Surveys Rahm & Bernstein, 01 Doan and Halevy, 05 ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 111 Schema Mapping Schema Mapping Schema mappings High-level, declarative assertions that specify the relationship between set of possible instances of two or more schemas Ideally, schema mappings should be Expressive enough to specify many data interoperability tasks including Data exchange Data integration Simple enough to be efficiently manipulated by tools Easy to maintain and reuse Incrementally create and refine ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 113 Creating Schema Mappings Provide high-level declarative language with common data transformation operators IBM Express [Shu et al, TODS 77] CONVERT language Efficient compilation into executable code Restructuring Complex Object Models [Abiteboul & Hull, TCS88] Local structural transformations of types Each restructuring primitive has corresponding query that expresses data transformation Important development Declarative creation and manipulation of object identifiers ILOG [Hull & Yoshikawa, VLDB 90] ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 114 Leveraging Matchings TranScm [Milo & Zohar, VLDB 98] Apply local transformation rules to matched schemas Article BibEntry authors author author author author E.G., Descendents function checks the numbers and types of the children of the current node ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 115 Beyond transformation operators Toronto/IBM Clio project begun in 1999 Can we further automate the creation of mapping specifications? Can we go beyond local transformation rules? [Miller et al, VLDB 00, Popa et al, VLDB 02, Fuxman et al, VLDB 06], IBM Rational Data Architect ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 116 Mapping Creation Leverage attribute matches User friendly Automatic discovery Even in 1999 quality of matching tools already very good Preserve data semantics discover data associations use constraints and schema structure Model incompleteness generate new values for data exchange Produce correct grouping ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 117 Mapping Requirements From mapping should be able to produce execution scripts that perform data exchange In different execution environments Declarative representation that is easy to manage and maintain Incrementally create and reuse ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 118 What is a Schema Mapping? Source: company join with grant on cid company(cid,name,city),grant(cid,gid,amt,project) If we stop here, we have a plain old view (GAV) create view org (cid, cname) as ( select cid, name from company, grant where company.cid = grant.cid ) ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 119 What is a Schema Mapping? Source: company join with grant on cid company(cid,name,city),grant(cid,gid,amt,project) Target: org with nested set funding (FS) join financial on aid org(cid,name,FS),FS(gid,proj,aid,recv),financial(aid,amt,date) ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 120 What is a Schema Mapping? Source: query on source ∀ cid,name,city,gid,amt,project company(cid,name,city),grant(cid,gid,amt,project) → Target: query on target org(cid,name,FS),FS(gid,proj,aid,recv),financial(aid,amt,date) ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 121 What is a Schema Mapping? Source: query on source ∀ cid,name,city,gid,amt,project company(cid,name,city),grant(cid,gid,amt,project) → Target: query on target ∃ FS,proj,aid,recv,date org(cid,name,FS),FS(gid,proj,aid,recv),financial(aid,amt,date) ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 122 Schema Mapping Specification The relationship between the source and the target is given by a set of mappings Mst that are source-to-target tuple generating dependencies (s-t tgds) ∀ϕ(x) → ∃y ψ(x, y) ϕ(x) is a query over the source ψ(x, y) is a query over the target In theory queries are conjunctive queries In practice (tool), queries may include: ICDE 2008 aggregation order by bag semantics Getoor, Miller --- Data & Metadata Alignment 123 Mapping Generation Mapping: ∀x ϕs(x) → ∃y ψt(x, y) Matching determines shared variables x How do we find source (ϕs) and target (ψt) queries? Use chase [Maier, Mendelzon, Sagiv 79] to find connections within the schemas. Originally defined to solve inference problem for relational dependencies. We use it to generate possible alternative representations of information (logical associations). ICDE 2008 Generalized to nested-relational model [Popa et al, VLDB02] Getoor, Miller --- Data & Metadata Alignment 124 Associations expenseDB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amt sponsor project company(CID,N,C) company(CID,N,C),grant(CID,GID,A,S,P) ICDE 2008 Getoor, Miller --- Data & Metadata Alignment statDB: Set of Rcd cityStat: Rcd city orgs: Set of Rcd org: Rcd cid name fundings: Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd aid date amount cityStat(C,Os),Os(CID,N) cityStat(C,Os),Os(CID,N), funding(GID,P,AID), financial(AID,D,A) 125 Associations expenseDB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amt sponsor project statDB: Set of Rcd cityStat: Rcd city orgs: Set of Rcd org: Rcd cid name fundings: Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd aid date amount M1 company(CID,N,C) → ∃C’,CID’,Os company(CID,N,C),grant(CID,GID,A,S,P) ICDE 2008 Getoor, Miller --- Data & Metadata Alignment cityStat(C’,Os),Os(CID’,N) cityStat(C,Os),Os(CID,N), funding(GID,P,AID), financial(AID,D,A) 126 Associations expenseDB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amt sponsor project M1 M2 company(CID,N,C) → ∃C’,CID’,Os ∃C’,CID’,Os, company(CID,N,C),grant(CID,GID,A,S,P) → P’,AID,D ICDE 2008 Getoor, Miller --- Data & Metadata Alignment statDB: Set of Rcd cityStat: Rcd city orgs: Set of Rcd org: Rcd cid name fundings: Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd aid date amount cityStat(C’,Os),Os(CID’,N) cityStat(C’,Os),Os(CID’,N), funding(GID,P’,AID’), financial(AID’,D,A) 127 Multiple Associations expenseDB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amount project sponsor statDB: Set of Rcd cityStat: Rcd orgs: Set of Rcd org: Rcd cid name fundings: Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd aid date amount city Grants may be associated with companies in multiple ways Association 1: grants ⋈ companies join on cid = cid Association 2: grants ⋈ companies join on sponsor = cid ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 128 Logical Inference Logical inference is basis for mapping discovery Use schema structure (nesting implies a logical relationship between schemas) Constraints Postulated or discovered constraints ICDE 2008 Mine for approximate or context-dependent constraints Getoor, Miller --- Data & Metadata Alignment 129 Adding more knowledge Data-Metadata transformations HePToX [Bonafiti et al, VLDB 05] Tupelo [Fletcher & Wyss, EDBT 07] If a domain ontology or ER conceptual schema is available, we can use it in our inference [An, Borgida et al ICDE 07] ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 130 Using Conceptual Model Doctor ssn clinic Employee ssn name Employee ssn name X X Scientist ssn lab Doctor ssn clinic Scientist ssn lab 1. ∀ ssn,name,clinic(doctor(ssn,name,clinic) doctor ssn name clinic → ∃. x,y. employee(x,name,clinic,y)). employee 2. ∀ ssn,name,lab(scientist(ssn,name,lab) eid name clinic lab → ∃. x,y. employee(x,name,y,lab)). 3. ∀ ssn,name,clinic,lab(doctor(ssn,name,clinic) scientist ssn name lab ∧scientist(ssn,name’,lab) → ∃. x. employee(x,name,clinic,lab)). ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 131 Open Issues Evaluation How to compare matchers or mappers? How to determine when one matcher or mapper will produce better results? Are there schema or data characteristics that give us clues to which type of inference would work best? Can we do matching and mapping collectively? ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 132 ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 133 Recap Data Alignment Classification based on attributes-only Collective inference based on attributes and relations Metadata Alignment Schema matching based on classification Logical inference can be used to find schema mapping We now look at methods which for combining them for ontology alignment ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 134 Ontology Alignment Basic Idea Short overview of OWL Lite The ILIADS method Experimental evaluation The basic idea Produce better quality alignments by using data (instances) effectively and using logical inference (e.g., in OWL) to estimate how good an alignment is Parameterize the method such that It can be adapted for a wide variety of inputs The parameters can be adjusted with minimal effort based on the input ontologies Defining the terms Entity: everything that has an URI identifier (plus literals) Ontology: software artifact consisting of classes, instances, facts, axioms Alignment: Given two ontologies, find relationships between their respective entities Integration: Merge two ontologies under a set of alignments to obtain a consistent result Ontology Alignment Motivation and goals Short overview of OWL Lite The ILIADS method Experimental evaluation Example OWL Lite ontologies (discoveredBy, owl:inverseOf, discoverer); (discoveredBy, owl:type, owl:FunctionalProperty) (discoveredBy, owl:inverseOf, discoverer); (associatedWith, owl:type, owl:TransitiveProperty) (resultsF rom, rdfs:subPropertyOf, associatedWith) Example OWL Lite ontologies (discoveredBy, owl:inverseOf, discoverer); (discoveredBy, owl:type, owl:FunctionalProperty) (discoveredBy, owl:inverseOf, discoverer); (associatedWith, owl:type, owl:TransitiveProperty) (resultsF rom, rdfs:subPropertyOf, associatedWith) Example OWL Lite ontologies (discoveredBy, owl:inverseOf, discoverer); (discoveredBy, owl:type, owl:FunctionalProperty) (discoveredBy, owl:inverseOf, discoverer); (associatedWith, owl:type, owl:TransitiveProperty) (resultsF rom, rdfs:subPropertyOf, associatedWith) Example OWL Lite ontologies (discoveredBy, owl:inverseOf, discoverer); (discoveredBy, owl:type, owl:FunctionalProperty) (discoveredBy, owl:inverseOf, discoverer); (associatedWith, owl:type, owl:TransitiveProperty) (resultsF rom, rdfs:subPropertyOf, associatedWith) Example OWL Lite ontologies (discoveredBy, owl:inverseOf, discoverer); (discoveredBy, owl:type, owl:FunctionalProperty) (discoveredBy, owl:inverseOf, discoverer); (associatedWith, owl:type, owl:TransitiveProperty) (resultsF rom, rdfs:subPropertyOf, associatedWith) Inference in OWL (Lite) A tableau-based method Example tableau rule: (p owl:inverseOf p’) (o1 p o2) (o2 p’ o1) Example inconsistency: (o1 owl:sameAs o2) (o2 owl:differentFrom o1) ┴ Example inference (discoveredBy, owl:inverseOf, discoverer); (discoveredBy, owl:type, owl:FunctionalProperty) (discoveredBy, owl:inverseOf, discoverer); (associatedWith, owl:type, owl:TransitiveProperty) (resultsF rom, rdfs:subPropertyOf, associatedWith) Example inference (discoveredBy, owl:inverseOf, discoverer) Example inference (discoveredBy, owl:type, owl:FunctionalProperty) The alignment problem Find a set of triples (entity1 relation entity2) where: entity1, entity2 are entities from the two ontologies relation is one of subClassOf, equivalentClass, subPropertyOf, equivalentProperty, sameAs For integration, the union of the ontologies and the alignment must be consistent. Ontology Alignment Motivation and goals Short overview of OWL Lite The ILIADS method Udrea, Getoor, Miller, SIGMOD07 Experimental evaluation State of the art Ideally, alignment should be treated as an optimization problem Choose candidate pairs to maximize an ontology-level similarity measure Unfeasible in practice To approximate, existing tools use locally computed similarity measures Often, this means the “big picture” of the search space is ignored Incremental methods Incremental methods This score is high enough, so we commit to the owl:sameAs relation Incremental methods This changes the scores of the neighbors Incremental methods This is again high-enough, so we have found another alignment The core of ILIADS Compute alignment candidates based on well established methods Lexical, structural, extensional similarity In addition, evaluate how “good” a candidate pair is based on the logical consequences of asserting the alignment We call this “inference similarity” Essentially a look-ahead that estimates the impact of the alignment on the global similarity score The ILIADS algorithm repeat until no more candidates 1. Compute local similarities 2. Select promising candidates 3. For each candidate Perform N inference steps b. Update score with the inference similarity a. 4. Select the candidate with the best score end Computing similarity repeat until no more candidates 1. Compute local similarities 2. Select promising candidates 3. For each candidate a. b. 4. Perform N inference steps Update score with the inference similarity Select the candidate with the best score sim(e,e’) =λx simlexical(e,e’)+ end λs simstructural(e,e’)+ λe simextensional(e,e’) Lexical similarity: Jaro-Winkler and Wordnet Structural similarity: Jaccard for various neighborhoods Extensional similarity: Jaccard on extensions Select candidates with sim(e,e’) above a threshold Performing inference repeat until no more candidates 1. Compute local similarities 2. Select promising candidates 3. For each candidate a. b. 4. end Perform N inference steps Update score with the inference similarity Select the candidate with the best score For the candidate pair (e,e’): Select an axiom and apply the corresponding rule The logical consequences are the pairs of entities (e(i), e(j)) that have just become equivalent Repeat a small number of times (5) Updated score repeat until no more candidates 1. Compute local similarities 2. Select promising candidates 3. For each candidate a. b. 4. end Perform N inference steps Update score with the inference similarity Select the candidate with the best score For the candidate pair (e,e’): Compute the product P of sim(e(i), e(j)) / (1 – sim(e(i), e(j))) over all logical consequences simupdated(e,e’) = sim(e,e’) * P Example inference similarity Example inference similarity We assume this candidate pair is in a owl:sameAs relation before starting inference Example inference similarity (discoveredBy, owl:inverseOf, discoverer) Example inference similarity Remember that during inference, (E-Coli Poisoning, owl:sameAs, E-Coli) (discoveredBy, owl:type, owl:FunctionalProperty) Example inference similarity Updated score: .5 * 1.5 = 7.5 This is the only logical consequence. P = .6 /.4 = 1.5 The ILIADS algorithm It is still a local method Ultimately, it selects the best alignment after each step But it estimates the global impact of each alignment better The inference similarity is a look-ahead measure of how good the candidate alignment is Other issues ILIADS may not produce a consistent result Inconsistent ontologies in less than .5% of runs Pellet used to check consistency after ILIADS How do we decide between subsumption and equivalence for a pair of entities? How do we select the promising candidates? How do we choose the axioms to apply in the five inference steps? Subsumption vs. equivalence Deciding whether two entities should subsume each other or be equivalent is not clear-cut Simple extensional technique to distinguish between the two cases E.g., measure whether the instances of class c are “almost” the same of those of class c’ => rdfs:equivalentClass If they are a subset, then rdfs:subClassOf Deciding relationship type present in the extensions of both FoodPoisoning and FoodBorneDisease To measure how much the two classes have in common, we divide the size of the unique part to the size of the common part. We obtain 1/3 and 2/4 respectively. Deciding relationship type We decide based on λr. If λr = .49, then we choose rdfs:subClassOf Deciding relationship type If λr = .7, then we choose owl:equivalenClass Cluster type selection Existing tools use various strategies to generate candidates from classes, individuals or properties ILIADS supports: Randomly select from the three types Weighted random (more classes than individuals means classes will be selected more often) Classes first / Individuals first Alternate at each step Axiom selection policies The number of inference steps is small The axioms applied must make a difference ILIADS always selects from relevant axioms according to a policy: Random Property axioms first (e.g, owl:TransitiveProperty) Class axioms first (e.g., rdfs:subClassOf) Transitive/Inverse/Functional first (since they tend to “generated” sameAs relationships) Ontology Alignment Motivation and goals Short overview of OWL Lite The ILIADS method Experimental evaluation Experimental framework 30 pairs of ontologies Ontologies from 194 to over 20000 triples Ground truth provided by human reviewers Comparison in terms of recall and precision with FCA- merge and COMA++ Two versions of the algorithm Best overall average quality ILIADS – FP Best parameters for each pair ILIADS – BP ILIADS-BP parameter setting Precision/recall 0.95 0.85 Recall 0.75 ILIADS-FP 0.65 ILIADS-BP 0.55 FCA-merge 0.45 COMA++ 0.35 0.25 0.4 0.5 0.6 0.7 Precision 0.8 0.9 1 Precision/recall comparison 0.9 0.8 78.8% 75.3% 76.2% 73.2% 73.9% 73.3% 75.6% 0.7 76.3% 60.1% 0.6 70.7% 66.7% 50.6% 0.5 Precision Recall 0.4 F1 0.3 0.2 0.1 0 ILIADS-FP ILIADS-BP FCA-merge COMA++ Precision/recall for ontologies with substantial instance data 0.9 0.8 80.8% 77.2% 78.7% 75.1% 76.1% 76.7% 74.0% 72.2% 0.7 58.9% 0.6 69.0% 64.6% 49.7% 0.5 Precision Recall 0.4 F1 0.3 0.2 0.1 0 ILIADS-FP ILIADS-BP FCA-merge COMA++ False negative analysis 6% 45% 45% 17% 30% 29% Equiv. ind. Equiv. prop. Subprop. Equiv. cls. Subcls. 27% 19% 13% 9% 11% 8% 9% 4% ILIADS FCA-merge COMA++ 28% Number of inference steps The number of 5 inference steps was chosen as the best compromise between: F1 quality 18000 0.78 16000 0.76 14000 0.74 12000 0.72 F-1 quality Running time[s] Running time 10000 8000 6000 0.7 0.68 0.66 4000 0.64 2000 0.62 0 0.6 0 2 4 6 N 8 10 12 0 2 4 6 N 8 10 Cluster type/axiom selection policies 0.8 0.7 0.65 Trans/Inv/Func 0.6 Classes first 0.55 Prop. first Random axiom Cluster type selection policy Axiom selection policy F1 Quality 0.75 And the result is... (discoveredBy, owl:inverseOf, discoverer); (discoveredBy, owl:type, owl:FunctionalProperty) (discoveredBy, owl:inverseOf, discoverer); (associatedWith, owl:type, owl:TransitiveProperty) (resultsF rom, rdfs:subPropertyOf, associatedWith) Choosing the parameters The structural similarity coefficients strongly correlate with the average degree of the node The structural coefficient for classes correlates with the number of rdfs:subClassOf relationships The extensional coefficients correlate with the ratio of instance to classes Parameter sensitivity Structural coefficients are stable around the ILIADS- FP setting for 25 out of 30 pairs The remaining 5 pairs have large differences between their average node degrees Extensional coefficients are stable around the ILIADS- FP setting for 21 pairs The remaining 9 pairs have a low ratio of instances to classes (< 1.9) Experimental results summary ILIADS has better quality than COMA++ and FCA- merge, with a significant difference for all pairs with substantial instance data Matching properties is the major cause of false negatives for all three systems, but ILIADS does better at matching instances Structural and extensional coefficients correlate with structural properties and are stable for ontologies with similar structure ILLIADS Summary New algorithm that tightly integrates statistical matching and logical inference to produce better quality alignments Found intriguing correlations between structure and matching strategies Improvement over existing systems 25% higher quality than FCA-merge, 11% higher recall than COMA++ at comparable precision HOMER: Tool for Ontology Alignment http://www.cs.umd.edu/projects/linqs/iliads ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 188 Information Alignment: Summary The process of finding, modeling and using the correspondences or connections that place information artifacts in relation to each other Need new, flexible, adaptive methods for information alignment which can take context into account and which can exploit both logical and probabilistic consequences ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 189 Open Issues Query-time data and metadata alignment Notion of multiple alignments; no single one best Need to keep track and make use of lineage Need to understand which information is most informative and useful for alignment: data, structure, metadata, etc. Need for methods for evaluation and quality measures ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 190 Thanks! ICDE 2008 Getoor, Miller --- Data & Metadata Alignment 191