Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006 Articles A survey of approaches to automatic schema matching Rahm & Bernstein (2001) Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach He, Chen-Chuan Chang & Han (2004) Automatic Schema Matching, SDBI, 2006 2 Contents Problem Definition Applications Classic Approaches Correlation Mining Approach Automatic Schema Matching, SDBI, 2006 3 Match Definition Authors Authors ID AID Name AName NumOfBooks ANumOfBooks A match is a mapping between elements of two schemas that correspond semantically to each other Automatic Schema Matching, SDBI, 2006 4 Match Properties Authors Authors (1:1) ID ID Name NumOfBooks FName ? (1:n) LName ? YearOfBirth • (n:m) matching also possible Automatic Schema Matching, SDBI, 2006 5 Match Properties (cont’d) Authors Authors ID ID Name Name Salary ($) Salary (NIS) • Salary(NIS) = Salary($) * 4.55 • We will not find the function, just the attributes Automatic Schema Matching, SDBI, 2006 6 Match Properties (cont’d) Employees Employees EmpName EmpName DeptID Join DeptName Departments DeptID DeptName One relation is mapped to two others Automatic Schema Matching, SDBI, 2006 7 Match Properties (cont’d) Lessons Lessons Teacher Teacher StartTime EndTime ? ? Time • Too hard for PC! • PC should only suggest mappings to the user Automatic Schema Matching, SDBI, 2006 8 Match Properties (cont’d) So maybe it can all be done manually? Field1 Field1 Field2 Field2 Field3 Field3 Field4 Field4 Field5 Field5 Field6 Field6 Field7 Field7 Field8 Field8 Field9 Field9 field10 field10 An automated tool can be helpful here… Automatic Schema Matching, SDBI, 2006 9 Match Generalization We have defined a match for the relational model. There are other interesting models: … Books <author> Authors <id>1</id> ID <name>Calvino</name> </author> Authors … Automatic Schema Matching, SDBI, 2006 Name 10 Match Generalization (cont’d) Define a Schema to be a set of elements connected by some structure Use the natural correspondence: • nodes and edges in graphs • elements, subelements, and IDREFs in XML … Automatic Schema Matching, SDBI, 2006 11 Contents Problem Definition Applications Classic Approaches Correlation Mining Approach Automatic Schema Matching, SDBI, 2006 12 Data Migration Migrate data from old DB to new DB Old Forum New Forum Date Time From Writer Message Message IsVisible Special case: Data warehouse Automatic Schema Matching, SDBI, 2006 ResponseTo 13 E-Commerce Map between different message formats Book Store <book> <name>The Invisible Cities</name> <price>50</price> </book> General Store <product> <name>book</name> <price>50</price> </product> Automatic Schema Matching, SDBI, 2006 14 Global Query Interface You want to build a Meta-Querier. However… GOOGLE <input name=search> MSN <input name=q> <select name=type> Yahoo <input name=qry> <input name=type> Automatic Schema Matching, SDBI, 2006 15 Global Query Interface (cont’d) Solution: Reduce the html form to its “schema” GOOGLE Yahoo MSN Search Qry q Type Type Automatic Schema Matching, SDBI, 2006 16 Semantic Query Processing Keywords search scenario Authors Find: Author + Ram + Oren Id Author Name Ram Oren SELECT * WHERE Id=‘Ram Oren’ ? How does this differ from previous examples? SELECT * WHERE Name=‘Ram Oren’ ? Automatic Schema Matching, SDBI, 2006 17 Contents Problem Definition Applications Classic Approaches Correlation Mining Approach Automatic Schema Matching, SDBI, 2006 18 Matchers There are a few algorithms to map attributes of 2 schemas Define such an algorithm as a matcher Define a hybrid matcher as a matcher that combines results from other matchers Automatic Schema Matching, SDBI, 2006 19 Schema-based Vs. Instance-based Two ways to perform a match: • Use schema data (field name, type, constraints…) • Use data from the table Automatic Schema Matching, SDBI, 2006 20 Instance-based Two options for using data from the table: • Build a schema from instance data, then use schema matchers • Use the data directly. Example: Books BookID 1 TotPages TotPrice 500 50 2 3 400 450 What is TotalP? Books BookID TotalP 1 60 40 90 Automatic Schema Matching, SDBI, 2006 21 Instance-based (cont’d) When will we use/not use instance based matchers? • Useful when no schema data is available • Not useful when no instance data is available… Automatic Schema Matching, SDBI, 2006 22 Schema-Based What useful data is there in the schema? • Element’s name • Description • Data Type • Relationships • Constraints Automatic Schema Matching, SDBI, 2006 23 Schema-Based: Name Matching Map elements with similar names: • String equality • Common substrings (Birthday --> DayOfBirth) • Canonical names (CName --> Customer Name) • Synonyms (Car --> Automobile) • Hypernyms (Book is-a Publication) • Soundex (ShipTo --> Ship2) • User provided (Issue --> Bug) Automatic Schema Matching, SDBI, 2006 24 Schema-Based: Description Map elements based on description Schema A empn //employee name Schema B name //name of employee Automatic Schema Matching, SDBI, 2006 25 Schema-Based: Constraint Based Map elements based on Constraints: • Data Types • Unique, Primary, Foreign Permissions Employees Employees Payments ID Name Name ID PLevel PID PID Sum ? Automatic Schema Matching, SDBI, 2006 26 Reuse Previous Matching Schema A Schema B Schema C Name AName Author Salary Income Money • Get mapping AC From mappings AB and BC • A partial reuse is also possible (e.g. on some of the attributes) • Be aware of the domain: salary and income are not always the same! Automatic Schema Matching, SDBI, 2006 27 Complexity • We must compare every subgroup of attributes in schema A to every subgroup in schema B • Exponential in the number of attributes • However, we can assume the number of attributes is blocked… • Also check (n:m) matching only for n,m<C for some C Automatic Schema Matching, SDBI, 2006 28 Contents Problem Definition Applications Classic Approaches Correlation Mining Approach Automatic Schema Matching, SDBI, 2006 29 Data Mining Data Mining is the process of discovering patterns in data, usually stored in a Database. Sells TransID Item 1 Book 1 Pencil 2 Book 2 Soap 3 Book 3 Soap Which items are likely to co-appear? Automatic Schema Matching, SDBI, 2006 30 Data Mining (cont’d) Sells TransID Item 1 Book 1 Pencil 2 Book 2 Soap 3 Book 3 Soap Support of an itemset: the fraction of transactions that contain all items in the itemset. What is the support for {Book}? 1 And for {Book, Soap}? 0.666 The A-Priori property: the support for any subset of an itemset is bigger than the support for the itemset Automatic Schema Matching, SDBI, 2006 31 Data Mining (cont’d) Algorithm to find frequent itemsets: Sells TransID Item 1 Book 1 Pencil 2 Book 2 Soap 3 Book 3 Soap 1. Define a threshold minSupport for “frequent” itemsets 2. Calculate support for all itemsets of size (1) 3. Calculate support for itemsets of size Why can 2,3,4… we stop? 4. For each size k save the frequent itemsets 5. Stop when there are no frequent itemsets in size K. Automatic Schema Matching, SDBI, 2006 32 Data Mining (cont’d) Sells Example: TransID Item 1. Set minSupport = 0.5 1 Book 2. 1 Pencil S({Book})=1, S({Pencil})=0.33, S({Soap})=0.666 2 Book 3. S({Book, Soap})=0.666 2 Soap 3 Book 4. S({Book, Soap, Pencil})=0 3 Soap Where is {Soap, Pencil}? Automatic Schema Matching, SDBI, 2006 33 Back to Schema Matching… Authors Id Salary Name Year Id Author Id Id Id AuthorFirst First FirstName AuthorLast Last LastName What is the YearBirth Income difference from the Goal: Map {Name} to {Author}, {Salary} supermarket to {Income}… example? Idea:{Name} and {Author} are unlikely to appear together Solution: go to the supermarket, but instead of food buy attributes! Automatic Schema Matching, SDBI, 2006 34 The Algorithm Input: set of m schemas Id Salary Name Year Id Author Id Id Id AuthorFirst First FirstName AuthorLast Last LastName YearBirth Income Output: set of n-ary mappings {Name}:{Author}:{AuthorFirst, AuthorLast}:{First,Last}… {Salary}:{Income} {Year}:{YearBirth} Automatic Schema Matching, SDBI, 2006 35 Algorithm Naive Algorithm 1. Make a list L of all attributes from all schemas L = {Name, Salary, FirstName, LastName, Author, First, Last…} 2. For each pair of attributes, calculate their support (how often they appear together) S(Name, Salary) = 0.4 S(First, Last) = 0.95 S(Last, Name) = 0.1 Automatic Schema Matching, SDBI, 2006 36 Algorithm (Cont’d) 3. Choose groups with low support S(Name, Salary) = 0.4 S(First, Last) = 0.95 S(Last, Name) = 0.1 4. Using the A-Priory property calculate support for groups of sizes 3,4,5… S(Name, LastName, Salary) = 0 S(First, Last, Salary) = 0.1 5. Return all groups with low support Automatic Schema Matching, SDBI, 2006 37 Algorithm (Cont’d) The algorithm is naive. suppose we have negative correlation for this: {name, author} Then we also have negative correlation for this: {name, author, salary} {name, author, yearOfBirth} Actually for any attribute X we have: {name, author, X} Automatic Schema Matching, SDBI, 2006 38 Improvement Improvement: Define the support (s) of an itemset {a,b,c…} to be MAX { s(a,b), s(b,c), s(a,c) … } Example: s(name, author)=0.1 s(name, salary)=0.5 What is the logic behind s(salary, author)=0.6 this? s(name,author,salary)=MAX (0.1,0.5,0.6)=0.6 Now the support can go up so checking it is not trivial Automatic Schema Matching, SDBI, 2006 39 Generalizing the algorithm Now the algorithm finds all groups of attributes (a,b,c…) s.t. none of the pairs appears together. Hopefully these are attributes with the same semantic: {name, author} {salary, payments} … But what about this? ({first,last}, {name}) Currently we find only (1:1) matching For (n:m) we need to preprocess… Automatic Schema Matching, SDBI, 2006 40 Preprocess Pre-Process for the algorithm: 1. Make a list L of all attributes from all schemas L = {Name, Salary, FirstName, LastName, Author, First, Last…} 2. Run the normal A-Priori algorithm (find all attributes that DO appear together) S(first, last)=0.9 S(firstName,lastName)=0.85 Automatic Schema Matching, SDBI, 2006 41 Preprocess 3. For each schema S in the input: For each frequent attributes group A: S Id First If A intersects with S than add new attribute “A” to S S’ A Id {First,Last} Last Last 4. Run the previous algorithm on S1’, S2’… to find negative correlation Now we can find groups like: ({first,last}, {name}) First Automatic Schema Matching, SDBI, 2006 First, Last 42 Still Not Perfect… Suppose we found these mappings: {first,last}:{name}:{author} {first, yearOfBirth}:{birthDate} {yearOfBirth, monthOfBirth}:{birthDate} There is a contradiction! Automatic Schema Matching, SDBI, 2006 43 Solution Solution: rank the mappings according to the support of the lowest pair in each mapping 1. {first,last}:{name}:{author} 2. {first, yearOfBirth}:{birthDate} 3. {yearOfBirth, monthOfBirth}:{birthDate} Add the top rank to the results 1. {first,last}:{name}:{author} Delete contradictions to this rank: 2. {first, yearOfBirth}:{birthDate} X Process next mapping 3. {yearOfBirth, monthOfBirth}:{birthDate} Automatic Schema Matching, SDBI, 2006 44 Attributes with the same name Step 1 of the algorithm (reminder): Make a list S of all attributes from all schemas S = {Name, Salary, FirstName, LastName, Author, First, Last…} This means that two attributes with the same name are always considered the same. Payment (longint) ? Payment (datetime) Solution: add the type to the name Id Id_Int First First_String Last Last_String Automatic Schema Matching, SDBI, 2006 45 Correlation Measure Id Salary Name Id Author Year Id Id Id AuthorFirst First FirstName AuthorLast Last LastName YearBirth Income The rare attribute problem: s(Income, Id)=0.2 So Income=Id? Automatic Schema Matching, SDBI, 2006 46 Correlation Measure (cont’d) Id Salary Name Year Id Author Id Id Id AuthorFirst First FirstName AuthorLast Last LastName YearBirth Income The sparseness problem: s(Salary, Income)=0 If Salary=Income than what is their equivalence in the other tables? Automatic Schema Matching, SDBI, 2006 47 Correlation Measure (cont’d) Support of an itemset: the fraction of transactions that contain all items in the itemset. There are other ways to calculate support: Let A,B be two attributes. Define A ^A B f11 f10 f1+ ^B f01 f00 f0+ f+1 f+0 f++ f11: the number of schemas where both A,B appears f10: number of schemas where only A appears … f1+: f11+f10 Automatic Schema Matching, SDBI, 2006 48 Correlation Measure (cont’d) B ^B A f11 f01 f+1 ^A f10 f00 f+0 f1+ f0+ f++ We used: support=f11/f++ Lift: f00f11/f10f11 H-measure f01f10/f+1f1+ Every measure fits a different situation For example, in the matching problem we want to “punish” attributes that co-appear Automatic Schema Matching, SDBI, 2006 Id Salary Name Year 49 Applications This approach can only be used when we have many schemas • Data Migration? • Web query interfaces. Example: El-Al.Com Arkia.Com American Airlines.Com •Adult •Adult •Passengers •Child •Child •To •Infant •Destination Is it possible to use the algorithm for migration by running it on many Automatic random schemas? Schema Matching, SDBI, 2006 50 Complexity The A-Priory algorithm is O(2^n) Usually there are only few correlations, so in step (k+1) we consider just a few from the groups of size k Automatic Schema Matching, SDBI, 2006 51 Automatic Schema Matching, SDBI, 2006 52