Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSE 636 Data Integration Overview Data Warehouse Architecture Users Applications Relational Database (Warehouse) OLAP / Decision Support Data Cubes / Data Mining ETL Tools (Extract-Transform-Load) Data Cleaning Data Source Data Source Data Source 2 Virtual Integration Architecture • Leave the data in the sources • When a query comes in: – Determine the relevant sources to the query – Break down the query into sub-queries for the sources – Get the answers from the sources, filter them if needed and combine them appropriately • Data is fresh • Otherwise known as On Demand Integration 3 Virtual Integration Architecture Design-Time End Users Applications Global Schema Schema Mappings Local Data Schema Source Local Data Schema Source Sources can be: • Relational DBs • Excel Files • Web Sites • Web Services Local Data Schema Source 4 Schema Mappings • Differences in: – Names in schema – Attribute grouping Inventory Database A Inventory Database B Books Title ISBN Price DiscountPrice Edition ISBN FirstName LastName BookCategories ISBN Category BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords Authors CDCategories ASIN Category CDs Album ASIN Price DiscountPrice Studio – Coverage of databases – Granularity and format of attributes Artists ASIN ArtistName GroupName 5 Issues for Schema Mappings Design-Time End Users Applications Global Schema Schema Mappings Local Data Schema Source Local Data Schema Source • What formalisms to express them? • How to create them? • Can we discover them somehow? • How do we use them? Local Data Schema Source 6 Virtual Integration Architecture Run-Time Query Result Mediator Reformulation Optimization Global Schema Execution Wrapper Local Data Schema Source Wrapper Local Data Schema Source Local Data Schema Source 7 Issues for Query Processing Reformulation Query Mediator Reformulation Global Schema Local Data Schema Source Local Data Schema Source • User queries refer to the global schema • Data is stored in the sources in a local schema • Rewriting algorithms Local Data Schema Source 8 Issues for Query Processing Reformulation Global Schema Books Title ISBN Price DiscountPrice Edition SELECT ISBN, Price FROM Books WHERE Title = ‘on the road’ Local Schema A BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords SELECT ItemID, SuggestedPrice FROM BooksAndMusic WHERE Title = ‘on the road’ AND ItemType = ‘Books’ 9 Issues for Query Processing Query Translation Query • Different query languages Mediator Reformulation Optimization Global Schema Execution Wrapper Local Data Schema Source Local Data Schema Source Local Data Schema Source 10 Issues for Query Processing Query Translation Global Schema Books Title ISBN Price DiscountPrice Edition SELECT ISBN, Price FROM Books WHERE Title = ‘on the road’ Local Source A http://www.amazon.com/homepage.html?ItemType=Books&Title=on+the+road 11 Issues for Query Processing Data Translation Query • Different data models Mediator Reformulation Optimization Global Schema Execution Wrapper Local Data Schema Source Local Data Schema Source Local Data Schema Source 12 Issues for Query Processing Data Translation Global Schema Books Title ISBN Price DiscountPrice Edition Title ISBN Price On the Road 123 10.86 Local Result A … … … … <table> <tr> <td> <a href=/details?isbn=123> <b>On the Road</b> </a> -- by Jack Kerouac; Paperback <br> <a href=/details?isbn=123> Buy new </a> :<b class=price>$10.86</b> </td> </tr> </table> 13 Issues for Query Processing Query Execution Query Mediator Reformulation Optimization Global Schema Execution Wrapper Local Data Schema Source • Access as many data sources as needed • Duplicate/redundant and irrelevant data • Limited query capabilities Wrapper Local Data Schema Source Local Data Schema Source 14 Issues for Query Processing Limited Query Capabilities Global Schema Books E DiscountPrice 8.86 Local Schema A Local Schema B BooksAndMusic DiscountBooks Title B Title ISBN ISBN Price Price 123 10.86 DiscountPrice Edition SELECT ISBN, Price, DiscountPrice FROM Books WHERE Title = ‘on the road’ ItemID Author SuggestedPrice ItemID 123ItemType 10.86 Title D GreatPrice Edition ISBN 8.86 GreatPrice SuggestedPrice A SELECT ItemID, SuggestedPrice FROM BooksAndMusic WHERE Title = ‘on ? the road’ C SELECT GreatPrice FROM DiscountBooks WHERE ISBN = 123 ? 15 Issues for Query Processing Query Answering Query Result Mediator Reformulation Optimization Global Schema Execution Wrapper Local Data Schema Source • Combine the results and further process them if needed • Mainly union and merge • Inconsistencies Wrapper Local Data Schema Source Local Data Schema Source 16 Issues for Query Processing Query Answering (Union) ISBN Price 123 10.86 456 8.86 ItemID SuggestedPrice ISBN GreatPrice 123 10.86 456 8.86 17 Issues for Query Processing Query Answering (Merge) Primary Key ISBN Title Edition Price 123 On the Road 2nd 8.86 ItemID Title ISBN Edition Price 123 On the Road 123 2nd 8.86 Primary Key Primary Key 18 Issues for Query Processing Query Answering (Inconsistencies) Primary Key ISBN Title Edition Price 123 On the Road ??? 8.86 ItemID Title Edition ISBN Edition Price 123 On the Road 1st 123 2nd 8.86 Primary Key Primary Key 19 Peer-Based Integration Query Peer 4 Query Peer 5 Peer 2 Peer 1 Peer 3 21 Peer-Based Integration • • • • No need for a central mediated schema Peers serve as mediators for other peers A peer can be both a server and a client Semantic relationships are specified locally (between small sets of peers) • Queries are posed using the peer’s schema • Answers come from anywhere in the system • This is not P2P file sharing. – Data has rich semantics 22 References • Information integration – Maurizio Lenzerini – Eighteenth International Joint Conference on Artificial Intelligence, IJCAI 2003 – Invited Tutorial • Data Integration: a Status Report – Alon Halevy – German Database Conference (BTW), 2003 – Invited Talk 23