Download Data/Information Integration/aggregation

Information Integration Kambhampati & Knoblock Information Integration on the Web (MA-1) 1 Services Source Trust Webpages Structured data Sensors (streaming Data) Source Fusion/ Query Planning Mediator Executor Answers Monitor Kambhampati & Knoblock Information Integration on the Web (MA-1) 2 Services • • • • Source Trust User queries refer to the mediated schema. Data is stored in the sources in a local schema. Content descriptions provide the semantic mappings between the different schemas. Mediator uses the descriptions to translate user queries into queries on the sources. Ontologies; Source/Service Descriptions Webpages Structured data Sensors (streaming Data) Source Fusion/ Query Planning Needs to handle: Multiple objectives, Service composition, Source quality & overlap DWIM Executor Answers Kambhampati & Knoblock Probing Queries Needs to handle Source/network Interruptions, Runtime uncertainity, replanning Information Integration on the Web (MA-1) Monitor 3 Who is dying to have it? (Applications) • WWW: – Comparison shopping – Portals integrating data from multiple sources – B2B, electronic marketplaces • Science and culture: – Medical genetics: integrating genomic data – Astrophysics: monitoring events in the sky. – Culture: uniform access to all cultural databases produced by countries in Europe provinces in Canada • Enterprise data integration – An average company has 49 different databases and spends 35% of its IT dollars on integration efforts Kambhampati & Knoblock Information Integration on the Web (MA-1) 4 Why isn’t this just • Search engines do text-based retrieval of URLS – Works reasonably well for single document texts, or for finding sites based on single document text • Cannot integrate information from multiple documents • Cannot do effective “query relaxation” or generalization • Cannot link documents and databases • The aim of Information integration is to support query processing over structured and semistructured sources as well as services. Kambhampati & Knoblock Information Integration on the Web (MA-1) 9 Is it like Expedia/Travelocity/Orbitz… • Surpringly, NO! • The online travel sites don’t quite need to do data integration; they just use SABRE – SABRE was started in the 60’s as a joint project between American Airlines and IBM – It is the de facto database for most of the airline industry (who voluntarily enter their data into it) • There are very few airlines that buck the SABRE trend—SouthWest airlines is one (which is why many online sites don’t bother with South West) • So, online travel sites really are talking to a single database (not multiple data sources)… – To be sure, online travel sources do have to solve a hard problem. Finding an optimal fare (even at a given time) is basically computationally intractable (not to mention the issues brought in by fluctuating fares). So don’t be so hard on yourself • Check out http://www.maa.org/devlin/devlin_09_02.html Kambhampati & Knoblock Information Integration on the Web (MA-1) 10 Why isn’t this just Databases Distributed Databases • No common schema – Sources with heterogeneous schemas (and ontologies) – Semi-structured sources • Legacy Sources – Not relational-complete – Variety of access/process limitations Query (SQL) Database Manager (DBMS) Database • Autonomous sources -Storage mgmt (relational) -Query processing Answer – No central administration -View management (relation) -(Transaction processing) – Uncontrolled source content overlap • Unpredictable run-time behavior – Makes query execution hard • Predominantly “Read-only” – Could be a blessing—less worry about transaction management – (although the push now is to also support transactions on web) Kambhampati & Knoblock Information Integration on the Web (MA-1) 11 Are we talking “comparison shopping” agents? • Certainly closer to the aims of these • But: • Wider focus • Consider larger range of databases • Consider services • Implies more challenges • “warehousing” may not work • Manual source characterization/ integration won’t scale-up Kambhampati & Knoblock Junglee Netbot Information Integration on the Web (MA-1) DealPilot.Com 12 11/7 Search Engine Reported Size Page Depth Google 8.1 billion 101K MSN 5.0 billion 150K Yahoo 4.2 billion (estimate) 500K Ask Jeeves 2.5 billion 101K+ --Project part C due 11/30 --Homework 4 due 11/14 --Google guy talk on 11/18 (TBA) Kambhampati & Knoblock Information Integration on the Web (MA-1) 14 Services • • • • Source Trust User queries refer to the mediated schema. Data is stored in the sources in a local schema. Content descriptions provide the semantic mappings between the different schemas. Mediator uses the descriptions to translate user queries into queries on the sources. Ontologies; Source/Service Descriptions Webpages Structured data Sensors (streaming Data) Source Fusion/ Query Planning Needs to handle: Multiple objectives, Service composition, Source quality & overlap DWIM Executor Answers Kambhampati & Knoblock Probing Queries Needs to handle Source/network Interruptions, Runtime uncertainity, replanning Information Integration on the Web (MA-1) Monitor 15 Information Integration Linkage • Discovering information sources (e.g. deep web modeling, schema learning, …) • Gathering data (e.g., wrapper learning & information extraction, federated search, …) Queries • Querying integrated • Cleaning data (e.g., de-duping and linking records) to form a single [virtual] database information sources (e.g. queries to views, execution of web-based queries, …) • Data mining & analyzing integrated information (e.g., collaborative filtering/classification learning using extracted data, …) Different “Integration” scenarios • “Data Aggregation” (Vertical) • – All sources export (parts of a) single relation • No need for joins etc • Could be warehouse or virtual – E.g. BibFinder, Junglee, Employeds etc – Challenges: Schema mapping; data overlap • “Data Linking” (Horizontal) – Joins over multiple relations stored in multiple DB • E.g. Softjoins in WHIRL • Ted Kennedy episode – Challenges: record linkage over text fields (object mapping); query reformulation Kambhampati & Knoblock • • “Collection Selection” – All sources export text documents – E.g. meta-crawler etc. Challenges: Similarity definition; relevance handling All together (vertical & horizontal) – Many interesting research issues – ..but few actual fielded systems Information Integration on the Web (MA-1) 17 Dimensions to Consider • • • • How many sources are we accessing? How autonomous are they? Can we get meta-data about sources? Is the data structured? – Discussion about soft-joins. See slide next • Supporting just queries or also updates? • Requirements: accuracy, completeness, performance, handling inconsistencies. • Closed world assumption vs. open world? – See slide next Kambhampati & Knoblock Information Integration on the Web (MA-1) 18 Soft Joins..WHIRL [Cohen]  We can extend the notion of Joins to “Similarity Joins” where similarity is measured in terms of vector similarity over the text attributes. So, the join tuples are output n a ranked form—with the rank proportional to the similarity  Neat idea… but does have some implementation difficulties   Most tuples in the cross-product will have non-zero similarities. So, need query processing that will somehow just produce highly ranked tuples Also other similarity/distance metrics may be used  E.g. Edit distance 19 20 Local Completeness Information • If sources are incomplete, we need to look at each one of them. • Often, sources are locally complete. • Movie(title, director, year) complete for years after 1960, or for American directors. • Question: given a set of local completeness statements, is a query Q’ a complete answer to Q? Problems: 1. Sources may not be interested in giving these!Advertised description Need to learn hard to learn! 2. Even if sources are willing to give, there may not be any “big enough” LCWs Saying “I definitely have the car with vehicle ID XXX is useless Kambhampati & Knoblock True source contents Guarantees (LCW; Inter-source comparisons) Information Integration on the Web (MA-1) 21 Source Descriptions Kambhampati & Knoblock Services Source Trust Ontologies; Source/Service Descriptions Probing Queries Webpages Structured data Sensors (streaming Data) re nc e/U ti l ity M od e Qu Executor Pr efe Information Integration on the Web (MA-1) Ca ur ce el ry So Needs to handle: Multiple objectives, Service composition, Source quality & overlap lls Source Fusion/ Query Planning Answers Needs to handle Source/network Interruptions, Runtime uncertainty, replanning tics atis g St atin Upd ing nn pla ts Re ques Re • Contains all meta-information about the sources: – Logical source contents (books, new cars). – Source capabilities (can answer SQL queries) – Source completeness (has all books). – Physical properties of source and network. – Statistics about the data (like in an RDBMS) – Source reliability – Mirror sources – Update frequency. Monitor 23 Source Access • How do we get the “tuples”? – Many sources give “unstructured” output • Some inherently unstructured; while others “englishify” their database-style output – Need to (un)Wrap the output from the sources to get tuples – “Wrapper building”/Information Extraction – Can be done manually/semi-manually Kambhampati & Knoblock Information Integration on the Web (MA-1) 24 Source Fusion/Query Planning Services Source Trust Ontologies; Source/Service Descriptions Webpages Structured data Sensors (streaming Data) re nc e/U til ity M od el y er Qu Pr efe Executor Information Integration on the Web (MA-1) Needs to handle Source/network Interruptions, Runtime uncertainty, replanning Ca ce ur So Needs to handle: Multiple objectives, Service composition, Source quality & overlap lls Source Fusion/ Query Planning Answers Kambhampati & Knoblock Probing Queries tics atis g St atin Upd ing nn pla ts Re ques Re • Accepts user query and generates a plan for accessing sources to answer the query – Needs to handle tradeoffs between cost and coverage – Needs to handle source access limitations – Needs to reason about the source quality/reputation Monitor 25 Monitoring/Execution Services Source Trust Ontologies; Source/Service Descriptions Webpages Structured data Sensors (streaming Data) el ry So ur ce Needs to handle: Multiple objectives, Service composition, Source quality & overlap Ca lls Source Fusion/ Query Planning til e/U nc re efe Pr tics Information Integration on the Web (MA-1) atis g St Executor Needs to handle Source/network Interruptions, Runtime uncertainty, replanning atin ity M Upd od e Qu Answers Kambhampati & Knoblock Probing Queries ing nn pla ts Re ques Re • Takes the query plan and executes it on the sources – Needs to handle source latency – Needs to handle transient/short-term network outages – Needs to handle source access limitations – May need to re-schedule or re-plan Monitor 26 Models for Integration Overview Ser vi ces Source Trust Ont ol ogies; Source /Servic e Descripti ons Probing Queries Webpages Struc tured data M ity en ce /U til y er Qu od el Executor Answers Nee ds t o handle Source/ net work Int errupt ions, Runt ime unce rt ainit y, repl anning s Sensors (streaming Data) Ca ll Ne eds t o handl e: Multi pl e obje cti ves, Se rvice composit ion, Source qua lit y & ove rl ap Sou rc e Source Fusion/ Query Planning Pr efe r Motivation for Information Integration [Rao] Accessing Information Sources [Craig] Models for Integration [Rao] Query Planning & Optimization [Rao] Plan Execution [Craig] Standards for Integration/Mediation [Rao] Ontology & Data Integration [Craig] Future Directions [Craig] tics at is g St at in Upd ing lann s R ep ue st R eq • • • • • • • • Monitor Modified from Alon Halevy’s slides Kambhampati & Knoblock Information Integration on the Web (MA-1) 29 Solutions for small-scale integration lls Ca ce el So ur Needs to handle: Multiple objectives, Service composition, Source quality & overlap od M ity til e/U nc re Pr ef e s Needs to handle Source/network Interruptions, Runtime uncertainity, replanning tistic Executor Answers ing nn pla ts Re ques Re Kambhampati & Knoblock Sensors (streaming Data) Source Fusion/ Query Planning e Qu User queries Webpages Structured data ry Mostly ad-hoc programming: create a special solution for every case; pay consultants a lot of money. Data warehousing: load all the data periodically into a warehouse. – 6-18 months lead time – Separates operational DBMS from decision support DBMS. (not only a solution to data integration). – Performance is good; data may not be fresh. – Need to clean, scrub you data. Probing Queries g Sta atin • Ontologies; Source/Service Descriptions Upd • Services Source Trust Monitor OLAP / Decision support/ Data cubes/ data mining Relational database (warehouse) Data extraction programs Data source Information Integration on the Web (MA-1) Data cleaning/ scrubbing Data source Data source 30 Services Sensors (streaming Data) Ca lls Source Fusion/ Query Planning So ur od el ry ce Needs to handle: Multiple objectives, Service composition, Source quality & overlap re nc e/U til i ty M e Qu Needs to handle Source/network Interruptions, Runtime uncertainity, replanning s Pr ef e Executor Answers tistic Kambhampati & Knoblock Webpages Structured data g Sta – Relating Sources & Mediator – Reformulating the query – Efficient planning & execution Probing Queries ing nn pla ts Re ques Re • Leave the data in the sources. • When a query comes in: – Determine the relevant sources to the query – Break down the query into sub-queries for the sources. – Get the answers from the sources, and combine them appropriately. • Data is fresh. Approach scalable • Issues: Ontologies; Source/Service Descriptions atin Upd The Virtual Integration Architecture Source Trust Monitor User queries Mediated schema Mediator: Which data model? Reformulation engine optimizer Execution engine Data source catalog wrapper wrapper wrapper Data source Data source Data source Garlic [IBM], Hermes[UMD];Tsimmis, InfoMaster[Stanford]; DISCO[INRIA]; Information Manifold [AT&T]; SIMS/Ariadne[USC];Emerac/Havasu[ASU] Information Integration on the Web (MA-1) 32 Desiderata for Relating Source-Mediator Schemas • Expressive power: distinguish between sources with closely related data. Hence, be able to prune access to irrelevant sources. • Easy addition: make it easy to add new data sources. • Reformulation: be able to reformulate a user query into a query on the sources efficiently and effectively. • Nonlossy: be able to handle all queries that can be answered by directly accessing the sources Kambhampati & Knoblock User queries Mediated schema Mediator: Reformulation engine optimizer Execution engine Data source catalog wrapper wrapper wrapper Data source Data source Data source Reformulation • Given: – A query Q posed over the mediated schema – Descriptions of the data sources • Find: – A query Q’ over the data source relations, such that: • Q’ provides only correct answers to Q, and • Q’ provides all possible answers to Q given the sources. Information Integration on the Web (MA-1) 35 11/9 --Google Mashups—what type of integration are they? --Mturk.comcollaborative computing the capitalist way Kambhampati & Knoblock Information Integration on the Web (MA-1) 36 Information Integration Data Integration Text/Data Integration Service integration Collection Selection Data aggregation (vertical integration) Data Linking (horizontal integration) Kambhampati & Knoblock Information Integration on the Web (MA-1) 38 Extremes of automation in Information Integration • Fully automated II (blue sky for now) – – – – – – – • Fully hand-coded II – Decide on the only query Get a query from the user you want to support (most interesting on the mediator schema – Write a (java)script that action is Go “discover” relevant data supports the query by “in between”) sources accessing specific (predetermined) sources, piping Figure out their “schemas” Map the schemas on to the E.g. We may start with results (through known APIs) to specific other known sources and mediator schema sources Reformulate the user query their known schemas, • Examples include Google do hand-mapping into data source queries Map Mashups and support automated Optimize and execute the reformulation and queries optimization Return the answers Kambhampati & Knoblock Information Integration on the Web (MA-1) 39 Differences minor for data aggregation… Approaches for relating source & Mediator Schemas “View” Refresher • Global-as-view (GAV): express the mediated schema relations as a set of views over the data source relations • Local-as-view (LAV): express the source relations as views over the mediated schema. • Can be combined…? Kambhampati & Knoblock CREATE VIEW Seattle-view AS SELECT buyer, seller, product, store FROM Person, Purchase WHERE Person.city = “Seattle” AND Person.name = Purchase.buyer We can later use the views: Virtual vs Materialized SELECT name, store FROM Seattle-view, Product WHERE Seattle-view.product = Product.name AND Product.category = “shoes” Information Integration on the Web (MA-1) 40 Global-as-View Mediated schema: Express mediator schema relations as views over Movie(title, dir, year, genre), source relations Schedule(cinema, title, time). Create View Movie AS select * from S1 [S1(title,dir,year,genre)] union select * from S2 [S2(title, dir,year,genre)] union [S3(title,dir), S4(title,year,genre)] select S3.title, S3.dir, S4.year, S4.genre from S3, S4 where S3.title=S4.title Kambhampati & Knoblock Information Integration on the Web (MA-1) 42 Global-as-View Mediated schema: Express mediator schema relations as views over Movie(title, dir, year, genre), source relations Schedule(cinema, title, time). Create View Movie AS select * from S1 [S1(title,dir,year,genre)] union select * from S2 [S2(title, dir,year,genre)] union [S3(title,dir), S4(title,year,genre)] select S3.title, S3.dir, S4.year, S4.genre from S3, S4 Mediator schema relations are where S3.title=S4.title Virtual views on source relations Kambhampati & Knoblock Information Integration on the Web (MA-1) 43 Local-as-View: example 1 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Express source schema relations as views over mediator relations Create Source S1 AS select * from Movie S1(title,dir,year,genre) Create Source S3 AS select title, dir from Movie S3(title,dir) Create Source S5 AS select title, dir, year S5(title,dir,year), year >1960 from Movie where year > 1960 AND genre=“Comedy” Sources are “materialized views” of mediator schema Kambhampati & Knoblock Information Integration on the Web (MA-1) 46 GAV vs. LAV Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Source S4: S4(cinema, genre) Create View Movie AS Create Source S4 select NULL, NULL, NULL, genre select cinema, genre from S4 from Movie m, Schedule s Create View Schedule AS where m.title=s.title select cinema, NULL, NULL from S4. But what if we want to find which cinemas are playing comedies? Now if we want to find which cinemas are playing comedies, there is hope! Lossy mediation Kambhampati & Knoblock Information Integration on the Web (MA-1) 49 GAV vs. • Not modular – Addition of new sources changes the mediated schema • Can be awkward to write mediated schema without loss of information • Query reformulation easy – reduces to view unfolding (polynomial) – Can build hierarchies of mediated schemas • Best when – Few, stable, data sources – well-known to the mediator (e.g. corporate integration) • Garlic, TSIMMIS, HERMES Kambhampati & Knoblock LAV • Modular--adding new sources is easy • Very flexible--power of the entire query language available to describe sources • Reformulation is hard – Involves answering queries only using views (can be intractable—see below) • Best when – Many, relatively unknown data sources – possibility of addition/deletion of sources • Information Manifold, InfoMaster, Emerac, Havasu Information Integration on the Web (MA-1) 50

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data/Information Integration/aggregation