Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data center wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Data analysis wikipedia , lookup
Clusterpoint wikipedia , lookup
Versant Object Database wikipedia , lookup
Information privacy law wikipedia , lookup
3D optical data storage wikipedia , lookup
Business intelligence wikipedia , lookup
Relational model wikipedia , lookup
Data Integration Systems overview The architecture of a data integration system: Components and their interaction Tasks Concepts 2005 Integration-intro 1 Main components of a DI system (I) Mediator מתווך Supports in its user interface : • The global data model • The integrated / global / mediated schema / world view • A query language Manages the interaction with sources • Posing queries • Receiving answers, transforming and showing them Is responsible for query execution strategies • planning • carrying out 2005 Integration-intro 2 (II) Wrapper עוטף Serves as the interface to a source • Receive queries from a mediator • Plan and execute how to retrieve the data from its source • Transform data to global data model • Send to mediator For an SQL source, these are rather easy For a restricted capability source, may require • A series of queries on the source, or • A program to be executed (on a non-db source) • Filtering results obtained from the source 2005 Integration-intro 3 A simple architecture: mediator wrapper wrapper source source Arrows represent query and data flow 2005 Integration-intro 4 A more complex architecture: mediator mediator wrapper wrapper wrapper source source source Mediators can serves as wrapped sources for other mediators 2005 Integration-intro 5 Important: • The global database is virtual – contain no data • The data reside in the sources • The users pose queries as if the data resides in the global db • Users may/may not be aware that the data actually comes from the sources 2005 Integration-intro 6 Main tasks & activities: At mediator: • Query reformulation & decomposition – express queries in terms of the sources’ schemas – decompose into queries on sources • Planning query execution, including optimization – a declarative query may be executed in various ways (even in a single centralized db) – different sources may provide same data at different costs (money, communication time, response time, delays, …) – If data is associated with user priorities, we may want to retrieve some answers before others • When answers arrive – fuse them – a full answer is not a simple union of partial answers; data on an entity must be combined (fused) into a single record 2005 Integration-intro 7 Requirements: (from mediator, wrapper, system) Ability to handle • Incomplete information (data may be missing from available sources) • Heterogeneity – in data model, schema, contents • Both data and meta-data ability to describe sources: • Capabilities – what queries can a source answer – What mechanisms does it offer for data retrieval • Coverage – allows to know – Where can data relevant to a query be found – Is there overlap between sources? 2005 Integration-intro 8 The relationship between source and global data The global data is virtual the mediated schema describes data that • resides in the sources • is described by source schemas The relationship between the mediated and source data determines how queries are answered Two main approaches (a combination of the two – later) 2005 Integration-intro 9 Global as View – GAV The global db is defined as a view on the sources In relational model: Each global relation defined as a view, by a query on sources Obvious advantage: simplicity of query answering Given Q on global relations, • expand it: replace each atom R(x) by an expression on sources, using the definition of R • Then send appropriate sub-queries to sources 2005 Integration-intro 10 Simple example: a university database Source A: • Dept(D, C) – departments and their courses • Teaches(C,T) – teachers of courses Source B: • Enroll(S, C) – student enrollment to courses Integrated schema & its definition: • Stud(S, D, T) :- Dept(D, C), Teach(C, T), Enroll(S, C) Query Q: • Stud(S, ‘CS’, ‘Beeri’) Expand body to Dept(‘CS’, C), Teach(C, ‘Beeri’), Enroll(S, C) Then use one of (at least) two execution strategies on sources A,B 2005 Integration-intro 11 Local as View – LAV The global database is viewed as the “real world” Each source is defined as a view on it Example (revisited): Global schema: Univ(D, C, T, S) Source A: • Dept(D, C) :- Univ(D, C, T, S) • Teaches(C, T) :- Univ(D, C, T, S) Source B: • Enroll(S, C) :- Univ(D, C, T, S) 2005 Integration-intro 12 Possible assumptions on sources: • A source contains all data in its defining view • A source contains some of the data in its view, usually not all 2nd is more realistic Example: Global database describes cars for sale A source may contain : • only some of the attributes of cars present in the global schema (e.g., it may not contain history, or owner-contact) • Only some of the cars for sale full view / contained view Obviously, the more sources we have, the more cars 2005 Integration-intro 13 Query answering in LAV: • Expansion is not possible • An approach: answering queries using views Practically: rewriting queries using views (differences explored later) Only the views have data rewrite query to an expression over the views expression must be (explained in more detail later) • Full views: equivalent to query • Contained views: contained in query Solution may/may not exist (contrast to expansion) Finding it is more difficult This problem was explored in many contexts, e.g.: Query optimization using views/previous answers 2005 Integration-intro 14 Why prefer LAV to GAV? • Ease of expanding a system: – In GAV, adding a source may require re-definition of global schema – makes it difficult to add sources – In LAV, just define the new source as a view given an algorithm for using views to answer queries, it automatically uses the new source As for expanding queries vs. using views: Even in GAV, when sources have restricted capabilities, query answering requires using views 2005 Integration-intro 15 • Typically, a global schema reflects a real ‘world’, as we know it; each source materializes only a fragment – Horizontal – not all entity types or attributes are present – Vertical – not all entities of a type are present Thus, it is natural to define the sources as (contained) views Examples: • Cars for sale: – global db reflects our understanding and requirements – A source provides only some info, only on the cars it has • Looking for personal information using UNIX facilities – we know about: name, office, phone, email, … – Each facility may offer only some of the above 2005 Integration-intro 16 LAV is a natural approach in the presence of • www and its diversity & dynamicity of source • Legacy systems Most research efforts & systems are LAV 2005 Integration-intro 17 On rewriting queries using views: It is not clear (now) how to obtain a rewriting, given Q But, given v1(..), v2(..), …, vn(..) as a candidate, we may • expand each vi using its definition in terms of the global schema • Check whether the resulting expression is equivalent to or contained in Q (both Q and the expansion are in terms of global schema relations) Equivalence and containment of queries are fundamental problems for data integration 2005 Integration-intro 18 Example (our LAV example): Q: ans(S,’CS’,’Beeri’) :- Univ(’CS, C, ’Beeri’,S) Guess an answer in terms of views: ans`(S, ’CS’ , ‘Beeri’ ) :Dept(‘CS’,C), Teach(C, ‘Beeri’), Enroll(S,C) Univ(‘CS’,C, T1, S1) Univ(D2,C, ‘Beeri’, S2) Univ(D3,C, T3, S) (Note: must use distinct variables in different expansions for all non-join variables) Is the query equivalent to this expansion? Is the expansion contained in the query? 2005 Integration-intro 19