Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
TU/e eindhoven university of technology Technologie van Informatiesystemen TIS college 3 /faculty of mathematics and informatics TU/e eindhoven university of technology Inhoud • Inleiding, 30/11 • Web engineering & Web information systems, 7/12 • Data transformatie & Data integratie, 14/12 • ERP, Smulders (Deloitte), 21/12 + 11/1 • Flower, Berens (Pallas Athena), 25/1 + 1/2 • Biztalk, van den Boom (Microsoft), 15+22/2 /faculty of mathematics and informatics TU/e eindhoven university of technology Inhoud • Inleiding, 30/11 • Web engineering & Web information systems, 7/12 Philippe Thiran • Data transformatie & Data integratie, 14/12 • ERP, Smulders (Deloitte), 21/12 + 11/1 • Flower, Berens (Pallas Athena), 25/1 + 1/2 • Biztalk, van den Boom (Microsoft), 15+22/2 /faculty of mathematics and informatics TU/e eindhoven university of technology Data Transformation Data Integration Philippe Thiran Computer Science Department Technische Universiteit Eindhoven The Netherlands /faculty of mathematics and informatics TU/e eindhoven university of technology Data Transformation & Integration • Agenda – Problem Statement • Existing database systems • Heterogeneity, distribution, autonomy – Data Transformation • Schema conversion • Query conversion: Wrapper – Data Integration • Schema integration • Query processing: Multidatabase and Federation /faculty of mathematics and informatics TU/e eindhoven university of technology Problem Statement Existing database systems Heterogeneity, distribution, autonomy /faculty of mathematics and informatics TU/e eindhoven university of technology Problem Statement Existing Database Systems • Existing Database Systems – Data are recorded in existing database systems – Existing database systems are: • Mission critical (essential to the organization business) • To be operational at all times • Inflexible – Typically, existing database systems are: • Very large (millions of lines of code) • Old (often more than 10 years old) • Written in old programming language like COBOL, PL/1, SQL! • Built around an old DBMS /faculty of mathematics and informatics TU/e eindhoven university of technology Problem Statement Existing Database Systems • Existing Database Systems – Data are recorded in existing database systems – Answer of old requirements • New functions and services • New user requirements • New technology (Web) • Communication among them? /faculty of mathematics and informatics TU/e eindhoven university of technology Problem Statement Existing Database Systems • Existing Systems: New Services – How to deal with existing database systems ? • Abandon the existing systems: migration to a new system • Keep and modify the existing systems • Keep the existing systems and wrap them: autonomy • Existing Systems: Communication – How to integrate existing database /faculty of mathematics and informatics systems? TU/e eindhoven university of technology Problem Statement Data Integration • Data Integration Problems – Integrating database systems is very hard and costly Distribution – Three main dimension Distributed databases of the problem: • Distribution • Autonomy • Heterogeneity Heterogeneity /faculty of mathematics and informatics Autonomy Centralized DBMS TU/e eindhoven university of technology Distribution Data Integration Problem Statement • Autonomy Autonomy Heterogeneity – Autonomy refers to the distribution of control – Four dimensions of autonomy: • Design: own data models and own transaction management technique • Communication: nor knowledge of the existence of other system nor how to communicate with them • Execution: independently of the other systems • Association: each system decides how much of its data and processing capabilities it will share with the other system /faculty of mathematics and informatics TU/e eindhoven university of technology Distribution Data Integration Problem Statement • Heterogeneity Autonomy Heterogeneity – Heterogeneity may exist at three basic levels: • DBMS level. Data is managed by a variety of DBMS based on different data models and data languages – – • • Data models : relational model, hierarchical model and file model Data languages : SQL, DL/1, COBOL programs Platform level. Different hardwares, different network protocols Semantic level. Different designer viewpoints in modelling the same objects of the application domain. Incompatible design specifications which lead to different naming, types or integrity constraints /faculty of mathematics and informatics TU/e eindhoven university of technology Data Integration Generic Integration Architecture Local Models Common Model • Schema Hierarchy Integrated Schema Homogenizes and unions import schemas Import Schema 1 Import Schema 2 Import Schema 3 Export Schema 1 Export Schema 2 Export Schema 3 Database Schema 1 Database Schema 2 Data Schema 3 DB1 DB2 Relational DBMS OO DBMS File System /faculty of mathematics and informatics View on export schema available for non-local access Unifies data models TU/e eindhoven university of technology Data Integration Generic Integration Architecture Local Models Common Model • Schema Hierarchy Integrated Schema Data and Schema Integration Import Schema 1 Import Schema 2 Import Schema 3 Export Schema 1 Export Schema 2 Export Schema 3 Database Schema 1 Database Schema 2 Data Schema 3 DB1 DB2 Relational DBMS OO DBMS File System /faculty of mathematics and informatics Data and Schema Transformation TU/e eindhoven university of technology Data Transformation Schema Conversion Query Conversion: Wrapper /faculty of mathematics and informatics TU/e eindhoven university of technology Data Transformation Schema Conversion • Introduction – Schema conversion – Query/Data conversion Export Schema 1 Database Schema 1 Query1 Query1’ Data1 Data1’ Data Source 1 /faculty of mathematics and informatics Data2 Query2 Data2’ Query2’ Export Schema 2 Database Schema 2 Data Source 2 Common Data Model Local Data Models TU/e eindhoven university of technology Data Transformation Schema Conversion • Schema Conversion – Schema transformation • Transformation of a schema expressed in a data model (Ms) into an equivalent schema expressed in another data model (Mt) • Examples – ER model Relational model (lecture ISO) – Relational model XML Schema (see later) • Schema transformation operators • Schema conversion consists in applying the relevant transformations on the relevant constructs of the schema expressed in Ms in such a way that the final result complies with Mt /faculty of mathematics and informatics TU/e eindhoven university of technology Data Transformation Schema Conversion • Schema Conversion – Schema transformation • A (schema) transformation basically is an operator by which a source data structure C is replaced with a target structure C'. • Example of a semantics-preserving transformation: transforming a relationship type into an attribute RT-FK: Transforming a binary relationship type into a foreign key. B B1 B2 id: B1 A A1 1-1 /faculty of mathematics and informatics R 0-N A A1 B1 ref: B1 B B1 B2 id: B1 TU/e eindhoven university of technology Data Transformation Schema Conversion • Schema Conversion – 2 main schema transformations for ER model Relational model RT-ET: Transforming a relationship type into an entity type. Inverse: ET-RT 0-N RT-FK: Transforming a binary relationship type into a foreign key. Inverse: FK-RT B1 B1 A A1 A A1 R 0-N B B1 B2 id: B1 /faculty of mathematics and informatics R 1-1 0-N B1 B1 A A1 0-N rA 0-N R 1-1 id: rB.B1 1-1 rA.A A A1 B1 ref: B1 B B1 B2 id: B1 rB TU/e eindhoven university of technology Data Transformation Schema Conversion • Schema Conversion – Exercice: From ER model Relational model CUSTOMER Code Description id: Code place 0-N ORDER Code id: Code 1-1 0-N 0-N purchase Tot 0-N STOCK Code Name Level /faculty of mathematics and informatics id: Code details Order-qty 0-N TU/e eindhoven university of technology Data Transformation Schema Conversion • Schema Conversion – Exercice: From ER model Relational model CUSTOMER Code Description id: Code place 0-N ORDER Code id: Code 0-N det_ORD 1-1 0-N pur_CUS 1-1 details Order-qty id: det_ORD.ORDER det_STO.STOCK 1-1 purchase Tot id: pur_CUS.CUSTOMER pur_STO.STOCK 1-1 1-1 pur_STO 0-N /faculty of mathematics and STOCK Code Name informaticsLevel id: Code det_STO 0-N TU/e eindhoven university of technology Data Transformation Schema Conversion • Schema Conversion – Exercice: From ER model Relational model CUSTOMER Code Description id: Code purchase P_C_Code Code Tot id: P_C_Code Code ref: Code ref: P_C_Code /faculty of mathematics and STOCK Code Name Level informatics id: Code ORDER Code Cus_Code id: Code ref: Cus_Code details D_O_Code Code Order-qty id: D_O_Code Code ref: Code ref: D_O_Code TU/e eindhoven university of technology Common Data Model Common Query Language Data Transformation Wrappers • Definition Export Schema Common Data Model Wrapper Database Schema Local Data Models Data Source – A wrapper controls a (legacy) data source – Basically a wrapper is a software component that offers an homogeneous query interface based on a common data model (XML for the Web) – It converts data and queries from the common data model to a local data model It offers an adequate way for solving the DBMS heterogeneity that appears when one wants to integrate existing and heterogeneous data systems /faculty of mathematics and informatics TU/e eindhoven university of technology Data Transformation Wrappers • Definition (ctd) – A data wrapper is basically defined as a converter of data and queries – That is, a wrapper: • • • • Offers an export schema in the common data model Accepts queries against the export schema Translates them into queries understandable by the data system Transforms the results of the local queries into a format understood by the application Query Data Common Data Model Common Query Language Export Schema Common Data Model Wrapper Local Data Model Local Query Language /faculty of mathematics and informatics Database Schema Data Source Local Data Models TU/e eindhoven university of technology Data Transformation Wrappers • Categories of Wrappers – There exists no standard approach to build wrappers – Functionality • One-way: only transformation of data (e.g., for data warehouses) • Two-way: transformation of requests and data – Development • Hard-wired wrappers, for specific data sources • Semi-automated generation: wrapper development tools • Automatically generated wrappers – Availability • Standalone programs (data conversion, data migration) • Components of a federation (see later) • Database interface for foreign data /faculty of mathematics and informatics TU/e eindhoven university of technology Data Transformation Wrappers • Wrappers and the Web – Wrapper interface • Data format: XML • Common data model: XML DTD and Schema • Common query language: XPath, XQuery, none – Wrapper mapping • Generally between relational data and XML • Two translation types – Automated – Defined by the user • XML- or SQL-oriented query language /faculty of mathematics and informatics TU/e eindhoven university of technology Data Transformation Wrappers • XML Views of Relational Databases – Automated translation Order Id Item Custname Custnum Payement Oi Desc Cost Oid Due Amt 10 Philips 7734 10 Ship 24000 10 1/10/01 20000 9 7725 10 Generator 8000 9 6/10/01 12000 Unilever <db> <order> <row><id>10</id><custname>Philips</custname><custum>7734</custnum></row> <row><id>9</id><custname>Unilever</custname><custum>7725</custnum></row> </order> <item> <row><oid>10</oid><desc>Ship</desc><cost>24000</cost></row> <row><oid>10</oid><desc>Generator</desc><cost>8000</cost></row> </item> <payement> similar to <order> and <item> </payement> </db> /faculty of mathematics and informatics TU/e eindhoven university of technology Data Transformation Wrappers • XML Views of Relational Databases Order Id – User-defined Translation Custname Custnum 10 Philips 7734 9 7725 Unilever Item Oi Desc Cost 10 Ship 24000 10 Generator 8000 Payement Oid Due Amt 10 1/10/01 20000 6/10/01 12000 9 /faculty of mathematics and informatics <order id=’10’> <custname> Philips </custname> <items> <item description=“Ship”> <cost> 24000 </cost> </item> <item description=“Generator”> <cost> 800 <cost> </item> </items> </payments> <payement due=’1/10/01’> <amount> 20000 </amount> </payement> </payements> </order> <order id =‘9’> … </order> TU/e eindhoven university of technology Data Transformation Wrappers • XML Views of Relational Databases – Exercises • What is the XML Document of this relational database? Order OderID Customer Date Total[0-1] id: OderID Detail OderID Reference Quantity Amount id: OderID Reference ref: OderID ref: Reference /faculty of mathematics and informatics Product Reference Label[0-1] UnitPrice Supplier id: Reference TU/e eindhoven university of technology Data Transformation Wrappers • XML Views of Relational Databases – Exercises • What is the XML Document of this relational <!ELEMENT Catalog (Order*, Product*)> database? <!ELEMENT Order (Customer, Date, Total?, Detail+)> Order OderID Customer Date Total[0-1] id: OderID Detail OderID Reference Quantity Amount id: OderID Reference ref: OderID ref: Reference Product Reference Label[0-1] UnitPrice Supplier id: Reference /faculty of mathematics and informatics <!ATTLIST Order OrderID ID #REQUIRED> <!ELEMENT Customer ANY> <!ELEMENT Date (#PCDATA)> <!ELEMENT Total (#PCDATA)> <!ELEMENT Detail (Quantity, Amount)> <!ATTLIST Detail Product IDREF #REQUIRED> <!ELEMENT Quantity (#PCDATA)> <!ELEMENT Amount (#PCDATA)> <!ELEMENT Product (Supplier+)> <!ATTLIST Product Reference ID #REQUIRED Label CDATA #IMPLIED UnitPrice CDATA #REQUIRED> <!ELEMENT Supplier ANY> TU/e eindhoven university of technology Data Transformation Wrappers • XML Views of Existing Relational Databases – Mapping definition • SQL-oriented query language For $b in SQL(select * from Order where Custname=“’ +$x + ‘””) return <order> {$b/Id} <Custname>{$x}</Custname></order> Order Order Id Custname /faculty of mathematics and informatics Id Custname Custnum 10 Philips 7734 9 7725 Unilever TU/e eindhoven university of technology Data Transformation Wrappers • XML Views of Existing Relational Databases – XML View definition • Bottom-up (from the relational schema) • Top-Down (from a given XML schema) – Mappings between XML views and relational schemas • Automated (algorithm) • Manual (defined by the user) /faculty of mathematics and informatics TU/e eindhoven university of technology Data Transformation Wrappers • XML Views of Existing Relational Databases – Examples Product Name SQL-written Mapping XML-written Mapping XML Schema Query over views Xperanto no yes XML Schema yes (XQuery) update Microsoft’s SQL Server yes (FOR XML clause) no XDR Schema yes (XPath) DB2 (IBM) no yes (subset XQuery) yes (XQuery) no Oracle9i yes no SilkRoute no (AT&T) /faculty of mathematics and informatics yes no XML Schema yes (XQuery) update TU/e eindhoven university of technology Data Integration Generic Integration Architecture Schema Integration Query Processing: multidatabase and federation /faculty of mathematics and informatics TU/e eindhoven university of technology Data Integration Generic Integration Architecture Schema Integration /faculty of mathematics and informatics TU/e eindhoven university of technology Data Integration Generic Integration Architecture Local Models Common Model • Schema Hierarchy Integrated Schema Homogenizes and unions import schemas Import Schema 1 Import Schema 2 Import Schema 3 Export Schema 1 Export Schema 2 Export Schema 3 Database Schema 1 Database Schema 2 Data Schema 3 DB1 DB2 Relational DBMS OO DBMS File System /faculty of mathematics and informatics View on export schema available for non-local access Unifies data models TU/e eindhoven university of technology Data Integration Generic Integration Architecture • Component Architecture Application 1 Application 2 Integrated Schema Application 3 Common DDL/DML Meditor Offers an abstract integrated view of sources Reconciles independent data structures to yield a unique, coherent, view of the data Import Schema 1 Export Schema 1 Wrapper Wrapper Wrapper Local DDL/DML Database Schema 1 DBMS 1 DBMS 2 DBMS 3 DB1 DB2 DB3 /faculty of mathematics and informatics Controls a local data source Offers an homogeneous query interface based on a common data model TU/e eindhoven university of technology Data Integration Generic Integration Architecture • Aspects to Consider for Integration – General Issues • Bottom-up vs. top-down engineering – From existing schema to integrated or vice-versa – Schema integration vs. schema matching • Virtual vs. materialized integration • Read-only vs. read-write access • Transparency – Language, schema, location – Data Model related issues • Types of sources – Structured, semi-structured, unstructured • Common data model of integrated system • Tight vs. loose integration – Use of a global schema • Query model /faculty of mathematics and informatics TU/e eindhoven university of technology Data Integration Schema Integration • Methodology – Bottom-up process – Four main steps • Preparing the local schemas • Detecting what is common between the components of local schemas – Correspondence (what is common) • Solving the conflicts – Conflict (what is incompatible) • Integrating the different schemas according to the correspondences and conflicts detected in the previous steps and informatics /faculty of mathematics TU/e eindhoven university of technology Data Integration Schema Integration • Concept of Correspondence – Two complementary views of correspondence: • Structural correspondence (schema level: concepts) • Instance correspondence (instance level: data) – Structural correspondence • Five types of structural correspondence: – – – – – Identity Independence Complementarity Subtyping Common supertype /faculty of mathematics and informatics TU/e eindhoven university of technology Data Integration Schema Integration • Concept of Correspondence – Instance correspondence • Four types of instance correspondence: – – – – Disjointed: the instances classes are disjointed Inclusion: the set of one class is included to another class Equivalence: the classes contain the same instances Overlapping: the classes share some instances but not all /faculty of mathematics and informatics TU/e eindhoven university of technology Data Integration Schema Integration • Concept of Conflict – Conflicts occur in three possible ways : syntactic (naming conflicts), structural, semantic or instance – Syntactic conflicts (resolution: use of an ontology) • Synonyms. Two identical objects (entities, attributes, relationships) that have different names are synonyms • Homonyms. Two different objects that have identical names are homonyms – Structural conflicts (resolution: mapping function or transformation) • Domain. Two identical objects have different domains (Differences in dimension, units and scales) • Structure. The same concept is presented by different data structures (e.g., different attributes) /faculty of mathematics and informatics TU/e eindhoven university of technology Data Integration Schema Integration • Concept of Conflict – Structural conflict • In the left-hand schema, Address is an compound attribute, whereas in the right-hand one, Address is represented by an entity type Site 2 • Resolution: transformation CUSTOMER CUSTID NAME id: CUSTID /faculty of mathematics and informatics Site 1 1-1 CUSTOMER CUSTID NAME ADDRESS STREET ZIP CODE CITY id: CUSTID lives 1-1 ADDRESS STREET ZIP CODE CITY TU/e eindhoven university of technology Data Integration Schema Integration • Concept of Conflict – Semantic conflicts • A semantic conflict appears when a contradiction appears between two representations A and B of the same application domain concept or between two integrity constraints (resolution?) • Example – In the left-hand schema, Customer is identified by CustId, whereas in the right-hand one, it is identified CUSTOMER by Name Site 1 CUSTOMER Site 2 CUSTID NAME ADDRESS STREET ZIP CODE CITY /faculty of mathematics and informatics id: CUSTID CUSTID NAME ADDRESS STREET ZIP CODE CITY id: NAME TU/e eindhoven university of technology Data Integration Schema Integration • Concept of Conflict – Instance conflicts • Instance conflicts are specific to existing data • Modelling constructs A and B that are recognized as corresponding can cover sets with different scopes • Examples – ZIP codes of addresses can be written like “NL-5600 MB” or “56oo MB” or “5600” – Different ZIP codes can be recorded for the same address (encoding errors) – Resolution: Data transforming… cleaning? /faculty of mathematics and informatics TU/e eindhoven university of technology Data Integration Query Processing: multidatabase and federation /faculty of mathematics and informatics TU/e eindhoven university of technology Data Integration Integration Architecture • Three Classical Architectures – Multidatabases • No integrated schema • Integrated access to different relational DBMS – Federated Databases • Integrated schema • Integrated access to different DBMS • Integrated access to different data sources (on the Web) – Data Warehouses • Materialized integrated data sources • Not here /faculty of mathematics and informatics TU/e eindhoven university of technology Data Integration Query Processing • Classical Architecture: Multidatabase – Enable transparent access to multiple (relational) databases • Hides distribution, different SQL variants • Processes queries and updates against multiple databases (2phase commit) • Does not provide any type of global schema (does not hide the different database schemas) DataJoiner • Example: IBM DataJoiner Sybase Open Client Oracle SQL*Net TCP/IP Network /faculty of mathematics and informatics Sybase Server Oracle Server TU/e eindhoven university of technology Data Integration Query Processing • Classical Architecture: Multidatabase – Multidatabase schema Sybase.Publications PNR Title Author Journal id: PNR Sybase.Authors ANR Title Name Affiliation id: ANR Sybase Publications PNR Title Author Journal id: PNR Authors ANR Title Name Affiliation id: ANR Oracle.Papers Number Title Writer Published id: Number Oracle.Writer FirstName LastName NRofPublications id: FirstName LastName Multidatabase Schema Oracle Source 1 /faculty of mathematics and informatics Papers Number Title Writer Published id: Number Writer FirstName LastName NRofPublications id: FirstName LastName Source 2 TU/e eindhoven university of technology Data Integration Query Processing • Classical Architecture: Multidatabase – Query processing Multidatabase Schema SELECT p2.title FROM Sybase.PUBLICATIONS p1, Oracle.PAPERS p2 WHERE p1.title = p2.title SELECT title FROM PUBLICATIONS SELECT title FROM PAPERS Sybase Oracle Source 1 Sybase /faculty of mathematics and informatics Data Source 2 Oracle Data TU/e eindhoven university of technology Data Integration Query Processing • Classical Architecture: Multidatabase • Main properties • Transparency – Low level of transparency provided to the user (The user is responsible for finding the relevant information, understanding each database schema, detecting and resolving the semantic conflicts, and finally, building the required view of the data in the sources) • Autonomy – Not intrusive against the autonomy of the data sources – Suitable when component systems are strongly autonomous • Methodology – Simplicity since there is no schema integration • Maintenance and evolution – No integrated schema maintenance /faculty of mathematics and informatics TU/e eindhoven university of technology Data Integration Query Processing • Classical Architecture: Federation – Integrated schema(s) and unique interface • Hides the semantic and location heterogeneity • Wrapper/Mediator hierarchy – Wrapper » Controls a local data source » Offers an homogeneous query interface based on a common data model – Mediator » Offers an abstract integrated view of several sources » Reconciles independent data structures to yield a unique, coherent, view of the data – Research projects • Tsimmis (Stanford) • Garlic (IBM) • Oasis (Dublin University) /faculty of mathematics and informatics TU/e eindhoven university of technology Data Integration Query Processing • Classical Architecture: Federation – Typical example Meditor Views Integrated schema Import schemas Wrapper (provides export schema) <complexType name=“Book”> <element name=“title” type=“string”/> <element name=“authors” type=“string”/> <element name=“pages” type=“string”/> </complexType> Publication PNR Title Authors Journal Pages id: PNR Authors ANR Title FirstName Surname Affiliation id: ANR /faculty of mathematics and informatics Oracle SQL DBMS Wrapper (provides export schema) <complexType name=“Book”> <element name=“title” type=“string”/> <element name=“author” type=“string”/> </complexType> <!ELEMENT Book(title,author)> <!ELEMENT title(#PCDATA)> <!ELEMENT author(#PCDATA)> XML DBMS TU/e eindhoven university of technology Data Integration Query Processing • Classical Architecture: Federation – Typical example Views <complexType name=“Book”> <element name=“title” type=“string”/> <element name=“author” type=“string”/> </complexType> Integrated schema <complexType name=“Book”> <complexType name=“Book”> <element name=“title” type=“string”/> <element name=“title” type=“string”/> <element name=“authors” type=“string”/> <element name=“author” type=“string”/> </complexType> Import schema DB2 Import schema DB1 </complexType> <complexType name=“Book”> <element name=“title” type=“string”/> <element name=“authors” type=“string”/> <element name=“pages” type=“string”/> /faculty of mathematics and informatics </complexType> <complexType name=“Book”> <element name=“title” type=“string”/> <element name=“author” type=“string”/> </complexType> TU/e eindhoven university of technology Data Integration Query Processing: Federation Submit query Q Return result A Q = FOR $b IN //Book RETURN $b/author A1’={<author> … <\author>} A = A1’ A2 Q1 = FOR $b IN //Book RETURN $b/authors A1 Q2 = FOR $b IN //book RETURN $b/author A1= {<authors> … <\authors>} Q1 Q1’ = SELECT a.name FROM AUTHORS A Q1’ ORACLE /faculty of mathematics SQL DBMS and informatics Q2 A2 A2 Q2’ = //book/author Q2’ A2= {<author> … <\author>} XML DBMS TU/e eindhoven university of technology Data Integration Query Processing • Classical Architecture: Federation • Main properties • Transparency – High level of transparency provided to the user. The user is not aware of the distribution and the heterogeneity of the integrated data sources • Autonomy – Each local data source have control over its sharable information • Methodology – Problems of defining an integrated schema – Web as Loosely Coupled Federation • Many different, widely distributed information systems • Heterogeneity – Structural homogeneous: XML – Semantically heterogeneous: no explicit schemas (ontology?) • Autonomy – Runtime autonomy: pages change on average every 4 weeks, dangling links • Distribution – Replication (proxies) and caching frequently used /faculty of mathematics and informatics