Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Corso di Architetture della Informazione A.A. 2009 – 2010 Carlo Batini 5.1.1 Before Data Integration Enterprise Application Integration (EAI) & Enterprise Information Integration (EII) & Extract Transaction Load (ETL)/ Data Warehouse (DW) 1 Remark • Il simbolo significa che il tema fa parte del programma di Rappresentazione della Informazione. • Alla fine della presentazione si trovano le references, gli acronimi e i sinonimi utilizzati. 2 Summary • • • • Enterprise Application Integration (EAI) & Enterprise Information Integration (EII) & Extract Transaction Load (ETL)/ Data Warehouse & Enterprise Information Quality (EIQ) are different technologies that try to achieve the same goal, the integration of SW applications, services and information (databases, and others) managed in an organization or in a set of organizations, that is inherently lost by the legacy style of the application and database developement, and by the different points of views adopted by the different players using the information system. 3 There is big money to be spent on integration …… 4 Some details 5 1. Definitions and classifications & a short hystory 6 1.1 The beginnings 7 Short history - 1 A typical Fortune I000 company has a myriad of mission critical data systems on which the enterprise depends. Such systems include ERP systems, sales tracking systems, HR systems, etc. Over the last twenty years, the conventional wisdom was to implement such applications using a client-server computing model as noted in Figure Here, the DBMS (or other storage technology) ran at the server level, accessed by a collection of applications which ran on the client desktop. Client-server computing was enabled by the emergence of desktop PC's which provided a client computing platform, and was pushed by many software and hardware companies, that were selling alternatives to IBM mainframes. Client Server Figure 1 8 Short history - 2 Recently, several factors have rendered client Server computing completely obsolete, but we will focus only on the web in this discussion. Basically, it forces every enterprise to move to the architecture of Figure . The Client level from previous Figure moves inside a web browser, where at most a portion of the application from Previous Figure can run. Sometimes an ultra-thin client is run, whereby no code exists at the client level. Other enterprises utilize Java, Javascript, or other technology to run some code on the client desktop. In either case, the remainder of the application must be executed in a new middleware tier. Client Middleware Server Figure 2 9 Middleware products – Application servers Application servers have been in existence for more than twenty-five years. Historically, they were called TP monitors, the most popular being the IBM Customer Information and Control Systems (CICS). 10 Middleware products – Application servers TP monitors came into existence to allow multiple client users to run the same application efficiently on a mainframe computer. Specifically, they provided application activation. To perform efficiently, they also provided multi threading of applications (that were written to allow threading) and connection multiplexing. The latter feature lowered the number of connections that had to be maintained between the middleware layer and the server level, typically Increasing efficiency. To provide scalability, many application servers support automatic load sharing over multiple hardware machines. Lastly, most provide a security module including authentication, single site login, and access control. In summary, an application server is capable of providing code activation and related services. 11 Evolution from Application Servers Application driven Enteprise Application Integration (EAI) Application servers Data driven Extract Transform Load/ Data Warehouse W Enteprise Information Integration EII 12 1.2 Evolution toward Enterprise Application Integration (EAI) 13 Motivation of evolution toward EAI - example A typical large enterprise has more than 5000 major application systems. Such systems are always logically interconnected. For example, when a new employee is hired, he must be inserted into the payroll systems, added to the pension plan, added to the medical insurance system, etc. When a customer changes his telephone options, a change must be made to the billing system as well as to the provisioning system. Whenever two companies merge, there are two sets of information systems to contend with, one from each of the partners. It is desirable, but rarely possible, to kill one of the two overlapping systems, at least not immediately. Again, multiple interacting application systems are part of the resulting enterprise architecture. 14 Core functionalities of EAI middleware - 1 To service the needs of interacting application, a class of products called enterprise application integration (EAI) systems arose. The core functionality is to reliably deliver a message from one application to a second one, which needs to be alerted concerning an action taken by the first one. Such a messaging service is a core functionality of EAI systems. However, the two communicating systems, which were written independently, never agree on the syntax and semantics o application constructs. A transaction to the net market is never the same as the one expected by the buyer's procurement system or the seller's order processing system. Hence, there is a need for message transformation between the originator and the recipient. 15 Core functionalities of EAI middleware - 2 Lastly, the two communicating applications are never written using the same technology. A home-brew order processing must communicate with a net market based on Ariba. A customer relationship system written using Siebel must communicate with an ERP system from SAP. In this heterogeneous world, a collection of adapters is required. A source adapter transforms the source message to a common form while a target adapter changes the common form to that required by the recipient. The basic functionality provided by EAI systems is message delivery and transformation. 16 Typical architecture of an EAI User query/update Middleware Application 1 Source 1 Application 2 Source 2 Application n Source n 17 A more recent example of middleware: Publish/Subscribe It realizes a many to many and anonymous interaction The participants are divided in two groups: – publisher: producers, message senders – subscriber: consumers, are interested in receiving specific typology of messages The communication is performed through a central unit (event service) that receives the messages from publishers and sends them to the suitable subscribers Useful for multiple asyncronous update of autonomous but overlapping databases 18 Typical architecture of a P&S system User update of object O 1 Middleware 6 Application 1 2 5 6 Source 1 Application 2 4 Source 2 3 Application n Source n 19 From EAI to workflow processes and business process languages and platforms Workflow systems have also been in existence for many years. The early systems were oriented toward procurement, and the focus was on process flow. A typical purchase order must be signed by the manager of the originating employee. If it is large enough, it must also be signed by a division manager. If it entails computer equipment, it must be signed by the CIO. A typical large enterprise has tens or hundreds of such rules. The process of obtaining approval for a purchase order entails moving through a collection of approval steps, whose sequencing is governed by business rules. Of course, a workflow system must also have an execution environment. This run time environment must be capable of executing applications, as well as supporting connectivity between one application and the subsequent one in the process flow. Hence workflow diagrams are typically compiled into some sort of middleware framework, and executed in the middle tier of the Figure 2 architecture. 20 1.3 The evolution of ETL and EII solutions 21 Problems in traditional DB architectures - 1 Lot of different types of heterogeneities among several BS to be used together 1. Different platforms: Technological heterogeinity 2. Different data models at the participating DBMS Model heterog. 3. Different query languages -> Language heterogeneity 4. Different data schemas and different conceptual representations in DBs previuosly developed at the participating DBMS Schema (or semantic) heterogeneities 5. Errors in data, that result different values for the same info instance heteorgeneities 22 Problems in traditional DB Architectures - 2 Dependencies exist: – among databases (databases assume the existence of certain data in other databases), – databases and applications (applications assume certain properties to hold in the database), – among applications (applications assume that other applications are performing certain operations) Changes in the data and applications become difficult to manage. Since the "knowledge" on dependencies is distributed in many places and since the same functionality is implemented in many places each change becomes a complex task to perform. 23 New types of transparency DQ frameworks: DQ errors/instance heter. transparency Mediator based: Semantics (schema) transparency Federated databases: Model transparency Multidatabases: Relational dialects transparency Traditional transparencies in DBMSs: Physical, logical, distribution Transparency (fragmentation + allocation) 24 Nuovi tipi di trasparenza DQ Frameworks: trasparenza dagli errori Basate su mediatore: trasparenza dalla eterogeneita’ semantica Basi di dati federate: trasparenza dal modello Multibasi di dati: Trasparenza dal dialetto relaz. Tradizionali: Trasp. Fisica, logica e di distribuzione (frammentazione + allocazione) 25 The data integration problem • Combining data coming from different data sources, providing the user with a unified vision of data • Detecting correspondences between similar concepts that come from different sources, and solving conflicts 26 Data architecture • Is the allocation of entities (conceptual intensional level) and the data (extensional level) in the multiplicity of DBMS platforms. 27 Data architecture in a centralized DB • The data integration problem arises even in the simplest situation: unique centralized DB. • Here, in order to avoid inconsistencies and duplications, and favour efifciency, a three steps methodology is suggested: conceptual, logical, phisical design Interrogazioni di accesso Transazioni di aggiornamento DBMS Schema logico Data Base1 28 Architettura DBMS distribuito • Distributed architectures address the efficiency issue; in dealing with heterogeneities, are quite similar having a unique schema to the centralized case • New: design decisions on vertical and horizontal fragmentation, allocation and replication Interrogazioni Transazioni di di accesso aggiorname nto DBMS Interrogazioni di accesso Interrogazioni di accesso Transazioni di aggiornamento Transazioni di aggiornamento Schema globale DBMS DBMS DataBase1 Rete Schema locale Schema locale DataBase n Schema locale DataBase2 Interrogazioni di accesso Transazioni di aggiornamento DBMS 29 In the organizations the data architecture is built in the course of the years 2000 2007 DBMS DBMS Logical schema Logical schema 2003 Data Base1 DBMS Logical schema Data Base3 2005 DBMS Logical schema Data Base4 2001 Data Base5 DBMS Logical schema Data Base2 30 Evolution of the data architecture • In the organizations the data architecture is built in the course of the years. • This leads to the potential explosion of all the types of heterogeneities seen before. 2007 DBMS Logical schema Data Base5 2003 DBMS Logical schema 2005 DBMS Logical schema Data Base4 2001 DBMS Logical schema Data Base2 Data Base3 31 General approaches to data integration • Consolidation: creation of a new centralized database in which all databases are integrated • Data exchange: creation of gateways between system pairs – Only appropriate when only two systems, no support for queries coming from multiple systems – Number of gateways increase potentially as n2 • Multidatabase/Federated/Data integration/DataWarehouse solutions: creation of a global schema 32 Consolidation Process: • Reverse engineering of logical schemas into conceptual schemas • Integration of the local conceptual schemas into a global schema, with resolution of schema level heterogeneities • Integration of data, using techniques such as record linkage to resolve 33 Virtual vs materialized integration Besides the different approaches of data driven integration, a distinction is made in how the information needed is retrieved. Independently of which approach is used to define the mapping between the local sources and the global schema, a 1. virtual or 2. materialized integration approach determines how the information is accessed. 34 Virtual Data Integration •A system designer builds a Mediated Schema over which a user poses queries. The source databases are interfaced with wrappers if needed. Data reside at the data sources. Wrapper are format translators Wise 2009 – Poznan (PL) Università di Modena e Reggio Emilia & Milano Bicocca 35 35 Data Warehouse: Materialized Data Integration • • Data warehouse [Inmon (1997)]: “Subject-oriented, integrated, timevariant (temporal), non volatile collection of summary and detailed data, used to support strategic decision-making process for the enterprise” The information from the source databases is extracted, transformed then loaded into the data warehouse. ETL = Extraction Transformation Loading Wise 2009 – Poznan (PL) Università di Modena e Reggio Emilia & Milano Bicocca 36 36 Relevant Types of Integrated Database Systems 1. Use a materialized data base (data are merged in a new database) – Enterprise Transform Load Systems Data Warehouses: Materialized integrated data sources 2. Use a virtual non materialized data base (data remain at sources) Enteprise Information Integration (EII) (or Data Integration) Systems Several solutions 37 Virtual integration • The virtual integration approach leaves the information requested in the local sources. The virtual approach will always return a fresh answer to the query. The query posted to the global schema is reformulated into the formats of the local information system. The information retrieved needs to be combined to answer the query. 38 1.4 Data Warehouses as materialized data integration systems 39 ETL - 1 A large enterprise has more than 5000 major application systems that hold data critical to the efficient functioning of the enterprise. There is an obvious need for data integration, so that a business analyst can get a more complete picture of the enterprise and how to optimize it. The conventional wisdom to perform this task is to buy a giant machine located in a safe place (say under Mount Washington). Periodically, desired data is "scraped" from each operational data system and copied to Mount Washington. Of course, the various operational systems never have a common notion of enterprise objects, such as a purchase order. Hence, there is a requirement to transform source data to a common format before loading data onto the Mount Washington machine. 40 ETL - 2 Large common machines came to be known as warehouses, and the software to access, scrape, transform, and load data into warehouses, become known as extract, transform, and load (ETL) systems. In a dynamic environment, one must perform ETL periodically (say once a day or once a week), thereby building up a history of the enterprise. The main purpose of a data warehouse is to allow systematic or ad-hoc data mining. The requirement for data integration has driven the ETL market, and products are available from Informatica, Ascential, and the major data base vendors. 41 Data warehouses - 1 Data warehouses are used to globally analyse large datasets that are composed from the data originating in many On Line Transaction Processing databases (e.g. the transaction data from all stores of a retailer all over the country) in order to derive strategic business decisions. Since updates play no role here, and efficient processing of large datasets is the critical issue, the data is not directly accessed in the component databases, but materialized in the integrated database, the central data warehouse. From this warehouse typically application-specific views are derived by materializing them in so-called datamarts, to which then the different data analyis tools are applied. 42 Typical architecture of a data warehouse User query Global schema Source 1 Extract/Transform/Load Local schema 1 Source 1 Extract/Translate/Load Local schema 2 Source 2 Extract/Translate/Load Local schema n Source n 43 Data warehouses - 2 The canonical data model for the global schema in data warehouses is typically a specialized form of the relational data model, the so-called multidimensional data model, where all data is kept in one central table with different attributes that represent the different dimensions. It differs insofar from the relational model as it provides much more powerful operations on the multidimensional data for aggregation (similarly as in an Excel spreadsheet) Important problems in data warehouses include the development of wrappers, which are used to 1. map from different data models and schemas into the datawarehouse schema, 2. perform so-called data-cleansing, i.e. remove errors from the original data. 44 Example of multidimensional schema Facts SSN Last Name Cell Phone Client City of residence Region Measures Business Business area Dimensions 45 Esempio di applicazione DW Facts CF Cognome Patrimonio Client Patrimonio e Tenore di vita Dichiarazione redditi Dimensions Measure 46 Data Warehouses - 3 Are a specialized form of federated database system, where the integrated database is materialized – perform efficient data analysis – On-line analytical processing (OLAP) – Data mining Uses typically limited forms of database schemas (e.g. relational star schemas) Perform efficient processing of OLAP and data mining queries 47 ETL evolution • In the past fteen years, data warehousing and its associated ETL (Extract, Transform, Load) technologies have delivered great business values to enterprises looking to integrate and analyze large volumes of disparate customer and product data. • The data warehouse has successfully evolved from monthly dumps of operational data lightly cleansed and transformed by batch programs, to sophisticated metadata-driven systems that move large volumes of data through staging areas to operational data stores to data warehouses to data marts. 48 1.5 Evolution of Virtual Data Integration Systems 49 EII - 1 The conventional wisdom is to use data warehousing and ETL products to perform data integration. However, there is a serious flaw in one aspect of this wisdom. Suppose one wants to integrate current (operational) data rather than historical information. Consider, for example, an e-commerce web site which wishes to sell hotel rooms over the web. The actual inventory of available hotel rooms exists in 100 or so information systems. After all, Hilton, Hyatt and Marriott all run their own reservation systems. Applying ETL and warehousing to this problem will create a copy of hotel availability data, which is quickly out of date. If a web site sells a hotel room, based on this data, it has no way of guaranteeing delivery of the room., because somebody else may have sold the room in the meantime. 50 EII - 2 The only way to guarantee correct data integration in this environment is to fetch the data at the time it is needed from the originating system. Data integration in dynamic environments requires integration on demand, not integration in advance. Dynamic data integration has created a market for data federation systems. These products construct a composite view of disparate data systems and allow a user to run SQL commands, including retrieves and updates to this composite view. Then, they translate each SQL command on this composite view to a collection of local SQL commands that "solves" the user's request. 51 Historical Relevant Types of EII Solutions a. Multidatabases – Integrated access to different relational DBMS b. Federated Databases – Integrated access to different DBMS with different logical models c. Mediator Systems (or Data Integration systems) – Integrated access to different data sources (on the web) adopting different schemas and different meanings for concepts Mediator based: Semantics transparency and also value heterogeneity transparency Federated databases: Model transparency Multidatabases: Relational dialects transparency Traditional transparencies in DBMSs: Physical, logical, distribution Transparency (fragmentation + allocation) 52 1.5.1 Multidatabases Relational dialects transparency 53 Multidatabases • Enable transparent access to multiple (relational) databases (relational database middleware) • Hide distribution, different databases language variants • Process queries and updates against multiple databases (2-phase commit) • Do not hide the different database schemas 54 Example (2004) 55 Query execution in multidatabases When a DML statement, such as a query, is submitted, the multiDBMS has to decompose it into queries against the two component databases. This is a nontrivial task, because it has to 1. determine first which tables are from which database, 2. which predicates apply to only a single database and 3. which predicates apply to tuples from both databases. The latter ones can only be evaluated within the multidatabase, whereas the former ones can be evaluated within the component databases. 56 Query execution: example • In the example this is illustrated: the queries against the component databases do not contain the join predicate. The join itself needs to be evaluated at the multidatabase level. One of the main challenges in developing multidatabases is thus to find good strategies of how to decompose and evaluate queries against multiple databases in an efficient manner. 57 1.5.2 Federated databases Model transparency 58 From multiDBs to federated databases • Federated databases go one step further than multidatabases by transforming the databases to be integrated both with respect to the data model as well as with respect to the database schema. • Thus data stored according to different data models can be integrated and the database schema of the integrated database can be very different from the schemas of the component databases (which we will call in the following also component schemas). • In order to structure the process of mapping the data from the component databases to the integrated database, a 5-layer abstract schema architecture has been introduced. It separates the different concerns in the mapping process. 59 Federated Databases Schema Architecture • Export Schema – Unifies data model – Defines access functions • Import Schema – view on export schema • Integrated schema – Homogenizes and unions import schemas • Views – as in centralized DBMS 60 Federated Databases Schema Architecture Besides the 1. physical schema in a first step differences in the data models (relational, XML etc.) are overcome by providing export schemas, which are expressed in the same (canonical) data model. Each component database can decide autonomously which data and which access to the data to provide in the export schema. These 2. export schemas can be used by different integrated databases. In a second step the integrated databases decide which data to use from the component databases by specifying an import schema. The 3. import schema can of course only use what is provided by the export schema, thus it is a view on the export schema. 61 Federated Databases Schema Architecture In a third step the integrated DBMS maps the data obtained from different databases as defined in the import schemas into one common view, the 4. integrated schema. This can be a complex mapping, both at the schema level and the data level. The integrated schema is what the integrated database provides to the applications as logical database schema. 62 Federated Databases Schema Architecture As in standard databases from the integrated schema 5. applicationspecific views can be derived. In practice the systems not always follow strictly this architecture and different layers are combined (e.g. often there is no clear distinction among the export and import schemas). Still the architecture is a very useful tool in order to distinguish the different aspects that need to be addressed in database integration 63 More on Export schemas and Import schemas Besides the 1. physical schema in a first step differences in the data models (relational, XML etc.) are overcome by providing export schemas, which are expressed in the same (canonical) data model. Each component database can decide autonomously which data and which access to the data to provide in the export schema. These 2. export schemas can be used by different integrated databases. In a second step the integrated databases decide which data to use from the component databases by specifying an import schema. The 3. import schema can of course only use what is provided by the export schema, thus it is a view on the export schema. 64 ..... on the Integrated schema In a third step the integrated DBMS maps the data obtained from different databases as defined in the import schemas into one common view, the 4. integrated schema. This can be a complex mapping, both at the schema level and the data level. The integrated schema is what the integrated database provides to the applications as logical database schema. 65 ..... on Views As in standard databases from the integrated schema 5. application-specific views can be derived. In practice the systems not always follow strictly this architecture and different layers are combined (e.g. often there is no clear distinction among the export and import schemas). Still the architecture is a very useful tool in order to distinguish the different aspects that need to be addressed in database integration. 66 1.5.2.1 Details on Models for the integrated schema 67 The common model for the integrated schema All these solutions request for a common model used to represent homogeneously the global data schema. We can consider all of the existing kinds of data models that have been proposed, and in fact each of those has also been used as global model for federated DBMS. 68 The common model in federated databases: 1. the relational model + Most frequently used data model for centralized and distributed DBMS - Difficult to represent data from more complex models (documents, OO) + Thus best suited for integrating relational data sources - Difficult to represent constraints occuring in integration (e.g. generalization/specialization) 69 The common model in federated databases: 2. the object oriented model + Expressive + Has been proven as successful approach and used in research and some systems -- Did not prevail in practice because of a lack of commercial success of OODBMS 70 The common model in federated databases: 3. XML + Is becoming a ubiquitously used data model + XML Schema contains the main elements of an object-oriented data model + Can be used both for representation of data and documents + From these characterizations we can see that most likely the XML based models will become the prevailing global data model that is used for integration of databases. 71 The common model in federated databases: 4. Object Oriented + XML + Effective when data bases (OO) and document bases (XML) are considered together • Adopted in Momis, the academic prototype described later in the course. 72 1.5.2.2 An example of federated architecture 73 Federated DBMS Example Here the common data model is XML Export schemas 74 Example - 2 Here the component DBMS use different data models (relational and XML DTD) The canonical data model is again different, i.e. XML Schema. Thus the wrappers provide the necessary functionality to access data in the component databases according to the canonical data model. 75 Example - 3 For the relational database, for example, a complex type "Book" is defined in the export schema that contains data from both tables PUBLICATIONS and AUTHORS (assuming the element "authors" contains the author names). For populating the export schema (or for transforming the data) the wrapper would need to compute a join using the underlying database system. For the XML database the export schema is an XML Schema that exactly corresponds to the XML DTD. Thus also the transformation of the XML data is simple, because it needs not to be transformed at all ! 76 Example – 4: Import schemas and integrated schema 77 Example - 5 Next we zoom into the internal workings of the federated DBMS. First the federated database determines the import schemas. As it is not using the element "pages" from DB1 the import schema drops this element. Otherwise the schemas remain unchanged. The two import schemas, though similar, still contain differences in the names. Both "Book"-"book" and "authors"-"author" are incompatible names. Therefore in order to arrive at an integrated schema the federated DBMS has to map the two import schemas into one integrated schema as shown above (of course in general the differences among the schemas will be much larger). 78 Example - 6 What is not shown in the illustration is the corresponding mapping at the data level. Since now books from two databases are mapped into one integrated database, it can occur that the same book is contained in both component databases. Then the federated DBMS has to recognize such a situation and take the corresponding action, e.g. by instantiating such a book only once in the virtual integrated database. 79 Federated Databases Query processing Once the integration is established at the schema level, the federated DBMS needs to support standard operations on the integrated databases, in particular DML operations. Since multiple schemas and data models are involved, this task is considerably more complex than in the case of multidatabases. 80 Federated Databases Query processing Example - 2 This is illustrated in this example Assume a query for all authors of books is now submitted to the integrated database in XML Query. 1. First the federated DBMS has to determine which databases can contribute to the answer and generate from the query Q that is posed against the integrated schema, queries that can be posed against the import schemas (Q1 and Q2). 81 Federated Databases Query processing Example - 3 1.bis In that step it has also to invert the mappings that have been performed during integration (e.g. mapping back to the names "book" and "authors"). 2. Then the query is forwarded to the wrapper without change since the import schema is just a projection on the export schema. 82 Federated Databases Query processing Example - 4 3. The wrapper has now the task to convert the query from a query in the canonical query language and data model to a query in the model of the component database. For the relational database this requires the transformation to an SQL query, whereas for the XML database the query is converted to an XPath query. 4. Then the wrappers submit those queries to their respective component database systems and receive results in their respective data models. 83 Federated Databases Query processing Example - 5 4.Bis For the relational database this means that now the wrapper has to convert the relational result into an XML representation. For the XML database no change is required since the data representation (namely XML documents) is the same independently of whether XML DTDs or XML Schemas are used as XML type definitions. 84 Federated Databases Query processing Example - 6 6. Once the federated DBMS receives the results it has to map the result data from the import schema to the representation of the integrated schema. 7. For our example this requires a renaming of the element name "authors" to "author" for result A1. (note that the renaming of "book" is not required as this name does not occur in the result). 8. Then the federated DBMS computes the result of the query by performing a union (and thus eliminating duplicates). Note that for more complex mappings from the import schemas to the integrated schema this step can be considerably more complex. 9. Then the final result can be returned to the application. 85