Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Microsoft SQL Server wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Functional Database Model wikipedia , lookup
Relational model wikipedia , lookup
UmbrellaDB Virtual Data Base Architecture to Integrate Heterogeneous Data Sources Eric T. Matson CIS 864 Data Engineering Kansas State University – Manhattan, KS, USA, 66506 [email protected] Abstract The UmbrellaDB Virtual Database Architecture’s main purpose is to reconcile the heterogeneity occurring in large organizations. UmbrellaDB provides a generic architecture for multiple, heterogeneous data sources to be efficiently accessed and utilized from a unified, simple and intuitive application. It presents the varied data sources to the user as a single virtual database schema and allows the user to query and manipulate the data sources with an easy-to-use, “point and click” Graphical User Interface (GUI). The UmbrellaDB tool uses the Standard Query Language (SQL) as its base query language so there is no requirement to learn a new language or syntax. 1 Introduction Large organizations, with a great number of people, products, customers, and services are vast repositories of information. As organizations grow and change, the data organization requirements and needs evolve. Far too often the ability to proactively manage the organizational information requirements cannot keep pace with the actual 1 information growth. Enterprises tend to represent their data using a variety of data models and schemas while users drive for data integration and cohesiveness. [3] Organizations naturally create information storage and classification heterogeneity. The heterogeneity can be purposeful, which means there are specific and just reasons for selecting and implementing multiple platforms, systems and architectures. Heterogeneity can be driven by acquisitions and mergers. If an organization takes control of or inherits another organization, the data needs of both have to be considered and managed for the greater good of the resulting new organization structure. The worst cast is of the heterogeneity resulting from unintended or accidental purposes. Normally this is propagated by a lack of strategy, lack of discipline or instances of poor execution. The result is a unorganized and fragmented organizationwide data schema with very few or no intersections of which to integrate the data to provide information. There are many perfectly justifiable reasons for not changing the residence or format of organizational information. It may be cost prohibitive to implement new technology, either from a human resource or financial resource perspective [1]. Implemented technology or infrastructure may be obsolete not allowing the organization to migrate to a newer instance or release. The implementation may be of a short life span where there is no Return on Investment (ROI) or Net Present Value (NPV) justification for migrating from the information platform. To approach the problem of data and information heterogeneity, an organization must be able and willing to develop an enterprise level schema capable of representing the minimal set of relations required to provide information to answer questions and 2 solve problems. One perspective to view this type of technology architecture is that of a virtual data warehouse. The movement toward data warehousing is recognition of the fact that the fragmented information that an organization uses to support day-to-day operations at a department level can have immense strategic value when brought together. [2] In this paper, I address the problem of data source heterogeneity and integration very specifically. I will introduce the UmbrellaDB tool which implements a Virtual Database Architecture (VDA). 2 Problem Statement An organization’s need to integrate its data is not a single requirement with minimal constraints. The organization must have the ability to change the configuration of a global data schema very quickly to adjust to changes in business and to satisfy temporal aspects of organization information evolution. The organization must also have the ability to integrate data sources of many types existing on many platforms. The tool used to integrate the data sources must be intuitive to use and not require a steep learning curve or intimate knowledge of all data source architectures to configure. 3 UmbrellaDB Virtual Database Introduction The UmbrellaDB architecture that I present in this paper, simplifies the process of integrating data of different formats across heterogeneous platforms and architectures. Data is integrated at the user level. The user sees a virtual view of all data sources integrated into a single conceptual database. All technical architecture details, of where 3 the data is located and configured, are hidden from the user. The user sees only a graphical database where they can “point-and-click” to formulate queries in order to retrieve the required information. UmbrellaDB is an example of a Virtual Database Architecture. Schemas of all data sources are integrated within the UmbrellaDB workbench so that it appears a user is working with a single, unified global schema. A conceptual, high-level description of the UmbrellaDB VDA is described by Figure 1. UmbrellaDB TCP/IP TCP/IP Oracle Database France Object PostgreSQL Database Database China USA Text File USA Text File local CLIPS Database local Figure 1: UmbrellaDB Virtual Database 4 Architecture Overview The general architecture of the UmbrellaDB VDA is shown in Figure 2. The architecture consists of three main components: The UmbrellaDB GUI and Application Programming Interface (API). The GUI is the tool to be used if the user needs to execute queries across data sources and look at information within the tool’s workbench. The 4 API is necessary if the user wants to develop additional programs in Java or C or C++ to use all of the data sources in a real time environment as a single active database schema. The second part of the architecture is the DataServer. The DataServer acts similar to a Relational Database Management System (RDBMS) except it accesses and manages formatted and delimited text files. The allows the textual files to easily be accessed and manipulated similar to RDBMS tables. Parser Splitter D a t a Relational Database Object Database API Router Umbrella GUI Unifier Profile Engine C o n n e c t o r s TCP/IP TCP/IP Text File KB File Data Server Data Sources Figure 2: UmbrellaDB Architecture The third architecture piece is the UmbrellaDB Engine which is the heart of the application. IT has six distinct parts that require additional explanation: Parser, Splitter, Router, Unifier, Profile and Data Connectors. The data sources are represented as schemas. A data source Profile stores that source’s metadata. Profile metadata such as the number of records and size of records is used by the UmbrellaDB engine to optimize queries prior to execution. The UmbrellaDB Engine Agent continually monitors the defined data sources to retrieve and update the data source’s Profile. A Data Connector is an intelligent agent that acts as an interpreter 5 for each type of data source. It interprets and communicates the query to each data source and translates query results back to the UmbrellaDB engine. The GUI displays the parts of a in a user-friendly and organized manner, which assists the user to easily build queries using “point-and-click” technology. The main components of the UmbrellaDB Engine Agent are: Parser, Splitter, Router and Unifier. Upon execution of a query, the UmbrellaDB engine propagates through a series of steps: Parse the query to insure correctness Split the query into sub-queries relevant to each data source being accessed Route the queries to the data source for processing Unify the result sets of all of the returned queries into a single, unified relation Query optimization across the data sources will be conducted by two general processing algorithms: parallel and sequential execution. A decision is made at the time of the query execution whether to execute the query in parallel or sequential mode. The intelligent agent engine that executes and manages the query plan is based on a metadata description of the data sources. The goal of the query optimization process is to minimize the defined cost during query execution. Cost is defined by amount of data transported across the network, access time, input/output, response time and other relevant factors. 5 Data Sources integrated with UmbrellaDB UmbrellaDB is a flexible architecture designed to support many data source types. Examples of the data source types supported are: formatted text files, delimited text files and relational databases that can be accessed via a Java Database Connector (JDBC). Future research and development will provide for integration with knowledge based formats such as Prolog and CLIPS files, non-relational data sources, such as Key 6 Sequenced Data Sets (KSDS), Open Data Base Connectivity (ODBC) data sources, Btree data sources and traditional mainframe technologies such as IMS. 6 Example Schema Hotel hotel_no hotel_name address JD BC BC JD Informix Room room_no hotel_no type price Data Connector PostgreSQL UmbrellaDB OD BC Text Booking hotel_no guest_no date_from date_to room_no Guest guest_no guest_name no_adults no_children address MS Excel Figure 3: Example Virtual Schema Figure 3 provides an example to be used throughout the description of the Query Engine processes. It is a specific example of the UmbrellaDB virtual database schema. There are sub-schemas representing four data sources: an Informix Relational Database Management System (RDBMS), a PostgreSQL Object Relational Database Management System (ORDBMS), a Microsoft Excel spreadsheet ODBC data source, and a formatted text file. 7 Query Process The Query Engine is responsible for processing queries provided by the GUI or the API.. A query is processed in four distinct steps: Parsing, Splitting, Routing and Unification. The query is represented by SQL language constructs. 7 There are two execution processes that the Query Engine can use to most efficiently process a query over a set of heterogenenous data sources: parallel and sequential. The decision is made by the Query Engine Agent upon the execution of a query. If it is more efficient to split a query into parallel processes and run them simultaneously, the Engine Agent will select that option. If it is more efficient to sequentially execute a query, the agent will select that option. The decision is made by evaluating the meta-data, of each data source involved in the query, and calculating the most efficient solution. 7.1 Query Parsing When a Query is generated graphically through the Umbrella GUI Workbench or passed as an SQL statement from the API, it is parsed to test correctness and validity. Splitting the query into parts is the initial step of the parsing process. The parts are strings that contain a specific part of the query. The second step checks the query for the correct syntax. A sample query: SELECT FROM WHERE hotel.hotel_no, hotel.hotel_name, booking.room_no, room.price hotel, booking, room hotel.hotel_no = booking.hotel_no AND room.room_no = booking.room_no; Is decomposed into a set of Tokens: Token: SELECT Token: Field List Token: FROM Token: Table List Token: WHERE Token: Conditionals List 8 The SELECT, FROM, and WHERE Tokens are checked to insure the spellings are correct. The Field List is checked against the Profile database to make sure all requested fields are valid, spelled correctly, and exist in the schema. The Table List is checked to insure that all listed tables are valid, spelled correctly and are registered data sources in the global schema. The Conditionals List checks to make sure that all conditional fields are valid, spelled correctly, and exist in the schema. It also checks the operands and syntax of each conditional statement. The final check it performs is to insure that the data types of any conditional comparison match. 7.2 Query Splitting After the query is successfully parsed, it is forwarded to the Splitter. The Splitter decomposes the query into a set of sub-queries. Each data source represented will process a sub-query. Using the sample query again: SELECT FROM WHERE hotel.hotel_no, hotel.hotel_name, booking.room_no, room.price hotel, booking, room hotel.hotel_no = booking.hotel_no AND room.room_no = booking.room_no; Is decomposed into a set of sub-queries: SELECT hotel_no, hotel_name FROM hotel; SELECT room_no, hotel_no FROM booking; SELECT room_no, price from FROM room; These sub-queries are not optimized in this example. The sub-queries are passed to the Router and the conditionals would be used by the Unifier to perform the necessary joins upon successful return from the Router. 9 7.3 Query Routing The Query Router will take the output of the Query Splitting process and send it to the appropriate data source for processing. If the query is successful and data is returned it will be in the form of a UmbrellaDB dataset that can then be used by the Unifer. The sub-queries: SELECT hotel_no, hotel_name FROM hotel; The hotel query is routed to the Informix database by a JDBC call and returned as a dataset. SELECT room_no, hotel_no FROM booking; The booking query is routed to the PostgreSQL database by a JDBC call and returned as a dataset. SELECT room_no, price from FROM room; The room query is router to a DataServer object and processed against the formatted textual file and returned as a dataset. 7.4 Query Unification The goal of the Unification step of the Query process is to take all of the returned datasets, from each data source involved in the query, and join the data sets into a single relation. Using the set of sub-queries: SELECT hotel_no, hotel_name FROM hotel; SELECT room_no, hotel_no FROM booking; SELECT room_no, price from FROM room; And the conditional statement from the initial query: WHERE hotel.hotel_no = booking.hotel_no AND room.room_no = booking.room_no; The query is joined together as a universal relation. 10 7.5 Query Optimization An interesting research area of the UmbrellaDB project is query optimization. With virtual database architectures, there are new design issues to be addressed that don’t occur in traditional, integrated database architectures. Firstly, virtual database architectures are not represented by a native global schema. The absence of a global schema indicates that integration must be defined by the user. The integration is managed at the user level by an active and intelligent agent that monitors the data sources and collects metadata to continuously update the virtual database schema. With JDBC related data sources the Engine Agent will make calls to the database and get changes in size and schema on a planned basis. The textual data sources have no meta-data environment, so they will be monitored and updated by the user. Most RDBMS, ORDBMS, and commercial systems have query optimizations in use that execute in the database server run time environment. Our optimization issues are focus on three distinct areas. First we examine the ability to efficiently retrieve textual data from formatted and delimited data sources. The second consideration is the order in which the queries are executed. If a sub-query’s result dataset is very small, it can be executed and then used against a dataset that is very large. This will reduce the overall data transfer across network interfaces. The third area is unification of data sets from multiple heterogeneous sources. The concept of query tree decomposition [5][6] is used to design the algorithms for the economic analysis and query evaluation. Depending on a selection of parallel or sequential execution the Engine will decomposed the queries into query trees 11 and use the cost of projections, selections, joins and Cartesian products to evaluate the economic factors associated with executing a query. Once the total cost is calculated the selection of the parallel or sequential algorithm is made and the query propagates through its steps to successful completion. The evaluation will consider all economic factors possible to calculate the overall cost. There are some factors that cannot be factored because of the potential for errors and unforeseen issues such as network bandwidth and traffic [5]. 7.5.1 Sequential Query Execution Example We present an example of a sequential execution of a query. This example is based on the virtual database exhibited in Figure 4. This query accesses to data sources: a PostgreSQL ORDBMS and an MS Excel spreadsheet via ODBC. The two virtual database tables are related using a guest_no field that relates the data. The objective is to find the guest name and address of every registered guest who stayed at the hotel on July 4th who has children. Step 1 is to split the query into sub-queries so that the Query Engine can evaluate. Step 2 uses the sub-query of each data source and factors it against the current metadata for that data source to evaluate if the query should be executed in sequential or parallel order. Once the decision is made to evaluate in sequential order the Query Engine Agent develops the query plan to most efficiently execute the query across the data sources. In this example the Agent decides to process the data from the PostgreSQL data source first because it will have a small set of guest numbers that meet the query constraints. The data returned will be smaller than if we went to the Excel data source 12 first. A selection is executed on the PostgreSQL data source to return all records where the date is equal to July 4th. A projection of guest numbers from the July 4th data set is then returned to the Umbrella Engine. Step 4 will use the list of the data sets to query the MS Excel data source. First a selection is run to return all relations where the number of children is greater than zero. A selection is used on that dataset to only return the guest name and address of each relation. To complete the query the return of the MS Excel dataset is minimized by the guest number list from the PostgreSQL dataset to produce the final guest name and address dataset. SELECT guest_name, address FROM Booking, Guest WHERE date_from = 07-04-2002 AND no_children > 0; 2 guest_name, address Split Queries 2 Evaluate Impact 3 Process Query (smallest return) 4 Process 2nd Query Local Server 4 3 1 Over the network guest_no guest_name, address Date_from=07-04-2002 Booking hotel_no guest_no date_from date_to PostgreSQL room_no Dist. Server no_children > 0 AND guest_no Guest guest_no guest_name no_adults MS Excel no_children address 1 Figure 4: Sequential Execution Example 7.5.2 Parallel Query Execution Example We present an example of a parallel query execution . This example is based on the virtual database exhibited in Figure 5. This query accesses to data sources: a PostgreSQL ORDBMS and an MS Excel spreadsheet via ODBC. The two virtual database tables are 13 related using a guest_no field that relates the data. The objective is to find the guest name and address of every registered guest who stayed at the hotel on July 4th who has children. Step 1 is to split the query into sub-queries so that the Query Engine can evaluate. Step 2 uses the sub-query of each data source and factors it against the current metadata for that data source to evaluate if the query should be executed in sequential or parallel order. Once the decision is made to evaluate in sequential order the Query Engine Agent develops the query plan to most efficiently execute the query across the data sources. In this case, the Query Engine Agent makes the decision to process in parallel because it is the most efficient execution. A selection is executed against the Booking table of the PostgreSQL database where the date from is July 4th, 2002. At the same time a selection is being executed against the Guest MS Excel spreadsheet where the number of children is greater than zero. Once the projections are complete both resulting datasets have selections executed against them to return the guest number on the dataset from the PostgreSQL dataset and the guest number, guest name and address from the MS Excel dataset, respectively. The two returned datasets are joined by guest number to derive the result set of guest name and address. 14 SELECT guest_name, address FROM Booking, Guest WHERE date_from = 07-04-2002 AND no_children > 0; Local Server guest_name, address guest_no Over the network 1a guest_no Date_from=07-04-2002 Booking hotel_no guest_no date_from date_to room_no 1b Guest_no, guest_name, address PostgreSQL Distributed Servers no_children > 0 MS Excel Guest guest_no guest_name no_adults no_children address Figure 5: Sequential Execution Example 10 Conclusion In the present paper we present a working prototype of UmbrellaDB. The prototype successfully integrates delimited text, formatted text, MySQL Relational Database Management System (RDBMS) and PostgreSQL Object-Relational Database Management System (ORDBMS) data sources. The data sources reside on different physical hardware systems consisting of Sun Solaris, Red Hat Linux, Windows XP and Windows NT connected via a TCP/IP based network. Virtual schemas of each data source have been defined to UmbrellaDB and queries have been successfully executed across the heterogeneous sources. We have been able to test that the sequential and parallel execution of queries are optimal. 15 The ability to integrate numerous heterogeneous data sources has been tested and proven with this architecture. User level integration has been achieved. Queries are easily constructed and optimally executed to completion. 11 Future Work There are numerous additional developments planned for the UmbrellaDB Virtual Database tool. A representative list of future research topics and enhancements is listed below: Extension to numerous data source types Enhancement of the Query Engine Algorithms Inclusion data decompression techniques to drastically reduce data transmission time Use of proposed Huffman Code methods [4] will allow for fast retrieval and processing of large text datasets. 16 References [1] E. Matson, “Querying Distributed, Heterogeneous Databases and Knowledge Bases with a Single Unified Approach”, Research Paper, CIS 860, 1997. [2] I. Witten, E. Frank, “Data Mining”, Morgan-Kaufman Publishers, San Francisco, 2000. [3] Y. Papakonstantinou, H. Garcia-Molina, J. Widom, “Object Exchange Across Heterogeneous Information Sources”, Research sponsored by Wright Laboratory, Aeronautical Systems Center, Air Force Material Command, USAF. [4] E. Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates, “Direct pattern matching on compressed text”. In Proc. 5th International Symposium on String Processing and Information Retrieval (SPIRE'98) (September 1998), pp. 90--95. IEEE Computer Society. [5] M. Oszu, P. Valduriez, “Principles of Distributed Database Systems”, 1991, PrenticeHall, Inc. New Jersey. [6] J. Ullman, Principles of Database and Knowledge-Base Systems, Volume II: The New Technologies”. 1989, Computer Science Press, Rockville, Maryland. [7] J. Ullman, Principles of Database and Knowledge-Base Systems, Volume I: Classical Database Systems” 1988, Computer Science Press, Rockville, Maryland. 17