Download Paper - People - Kansas State University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Microsoft SQL Server wikipedia, lookup

Extensible Storage Engine wikipedia, lookup

Open Database Connectivity wikipedia, lookup

Entity–attribute–value model wikipedia, lookup

Database wikipedia, lookup

Microsoft Jet Database Engine wikipedia, lookup

Functional Database Model wikipedia, lookup

Relational model wikipedia, lookup

Clusterpoint wikipedia, lookup

Database model wikipedia, lookup

Transcript
UmbrellaDB
Virtual Data Base Architecture to Integrate Heterogeneous Data Sources
Eric T. Matson
CIS 864 Data Engineering
Kansas State University – Manhattan, KS, USA, 66506
[email protected]
Abstract
The UmbrellaDB Virtual Database Architecture’s main purpose is to reconcile the
heterogeneity occurring in large organizations. UmbrellaDB provides a generic
architecture for multiple, heterogeneous data sources to be efficiently accessed and
utilized from a unified, simple and intuitive application. It presents the varied data
sources to the user as a single virtual database schema and allows the user to query and
manipulate the data sources with an easy-to-use, “point and click” Graphical User
Interface (GUI). The UmbrellaDB tool uses the Standard Query Language (SQL) as its
base query language so there is no requirement to learn a new language or syntax.
1
Introduction
Large organizations, with a great number of people, products, customers, and
services are vast repositories of information. As organizations grow and change, the data
organization requirements and needs evolve. Far too often the ability to proactively
manage the organizational information requirements cannot keep pace with the actual
1
information growth. Enterprises tend to represent their data using a variety of data
models and schemas while users drive for data integration and cohesiveness. [3]
Organizations naturally create information storage and classification
heterogeneity. The heterogeneity can be purposeful, which means there are specific and
just reasons for selecting and implementing multiple platforms, systems and
architectures. Heterogeneity can be driven by acquisitions and mergers. If an
organization takes control of or inherits another organization, the data needs of both have
to be considered and managed for the greater good of the resulting new organization
structure. The worst cast is of the heterogeneity resulting from unintended or accidental
purposes. Normally this is propagated by a lack of strategy, lack of discipline or
instances of poor execution. The result is a unorganized and fragmented organizationwide data schema with very few or no intersections of which to integrate the data to
provide information.
There are many perfectly justifiable reasons for not changing the residence or
format of organizational information. It may be cost prohibitive to implement new
technology, either from a human resource or financial resource perspective [1].
Implemented technology or infrastructure may be obsolete not allowing the organization
to migrate to a newer instance or release. The implementation may be of a short life span
where there is no Return on Investment (ROI) or Net Present Value (NPV) justification
for migrating from the information platform.
To approach the problem of data and information heterogeneity, an organization
must be able and willing to develop an enterprise level schema capable of representing
the minimal set of relations required to provide information to answer questions and
2
solve problems. One perspective to view this type of technology architecture is that of a
virtual data warehouse. The movement toward data warehousing is recognition of the
fact that the fragmented information that an organization uses to support day-to-day
operations at a department level can have immense strategic value when brought
together. [2]
In this paper, I address the problem of data source heterogeneity and integration
very specifically. I will introduce the UmbrellaDB tool which implements a Virtual
Database Architecture (VDA).
2
Problem Statement
An organization’s need to integrate its data is not a single requirement with minimal
constraints. The organization must have the ability to change the configuration of a
global data schema very quickly to adjust to changes in business and to satisfy temporal
aspects of organization information evolution. The organization must also have the
ability to integrate data sources of many types existing on many platforms. The tool
used to integrate the data sources must be intuitive to use and not require a steep learning
curve or intimate knowledge of all data source architectures to configure.
3
UmbrellaDB Virtual Database Introduction
The UmbrellaDB architecture that I present in this paper, simplifies the process of
integrating data of different formats across heterogeneous platforms and architectures.
Data is integrated at the user level. The user sees a virtual view of all data sources
integrated into a single conceptual database. All technical architecture details, of where
3
the data is located and configured, are hidden from the user. The user sees only a
graphical database where they can “point-and-click” to formulate queries in order to
retrieve the required information.
UmbrellaDB is an example of a Virtual Database Architecture. Schemas of all
data sources are integrated within the UmbrellaDB workbench so that it appears a user is
working with a single, unified global schema. A conceptual, high-level description of the
UmbrellaDB VDA is described by Figure 1.
UmbrellaDB
TCP/IP
TCP/IP
Oracle
Database
France
Object PostgreSQL
Database Database
China
USA
Text
File
USA
Text
File
local
CLIPS
Database
local
Figure 1: UmbrellaDB Virtual Database
4
Architecture Overview
The general architecture of the UmbrellaDB VDA is shown in Figure 2. The architecture
consists of three main components: The UmbrellaDB GUI and Application
Programming Interface (API). The GUI is the tool to be used if the user needs to execute
queries across data sources and look at information within the tool’s workbench. The
4
API is necessary if the user wants to develop additional programs in Java or C or C++ to
use all of the data sources in a real time environment as a single active database schema.
The second part of the architecture is the DataServer. The DataServer acts similar to a
Relational Database Management System (RDBMS) except it accesses and manages
formatted and delimited text files. The allows the textual files to easily be accessed and
manipulated similar to RDBMS tables.
Parser
Splitter
D
a
t
a
Relational
Database
Object
Database
API
Router
Umbrella
GUI
Unifier
Profile
Engine
C
o
n
n
e
c
t
o
r
s
TCP/IP
TCP/IP
Text
File
KB
File
Data Server
Data Sources
Figure 2: UmbrellaDB Architecture
The third architecture piece is the UmbrellaDB Engine which is the heart of the
application. IT has six distinct parts that require additional explanation: Parser, Splitter,
Router, Unifier, Profile and Data Connectors.
The data sources are represented as schemas. A data source Profile stores that
source’s metadata. Profile metadata such as the number of records and size of records is
used by the UmbrellaDB engine to optimize queries prior to execution. The UmbrellaDB
Engine Agent continually monitors the defined data sources to retrieve and update the
data source’s Profile. A Data Connector is an intelligent agent that acts as an interpreter
5
for each type of data source. It interprets and communicates the query to each data
source and translates query results back to the UmbrellaDB engine. The GUI displays
the parts of a in a user-friendly and organized manner, which assists the user to easily
build queries using “point-and-click” technology.
The main components of the UmbrellaDB Engine Agent are: Parser, Splitter,
Router and Unifier. Upon execution of a query, the UmbrellaDB engine propagates
through a series of steps:




Parse the query to insure correctness
Split the query into sub-queries relevant to each data source being accessed
Route the queries to the data source for processing
Unify the result sets of all of the returned queries into a single, unified relation
Query optimization across the data sources will be conducted by two general processing
algorithms: parallel and sequential execution. A decision is made at the time of the query
execution whether to execute the query in parallel or sequential mode. The intelligent
agent engine that executes and manages the query plan is based on a metadata description
of the data sources. The goal of the query optimization process is to minimize the
defined cost during query execution. Cost is defined by amount of data transported
across the network, access time, input/output, response time and other relevant factors.
5
Data Sources integrated with UmbrellaDB
UmbrellaDB is a flexible architecture designed to support many data source types.
Examples of the data source types supported are: formatted text files, delimited text files
and relational databases that can be accessed via a Java Database Connector (JDBC).
Future research and development will provide for integration with knowledge based
formats such as Prolog and CLIPS files, non-relational data sources, such as Key
6
Sequenced Data Sets (KSDS), Open Data Base Connectivity (ODBC) data sources, Btree data sources and traditional mainframe technologies such as IMS.
6
Example Schema
Hotel
hotel_no
hotel_name
address
JD
BC
BC
JD
Informix
Room
room_no
hotel_no
type
price
Data Connector
PostgreSQL
UmbrellaDB
OD
BC
Text
Booking
hotel_no
guest_no
date_from
date_to
room_no
Guest
guest_no
guest_name
no_adults
no_children
address
MS Excel
Figure 3: Example Virtual Schema
Figure 3 provides an example to be used throughout the description of the Query Engine
processes. It is a specific example of the UmbrellaDB virtual database schema. There
are sub-schemas representing four data sources: an Informix Relational Database
Management System (RDBMS), a PostgreSQL Object Relational Database Management
System (ORDBMS), a Microsoft Excel spreadsheet ODBC data source, and a formatted
text file.
7
Query Process
The Query Engine is responsible for processing queries provided by the GUI or the API..
A query is processed in four distinct steps: Parsing, Splitting, Routing and Unification.
The query is represented by SQL language constructs.
7
There are two execution processes that the Query Engine can use to most
efficiently process a query over a set of heterogenenous data sources: parallel and
sequential. The decision is made by the Query Engine Agent upon the execution of a
query. If it is more efficient to split a query into parallel processes and run them
simultaneously, the Engine Agent will select that option. If it is more efficient to
sequentially execute a query, the agent will select that option. The decision is made by
evaluating the meta-data, of each data source involved in the query, and calculating the
most efficient solution.
7.1
Query Parsing
When a Query is generated graphically through the Umbrella GUI Workbench or passed
as an SQL statement from the API, it is parsed to test correctness and validity. Splitting
the query into parts is the initial step of the parsing process. The parts are strings that
contain a specific part of the query. The second step checks the query for the correct
syntax.
A sample query:
SELECT
FROM
WHERE
hotel.hotel_no, hotel.hotel_name, booking.room_no, room.price
hotel, booking, room
hotel.hotel_no = booking.hotel_no AND
room.room_no = booking.room_no;
Is decomposed into a set of Tokens:
Token: SELECT
Token: Field List
Token: FROM
Token: Table List
Token: WHERE
Token: Conditionals List
8
The SELECT, FROM, and WHERE Tokens are checked to insure the spellings are
correct. The Field List is checked against the Profile database to make sure all requested
fields are valid, spelled correctly, and exist in the schema. The Table List is checked to
insure that all listed tables are valid, spelled correctly and are registered data sources in
the global schema. The Conditionals List checks to make sure that all conditional fields
are valid, spelled correctly, and exist in the schema. It also checks the operands and
syntax of each conditional statement. The final check it performs is to insure that the
data types of any conditional comparison match.
7.2
Query Splitting
After the query is successfully parsed, it is forwarded to the Splitter. The Splitter
decomposes the query into a set of sub-queries. Each data source represented will
process a sub-query.
Using the sample query again:
SELECT
FROM
WHERE
hotel.hotel_no, hotel.hotel_name, booking.room_no, room.price
hotel, booking, room
hotel.hotel_no = booking.hotel_no AND
room.room_no = booking.room_no;
Is decomposed into a set of sub-queries:
SELECT hotel_no, hotel_name FROM hotel;
SELECT room_no, hotel_no FROM booking;
SELECT room_no, price from FROM room;
These sub-queries are not optimized in this example. The sub-queries are passed to the
Router and the conditionals would be used by the Unifier to perform the necessary joins
upon successful return from the Router.
9
7.3
Query Routing
The Query Router will take the output of the Query Splitting process and send it to the
appropriate data source for processing. If the query is successful and data is returned it
will be in the form of a UmbrellaDB dataset that can then be used by the Unifer.
The sub-queries:
SELECT hotel_no, hotel_name FROM hotel;
The hotel query is routed to the Informix database by a JDBC call and returned as a
dataset.
SELECT room_no, hotel_no FROM booking;
The booking query is routed to the PostgreSQL database by a JDBC call and returned as
a dataset.
SELECT room_no, price from FROM room;
The room query is router to a DataServer object and processed against the formatted
textual file and returned as a dataset.
7.4
Query Unification
The goal of the Unification step of the Query process is to take all of the returned
datasets, from each data source involved in the query, and join the data sets into a single
relation.
Using the set of sub-queries:
SELECT hotel_no, hotel_name FROM hotel;
SELECT room_no, hotel_no FROM booking;
SELECT room_no, price from FROM room;
And the conditional statement from the initial query:
WHERE
hotel.hotel_no = booking.hotel_no AND
room.room_no = booking.room_no;
The query is joined together as a universal relation.
10
7.5
Query Optimization
An interesting research area of the UmbrellaDB project is query optimization. With
virtual database architectures, there are new design issues to be addressed that don’t
occur in traditional, integrated database architectures. Firstly, virtual database
architectures are not represented by a native global schema. The absence of a global
schema indicates that integration must be defined by the user.
The integration is managed at the user level by an active and intelligent agent that
monitors the data sources and collects metadata to continuously update the virtual
database schema. With JDBC related data sources the Engine Agent will make calls to
the database and get changes in size and schema on a planned basis. The textual data
sources have no meta-data environment, so they will be monitored and updated by the
user.
Most RDBMS, ORDBMS, and commercial systems have query optimizations in
use that execute in the database server run time environment. Our optimization issues
are focus on three distinct areas. First we examine the ability to efficiently retrieve
textual data from formatted and delimited data sources. The second consideration is the
order in which the queries are executed. If a sub-query’s result dataset is very small, it
can be executed and then used against a dataset that is very large. This will reduce the
overall data transfer across network interfaces. The third area is unification of data sets
from multiple heterogeneous sources.
The concept of query tree decomposition [5][6] is used to design the
algorithms for the economic analysis and query evaluation. Depending on a selection of
parallel or sequential execution the Engine will decomposed the queries into query trees
11
and use the cost of projections, selections, joins and Cartesian products to evaluate the
economic factors associated with executing a query. Once the total cost is calculated the
selection of the parallel or sequential algorithm is made and the query propagates through
its steps to successful completion. The evaluation will consider all economic factors
possible to calculate the overall cost. There are some factors that cannot be factored
because of the potential for errors and unforeseen issues such as network bandwidth and
traffic [5].
7.5.1 Sequential Query Execution Example
We present an example of a sequential execution of a query. This example is based on
the virtual database exhibited in Figure 4. This query accesses to data sources: a
PostgreSQL ORDBMS and an MS Excel spreadsheet via ODBC. The two virtual
database tables are related using a guest_no field that relates the data. The objective is to
find the guest name and address of every registered guest who stayed at the hotel on July
4th who has children.
Step 1 is to split the query into sub-queries so that the Query Engine can evaluate.
Step 2 uses the sub-query of each data source and factors it against the current metadata
for that data source to evaluate if the query should be executed in sequential or parallel
order. Once the decision is made to evaluate in sequential order the Query Engine Agent
develops the query plan to most efficiently execute the query across the data sources.
In this example the Agent decides to process the data from the PostgreSQL data
source first because it will have a small set of guest numbers that meet the query
constraints. The data returned will be smaller than if we went to the Excel data source
12
first. A selection is executed on the PostgreSQL data source to return all records where
the date is equal to July 4th. A projection of guest numbers from the July 4th data set is
then returned to the Umbrella Engine.
Step 4 will use the list of the data sets to query the MS Excel data source. First a
selection is run to return all relations where the number of children is greater than zero.
A selection is used on that dataset to only return the guest name and address of each
relation. To complete the query the return of the MS Excel dataset is minimized by the
guest number list from the PostgreSQL dataset to produce the final guest name and
address dataset.
SELECT guest_name, address
FROM Booking, Guest
WHERE date_from = 07-04-2002
AND no_children > 0;
2
guest_name, address
Split Queries
2
Evaluate Impact
3
Process Query
(smallest return)
4
Process 2nd Query
Local Server
4
3
1
Over the network
 guest_no

 guest_name, address

Date_from=07-04-2002
Booking
hotel_no
guest_no
date_from
date_to PostgreSQL
room_no
Dist. Server
no_children > 0 AND guest_no
Guest
guest_no
guest_name
no_adults
MS Excel
no_children
address
1
Figure 4: Sequential Execution Example
7.5.2 Parallel Query Execution Example
We present an example of a parallel query execution . This example is based on the
virtual database exhibited in Figure 5. This query accesses to data sources: a PostgreSQL
ORDBMS and an MS Excel spreadsheet via ODBC. The two virtual database tables are
13
related using a guest_no field that relates the data. The objective is to find the guest
name and address of every registered guest who stayed at the hotel on July 4th who has
children.
Step 1 is to split the query into sub-queries so that the Query Engine can evaluate.
Step 2 uses the sub-query of each data source and factors it against the current metadata
for that data source to evaluate if the query should be executed in sequential or parallel
order. Once the decision is made to evaluate in sequential order the Query Engine Agent
develops the query plan to most efficiently execute the query across the data sources.
In this case, the Query Engine Agent makes the decision to process in parallel
because it is the most efficient execution. A selection is executed against the Booking
table of the PostgreSQL database where the date from is July 4th, 2002. At the same time
a selection is being executed against the Guest MS Excel spreadsheet where the number
of children is greater than zero. Once the projections are complete both resulting datasets
have selections executed against them to return the guest number on the dataset from the
PostgreSQL dataset and the guest number, guest name and address from the MS Excel
dataset, respectively. The two returned datasets are joined by guest number to derive the
result set of guest name and address.
14
SELECT guest_name, address
FROM Booking, Guest
WHERE date_from = 07-04-2002
AND no_children > 0;
Local Server
guest_name, address
guest_no
Over the network
1a
 guest_no
 Date_from=07-04-2002
Booking
hotel_no
guest_no
date_from
date_to
room_no
1b
 Guest_no, guest_name, address

PostgreSQL
Distributed Servers
no_children > 0
MS Excel
Guest
guest_no
guest_name
no_adults
no_children
address
Figure 5: Sequential Execution Example
10
Conclusion
In the present paper we present a working prototype of UmbrellaDB. The prototype
successfully integrates delimited text, formatted text, MySQL Relational Database
Management System (RDBMS) and PostgreSQL Object-Relational Database
Management System (ORDBMS) data sources. The data sources reside on different
physical hardware systems consisting of Sun Solaris, Red Hat Linux, Windows XP and
Windows NT connected via a TCP/IP based network. Virtual schemas of each data
source have been defined to UmbrellaDB and queries have been successfully executed
across the heterogeneous sources. We have been able to test that the sequential and
parallel execution of queries are optimal.
15
The ability to integrate numerous heterogeneous data sources has been tested and
proven with this architecture. User level integration has been achieved. Queries are
easily constructed and optimally executed to completion.
11
Future Work
There are numerous additional developments planned for the UmbrellaDB Virtual
Database tool. A representative list of future research topics and enhancements is listed
below:




Extension to numerous data source types
Enhancement of the Query Engine Algorithms
Inclusion data decompression techniques to drastically reduce data transmission
time
Use of proposed Huffman Code methods [4] will allow for fast retrieval and
processing of large text datasets.
16
References
[1] E. Matson, “Querying Distributed, Heterogeneous Databases and Knowledge Bases
with a Single Unified Approach”, Research Paper, CIS 860, 1997.
[2] I. Witten, E. Frank, “Data Mining”, Morgan-Kaufman Publishers, San Francisco,
2000.
[3] Y. Papakonstantinou, H. Garcia-Molina, J. Widom, “Object Exchange Across
Heterogeneous Information Sources”, Research sponsored by Wright Laboratory,
Aeronautical Systems Center, Air Force Material Command, USAF.
[4] E. Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates, “Direct pattern matching on
compressed text”. In Proc. 5th International Symposium on String Processing and
Information Retrieval (SPIRE'98) (September 1998), pp. 90--95. IEEE Computer
Society.
[5] M. Oszu, P. Valduriez, “Principles of Distributed Database Systems”, 1991, PrenticeHall, Inc. New Jersey.
[6] J. Ullman, Principles of Database and Knowledge-Base Systems, Volume II: The
New Technologies”. 1989, Computer Science Press, Rockville, Maryland.
[7] J. Ullman, Principles of Database and Knowledge-Base Systems, Volume I: Classical
Database Systems” 1988, Computer Science Press, Rockville, Maryland.
17