* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Unity_lawrence
Extensible Storage Engine wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Concurrency control wikipedia , lookup
Ingres (database) wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Functional Database Model wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Relational algebra wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Clusterpoint wikipedia , lookup
Versant Object Database wikipedia , lookup
Automatic Integration of Relational Database Systems Ramon Lawrence University of Manitoba [email protected] Outline  Introduction, Motivation, and Background  Our integration approach  The integration architecture  Standard  Example dictionary, X-Specs, query processor integration  Northwind,  Querying Southstorm databases the integrated databases  Generating SQL queries from semantic queries  Unity implementation  Contributions, Conclusions, and Future Work Page 2 Database Terminology  Database system - is a database and a system to manage the data.  Schema - is a description of the data organization and format in a database.  Schema integration - is the process of combining local schemas into a global, integrated view by resolving conflicts present between the schemas.  Data integration - is the process of combining data at the entity-level. It requires resolving representational conflicts and determining equivalent keys.  Multidatabase system (MDBS) - is a collection of autonomous, local databases participating in a global database system to share data. Page 3 What is Integration?  Two levels of integration:  Schema integration - the description of the data  Data integration - the individual data instances  Integration problems include:  Different data models and conflicts within a model  Incompatible concept representations  Different user or view perspectives  Naming conflicts (homonym, synonym)  Integration handles the different mechanisms for storing data (structural conflicts), for referencing data (naming conflicts), and for attributing meaning to the data (semantic conflicts). Page 4 Why is Integration Required?  There are many integration environments:  Operational systems within an organization  System integration during company merger  Data warehouses, Intranets, and the WWW  Users require information from many data sources which often do not work together.  Companies require a global view of their entire operations which may be present in numerous operational databases for different departments and distributed geographically.  E-commerce demands integration of web databases with production systems. Page 5 What is the Current Solution?  Manual Integration Algorithms:  Allow designer to detect and resolve conflicts  Manipulate information using semantic models  Knowledge  Cyc  Global bases/Artificial Intelligence: knowledge base and Carnot project Dictionaries and Lexical Semantics:  Wordnet, Clio, Summary schemas model  Concept hierarchies (Castano) Page 6 What is the Current Solution? (2)  SQL and multidatabase query languages:  SQL, MSQL, IDL, DIRECT, SchemaSQL  Requires user to understand DB structure & semantics  Wrapper and mediator systems:  Information Manifold, TSIMMIS, Infomaster  Use query languages or description logics  Focus on query rewriting and reformulation  Industrial standards:  XML, BizTalk, E-commerce portals  Apply to limited domains/industries  Require standard structures and database changes Page 7 Previous Work Summary  Current techniques for database integration have some of these problems:  Require integrator to understand all databases  Integration process is manual  Do not hide system complexity from the user  Force changes on the existing database systems  Construct global view manually  Suffer from query imprecision (query containment) Page 8 Our Approach  Our approach combines standardization and query mapping algorithms.  The major idea is that schema conflicts can be resolved if we:  Eliminate all naming conflicts  Define a language capable of determining schema equivalence and performing transformations  Naming conflicts are eliminated by accepting a standard term dictionary.  Not a knowledge base or set of mediated views  Leverages semantic information in English words Page 9 Integration Architecture Client Client Multidatabase Layer • user’s view of integration 2) X-Spec Editor Integrated Context View X-Spec Editor Standard Dictionary Architecture Components: 1) Integrated Context View • stores schema & metadata • uses XML Integration Algorithm 3) Standard Dictionary • terms to express semantics 4) Integration Algorithm Query Processor and ODBC Manager 5) Query Processor Subtransactions X-Spec X-Spec Database Database Local Transactions • combines X-Specs into integrated context view • accepts query on view • determines data source mappings and joins • executes queries and formats results Architecture Components  The architecture consists of four components: A standard dictionary (SD) to capture data semantics  SD terms are used to build semantic names describing semantics of schema elements.  X-Specs for storing data semantics  Database metadata and semantic names stored using XML  Integration Algorithm  Matches concepts in different databases by semantic names.  Produces an integrated view of all database concepts.  Query Processor  Allows the user to formulate queries on the view.  Translates from semantic names in integrated view to SQL queries and integrates and formats results.  Involves determining correct field and table mappings and discovery of join conditions and join paths Page 11 Integration Processes  The integration architecture consists of three separate processes:  Capture process: independently extracts database schema information and metadata into a XML document called a X-Spec.  Integration process: combines X-Specs into a structurally-neutral hierarchy of database concepts called an integrated context view.  Query process: allows the user to formulate queries on the integrated view that are mapped by the query processor to structural queries (SQL) and the results are integrated and formatted. Page 12 Integration Architecture: The Capture Process Relational Schema Automatic Extraction Specification Editor Standard Dictionary  Capture X-Spec DBA Lookup of terms process involves:  Automatically extracting the schema information and metadata using a specification editor  Assigning semantic names to each schema element (tables and fields) to capture their semantics Page 13 Architecture Components: The Standard Dictionary A standard dictionary (SD) provides standardized terms to capture data semantics.  Hierarchy of terms related by IS-A or Has-A links  Contains base set of common database concepts, but new concepts can be added A SD term is a single, unambiguous semantic definition.  Several SD entries for a single English word are required if the word has multiple definitions.  The top-level dictionary terms are those proposed by Sowa. Page 14 Architecture Components: Dictionary vs. Knowledge Base  The standard dictionary differs from a knowledge base such as Cyc because:  Not intended to be a general English dictionary or contain knowledge facts about the world  Dictionary is evolved as new terms are required  Not all English words are used  Dictionary provides the systems with no “knowledge”  Since no facts are stored, system cannot deduce new facts  Dictionary terms are just semantic place holders, integrators determine the semantics of the database not the system  Simplified organization  Dictionary is organized as a tree for efficiency and simplicity in determining related concepts  Re-use of terms  Terms are re-used in semantic names Page 15 Architecture Components: Using the Standard Dictionary  SD terms are used to build semantic names describing semantics of schema elements.  Semantic names have the form:  semantic name := [CT_Type] | [CT_Type] CN  CT_Type := CT | CT {; CT} | CT {,CT}  CT := context term, CN := concept name  each CT and CN is a single term from the SD  Semantic names are included in specifications describing a database. Page 16 Northwind & Southstorm Integration Example Northwind Database Schema Tables Categories Customers Employees OrderDetails Order Products Shippers Suppliers Fields CategoryID, CategoryName CustomerID, CompanyName EmployeeID, LastName, FirstName OrderID, ProductID, UnitPrice, Quantity OrderID, CustomerID, EmployeeID, OrderDate, Shipvia ProductID, ProductName, SupplierID, CategoryID ShipperID, CompanyName SupplierID, CompanyName Page 17 Northwind & Southstorm Integration Example (2) Southstorm Database Schema Tables Fields Orders_tb Order_num, Cust_name, Emp_name, Item1_id, Item1_qty, Item1_price, Item2_id, Item2_qty, Item2_price Page 18 Integration Example (3) Northwind Semantic Name Mappings Type T F F T F F T F F F T F F F F Semantic Name [Category] [Category] Id [Category] Name [Customer] [Customer] Id [Customer] Name [Employee] [Employee] Id [Employee] Last Name [Employee] First Name [Order;Product] [Order] Id [Order;Product] Id [Order;Product] Price [Order;Product] Quantity System Name Categories CategoryID CategoryName Customers CustomerID CompanyName Employees EmployeeID LastName FirstName OrderDetails OrderID ProductID UnitPrice Quantity Type T F F F F F T F F F F T F F T F F Semantic Name [Order] [Order] Id [Order;Customer] Id [Order;Employee] Id [Order] Date [Order;Shipper] Id [Product] [Product] Id [Product] Name [Product;Supplier] Id [Product;Category] Id [Shipper] [Shipper] Id [Shipper] Name [Supplier] System Name Orders OrderID CustomerID EmployeeID OrderDate Shipvia Products ProductID ProductName SupplierID CategoryID Shippers ShipperID ShipperName Suppliers [Supplier] Id [Supplier] Name SupplierID SupplierName Page 19 Northwind & Southstorm Integration Example (4) Southstorm Semantic Name Mappings Type Table Field Field Table Field Field Table Field Field Field Semantic Name [Order] [Order] Id [Order;Customer] Name [Order;Employee] Name [Order;Product] Id [Order;Product] Quantity [Order;Product] Price [Order;Product] Id [Order;Product] Quantity [Order;Product] Price System Name Orders_tb Order_num Cust_name Emp_name Item1_id Item1_qty Item1_price Item2_id Item2_qty Item2_price Page 20 What is a semantic name? A semantic name is a universal, semantic identifier in a domain.  Similar to a field name in the Universal Relation.  Semantics are guaranteed unique by construction.  System has mechanism for comparing semantics across domains even though it does not understand them. (Exploiting semantics in English words.)  Important  context definitions: - a semantic name is a context if it maps to a table  concept - a semantic name is a concept if it maps to a field  context closure - of semantic name Si denoted Si* is the set of semantic names produced by taking ordered subsets of the terms of Si = {T1, T2 , … TN} starting with T1. Page 21 Architecture Components: X-Specs  Database metadata and semantic names are combined into specifications called X-Specs:  Stored and transmitted using XML  Contains information on a relational schema  Organized into database, table, and field levels  Stores semantic names to describe and integrate schema elements Page 22 Southstorm X-Spec <?xml version="1.0" ?> <Schema name = "Southstorm_xspec.xml” xmlns="urn:schemas-microsoft-com:xml-data" xmlns:dt="urn:schemas-microsoft-com:datatypes"> <ElementType name="[Order]" sys_name = "Orders_tb" sys_type="Table"> <element type = "[Order] Id" sys_name = "Order_num" sys_type = "Field"/> <element type = "[Order] Total Amount" sys_name = "Order_total" sys_type = "Field"/> <element type = "[Order;Customer] Name" sys_name = "Cust_name" sys_type = "Field"/> <element type = "[Order;Customer;Address] Address Line 1" sys_name="Cust_address" sys_type="Field"/> <element type = "[Order;Customer;Address] City" sys_name = "Cust_city" sys_type = "Field"/> <element type = "[Order;Customer;Address] Postal Code" sys_name="Cust_pc" sys_type="Field"/> <element type = "[Order;Customer;Address] Country" sys_name="Cust_country" sys_type="Field"/> <element type = "[Order;Product] Id" sys_name = "Item1_id" sys_type = "Field"/> <element type = "[Order;Product] Quantity" sys_name = "Item1_quantity" sys_type = "Field"/> <element type = "[Order;Product] Price" sys_name = "Item1_price" sys_type = "Field"/> <element type = "[Order;Product] Id" sys_name = "Item2_id" sys_type = "Field"/> <element type = "[Order;Product] Quantity" sys_name = "Item2_quantity" sys_type = "Field"/> <element type = "[Order;Product] Price" sys_name = "Item2_price" sys_type = "Field"/> </ElementType> </Schema> Page 23 Architecture Components: Integrating X-Specs  Each database to be integrated is described using a X-Spec.  Identical concepts in different databases are identified by similar semantic names.  Concepts with identical (or hierarchially related) semantic names are combined regardless of their physical representation in the individual databases. Page 24 Integration Architecture: The Integration Process  Integration process involves:  Automatically identifying identical concepts by matching semantic names  Constructing a global view of database concepts consisting of a hierarchy of concept terms  Resolving structural differences during query generation and submission (e.g. a concept may be represented as a table in one database and a field (attribute) in another) Page 25 Integration Product: The Integrated Context View  The product of the integration is a structurallyneutral hierarchy of concepts called an integrated context view.  Define a context view (CV) as follows:  If a semantic name Si is in CV, then for any Sj in Si*, Sj is also in CV.  For each semantic name Si in CV, there exists a set of zero or more mappings Mi that associate a schema element Ej with Si.  A semantic name Si can only occur once in the CV. A context view (CV) is a valid Universal Relation.  Each field is assigned a semantic name which uniquely identifies its semantic connotation. Page 26 Northwind & Southstorm Integration Example Integrated Context View Integrated View Term V (view root) - [Category] - Id - Name - [Customer] - Id - Name - [Employee] - Id - [Name] - First Name - Last Name - [Product] - Id - Name - [Supplier] - Id - [Category] - Id Data Source Mappings (not visible to user) Integrated View Term Data Source Mappings (not visible to user) N/A NW.Categories NW.Categories.CategoryID NW.Categories.CategoryName NW.Customers NW.Customers.CustomerID NW.Customers.CompanyName NW.Employees NW.Employees.EmployeeID V (view root) (cont.) - [Order] -Id - [Customer] - Id - Name - [Employee] - Id - Name - [Product] - Id - Price - Quantity - [Shipper] - Id - Name - [Supplier] - Id - Name N/A NW.Orders, SS.Orders_tb NW.[Orders,OrderDetails].OrderID, SS.Orders_tb.Order_num NW.Employees.FirstName NW.Employees.LastName NW.Products NW.Products.PrdouctID NW.Products.ProductName NW.Products.SupplierID NW.Products.CategoryID NW.Orders.CustomerID SS.Orders_tb.Cust_name NW.Orders.EmployeeID SS.Orders_tb.Emp_name NW.OrderDetails NW.OrderDetails.ProductID, SS.Orders_tb.Item[1,2]_id NW.OrderDetails.UnitPrice, SS.Orders_tb.Item[1,2]_price NW.OrderDetails.Quantity, SS.Orders_tb.Item[1,2]_qty NW.Shippers NW.Shippers.ShipperID NW.Shippers.ShipperName NW.Suppliers NW.Suppliers.SupplierID NW.Suppliers.SupplierName Page 27 Architecture Components: The Query Processor  The query processor:  Allows the user to formulate queries on the view.  Translates from semantic names in the context view to structural queries (SQL) on databases.  Involves determining correct field and table mappings and discovery of join conditions and join paths  Retrieves query results and formats them for display to the user.  Client-side query processing:  Perform joins between databases using common keys.  Data value formatting and transformation Page 28 The Query Processor: Determining field/table mappings  For each database (D) in the context view  For each semantic name (S) in query  If S has only one semantic name mapping in D Then  Add field mapping to query and its parent table  Else If S has multiple mappings but all in one table Then  Add each field mapping to query and the parent table  Else S has multiple mappings in more than one table Then    If any field mapping has a table already in query take that one Else take field mapping with best semantic name match Else take first mapping found  End If  Next  Next Page 29 The Query Processor: Constructing Join Graphs  Given a set of fields (F) and tables (T) to access, joins are applied to connect the tables.  A join graph is an undirected graph where:  Each node Ni is a table in the database.  There is a link from node Ni to node Nj if there is a join between the two tables. A join path is a sequence of joins connecting two nodes in the graph.  A join tree is a set of joins connecting two or more nodes.  A join matrix M stores the shortest join paths between any two nodes (tables). Page 30 The Query Processor: Join Graph for Northwind 1 Suppliers N N 1 Categories Products 1 Products N OrderDetails 1 Shippers N N 1 N Orders N Products 1 Employees Products 1 Customers Page 31 The Query Processor: Join Discovery Results  Join discovery in a database with a connected, acyclic join graph and a join matrix M:  There exists only one join tree for any set of tables .  The joins required to connect a table set T is found by taking any Ti of T and unioning the join paths in M[Ni,N1], M[Ni,N2], ... M[Ni,Nn] where N1,N2,..Nn are the nodes corresponding to the set of tables T.  For a cyclic join graph:  There may exist more than one join tree for a set of tables and each tree may have different semantics.  Can allow the user to uniquely determine join tree by graphically displaying join conditions to the user as they browse the context view. Page 32 Advanced Query Processing  Advanced query processor features include:  global keys and joins - a mechanism for specifying when a field stores a global key such as a social security number.  result normalization - a procedure for normalizing query results returned from each individual database. (e.g. Southstorm)  data integration - transforming data representational conflicts at the global level.  For example, “M” and “F” may represent “Male” and “Female” in one database, and another may represent these concepts using “0” and “1”. Page 33 Northwind & Southstorm Query Examples  Example 1: Retrieve all order ids ([Order] Id) and customers ([Customer] Name):  SS: SELECT Order_num, Cust_name FROM Orders_tb  NW: SELECT OrderID, CompanyName FROM Orders, Customers WHERE Orders.CustomerID = Customers.CustomerID  Example 2: Retrieve all ordered products ([Order;Product] Id) and their order ids.  SS: SELECT Order_num, Item1_id, Item2_id FROM Orders_tb  NW: SELECT OrderID, ProductID FROM OrderDetails  Note: In NW, selects from two different order id mappings. In SS, result normalization is required. Page 34 Integration Example: Discussion  Important points:  System table and field names are not presented to the user who queries based on semantic names.  Database structure is not shown to the user.  Field and table mappings are automatically determined based on X-Spec information.  Join conditions are inserted as needed when available to join tables.  Different physical representations for the same concept are combined.  Hierarchically related concepts are combined based on their IS-A relationship in the standard dictionary. Page 35 Unity Overview  Unity is a software package that implements the integration architecture with a GUI.  Developed using Microsoft Visual C++ 6 and Microsoft Foundation Classes (MFC).  Unity allows the user to:  Construct and modify standard dictionaries  Build X-Specs to describe data sources  Integrate X-Specs into an integrated view  Transparently query integrated systems using ODBC and automatically generate SQL transactions  Unity is available for demonstration and distribution. Page 36 Architecture Discussion  The architecture automatically integrates relational schemas into a multidatabase.  Desirable properties:  Individual mappings - information sources integrated one-at-a-time and independently  Integrated view constructed for query transparency user queries system by semantics instead of structure  Handles schema conflicts - including semantic, structural, and naming conflicts  Automated integration - integrated view constructed efficiently and automatically  No wrapper or mediator software is required  Transparent querying - users issue semantic queries which are translated to SQL by the query processor Page 41 Contributions  Architecture contributions:  Has an unique application of a standard dictionary which is not a knowledge base  Separates the capture and integration processes  Allows transparent querying without structure  Provides algorithms for dynamically extracting database data (creating relevant views)  Algorithms for mediation of global level conflicts (global keys, normalization, etc.)  Arguably simpler method for capturing data semantics than using description logic  An implementation, Unity, which demonstrates the practical benefits of the architecture Page 42 Conclusions  Automatic database integration is possible by using a standard term dictionary and defining semantic names for schema elements.  Integration of data sources has applications to the WWW and construction of data warehouses.  Users are able to transparently query integrated systems by concept instead of structure. Page 43 Future Work  The integration architecture is evolving with standards on XML and captures metadata information in XML documents.  We are constantly refining Unity.  Develop an integration component for a web browser  The query processor is being extended to resolve more complex queries and conflicts.  Test the system in large industrial projects.  Allow distributed updates and global updates on all databases. Page 44 References  Publications:  Unity - A Database Integration Tool, R. Lawrence and K. Barker, TRLabs Emerging Technology Bulletin, January 2000.  Multidatabase Querying by Context, R. Lawrence and K. Barker, DataSem2000, pages 127-136, Oct. 2000.  Integrating Relational Database Schemas using a Standardized Dictionary, To appear in SAC’2001 - ACM Symposium on Applied Computing, March, 2001.  Sponsors:  NSERC,  Further TRLabs Information:  http://www.cs.umanitoba.ca/~umlawren/ Page 45
 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            