* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Unity Demonstration - People | UBC's Okanagan campus
Extensible Storage Engine wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Concurrency control wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Functional Database Model wikipedia , lookup
Clusterpoint wikipedia , lookup
Unity Demonstration Dr. Ramon Lawrence University of Iowa [email protected] Outline Motivation and Background Two basic integration approaches: global as view (GAV) local as view (LAV) What is the open problem? How Unity is different Using Unity example Benefits and Contributions Future Work Page 2 Motivation There are many integration environments: Operational systems within an organization System integration during company merger Data warehouses, Intranets, and the WWW Users require information from many data sources which often do not work together. Page 3 What is Integration? Two levels of integration: Schema integration - the description of the data Data integration - the individual data instances Integration handles the different mechanisms for storing data (structural conflicts), for referencing data (naming conflicts), and for attributing meaning to the data (semantic conflicts). Page 4 Two Current Approaches The current state-of-the-art integration systems all can be reduced to a logical basis. For this demo, assume the data is physically stored in the relational model and queried using Datalog. There are two basic "database" approaches to integration: global as view approach - the extraction and integration of data is defined simulatenously with the global view definition TSIMMIS using Object Exchange Model (OEM) local as view approach - pre-defines the global view and then defines what portion of the global view each local source provides Information Manifold using description logic Page 5 BodyWorks Systems Customer Web Server Order Database Invoice Database Shipment Database Custom Accounting Package Shipment Tracking Software BodyWorks Systems Customer Web Server Order Database Invoice Database Shipment Database Custom Accounting Package Shipment Tracking Software Question: Who has a complete picture of a customer's order, or the entire customer relatioship? BodyWorks Systems Customer Web Server Order Database Invoice Database Shipment Database Custom Accounting Package Answer: No one, but management wants to know... Shipment Tracking Software Data Warehouse Approach Features: Warehouse Gather Refine Aggregate Store Gather Refine Aggregate Store Invoice Database Order Database - static, materialized view - performs data cleansing and aggregation - historical more than operational Gather Refine Aggregate Store Shipment Database Query-Driven Dynamic Approach Features: - view dynamically built - data is extracted at query-time - still typically read-only mediator Wrapper Wrapper Invoice Database Order Database Cust(id,name,addr,city,state,cty) Invoice(invId,custId,shipId,iDate) InvProd(invId,prodId,amt,pr) Prod(id,name,pr,desc) Cust(id,name,addr,city,state,cty) Order(oid,cid,odate) OrdProd(oid,pid,amt,pr) Prod(id,name,pr,desc) Wrapper Shipment Database Cust(id,name,addr,city,state,cty) Shipment(shipid,oid,cid,shipdate) ShipProd(shipid,prodid,amt) Prod(id,name,pr,desc, inv) Global as View Approach Define global objects by specifying how to extract their information from the local sources. Requires that the administrator defining the global view understand the semantics of every local data source. Further, if the local views or global views must be changed for whatever reason (such as adding a new data source), the global view must be re-compiled. Page 11 Global as View Example Tsimmis MSL example extracting customer info: <f(I) customer customer <f(I) customer customer {<id {<id {<id {<id I> I> I> I> <name <name <name <name N> N> N> N> <addr <addr <addr <addr A>}>@med :A>}@invoiceDB A>}>@med :A>}@orderDB <f(I) customer {<id I> <name N> <addr A>}>@med :customer {<id I> <name N> <addr A>}@shipmentDB Equivalent SQL: Union the results of the following 3 queries: (matching ids if possible) orderDB: SELECT * FROM customer invoiceDB: SELECT * FROM customer shipmentDB: SELECT * FROM customer Page 12 Global as View Example (2) Extract all orders with invoices and shipments: <shipInvOrd {<shipment S> <invoice I> <order O>}>@med :<shipment {<shipid S> <oid O>}@shipmentDB AND <order {<oid O>}>@orderDB AND <invoice {<invId I> <shipId S>}@invoiceDB Equivalent SQL: (if possible to query multiple databases) SELECT shipment.shipid, invoice.invId, order.oid FROM shipment, invoice, order WHERE shipment.shipid = invoice.shipId AND shipment.oid = order.oid Page 13 Local as View Approach Pre-define an integrated global view that encompasses the information present in all sources. For each local source, specify the local view as a subset of the information available in the GV. Building the GV is typically not discussed. However, LAV approach makes it easier to add/remove sources as GV does not have to be updated. Query processing using LAV approach is more difficult than GAV approach as have to determine what information can be extracted from the views. Page 14 Local as View Example Consider this global customer relation in the GV: customer(id, name, addr) Assume that the order, shipment, and invoice databases only contains a customer record if the customer had an invoice, order ,or shipment respectively. Further, assume that only shipmentDB contains a customer address. Local views of each source: orderView(C,N) :- customer(C,N) invoiceView(C,N) :- customer(C,N) shipView(C,N,A) :- customer(C,N,A) Page 15 Local as View Example (2) Let the user pose the following query: q(N) :- customer(I, N, A) Query asks for all customer names. Query processor must determine which views are relevant (in this case all of them). Local queries on each source: q(N) :- orderView(C,N) q(N) :- invoiceView(C,N) q(N) :- shipView(C,N,A) Page 16 What is the open problem? The two approaches are both viable methods for solving data integration. However, the open problem is that neither approach performs schema integration - the construction of the global view itself. GAV - GV constructed (schema integration performed) by global designer when specifying extraction rules LAV - GV is pre-defined using some previous integration process (most likely manual in nature) Both methods rely on the concept of a global user to create the global schema. Page 17 How Unity is Different Our integration architecture called Unity is different because it approaches the integration problem for a different perspective: How can we automate, or semi-automate, the construction of the global view by extracting information from the local data sources? Thus, the integration problem is tackled from a different set of starting assumptions: Do not assume pre-existing or manually created GV. However, assume we have a dictionary and a language for describing schema and data element semantics. Attempt to automatically build a GV from source descriptions of each data source. Page 18 The Unity Approach Given a set of data sources and a dictionary and language to describe data semantics: 1) Semi-automatically extract and represent data source semantics in the language using the dictionary. 2) Automatically match concepts across data sources by using the dictionary to determine related concepts. This process effectively builds the global level relations or objects initially assumed or created in other approaches. However, since there is no manual intervention, the precision of global view construction is affected by inconsistencies in the descriptions of the data sources and matching concepts. 3) Automatically generate queries specified by the user using dictionary terms (not structures) and map the user's query to appropriate data elements in the local sources. Page 19 Unity Overview Unity is a software package that implements the integration architecture with a GUI. Developed using Microsoft Visual C++ 6 and Microsoft Foundation Classes (MFC). Unity allows the user to: Construct and modify standard dictionaries Build X-Specs to describe data sources Integrate X-Specs into an integrated view Transparently query integrated systems using ODBC and automatically generate SQL transactions Page 20 Unity Example Step #1 - Standard Dictionary A standard dictionary (SD) provides standardized terms to capture data semantics. Hierarchy of terms related by IS-A or HAS-A links Contains base set of common database concepts, but new concepts can be added A SD term is a single, unambiguous semantic definition. Several SD entries for a single English word are required if the word has multiple definitions. The top-level dictionary terms are those proposed by Sowa. Page 21 Unity Example Step #2 - Data Extraction For each data source, an X-Spec document is constructed that consists of: field, table, key, and join information extracted from the ODBC source assignment of semantic names for each field and table Semantic names combine dictionary terms to describe the semantics of schema elements. semantic name := [CT_Type] | [CT_Type] PN CT_Type := CT | CT {; CT} | CT {,CT} CT := context term, PN := property name each CT and PN is a single term from the dictionary Page 23 Unity Example Step #2 - Data Extraction (2) Semantic names are initially assigned using an automatic algorithm which attempts to find the best matches. The integrator can then refine initial semantic name assignments. Semantic names have two major purposes: used as a means for describing, documenting, and comparing concepts across systems allow information in the database (and later in the integrated view) to be organized by semantic concept instead of using structures or relations This simplifies querying the database and integrated view because the information is not divided in normalized relations. Page 24 Unity Example Step #3 - Schema Building Unlike previous approaches, the global view (or schema) is constructed automatically by combining source specifications (X-Specs). This is possible because semantic naming of concepts allows matching across systems: The same semantic name in two databases is assumed to represent the same concept. Hierarchical nature of semantic names (consisting of multiple terms) allows a schema to be built-up from pieces of relations or objects from each data source. Effectively, the global view is synthesized by the union of concepts in the underlying systems. Page 26 Unity Example Step #4 - Query Processing The query processor: Allows the user to formulate queries on the view. Translates from semantic names in the context view to structural queries (SQL) on databases. Involves determining correct field and table mappings and discovery of join conditions and join paths Retrieves query results and formats them for display to the user. Client-side query processing: Perform joins between databases using common keys. Page 28 Benefits and Contributions The architecture automatically integrates relational schemas into a global view for querying. Unique contributions: Synthesizing a global view from the bottom-up instead of top-down. This should improve integration scalability. Organizing the global view as a hierarchy of concepts instead of relations or predicates simplifies querying similar to the Universal Relation as the user does not have to specify specific predicates/relations or join conditions. Query processing is achieved by dynamically discovering extraction rules. The discovered rules are similar to extraction rules of GAV systems. Page 30 Future Work Unity performs schema integration by extracting data source information and performing global joins. However, the global query processor needs to be extended to handle more diverse queries involving: aggregration and grouping, recursive queries, queries with selection conditions that span data sources support for typical data integration problems of scaling, data type conversions, and translation of units Synthesizing the global view by combining concepts can be improved by exploiting dictionary knowledge: Use IS-A relationships in dictionary to improve matching. Determine when to create new global level attributes and contexts that are discovered based on interschema relationships. Page 31 References Publications: Unity - A Database Integration Tool, R. Lawrence and K. Barker, TRLabs Emerging Technology Bulletin, Jan. 2000. Multidatabase Querying by Context, R. Lawrence and K. Barker, DataSem2000, pages 127-136, Oct. 2000. Integrating Relational Database Schemas using a Standardized Dictionary, SAC’2001 - ACM Symposium on Applied Computing, pages 225-230, March 2001. Querying Relational Databases without Explicit Joins DASWIS 2001- International Workshop on Data Semantics in Web Information Systems (with ER'2001), Nov. 2001. Further Information: http://www.cs.uiowa.edu/~rlawrenc/ Page 32 Extra Slides Extra Slides... Page 33 Data Warehouse Approach Warehouse Gather Refine Aggregate Store Gather Refine Aggregate Store Gather Refine Aggregate Store Invoice Database Order Database Shipment Database Cust(id,name,addr,city,state,cty) Invoice(invId,custId,invDate) InvProdinvId,prodId,amt,pr) Prod(id,name,pr,desc) Cust(id,name,addr,city,state,cty) Order(oid,cid,odate) OrdProd(oid,pid,amt,pr) Prod(id,name,pr,desc) Cust(id,name,addr,city,state,cty) Shipment(shipid,oid,cid,shipdate) ShipProd(shipid,prodid,amt) Prod(id,name,pr,desc, inv) Integration Architecture Client Client Multidatabase Layer • user’s view of integration 2) X-Spec Editor Integrated Context View X-Spec Editor Standard Dictionary Architecture Components: 1) Integrated Context View • stores schema & metadata • uses XML Integration Algorithm 3) Standard Dictionary • terms to express semantics 4) Integration Algorithm Query Processor and ODBC Manager 5) Query Processor Subtransactions X-Spec X-Spec Database Database Local Transactions • combines X-Specs into integrated context view • accepts query on view • determines data source mappings and joins • executes queries and formats results Architecture Components The architecture consists of four components: A standard dictionary (SD) to capture data semantics SD terms are used to build semantic names describing semantics of schema elements. X-Specs for storing data semantics Database metadata and semantic names stored using XML Integration Algorithm Matches concepts in different databases by semantic names. Produces an integrated view of all database concepts. Query Processor Allows the user to formulate queries on the view. Translates from semantic names in integrated view to SQL queries and integrates and formats results. Involves determining correct field and table mappings and discovery of join conditions and join paths Page 36 Integration Processes The integration architecture consists of three separate processes: Capture process: independently extracts database schema information and metadata into a XML document called a XSpec. Integration process: combines X-Specs into a structurallyneutral hierarchy of database concepts called an integrated context view. Query process: allows the user to formulate queries on the integrated view that are mapped by the query processor to structural queries (SQL) and the results are integrated and formatted. Page 37 Architecture Components: Dictionary vs. Knowledge Base The standard dictionary differs from a knowledge base such as Cyc because: Not intended to be a general English dictionary or contain knowledge facts about the world Dictionary is evolved as new terms are required Not all English words are used Dictionary provides the systems with no “knowledge” Since no facts are stored, system cannot deduce new facts Dictionary terms are just semantic place holders, integrators determine the semantics of the database not the system Simplified organization Dictionary is organized as a tree for efficiency and simplicity in determining related concepts Re-use of terms Terms are re-used in semantic names Page 38