Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Integrating data sources on the World-Wide Web Ramon Lawrence and Ken Barker U. of Manitoba, U. of Calgary umlawren,[email protected] Introduction • Integration of data is required when accessing multiple databases within an organization or on the WWW. • Our focus is automatically combining database schema using schema integration. • Schema integration requires knowledge of data semantics and use of metadata. Motivation • Organizations have several database systems which must interoperate. • Users often access multiple Web databases whose knowledge must be integrated and presented in a useful form. • Data warehouses and OLAP systems require data semantics to be understood and data to be cleansed and summarized. Background • Schema integration involves combining diverse database schema into an integrated view by resolving conflicts. • Schema conflicts include naming, structural, and semantic conflicts. • Schema integration is required for database interoperability, but it is currently a manual process. Previous Work • Research systems: – integrating systems by logical rules (Sheth) – defining global dictionaries (Castano) – Carnot Project using the Cyc knowledge base • Industrial systems and standards: – Metadata Interchange Specification (MDIS) – XML, BizTalk, E-commerce portals Architecture Components: The Global Dictionary • A global dictionary (GD) provides standardized terms to capture data semantics. – Hierarchy of terms related by IS-A or Has-A links – Contains base set of common database concepts, but new concepts can be added • A GD term is a single, unambiguous semantic definition. – Several GD entries for a single English word are required if the word has multiple definitions. Architecture Components: Using the Global Dictionary • GD terms are used to build semantic names to describe the semantics of schema elements. • Semantic names have the form: – semantic name = “[“CT [[;CT] | [,CT]] “]” CN – CT = context term, CN = concept name – each CT and CN is a single term from the GD • Semantic names are included in RIM specifications describing a data source. Architecture Components: The Relational Integration Model • Database metadata and semantic names are combined into Relational Integration Model (RIM) Specifications (RIM Specs) – contains information on a relational schema – organized into database, table, and field levels – stores semantic names to describe and integrate schema elements Architecture Components: Integrating RIM Specs • Each database to be integrated is described using a RIM specification. • Identical concepts in different databases are identified by similar semantic names. • Concepts with identical (or hierarchially related) semantic names are combined regardless of their physical representation in the individual databases. Integration Architecture • Our integration architecture consists of two separate phases: – capture process: RIM specs are constructed for each data source independently – integration process: RIM specs are combined using the integration algorithm which matches semantic names using the global dictionary Integration Architecture: The Capture Process • Capture process involves: – automatically extracting the schema information and metadata using a specification editor – assigning semantic names to each schema element (tables and fields) to capture their semantics Integration Architecture: The Capture Process Relational Schema Automatic Extraction Specification Editor Global Dictionary DBA Lookup of terms RIM Spec Integration Architecture: The Integration Process • Integration process involves: – automatically identifying identical concepts by matching semantic names – constructing a global view of database concepts consisting of a hierarchy of concept terms – resolving structural differences during query generation and submission (e.g. a concept may be represented as a table in one database and a field (attribute) in another) Integration Architecture: The Integration Process Client …………. Client Integration Site Subtransactions RIM spec RIM spec RDBMS …….. RDBMS Integration Architecture Benefits • The benefits of the two phase architecture are: – Dynamic integration: schemas integrated as needed – RIM Specs are constructed only once and independent of each other – Automatic conflict resolution by integrating based on semantic name rather than physical structure – Users are isolated from system names and organization by querying through a global view using semantic names for concepts Integration Example • Two claims databases to be integrated: – ABC Company: Claims_tb(claim_id, claimant, net_amount, paid_amount) – XYZ Company: T_claims(id, customer, claim_amt), T_payments(cid, pid, amount) • First step is to construct RIM specs for each database. Integration Example: ABC Database RIM Spec Type System Name Semantic Name Table Field Claims_tb Claim_id [Claim] [Claim] Id Field Field Field Claimant Net_amount Paid_amount [Claim;Claimant] Name [Claim] Amount [Claim;Payment] Amount Integration Example: XYZ Database RIM Spec Type System Name Table T_claims Field id Field customer Field claim_amt Table Field Field Field T_payments cid pid amount Semantic Name [Claim] [Claim] Id [Claim;Customer] Name [Claim] Amount [Claim;Payment] [Claim] Id [Claim;Payment] Id [Claim;Payment] Amount Integration Example: Integrated View • Global view after integration: – [Claim] • Id • Net amount • [Customer] – name • [Payment] – id – amount Integration Example: Discussion • Important points: – system and field names are not presented to the user who queries based on semantic names – database structure is not shown to the user – different physical representations for the same concept are combined (e.g. payment (attribute) in ABC with payment table in XYZ database) – hierarchially related concepts (customer vs. claimant) are combined based on their IS-A relationship in the global dictionary Applications to the WWW • Integrating diverse data sources is involved in constructing a data warehouse and other operational systems. • The WWW is a diverse organizations of databases which users access. • Automatically integrating web data sources by a browser or portal reduces query complexity and integration of results for the user. Conclusions • Automatic integration of database schema is possible by using a global dictionary of terms and constructing semantic names for schema elements. • Integration of data sources has applications to the WWW and construction of data warehouses. Important Changes • The integration architecture is constantly being refined. Some notable differences in this presentation versus the paper: – Our integration system uses XML to represent a RIM spec which is renamed as a X-Spec. – An integration site is used as a central portal for integration and management. – No longer using semantic distance calculations between terms. – Format of semantic name has been simplified. Future Work • The integration architecture is involving with standards on XML and now captures metadata information in XML documents. • The system is being tested on sample problems, and a query mechanism is workin-progress. • We are refining a prototype of the system called Unity.