Download Integrating data sources on the World

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Architecture of the United Kingdom wikipedia , lookup

Sacred architecture wikipedia , lookup

History of business architecture wikipedia , lookup

Architectural theory wikipedia , lookup

Architecture wikipedia , lookup

Transcript
Integrating data sources on the
World-Wide Web
Ramon Lawrence and Ken Barker
U. of Manitoba, U. of Calgary
umlawren,[email protected]
Introduction
• Integration of data is required when
accessing multiple databases within an
organization or on the WWW.
• Our focus is automatically combining
database schema using schema integration.
• Schema integration requires knowledge of
data semantics and use of metadata.
Motivation
• Organizations have several database
systems which must interoperate.
• Users often access multiple Web databases
whose knowledge must be integrated and
presented in a useful form.
• Data warehouses and OLAP systems
require data semantics to be understood and
data to be cleansed and summarized.
Background
• Schema integration involves combining
diverse database schema into an integrated
view by resolving conflicts.
• Schema conflicts include naming,
structural, and semantic conflicts.
• Schema integration is required for database
interoperability, but it is currently a manual
process.
Previous Work
• Research systems:
– integrating systems by logical rules (Sheth)
– defining global dictionaries (Castano)
– Carnot Project using the Cyc knowledge base
• Industrial systems and standards:
– Metadata Interchange Specification (MDIS)
– XML, BizTalk, E-commerce portals
Architecture Components:
The Global Dictionary
• A global dictionary (GD) provides
standardized terms to capture data semantics.
– Hierarchy of terms related by IS-A or Has-A links
– Contains base set of common database concepts,
but new concepts can be added
• A GD term is a single, unambiguous semantic
definition.
– Several GD entries for a single English word are
required if the word has multiple definitions.
Architecture Components:
Using the Global Dictionary
• GD terms are used to build semantic names to
describe the semantics of schema elements.
• Semantic names have the form:
– semantic name = “[“CT [[;CT] | [,CT]] “]” CN
– CT = context term, CN = concept name
– each CT and CN is a single term from the GD
• Semantic names are included in RIM
specifications describing a data source.
Architecture Components:
The Relational Integration Model
• Database metadata and semantic names are
combined into Relational Integration Model
(RIM) Specifications (RIM Specs)
– contains information on a relational schema
– organized into database, table, and field levels
– stores semantic names to describe and integrate
schema elements
Architecture Components:
Integrating RIM Specs
• Each database to be integrated is described
using a RIM specification.
• Identical concepts in different databases are
identified by similar semantic names.
• Concepts with identical (or hierarchially
related) semantic names are combined
regardless of their physical representation in
the individual databases.
Integration Architecture
• Our integration architecture consists of two
separate phases:
– capture process: RIM specs are constructed for
each data source independently
– integration process: RIM specs are combined
using the integration algorithm which matches
semantic names using the global dictionary
Integration Architecture:
The Capture Process
• Capture process involves:
– automatically extracting the schema information
and metadata using a specification editor
– assigning semantic names to each schema
element (tables and fields) to capture their
semantics
Integration Architecture:
The Capture Process
Relational
Schema
Automatic
Extraction
Specification
Editor
Global
Dictionary
DBA Lookup
of terms
RIM
Spec
Integration Architecture:
The Integration Process
• Integration process involves:
– automatically identifying identical concepts by
matching semantic names
– constructing a global view of database concepts
consisting of a hierarchy of concept terms
– resolving structural differences during query
generation and submission (e.g. a concept may be
represented as a table in one database and a field
(attribute) in another)
Integration Architecture:
The Integration Process
Client
………….
Client
Integration Site
Subtransactions
RIM spec
RIM spec
RDBMS …….. RDBMS
Integration Architecture Benefits
• The benefits of the two phase architecture are:
– Dynamic integration: schemas integrated as needed
– RIM Specs are constructed only once and
independent of each other
– Automatic conflict resolution by integrating based
on semantic name rather than physical structure
– Users are isolated from system names and
organization by querying through a global view
using semantic names for concepts
Integration Example
• Two claims databases to be integrated:
– ABC Company: Claims_tb(claim_id, claimant,
net_amount, paid_amount)
– XYZ Company: T_claims(id, customer, claim_amt),
T_payments(cid, pid, amount)
• First step is to construct RIM specs for each
database.
Integration Example:
ABC Database RIM Spec
Type
System Name
Semantic Name
Table
Field
Claims_tb
Claim_id
[Claim]
[Claim] Id
Field
Field
Field
Claimant
Net_amount
Paid_amount
[Claim;Claimant] Name
[Claim] Amount
[Claim;Payment] Amount
Integration Example:
XYZ Database RIM Spec
Type
System Name
Table T_claims
Field id
Field customer
Field claim_amt
Table
Field
Field
Field
T_payments
cid
pid
amount
Semantic Name
[Claim]
[Claim] Id
[Claim;Customer] Name
[Claim] Amount
[Claim;Payment]
[Claim] Id
[Claim;Payment] Id
[Claim;Payment] Amount
Integration Example:
Integrated View
• Global view after integration:
– [Claim]
• Id
• Net amount
• [Customer]
– name
• [Payment]
– id
– amount
Integration Example:
Discussion
• Important points:
– system and field names are not presented to the
user who queries based on semantic names
– database structure is not shown to the user
– different physical representations for the same
concept are combined (e.g. payment (attribute)
in ABC with payment table in XYZ database)
– hierarchially related concepts (customer vs.
claimant) are combined based on their IS-A
relationship in the global dictionary
Applications to the WWW
• Integrating diverse data sources is involved
in constructing a data warehouse and other
operational systems.
• The WWW is a diverse organizations of
databases which users access.
• Automatically integrating web data sources
by a browser or portal reduces query
complexity and integration of results for the
user.
Conclusions
• Automatic integration of database schema is
possible by using a global dictionary of
terms and constructing semantic names for
schema elements.
• Integration of data sources has applications
to the WWW and construction of data
warehouses.
Important Changes
• The integration architecture is constantly
being refined. Some notable differences in
this presentation versus the paper:
– Our integration system uses XML to represent a
RIM spec which is renamed as a X-Spec.
– An integration site is used as a central portal for
integration and management.
– No longer using semantic distance calculations
between terms.
– Format of semantic name has been simplified.
Future Work
• The integration architecture is involving
with standards on XML and now captures
metadata information in XML documents.
• The system is being tested on sample
problems, and a query mechanism is workin-progress.
• We are refining a prototype of the system
called Unity.