Download Data integration mediation system “ … The mountain is a mountain

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expense and cost recovery system (ECRS) wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Data center wikipedia , lookup

Database wikipedia , lookup

Data model wikipedia , lookup

Data analysis wikipedia , lookup

Versant Object Database wikipedia , lookup

Clusterpoint wikipedia , lookup

3D optical data storage wikipedia , lookup

Data vault modeling wikipedia , lookup

Business intelligence wikipedia , lookup

Information privacy law wikipedia , lookup

Database model wikipedia , lookup

Transcript
Data integration
mediation system
Heterogeneous
reasoning and
mediator system
“ … The mountain is a mountain,
The mountain is not a mountain
The mountain is a mountain. “
Presented by
Taras Mahlin
Problem: Mountain is not a mountain
• The past few decades have
witnessed a spectacular
explosion in the quantity
of data available in one
electronic form or another.
• This vast quantity of data
has been gathered,
organized, and stored by a
small army of individuals,
working for different
organizations on varied
problems.
Solution: Mountain is a mountain
• Synergetic approach - the
complete thing is much
more then all it’s
components together.
• Integration of disparate
data sources by pooling
fragmented data together ,
resolving data conflicts ,
and transforming them into
information objects
• All these while user
continue to use existing
systems for routine
function of add, change and
delete.
Integrating Heterogeneous Data Sources
Advantages
• An integration alleviates the burden of duplicating the data
gathering efforts.
• Synergetic effect - it enables the extraction of information that
would otherwise be impossible.
For example:
– Law enforcement agencies ( Interpol )
– Insurance companies
– Medical researchers and epidemiologists
Integrating Heterogeneous Reasoning Paradigms
• In conjunction with the ability to integrate a variety of data
sources is the need to integrate diverse forms of reasoning.
• Access to such reasoning systems provides mediators with
sophisticated abilities to extract and produce new information
from existing data.
For example:
– Problem of terrain reasoning
• Determining where resources can be physically
situated
• Integrating multiple forms of reasoning that may
include logical inference, numerical optimization,
planning, pattern recognition, scheduling, and learning.
Mediator technology
• Seamless integration of
information located across
multiple, heterogeneous
computer platforms and
recorded in multiple,
heterogeneous electronic
formats.
– relational database
management systems,
– other non-relational database
management systems,
– flat files , text files etc.
• Mediator technology defines a
structure and architecture that
allows software applications
to be independent of the
underlying data resources.
Mediator technology cont
• Mediators provide:
– Intelligence for
understanding, selecting,
accessing, merging, and
manipulating data.
– New level of knowledge
– Consistent responses to
questions regardless of
who asks the question.
– Seamless integration of
information from
multiple existing sources
without having to
redesign existing
databases (i.e., legacy
data) or change existing
operational systems.
Mediator technology - summary
•
•
Mediators perform ``mediation'' between applications and databases.
Mediators are software modules that occupy an explicit, active layer
between an end user application and the data sources the application is
accessing. In this way, the Mediator forms a distinct middle layer,
making user applications independent of data sources.
• They capture knowledge from the data experts so that the common
user can find the information.
• Mediators do not create a new database. A mediator creates a ``virtual''
database that supplies data contained in the existing database(s).
Mediators use existing databases and require no redesign or changes in
these databases or existing operational systems.
• Mediators provide easy access to information. They support a
heterogeneous computing environment (i.e., multiple hardware,
software, and databases). It provides a cost effective means to integrate
data from heterogeneous information systems.
Mediators - goals and implementation
• The aim of the system is to develop the principled
methodology for
– integrating multiple data sources and
– reasoning systems,
– and to propose a mediator language within which access to the data
sources and reasoning systems can be expressed uniformly.
• There are two important aspects to constructing a
mediator: domain integration and semantic integration .
– Domain integration is the physical linking of the data sources and
reasoning systems .
– Semantic integration is the coherent extraction and combination
of the information provided by the data and reasoning sources,
serving a given purpose.
Domain integration
Goal:
– Adding a new source of data or reasoning system to an
existing mediated system (or one being developed) such
that
Requirements:
–
resources provided by the new system, whether it is
• new data, or
• new representations of data,
• or a corpus of new reasoning algorithms,
may be accessed by various mediators.
– no recompilation of the whole system is needed
– integrity of the system is preserved
Semantic integration
• Semantic integration is the process of specifying
methods
– to resolve conflicts,
– pool information together,
– and define new, compositional operations based on
existing operations in the individual data sources.
Data Integration and Mediation System
• DIMS is an implementation of
"intelligent middleware” that
resides between user
applications and independent
data sources.
• Data sources can reside on
multiple, heterogeneous
computer platforms and may
be recorded in a variety of
formats
• DIMS creates a “virtual object
database” so that the user
application sees the data
retrieved from the various
sources as though it were
returned from a single,
integrated database.
System major functions
• DIMS performs five major
functions:
–
–
–
–
query decomposition/routing
object unification and fusion
removal of data redundancies
identification/resolution of
data inconsistencies
– advanced data integration
techniques
• Although DIMS performs
query decomposition/routing
to multiple, heterogeneous data
sources, DIMS’s main
advantage is its data instance
integration functionality.
Query processing example
• Query : retrieve information
about Employees and their
associated Dependents.
• We assumes that the Employee
and Dependent information is
spread across three disparate
data sources:
– Personnel database
– Payroll database
– Benefits database.
• The Employee information is
distributed across the Personnel
and Payroll databases.
• The Dependent information is
contained in the Benefits
database.
Query processing example cont.
• Initially, a single query
for Employee and
their associated
Dependent(s)
information is sent
from a user
application to DIMS.
Query processing example cont.
• First retrieve the Employee
objects which meet the
specified constraint.
• Based upon domain-specific
knowledge, we know that the
Personnel database can
supply the Employee name
and title information,
whereas the Payroll database
can supply the Employee
name and salary information.
• In both cases, DIMS
automatically “knows” to
also retrieve the Employee
ID which will be needed for
later data integration
functions.
Query processing example cont.
• Tabular results are
returned from the
Personnel and Payroll
databases to DIMS.
• Note that Mark Smith is
returned only from the
Personnel database and
Jane Peterson is returned
only from the Payroll
database.
Query processing - data retrieving
• DIMS performs object
unification based on the data
returned from the data sources.
Object unification is the
combining of the data into
object instances.
• Notice that the “Mark Smith”
and the “Jane Peterson”
objects have empty attributes
since their information was
returned from single sources
with only partial information.
Query processing - redundancy elimination
• Once the object instances have
been created, DIMS then
removes any extraneous data
redundancies.
• In this example, the “Tim
Andrews” object has the same
name listed twice
• Assume that the domain object
model specified that each
Employee object should have
only one name attribute.
• Therefore, the “Tim Andrews”
object has an extraneous,
redundant name attribute which
should not exist.
Query processing - redundancy elimination
• The system will
automatically remove
the second, redundant
occurrence of the
name for the “Tim
Andrews” object.
Inconsistency resolving
• DIMS will then identify data
inconsistencies within the objects.
It can also provide resolutions to
these data inconsistencies.
• In this example, there is a data
inconsistency in the “Sarah
Jones/Kaiser” object because
– the Personnel database
returned the name as “Sarah
Jones”
– whereas the Payroll database
returned the name as “Sarah
Kaiser” for the same
Employee ID.
Inconsistency resolving cont.
• DIMS identifies the data
inconsistency. DIMS will
then flag the identified
inconsistencies within an
object.
• DIMS can also provides the
source information
associated with each data
inconsistency to allow
further automated and/or
manual inconsistency
handling.
Data inconsistency rules.
• Data inconsistency rules can
be defined for a specific
domain for DIMS.
• DIMS uses a rules-based
expert system to apply the
rules over the data.
• In this example, assume that
a data inconsistency rule that
specifies to use data from the
Payroll database if there is an
inconsistency in an
employee’s name attribute is
defined.
Data inconsistency rules cont.
• Based upon the
example’s rule, DIMS
will remove the
“Sarah Jones” name
that came from the
Personnel database
from the “Sarah
Jones/Kaiser” object.
Getting dependent information
• After the Employee objects have
been integrated, DIMS will then
send another query for the
Dependent information associated
with each of these Employees.
• This example assumes that only
the Benefits database contains
Dependent information. Based on
the domain-specific knowledge,
DIMS “knows” that each
Employee object is associated to
its Dependent object(s) via the
Employee ID attribute.
• Therefore DIMS uses this
information to constrain the new
query.
Getting dependent information cont.
• The Dependents
information is returned
from the Benefits
database.
Object unification
• DIMS again performs object
unification on the new
results.
• However, instead of making
totally independent objects
for the Dependents, DIMS
integrates the Dependent
objects with the appropriate
Employee objects.
• Since the Dependent objects
contained no redundant data
nor data inconsistencies, no
further processing is needed
on the Dependent
information.
Composing the result
• Finally, DIMS returns all the
Employee objects and their
associated Dependent objects to
the user application as a single,
packaged integrated response.
• The user application never had
to “know” anything about all the
extra processing that DIMS
performed -- it simply knows that
it had to send one query to
DIMS and received one “clean”,
integrated response.
Advanced integration
•
•
•
•
Units conversion
Data abstraction
Data aggregations
Expert rules