Download 208_Summarization CS 257

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

SQL wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Database wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Functional Database Model wikipedia , lookup

Relational algebra wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Versant Object Database wikipedia , lookup

Database model wikipedia , lookup

Relational model wikipedia , lookup

Transcript
Summarization – CS 257
Chapter – 21 (Information Integration)
Database Systems: The Complete Book
Submitted by:
Deepti Kundu
Submitted to:
Dr.T.Y.Lin
21.1 Introduction to Information
Integration

Need for Information Integration
 All
the data in the world could put in a single database
(ideal database system)
 In the real world (impossible for a single database):
databases are created independently hard to design a
database to support future use
University Database




Registrar: to record student and grade
Bursar: to record tuition payments by students
Human Resources Department: to record employees
Other department….
Inconvenient



Record grades for students who pay tuition
Want to swim in SJSU aquatic center for free in
summer vacation?
(all the cases above cannot achieve the function by a
single database)
Solution: one database
How to integrate
Start over
build one database: contains all the legacy
databases; rewrite all the applications
result: painful
 Build a layer of abstraction (middleware)
on top of all the legacy databases
this layer is often defined by a collection of classes
BUT…

Heterogeneity Problem
What is Heterogeneity Problem
Aardvark Automobile Co.
1000 dealers has 1000 databases
to find a model at another dealer
can we use this command:
SELECT * FROM CARS
WHERE MODEL=“A6”;

Type of Heterogeneity






Communication Heterogeneity
Query-Language Heterogeneity
Schema Heterogeneity
Data type difference
Value Heterogeneity
Semantic Heterogeneity
21.2 Modes of Information Integration

Federations
 The
simplest architecture for integrating several DBs
 One to one connections between all pairs of DBs
 n DBs talk to each other, n(n-1) wrappers are needed
 Good when communications between DBs are limited

Wrapper
a
software translates incoming queries and outgoing
answers. In a result, it allows information sources to
conform to some shared schema.
Federations Diagram
DB1
DB2
2 Wrappers
2 Wrappers
2 Wrappers
2 Wrappers
2 Wrappers
2 Wrappers
DB3
DB4
A federated collection of 4 DBs needs 12 components to translate queries
from one to another.
Data Warehouse



Sources are translated from their local schema to a global
schema and copied to a central DB.
User transparent: user uses Data Warehouse just like an
ordinary DB
User is not allowed to update Data Warehouse
Warehouse Diagram
User
query
result
Warehouse
Combiner
Extractor
Extractor
Source 1
Source 2
Construct Data Warehouse
There are mainly 3 ways to constructing the data in the warehouse:
1) Periodically reconstructed from the current data in the sources, once a night
or at even longer intervals.
Advantages:

simple algorithms.
Disadvantages:

need to shut down the warehouse;

data can become out of date.
2) Updated periodically based on the changes (i.e. each night) of the sources.
Advantages:

involve smaller amounts of data. (important when warehouse is large
and needs to be modified in a short period)
Disadvantages:

the process to calculate changes to the warehouse is complex.

data can become out of date.
3) Changed immediately, in response to each change or a small set of
changes at one or more of the sources.
Advantages:

data won’t become out of date.
Disadvantages:

requires too much communication, therefore, it is generally too
expensive.
(practical for warehouses whose underlying sources changes slowly.)
Mediators



Virtual warehouse, which supports a virtual view or a
collection of views, that integrates several sources.
Mediator doesn’t store any data.
Mediators’ tasks:
1)receive user’s query,
2)send queries to wrappers,
3)combine results from wrappers,
4)send the final result to user.
A Mediator diagram
Result
User query
Mediator
Query
Result
Result
Wrapper
Query
Result
Source 1
Query
Wrapper
Query
Result
Source 2
21.3 Wrappers in Mediator-Based
Systems



Intro
Templates for Query patterns
Wrapper Generator
Filter
Wrappers in Mediator-based Systems




More complicated than that in most data warehouse
system.
Able to accept a variety of queries from the mediator
and translate them to the terms of the source.
Communicate the result to the mediator.
How to design a wrapper?
Classify the possible queries that the mediator can
ask into templates, which are queries with parameters
that represent constants.
Wrapper Generators

Templates for Query Patterns:
 Use
notation T=>S to express the idea that the template T
is turned by the wrapper into the source query S.

The wrapper generator creates a table holds the
various query patterns contained in the templates.
The source queries that are associated with each.

Filter


Have a wrapper filter to supporting more queries.




A driver is used in each wrapper, the task of the
driver is to:
Accept a query from the mediator.
Search the table for a template that matches the query.
The source query is sent to the source, again using a
“plug-in” communication mechanism.
The response is processed by the wrapper.
21.4 Capability Based Optimization

Introduction
 Typical DBMS estimates the cost of each query
plan and picks what it believes to be the best
 Mediator – has knowledge of how long its sources
will take to answer
 Optimization of mediator queries cannot rely on
cost measure alone to select a query plan
 Optimization by mediator follows capability based
optimization
21.4.1 The Problem of Limited Source
Capabilities




Many sources have only Web Based interfaces
Web sources usually allow querying through a query
form
E.g. Amazon.com interface allows us to query about
books in many different ways.
But we cannot ask questions that are too general
 E.g. Select * from books;
(con’t)

Reasons why a source may limit the ways in which
queries can be asked
 Earliest database did not use relational DBMS that
supports SQL queries
 Indexes on large database may make certain
queries feasible, while others are too expensive to
execute
 Security reasons
21.4.2 A Notation for Describing
Source Capabilities


For relational data, the legal forms of queries are
described by adornments
Adornments – Sequences of codes that represent the
requirements for the attributes of the relation, in their
standard order
– attribute can be specified or not
 b(bound) – must specify a value for an attribute but any
value is allowed
 u(unspecified) – not permitted to specify a value for a
attribute
 f(free)
(cont’d)
 c[S](choice from
set S) means that a value must be specified
and value must be from finite set S.
 o[S](optional from set S) means either do not specify a
value or we specify a value from finite set S
 A prime (f’) specifies that an attribute is not a part of the
output of the query
 A capabilities specification is a set of adornments
 A query must match one of the adornments in its capabilities
specification
21.4.3 Capability-Based Query-Plan
Selection


Given a query at the mediator, a capability based query
optimizer first considers what queries it can ask at the sources
to help answer the query
The process is repeated until:
 Enough queries are asked at the sources to resolve all the
conditions of the mediator query and therefore query is
answered. Such a plan is called feasible.
 We can construct no more valid forms of source queries, yet
still cannot answer the mediator query. It has been an
impossible query.
(cont’d)


The simplest form of mediator query where we need to apply
the above strategy is join relations
E.g we have sources for dealer 2
 Autos(serial, model, color)
 Options(serial, option)
 Suppose that ubf is the sole adornment for Auto and
Options have two adornments, bu and uc[autoTrans,
navi]
 Query is – find the serial numbers and colors of Gobi
models with a navigation system
21.4.4 Adding Cost-Based
Optimization




Mediator’s Query optimizer is not done when the capabilities
of the sources are examined
Having found feasible plans, it must choose among them
Making an intelligent, cost based query optimization requires
that the mediator knows a great deal about the costs of queries
involved
Sources are independent of the mediator, so it is difficult to
estimate the cost
21.5 Optimizing Mediator Queries

Chain algorithm – a greed algorithm that finds a way to
answer the query by sending a sequence of requests to its
sources.
 Will always find a solution assuming at least one solution
exists.
 The solution may not be optimal.
21.5.1 Simplified Adornment Notation


A query at the mediator is limited to b (bound) and f (free)
adornments.
We use the following convention for describing adornments:
 Nameadornments (attributes)
 where:
 name is the name of the relation
 the number of adornments = the number of attributes
21.5.2 Obtaining Answers for
Subgoals

Rules for subgoals and sources:
 Suppose we have the following subgoal:
Rx1x2…xn(a1, a2, …, an),
and source adornments for R are: y1y2…yn.
 If yi is b or c[S], then xi = b.
 If xi = f, then yi is not output restricted.
 The adornment on the subgoal matches the adornment at the
source:
 If yi is f, u, or o[S] and xi is either b or f.
21.5.3 The Chain Algorithm



Maintains 2 types of information:
 An adornment for each subgoal.
 A relation X that is the join of the relations for all the
subgoals that have been resolved.
Initially, the adornment for a subgoal is b iff the mediator
query provides a constant binding for the corresponding
argument of that subgoal.
Initially, X is a relation over no attributes, containing just an
empty tuple.
(cont’d)


1.
First, initialize adornments of subgoals and X.
Then, repeatedly select a subgoal that can be resolved. Let
Rα(a1, a2, …, an) be the subgoal:
Wherever α has a b, we shall find the argument in R is a
constant, or a variable in the schema of R.
 Project X onto its variables that appear in R.
(cont’d)
2.
For each tuple t in the project of X, issue a query to the
source as follows (β is a source adornment).
 If a component of β is b, then the corresponding
component of α is b, and we can use the corresponding
component of t for source query.
 If a component of β is c[S], and the corresponding
component of t is in S, then the corresponding component
of α is b, and we can use the corresponding component of
t for the source query.
 If a component of β is f, and the corresponding component
of α is b, provide a constant value for source query.
(cont’d)
If a component of β is u, then provide no binding for this
component in the source query.
 If a component of β is o[S], and the corresponding
component of α is f, then treat it as if it was a f.
 If a component of β is o[S], and the corresponding
component of α is b, then treat it as if it was c[S].
Every variable among a1, a2, …, an is now bound. For each
remaining unresolved subgoal, change its adornment so any
position holding one of these variables is b.

3.
(cont’d)
4.
5.


Replace X with X πs(R), where S is all of the variables
among: a1, a2, …, an.
Project out of X all components that correspond to variables
that do not appear in the head or in any unresolved subgoal.
If every subgoal is resolved,α then X is the answer.
If every subgoal is not resolved, then the algorithm fails.
21.5.4 Incorporating Union Views at
the Mediator



This implementation of the Chain Algorithm does not consider
that several sources can contribute tuples to a relation.
If specific sources have tuples to contribute that other sources
may not have, it adds complexity.
To resolve this, we can consult all sources, or make best
efforts to return all the answers.
(cont’d)


Consulting All Sources
 We can only resolve a subgoal when each source for its
relation has an adornment matched by the current
adornment of the subgoal.
 Less practical because it makes queries harder to answer
and impossible if any source is down.
Best Efforts
 We need only 1 source with a matching adornment to
resolve a subgoal.
 Need to modify chain algorithm to revisit each subgoal
when that subgoal has new bound requirements.
21.6 Local-as-View Mediators



GAV: Global as view mediators are like view, it doesn’t exist
physically, but piece of it are constructed by the mediator by
asking queries
LAV: Local as view mediators, defines the global predicates at
the mediator, but we do not define these predicates as views of
the source of data
Global expressions are defined for each source involving
global predicates that describe the tuple that source is able to
produce and queries are answered at mediator by discovering
all possible ways to construct the query using the views
provided by sources
Motivation for LAV Mediators


LAV mediators help us to discover how and when to use that
source in a given query
Example: Par(c,p)-> GAV of Par(c,p) gives information about
the child and parent but does not give information of
grandparents
LAV Par(c,p) will help to get information of chlid-parent and
even grandparent
Terminology for LAV Mediation




It is in form of logic that serves as the language for defining
views.
Datalog is used which will remain common for the queries of
mediator and source which is known as Conjunctive query.
LAV has global predicates which are the subgoals of mediator
queries
Conjunctive queries defines the views which has unique view
predicate and that view has Global predicates and associated
with particular view.
Containment of Conjunctive Queries



Conjunctive query S be the solution to the mediator Q,
Expansion of S->E, produces same answers that Q produces,
so, E subset Q.
A containment mapping from Q to E is function Γ(x) is the ith
argument of the head E.
Add to Γ the rule that Γ(c) =c for any constant c. IF
P(x1,x2,..xn) is a subgoal of Q, then P(Γ(x1), Γ(x2),.., Γ(xn)) is
a subgoal of E.
Why Containment Mapping Test Works:

1.
2.
Questions:
If there is containment mapping, why must there be a
containment of conjunctive queries?
If there is containment, why must there be a containment
mapping?
Finding Solutions to a Mediator Query

Query Q, solutions S, Expansion E of S is contained in Q.
“If a query Q has n subgoals, then any answer produced by
any solution is also produced by a solution that has at most n
subgoals.
This is known by LMSS Theorem
Why the LMSS Theorem Holds





Query Q with n subgoals and S with n subgoals, E of S must be
contained in query Q, E is expansion of Q.
S’ must be the solution got after removing all subgoals from S
those are not the target of Q.
E subset or equal to Q and also E’ is the expansion of S’.
So, S is subser of S’ : identity mapping.
Thus there is no need for solution s among the solution S
among the solutions to query Q.
21.7 Entity Resolution

Determining whether two records or tuples do or do not
represent the same person, organization, place or other entity
is called ENTITY RESOLUTION.
Deciding whether Records represent a
Common Entity

Two records represent the same individual if the two records
have similar values for each of the fields associated with
those records.

It is not sufficient that the values of corresponding fields be
identical because of following reasons:
1. Misspellings
2. Variant Names
3. Misunderstanding of Names
4. Evolution of Values
5. Abbreviations
Deciding Whether Records Represents
a Common Entity - Edit Distance

First approach to measure the similarity of records is Edit
Distance.

Values that are strings can be compared by counting the
number of insertions and deletions of characters it takes to
turn one string into another.

So the records represent the same entity if their similarity
measure is below a given threshold.
Deciding Whether Records Represents
a Common Entity - Normalization

To normalize records by replacing certain substrings
by others. For instance: we can use the table of
abbreviations and replace abbreviations by what they
normally stand for.

Once normalize we can use the edit distance to
measure the difference between normalized values in
the fields.
Merging Similar Records


Merging means replacing two records that are similar
enough to merge and replace by one single record
which contain information of both.
There are many merge rules:
1. Set the field in which the records disagree to the
empty string.
2. (i) Merge by taking the union of the values in each
field
(ii) Declare two records similar if at least two of
the three fields have a nonempty intersection.
Useful Properties of Similarity and
Merge Functions
The following properties say that the merge operation
is a semi lattice :
1.
Idempotence : That is, the merge of a record with
itself should surely be that record.
2.
Commutativity : If we merge two records, the order
in which we list them should not matter.
3.
Associativity : The order in which we group
records for a merger should not matter.
There are some other properties that we expect similarity
relationship to have:
•
Idempotence for similarity : A record is always
similar to itself
•
Commutativity of similarity : In deciding whether two
records are similar it does not matter in which order
we list them
•
Representability : If r is similar to some other record
s, but s is instead merged with some other record t,
then r remains similar to the merger of s and t and
can be merged with that record.
R-swoosh Algorithm for ICAR Records



Input: A set of records I, similarity function and a merge function.
Output: A set of merged records O.
Method:
 O:= emptyset;
 WHILE I is not empty DO BEGIN
 Let r be any record in I;
 Find, if possible, some record s in O that is similar to r;
 IF no record s exists THEN
move r from I to O
 ELSE BEGIN
delete r from I;
delete s from O;
add the merger of r and s to I;
 END;
 END;
Other Approaches to Entity Resolution

The other approaches to entity resolution are :
Non ICAR Datasets : We can define a dominance relation
r<=s that means record s contains all the information
contained in record r.If so, then we can eliminate record r
from further consideration.
Clustering : Some time we group the records into clusters such
that members of a cluster are in some sense similar to each
other and members of different clusters are not similar.
Partitioning : We can group the records, perhaps several times,
into groups that are likely to contain similar records and look
only within each group for pairs of similar records.