Download Lecture Notes Part 1 - 354KB

Document related concepts

Big data wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Functional Database Model wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Database model wikipedia , lookup

Transcript
Planning for the Web I
Data Integration
Dan Weld
University of Washington
June, 2003
Acknowledgements
•
•
•
•
Alon Halevy
Zack Ives
Rao Kambhampati
UW students
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
2
My Two Talks for Today
• Data Integration
Providing uniform access to disparate data srcs
AI meets DB
Answering queries using views
Execution in the face of uncertainty, latency
• Service Integration
Invoking and composing web services
Query and update
Planning with incomplete information
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
3
Overview: Data Integration
•
•
•
•
Motivation
Wrappers / information extraction
Database Review
Modeling data sources
Content, completeness, capabilities
Reformulation algorithms: Bucket, MINICON
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
4
What is Data Integration?
A system providing:
Uniform (same query interface to all sources)
Access to (queries; eventually updates too)
Multiple (we want many, but 2 is hard too)
Autonomous (DBA doesn’t report to you)
Heterogeneous (data models are different)
Structured (or at least semi-structured)
Data Sources (not only databases).
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
5
User enters query
Formulate queries
Remove duplicates
...
Post-process + rank
Lycos
Excite
Collate results
Download?
Present to user
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
6
•
•
•
•
•
•
•
•
•
•
•
•
Meta-?
Web Search
Shopping
Product Reviews
Chat Finder
Columnists (e.g. jokes, sports, ….)
Email Lookup
Event Finder
People Finder
Restaurant Reviews
Job Listings
Classifieds
Apartment + Real Estate
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
7
Intuition: Info Integration
• Info aggregation … on Steroids!
• Want agent such that
• User says what she wants
• Agent decides how & when to achieve it
• Example:
Show me all reviews of movies starring Matt
Damon that are currently playing in Seattle
Sidewalk
IMDB
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
Ebert
Info. Aggregation vs. Integration
prices of
laptop
with …
movies
in Seattle
starring …
IMDB
store1
store2
… storeN
sidewalk
join
rev1
rev2
…
revN
sort
•
•
•
•
Join, sort
aggregate
More complex queries
Dynamic generation/optimization of execution plan
Applicable to wider range of problems
Much harder to implement efficiently
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
9
Challenges
User must know which sites have relevant
info
User must go to each one in turn
Slow: Sequential access takes time
Confusing: Each site has a different
interface
User must manually integrate information
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
Practical Motivation
• Enterprise
Business “dashboard’’; web-site construction.
• WWW
Comparison shopping
Portals integrating data from multiple sources
B2B, electronic marketplaces
• Science and culture:
Medical genetics: integrating genomic data
Astrophysics: monitoring events in the sky.
Environment: Puget Sound Regional Synthesis Model
Culture: uniform access to all cultural databases
produced by countries in Europe.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
11
The Problem: Data Integration
mybooks.com Mediated Schema
Books
Internet
Inventory
Orders
WAN
MorganKaufman
PrenticeHall
...
Shipping
Internet
East
Orders
West
Reviews
Internet
FedEx
Customer
Reviews
UPS
NYTimes
...
alt.books.
reviews
Uniform query capability across autonomous,
heterogeneous data sources on LAN, WAN,
or Internet
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
12
Current Solutions
• Mostly ad-hoc programming: create a
special solution for every case;
pay consultants a lot of money.
• Data warehousing: load all the data
periodically into a warehouse.
6-18 months lead time
Separates operational DBMS from decision
support DBMS. (not only a solution to data
integration).
Performance is good; data may not be fresh.
Need to clean, scrub you data.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
13
Data Warehouse Architecture
User queries
OLAP / Decision support/
Data cubes/ data mining
Relational database (warehouse)
Data extraction,
cleaning/
scrubbing
Data extraction
programs
Data
source
Data
source
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
Data
source
14
Warehouse Summary
• Pro
Relatively simple
Good performance (OLAP support)
Mature technology (DB, ETL industries)
• Con
Expensive
Stale data
Risky – most warehouse projects fail
• Rigid architecture
• Fixed schema
• Must know all queries ahead of time
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
15
Architecture for Virtual
Integration
Leave the data in the sources.
When a query comes in:
1) Determine the relevant sources to the
query
2) Break down the query into sub-queries for
the sources.
3) Get the answers from the sources, and
combine them appropriately.
Data is fresh.
Challenge: performance.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
16
Virtual Integration Architecture
User queries
Mediated schema
Mediator:
Which data
model?
Reformulation engine
Optimizer
Execution engine
Data source
catalog
wrapper
wrapper
wrapper
Data
source
Data
source
Data
source
Sources can be: relational, hierarchical (IMS), structured files, web sites.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
17
Research Projects
•
•
•
•
•
•
•
•
•
•
Garlic (IBM),
Information Manifold (AT&T)
Tsimmis, InfoMaster (Stanford)
Internet Softbot/Razor/Tukwila (U Wash.)
Hermes (Maryland)
Telegraph / Eddies (UC Berkeley)
Niagara (Univ Wisconsin)
DISCO, Agora (INRIA, France)
SIMS/Ariadne (USC/ISI)
Emerac/Havasu (ASU)
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
18
Industry
•
•
•
•
Nimble Technology
Enosys Markets
IBM starting to announce stuff
BEA marketing announcing stuff too.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
19
Dimensions to Consider
•
•
•
•
•
•
How many sources are we accessing?
How autonomous are they?
Meta-data about sources?
Is the data structured?
Queries or also updates?
Requirements: accuracy, completeness,
performance, handling inconsistencies.
• Closed world assumption vs. open world?
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
20
Outline
Motivation
• Wrappers / information extraction
• Database Review
• Modeling data sources
Content, completeness, capabilities
Reformulation algorithms: Bucket, MINICON
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
21
Wrapper Programs
• Task
to communicate with the data sources and do
format translations.
• Built w.r.t. a specific source.
• Can sit either at the source or mediator.
• Often hard to build
(very little science).
• Can be “intelligent”
perform source-specific optimizations.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
22
Example
Transform:
<b> Introduction to DB </b>
<i> Phil Bernstein </i>
<i> Eric Newcomer </i>
Addison Wesley, 1999
into:
<book>
<title> Introduction to DB </title>
<author> Phil Bernstein </author>
<author> Eric Newcomer </author>
<publisher> Addison Wesley </publisher>
<year> 1999 </year>
</book>
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
23
Wrapper Construction
• Use PERL, or
• Generate wrappers automatically
Get training examples
• Human marks up selected pages with GUI tool
Use shallow NLP to create features
Favorite learning method
• HMMs, VS on prefix, postfix strings, ??
Boosting
Co-training
• See research on information extraction
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
24
SemTag & Seeker
• WWW-03 Best Paper Prize
• Seeded with TAP ontology (72k concepts)
And ~700 human judgments
• Crawled 264 million web pages
• Extracted 434 million semantic tags
Automatically disambiguated
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
25
Outline
Motivation
Wrappers / information extraction
• Database Review
• Relational algebra, SQL, datalog
• Views
• Optimization (query planning)
• Modeling data sources
Content, completeness, capabilities
Reformulation algorithms: Bucket, MINICON
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
26
Traditional Database
Architecture
Query
(SQL)
Answer
(relation)
Database Manager
(DBMS)
-Storage mgmt
-Query processing
-View management
-(Transaction processing)
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
Database
(relational)
27
Relational Data: Terminology
Product
relation (Arity=4)
Name
Price
attribute
Category
Manufacturer
gizmo
$19.99
gadgets
GizmoWorks
Power gizmo
$29.99
gadgets
GizmoWorks
SingleTouch
$149.99
photography
Canon
MultiTouch
$203.99
household
Hitachi
tuple
schema
Product(Name: string, Price: real, category: enum, Manufacturer: string)
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
28
Relational Algebra
• Operators
tuple sets as input,
new set as output
• Operations
Union, Intersection, difference, ..
Selection (s)
Projection ()
Cartesian product (X)
• Join ( )
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
Name
Price
Category
Manufacturer
gizmo
$19.99
gadgets
GizmoWorks
Power gizmo $29.99
gadgets
GizmoWorks
SingleTouch $149.99
photography
Canon
MultiTouch
household
Hitachi
$203.99
City
Tempe
Manufacturer
GizmoWorks
Kyoto
Canon
Dayton
Hitachi
29
SQL: A query language for
Relational Algebra
Many standards out there: SQL92, SQL2,
SQL3, SQL99
Select attributes
From relations (possibly multiple, joined)
Where conditions (selections)
Other features:
aggregation, group-by
etc.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
“Find companies that manufacture products
bought by Joe Blow”
SELECT Company.name
FROM
Company, Product
WHERE Company.name=Product.maker
AND Product.name IN
(SELECT product
FROM Purchase
WHERE buyer = “Joe Blow”);
30
Deductive Databases
• Tables viewed as predicates.
• Operations on tables expressed as “datalog” rules
(Horn clauses, without function symbols)
Enames(Name) :- Employe(Name, SSN) [Projection]
Wealthy-Employee(Name)
:- Employee(Name,SSN), Salary(SSN,Money),Money> 100000
[Selection]
Ed(Name, Dname)
:- Employee(Name, SSN), Employee_Dependents(SSN, Dname)
[Join]
Emprelated(Name,Dname) :- Ed(Name,Dname)
Emprelated(Name,Dname) :- Ed(Name,D1), Emprelated(D1,D2)
[Recursion]
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
31
More datalog terminology
A datalog program is a set of datalog rules.
A program with a single rule is a conjunctive query.
We distinguish EDB predicates and IDB predicates
•
•
EDB’s are stored in the database,
appear only in the bodies
IDB’s are intensionally defined,
appear in both bodies and heads.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
32
Views
Views are relations,
except that they are not physically stored.
Uses:
• simplify complex queries, &
• define conceptually different views of DB for diff. users.
Example:
purchases of telephony products:
CREATE VIEW telephony-purchases AS
SELECT product, buyer, seller, store
FROM Purchase, Product
WHERE Purchase.product = Product.name
AND Product.category = “telephony”
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
33
A Different View
CREATE VIEW Seattle-view AS
SELECT buyer, seller, product, store
FROM Person, Purchase
WHERE Person.city = “Seattle” AND
Person.name = Purchase.buyer
We can later use the view:
SELECT name, store
FROM
Seattle-view, Product
WHERE Seattle-view.product = Product.name AND
Product.category = “shoes”
What’s really happening when we query a view??
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
34
Materialized Views
• Views whose corresponding queries have been
executed and the data is stored in a separate
database
Uses: Caching
• Issues
Using views in answering queries
• Normally, the views are available in addition to DB
– (so, views are local caches)
• In information integration, views may be the only
things we have access to.
– An internet source that specializes in woody allen movies can
be seen as a view on a database of all movies.
– Except, there is no DB out there which contains all movies..
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
35
Query Optimization
Goal:
Declarative SQL query
SELECT S.buyer
FROM Purchase P, Person Q
WHERE P.buyer=Q.name AND
Q.city=‘seattle’ AND
Q.phone > ‘5430000’
Imperative query execution plan:
buyer
s
City=‘seattle’
phone>’5430000’
Inputs:
Buyer=name
(Simple Nested Loops)
• the query
• statistics about the data
Person
Purchase
(indexes, cardinalities,
(Table scan)
(Index scan)
selectivity factors)
• available memory
Ideally: Want to find best plan. Practically: Avoid worst plans!
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
36
(On-the-fly)
sname
bid=100
(On-the-fly)
sname
(On-the-fly)
rating > 5
(Sort-Merge Join)
sid=sid
(Scan;
(Simple Nested Loops) write to bid=100
temp T1)
sid=sid
Reserves
Reserves
Sailors
•Goal of
optimization: To
find more efficient
plans that compute
the same answer.
sname
rating > 5
(Scan;
write to
temp T2)
Sailors
(On-the-fly)
rating > 5 (On-the-fly)
SELECT S.sname
FROM Reserves
R, Sailors S
sid=sid
(Use hash
index; do
not write
result to
temp)
bid=100
with pipelining )
Sailors
WHERE
R.sid=S.sid AND
R.bid=100
AND S.rating>5
Reserves
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
37
Relational Algebra Equivalences
• Allow us to choose different join orders
and to ‘push’ selections and projections
ahead of joins.
Selections: s c1... cn  R)  s c1  . . . s cn  R))
s c1 s c 2  R) )  s c 2 s c1  R))
(Commute)
Projections:
Joins:
 a1 R ) a1 ... an R )))
R  (S T)
 (R S)  T
(R  S)  (S  R)
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
(Cascade)
(Associative)
(Commute)
Optimizing Joins
• Q(u,x) :- R(u,v), S(v,w), T(w,x)
R
S
T
• Many ways of doing a single join R S
Symmetric vs. asymmetric join operations
• Nested join, hash join, double pipe-lined hash join etc.
Processing costs alone vs. processing + transfer costs
• Get R and S together vs, get R, get just the tuples of S that will
join with R (“semi-join”)
• Many orders in which to do the join
(R join S) join T
(S join R) join T
(T join S) join R etc.
• All with different costs
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
39
Determining Join Order
• In principle, we need to consider all possible join orderings:
D
D
C
A
B
C
D
A
B
C
A
• As the number of joins increases, the number of alternative plans
grows rapidly; we need to restrict the search space.
• System-R: consider only left-deep join trees.
Left-deep trees allow us to generate all fully pipelined plans:
• Intermediate results not written to temporary files.
– Not all left-deep trees are fully pipelined
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
B
Cost Estimation
• For each plan considered, estimate cost:
Estimate cost of each operation in plan tree.
• Depends on input cardinalities.
Estimate size of result for each op in tree!
• Use information about the input relations.
• Selectivity (Histograms)
• For selections and joins, assume independence of
predicates.
• System R cost estimation approach.
Very inexact, but works ok in practice.
More sophisticated techniques known now.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
Key Lessons in Optimization
• Classic planning / execution scenario
Uncertainty / replanning key for data integration
• Main points
Disk IO as cost metric
Algebraic rules / use in query transformation..
Join ordering via dynamic programming
Estimating cost of plans
• Size of intermediate results.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
43
Integrator vs. DBMS
No common schema
Sources with heterogeneous schemas
Semi-structured sources
Legacy Sources
Not relational-complete
Variety of access/process limitations
Autonomous sources
No central administration
Uncontrolled source content overlap
Lack of source statistics
Tradeoffs between query plan cost, coverage, quality, …
Multi-objective cost models
Unpredictable run-time behavior
Makes query execution hard
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
44
Outline
Motivation
Wrappers / information extraction
Database Review
• Modeling data sources
Content, completeness, capabilities
Reformulation algorithms: Bucket, MINICON
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
45
Data Source Catalog
• Contains meta-information about sources:
Logical source contents (books, new cars).
Source capabilities (can answer SQL queries?)
Source completeness (has all books).
Physical properties of source and network.
Statistics about the data (like in an RDBMS)
Source reliability
Mirror sources?
Update frequency.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
46
Content Descriptions
• User queries refer to the mediated schema.
• Source data is stored in a local schema.
• Content descriptions provide
semantic mappings between different schemas.
• Data integration system
uses the descriptions to translate user queries
into queries on the sources.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
47
Desiderata for Source
Descriptions
• Expressive power: distinguish between
sources with closely related data. Enable
pruning of access to irrelevant sources.
• Easy addition: make it easy to add new
data sources.
• Reformulation: be able to reformulate a
user query into a query on the sources
efficiently and effectively.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
48
Reformulation Problem
• Given:
A query Q posed over the mediated schema
Descriptions of the data sources
• Find:
A query Q’ over the data source relations, such
that:
• Q’ provides only correct answers to Q, and
• Q’ provides all possible answers from to Q given the
sources.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
49
Approaches to Specifying Source
Descriptions
• Global-as-view: express the mediated
schema relations as a set of views over
the data source relations
• Local-as-view: express the source
relations as views over the mediated
schema.
• Can be combined with no additional cost.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
50
Global-as-View
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Create View Movie AS
select * from S1
[S1(title,dir,year,genre)]
union
select * from S2
[S2(title, dir,year,genre)]
union
[S3(title,dir), S4(title,year,genre)]
select S3.title, S3.dir, S4.year, S4.genre
from S3, S4
where S3.title=S4.title
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
51
Global-as-View: Example 2
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Create View Movie AS [S1(title,dir,year)]
select title, dir, year, NULL
from S1
union
[S2(title, dir,genre)]
select title, dir, NULL, genre
from S2
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
52
Global-as-View: Example 3
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Source S4: S4(cinema, genre)
Create View Movie AS
select NULL, NULL, NULL, genre
from S4
Create View Schedule AS
select cinema, NULL, NULL
from S4.
But what if we want to find which cinemas are
playing comedies?
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
53
Global-as-View Summary
 Very easy conceptually.
Query reformulation  view unfolding.
 Can build hierarchies of mediated schemas.
 Sometimes loose information.
Not always natural.
 Adding sources is hard.
Need to consider all other sources that are
available.
May need to modify every global view defn
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
54
Local-as-View: example 1
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Create Source S1 AS
select * from Movie
Create Source S3 AS
[S3(title, dir)]
select title, dir from Movie
Create Source S5 AS
select title, dir, year
from Movie
where year > 1960 AND genre=“Comedy”
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
55
Local-as-View: Example 2
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Source S4: S4(cinema, genre)
Create Source S4
select cinema, genre
from Movie m, Schedule s
where m.title=s.title
.
Now if we want to find which cinemas are
playing comedies, there is hope!
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
56
Local-as-View Summary
• Very flexible.
You have the power of the entire query language
to define the contents of the source.
• Hence, can easily distinguish between
contents of closely related sources.
• Adding sources is easy:
They’re independent of each other.
• Query reformulation:
Answering queries using views!
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
57
The General Problem
• Given a set of views V1,…,Vn, and a query Q,
can we answer Q using only the answers to
V1,…,Vn?
Many, many papers on this problem.
Great survey on the topic: (Halevy, 2001).
• The best performing algorithm:
MiniCon (Pottinger & Levy, 2000).
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
58
Modeling Source Capabilities
• Negative capabilities:
A web site may require certain inputs (in an
HTML form).
Need to consider only valid query execution
plans.
• Positive capabilities:
A source may be an ODBC compliant system.
Need to decide placement of operations
according to capabilities.
• Problem: how to describe and exploit source
capabilities.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
59
Example #1: Access Patterns
Mediated schema relation: Cites(paper1, paper2)
Create Source S1 as
select *
from Cites
given paper1
Create Source S2 as
select paper1
from Cites
Query: select paper1 from Cites where paper2=“H03”
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
60
Example #1: Continued
Create Source S1 as
select *
from Cites
given paper1
Create Source S2 as
select paper1
from Cites
Select p1
From S1, S2
Where S2.paper1=S1.paper1 AND
S1.paper2=“Hal00”
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
61
Example #2: Access Patterns
Create Source S1 as
select *
from Cites
given paper1
Create Source S2 as
select paperID
from UW-Papers
Create Source S3 as
select paperID
from AwardPapers
given paperID
Query: select * from AwardPapers
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
62
Example #2: Solutions
• Can’t go directly to S3 (it requires a binding).
• Can
go to S1, get UW papers,
and check if they’re in S3.
• Can
go to S1, get UW papers, feed them into S2,
and then check if they’re in S3.
• Can
go to S1, feed results into S2, feed results into S2 again,
and then check if they’re in S3.
• Note: we can’t a priori decide when to stop.
Need recursive query processing.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
63
Local Completeness Information
• If sources are incomplete, we need to look at
each one of them.
• Often, sources are locally complete.
• Movie(title, director, year) complete for
years after 1960, or for American directors.
• Question: given a set of local completeness
statements, is a query Q’ a complete answer
to Q?
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
64
Example
• Movie(title, director, year)
(complete after 1960).
• Show(title, theater, city, hour)
• Query: find movies (and directors) playing in
Seattle:
Select m.title, m.director
From Movie m, Show s
Where m.title=s.title AND city=“Seattle”
• Complete or not?
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
65
Example #2
• Sources
Movie(title, director, year),
Oscar(title, year)
• Query: find directors whose movies won
Oscars after 1965:
select m.director
from Movie m, Oscar o
where m.title=o.title AND m.year=o.year
AND o.year > 1965.
• Complete or not?
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
66
Matching Objects Across Sources
• How do I know that D. Weld in source 1 is
the same as Daniel S. Weld in source 2?
• If  uniform keys across sources, easy.
• If not:
Domain specific solutions
• (e.g., maybe look at the address, …).
Use IR techniques (Cohen, 98).
• Judge similarity as you would between documents.
Use concordance tables.
• These are time-consuming to build, but you can then
sell them for lots of money.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
67
The Structure Mapping
Problem
• Types of structures:
Database schemas, XML DTDs, ontologies, …,
• Input:
Two (or more) structures, S1 and S2
(perhaps) Data instances for S1 and S2
Background knowledge
• Output:
A mapping between S1 and S2
• Should enable translating between data instances.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
68
Semantic Mappings between
Schemas
• Source schemas = XML DTDs
house
address
contact-info
agent-name
num-baths
agent-phone
1-1 mapping
non 1-1 mapping
house
location
contact
name
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
full-baths
half-baths
phone
69
Why Matching is Difficult
• Structures represent same entity differently
different names => same entity:
• area & address => location
same names => different entities:
• area => location or square-feet
• Intended semantics is typically subjective!
IBM Almaden Lab = IBM?
• Schema, data and rules never fully capture
semantics!
not adequately documented, certainly not for machine
consumption.
• Often hard for humans (committees are formed!)
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
70
Desiderata
• Accuracy, efficiency, ease of use.
• Realistic expectations:
Unlikely to be fully automated. Need user in
the loop.
• Some notion of semantics for mappings.
• Extensibility:
Solution should exploit additional background
knowledge.
• “Memory”, knowledge reuse:
System should exploit previous manual or
automatically generated matchings.
Key idea behind LSD [Doan SIGMOD 2001].
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
71