Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DataSpaces:
A New Abstraction for Data
Management
Alon Halevy*
DASFAA, 2006
Singapore
*Joint work with Mike Franklin and David Maier
Outline
• Dataspaces:
– collections of heterogeneous (un)-structured
data.
– examples and characteristics
• Dataspace Support Platforms:
– “Pay-as-you-go” data management
• Getting there:
– Recent progress and research challenges
Outline
• Dataspaces:
– collections of heterogeneous (un)-structured
data.
– examples and characteristics
• Dataspace Support Platforms:
– “Pay-as-you-go” data management
• Getting there:
– Recent progress and research challenges
Shrapnels in Baghdad
Story courtesy of Phil Bernstein
Personal Information Management
AttachedTo
[Semex: Dong et al.]
Recipient
ConfHomePage
ExperimentOf
CourseGradeIn
PublishedIn
Sender
Cites
EarlyVersion
ArticleAbout
PresentationFor
FrequentEmailer
CoAuthor
BudgetOf
OriginatedFrom
HomePage
AddressOf
Google Base
What do they have in common?
• All dataspaces contain >20% porn.
• The rest is spam.
What do they have in common?
• Must manage all the data in the space
• Need best-effort services with no setup
time.
• Data is heterogeneous,
– possibly unstructured
• Do not have control over the data, just
access.
Isn’t this Data Integration?
Phenotype
Gene
Sequenceable
Entity
Protein
OMIM
Experiment
Nucleotide
Sequence
Microarray
Experiment
SwissProt
HUGO
GeneClinics
Structured
Vocabulary
LocusLink
GO
Entrez
GEO
No, it’s Data Co-existence
• Data integration systems require semantic
mappings.
Semantic Mappings
Books
BooksAndMusic
Title
Author
Publisher
ItemID
ItemType
SuggestedPrice
Categories
Keywords
Inventory
Database A
Title
ISBN
Price
DiscountPrice
Edition
Authors
ISBN
FirstName
LastName
BookCategories
ISBN
Category
CDCategories
CDs
Album
ASIN
Price
DiscountPrice
Studio
ASIN
Category
Artists
ASIN
ArtistName
GroupName
Inventory Database B
No, It’s Data Co-existence
• Data integration systems require semantic
mappings.
• Dataspaces are “pay-as-you-go”:
– Provide some services immediately
– Create more tight integrations as needed.
The Cost of Semantics
Schema first vs. schema last
Benefit
Dataspaces
Data integration solutions
Investment (time, cost)
Why Now?
• Data management is moving towards
dataspace-like applications.
• Prediction:
– Data management is about people, not enterprises.
– In 5 years our community will figure it out.
• We’ve made relevant progress:
– Combining DB & IR
– Creation and management of semantic mappings.
– Uncertainty, lineage, inconsistency
Dataspaces Fundamentals:
Participants and Relationships
RDB
java
sensor
WSDL
snapshot
1hr updates
SDB
java
XML
manually created
WSDL
RDB
sensor
view
schema mappingsensor
RDB
replica
XML
DSSP Components
Catalog
Local
Store &
Index
Search
& query
participants’
• Heterogeneous
index
• Seamless
flow
from
sensor
• •Find
WSDL
java
capabilities
• Reference
search
toreconciliation
query
participants
snapshot
1hr updates
•Relationships
• Additional
associations
• Query
about
• Discover
andSDB
sensor
•Quality
of both
XMLlocate
• java
Cache
for
performance
&
participants,
data
refine
manually created
sensor
availability
• Lineage,
uncertainty,
relationships
schema mapping
completeness
• Maintain
WSDL
RDB
• Set up
workflows
XML
catalog
replica
view Relax
RDB
RDB
Administration Discovery & Enhancement
DSSP Components
Catalog
Local
Store &
Index
Search
& query
RDB
sensor
WSDL
java
snapshot
1hr updates
SDB
java
sensor
XML
manually created
schema mapping
WSDL
RDB
view
RDB
replica
sensor
XML
Administration Discovery & Enhancement
Querying a Dataspace
• Best effort, based on:
– Approximate semantic mappings
– Other mechanisms
• Example -- searching for Beng Chin’s
phone number:
– Keyword search for “beng chin ooi”
– Examine attributes of tuples/XML elements
– Match attributes to ‘address’
Querying a Dataspace
• Best effort
• Combine structured and unstructured data
Two Kinds of Data
Structured Data
(or XML)
[Dong, Liu, Halevy]
Unstructured
Data
Querying a Dataspace
• Best effort
• Combine structured and unstructured data
• Rank: answers, sources
Volvo Palo alto
Volvo Palo alto
Acura integra Palo alto
Querying a Dataspace
•
•
•
•
Best effort
Combine structured and unstructured data
Rank: answers, sources
Iterative -- sequences of queries:
– Used cars palo alto
– Saab for sale
– Classified ads
Classified ads for Saabs near Palo Alto
Querying a Dataspace
•
•
•
•
•
Best effort
Combine structured and unstructured data
Rank: answers, sources
Iterative: sequences of queries
Reflective: need to explain and expose
confidence in answers.
Dataspace Reflection
• Sources of uncertainty:
– Unreliable sources
– Data obtained by imprecise extractions
– Inconsistent data
– Approximate query answering (mappings)
• A DSSP must:
– Expose the lineage of an answer
• Web search engines already do this
– Reason about relationship between sources
LUI Introspection
• Currently, three distinct formalisms:
– Uncertainty
– Lineage
– Inconsistency
• Single formalism should do it all:
– Uncertainty and lineage (Trio @ Stanford)
– Lineage & inconsistency (Orchestra @ U.Penn)
ULDB’s
[Benjelloun, Das Sarma, Halevy, Widom]
• Combine uncertainty and lineage
– Based on x-tuples: {(t1 | t2 | t3)}
• Queries can be answered with no
additional complexity
– You can do even better than uncertain DBs.
• Because of lineage, you can sometimes
obtain completeness.
DSS Components
Catalog
Local
Store &
Index
Search
& query
RDB
sensor
WSDL
java
snapshot
1hr updates
SDB
java
sensor
XML
manually created
schema mapping
WSDL
RDB
view
RDB
replica
sensor
XML
Administration Discovery & Enhancement
Alon Halevy
authoredPaper
Semex: …
author
Luna Dong
author
authoredPaper
Personal
Information
Space
Inverted List
Alon
Dong
Halevy
Luna
Semex
Xin
Departmental Database
StuID
LastName
FirstName
…
1000001
Xin
Dong
…
…
…
…
…
Alon Halevy
authoredPaper
Departmental Database
Semex: …
author
Luna Dong
StuID
LastName
FirstName
…
1000001
Xin
Dong
…
…
…
…
…
author
authoredPaper
Personal
Information
Space
Inverted List
Alon
1
Dong
Halevy
Luna
Semex
Xin
Query: Dong
1
1
1
1
1
1
Alon Halevy
authoredPaper
Departmental Database
Semex: …
author
Luna Dong
StuID
LastName
FirstName
…
1000001
Xin
Dong
…
…
…
…
…
author
authoredPaper
Personal
Information
Space
Inverted List
Alon
1
Dong
Halevy
1
1
1
Luna
Semex
Xin
Query: FirstName “Dong”
1
1
1
Alon Halevy
authoredPaper
Departmental Database
Semex: …
author
Luna Dong
StuID
LastName
FirstName
…
1000001
Xin
Dong
…
…
…
…
…
author
authoredPaper
Personal
Information
Space
Inverted List
Alon&&name&&
1
Dong&&FirstName&&
1
Dong&&name&&
Halevy&&name&&
1
1
Luna&&name&&
Semex&&title&&
Xin&&LastName&&
Query: FirstName “Dong”
1
1
1
Alon Halevy
authoredPaper
Departmental Database
Semex: …
author
Luna Dong
StuID
LastName
FirstName
…
1000001
Xin
Dong
…
…
…
…
…
author
authoredPaper
Personal
Information
Space
Inverted List
Alon&&name&&
1
Dong&&FirstName&&
1
Dong&&name&&
Halevy&&name&&
Luna&&name&&
Semex&&title&&
Xin&&LastName&&
Query: name “Dong”
1
1
1
1
1
Alon Halevy
Departmental Database
Semex: …
authoredPaper
author
Luna Dong
StuID
LastName
FirstName
…
1000001
Xin
Dong
…
…
…
…
…
author
authoredPaper
Personal
Information
Space
Inverted List
Alon&&name&&
1
Dong&&name&&FirstName&&
1
Dong&&name&&
Halevy&&name&&
Luna&&name&&
Semex&&title&&
Xin&&name&&LastName&&
Query: name “Dong”
1
1
1
1
1
Alon Halevy
Departmental Database
Semex: …
authoredPaper
author
Luna Dong
StuID
LastName
FirstName
…
1000001
Xin
Dong
…
…
…
…
…
author
authoredPaper
Personal
Information
Space
Inverted List
Alon&&name&&
1
Dong&&name&&FirstName&&
1
Dong&&name&&
Halevy&&name&&
1
1
Luna&&name&&
Semex&&title&&
Xin&&name&&LastName&&
Query: Paper author “Dong”
1
1
1
Alon Halevy
Departmental Database
Semex: …
authoredPaper
author
Luna Dong
StuID
LastName
FirstName
…
1000001
Xin
Dong
…
…
…
…
…
author
authoredPaper
Personal
Information
Space
Inverted List
Alon&&author&&
Alon&&name&&
1
1
Dong&&author&&
1
Dong&&name&&FirstName&&
1
Dong&&name&&
Halevy&&name&&
1
1
Luna&&name&&
Semex&&authoredPaper&&
1
1
Semex&&title&&
Xin&&name&&LastName&&
Query: Paper author “Dong”
1
1
1
DSS Components
Catalog
Local
Store &
Index
Search
& query
RDB
sensor
WSDL
java
snapshot
1hr updates
SDB
java
sensor
XML
manually created
schema mapping
WSDL
RDB
view
RDB
replica
sensor
XML
Administration Discovery & Enhancement
Enhancing a Dataspace
AttachedTo
Recipient
ConfHomePage
ExperimentOf
CourseGradeIn
PublishedIn
Cites
EarlyVersion
ArticleAbout
Sender
ComeFrom
PresentationFor
• Creating associations
FrequentEmailer
CoAuthor
• [Dong & Halevy, CIDR 2005]
OriginitatedFrom
• Reference reconciliation
BudgetOf
• [Dong et al., SIGMOD 2005]
• Very active field.
HomePage
AddressOf
DSS Components
Catalog
Local
Store &
Index
Search
& query
RDB
sensor
WSDL
java
snapshot
1hr updates
SDB
java
sensor
XML
manually created
schema mapping
WSDL
RDB
view
RDB
replica
sensor
XML
Administration Discovery & Enhancement
Reusing Human Attention
• Human attention is most expensive.
• Reuse whenever possible. E.g.,:
– Manual schema mapping
– Annotations
– Queries written on data
– Temporary collections of items
– Operations on the data (cut & paste)
• Solicit semantic information selectively
– The ESP game: [von Ahn et al.]
Learning from Past Matches
[Doan et. al, Transformic]
• Every manual map is a
learning example.
• Learn models for
elements in mediated
schema.
• Use multi-strategy
learning.
• Thousands of maps in
very little time.
Reuse for a very related task.
Corpus-based Matching
Product
productID name price
0X7630AB12
The
Concert in
Central
Park
$13.99
salePrice
$11.99
Music
ASIN
title artists recordLabel discountPrice
(no tuples)
[Madhavan et al.]
Obtaining More Evidence
Product CD
productID name
price salePrice
albumName
prodID
0X7630AB12
The Concert in
Central Park
$13.99
$11.99
Corpus
MusicCD
CD
ASIN
album
artistName
price
discountPrice
4Y3026DF23
The Best of the Doors
The Doors
$16.99
$12.99
prodID
albumName
artists
recordCompany
price
salePrice
9R4374FG56
Saturday Night Fever
The Bee Gees
Columbia
$14.99
$9.99
Comparing with More Evidence
Product CD
productID name
price salePrice
albumName
prodID
0X7630AB12
Music
ASIN
Title
album
4Y6DF23
The Best of
the Doors
The Concert in
Central Park
$13.99
$11.99
MusicCD
artists
recordLabel discount
artistName recordCompany price
The Doors
Columbia
$12.99
Challenges
• Learn from other kinds of user activities
• Create other kinds of relationships
between participants
• Identify higher-level goals from user
actions
• Develop a formal framework for reusing
human attention.
Conclusion
“Dataspaces:
because that’s the size of the problem”
•
•
•
•
Pay-as-you-go data integration
Data management for the masses
Key: reuse of human attention
Need to be very data driven in our
research
Some References
• www.cs.washington.edu/homes/alon
• SIGMOD Record, December 2005:
– Original dataspace vision paper
• PODS 2006:
– Specific technical challenges for dataspace research
• Semex: CIDR 2005, SIGMOD 2005
• Teaching integration to undergraduates:
– SIGMOD Record, September, 2003.
1. Build initial models
Ms
T
S
Name:
Instances:
Type: …
s
Mt
Name:
Instances:
Type: …
t
2. Find similar elements in
corpus
Corpus of schemas
and mappings
3. Build augmented models
T
S
M’s
Name:
Instances:
Type: …
s
M’t
Name:
Instances:
Type: …
t
4. Match using augmented models
5. Use additional statistics (IC’s) to refine match