Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Indexing Dataspaces
Xin (Luna) Dong
Alon Halevy
University of Washington
@ SIGMOD 2007
Google Inc.
Many Data Management Applications Need to Manage
Heterogeneous Data Sources
D5
D1
D2
D4
D3
Traditional Data Integration Systems
SELECT P.title AS title, P.year
AS year, A.name AS author
FROM Author AS A, Paper AS P,
AuthoredBy AS B
WHERE A.aid=B.aid AND
P.pid=B.pid
Publication(title, year, author)
Mediated Schema
D5
D1
Author(aid, name)
Paper(pid, title, year)
AuthoredBy(aid,pid)
D2
D4
D3
Querying on Traditional Data Integration Systems
Q
Q5
Q1
D1
Mediated Schema
Q4
Q
Q2
D2
Q3
D4
D3
D5
In Many Applications it is Hard to Obtain Precise
Semantic Mappings
?
D1
D2
D5
D4
D3
Scenario 1. Different Websites About Movies
Scenario 2. Personal Information Space
Intranet
Internet
Querying Dataspaces
Dataspaces
Collections
of heterogeneous data sources
Don’t necessarily include semantic mappings
Scenarios: personal information, enterprises,
government agencies, smart homes, digital
libraries, and the Web
Pay-as-you-go data management
Provide
some services from the outset
Improve the mappings on an as-needed basis
How to effectively query and search a
dataspace?
Example Dataspace
<publication>
<title>Semex: Personal information management
and integration</title>
<author>Xin Dong</author>
<author>Alon Halevy</author>
<conference>IIWeb Workshop</conference>
</publication>
<thesis-proposal>
<title>Semex: Personal …</title>
<student>
<name>Xin (Luna) Dong</name>
<entryYear>2001</entryYear>
</student>
</thesis-proposal>
stuID
lastName
firstName
entryYear
5001438
Xin
Dong
2001
…
…
…
…
Searching and Querying a Dataspace
Structured query?
Require
detailed knowledge on schemas
Require precise attribute values
Keyword search?
Forgiving,
but…
Does not allow specifications on structure
We consider queries that are
keyword-based
structure-aware
I. Predicate Query
<publication>
<title>Semex: Personal information management
and integration</title>
<author>Xin Dong</author>
<author>Alon Halevy</author>
<conference>IIWeb Workshop</conference>
</publication>
<thesis-proposal>
<title>Semex: Personal …</title>
<student>
<name>Xin (Luna) Dong</name>
<entryYear>2001</entryYear>
</student>
</thesis-proposal>
Conjunction of predicates
Predicate: (v, {K1, …, Kn})
v
- an attribute or
association label
{K1, …, Kn} - a keyword set
stuID
lastName
firstName
entryYear
5001438
Xin
Dong
2001
…
…
…
…
I. Predicate Query
<publication>
<title>Semex: Personal information management
and integration</title>
<author>Xin Dong</author>
<author>Alon Halevy</author>
<conference>IIWeb Workshop</conference>
</publication>
<thesis-proposal>
<title>Semex: Personal …</title>
<student>
<name>Xin (Luna) Dong</name>
<entryYear>2001</entryYear>
</student>
</thesis-proposal>
Example I:
(title, ‘Semex’)
(author, ‘Luna Dong’)
stuID
lastName
firstName
entryYear
5001438
Xin
Dong
2001
…
…
…
…
I. Predicate Query
<publication>
<title>Semex: Personal information management
and integration</title>
<author>Xin Dong</author>
<author>Alon Halevy</author>
<conference>IIWeb Workshop</conference>
</publication>
Example II:
(name, ‘Dong’)
<thesis-proposal>
<title>Semex: Personal …</title>
<student>
<name>Xin (Luna) Dong</name>
<entryYear>2001</entryYear>
</student>
</thesis-proposal>
stuID
lastName
firstName
entryYear
5001438
Xin
Dong
2001
…
…
…
…
II. Neighborhood Keyword Query
<publication>
<title>Semex: Personal information management
and integration</title>
<author>Xin Dong</author>
<author>Alon Halevy</author>
<conference>IIWeb Workshop</conference>
</publication>
<thesis-proposal>
<title>Semex: Personal …</title>
<student>
<name>Xin (Luna) Dong</name>
<entryYear>2001</entryYear>
</student>
</thesis-proposal>
Form: {K1, …, Kn}
Example: ‘Semex’
Relevant items
Associated
stuID
lastName
firstName
entryYear
5001438
Xin
Dong
2001
…
…
…
…
items
Indexing of the Heterogeneous Data
Challenges
Index
data from heterogeneous data sources
Capture both text values and structural information
Traditional Indexes
Build
a separate index for each attribute to support
structured queries
Build an inverted list to support keyword search
XML indexes assume tree models and build multiple
indexes ([Cooper et al., 01],[Kaushik et al., 05],[Wang et al., 03], etc.)
Our approach: Extend inverted lists to capture
both text values and structure of the data
Contributions
Design an index that
indexes
data from heterogeneous data sources
captures both structure and text of the data
incorporates various types of heterogeneity,
including synonyms and hierarchies of attributes
and associations
Outline
Motivation
Overview
of our approach
Our algorithm
Indexing
structure
Indexing hierarchies
Experimental
Conclusions
Results
View Data Sources as Triple Base
<publication>
<title>Semex: Toward …</title>
<authors>
<author><name>
Xin Dong</name></author>
<author><name>
Alon Halevy</name></author>
</authors>…
</publication>
Alon Halevy
Semex: …
author
Luna Dong
author
Attribute
Object
Association
View Data Sources as Triple Base
Alon Halevy
Semex: …
author
Luna Dong
author
View Data Sources as Triple Base
Alon Halevy
Semex: …
author
Luna Dong
author
Departmental Database
StuID
lastName
firstName
…
1000001
Xin
Dong
…
…
…
…
…
Goal: Index triples to efficiently answer queries
that combine text and structure
Indexing a Triple Base Using an Inverted List
Alon Halevy
Semex: …
author
Luna Dong
Inverted List
Alon
Dong
Halevy
Luna
Semex
Xin
author
Departmental Database
StuID
lastName
firstName
…
1000001
Xin
Dong
…
…
…
…
…
Indexing a Triple Base Using an Inverted List
Query: Dong
Alon Halevy
Departmental Database
Semex: …
author
Luna Dong
StuID
lastName
firstName
…
1000001
Xin
Dong
…
…
…
…
…
author
Inverted List
Alon
1
Dong
Halevy
Luna
Semex
Xin
1
1
1
1
1
1
Outline
Motivation
Overview
of our approach
Our algorithm
Indexing
structure
Indexing hierarchies
Experimental
Conclusions
Results
Incorporate Attribute Labels in the Inverted List
Query: firstName “Dong”
Alon Halevy
Departmental Database
Semex: …
author
Luna Dong
StuID
lastName
firstName
…
1000001
Xin
Dong
…
…
…
…
…
author
Inverted List
Alon
1
Dong
Halevy
Luna
Semex
Xin
1
1
1
1
1
1
Incorporate Attribute Labels in the Inverted List
Query: firstName “Dong” “Dong/firstName/”
Alon Halevy
Departmental Database
Semex: …
author
Luna Dong
StuID
lastName
firstName
…
1000001
Xin
Dong
…
…
…
…
…
author
Inverted List
Alon/name/
1
Dong/name/
1
Dong/firstName/
Halevy/name/
Luna/name/
Semex/title/
Xin/lastName/
1
1
1
1
1
Incorporate Association Labels in the Inverted List
Query: author “Dong”
Alon Halevy
Departmental Database
Semex: …
author
Luna Dong
StuID
lastName
firstName
…
1000001
Xin
Dong
…
…
…
…
…
author
Inverted List
Alon/name/
1
Dong/name/
1
Dong/name/firstName/
Halevy/name/
Luna/name/
Semex/title/
Xin/name/lastName/
1
1
1
1
1
Incorporate Association Labels in the Inverted List
Alon Halevy
Query: author “Dong” “Dong/author/”
Departmental Database
Semex: …
author
Luna Dong
author
StuID
LastName
FirstName
…
1000001
Xin
Dong
…
…
…
…
…
Inverted List
Alon/author/
Alon/name/
1
1
Dong/author/
1
Dong/name/
1
Dong/name/firstName/
Halevy/name/
Luna/name/
1
1
1
Luna/auhor
1
Semex/title/
1
Xin/name/LastName/
1
Outline
Motivation
Overview
of our approach
Our algorithm
Indexing
structure
Indexing hierarchies
Experimental
Conclusions
Results
Hierarchies of Attributes and Associations
<publication>
<title>Semex: Toward on-the-fly personal
information integration</title>
<author>Xin Dong</author>
<author>Alon Halevy</author>
<conference>IIWeb Workshop</conference>
</publication>
<thesis-proposal>
<title>Semex: Personal …</title>
<student>
<name>Xin (Luna) Dong</name>
<entryYear>2001</entryYear>
</student>
</thesis-proposal>
Example II:
(name, ‘Dong’)
Attribute Hierarchy:
name
firstName
stuID
lastName
firstName
entryYear
5001438
Xin
Dong
2001
…
…
…
…
lastName
Incorporate Attribute Hierarchy in the Inverted List
Query: name “Dong”
Alon Halevy
Departmental Database
Semex: …
author
Luna Dong
StuID
lastName
firstName
…
1000001
Xin
Dong
…
…
…
…
…
author
name
firstName
lastName
Inverted List
Alon/name/
1
Dong/name/
1
Dong/firstName/
Halevy/name/
Luna/name/
Semex/title/
Xin/lastName/
1
1
1
1
1
Naïve Approach: Expand Queries with Sub-Attributes
Query: name “Dong” “Dong/name/ OR Dong/firstName/ OR …”
Alon Halevy
Departmental Database
Semex: …
author
Luna Dong
StuID
lastName
firstName
…
1000001
Xin
Dong
…
…
…
…
…
author
name
firstName
lastName
Inverted List
Alon/name/
1
Dong/name/
1
Dong/firstName/
Halevy/name/
Luna/name/
Semex/title/
Xin/lastName/
1
1
1
1
1
Approach I: Duplicate Entries for Parent Attributes
Query: name “Dong” “Dong/name/”
Alon Halevy
Departmental Database
Semex: …
author
Luna Dong
StuID
lastName
firstName
…
1000001
Xin
Dong
…
…
…
…
…
author
name
firstName
lastName
Inverted List
Alon/name/
1
Dong/name/
1
1
Dong/firstName/
Halevy/name/
Luna/name/
Semex/title/
1
1
1
1
Xin/lastName/
1
Xin/name/
1
Approach I: Duplicate Entries for Parent Attributes
Query: name “Dong” “Dong/name/”
Alon Halevy
Departmental Database
Semex: …
author
Luna Dong
StuID
lastName
firstName
…
1000001
Xin
Dong
…
…
…
…
…
author
name
firstName
lastName
Inverted List
Alon/name/
1
Dong/name/
1
1
Dong/firstName/
Halevy/name/
Luna/name/
Semex/title/
1
1
1
1
Xin/lastName/
1
Xin/name/
1
Approach II. Concatenate a keyword with a Hierarchy
Path
Query: name “Dong” “Dong/name/*”
Alon Halevy
Departmental Database
Semex: …
author
Luna Dong
StuID
lastName
firstName
…
1000001
Xin
Dong
…
…
…
…
…
author
name
firstName
lastName
Inverted List
Alon/name/
1
Dong/name/
1
Dong/name/firstName/
Halevy/name/
Luna/name/
Semex/title/
Xin/name/lastName/
1
1
1
1
1
Approach III. Hierarchy Path + Summary Rows
Query: name “Dong” “Dong/name/*”
Alon Halevy
Departmental Database
Semex: …
author
Luna Dong
StuID
lastName
firstName
…
1000001
Xin
Dong
…
…
…
…
…
author
name
firstName
lastName
Inverted List
Alon/name/
1
Dong/name//
1
1
Dong/name/firstName/
Halevy/name/
Luna/name/
Semex/title/
Xin/name/lastName/
1
1
1
1
1
Summary Rows
Goal: Given a threshold t, answer any prefix
search by reading no more than t rows.
Definition:
The
indexed keyword: p//
E.g. “Dong/name//”
starting with p/ are shadowed by the
summary row p//
E.g. “Dong/name/lastName/” is shadowed by
“Dong/name//”
Rows
Answering Prefix Search with Summary
Rows
Once read a summary row, ignore the rows
shadowed by it
Example (t=1)
Query: name “Dong” “Dong/name/*”
Inverted List
Alon/name/
1
Dong/name//
1
1
Dong/name/firstName/
Halevy/name/
Luna/name/
Semex/title/
Xin/name/lastName/
1
1
1
1
1
Answering Prefix Search with Summary
Rows
Once read a summary row, ignore the rows
shadowed by it
Example (t=1)
Query: name “Xin” “Xin/name/*”
Inverted List
Alon/name/
1
Dong/name//
1
1
Dong/name/firstName/
Halevy/name/
Luna/name/
Semex/title/
Xin/name/lastName/
1
1
1
1
1
Adding Summary Rows
Step 1. Create a summary row for a prefix p if
Searching prefix p needs to read more than t rows
There is no p’ with p as prefix such that searching prefix p’ needs to
read more than t rows
Step 2. Remove row p if summary row p/ exists
Example (t=1)
Inverted List
Alon/name/
1
Dong/name/
1
Dong/name/firstName/
Halevy/name/
Luna/name/
Semex/title/
Xin/name/lastName/
1
1
1
1
1
Adding Summary Rows
Step 1. Create a summary row for a prefix p if
Searching prefix p needs to read more than t rows
There is no p’ with p as prefix such that searching prefix p’ needs to
read more than t rows
Step 2. Remove row p if summary row p/ exists
Example (t=1)
Inverted List
Alon/name/
1
Dong/name//
1
Dong/name/
1
1
Dong/name/firstName/
Halevy/name/
Luna/name/
Semex/title/
Xin/name/lastName/
1
1
1
1
1
Adding Summary Rows
Step 1. Create a summary row for a prefix p if
Searching prefix p needs to read more than t rows
There is no p’ with p as prefix such that searching prefix p’ needs to
read more than t rows
Step 2. Remove row p if summary row p/ exists
Example (t=1)
Inverted List
Alon/name/
1
Dong/name//
1
Dong/name/
1
1
Dong/name/firstName/
Halevy/name/
Luna/name/
Semex/title/
Xin/name/lastName/
1
1
1
1
1
Adding Summary Rows
Step 1. Create a summary row for a prefix p if
Searching prefix p needs to read more than t rows
There is no p’ with p as prefix such that searching prefix p’ needs to
read more than t rows
Step 2. Remove row p if summary row p/ exists
Example (t=1)
Inverted List
Alon/name/
1
Dong/name//
1
1
Dong/name/firstName/
Halevy/name/
Luna/name/
Semex/title/
Xin/name/lastName/
1
1
1
1
1
Answering Neighborhood Keyword Queries
Alon Halevy
Query: Semex “Semex/*”
~author
Semex: …
Departmental Database
author
Luna Dong
author
StuID
LastName
FirstName
…
1000001
Xin
Dong
…
…
…
…
…
~author
Inverted List
Alon/author/
Alon/name/
1
1
Dong/author/
1
Dong/name/
1
Dong/name/firstName/
Halevy/name/
1
1
Luna/name/
Semex/~author/
Semex/title/
Xin/name/LastName/
1
1
1
1
1
Outline
Motivation
Overview
of our approach
Our algorithm
Indexing
structure
Indexing hierarchies
Experimental
Conclusions
Results
Implementation Details
Our index extends the Lucene Indexing Tool
Lucene
stores an inverted list as a sorted array
Implemented in Java
Run on a machine with four 3.2GHz and
1024KB-cache CPUs, and 1GB memory
Experimental Setting
Data sets
A
50MB personal data set
Two 10GB XML data sets: Wikipedia, XMark Benchmark
Queries: with one predicate or keyword
Predicate
Query with leaf attributes
Predicate Query with branch attributes
Predicate Query with associations
Neighborhood Keyword Query
Measure: in millisecond
Index-lookup
time
Query-answering time
Our Indexing Method Significantly Improves
Query Answering
Plain Inverted List
(10.6MB)
Query Type
Extended Inverted List
(15.2MB)
Index
Lookup
Query
Answer
Index
Lookup
Query
Answer
(ms)
(ms)
(ms)
(ms)
Pred Query with
leaf attributes
2
22
4
6
Pred Query with
branch attributes
3
43
4
6
Pred Query with
associations
3
88
6
17
Neighborhood
Keyword Query
18
4174
48
97
XML Index [Kaushik et al, Sigmod’05]
Three indexes
Inverted
list: index each attribute value on its text
Structured
index: index each attribute value on the
labels of the attribute and its ancestor attributes
Relationship
index: index each instance on its
associated instances
Our Indexing Method Performs Better Than
XML Indexes
XML Index
(28.1MB)
Query Type
Extended Inverted List
(15.2MB)
Index
Lookup
Query
Answer
Index
Lookup
Query
Answer
(ms)
(ms)
(ms)
(ms)
Pred Query with
leaf attributes
7
9
4
6
Pred Query with
branch attributes
7
11
4
6
Pred Query with
associations
301
415
6
17
Neighborhood
Keyword Query
365
488
48
97
Our Indexing Method Scales Well
Index
4.15hr
(1.13GB)
XMark
w/o asso
6.64hr
(3.04GB)
XMark
with asso
12.72hr
(4.08GB)
Pred Query with
leaf attributes
156
94
116
Pred Query with
branch attributes
-
67
93
Pred Query with
associations
-
-
217
Neighborhood
Keyword Query
1646
1838
13468
Wikipedia
Conclusions
Contributions: An index for heterogeneous
data
Index
heterogeneous data from multiple sources
through a (virtual) central triple base
Extend inverted lists to capture both texts and
structure of data
Future Work
Support
value heterogeneity
Incorporate approximate matching of schema
terms and object instances
Indexing Dataspaces
Xin (Luna) Dong
Alon Halevy
University of Washington
@ SIGMOD 2007
Google Inc.