Download Presentation - Xin Luna Dong

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Indexing Dataspaces
Xin (Luna) Dong
Alon Halevy
University of Washington
@ SIGMOD 2007
Google Inc.
Many Data Management Applications Need to Manage
Heterogeneous Data Sources
D5
D1
D2
D4
D3
Traditional Data Integration Systems
SELECT P.title AS title, P.year
AS year, A.name AS author
FROM Author AS A, Paper AS P,
AuthoredBy AS B
WHERE A.aid=B.aid AND
P.pid=B.pid
Publication(title, year, author)
Mediated Schema
D5
D1
Author(aid, name)
Paper(pid, title, year)
AuthoredBy(aid,pid)
D2
D4
D3
Querying on Traditional Data Integration Systems
Q
Q5
Q1
D1
Mediated Schema
Q4
Q
Q2
D2
Q3
D4
D3
D5
In Many Applications it is Hard to Obtain Precise
Semantic Mappings
?
D1
D2
D5
D4
D3
Scenario 1. Different Websites About Movies
Scenario 2. Personal Information Space
Intranet
Internet
Querying Dataspaces

Dataspaces
 Collections
of heterogeneous data sources
 Don’t necessarily include semantic mappings
 Scenarios: personal information, enterprises,
government agencies, smart homes, digital
libraries, and the Web

Pay-as-you-go data management
 Provide
some services from the outset
 Improve the mappings on an as-needed basis

How to effectively query and search a
dataspace?
Example Dataspace
<publication>
<title>Semex: Personal information management
and integration</title>
<author>Xin Dong</author>
<author>Alon Halevy</author>
<conference>IIWeb Workshop</conference>
</publication>
<thesis-proposal>
<title>Semex: Personal …</title>
<student>
<name>Xin (Luna) Dong</name>
<entryYear>2001</entryYear>
</student>
</thesis-proposal>
stuID
lastName
firstName
entryYear
5001438
Xin
Dong
2001
…
…
…
…
Searching and Querying a Dataspace

Structured query?
 Require
detailed knowledge on schemas
 Require precise attribute values

Keyword search?
 Forgiving,
but…
 Does not allow specifications on structure

We consider queries that are
 keyword-based
 structure-aware
I. Predicate Query
<publication>
<title>Semex: Personal information management
and integration</title>
<author>Xin Dong</author>

<author>Alon Halevy</author>
<conference>IIWeb Workshop</conference>
</publication>
<thesis-proposal>
<title>Semex: Personal …</title>
<student>
<name>Xin (Luna) Dong</name>
<entryYear>2001</entryYear>
</student>
</thesis-proposal>
Conjunction of predicates
Predicate: (v, {K1, …, Kn})
v
- an attribute or
association label
 {K1, …, Kn} - a keyword set
stuID
lastName
firstName
entryYear
5001438
Xin
Dong
2001
…
…
…
…
I. Predicate Query
<publication>
<title>Semex: Personal information management
and integration</title>
<author>Xin Dong</author>
<author>Alon Halevy</author>
<conference>IIWeb Workshop</conference>
</publication>
<thesis-proposal>
<title>Semex: Personal …</title>
<student>
<name>Xin (Luna) Dong</name>
<entryYear>2001</entryYear>
</student>
</thesis-proposal>

Example I:
(title, ‘Semex’)
(author, ‘Luna Dong’)
stuID
lastName
firstName
entryYear
5001438
Xin
Dong
2001
…
…
…
…
I. Predicate Query
<publication>
<title>Semex: Personal information management
and integration</title>
<author>Xin Dong</author>
<author>Alon Halevy</author>
<conference>IIWeb Workshop</conference>
</publication>

Example II:
(name, ‘Dong’)
<thesis-proposal>
<title>Semex: Personal …</title>
<student>
<name>Xin (Luna) Dong</name>
<entryYear>2001</entryYear>
</student>
</thesis-proposal>
stuID
lastName
firstName
entryYear
5001438
Xin
Dong
2001
…
…
…
…
II. Neighborhood Keyword Query
<publication>
<title>Semex: Personal information management
and integration</title>
<author>Xin Dong</author>
<author>Alon Halevy</author>
<conference>IIWeb Workshop</conference>
</publication>


<thesis-proposal>
<title>Semex: Personal …</title>
<student>
<name>Xin (Luna) Dong</name>
<entryYear>2001</entryYear>
</student>
</thesis-proposal>
Form: {K1, …, Kn}
Example: ‘Semex’
 Relevant items
 Associated
stuID
lastName
firstName
entryYear
5001438
Xin
Dong
2001
…
…
…
…
items
Indexing of the Heterogeneous Data

Challenges
 Index
data from heterogeneous data sources
 Capture both text values and structural information

Traditional Indexes
 Build
a separate index for each attribute to support
structured queries
 Build an inverted list to support keyword search
 XML indexes assume tree models and build multiple
indexes ([Cooper et al., 01],[Kaushik et al., 05],[Wang et al., 03], etc.)
Our approach: Extend inverted lists to capture
both text values and structure of the data
Contributions

Design an index that
 indexes
data from heterogeneous data sources
 captures both structure and text of the data
 incorporates various types of heterogeneity,
including synonyms and hierarchies of attributes
and associations
Outline
 Motivation
 Overview
of our approach
 Our algorithm
Indexing
structure
Indexing hierarchies
 Experimental
 Conclusions
Results
View Data Sources as Triple Base
<publication>
<title>Semex: Toward …</title>
<authors>
<author><name>
Xin Dong</name></author>
<author><name>
Alon Halevy</name></author>
</authors>…
</publication>
Alon Halevy
Semex: …
author
Luna Dong
author
Attribute
Object
Association
View Data Sources as Triple Base
Alon Halevy
Semex: …
author
Luna Dong
author
View Data Sources as Triple Base
Alon Halevy
Semex: …
author
Luna Dong
author
Departmental Database
StuID
lastName
firstName
…
1000001
Xin
Dong
…
…
…
…
…
Goal: Index triples to efficiently answer queries
that combine text and structure
Indexing a Triple Base Using an Inverted List
Alon Halevy
Semex: …
author
Luna Dong
Inverted List
Alon
Dong
Halevy
Luna
Semex
Xin
author
Departmental Database
StuID
lastName
firstName
…
1000001
Xin
Dong
…
…
…
…
…
Indexing a Triple Base Using an Inverted List
Query: Dong
Alon Halevy
Departmental Database
Semex: …
author
Luna Dong
StuID
lastName
firstName
…
1000001
Xin
Dong
…
…
…
…
…
author
Inverted List
Alon
1
Dong
Halevy
Luna
Semex
Xin
1
1
1
1
1
1
Outline
 Motivation
 Overview
of our approach
 Our algorithm
Indexing
structure
Indexing hierarchies
 Experimental
 Conclusions
Results
Incorporate Attribute Labels in the Inverted List
Query: firstName “Dong”
Alon Halevy
Departmental Database
Semex: …
author
Luna Dong
StuID
lastName
firstName
…
1000001
Xin
Dong
…
…
…
…
…
author
Inverted List
Alon
1
Dong
Halevy
Luna
Semex
Xin
1
1
1
1
1
1
Incorporate Attribute Labels in the Inverted List
Query: firstName “Dong”  “Dong/firstName/”
Alon Halevy
Departmental Database
Semex: …
author
Luna Dong
StuID
lastName
firstName
…
1000001
Xin
Dong
…
…
…
…
…
author
Inverted List
Alon/name/
1
Dong/name/
1
Dong/firstName/
Halevy/name/
Luna/name/
Semex/title/
Xin/lastName/
1
1
1
1
1
Incorporate Association Labels in the Inverted List
Query: author “Dong”
Alon Halevy
Departmental Database
Semex: …
author
Luna Dong
StuID
lastName
firstName
…
1000001
Xin
Dong
…
…
…
…
…
author
Inverted List
Alon/name/
1
Dong/name/
1
Dong/name/firstName/
Halevy/name/
Luna/name/
Semex/title/
Xin/name/lastName/
1
1
1
1
1
Incorporate Association Labels in the Inverted List
Alon Halevy
Query: author “Dong”  “Dong/author/”
Departmental Database
Semex: …
author
Luna Dong
author
StuID
LastName
FirstName
…
1000001
Xin
Dong
…
…
…
…
…
Inverted List
Alon/author/
Alon/name/
1
1
Dong/author/
1
Dong/name/
1
Dong/name/firstName/
Halevy/name/
Luna/name/
1
1
1
Luna/auhor
1
Semex/title/
1
Xin/name/LastName/
1
Outline
 Motivation
 Overview
of our approach
 Our algorithm
Indexing
structure
Indexing hierarchies
 Experimental
 Conclusions
Results
Hierarchies of Attributes and Associations
<publication>
<title>Semex: Toward on-the-fly personal
information integration</title>
<author>Xin Dong</author>
<author>Alon Halevy</author>
<conference>IIWeb Workshop</conference>
</publication>
<thesis-proposal>
<title>Semex: Personal …</title>
<student>
<name>Xin (Luna) Dong</name>
<entryYear>2001</entryYear>
</student>
</thesis-proposal>

Example II:
(name, ‘Dong’)
Attribute Hierarchy:
name
firstName
stuID
lastName
firstName
entryYear
5001438
Xin
Dong
2001
…
…
…
…
lastName
Incorporate Attribute Hierarchy in the Inverted List
Query: name “Dong”
Alon Halevy
Departmental Database
Semex: …
author
Luna Dong
StuID
lastName
firstName
…
1000001
Xin
Dong
…
…
…
…
…
author
name
firstName
lastName
Inverted List
Alon/name/
1
Dong/name/
1
Dong/firstName/
Halevy/name/
Luna/name/
Semex/title/
Xin/lastName/
1
1
1
1
1
Naïve Approach: Expand Queries with Sub-Attributes
Query: name “Dong”  “Dong/name/ OR Dong/firstName/ OR …”
Alon Halevy
Departmental Database
Semex: …
author
Luna Dong
StuID
lastName
firstName
…
1000001
Xin
Dong
…
…
…
…
…
author
name
firstName
lastName
Inverted List
Alon/name/
1
Dong/name/
1
Dong/firstName/
Halevy/name/
Luna/name/
Semex/title/
Xin/lastName/
1
1
1
1
1
Approach I: Duplicate Entries for Parent Attributes
Query: name “Dong”  “Dong/name/”
Alon Halevy
Departmental Database
Semex: …
author
Luna Dong
StuID
lastName
firstName
…
1000001
Xin
Dong
…
…
…
…
…
author
name
firstName
lastName
Inverted List
Alon/name/
1
Dong/name/
1
1
Dong/firstName/
Halevy/name/
Luna/name/
Semex/title/
1
1
1
1
Xin/lastName/
1
Xin/name/
1
Approach I: Duplicate Entries for Parent Attributes
Query: name “Dong”  “Dong/name/”
Alon Halevy
Departmental Database
Semex: …
author
Luna Dong
StuID
lastName
firstName
…
1000001
Xin
Dong
…
…
…
…
…
author
name
firstName
lastName
Inverted List
Alon/name/
1
Dong/name/
1
1
Dong/firstName/
Halevy/name/
Luna/name/
Semex/title/
1
1
1
1
Xin/lastName/
1
Xin/name/
1
Approach II. Concatenate a keyword with a Hierarchy
Path
Query: name “Dong”  “Dong/name/*”
Alon Halevy
Departmental Database
Semex: …
author
Luna Dong
StuID
lastName
firstName
…
1000001
Xin
Dong
…
…
…
…
…
author
name
firstName
lastName
Inverted List
Alon/name/
1
Dong/name/
1
Dong/name/firstName/
Halevy/name/
Luna/name/
Semex/title/
Xin/name/lastName/
1
1
1
1
1
Approach III. Hierarchy Path + Summary Rows
Query: name “Dong”  “Dong/name/*”
Alon Halevy
Departmental Database
Semex: …
author
Luna Dong
StuID
lastName
firstName
…
1000001
Xin
Dong
…
…
…
…
…
author
name
firstName
lastName
Inverted List
Alon/name/
1
Dong/name//
1
1
Dong/name/firstName/
Halevy/name/
Luna/name/
Semex/title/
Xin/name/lastName/
1
1
1
1
1
Summary Rows

Goal: Given a threshold t, answer any prefix
search by reading no more than t rows.

Definition:
 The
indexed keyword: p//
E.g. “Dong/name//”
starting with p/ are shadowed by the
summary row p//
E.g. “Dong/name/lastName/” is shadowed by
“Dong/name//”
 Rows
Answering Prefix Search with Summary
Rows

Once read a summary row, ignore the rows
shadowed by it

Example (t=1)
Query: name “Dong”  “Dong/name/*”
Inverted List
Alon/name/
1
Dong/name//
1
1
Dong/name/firstName/
Halevy/name/
Luna/name/
Semex/title/
Xin/name/lastName/
1
1
1
1
1
Answering Prefix Search with Summary
Rows

Once read a summary row, ignore the rows
shadowed by it

Example (t=1)
Query: name “Xin”  “Xin/name/*”
Inverted List
Alon/name/
1
Dong/name//
1
1
Dong/name/firstName/
Halevy/name/
Luna/name/
Semex/title/
Xin/name/lastName/
1
1
1
1
1
Adding Summary Rows

Step 1. Create a summary row for a prefix p if

Searching prefix p needs to read more than t rows

There is no p’ with p as prefix such that searching prefix p’ needs to
read more than t rows

Step 2. Remove row p if summary row p/ exists

Example (t=1)
Inverted List
Alon/name/
1
Dong/name/
1
Dong/name/firstName/
Halevy/name/
Luna/name/
Semex/title/
Xin/name/lastName/
1
1
1
1
1
Adding Summary Rows

Step 1. Create a summary row for a prefix p if

Searching prefix p needs to read more than t rows

There is no p’ with p as prefix such that searching prefix p’ needs to
read more than t rows

Step 2. Remove row p if summary row p/ exists

Example (t=1)
Inverted List
Alon/name/
1
Dong/name//
1
Dong/name/
1
1
Dong/name/firstName/
Halevy/name/
Luna/name/
Semex/title/
Xin/name/lastName/
1
1
1
1
1
Adding Summary Rows

Step 1. Create a summary row for a prefix p if

Searching prefix p needs to read more than t rows

There is no p’ with p as prefix such that searching prefix p’ needs to
read more than t rows

Step 2. Remove row p if summary row p/ exists

Example (t=1)
Inverted List
Alon/name/
1
Dong/name//
1
Dong/name/
1
1
Dong/name/firstName/
Halevy/name/
Luna/name/
Semex/title/
Xin/name/lastName/
1
1
1
1
1
Adding Summary Rows

Step 1. Create a summary row for a prefix p if

Searching prefix p needs to read more than t rows

There is no p’ with p as prefix such that searching prefix p’ needs to
read more than t rows

Step 2. Remove row p if summary row p/ exists

Example (t=1)
Inverted List
Alon/name/
1
Dong/name//
1
1
Dong/name/firstName/
Halevy/name/
Luna/name/
Semex/title/
Xin/name/lastName/
1
1
1
1
1
Answering Neighborhood Keyword Queries
Alon Halevy
Query: Semex  “Semex/*”
~author
Semex: …
Departmental Database
author
Luna Dong
author
StuID
LastName
FirstName
…
1000001
Xin
Dong
…
…
…
…
…
~author
Inverted List
Alon/author/
Alon/name/
1
1
Dong/author/
1
Dong/name/
1
Dong/name/firstName/
Halevy/name/
1
1
Luna/name/
Semex/~author/
Semex/title/
Xin/name/LastName/
1
1
1
1
1
Outline
 Motivation
 Overview
of our approach
 Our algorithm
Indexing
structure
Indexing hierarchies
 Experimental
 Conclusions
Results
Implementation Details

Our index extends the Lucene Indexing Tool
 Lucene
stores an inverted list as a sorted array

Implemented in Java

Run on a machine with four 3.2GHz and
1024KB-cache CPUs, and 1GB memory
Experimental Setting

Data sets
A
50MB personal data set
 Two 10GB XML data sets: Wikipedia, XMark Benchmark

Queries: with one predicate or keyword
 Predicate
Query with leaf attributes
 Predicate Query with branch attributes
 Predicate Query with associations
 Neighborhood Keyword Query

Measure: in millisecond
 Index-lookup
time
 Query-answering time
Our Indexing Method Significantly Improves
Query Answering
Plain Inverted List
(10.6MB)
Query Type
Extended Inverted List
(15.2MB)
Index
Lookup
Query
Answer
Index
Lookup
Query
Answer
(ms)
(ms)
(ms)
(ms)
Pred Query with
leaf attributes
2
22
4
6
Pred Query with
branch attributes
3
43
4
6
Pred Query with
associations
3
88
6
17
Neighborhood
Keyword Query
18
4174
48
97
XML Index [Kaushik et al, Sigmod’05]

Three indexes
 Inverted
list: index each attribute value on its text
 Structured
index: index each attribute value on the
labels of the attribute and its ancestor attributes
 Relationship
index: index each instance on its
associated instances
Our Indexing Method Performs Better Than
XML Indexes
XML Index
(28.1MB)
Query Type
Extended Inverted List
(15.2MB)
Index
Lookup
Query
Answer
Index
Lookup
Query
Answer
(ms)
(ms)
(ms)
(ms)
Pred Query with
leaf attributes
7
9
4
6
Pred Query with
branch attributes
7
11
4
6
Pred Query with
associations
301
415
6
17
Neighborhood
Keyword Query
365
488
48
97
Our Indexing Method Scales Well
Index
4.15hr
(1.13GB)
XMark
w/o asso
6.64hr
(3.04GB)
XMark
with asso
12.72hr
(4.08GB)
Pred Query with
leaf attributes
156
94
116
Pred Query with
branch attributes
-
67
93
Pred Query with
associations
-
-
217
Neighborhood
Keyword Query
1646
1838
13468
Wikipedia
Conclusions

Contributions: An index for heterogeneous
data
 Index
heterogeneous data from multiple sources
through a (virtual) central triple base
 Extend inverted lists to capture both texts and
structure of data

Future Work
 Support
value heterogeneity
 Incorporate approximate matching of schema
terms and object instances
Indexing Dataspaces
Xin (Luna) Dong
Alon Halevy
University of Washington
@ SIGMOD 2007
Google Inc.
Related documents