Download Presentation - Xin Luna Dong

Indexing Dataspaces Xin (Luna) Dong Alon Halevy University of Washington @ SIGMOD 2007 Google Inc. Many Data Management Applications Need to Manage Heterogeneous Data Sources D5 D1 D2 D4 D3 Traditional Data Integration Systems SELECT P.title AS title, P.year AS year, A.name AS author FROM Author AS A, Paper AS P, AuthoredBy AS B WHERE A.aid=B.aid AND P.pid=B.pid Publication(title, year, author) Mediated Schema D5 D1 Author(aid, name) Paper(pid, title, year) AuthoredBy(aid,pid) D2 D4 D3 Querying on Traditional Data Integration Systems Q Q5 Q1 D1 Mediated Schema Q4 Q Q2 D2 Q3 D4 D3 D5 In Many Applications it is Hard to Obtain Precise Semantic Mappings ? D1 D2 D5 D4 D3 Scenario 1. Different Websites About Movies Scenario 2. Personal Information Space Intranet Internet Querying Dataspaces  Dataspaces  Collections of heterogeneous data sources  Don’t necessarily include semantic mappings  Scenarios: personal information, enterprises, government agencies, smart homes, digital libraries, and the Web  Pay-as-you-go data management  Provide some services from the outset  Improve the mappings on an as-needed basis  How to effectively query and search a dataspace? Example Dataspace <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entryYear>2001</entryYear> </student> </thesis-proposal> stuID lastName firstName entryYear 5001438 Xin Dong 2001 … … … … Searching and Querying a Dataspace  Structured query?  Require detailed knowledge on schemas  Require precise attribute values  Keyword search?  Forgiving, but…  Does not allow specifications on structure  We consider queries that are  keyword-based  structure-aware I. Predicate Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author>  <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entryYear>2001</entryYear> </student> </thesis-proposal> Conjunction of predicates Predicate: (v, {K1, …, Kn}) v - an attribute or association label  {K1, …, Kn} - a keyword set stuID lastName firstName entryYear 5001438 Xin Dong 2001 … … … … I. Predicate Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entryYear>2001</entryYear> </student> </thesis-proposal>  Example I: (title, ‘Semex’) (author, ‘Luna Dong’) stuID lastName firstName entryYear 5001438 Xin Dong 2001 … … … … I. Predicate Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication>  Example II: (name, ‘Dong’) <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entryYear>2001</entryYear> </student> </thesis-proposal> stuID lastName firstName entryYear 5001438 Xin Dong 2001 … … … … II. Neighborhood Keyword Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication>   <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entryYear>2001</entryYear> </student> </thesis-proposal> Form: {K1, …, Kn} Example: ‘Semex’  Relevant items  Associated stuID lastName firstName entryYear 5001438 Xin Dong 2001 … … … … items Indexing of the Heterogeneous Data  Challenges  Index data from heterogeneous data sources  Capture both text values and structural information  Traditional Indexes  Build a separate index for each attribute to support structured queries  Build an inverted list to support keyword search  XML indexes assume tree models and build multiple indexes ([Cooper et al., 01],[Kaushik et al., 05],[Wang et al., 03], etc.) Our approach: Extend inverted lists to capture both text values and structure of the data Contributions  Design an index that  indexes data from heterogeneous data sources  captures both structure and text of the data  incorporates various types of heterogeneity, including synonyms and hierarchies of attributes and associations Outline  Motivation  Overview of our approach  Our algorithm Indexing structure Indexing hierarchies  Experimental  Conclusions Results View Data Sources as Triple Base <publication> <title>Semex: Toward …</title> <authors> <author><name> Xin Dong</name></author> <author><name> Alon Halevy</name></author> </authors>… </publication> Alon Halevy Semex: … author Luna Dong author Attribute Object Association View Data Sources as Triple Base Alon Halevy Semex: … author Luna Dong author View Data Sources as Triple Base Alon Halevy Semex: … author Luna Dong author Departmental Database StuID lastName firstName … 1000001 Xin Dong … … … … … Goal: Index triples to efficiently answer queries that combine text and structure Indexing a Triple Base Using an Inverted List Alon Halevy Semex: … author Luna Dong Inverted List Alon Dong Halevy Luna Semex Xin author Departmental Database StuID lastName firstName … 1000001 Xin Dong … … … … … Indexing a Triple Base Using an Inverted List Query: Dong Alon Halevy Departmental Database Semex: … author Luna Dong StuID lastName firstName … 1000001 Xin Dong … … … … … author Inverted List Alon 1 Dong Halevy Luna Semex Xin 1 1 1 1 1 1 Outline  Motivation  Overview of our approach  Our algorithm Indexing structure Indexing hierarchies  Experimental  Conclusions Results Incorporate Attribute Labels in the Inverted List Query: firstName “Dong” Alon Halevy Departmental Database Semex: … author Luna Dong StuID lastName firstName … 1000001 Xin Dong … … … … … author Inverted List Alon 1 Dong Halevy Luna Semex Xin 1 1 1 1 1 1 Incorporate Attribute Labels in the Inverted List Query: firstName “Dong”  “Dong/firstName/” Alon Halevy Departmental Database Semex: … author Luna Dong StuID lastName firstName … 1000001 Xin Dong … … … … … author Inverted List Alon/name/ 1 Dong/name/ 1 Dong/firstName/ Halevy/name/ Luna/name/ Semex/title/ Xin/lastName/ 1 1 1 1 1 Incorporate Association Labels in the Inverted List Query: author “Dong” Alon Halevy Departmental Database Semex: … author Luna Dong StuID lastName firstName … 1000001 Xin Dong … … … … … author Inverted List Alon/name/ 1 Dong/name/ 1 Dong/name/firstName/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/lastName/ 1 1 1 1 1 Incorporate Association Labels in the Inverted List Alon Halevy Query: author “Dong”  “Dong/author/” Departmental Database Semex: … author Luna Dong author StuID LastName FirstName … 1000001 Xin Dong … … … … … Inverted List Alon/author/ Alon/name/ 1 1 Dong/author/ 1 Dong/name/ 1 Dong/name/firstName/ Halevy/name/ Luna/name/ 1 1 1 Luna/auhor 1 Semex/title/ 1 Xin/name/LastName/ 1 Outline  Motivation  Overview of our approach  Our algorithm Indexing structure Indexing hierarchies  Experimental  Conclusions Results Hierarchies of Attributes and Associations <publication> <title>Semex: Toward on-the-fly personal information integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entryYear>2001</entryYear> </student> </thesis-proposal>  Example II: (name, ‘Dong’) Attribute Hierarchy: name firstName stuID lastName firstName entryYear 5001438 Xin Dong 2001 … … … … lastName Incorporate Attribute Hierarchy in the Inverted List Query: name “Dong” Alon Halevy Departmental Database Semex: … author Luna Dong StuID lastName firstName … 1000001 Xin Dong … … … … … author name firstName lastName Inverted List Alon/name/ 1 Dong/name/ 1 Dong/firstName/ Halevy/name/ Luna/name/ Semex/title/ Xin/lastName/ 1 1 1 1 1 Naïve Approach: Expand Queries with Sub-Attributes Query: name “Dong”  “Dong/name/ OR Dong/firstName/ OR …” Alon Halevy Departmental Database Semex: … author Luna Dong StuID lastName firstName … 1000001 Xin Dong … … … … … author name firstName lastName Inverted List Alon/name/ 1 Dong/name/ 1 Dong/firstName/ Halevy/name/ Luna/name/ Semex/title/ Xin/lastName/ 1 1 1 1 1 Approach I: Duplicate Entries for Parent Attributes Query: name “Dong”  “Dong/name/” Alon Halevy Departmental Database Semex: … author Luna Dong StuID lastName firstName … 1000001 Xin Dong … … … … … author name firstName lastName Inverted List Alon/name/ 1 Dong/name/ 1 1 Dong/firstName/ Halevy/name/ Luna/name/ Semex/title/ 1 1 1 1 Xin/lastName/ 1 Xin/name/ 1 Approach I: Duplicate Entries for Parent Attributes Query: name “Dong”  “Dong/name/” Alon Halevy Departmental Database Semex: … author Luna Dong StuID lastName firstName … 1000001 Xin Dong … … … … … author name firstName lastName Inverted List Alon/name/ 1 Dong/name/ 1 1 Dong/firstName/ Halevy/name/ Luna/name/ Semex/title/ 1 1 1 1 Xin/lastName/ 1 Xin/name/ 1 Approach II. Concatenate a keyword with a Hierarchy Path Query: name “Dong”  “Dong/name/*” Alon Halevy Departmental Database Semex: … author Luna Dong StuID lastName firstName … 1000001 Xin Dong … … … … … author name firstName lastName Inverted List Alon/name/ 1 Dong/name/ 1 Dong/name/firstName/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/lastName/ 1 1 1 1 1 Approach III. Hierarchy Path + Summary Rows Query: name “Dong”  “Dong/name/*” Alon Halevy Departmental Database Semex: … author Luna Dong StuID lastName firstName … 1000001 Xin Dong … … … … … author name firstName lastName Inverted List Alon/name/ 1 Dong/name// 1 1 Dong/name/firstName/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/lastName/ 1 1 1 1 1 Summary Rows  Goal: Given a threshold t, answer any prefix search by reading no more than t rows.  Definition:  The indexed keyword: p// E.g. “Dong/name//” starting with p/ are shadowed by the summary row p// E.g. “Dong/name/lastName/” is shadowed by “Dong/name//”  Rows Answering Prefix Search with Summary Rows  Once read a summary row, ignore the rows shadowed by it  Example (t=1) Query: name “Dong”  “Dong/name/*” Inverted List Alon/name/ 1 Dong/name// 1 1 Dong/name/firstName/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/lastName/ 1 1 1 1 1 Answering Prefix Search with Summary Rows  Once read a summary row, ignore the rows shadowed by it  Example (t=1) Query: name “Xin”  “Xin/name/*” Inverted List Alon/name/ 1 Dong/name// 1 1 Dong/name/firstName/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/lastName/ 1 1 1 1 1 Adding Summary Rows  Step 1. Create a summary row for a prefix p if  Searching prefix p needs to read more than t rows  There is no p’ with p as prefix such that searching prefix p’ needs to read more than t rows  Step 2. Remove row p if summary row p/ exists  Example (t=1) Inverted List Alon/name/ 1 Dong/name/ 1 Dong/name/firstName/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/lastName/ 1 1 1 1 1 Adding Summary Rows  Step 1. Create a summary row for a prefix p if  Searching prefix p needs to read more than t rows  There is no p’ with p as prefix such that searching prefix p’ needs to read more than t rows  Step 2. Remove row p if summary row p/ exists  Example (t=1) Inverted List Alon/name/ 1 Dong/name// 1 Dong/name/ 1 1 Dong/name/firstName/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/lastName/ 1 1 1 1 1 Adding Summary Rows  Step 1. Create a summary row for a prefix p if  Searching prefix p needs to read more than t rows  There is no p’ with p as prefix such that searching prefix p’ needs to read more than t rows  Step 2. Remove row p if summary row p/ exists  Example (t=1) Inverted List Alon/name/ 1 Dong/name// 1 Dong/name/ 1 1 Dong/name/firstName/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/lastName/ 1 1 1 1 1 Adding Summary Rows  Step 1. Create a summary row for a prefix p if  Searching prefix p needs to read more than t rows  There is no p’ with p as prefix such that searching prefix p’ needs to read more than t rows  Step 2. Remove row p if summary row p/ exists  Example (t=1) Inverted List Alon/name/ 1 Dong/name// 1 1 Dong/name/firstName/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/lastName/ 1 1 1 1 1 Answering Neighborhood Keyword Queries Alon Halevy Query: Semex  “Semex/*” ~author Semex: … Departmental Database author Luna Dong author StuID LastName FirstName … 1000001 Xin Dong … … … … … ~author Inverted List Alon/author/ Alon/name/ 1 1 Dong/author/ 1 Dong/name/ 1 Dong/name/firstName/ Halevy/name/ 1 1 Luna/name/ Semex/~author/ Semex/title/ Xin/name/LastName/ 1 1 1 1 1 Outline  Motivation  Overview of our approach  Our algorithm Indexing structure Indexing hierarchies  Experimental  Conclusions Results Implementation Details  Our index extends the Lucene Indexing Tool  Lucene stores an inverted list as a sorted array  Implemented in Java  Run on a machine with four 3.2GHz and 1024KB-cache CPUs, and 1GB memory Experimental Setting  Data sets A 50MB personal data set  Two 10GB XML data sets: Wikipedia, XMark Benchmark  Queries: with one predicate or keyword  Predicate Query with leaf attributes  Predicate Query with branch attributes  Predicate Query with associations  Neighborhood Keyword Query  Measure: in millisecond  Index-lookup time  Query-answering time Our Indexing Method Significantly Improves Query Answering Plain Inverted List (10.6MB) Query Type Extended Inverted List (15.2MB) Index Lookup Query Answer Index Lookup Query Answer (ms) (ms) (ms) (ms) Pred Query with leaf attributes 2 22 4 6 Pred Query with branch attributes 3 43 4 6 Pred Query with associations 3 88 6 17 Neighborhood Keyword Query 18 4174 48 97 XML Index [Kaushik et al, Sigmod’05]  Three indexes  Inverted list: index each attribute value on its text  Structured index: index each attribute value on the labels of the attribute and its ancestor attributes  Relationship index: index each instance on its associated instances Our Indexing Method Performs Better Than XML Indexes XML Index (28.1MB) Query Type Extended Inverted List (15.2MB) Index Lookup Query Answer Index Lookup Query Answer (ms) (ms) (ms) (ms) Pred Query with leaf attributes 7 9 4 6 Pred Query with branch attributes 7 11 4 6 Pred Query with associations 301 415 6 17 Neighborhood Keyword Query 365 488 48 97 Our Indexing Method Scales Well Index 4.15hr (1.13GB) XMark w/o asso 6.64hr (3.04GB) XMark with asso 12.72hr (4.08GB) Pred Query with leaf attributes 156 94 116 Pred Query with branch attributes - 67 93 Pred Query with associations - - 217 Neighborhood Keyword Query 1646 1838 13468 Wikipedia Conclusions  Contributions: An index for heterogeneous data  Index heterogeneous data from multiple sources through a (virtual) central triple base  Extend inverted lists to capture both texts and structure of data  Future Work  Support value heterogeneity  Incorporate approximate matching of schema terms and object instances Indexing Dataspaces Xin (Luna) Dong Alon Halevy University of Washington @ SIGMOD 2007 Google Inc.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Presentation - Xin Luna Dong