Download ppt

Index Structures for Querying the Deep Web Jian Qiu, Feng Shao, Jayavel Shanmugasundaram Cornell Universersity Misha Zatsman Google Deep Web Keyword queries Static web pages Surface web Deep Web Keyword queries Static web pages Surface web 400-500 times the size of surface web! www.ebay.com Deep web Ebay databases CNN databases Cars.com databases … … Amazon databases Deep Glue Query results Structured queries 400-500 times the size of surface web! Deep web Ebay database CNN databases Cars.com database … Amazon database Deep Glue System Find textbooks with price<$50 Our focus Indexer Index structures Query Query Engine Superset of relevant data sources Internet Half.com databases Database Concepts @ half.com… … Index structure for deep web: Challenges  Deal with structured data Underlying databases are structured Surface web typically unstructured Deal with large volumes Orders of magnitude larger than the size of surface web Our approach  Understand the structure/typing of the data   Support equality and range queries Heavily compress the index  Achieve a factor of 10 compression Tradeoff between compression factor and the number of false positives  Compression factor 10 with only ~10 false positives for 1000 data sources.  Outline Query model  Index Structures  Experimental Evaluation  Related work and conclusion  Assumptions  Data sources are classified into domains  Online car dealers, online auctions, online travel agents, …  Data sources in the same domain use same logical relational schema  Indexing attributes    Price, date, make, model, isbn,… Indexed by Deep Glue system Indexing data can be obtained via   Crawling the deep web [Raghavan 01 ] Previously agreed-upon protocols [Froogle] Query Model  Support equality and range queries  currently on a single indexing attribute Schema: Car(Id,Make,Model,Year,Price)  Queries:   Find all year 2003 cars, year = 2003  Find all cars that cost less than $1000, price < 1000 Outline Query model  Index Structures  Experimental Evaluation  Related work and conclusion  Overview  Uncompressed Index  Compressed Index, still support equality and range queries  Value Clustered Index (VCI)  DataSource Clustered Index (DCI)  Value DataSource Clustered Index (VDCI)  Histogram Based Index (HBI) Uncompressed Index (UI) For each distinct value v for an indexing attribute, stores the list of data sources data source UI: value d d d d d d d d 1 2 3 4 5 6 7 8 value data sources  1 X X X 2 X X X X 3 X 4 X X X 5 X X X 6 X X X d1: ebay.com , d2: amazon.com … B+tree 1 2 3 4 5 6 d6,d7,d8 d2,d3,d4,d6 d4 d1,d4,d5 d2,d3,d6 d1,d7,d8 Problems  A huge number of values and data sources in deep web !!  Indexing space  every indexing attribute requires Need to compress UI !  Use gzip?  Have to uncompress the index  index lookup too expensive!  Need new compression techniques Overview  Uncompressed Index  Compressed Index, still support equality and range queries  Value Clustered Index (VCI)  DataSource Clustered Index (DCI)  Value DataSource Clustered Index (VDCI)  Histogram Based Index (HBI) Value Clustered Index (VCI)  Intuition:  “closely related” values are stored in “closely related” data sources  ISBN numbers of antique books in the online book retailers specializing in antique books.  Cluster “closely related” values  Stores the list of data sources only for each cluster VCI Example data source value d d d d d d d d 1 2 3 4 5 6 7 8 1 XXX 2 XXX X 3 X 4X XX 5 XX X 6X XXX value Cluster id 1 2 3 4 5 6 B+tree False c1 c2 c3 c3 c2 c1 Cluster 2: { 2, 5} Cluster 3: { 3, 4} Cluster id data sources c1 d1,d6,d7,d8 c2 d2,d3,d4,d6 c3 d1,d4,d5 positives value 1  data source d1 Tradeoff between Cluster 1: { 1, 6} Union VCI structures: Mapping Mapping space and accuracy all values in one cluster each distinct value into a separate cluster VCI Implementation  Use existing scalable algorithm   Scales to large data sets: Birch Framework [Zhang96] Minimize the number of false positives  Specify the parameters for Birch Centroid, the mid-point of a cluster  Radius, a measure of quality for Centroid   cluster1 a cluster Distance between clusters Radius Distance cluster2 VCI formulae   For a cluster having the set of values V ds(v): the set of data sources for value v  ds(v) centroid(V) = vV  radius(V) =  centroid (V )  ds(v) Data sources associated with the cluster Sum of number of false positives vV V  distance(V1, V2)  Additional number of false positives when merging two clusters  V 1  centroid (V 2)  centroid (V 1)  V 2  centroid (V 1)  centroid (V 2) Overview  Uncompressed Index  Compressed Index:  Value Clustered Index (VCI)  DataSource Clustered Index (DCI)  Value DataSource Clustered Index (VDCI)  Histogram Based Index (HBI) DataSource Clustered Index (DCI)  Intuition: “closely related” data sources may have “closely related” sets of values  Amazon and b&n has similar sets of ISBN numbers In the data graph, VCI clusters rows and DCI clusters columns data source Cluster 1: {d2,d3,d6} d d d d d d d d value Cluster 2: { d4, d5} 1 2 3 4 5 6 7 8  1 X X X 2 X X X X 3 X X X X 4 X X X 5 X X X 6 X X X Cluster 3: { d1, d7, d8}  Table structures are similar to VCI.  See paper for other details Overview  Uncompressed Index  Compressed Index:  Value Clustered Index (VCI)  DataSource Clustered Index (DCI)  Value DataSource Clustered Index (VDCI)  Histogram Based Index (HBI) Value-DataSource Clustered Index (VDCI)    VCI, DCI: clusters in 1 dimension VDCI: clusters in 2 dimensions, generalizes VCI/DCI Cluster: a set of values and a set of data sources data source value d 1 d 2 d 3 d 4 d 5 d 6 d 7 d 8 1 X X X 2 X X X X X X 3 X X X X 4 X X X X 5 X X X X X 6 X X X Cluster 1:{ {2,3}, {d2,d3,d4}} Cluster 2:{ {4,5}, {d4,d5,d6} } Cluster 3:{ {1,2}, {d6,d7,d8} } Data source d4 is in two clusters Value 2 is in two clusters Table structures are similar to VCI. See paper for other details Overview  Uncompressed Index  Compressed Index:  Value Clustered Index (VCI)  DataSource Clustered Index (DCI)  Value DataSource Clustered Index (VDCI)  Histogram Based Index (HBI) Histogram Based Index (HBI)  VCI/VDCI don’t consider the ordering among values  Range queries implies this need  HBI groups adjacent values in the same cluster  Also need to ensure the accuracy  Use threshold to determine the boundary of a cluster  Threshold: average number of false positives in a cluster HBI Example data source Threshold: 2 d1 d2 d3 d4 d5 d6 d7 d8 value 1 2 3 4 5 6 X X X X X X X X X X X X X X X X X Cluster adjacent values Cluster 1: {1} Cluster 2: {2,3,4} Cluster 3: {5,6} Outline Query model  Index Structures  Experimental Evaluation  Related work and conclusion  Experimental setup Synthetic data 1000 data sources, 100,000 values, 4,000,000 (value,data source) pairs Other parameters are in the paper Metrics Index creation time Compression factor False positives  Setup   2.8GHz Pentium IV, 1GB memory, 80GB disk C++ Index creation time Index structure Time(min) UI 0.25 VCI 15 DCI 3 VDCI 180 HBI 2.5 Equality queries (1000 data sources) VCI DCI VDCI HBI false positives 50 40 30 20 10 0 0 5 10 compression factor 15 20 Range Queries (1000 data sources) VCI DCI VDCI HBI false positives 50 45 40 35 30 25 20 15 10 5 0 0 5 10 compression factor 15 20 Outline Query model  Index Structures  Experimental Evaluation  Related work and conclusion  Related work  Distributed database & information integration     Niagara system [Naughton01] GlOSS [Gravano99] … Database/Inverted list compression       Query Optimization in Compressed Databases [Chen 01] Compressing the Relations and Index [Goldstein 98] Improved Query Performance with Variant Indices [O’Neill 97] Implementation and Performance of Compressed Databases [Westmann 00] Size Reduction of Inverted Files [Weiss 90] … Conclusion     Space-efficient index structures for querying the deep web Support equality and range queries A factor of 10 compression with a little loss in precision Future work     Combine cluster-based and histogram-based Multiple attributes queries Joins Incremental index maintenance Questions? Experimental setup Other parameters: Number of groups  The data sources in the same group use same distribution to generate the values  Default 20 Group mode  How many groups a data source belongs to  Default 1 Value correlation  How the orders in the value space maps to the value ordering over which Gaussian distribution is used.  Default 0.2

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ppt