Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Index Structures for Querying the Deep Web Jian Qiu, Feng Shao, Jayavel Shanmugasundaram Cornell Universersity Misha Zatsman Google Deep Web Keyword queries Static web pages Surface web Deep Web Keyword queries Static web pages Surface web 400-500 times the size of surface web! www.ebay.com Deep web Ebay databases CNN databases Cars.com databases … … Amazon databases Deep Glue Query results Structured queries 400-500 times the size of surface web! Deep web Ebay database CNN databases Cars.com database … Amazon database Deep Glue System Find textbooks with price<$50 Our focus Indexer Index structures Query Query Engine Superset of relevant data sources Internet Half.com databases Database Concepts @ half.com… … Index structure for deep web: Challenges Deal with structured data Underlying databases are structured Surface web typically unstructured Deal with large volumes Orders of magnitude larger than the size of surface web Our approach Understand the structure/typing of the data Support equality and range queries Heavily compress the index Achieve a factor of 10 compression Tradeoff between compression factor and the number of false positives Compression factor 10 with only ~10 false positives for 1000 data sources. Outline Query model Index Structures Experimental Evaluation Related work and conclusion Assumptions Data sources are classified into domains Online car dealers, online auctions, online travel agents, … Data sources in the same domain use same logical relational schema Indexing attributes Price, date, make, model, isbn,… Indexed by Deep Glue system Indexing data can be obtained via Crawling the deep web [Raghavan 01 ] Previously agreed-upon protocols [Froogle] Query Model Support equality and range queries currently on a single indexing attribute Schema: Car(Id,Make,Model,Year,Price) Queries: Find all year 2003 cars, year = 2003 Find all cars that cost less than $1000, price < 1000 Outline Query model Index Structures Experimental Evaluation Related work and conclusion Overview Uncompressed Index Compressed Index, still support equality and range queries Value Clustered Index (VCI) DataSource Clustered Index (DCI) Value DataSource Clustered Index (VDCI) Histogram Based Index (HBI) Uncompressed Index (UI) For each distinct value v for an indexing attribute, stores the list of data sources data source UI: value d d d d d d d d 1 2 3 4 5 6 7 8 value data sources 1 X X X 2 X X X X 3 X 4 X X X 5 X X X 6 X X X d1: ebay.com , d2: amazon.com … B+tree 1 2 3 4 5 6 d6,d7,d8 d2,d3,d4,d6 d4 d1,d4,d5 d2,d3,d6 d1,d7,d8 Problems A huge number of values and data sources in deep web !! Indexing space every indexing attribute requires Need to compress UI ! Use gzip? Have to uncompress the index index lookup too expensive! Need new compression techniques Overview Uncompressed Index Compressed Index, still support equality and range queries Value Clustered Index (VCI) DataSource Clustered Index (DCI) Value DataSource Clustered Index (VDCI) Histogram Based Index (HBI) Value Clustered Index (VCI) Intuition: “closely related” values are stored in “closely related” data sources ISBN numbers of antique books in the online book retailers specializing in antique books. Cluster “closely related” values Stores the list of data sources only for each cluster VCI Example data source value d d d d d d d d 1 2 3 4 5 6 7 8 1 XXX 2 XXX X 3 X 4X XX 5 XX X 6X XXX value Cluster id 1 2 3 4 5 6 B+tree False c1 c2 c3 c3 c2 c1 Cluster 2: { 2, 5} Cluster 3: { 3, 4} Cluster id data sources c1 d1,d6,d7,d8 c2 d2,d3,d4,d6 c3 d1,d4,d5 positives value 1 data source d1 Tradeoff between Cluster 1: { 1, 6} Union VCI structures: Mapping Mapping space and accuracy all values in one cluster each distinct value into a separate cluster VCI Implementation Use existing scalable algorithm Scales to large data sets: Birch Framework [Zhang96] Minimize the number of false positives Specify the parameters for Birch Centroid, the mid-point of a cluster Radius, a measure of quality for Centroid cluster1 a cluster Distance between clusters Radius Distance cluster2 VCI formulae For a cluster having the set of values V ds(v): the set of data sources for value v ds(v) centroid(V) = vV radius(V) = centroid (V ) ds(v) Data sources associated with the cluster Sum of number of false positives vV V distance(V1, V2) Additional number of false positives when merging two clusters V 1 centroid (V 2) centroid (V 1) V 2 centroid (V 1) centroid (V 2) Overview Uncompressed Index Compressed Index: Value Clustered Index (VCI) DataSource Clustered Index (DCI) Value DataSource Clustered Index (VDCI) Histogram Based Index (HBI) DataSource Clustered Index (DCI) Intuition: “closely related” data sources may have “closely related” sets of values Amazon and b&n has similar sets of ISBN numbers In the data graph, VCI clusters rows and DCI clusters columns data source Cluster 1: {d2,d3,d6} d d d d d d d d value Cluster 2: { d4, d5} 1 2 3 4 5 6 7 8 1 X X X 2 X X X X 3 X X X X 4 X X X 5 X X X 6 X X X Cluster 3: { d1, d7, d8} Table structures are similar to VCI. See paper for other details Overview Uncompressed Index Compressed Index: Value Clustered Index (VCI) DataSource Clustered Index (DCI) Value DataSource Clustered Index (VDCI) Histogram Based Index (HBI) Value-DataSource Clustered Index (VDCI) VCI, DCI: clusters in 1 dimension VDCI: clusters in 2 dimensions, generalizes VCI/DCI Cluster: a set of values and a set of data sources data source value d 1 d 2 d 3 d 4 d 5 d 6 d 7 d 8 1 X X X 2 X X X X X X 3 X X X X 4 X X X X 5 X X X X X 6 X X X Cluster 1:{ {2,3}, {d2,d3,d4}} Cluster 2:{ {4,5}, {d4,d5,d6} } Cluster 3:{ {1,2}, {d6,d7,d8} } Data source d4 is in two clusters Value 2 is in two clusters Table structures are similar to VCI. See paper for other details Overview Uncompressed Index Compressed Index: Value Clustered Index (VCI) DataSource Clustered Index (DCI) Value DataSource Clustered Index (VDCI) Histogram Based Index (HBI) Histogram Based Index (HBI) VCI/VDCI don’t consider the ordering among values Range queries implies this need HBI groups adjacent values in the same cluster Also need to ensure the accuracy Use threshold to determine the boundary of a cluster Threshold: average number of false positives in a cluster HBI Example data source Threshold: 2 d1 d2 d3 d4 d5 d6 d7 d8 value 1 2 3 4 5 6 X X X X X X X X X X X X X X X X X Cluster adjacent values Cluster 1: {1} Cluster 2: {2,3,4} Cluster 3: {5,6} Outline Query model Index Structures Experimental Evaluation Related work and conclusion Experimental setup Synthetic data 1000 data sources, 100,000 values, 4,000,000 (value,data source) pairs Other parameters are in the paper Metrics Index creation time Compression factor False positives Setup 2.8GHz Pentium IV, 1GB memory, 80GB disk C++ Index creation time Index structure Time(min) UI 0.25 VCI 15 DCI 3 VDCI 180 HBI 2.5 Equality queries (1000 data sources) VCI DCI VDCI HBI false positives 50 40 30 20 10 0 0 5 10 compression factor 15 20 Range Queries (1000 data sources) VCI DCI VDCI HBI false positives 50 45 40 35 30 25 20 15 10 5 0 0 5 10 compression factor 15 20 Outline Query model Index Structures Experimental Evaluation Related work and conclusion Related work Distributed database & information integration Niagara system [Naughton01] GlOSS [Gravano99] … Database/Inverted list compression Query Optimization in Compressed Databases [Chen 01] Compressing the Relations and Index [Goldstein 98] Improved Query Performance with Variant Indices [O’Neill 97] Implementation and Performance of Compressed Databases [Westmann 00] Size Reduction of Inverted Files [Weiss 90] … Conclusion Space-efficient index structures for querying the deep web Support equality and range queries A factor of 10 compression with a little loss in precision Future work Combine cluster-based and histogram-based Multiple attributes queries Joins Incremental index maintenance Questions? Experimental setup Other parameters: Number of groups The data sources in the same group use same distribution to generate the values Default 20 Group mode How many groups a data source belongs to Default 1 Value correlation How the orders in the value space maps to the value ordering over which Gaussian distribution is used. Default 0.2