Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB 2006 Outline • Introduction • Robust Index • Compute Robust Index – Exact Solution – Approximate Solution – Multiple Indices • Performance Study • Discussion and Conclusions 2 Introduction Sample Database R Ranked Query Tid A1 A2 t1 0.10 1.00 Select top 3 * from R t2 0.15 0.80 order by A1+A2 asc t3 0.25 0.55 t4 0.40 0.35 t5 0.80 0.25 t6 0.30 0.70 t7 0.35 0.50 t8 0.75 0.45 Linear Ranking Functions Query Results Tid A1 A2 A1+A2 t4 0.40 0.35 0.75 t3 0.25 0.55 0.80 t7 0.35 0.50 0.85 3 Efficient Processing of Ranked Queries • Naïve Solution: scan the whole database and evaluate all tuples • Using indices or materialized views – Distributed Indexing • Sort each attribute individually and merge attributes by a threshold algorithm (TA) [Fagin et al, PODS’96,’99,’01] – Spatial Indexing • Organize tuples into R-tree and determine a threshold to prune the search space [Goldstain et al, PODS’97] • Organize tuples into R-tree and retrieve data progressively [Papadias et al, SIGMOD’03] – Sequential Indexing • Organize tuples into convex hulls [Chang et al, SIGMOD’00] • Materialize ranked views according to the preference functions [Hristidis et al, SIGMOD’01] – And More… 4 Sequential Indexing • Sequential Index (ranked view) – Linearly sort tuples – No sophisticated data structures – Sequential data access (good for database I/O) • Representative work – Onion [Chang et al, SIGMOD’00] – PREFER [Hristidis et al, SIGMOD’01] • Our proposal: Robust Index 5 Review: Onion Technique Sample Database R A1 Tid A1 A2 t1 0.10 1.00 t2 0.15 0.80 t3 0.25 0.55 t4 0.40 0.35 t5 0.80 0.25 t6 0.30 0.70 t7 0.35 0.50 t8 0.75 0.45 Ranked Query Select top 3 * from R Index by Convex hull t1 Second layer t2 t6 t3 t7 Retrieve data layer by layer until the top-k results are found t8 t4 In worst case, retrieve top-k layers of tuples t5 First layer A1 A2 If a and b are non-negative (a, b are weighing parameters in linear ranking function) t1 t2 t6 t3 t7 order by aA1+bA2 asc Second layer t8 Index by Convex Shell t4 Expect less number of tuples in each layer t5 First layer A2 6 Review: PREFER System Sample Database R Tid A1 A2 t1 0.10 1.00 t2 0.15 0.80 t3 0.25 0.55 t4 0.40 0.35 t5 0.80 0.25 t6 0.30 0.70 t7 0.35 0.50 t8 0.75 0.45 Index by the ranking function: A1+A2 Index on preference ranking function A1 t1 Given query ranking function: A1+2A2 t2 t6 t3 t7 t4 t5 A2 Ranked Query Map query ranking function to index ranking function Will retrieve t1, t2, t3, t4, t6, t7 Select top 3 * from R order by w1A1+w2A2 asc t8 Query ranking function Map from query to preference 7 Comments on Sequential Indexing • PREFER – Works extremely well when query functions are close to the index function; Sensitive to query weights • Onion – Less sensitive to query weights; Can we do better? • Both methods – Require considerable online computation • Motivation for robust indexing – Not sensitive to query weights – Push most computation to index building phase Average #tuples retrieved for 10 random queries asking for top50 answers Query weights are randomly selected from 1,2,3,4 8 Outline • Introduction • Robust Index • Compute Robust Index – Exact Solution – Approximate Solution – Multiple Indices • Performance Study • Discussion and Conclusions 9 Robust Indexing: Motivating Example A1 t1 Index by Convex hull (shell) Second layer t2 t6 t3 t7 Organize data layer by layer In order to keep the convexity, each layer is built conservatively t8 t4 t5 Robust Index First layer A2 A1 Organize data layer by layer Layer 4 t1 Layer 3 t2 t6 t3 t7 Layer 3 Exploit dominating properties between data and push a tuple as deep as possible t8 t7: dominated by t2 and t4 (for any query, at least one of t2 and t4 ranks before t7) t4 t5 First layer A2 t7: dominated by t3 and t5 10 Robust Indexing: Formal Definition • How does it work? – Offline phase • Put each tuple in its deepest layer: the minimal (best) rank of all possible linear queries – Online phase • Retrieve tuples in top-k layers • Evaluate all of them, and report top-k • What are expected? – Correctness – Less tuples in each layer than convex hull • If a tuple does not belong to top-k for any query, it will not be retrieved 11 Robust Indexing: Appealing Properties • Database Friendly – No online algorithm required – Simply use the following SQL statement Select top k * from R where layer <=k order by Frank • Space efficient – Suppose the upper bound of the value k is given (e.g. k<=100) – Only need to index those tuples in top 100 layers – Robust indexing uses the minimal space comparing with other alternatives 12 Outline • Introduction • Robust Index • Compute Robust Index – Exact Solution – Approximate Solution – Multiple Indices • Performance Study • Discussion and Conclusions 13 Robust Indexing: Algorithm Highlights • Exact Solution – Compute the deepest layer for each tuple – Complexity: • n: number of tuples • d: number of dimensions • Approximate Solution – Compute the lower bound layer for each tuple – Complexity: • Multiple Indices – Transform R to different subspaces by linear transformation – Build an index in each subspace 14 Exact Solution Task: to compute the minimal rank over all possible linear queries for tuple t L1 A1 L2 L3 L4 t1 t6 t2 t t5 t3 t4 Given a query Q, with ranking function F=w1A1+w2A2, 0<=w1,w2<=1, and w1+w2=1 Q is one-to-one mapped to a line L e.g. A1+2A2 maps to L1 A2 Naïve Proposal: Enumerate all possible combinations of (w1,w2) Not feasible since the enumerating space is infinite Alternative Solution: Only enumerate (w1,w2) whose corresponding line passes t and another tuple, e.g., L1, … ,L4 Do not consider t3 and t6 because the corresponding weights does not satisfy 0<=w1,w2<=1 15 Exact Solution, cont. L1 Task: to compute the minimal rank over all possible linear queries for tuple t LV A1 L2 L3 L4 t1 t6 t2 LH t Given a query Q, with ranking function F=w1A1+w2A2, 0<=w1,w2<=1, and w1+w2=1 t5 Lv=>L1: minimal rank is 4 (after t1, t2, t3) t3 t4 L1=>L2: minimal rank is 3 (after t2, t3) A2 Complexity: to sort all lines takes O(n log n), to compute minimal rank for all t, In general, L2=>L3: minimal rank is 4 (after t2, t3, t4) L3=>L4: minimal rank is 3 (after t3, t4) L4=>LH: minimal rank is 4 (after t3, t4, t5) Minimal rank (the deepest layer) of t is 3 16 Approximate Solution A1 I4 I Task: to compute the lower bould of the minimal rank of tuple t IV I3 I2 I1 Four regions II: dominating region, data ranked before t IV: dominated region, data ranked after t I and III? t III4 II III3 III1 III2 III A2 Lower Bounding Theorem [Minimal ranking of t] >= card(II) + min( card(I3+I2+I1), card(I2+I1+III1), card(I1+III1+III2), card(III1+III2+III3)) Step 1: Partition regions I and III Step 2: Count cardinalities of region II and sub-regions I1,…,I4, III1,…,III4 Step 3: Match the cardinalities of the sub-regions and compute the lower bound 17 Approximate Solution, Cont. A1 I4 I Step 2: Count cardinalities of region II and sub-regions I1,…,I4, III1,…,III4 IV I3 L I2 I1 Count the cardinality of region II? 1. All tuples in region II dominate t 2. A reversed version of skyline problem 3. Standard divide and conquer solution (details in the paper) t III4 III3 II III1 III2 III A2 Tid A1 A2 Tid A1 A2 t 0.50 0.50 t -0.50 0.63 t1 0.15 0.80 t1 -0.15 0.35 t2 0.25 0.55 t2 -0.25 0.39 t3 0.40 0.35 t3 -0.40 0.49 A1=-A1 A2=A1+0.25A2 Count the cardinality of region I1? Suppose t: (a1,a2) Line L: A1 + 0.25A2=a1 + 0.25a2 Tuples in region I1 satisfy -A1 <= -a1 A1+0.25A2 <= a1 + 0.25 a2 18 Quality of the Approximate Solution • Complexity: – B: number of partitions in each subspace – n: number of tuples – d: number of dimensions • Approximate quality: – Assume data forms a uniform distribution – Each subspace is partitioned evenly – Partitioning according to the data distribution is an important and interesting future topic 19 Multiple Indices • Why? Ranking function: F=w1A1+w2A2 – To relax the constraint – To decompose and strengthen the constraints • How? (e.g., for w1<=w2) – Linearly transform R to R’, and build index on R’ (A1,A2) => (A1+A2, A2) – Rewrite query weights (w1,w2) => (w1,w2-w1) Relax Ranking function: F=w1A1+w2A2 Where 0<=w1,w2<=1 Data are projected to a smaller subspace (e.g., A1’ >=A2’ in the transformed subspace) Tuples can be pushed deeper since more domination can be found Strengthen Ranking function: F=w1A1+w2A2 Where 0<=w1<=w2<=1, or 0<=w2<=w1<=1 20 Multiple Indices, Cont. Number of tuples in top-k layers Top-k Convex Shell Robust Indexing 5 329 148 10 823 262 20 2064 427 50 6130 813 100 9965 1271 150 10000 1618 200 10000 1922 Using the same index space, robust indexing can build 8 indices (if the value of k is up bounded by 100) Synthetic Data: 10K tuples 21 Outline • Introduction • Robust Index • Compute Robust Index – Exact Solution – Approximate Solution – Multiple Indices • Performance Study • Discussion and Conclusions 22 Performance Study • Data – Synthetic data – Real dataset (abalone3D, cover3D) • Measure – Number of tuples retrieved – Execution time not reported, but the robust indexing is expected to be even better • Approaches for comparison – Onion (convex shell) – PREFER – Approximate Robust Indexing (AppRI), #partition=10 23 Index Construction Time Convex Shell, Convex Hull and AppRI are implemented by C++ Construction time on PREFER is not included since it is implemented in Java Using the system default parameter, PREFER takes more than 1200 seconds on the 50k data set 24 Query Performance Average Number of tuples retrieved on synthetic data Average Number of tuples retrieved on Cover3D data set 25 Multiple Indices (Views) Synthetic Data, 3 dimensions Build 3 robust indices by decompose the weighting parameters: w1=max(w1,w2,w3) w2=max(w1,w2,w3) w2=max(w1,w2,w3) 26 Discussion and Conclusions • Strength – Easy to integrate with current DBMS – Good query performance – Practical construction complexity • Limitation – Online index maintenance is expensive (some weaker maintaining strategies available) – Indexing high dimensional data remains a challenging problem 27