Download Summary

On Random Sampling over Joins - Summery This paper is concerned with random sampling as a primitive operation in relational databases. A major bottleneck in implementing sampling as a primitive relational operation is the inefficiency of sampling the output of query. In large databases the cost of executing SQL query can be very expensive. Therefor the goal is to generate an efficient sample of a join tree ( SAMPLE(R,f) which produce sample S that is an f-fraction of a relation R ) without first evaluating the join tree completely. The paper presents new sampling algorithms, which are significantly more efficient then those known earlier. The Difficulty of Join Sampling – Example The following example will help illustrate some of the subtitles of the problem: Suppose that we have the relation R1  A, B   a1 , b0 , a2 , b1 , a2 , b3 ,..., a2 , bk , R2  A, C   a2 , c0 , a1 , c1 , a1 , c2 ,...., a1 , ck  R1 : n+1 tuples - one has the value a1 - k tuples have the value a 2 R2 : n+1 tuples - one has the value a 2 - k tuples have the value a1 Observe that the join over A, J  R1  R2 is of size n=2k and has k tuples with A–value a1 and k tuples with A-value a 2 . Assume that we wish to choose a random sample with WR (sampling with replacement) semantics. Consider a random sample S  J , we expect that roughly half of the tuples in S have A-value a1 and half of the tuples in S have A-value a 2 . Suppose we pick a random sample S1  R1 and S 2  R2 . It is quite unlikely that S1 will contain the tuple a1 ,b0  ; or that S 2 will contain the tuple a 2 , c0  .Thus, it is impossible to generate a random sample of J , because we would expect S1  S 2 to be empty ( even if we allow S 2 to be all of R2 , and S1 be a proper subset of R1 ). The preceding discussion suggests that we sample tuple from R1 based on frequency statistic for R2 . This requires that R2 be materialized and indexed appropriately. Previous Sampling Strategy The following strategies are the only strategies known earlier: Strategy Naive Sample: No information is available for either R1 or R2 to helps sampling from R1 or R2 . Therefor, the only possible approach appears to be the naive one, to compute the full join J  R1  R2 and to sample. Strategy Olken Sample: Here statistics for R2 is required and indexes for R1 . 1. Let M be an upper bound on m2 v  (the number of distinct tuples in R2 that contain the value v ) for all v  D . 2.repeat (a) Sample a tuple t1  R1 uniformly at random. (b) Sample a random tuple t 2  R2 from among all tuples t  R1 that have t. A  t1. A . (c) Output t1  t2 with probability m2 t2 . A / M , and With remaining probability reject the sample. Until r tuples have been produced. New Strategies for Join Sampling Strategy Stream Sample: First, the purpose was to improve Olken’s strategy by describing Strategy Stream –Sample which doesn’t require any information about R1 and avoids the inefficiency of rejecting samples from R2 . 1. Use WR algorithm to produce a WR sample of size r, where the weight wt  for a tuple S1  R1 is set to m2 t. A 2. While tuples of S1 are streaming by do begin (a) get next tuple t1 and let v  t1. A ; (b) sample a random tuple t2  R2 from among all tuples t. A  v ; (c) output t1  t2 . end. t  R2 that have Strategy Stream Sample: No information is requires for R1 , but statistic for R2 . 1. Use WR algorithm to produce a WR sample S1  R1 of size r, where the weight wt  for a tuple t  R1 is set to m2 t. A . 2. Let S1 consist of the tuples s1 , s2 ,...., sr . Produce S2  S1  R2 whose tuples are grouped by S1 ‘s tuples s1 , s2 ,...., sr that generated them. 3. Use r invocations of Black-Box U1 or U2 to sample r sample, one of each group. Strategy Frequency Partition Sample: Since a skew in frequency is the problem illustrated in the example it makes sense to handle with high frequency and low frequency values differently and to avoid computing the full join for high frequency value because the size of the join is large precisely for that set of values. This strategy is using logically partitioning the domain into two sets of values, high and low frequency in R2 , Strategy Group Sample for the high value and the Naive Strategy for the latter values. It can be understood that no information is requires for R1 , but statistic for R2 (histogram, which gives the frequency statistics for all value with high frequency, say higher than t. The algorithm: 1. Select a frequency threshold t for domain D of the join attribute. Determine D hi = set of values in D that have frequency exceeding t in R2 . D lo = The remaining values in D. divide R1 into R1 hi hi = tuples with join attribute value in D and lo 2. R1 = tuples with join attribute value in D lo . hi Use WR algorithm using statistics from R2 as weights to create sample S1  R1 . Merge S1 with R1 (to have R1 ) and determine the size hi lo * nhi | J hi | R1  R2hi . hi 3. nlo | J lo | R1  R2lo lo J *  R1  R2 . lo 4. Partition J * to J hi and J lo . Use Sample-Group-strategy for J hi to sample r tuples. lo Use algorithm to pick uniformly r tuples from J . 5. Flip r coins with head probability proportional to nhi and tail probability proportional to nlo . Set rhi = the number of heads and rlo = the number of tails.. lo 6. Use WoR( without replacement ) algorithm to produce rlo samples from J . hi Use WoR( without replacement ) algorithm to produce rhi samples from J . 7. Combine rlo and rhi to get the overall r samples. Experimental Results The running time of Naive-Sample, Olken-Sample, Stream-Sample and FrequencyPartition-Sample had been compared in the experiment, while varying the skew, the sampling fraction and index structure. The experiment show that the strategies are almost always better then those known earlier. Stream –Sample is the one that always the best. Summary The paper presents a problem, the difficulty of sampling, sagest new strategies and compare them (by experiment) to others strategies to prove that they are much better. The experiment only prove that they are better comparing running time, but not comparing the efficiency of the sampling.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Summary