Download Summary

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
On Random Sampling over Joins - Summery
This paper is concerned with random sampling as a primitive operation in
relational databases. A major bottleneck in implementing sampling as a primitive
relational operation is the inefficiency of sampling the output of query. In large
databases the cost of executing SQL query can be very expensive. Therefor the goal is
to generate an efficient sample of a join tree ( SAMPLE(R,f) which produce sample
S that is an f-fraction of a relation R ) without first evaluating the join tree completely.
The paper presents new sampling algorithms, which are significantly more efficient
then those known earlier.
The Difficulty of Join Sampling – Example
The following example will help illustrate some of the subtitles of the problem:
Suppose that we have the relation
R1  A, B   a1 , b0 , a2 , b1 , a2 , b3 ,..., a2 , bk ,
R2  A, C   a2 , c0 , a1 , c1 , a1 , c2 ,...., a1 , ck 
R1 : n+1 tuples - one has the value a1
- k tuples have the value a 2
R2 : n+1 tuples - one has the value a 2
-
k tuples have the value a1
Observe that the join over A, J  R1  R2 is of size n=2k and has k tuples with
A–value a1 and k tuples with A-value a 2 . Assume that we wish to choose a random
sample with WR (sampling with replacement) semantics. Consider a random sample
S  J , we expect that roughly half of the tuples in S have A-value a1 and half of the
tuples in S have A-value a 2 .
Suppose we pick a random sample S1  R1 and S 2  R2 . It is quite unlikely that S1
will contain the tuple a1 ,b0  ; or that S 2 will contain the tuple a 2 , c0  .Thus, it is
impossible to generate a random sample of J , because we would expect S1  S 2 to
be empty ( even if we allow S 2 to be all of R2 , and S1 be a proper subset of R1 ).
The preceding discussion suggests that we sample tuple from R1 based on frequency
statistic for R2 . This requires that R2 be materialized and indexed appropriately.
Previous Sampling Strategy
The following strategies are the only strategies known earlier:
Strategy Naive Sample: No information is available for either R1 or R2 to helps
sampling from R1 or R2 . Therefor, the only possible approach appears to be the naive
one, to compute the full join J  R1  R2 and to sample.
Strategy Olken Sample: Here statistics for R2 is required and indexes for R1 .
1. Let M be an upper bound on m2 v  (the number of distinct tuples in R2 that
contain the value v ) for all v  D .
2.repeat
(a) Sample a tuple t1  R1 uniformly at random.
(b) Sample a random tuple t 2  R2 from among all
tuples t  R1 that have t. A  t1. A .
(c) Output t1  t2 with probability m2 t2 . A / M , and
With remaining probability reject the sample.
Until r tuples have been produced.
New Strategies for Join Sampling
Strategy Stream Sample: First, the purpose was to improve Olken’s strategy by
describing Strategy Stream –Sample which doesn’t require any information about R1
and avoids the inefficiency of rejecting samples from R2 .
1. Use WR algorithm to produce a WR sample of size r, where the weight wt  for a
tuple S1  R1 is set to m2 t. A
2. While tuples of S1 are streaming by do begin
(a) get next tuple t1 and let v  t1. A ;
(b) sample a random tuple t2  R2 from among all tuples
t. A  v ;
(c) output t1  t2 .
end.
t  R2
that have
Strategy Stream Sample: No information is requires for R1 , but statistic for R2 .
1. Use WR algorithm to produce a WR sample S1  R1 of size r, where the weight
wt  for a tuple t  R1 is set to m2 t. A .
2. Let S1 consist of the tuples
s1 , s2 ,...., sr .
Produce S2  S1  R2 whose tuples are grouped by S1 ‘s tuples s1 , s2 ,...., sr
that generated them.
3. Use r invocations of Black-Box U1 or U2 to sample r sample, one of each group.
Strategy Frequency Partition Sample: Since a skew in frequency is the
problem illustrated in the example it makes sense to handle with high frequency
and low frequency values differently and to avoid computing the full join for
high frequency value because the size of the join is large precisely for that set of
values. This strategy is using logically partitioning the domain into two sets of
values, high and low frequency in R2 , Strategy Group Sample for the high value
and the Naive Strategy for the latter values. It can be understood that no
information is requires for R1 , but statistic for R2 (histogram, which gives the
frequency statistics for all value with high frequency, say higher than t.
The algorithm:
1.
Select a frequency threshold t for domain D of the join attribute.
Determine D hi = set of values in D that have frequency exceeding t in R2 .
D lo = The remaining values in D.
divide R1 into R1
hi
hi
= tuples with join attribute value in D and
lo
2.
R1 = tuples with join attribute value in D lo .
hi
Use WR algorithm using statistics from R2 as weights to create
sample S1  R1 . Merge S1 with R1 (to have R1 ) and determine the size
hi
lo
*
nhi | J hi | R1  R2hi .
hi
3. nlo | J
lo
| R1  R2lo
lo
J *  R1  R2 .
lo
4. Partition J
*
to J
hi
and J
lo
.
Use Sample-Group-strategy for J
hi
to sample r tuples.
lo
Use algorithm to pick uniformly r tuples from J .
5. Flip r coins with head probability proportional to nhi and tail probability
proportional to nlo . Set rhi = the number of heads and rlo = the number of tails..
lo
6. Use WoR( without replacement ) algorithm to produce rlo samples from J .
hi
Use WoR( without replacement ) algorithm to produce rhi samples from J .
7. Combine rlo and rhi to get the overall r samples.
Experimental Results
The running time of Naive-Sample, Olken-Sample, Stream-Sample and FrequencyPartition-Sample had been compared in the experiment, while varying the skew, the
sampling fraction and index structure. The experiment show that the strategies are
almost always better then those known earlier. Stream –Sample is the one that always
the best.
Summary
The paper presents a problem, the difficulty of sampling, sagest new strategies and
compare them (by experiment) to others strategies to prove that they are much better.
The experiment only prove that they are better comparing running time, but not
comparing the efficiency of the sampling.