Download A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
A Paper on
RANDOM SAMPLING OVER
JOINS
by
SURAJIT CHAUDHARI
RAJEEV MOTWANI
VIVEK NARASAYYA
PRESENTED BY,
JEEVAN KUMAR GOGINENI
SARANYA GOTTIPATI
Outlines
•
•
•
•
•
•
•
•
Introduction
Semantics of Sample
Algorithms of Sampling
Join Sampling Problem
New Strategies for Join Sampling
Extensions and Negative Results
Experimental Evaluations
Conclusions
Terms Used
• SAMPLE(R, f) is an SQL operation.
• f is a fraction of a relation R.
• Relation R is produced when a query Q is
evaluated.
Introduction
• Sampling the output of query is inefficient.
• OLAP and Data Mining use sample of the result
of the query posed.
• Sampling must be supported on the result of an
arbitrary SQL query.
Continued…
• Supports Random Sampling as a primitive
relational operation in relational databases.
• SAMPLE(R, f) operation.
• Partially evaluate Q to generate a sample of R.
• Sample operation appears arbitrarily in query
tree T.
• Commute the sample operation down the tree
using a single join operation.
Semantics of Sample
1. Sampling with Replacement (WR)
2. Sampling without Replacement (WoR)
3. Independent Coin Flips (CF)
Sample with probability f independent of other tuples.
f- Fraction of Tuples in R
n- Number of Tuples in R
Algorithms for Unweighted Sequential WR Sampling
Black-Box U1: Given relation R with n
tuples, generate an UNWEIGHTED WR
sample of size r.
Black-Box U2: Given relation R with n tuples,
generate an UNWEIGHTED WR sample of
size r.
The size of relation being sampled.
How it scans the relation?
Need any significant auxiliary memory?
Algorithms for Weighted Sequential WR Sampling
Black-Box U1: Given relation R with n tuples, Black-Box U2: Given relation R with n tuples,
generate an WEIGHTED WR sample of size generate an WEIGHTED WR sample of size r.
r.
The size of relation being sampled.
How it scans the relation?
Need any significant auxiliary memory?
The Difficulty of Join Sampling
SAMPLE( R1  R2 , f )
?
SAMPLE( R1 , f1 )  SAMPLE( R2 , f 2 )
Classification of the Problem
Case A : No information is available for either R1 or
R2
Case B : No information is available for R1 but
indexes and /or statistics are available for R2.
Case C : Indexes/statistics are available for R1 and R2
Previous Sampling Strategies
Strategy Naive-Sample
Strategy Olken-Sample:
New Strategies for Join Sampling
Three new strategies of Sampling are:
• Strategy Stream Sample.
• Strategy Group Sample.
• Strategy Frequency-Partition-Sample.
Table showing the information about R1
and R2
Strategy Stream Sample
• Performs only a sequential sample from R1
• Does not generate excess tuples
Strategy Group Sample
Strategy Frequency-Partition-Sample
• Assumption that we have full statistics for R2
• Uses strategy Group Sample for high frequency
values.
• Strategy Naive Sample for low frequency values.
• Join attribute values need not be of high frequency
in both operand relations.
• Determine the distribution of the sample between
high and low frequency sub domain.
• Advantage: It needs summary statistics in the form
of histograms for R2.
Continued…
Extensions and Negative Results
• The Inherent difficulty of Join Sampling:
Even if we have large samples from R1 and
R2 and the detailed statistics, it is not possible
to generate any non-empty random sample of
R1 join R2.
• Dealing with Join Trees:
Pushing down the Sample operation to the
operands.
Experimental Evaluations
• Naïve Sample: Add U1 operator as the root of tree
• Olken Sample: Create uniform random sample T
from key values of R1
• Stream Sample: Insert WR1 operator as a child of
the join operator
• Frequency-Partition-Sample: Implement a
modified version of WR1 operator for producing
random sample from R1
Experimental results
Continued…
Continued…
Conclusions
• Study of issues involved in implementing sampling
as primitive operation.
• Series of Sampling Strategies
• Provided new schemes for sequential random
sampling for uniform and weighted sampling
distributions
• Even more efficient strategies can be developed
QUESTIONS??
Thank you
Related documents