Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI Outlines • • • • • • • • Introduction Semantics of Sample Algorithms of Sampling Join Sampling Problem New Strategies for Join Sampling Extensions and Negative Results Experimental Evaluations Conclusions Terms Used • SAMPLE(R, f) is an SQL operation. • f is a fraction of a relation R. • Relation R is produced when a query Q is evaluated. Introduction • Sampling the output of query is inefficient. • OLAP and Data Mining use sample of the result of the query posed. • Sampling must be supported on the result of an arbitrary SQL query. Continued… • Supports Random Sampling as a primitive relational operation in relational databases. • SAMPLE(R, f) operation. • Partially evaluate Q to generate a sample of R. • Sample operation appears arbitrarily in query tree T. • Commute the sample operation down the tree using a single join operation. Semantics of Sample 1. Sampling with Replacement (WR) 2. Sampling without Replacement (WoR) 3. Independent Coin Flips (CF) Sample with probability f independent of other tuples. f- Fraction of Tuples in R n- Number of Tuples in R Algorithms for Unweighted Sequential WR Sampling Black-Box U1: Given relation R with n tuples, generate an UNWEIGHTED WR sample of size r. Black-Box U2: Given relation R with n tuples, generate an UNWEIGHTED WR sample of size r. The size of relation being sampled. How it scans the relation? Need any significant auxiliary memory? Algorithms for Weighted Sequential WR Sampling Black-Box U1: Given relation R with n tuples, Black-Box U2: Given relation R with n tuples, generate an WEIGHTED WR sample of size generate an WEIGHTED WR sample of size r. r. The size of relation being sampled. How it scans the relation? Need any significant auxiliary memory? The Difficulty of Join Sampling SAMPLE( R1 R2 , f ) ? SAMPLE( R1 , f1 ) SAMPLE( R2 , f 2 ) Classification of the Problem Case A : No information is available for either R1 or R2 Case B : No information is available for R1 but indexes and /or statistics are available for R2. Case C : Indexes/statistics are available for R1 and R2 Previous Sampling Strategies Strategy Naive-Sample Strategy Olken-Sample: New Strategies for Join Sampling Three new strategies of Sampling are: • Strategy Stream Sample. • Strategy Group Sample. • Strategy Frequency-Partition-Sample. Table showing the information about R1 and R2 Strategy Stream Sample • Performs only a sequential sample from R1 • Does not generate excess tuples Strategy Group Sample Strategy Frequency-Partition-Sample • Assumption that we have full statistics for R2 • Uses strategy Group Sample for high frequency values. • Strategy Naive Sample for low frequency values. • Join attribute values need not be of high frequency in both operand relations. • Determine the distribution of the sample between high and low frequency sub domain. • Advantage: It needs summary statistics in the form of histograms for R2. Continued… Extensions and Negative Results • The Inherent difficulty of Join Sampling: Even if we have large samples from R1 and R2 and the detailed statistics, it is not possible to generate any non-empty random sample of R1 join R2. • Dealing with Join Trees: Pushing down the Sample operation to the operands. Experimental Evaluations • Naïve Sample: Add U1 operator as the root of tree • Olken Sample: Create uniform random sample T from key values of R1 • Stream Sample: Insert WR1 operator as a child of the join operator • Frequency-Partition-Sample: Implement a modified version of WR1 operator for producing random sample from R1 Experimental results Continued… Continued… Conclusions • Study of issues involved in implementing sampling as primitive operation. • Series of Sampling Strategies • Provided new schemes for sequential random sampling for uniform and weighted sampling distributions • Even more efficient strategies can be developed QUESTIONS?? Thank you