Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Flexible Database Generators Nicolas Bruno Surajit Chaudhuri DMX Group Microsoft Research VLDB’05 DBMS Components DBMS are complex pieces of software Components still evolving/being added Multidimensional STHoles Histograms Automatic Physical Design Tool Statistics Statistics Management Statistics on Query Expressions Cost Model Optimizer DBMS 2 Evaluating New Components Functional vs. quality evaluation Black-box vs. gray-box evaluation Steps: Generate data Generate workload Evaluate improvement Multidimensional STHoles Histograms Automatic Physical Design Tool Statistics Statistics Management Statistics on Query Expressions Cost Model Optimizer DBMS Manual task: Time Consuming, sometimes difficult, not reusable 3 DGL (Data Generation Language) Special purpose specification language Based on iterators Functional flavor (can compose iterators) Interface with DBMS for scalability Flexible and extensible SQL extensions using DGL annotations Extend CREATE TABLE statements Specify how a database is populated Inter-table dependencies are possible 4 Populating a Database with DGL Annotated Schema CREATE TABLE R ( a INT, b INT, c INT ) POPULATE ( (a,d) = Step(1, 100) & UniformInt(0, 10), (b,c) = Duplicate( Query("SELECT DISTINCT(R.d) FROM R"), 100 ) ) Preprocessor DGL Program DGL Compiler LET R_ad = Step(1, 100) & UniformInt(0, 10), R_bc = Duplicate ( Query ( "SELECT DISTINCT(<<0>>.v1) FROM <<0>>", R_ad ), 100 ) IN PersistToExisting ( R_ad[0] & R_bc , "R" ) LET R_ad_Prx = Persist ( Step(1, 100) & UniformInt(0, 10) ), R_ad = Query ( "SELECT * FROM <<0>>", R_ad_Prx ), R_bc = Duplicate ( Query ( "SELECT DISTINCT(<<0>>.v1) FROM <<0>>", R_ad_Prx ), 100 ) IN PersistToExisting ( R_ad[0] & R_bc , "R" ) C++ Program C++ Compiler/Linker Data Generator DGL Primitives and Runtime Library - User-defined Iterator plumbing. - Buffering of intermediate results. - DB bulk-loading, querying, etc. 5 DGL: Data Types Base: Integers, Strings, Dates, etc. Rows: heterogeneous sequence of scalars Inherit operations from scalar types ( [1, 4, 5.0] + [2, 3, 4.5] ) ++ [‘John’] = [3, 7, 9.5, ‘John’] Iterators: key data type Step(1,5) returns <[1],[2],[3],[4],[5]> Constant( [1,2] ) returns <[1,2], [1,2], Iterators inherit operations from rows: [1,2], … > Step(1,5) ++ Step(6,10) = <[1,6], [2,7], [3,8], [4,9], [5,10]> Associative tables: main memory, random access 6 Primitive Iterators Statistical distributions: Uniform, Gaussian, Zipfian, Poisson, etc. Uniform( Constant([0,0]), Constant ([10,10]) ) = Uniform ( [0,0], [10,10] ) (implicit casts). SQL: Bridge DGL and DBMS Persist (expression, [table name]) Query (parameterized query, iterator1, …) Others: Duplicate elimination, union, etc. 7 Expressions and Functions Expressions (acyclic reference graph) Functions 8 Annotated Schemas with DGL Annotations: specify how to populate a database From annotations to DGL: Create single DGL fragment for all annotations Vertical partitions for inter-table dependencies Query rewriting Proxy introduction 9 Populating a Database with DGL Annotated Schema CREATE TABLE R ( a INT, b INT, c INT ) POPULATE ( (a,d) = Step(1, 100) & UniformInt(0, 10), (b,c) = Duplicate( Query("SELECT DISTINCT(R.d) FROM R"), 100 ) ) Preprocessor DGL Program DGL Compiler LET R_ad = Step(1, 100) & UniformInt(0, 10), R_bc = Duplicate ( Query ( "SELECT DISTINCT(<<0>>.v1) FROM <<0>>", R_ad ), 100 ) IN PersistToExisting ( R_ad[0] & R_bc , "R" ) LET R_ad_Prx = Persist ( Step(1, 100) & UniformInt(0, 10) ), R_ad = Query ( "SELECT * FROM <<0>>", R_ad_Prx ), R_bc = Duplicate ( Query ( "SELECT DISTINCT(<<0>>.v1) FROM <<0>>", R_ad_Prx ), 100 ) IN PersistToExisting ( R_ad[0] & R_bc , "R" ) C++ Program C++ Compiler/Linker Data Generator DGL Primitives and Runtime Library - User-defined Iterator plumbing. - Buffering of intermediate results. - DB bulk-loading, querying, etc. 10 DGL Function CREATE TABLE testDGL ( posX float, traditional DDL posY float, posZ float ) POPULATE 1000 ( target size gaussian = multiGauss([0,0,0], [1000,1000,1000], parameters temporary columns [@1,@1,@1], 1.0, @0, 10000 ), block = Uniform([100,100,100], [200,300,400]), (posX, posY, posZ) = ProbUnion(gaussian, block, User defined @2) Iterator ) FUN multiGauss(lo, hi, sigma, z, p, N) = LET ctrList = Top( Uniform(lo, hi), p ), Distributions idxs = Zipfian(z, p), centers = TApply(ctrList, idxs) IN Top( Normal(centers, sigma), N) C++ UDI SQL Annotation Example: Multidimensional Distributions class ProbUnion:public CIterator { public: virtual void _open() { ... } virtual bool _getNext(CRow* row) { ... } }; Lo Hi p σ N z Uniform Top Zipfian TApply Normal Top 11 Example: Inter-table dependencies Employees’ ages normally distributed around 40 Employees’ departments follows Zipfian distribution Employees’ bonus distributed around department 'category‘, which depends on budget LET employeeempID = Top ( Step ( 0, 10000 ), employeeage = Top ( Normal ( 40.00, 5.00 ), 10000 ), deptbuilding = Zipfian ( 1.00, 20 ), Dept budget is normally distributed * size (number employeedeptIDProxy = Persist ( Top (around Zipfian10000 ( 0.75, 50 ), 10000 ) of ),employees) employeedeptID = QueryZipfian ( "SELECT * FROM <<0>>", employeedeptIDProxy ), Dept building follows distribution deptdeptIDsize = Query ( "SELECT <<0>>.v0, count(*) FROM <<0>> GROUP BY <<0>>.v0", employeedeptIDProxy ), deptbudget = Normal ((10000 * deptdeptIDsize_1, 5000 ), CREATE TABLE employee CREATE TABLE dept ( tmp1Proxy = Persist ( deptbudget & deptdeptIDsize ), empID int, deptID int, tmp2Proxy = Persist ( employeedeptID & employeeempID ), age int, employeeempBaseBonus = Top ( Query ( budget float, deptID int, "SELECT <<0>>.v0 / 1000 FROM <<1>> JOIN <<0>> ON building <<1>>.v0 =int <<0>>.v1 ORDER BY bonus int <<1>>.v1", tmp1Proxy, tmp2Proxy ), 10000 ), ) POPULATE ( ) POPULATE 10000 ( (deptID, size) = Query(" ... empID = Step(0,10000), SELECT employee.deptID, count(*) age = Normal(40.0,5.0), FROM employee IN Union ( deptID = Zipfian(0.75, 50), Persist ( employeeempID & employeeage & employeedeptID GROUP & employeebonus, "employee" ), BY employee.deptID") empBaseBonus = Query(" & deptbudget & deptbuilding, Persist ( deptdeptIDsize_0 "dept" ) budget = Normal(10000*size, 5000) SELECT D.budget / 1000 ) building = Zipfian(1.0, 20) FROM employee JOIN dept ) ON employee.deptID = dept.deptID ORDER BY employee.empID"), 12 bonus = empBaseBonus * Uniform(0.5,1.5) ) Evaluation Model Iterator model (open/getNext/close) Program is DAG Depending on consumers, buffering is required In-memory circular queue that spills to disk 13 Examples Multi-gaussian Wisconsin Benchmark Skewed primary/foreign key joins 14 Complex TPC-H Examples All parts in an order are sold by suppliers that live in the same Orders arrivals follow a Poisson distribution starting in ‘1992/01/01’ country astothe customer. Item discounts are correlated the global number of parts sold Top 100 customers’ debt normally distributed around 3*balances. Number of items perisorder follows a Zipfian distribution. Remaining balances/2. Ship date occurs k dayscustomers, after orderaround date, where k follows Zipfian. Commit and receipt dates follow a bi-gaussian distribution after ship date. 15 Initial Performance Results Populate 1GB databases with various generators 16 Conclusion Creating datasets for quality evaluation of new database components is time-consuming DGL is expressive and easy to use SQL annotations reduce time needed to create and populate databases with non-trivial correlations 17