Download slides - VLDB 2005

Flexible Database Generators Nicolas Bruno Surajit Chaudhuri DMX Group Microsoft Research VLDB’05 DBMS Components   DBMS are complex pieces of software Components still evolving/being added Multidimensional STHoles Histograms Automatic Physical Design Tool Statistics Statistics Management Statistics on Query Expressions Cost Model Optimizer DBMS 2 Evaluating New Components    Functional vs. quality evaluation Black-box vs. gray-box evaluation Steps:    Generate data Generate workload Evaluate improvement Multidimensional STHoles Histograms Automatic Physical Design Tool Statistics Statistics Management Statistics on Query Expressions Cost Model Optimizer DBMS Manual task: Time Consuming, sometimes difficult, not reusable 3 DGL (Data Generation Language)  Special purpose specification language  Based on iterators  Functional flavor (can compose iterators)  Interface with DBMS for scalability  Flexible and extensible  SQL extensions using DGL annotations  Extend CREATE TABLE statements  Specify how a database is populated  Inter-table dependencies are possible 4 Populating a Database with DGL Annotated Schema CREATE TABLE R ( a INT, b INT, c INT ) POPULATE ( (a,d) = Step(1, 100) & UniformInt(0, 10), (b,c) = Duplicate( Query("SELECT DISTINCT(R.d) FROM R"), 100 ) ) Preprocessor DGL Program DGL Compiler LET R_ad = Step(1, 100) & UniformInt(0, 10), R_bc = Duplicate ( Query ( "SELECT DISTINCT(<<0>>.v1) FROM <<0>>", R_ad ), 100 ) IN PersistToExisting ( R_ad[0] & R_bc , "R" ) LET R_ad_Prx = Persist ( Step(1, 100) & UniformInt(0, 10) ), R_ad = Query ( "SELECT * FROM <<0>>", R_ad_Prx ), R_bc = Duplicate ( Query ( "SELECT DISTINCT(<<0>>.v1) FROM <<0>>", R_ad_Prx ), 100 ) IN PersistToExisting ( R_ad[0] & R_bc , "R" ) C++ Program C++ Compiler/Linker Data Generator DGL Primitives and Runtime Library - User-defined Iterator plumbing. - Buffering of intermediate results. - DB bulk-loading, querying, etc. 5 DGL: Data Types Base: Integers, Strings, Dates, etc.  Rows: heterogeneous sequence of scalars   Inherit operations from scalar types ( [1, 4, 5.0] + [2, 3, 4.5] ) ++ [‘John’] = [3, 7, 9.5, ‘John’]  Iterators: key data type  Step(1,5) returns <[1],[2],[3],[4],[5]>  Constant( [1,2] ) returns <[1,2], [1,2],  Iterators inherit operations from rows: [1,2], … > Step(1,5) ++ Step(6,10) = <[1,6], [2,7], [3,8], [4,9], [5,10]>  Associative tables: main memory, random access 6 Primitive Iterators  Statistical distributions: Uniform, Gaussian, Zipfian, Poisson, etc. Uniform( Constant([0,0]), Constant ([10,10]) ) = Uniform ( [0,0], [10,10] ) (implicit casts).  SQL: Bridge DGL and DBMS Persist (expression, [table name]) Query (parameterized query, iterator1, …)  Others: Duplicate elimination, union, etc. 7 Expressions and Functions  Expressions (acyclic reference graph)  Functions 8 Annotated Schemas with DGL  Annotations: specify how to populate a database  From annotations to DGL:  Create single DGL fragment for all annotations  Vertical partitions for inter-table dependencies  Query rewriting  Proxy introduction 9 Populating a Database with DGL Annotated Schema CREATE TABLE R ( a INT, b INT, c INT ) POPULATE ( (a,d) = Step(1, 100) & UniformInt(0, 10), (b,c) = Duplicate( Query("SELECT DISTINCT(R.d) FROM R"), 100 ) ) Preprocessor DGL Program DGL Compiler LET R_ad = Step(1, 100) & UniformInt(0, 10), R_bc = Duplicate ( Query ( "SELECT DISTINCT(<<0>>.v1) FROM <<0>>", R_ad ), 100 ) IN PersistToExisting ( R_ad[0] & R_bc , "R" ) LET R_ad_Prx = Persist ( Step(1, 100) & UniformInt(0, 10) ), R_ad = Query ( "SELECT * FROM <<0>>", R_ad_Prx ), R_bc = Duplicate ( Query ( "SELECT DISTINCT(<<0>>.v1) FROM <<0>>", R_ad_Prx ), 100 ) IN PersistToExisting ( R_ad[0] & R_bc , "R" ) C++ Program C++ Compiler/Linker Data Generator DGL Primitives and Runtime Library - User-defined Iterator plumbing. - Buffering of intermediate results. - DB bulk-loading, querying, etc. 10 DGL Function CREATE TABLE testDGL ( posX float, traditional DDL posY float, posZ float ) POPULATE 1000 ( target size gaussian = multiGauss([0,0,0], [1000,1000,1000], parameters temporary columns [@1,@1,@1], 1.0, @0, 10000 ), block = Uniform([100,100,100], [200,300,400]), (posX, posY, posZ) = ProbUnion(gaussian, block, User defined @2) Iterator ) FUN multiGauss(lo, hi, sigma, z, p, N) = LET ctrList = Top( Uniform(lo, hi), p ), Distributions idxs = Zipfian(z, p), centers = TApply(ctrList, idxs) IN Top( Normal(centers, sigma), N) C++ UDI SQL Annotation Example: Multidimensional Distributions class ProbUnion:public CIterator { public: virtual void _open() { ... } virtual bool _getNext(CRow* row) { ... } }; Lo Hi p σ N z Uniform Top Zipfian TApply Normal Top 11 Example: Inter-table dependencies Employees’ ages normally distributed around 40  Employees’ departments follows Zipfian distribution  Employees’ bonus distributed around department 'category‘, which depends on budget LET employeeempID = Top ( Step ( 0, 10000 ),  employeeage = Top ( Normal ( 40.00, 5.00 ), 10000 ), deptbuilding = Zipfian ( 1.00, 20 ),  Dept budget is normally distributed * size (number employeedeptIDProxy = Persist ( Top (around Zipfian10000 ( 0.75, 50 ), 10000 ) of ),employees) employeedeptID = QueryZipfian ( "SELECT * FROM <<0>>", employeedeptIDProxy ),  Dept building follows distribution deptdeptIDsize = Query ( "SELECT <<0>>.v0, count(*) FROM <<0>> GROUP BY <<0>>.v0", employeedeptIDProxy ), deptbudget = Normal ((10000 * deptdeptIDsize_1, 5000 ), CREATE TABLE employee CREATE TABLE dept ( tmp1Proxy = Persist ( deptbudget & deptdeptIDsize ), empID int, deptID int, tmp2Proxy = Persist ( employeedeptID & employeeempID ), age int, employeeempBaseBonus = Top ( Query ( budget float, deptID int, "SELECT <<0>>.v0 / 1000 FROM <<1>> JOIN <<0>> ON building <<1>>.v0 =int <<0>>.v1 ORDER BY bonus int <<1>>.v1", tmp1Proxy, tmp2Proxy ), 10000 ), ) POPULATE ( ) POPULATE 10000 ( (deptID, size) = Query(" ... empID = Step(0,10000), SELECT employee.deptID, count(*) age = Normal(40.0,5.0), FROM employee IN Union ( deptID = Zipfian(0.75, 50), Persist ( employeeempID & employeeage & employeedeptID GROUP & employeebonus, "employee" ), BY employee.deptID") empBaseBonus = Query(" & deptbudget & deptbuilding, Persist ( deptdeptIDsize_0 "dept" ) budget = Normal(10000*size, 5000) SELECT D.budget / 1000 ) building = Zipfian(1.0, 20) FROM employee JOIN dept ) ON employee.deptID = dept.deptID ORDER BY employee.empID"), 12 bonus = empBaseBonus * Uniform(0.5,1.5) ) Evaluation Model  Iterator model (open/getNext/close)  Program is DAG  Depending on consumers, buffering is required  In-memory circular queue that spills to disk 13 Examples Multi-gaussian  Wisconsin Benchmark   Skewed primary/foreign key joins 14 Complex TPC-H Examples All parts in an order are sold by suppliers that live in the same Orders arrivals follow a Poisson distribution starting in ‘1992/01/01’ country astothe customer. Item discounts are correlated the global number of parts sold Top 100 customers’ debt normally distributed around 3*balances. Number of items perisorder follows a Zipfian distribution. Remaining balances/2. Ship date occurs k dayscustomers, after orderaround date, where k follows Zipfian. Commit and receipt dates follow a bi-gaussian distribution after ship date. 15 Initial Performance Results Populate 1GB databases with various generators 16 Conclusion Creating datasets for quality evaluation of new database components is time-consuming  DGL is expressive and easy to use  SQL annotations reduce time needed to create and populate databases with non-trivial correlations  17

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download slides - VLDB 2005