Download slides - VLDB 2005

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Flexible Database
Generators
Nicolas Bruno
Surajit Chaudhuri
DMX Group
Microsoft Research
VLDB’05
DBMS Components


DBMS are complex pieces of software
Components still evolving/being added
Multidimensional
STHoles Histograms
Automatic Physical Design Tool
Statistics
Statistics Management
Statistics on Query
Expressions
Cost Model
Optimizer
DBMS
2
Evaluating New Components



Functional vs. quality evaluation
Black-box vs. gray-box evaluation
Steps:



Generate data
Generate workload
Evaluate improvement
Multidimensional
STHoles Histograms
Automatic Physical Design Tool
Statistics
Statistics Management
Statistics on Query
Expressions
Cost Model
Optimizer
DBMS
Manual task: Time Consuming,
sometimes difficult, not reusable
3
DGL (Data Generation Language)

Special purpose specification language
 Based
on iterators
 Functional flavor (can compose iterators)
 Interface with DBMS for scalability
 Flexible and extensible

SQL extensions using DGL annotations
 Extend CREATE TABLE
statements
 Specify how a database is populated

Inter-table dependencies are possible
4
Populating a Database with DGL
Annotated
Schema
CREATE TABLE R ( a INT, b INT, c INT )
POPULATE (
(a,d) = Step(1, 100) & UniformInt(0, 10),
(b,c) = Duplicate( Query("SELECT DISTINCT(R.d) FROM R"),
100 ) )
Preprocessor
DGL Program
DGL
Compiler
LET R_ad = Step(1, 100) & UniformInt(0, 10),
R_bc = Duplicate (
Query ( "SELECT DISTINCT(<<0>>.v1) FROM <<0>>", R_ad ),
100 )
IN PersistToExisting ( R_ad[0] & R_bc , "R" )
LET R_ad_Prx = Persist ( Step(1, 100) & UniformInt(0, 10) ),
R_ad = Query ( "SELECT * FROM <<0>>", R_ad_Prx ),
R_bc = Duplicate (
Query ( "SELECT DISTINCT(<<0>>.v1) FROM <<0>>", R_ad_Prx ),
100 )
IN PersistToExisting ( R_ad[0] & R_bc , "R" )
C++ Program
C++
Compiler/Linker
Data
Generator
DGL Primitives
and Runtime Library
- User-defined Iterator plumbing.
- Buffering of intermediate results.
- DB bulk-loading, querying, etc.
5
DGL: Data Types
Base: Integers, Strings, Dates, etc.
 Rows: heterogeneous sequence of scalars

 Inherit
operations from scalar types
( [1, 4, 5.0] + [2, 3, 4.5] ) ++ [‘John’] = [3, 7, 9.5, ‘John’]

Iterators: key data type
 Step(1,5)
returns <[1],[2],[3],[4],[5]>
 Constant( [1,2] ) returns <[1,2], [1,2],
 Iterators inherit operations from rows:
[1,2], … >
Step(1,5) ++ Step(6,10) = <[1,6], [2,7], [3,8], [4,9], [5,10]>

Associative tables: main memory, random
access
6
Primitive Iterators

Statistical distributions:
Uniform, Gaussian, Zipfian, Poisson, etc.
Uniform( Constant([0,0]), Constant ([10,10]) )
= Uniform ( [0,0], [10,10] ) (implicit casts).

SQL: Bridge DGL and DBMS
Persist (expression, [table name])
Query (parameterized query, iterator1, …)

Others: Duplicate elimination, union, etc.
7
Expressions and Functions

Expressions (acyclic reference graph)

Functions
8
Annotated Schemas with DGL

Annotations: specify how to populate a
database

From annotations to DGL:
 Create
single DGL fragment for all annotations
 Vertical partitions for inter-table dependencies
 Query rewriting
 Proxy introduction
9
Populating a Database with DGL
Annotated
Schema
CREATE TABLE R ( a INT, b INT, c INT )
POPULATE (
(a,d) = Step(1, 100) & UniformInt(0, 10),
(b,c) = Duplicate( Query("SELECT DISTINCT(R.d) FROM R"),
100 ) )
Preprocessor
DGL Program
DGL
Compiler
LET R_ad = Step(1, 100) & UniformInt(0, 10),
R_bc = Duplicate (
Query ( "SELECT DISTINCT(<<0>>.v1) FROM <<0>>", R_ad ),
100 )
IN PersistToExisting ( R_ad[0] & R_bc , "R" )
LET R_ad_Prx = Persist ( Step(1, 100) & UniformInt(0, 10) ),
R_ad = Query ( "SELECT * FROM <<0>>", R_ad_Prx ),
R_bc = Duplicate (
Query ( "SELECT DISTINCT(<<0>>.v1) FROM <<0>>", R_ad_Prx ),
100 )
IN PersistToExisting ( R_ad[0] & R_bc , "R" )
C++ Program
C++
Compiler/Linker
Data
Generator
DGL Primitives
and Runtime Library
- User-defined Iterator plumbing.
- Buffering of intermediate results.
- DB bulk-loading, querying, etc.
10
DGL Function
CREATE TABLE testDGL (
posX float,
traditional DDL
posY float,
posZ float
) POPULATE 1000 ( target size
gaussian = multiGauss([0,0,0],
[1000,1000,1000],
parameters
temporary columns
[@1,@1,@1],
1.0, @0, 10000 ),
block = Uniform([100,100,100],
[200,300,400]),
(posX, posY, posZ) = ProbUnion(gaussian,
block,
User defined @2)
Iterator
)
FUN multiGauss(lo, hi, sigma, z, p, N) =
LET ctrList = Top( Uniform(lo, hi), p ),
Distributions
idxs = Zipfian(z, p),
centers = TApply(ctrList, idxs)
IN Top( Normal(centers, sigma), N)
C++ UDI
SQL Annotation
Example: Multidimensional Distributions
class ProbUnion:public CIterator {
public:
virtual void _open() { ... }
virtual bool _getNext(CRow* row) { ... }
};
Lo
Hi p
σ N
z
Uniform
Top
Zipfian
TApply
Normal
Top
11
Example: Inter-table dependencies
Employees’ ages normally distributed around 40
 Employees’ departments follows Zipfian distribution

Employees’ bonus distributed around department 'category‘, which depends on budget
LET employeeempID = Top ( Step ( 0, 10000 ),

employeeage = Top ( Normal ( 40.00, 5.00 ), 10000 ),
deptbuilding = Zipfian ( 1.00, 20 ),
 Dept
budget is normally
distributed
* size
(number
employeedeptIDProxy
= Persist
( Top (around
Zipfian10000
( 0.75,
50 ),
10000 ) of
),employees)
employeedeptID
= QueryZipfian
( "SELECT
* FROM <<0>>", employeedeptIDProxy ),
 Dept
building follows
distribution
deptdeptIDsize = Query ( "SELECT <<0>>.v0, count(*) FROM <<0>> GROUP BY <<0>>.v0",
employeedeptIDProxy ),
deptbudget
= Normal ((10000 * deptdeptIDsize_1, 5000 ),
CREATE
TABLE employee
CREATE
TABLE dept (
tmp1Proxy = Persist ( deptbudget & deptdeptIDsize
),
empID int,
deptID
int,
tmp2Proxy = Persist ( employeedeptID & employeeempID
),
age
int,
employeeempBaseBonus = Top ( Query (
budget float,
deptID
int,
"SELECT
<<0>>.v0 / 1000 FROM <<1>> JOIN <<0>> ON building
<<1>>.v0 =int
<<0>>.v1 ORDER BY
bonus int
<<1>>.v1",
tmp1Proxy, tmp2Proxy ), 10000 ),
) POPULATE (
) POPULATE 10000 (
(deptID, size) = Query("
...
empID = Step(0,10000),
SELECT employee.deptID, count(*)
age = Normal(40.0,5.0),
FROM employee
IN Union (
deptID
=
Zipfian(0.75,
50),
Persist ( employeeempID & employeeage & employeedeptID GROUP
& employeebonus,
"employee" ),
BY employee.deptID")
empBaseBonus
= Query(" & deptbudget & deptbuilding,
Persist ( deptdeptIDsize_0
"dept"
)
budget = Normal(10000*size, 5000)
SELECT D.budget / 1000
)
building = Zipfian(1.0, 20)
FROM employee JOIN dept
)
ON employee.deptID = dept.deptID
ORDER BY employee.empID"),
12
bonus = empBaseBonus * Uniform(0.5,1.5) )
Evaluation Model

Iterator model (open/getNext/close)
 Program
is DAG
 Depending on consumers, buffering is required

In-memory circular queue that spills to disk
13
Examples
Multi-gaussian
 Wisconsin Benchmark


Skewed primary/foreign key joins
14
Complex TPC-H Examples
All parts in an order are sold by suppliers that live in the same
Orders arrivals follow a Poisson distribution starting in ‘1992/01/01’
country astothe
customer.
Item
discounts
are
correlated
the
global
number
of parts
sold
Top 100
customers’
debt
normally
distributed
around
3*balances.
Number
of items
perisorder
follows
a Zipfian
distribution.
Remaining
balances/2.
Ship date occurs
k dayscustomers,
after orderaround
date, where
k follows Zipfian.
Commit and receipt dates follow a bi-gaussian distribution after ship date.
15
Initial Performance Results
Populate 1GB databases with various generators
16
Conclusion
Creating datasets for quality evaluation of new
database components is time-consuming
 DGL is expressive and easy to use
 SQL annotations reduce time needed to create
and populate databases with non-trivial
correlations

17
Related documents