Download An SQL API for Object Oriented Perl

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database wikipedia , lookup

DBase wikipedia , lookup

Microsoft Access wikipedia , lookup

Oracle Database wikipedia , lookup

Relational algebra wikipedia , lookup

Ingres (database) wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Versant Object Database wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Join (SQL) wikipedia , lookup

SQL wikipedia , lookup

PL/SQL wikipedia , lookup

Database model wikipedia , lookup

Relational model wikipedia , lookup

Transcript
An SQL API for
Object Oriented Perl
Tim Adye
BaBar Collaboration
Particle Physics Department
eScience Centre informal seminar
12th November 2004
12th November 2004
Tim Adye
1
Talk Plan
• Summary of BaBar and its bookkeeping system
• “large-scale distributed metadata catalogue” in Gridspeak
• The problem
• “Why not just write SQL?”
• The BaBar SQL API
•
•
•
•
User view: Perl classes and command-line tool
Behind the scenes
Table schema configuration classes
Summary of features
• Could this be generalised to other applications?
• Possible improvements
• Comparison with other DBIx packages
• Summary and references
12th November 2004
Tim Adye
2
Feedback and Discussion
• Would this be useful outside BaBar?
• Would it be worth making a public release?
• eg. on CPAN
• Does it need any improvements?
• New features
• Make it compatible with some other standards
• eg. sit on top of another abstraction like DBIx::Table
• A better name!
• or have I just reinvented the wheel?
• If so, maybe I should import some classic wheel
technology into BaBar, rather than try to export my own
“square wheel” idea to the rest of the world
12th November 2004
Tim Adye
3
The BaBar Collaboration
10 Countries
12th November 2004
77 Institutions
Tim Adye
593 Physicists
4
The BaBar Detector at SLAC
Linear Accelerator
26th May 1999: first events recorded by BaBar
2.2 km PEP-II ring
BaBar
The BaBar Bookkeeping
• New bookkeeping system created in parallel with
new data model
• Switch from OO database (Objectivity) to flat files
(ROOT)
• Much simpler to manage, but no longer have the
top-level Objectivity catalogue
• Actually the catalogue didn’t really help us, so random set
of bookkeeping tools had been build up
• The main aim of the project was to create a coherent
bookkeeping system to replace these ad hoc tools
• Successfully deployed Feb 2004. In use ever since.
• Started development in Jan 2003
• beta test in Autumn
12th November 2004
Tim Adye
6
Bookkeeping Concepts
• Currently BaBar analysis data consists of
•
•
•
•
43G events (event = results from one e+e- collision)
in 335k files
total size of 186 TB
156M rows in largest table
• Expect all to grow geometrically over the next few years
• Data grouped in an overlapping hierarchy
• many-to-many relations between tables, eg.
•
•
•
•
1.7M runs: unit of datataking
219k collections: unit of access to event store
335k files: each collection stored in a few physical files
18k datasets: groups of data that analysis users want
• Need to select runs, collections, etc based on attributes
from any of these tables
12th November 2004
Tim Adye
7
8
Features of the BaBar Bookkeeping
• Decoupled from event store
• Job configuration written to site-independent tcl file
• Worker nodes don’t need access to database server
• Mirrored to other sites
• From SLAC to 4 “Tier A”, ~15 “Tier C”, user laptops
• Supports Oracle and MySQL (mirror OracleMySQL)
• Remote access
• Used for mirroring and for sites who don’t want local db
• Connection parameter/key distribution system
• Eg. BookkeepingCommand –-site=slac or –-site=ral
transparently connects to correct Oracle server at SLAC or
MySQL at RAL.
• Allows fallback servers if there are problems
12th November 2004
Tim Adye
9
Other Features of the Bookkeeping
• Import of data to each site managed by local
bookkeeping system
• Each site knows what it has and what it wants
• New data imported automatically in cron job
• Task management system
• Controls job submission, checking, resubmission
• Currently just for “production jobs”, but soon for user
analysis jobs too
• Looking at interfacing this to LCG job submission
• Already use LCG for UK Monte Carlo production
12th November 2004
Tim Adye
10
User Access
• Users need to query database to find out what
data to process
• May also need other information
• eg. luminosity, run numbers, file sizes
• Mostly select by dataset, but may need to limit further
• eg. only data available locally, taken at “peak” energy,
excluding some problem datataking period
• Cannot expect users to know which tables to use,
how to join them, or even the SQL syntax
• Even worse if the schema change
• Cannot expect developers to code for all
combinations of queries with all possible selections
• Previous ad hoc tools (some mine!) tried to do this and it
was a nightmare, even for a simpler table structure
12th November 2004
Tim Adye
11
BaBar SQL API – user view
• Each column that users might want to query or
select on is given a unique logical name – regardless
of which table it lives in
• These names are used to specify query values
$query->addValues('collection', 'gbytes');
and selections
Each of these
happens to be in
a different table
$query->addSelector('dataset', 'Dilepton-*');
$query->addSelector('run', '10000-19999');
• Different types of data allow for different selection
syntax, eg. wildcards for names, or ranges for run numbers.
• Can also use SQL expressions (in terms of logical names)
$query->addValues('SUM(lumi)/1000');
and sorting, row limits, etc
12th November 2004
Tim Adye
12
SQL API – returning results
• That’s enough to generate a valid SQL SELECT
query. To return the results:my $sta = $query->execute();
while (my $row = $sta->fetch()) {
print $row->gbytes(), $row->collection(), "\n";
}
• The $query object collects the user requests
• $query->execute() returns a “statement accessor” (like a
DBI statement handle).
• $sta iterates over row objects, each of which has accessors
for each query value, gbytes and collection.
• That’s all there is to it!
• After the usual DBI connect, and $query object
instantiation (see later), these statements form a
working program
12th November 2004
Tim Adye
13
Command-Line Tools
• Standard BaBar tools use this API to create job
configuration, create datasets, calculate
luminosities, etc.
• Standard tasks, but optionally allowing additional
selections
• Also provide an “expert” tool that allows access to
full API functionality from the command line
• This has proved very popular, with many “non-experts”
making their own unique queries
12th November 2004
Tim Adye
14
Examples
$ BbkUser --dataset=A0-Run4-OnPeak-R14 \
--is_local=1 --file_status=0 \
dse_lumi events gbytes file
DSE_LUMI
========
1250.3
1250.3
1348.4
1348.4
...
156 rows
EVENTS GBYTES FILE
====== ====== ====================================================
526115
1.6 /store/PRskims/R14/14.4.0d/A0/02/A0_0239.01.root
526115
0.8 /store/PRskims/R14/14.4.0d/A0/02/A0_0239.02HBCA.root
576239
1.6 /store/PRskims/R14/14.4.0d/A0/02/A0_0240.01.root
576239
1.0 /store/PRskims/R14/14.4.0d/A0/02/A0_0240.02HBCA.root
returned
$ BbkUser –-collection-file=coll.lis \
tot_gbytes collection
12th November 2004
Tim Adye
15
What happens behind your back
• The SQL API
• translates the logical names to table columns
• selects the required tables and joins
• including otherwise unused tables required for the joins
• generates and executes a valid SQL SELECT statement
• creates a statement accessor object
• dynamically generates a class for the row objects with
accessors for each query value
12th November 2004
Tim Adye
16
BbkUser --dataset=A0-Run4-OnPeak-R14
--is_local=1 --file_status=0
dse_lumi events gbytes file
Our Example
• That first BbkUser command involved 5 tables
• including one that provides the join between dataset and
collection tables
SELECT dse.lumi_sum AS "dse_lumi",
dse.output_nev AS "events",
file.bytes,
dse.name AS "collection",
file.suffix AS "file_suffix", ds.id AS "ds_id",
dse.id AS "dse_id",
dtd.id AS "dtd_id",
dtd.link_status
FROM bbk_dataset ds, bbk_dsentities dse, data_files dfile,
bbk_files file, bbk_dstodse dtd
WHERE ds.id=dtd.ds_id
The SQL API can even
AND dtd.dse_id=dse.id
pretty-print it like this for you 
AND dse.id=file.dse_id
(What’s shown here is somewhat
AND file.id=dfile.file_id
abbreviated: actual command
AND ds.name='A0-Run4-OnPeak-R14'
includes full database and table
AND dse.is_local='1'
names in case of ambiguities)
AND dfile.status='0';
12th November 2004
Tim Adye
17
18
Table schema configuration classes
• Mapping between logical names and table columns is
defined in the configuration classes.
• One class per table
• Can also define special properties of each column (eg. whether
to allow ranges (“100-199”) for selection).
• Possible joins between tables defined here too
• Use logical column names for join conditions, so one table class
does not need to know about column names in other classes.
• In most cases it’s just a matter of listing logical vs.
column names
• with a little Perl syntactic sugar
• Inheritence of config classes expresses commonalities
• eg. common id and created, and modified columns
12th November 2004
Tim Adye
19
Example Table Configuration
sub table { return 'bbk_files' }
sub tableConfig {
return {
alias => 'file',
columns => [
bytes
uuid
checksum
file_suffix
nfiles
gbytes
tot_gbytes
file_dse_id
file
],
joins => [
dse_id
file_id
],
}}
=>
=>
=>
=>
=>
=>
=>
=>
=>
'bytes‘,
'uuid',
'checksum',
'suffix',
'COUNT(DISTINCT file_id)',
'(bytes/1073741824)',
'SUM(bytes)/1073741824',
{ column => 'dse_id', selectorType => 'range' },
{ valueAction => 'addLfnValue', selectorAction => 'lfnSelector' },
=> 'file_dse_id',
=> 'dfile_file_id',
12th November 2004
Tim Adye
20
Putting it all together
• Configuration classes must be registered with
$query object
my $query = new BbkSqlSelect;
$query->addModules(new MyTableClass($dbi));
but of course it is usually simpler to provide a
$query object pre-registered with all the table
configs as part of a specific API.
12th November 2004
Tim Adye
21
Overriding and Synthetic Columns
• A crutial advantage of this system is that it allows us to
override the default behaviour
• Allows us to hide complexities from user
• Make even complex schema changes transparent to users
• A logical name can refer to
•
•
•
•
ordinary database column name
SQL expression (in terms of database columns, or other logical names)
Perl method to pre- or post-process selection or query value
“synthetic” query value or selection
• can return calculated value or alter behaviour
• Global post-processing
• Can be triggered by value, selection, or table inclusion
• Allows global filtering of returned rows
12th November 2004
Tim Adye
22
What happens behind your back 2
•
We already used some of these features without
noticing!
1. Dataset names can be found in the bbk_dataset or the
bbk_aliases table
•
Requires a check and translation using the alias table
2. Datasets can evolve with time, with collection being
added or removed
•
•
Need to query dataset for any time in the past, or use
tagged dataset alias (like a CVS tag)
Implemented by automatically including date selection in
query, and post-processing returned results to remove
deleted collections
3. File names are made from a collection name + a suffix
•
$query->addSelector('file') splits the file name for the
query and the $row->file() accessor rejoins them
12th November 2004
Tim Adye
23
Features
• Supports Oracle 8 and MySQL 3.23
• Most queries that can be expressed in both these dialects
can be expressed by users via the API – without breaking the
paradigm of a flat namespace
•
•
•
•
•
aggregatation and grouping
sorting and distinct
MySQL’s LIMIT emulated in Oracle
inner and outer joins (generates Oracle or MySQL syntax)
Does not support UNION or subqueries
• Could be added, but not in MySQL 3.23
• Convenience features
• automatic Getopt specification
• query results display formatting and summary table generation
• Configuration class summary table generation
12th November 2004
Tim Adye
24
Limitations
• Assumes tables can be joined in a unique way
• ie. the joins form an acyclic graph
• can still select different joins with explicit switches
• Each column must have its own unique logical name
• This is usually a good thing
• but if the same data is held in different columns, it would
be more efficient to automatically select from tables
that are already included
12th November 2004
Tim Adye
25
A public version
• Current version has a few BaBar-specific pieces
•
•
•
•
BaBar Connection manager – can use DBI directly
BaBar Options manager – can use Getopt directly
BaBar base objects – borrow required methods
BaBar table formatting class – remove functionality or publish
this too
… otherwise just uses standard Perl modules
but with different table configs could be used
elsewhere
• Already do this in BaBar – used for QA and TM databases
• Maybe I’m making some other assumptions that are true of our
database and requirements, but not more generally so. I can’t
think of any.
• Needs a better name!
• This is really an SQL API creator
12th November 2004
Tim Adye
26
Possible improvements
• Tidy up code!
• User and config APIs are OK, but in between it’s pretty ugly
• Separate functionality that can be used on its own
• Already true of the DBI statement accessor class
• More SQL dialects: PostgreSQL, MS SQL?
• New SQL syntax: subqueries, UNIONs,…
• INSERT, UPDATE, etc
• these don’t need joins, so hand-coding not such a problem
• Automatic selection of different join possibilities
• Automatic generation of default table classes from SQL
schema
• Could use “The SQL Fairy”
• though not much work to do it by hand
12th November 2004
Tim Adye
27
Why not use another package?
• More than 100 DBIx and other SQL access packages
in CPAN
• Could not find any that do all (or even most) of
•
•
•
•
•
•
hide table structure from user
allow multi-table queries, taking care of joins automatically
do not impose their own conventions on table schema
allow query values and selections to be overridden
allow transparent post-processing of query results
provide accessor functions for query results
• I believe that taken together these features provide
a clear and easy to use abstraction
12th November 2004
Tim Adye
28
References
• The BaBar Experiment
http://hepunx.rl.ac.uk/BFROOT/
• BaBar Bookkeeping project
http://hepunx.rl.ac.uk/BFROOT/www/Computing/Distribu
ted/Bookkeeping/Documentation/
• BaBar Bookkeeping presentation and paper
http://indico.cern.ch/contributionDisplay.py?contrib
Id=338&sessionId=7&confId=0
D.A. Smith et al.,
BaBar Book Keeping project –
a distributed meta-data catalog of the BaBar event store,
Proc. Computing in High Energy and Nuclear Physics 2004 (CHEP04).
• CPAN Database Interfaces
(see particularly DBIx)
http://cpan.uwinnipeg.ca/chapter/Database_Interfaces
12th November 2004
Tim Adye
29