* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download PowerPoint
Oracle Database wikipedia , lookup
Microsoft Access wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Ingres (database) wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Relational algebra wikipedia , lookup
Clusterpoint wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Versant Object Database wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
The Bookkeeping SQL API Tim Adye Rutherford Appleton Laboratory Bookkeeping / Data Distribution Parallel BaBar Collaboration Meeting 8th December 2004 8th December 2004 Tim Adye 1 Talk Plan • The problem • “Why not just write SQL?” • The BaBar SQL API • • • • User view: Perl classes and command-line tool Behind the scenes Table schema configuration classes Summary of features • Could this be generalised to other applications? • Possible improvements • Comparison with other DBIx packages • Summary and references 8th December 2004 Tim Adye 2 User Access • Users need to query database to find out what data to process • May also need other information • eg. luminosity, run numbers, file sizes • Mostly select by dataset, but may need to limit further • eg. only data available locally, taken at peak energy, excluding some problem datataking period • Cannot expect users to know which tables to use, how to join them, or even the SQL syntax • Even worse if the schema change • Cannot expect developers to code for all combinations of queries with all possible selections • Previous ad hoc tools (some mine!) tried to do this and it was a nightmare, even for a simpler table structure 8th December 2004 Tim Adye 3 8th December 2004 Tim Adye 4 BaBar SQL API – user view • Each column that users might want to query or select on is given a unique logical name – regardless of which table it lives in • These names are used to specify query values $query->addValues('collection', 'gbytes'); and selections Each of these happens to be in a different table $query->addSelector('dataset', 'Dilepton-*'); $query->addSelector('run', '10000-19999'); • Different types of data allow for different selection syntax, eg. wildcards for names, or ranges for run numbers. • Can also use SQL expressions (in terms of logical names) $query->addValues('SUM(lumi)/1000'); and sorting, row limits, etc 8th December 2004 Tim Adye 5 SQL API – returning results • That’s enough to generate a valid SQL SELECT query. To return the results:my $sta = $query->execute(); while (my $row = $sta->fetch()) { print $row->gbytes(), $row->collection(), "\n"; } • The $query object collects the user requests • $query->execute() returns a “statement accessor” (like a DBI statement handle). • $sta iterates over row objects, each of which has accessors for each query value, gbytes and collection. • That’s all there is to it! • After the usual DBI connect, and $query object instantiation (see later), these statements form a working program 8th December 2004 Tim Adye 6 Command-Line Tools • Standard BaBar tools use this API to create job configuration, create datasets, calculate luminosities, etc. • Standard tasks, but optionally allowing additional selections • Also provide an “expert” tool that allows access to full API functionality from the command line • This has proved very popular, with many “non-experts” making their own unique queries 8th December 2004 Tim Adye 7 Examples $ BbkUser --dataset=A0-Run4-OnPeak-R14 \ --is_local=1 --file_status=0 \ dse_lumi events gbytes file \ --style=adye --display DSE_LUMI ======== 1250.3 1250.3 1348.4 1348.4 ... 156 rows EVENTS GBYTES FILE ====== ====== ==================================================== 526115 1.6 /store/PRskims/R14/14.4.0d/A0/02/A0_0239.01.root 526115 0.8 /store/PRskims/R14/14.4.0d/A0/02/A0_0239.02HBCA.root 576239 1.6 /store/PRskims/R14/14.4.0d/A0/02/A0_0240.01.root 576239 1.0 /store/PRskims/R14/14.4.0d/A0/02/A0_0240.02HBCA.root returned $ BbkUser –-collection-file=coll.lis \ tot_gbytes collection 8th December 2004 Tim Adye 8 What happens behind your back • The SQL API • translates the logical names to table columns • selects the required tables and joins • including otherwise unused tables required for the joins • generates and executes a valid SQL SELECT statement • creates a statement accessor object • dynamically generates a class for the row objects with accessors for each query value 8th December 2004 Tim Adye 9 BbkUser --dataset=A0-Run4-OnPeak-R14 --is_local=1 --file_status=0 dse_lumi events gbytes file Our Example • That first BbkUser command involved 5 tables • including one that provides the join between dataset and collection tables SELECT dse.lumi_sum AS "dse_lumi", dse.output_nev AS "events", file.bytes, dse.name AS "collection", file.suffix AS "file_suffix", ds.id AS "ds_id", dse.id AS "dse_id", dtd.id AS "dtd_id", dtd.link_status FROM bbk_dataset ds, bbk_dsentities dse, data_files dfile, bbk_files file, bbk_dstodse dtd WHERE ds.id=dtd.ds_id The SQL API can even AND dtd.dse_id=dse.id pretty-print it like this for you AND dse.id=file.dse_id (What’s shown here is somewhat AND file.id=dfile.file_id abbreviated: actual command AND ds.name='A0-Run4-OnPeak-R14' includes full database and table AND dse.is_local='1' names in case of ambiguities) AND dfile.status='0'; 8th December 2004 Tim Adye 10 8th December 2004 Tim Adye 11 Table schema configuration classes • Mapping between logical names and table columns is defined in the configuration classes. • One class per table • Can also define special properties of each column (eg. whether to allow ranges (“100-199”) for selection). • Possible joins between tables defined here too • Use logical column names for join conditions, so one table class does not need to know about column names in other classes. • In most cases it’s just a matter of listing logical vs. column names • with a little Perl syntactic sugar • Inheritence of config classes expresses commonalities • eg. common id and created, and modified columns 8th December 2004 Tim Adye 12 Example Table Configuration sub table { return 'bbk_files' } sub tableConfig { return { alias => 'file', columns => [ bytes uuid checksum file_suffix nfiles gbytes tot_gbytes file_dse_id file ], joins => [ dse_id file_id ], }} => => => => => => => => => 'bytes‘, 'uuid', 'checksum', 'suffix', 'COUNT(DISTINCT file_id)', '(bytes/1073741824)', 'SUM(bytes)/1073741824', { column => 'dse_id', selectorType => 'range' }, { valueAction => 'addLfnValue', selectorAction => 'lfnSelector' }, => 'file_dse_id', => 'dfile_file_id', 8th December 2004 Tim Adye 13 Putting it all together • Configuration classes must be registered with $query object my $query = new BbkSqlSelect($bbkconfig); $query->addModules(new MyTableClass($bbkconfig)); but of course it is usually simpler to provide a $query object pre-registered with all the table configs as part of a specific API. 8th December 2004 Tim Adye 14 Overriding and Synthetic Columns • A crutial advantage of this system is that it allows us to override the default behaviour • Allows us to hide complexities from user • Make even complex schema changes transparent to users • A logical name can refer to • • • • ordinary database column name SQL expression (in terms of database columns, or other logical names) Perl method to pre- or post-process selection or query value “synthetic” query value or selection • can return calculated value or alter behaviour • Global post-processing • Can be triggered by value, selection, or table inclusion • Allows global filtering of returned rows 8th December 2004 Tim Adye 15 What happens behind your back 2 • We already used some of these features without noticing! 1. Dataset names can be found in the bbk_dataset or the bbk_aliases table • Requires a check and translation using the alias table 2. Datasets can evolve with time, with collection being added or removed • • Need to query dataset for any time in the past, or use tagged dataset alias (like a CVS tag) Implemented by automatically including date selection in query, and post-processing returned results to remove deleted collections 3. File names are made from a collection name + a suffix • $query->addSelector('file') splits the file name for the query and the $row->file() accessor rejoins them 8th December 2004 Tim Adye 16 Features • Supports Oracle 8 and MySQL 3.23 • Most queries that can be expressed in both these dialects can be expressed by users via the API – without breaking the paradigm of a flat namespace • • • • • aggregatation and grouping sorting and distinct MySQL’s LIMIT emulated in Oracle inner and outer joins (generates Oracle or MySQL syntax) Does not support UNION or subqueries • Could be added, but not in MySQL 3.23 • Convenience features • automatic Getopt specification • query results display formatting and summary table generation • Configuration class summary table generation 8th December 2004 Tim Adye 17 Limitations • Assumes tables can be joined in a unique way • ie. the joins form an acyclic graph • can still select different joins with explicit switches • Each column must have its own unique logical name • This is usually a good thing • but if the same data is held in different columns, it would be more efficient to automatically select from tables that are already included 8th December 2004 Tim Adye 18 A public version • Current version has a few BaBar-specific pieces • BaBar Connection/Configuration manager – can use DBI directly • BaBar Options manager – can use Getopt directly • BaBar base objects – borrow required methods • BaBar table formatting class – publish this too … otherwise just uses standard Perl modules but with different table configs could be used elsewhere • Already do this in BaBar – used for QA and TM databases • Maybe I’m making some other assumptions that are true of our database and requirements, but not more generally so. I can’t think of any. • Needs a better name! • This is really an SQL API creator 8th December 2004 Tim Adye DBIx::SqlAbstractor ??? 19 Possible improvements • Tidy up code! • User and config APIs are OK, but in between it’s pretty ugly • Separate functionality that can be used on its own • Already true of the DBI statement accessor class • More SQL dialects: PostgreSQL, MS SQL? • New SQL syntax: subqueries, UNIONs,… • INSERT, UPDATE, etc • these don’t need joins, so hand-coding not such a problem • Automatic selection of different join possibilities • Automatic generation of default table classes from SQL schema • Could use “The SQL Fairy” • though not much work to do it by hand 8th December 2004 Tim Adye 20 Why not use another package? • More than 100 DBIx and other SQL access packages in CPAN • Could not find any that do all (or even most) of • • • • • • hide table structure from user allow multi-table queries, taking care of joins automatically do not impose their own conventions on table schema allow query values and selections to be overridden allow transparent post-processing of query results provide accessor functions for query results • I believe that taken together these features provide a clear and easy to use abstraction 8th December 2004 Tim Adye 21 Feedback and Discussion • Would this be useful outside BaBar? • Is it a good idea to make a public release? • eg. on CPAN • Does it need any improvements? • New features • Make it compatible with some other standards • eg. sit on top of another abstraction like DBIx::Table • A better name! 8th December 2004 Tim Adye 22 References • BaBar Bookkeeping project http://slac.stanford.edu/BFROOT/www/Computing/Distri buted/Bookkeeping/Documentation/ • BaBar Bookkeeping presentation and paper http://indico.cern.ch/contributionDisplay.py?contrib Id=338&sessionId=7&confId=0 D.A. Smith et al., BaBar Book Keeping project – a distributed meta-data catalog of the BaBar event store, Proc. Computing in High Energy and Nuclear Physics 2004 (CHEP04). • CPAN Database Interfaces (see particularly DBIx) http://cpan.uwinnipeg.ca/chapter/Database_Interfaces 8th December 2004 Tim Adye 23