Download ppt - Institute of Physics

High-Level Data Access With Grace Ease Greg Landsberg Prague Software Workshop Run I Lessons Tower of Babylon!  Plethora of Ntuple formats: How to avoid this in Run II?  Enforce standard physics analysis  QCD packages  WZQCD  Strongly enforce standard object  Top ID  LQ  Create and support standard  SUSY heptuples …  Proprietary physics analysis code  Develop a number of WWWfriendly high-end analysis tools  Number of nearly identical particle ID criteria  The ultimate product of this  Lack of standard documented collaboration is a flow of highways to start a new physics quality papers, and we have to analysis make it as easy as possible for Result: a living hell for remote remote physicists to efficiently collaborators! contribute to physics analyses Standard Object ID  The two key components are: Strong ID groups Strong management  Strong ID groups:  Develop out-of-the-box particle objects and ID criteria  Provide enough versatility to satisfy different physics needs (e.g. low energy and high energy electrons)  Provide well optimized selection tools  Provide efficiencies and fake probabilities for standard ID cuts  Strong Management  No new analysis is allowed to start with a proprietary object ID  If a new object ID is convincingly proved to be essential as a result of a new analysis, it should be immediately standardized and either be added to a list of accepted standards or replace one of the existing standards  New efficiencies and fake probabilities are then calculated for a new ID  Standard objects should be used in standard heptuples Standard HEPtuple Proposal  To become a standard, these  The proposed format is based on heptuples need to be introduced my experience in analyzing Run I early on, before people go off data; it’s just a starting point! and do MC-based physics  Necessary components: analyses  Event tags  The time is now!  Triggers  They should be enforced  Accelerator conditions (management) and supported  Global quantities (ID groups)  Basic physics quantities  Versatile enough to meet most  Electrons/photons of the physics goals  Muons  Expandable to accommodate  Jets new analyses  t’s, b- and c-jets  RCP/WWW-controlled user  High-pT tracks interface …  Sufficiently smaller than mDST, i.e.  2 KB/event Standard HEPtuples details Triggers Event Tags Accelerator Run # L0 Instantaneous Lum Event # MI Flag Accelerator conditions BadRun Flag L1 bit string Calorimeter baselines Time Stamp L2 bit string Average int/crossings Subdetector flags L3 bit string … … Compressed L1 info Compressed L2 info Compressed L3 info User-defined list of L1/L2/L3 triggers … Standard HEPtuples details (cont’d) Global Quantities Physics Quantities Photons # Primary vertices N leptons E, ET Primary Vertices N photons Px, Py, Pz # Tracks per vertex N jets per algorithm h, f, hdet # Secondary vertices Total energy Vertex from pointing Secondary vertices ST ID flag # Tracks per vertex HT c2, ISOA, ISOF, EMF L0/L1 vertex 2-body Masses Compressed cells L2 vertices Sphericity Preshower info Index of the closest track, jet L3 vertices Missing ET’s Aplanarity User-defined list of L1/L2/L3 triggers Likelihood Standard HEPtuples details (cont’d) Electrons Muons Jets E, ET p, pT E, ET Px, Py, Pz Px, Py, Pz Ex, Ey, Ez h, f, hdet h, f, hdet h, f, hdet Vertex ID, pointing Muon system info EMF, CHF, ICDF, HCF E/p, dE/dx ID flag, Sign ID flag, Rcone/KT ID flag, Sign c2, ISO, CAL, dE/dx Ntr, Width, Q/G/t/b/c c2, ISOs, EMF, s Timing Timing Compressed cells Compressed hits Compressed cells Preshower info Closest track, jet Preshower info Closest track, jet Closest track, jet Energy correction Likelihood Likelihood Standard HEPtuples details (cont’d) b/c/t Jets High pT tracks E, ET p, pT Ex, Ey, Ez Px, Py, Pz h, f, hdet h, f, hdet EMF, CHF, ICDF, HCF ISO, CAL, dE/dx, occ. ID flag, Rcone/KT Compressed CFT info Ntr, Width Compressed SMT info Vertex, Impact param. Compressed Muon info Neutrino energy corr. Timing Compressed SMT info Compressed hits Compressed cal. cells Preshower info Closest track, jet Closest track, jet Energy correction Likelihood Physics Object Database  Slides from Richard Partridge (GCM talk, Seattle workshop) What is POD?  R&D project to investigate the use of database technology for physics analysis of large data samples  POD uses a commercial relational database program to store:  Calibrated physics objects (leptons, jets, ET)  Results of particle ID algorithms  Global quantities (triggers, vertices, etc.)  Database queries performs event selection  Example: select top em events by requiring 1 e, 1m, and 2 jets with |h| cuts and ET/ET thresholds  Query output is physics analysis input  Ntuple with database info for selected events  List of run/event numbers allow selected mDSTs to be quickly fetched for advanced analyses  Current goal is to demonstrate feasibility, develop necessary tools, and establish performance benchmarks using a database loaded with the Run 1 data sample Why Should One Use a Database?  Designed to store, retrieve, update, and manage complex data samples  Large number of data types  Bits, integers, floats, characters, binary objects, etc.  Many ways of organizing data  Physics object, event, file, stream, run, etc.  Architecture allows fast access to data  Avoid reading/unpacking entire event to look at 1 bit  Separating algorithm results from physics object data eliminates need to look at all 600M events  Flexible access to data  Data, columns, tables, etc. can be added, updated, or deleted without recreating the database  Example: new calibrations/algorithms can be added to the database and compared to the old ones  Central location for latest calibrations, corrections, algorithms, etc.  Local processing, minimal network IO POD Status  Database server running at Brown  Dual P-II/450 with ~40 GB available for testing  Using SQL Server for present tests  Preliminary studies using pseudo-data  30M “electron” 4-vectors generated with flat ET, h, f distributions  ~1 minute to select 100K events satisfying restrictive cuts ET, |h|  1.5 - 3.5 minutes to select ~16K events with 2 electrons satisfying loose ET, |h| cuts  For comparison, expect 2M produced Wen per fb-1  While rather crude, these results suggest that the POD approach can increase the speed for event selection by several orders of magnitude  C++ program is being developed to load ntuples into the database  Heptuple used to read ntuples  ADO API used to write to database  Database is being loaded with Run 1 data (ALL stream), LQ-based ntuples POD Tools Planned  Program to load heptuples into the database  Web interface to construct database queries that perform event selection Provide web form for selecting desired physics objects, algorithms, kinematic cuts (ET, |h|, etc.), triggers, runs, etc. Translate selection criteria into an SQL command Save resulting event list, output ntuple  Ntuple generator to create heptuple of database variables for selected events  Web interface to define correspondence between heptuple and database columns NT as POD Server OS  NT has proven ability to handle large databases  Supported by all leading database vendors  NT has best price/performance in standard database benchmarks  Good scalability in multi-processor systems (up to 8 P-III processors with forthcoming Profusion chip set)  NT supports ADO (Active Data Objects)  Provides high level API that greatly simplifies programming the database interface  ADO interfaces to all leading database products  Brown is using ADO to develop software for loading ntuples into a database for our prototype studies  Brown plans to also provide a web-based query capability based on ADO  NT makes setting up and managing a high performance / high reliability database remarkably easy  CD support for NT project servers not required POD Server Requirements  Disk subsystem  Disk capacity determines how much info is stored  ~ 1 TB would allow ~ 1K of info/event  ~20 objects/event in Run 1  ~10 words/object  More info can be stored by adding disk space  Multiprocessor Server(s)  Large database queries are CPU (and disk) intensive  Queries execute in parallel on multiple CPUs  Goal would be to have a typical query selecting a small sub-sample in ~1 minute  Multiple servers can be clustered if needed  Optional DVD-RAM jukebox  Expect to be able to store ~2.8 TB in a single jukebox at 1/4 cost of disk space  Allows retrieval of full mDST event information for events selected by the database query POD Server is well matched to Project Server specs DVD Jukebox Storage  DVD-RAM: 2.6 GB/side, 5.2 GB total, $15-25/DVD; 4.8 GB/side were just announced!  DVD libraries: 600 DVD, 3 TB of storage for about $45K or $15/GB!  10-40 MB/s throughput  Very promising technology, potential capacity up to 17 GB/DVD  Fast price drop, wide availability  Brown group has purchased a single 1X DVD-RAM recorder for performance tests  Excellent tool for remote collaborators to have local copies of mDST data set (or selected STA/DST streams) Possible POD/DVD Server Central Analysis Server 20-30 TB Fermilab »$1M T-3 or faster Network Remote Physicist Web-based GUI T-1/T-3 Network A solution to the public data access provision H.R.4328 (or Quarknet)? POD Server DB DB DB SCSI bus $60/GB DVD-RAM Library 15 TB storage mDST DST STA $15/GB Tape robot 3 TB fast RAID storage Eight P-III or Merced CPU 1 GB memory $30K/ 5000 MIP server One Fast or two (shown) 1Gbit/s Ethernet fast server(s) SCSI DVD Server SCSI bus Eight P-III or bus 10-40 600-DVD MB/sec multi-drive changers Merced CPU 1 GB memory Cache disk POD/DVD Server at YOUR INSTITUTION

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ppt - Institute of Physics