Download Data Catalog Web Interface

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Fermi (previously
GLAST) Gamma-Ray
Space Telescope
Processing Pipeline
and Data Catalog
Tony Johnson
[email protected]
Launched 11 June 2008 – LAT activated 25 June
T. Johnson
FERMI Data Processing Pipeline
CHEP 2010, Taipei, Taiwan
2/18
Fermi MISSION ELEMENTS
•
•
GPS
msec
Large Area Telescope
& GBM
DELTA
7920H
-
• Telemetry 1 kbps
•
Fermi Spacecraft
TDRSS SN
S & Ku
•
•
S
•
GN
•
Schedules
Mission Operations
Center (MOC)
GRB
Coordinates Network
Fermi Science
Support Center
White Sands
HEASARC
Schedules
Alerts
LAT Instrument
Science
Operations Center
(SLAC)
GBM Instrument
Operations Center
Data, Command Loads
T. Johnson
FERMI Data Processing Pipeline
CHEP 2010, Taipei, Taiwan
3/18
Design Goals
•
Experiment Goals
• Completely automate Fermi data processing at SLAC
• 10 year lifetime for experiment
• Unlike HEP data astrophysics data is time critical
• NASA requirement to make data public in 24 hours
• Internal goal to complete processing of each downlink within 3
hours
• Support submission and bookkeeping of simulation and data reprocessing
• Maintain full history of all data processing
• Data catalog to keep track of all data products
• Provide access to data for collaborators
T. Johnson
FERMI Data Processing Pipeline
CHEP 2010, Taipei, Taiwan
4/18
Pipeline Design Goals
•
•
•
•
•
Automated submission and monitoring of batch jobs
– Very high reliability
Ability to define arbitrary graph of jobs to be run
– Ability to parallelize processing tasks
– Ability to perform simple computations as part of job graph
• E.g. Compute how many parallel streams to create as a function of the
number of events to be processed
Ability to “Roll Back” jobs (whether successful or not)
– Capability to automatically compute sub-graph of jobs to rerun
Web interface for monitoring jobs and submitting new tasks
– Worldwide collaboration
– Plus command line client, and programmatic API
Avoid tight coupling to specific experiment to allow for reuse
T. Johnson
FERMI Data Processing Pipeline
CHEP 2010, Taipei, Taiwan
5/18
Data Catalog design goals
• Support multiple storage formats
– AFS, NFS, xrootd, ….
• Support multiple file locations for same datasets
– SLAC, IN2P2, …
• Allow arbitrary meta-data to be stored with datasets
– As much meta-data as possible should be extracted from the
file itself to ensure integrity
• Dataset access from web, command line, API
– Including search based on meta-data
• Avoid tight coupling to specific experiment to allow for reuse
– Avoid tight coupling with pipeline
T. Johnson
FERMI Data Processing Pipeline
CHEP 2010, Taipei, Taiwan
6/18
Pipeline and Data Catalog Components
SLAC
Command
Client Client
Oracle
Data
Catalog
Catalog Web
Interface Client
SLAC
Client
Line Mode
Client Client
Client
Pipeline
Web Interface
T. Johnson
Pipeline
Server
read-only
Oracle
FERMI Data Processing Pipeline
Client Data
Client Portal
SLAC
Job Control
Batch
Farm
IN2P3
Job Control
Batch
Farm
INFN / Grid?
Job Control
Batch
Farm
CHEP 2010, Taipei, Taiwan
7/18
Anatomy of a Pipeline Task
Tasks, Subtasks, Processes, Streams, specified in user written XML
T. Johnson
FERMI Data Processing Pipeline
CHEP 2010, Taipei, Taiwan
8/18
Level 1 Processing Task Example
Digitization
Reconstruction
T. Johnson
FERMI Data Processing Pipeline
CHEP 2010, Taipei, Taiwan
9/18
Processing Pipeline Web Interface
•
Pipeline web interface allows
– Many views of data processing, down to log files of individual jobs
– Job submission (but normally done from command line)
– If jobs do fail they can be “rolled back” directly from the web interface
T. Johnson
FERMI Data Processing Pipeline
CHEP 2010, Taipei, Taiwan
10/18
Front End: Activity Plots
Simulation
L1 Reconstruction
L1 Digitization
T. Johnson
FERMI Data Processing Pipeline
CHEP 2010, Taipei, Taiwan
11/18
Monitoring Data Processing
• Pipeline provides web API for incorporating info into other web
applications
T. Johnson
FERMI Data Processing Pipeline
CHEP 2010, Taipei, Taiwan
12/18
Data Catalog Web Interface
Summary of all
files in a group
Meta-data
associated with
folders, groups and
files
Logical tree
allows browsing
for files
Folders and
Groups
T. Johnson
Crawler runs in background and
validates all files and extracts size,
#events, etc
FERMI Data Processing Pipeline
CHEP 2010, Taipei, Taiwan
13/18
Data Catalog Web Interface
Drill down to get
more details
Download manager,
reliable download of
multiple files
T. Johnson
FERMI Data Processing Pipeline
CHEP 2010, Taipei, Taiwan
14/18
Pipeline Performance and Reliability
x108
T. Johnson
FERMI Data Processing Pipeline
CHEP 2010, Taipei, Taiwan
15/18
Technologies Used
•
•
•
Database
– Oracle
• Java Stored Procedures for performance
• Scheduled Server-side Jobs for monitoring, stats-gathering
• Hierarchical queries
Servers and Client Libraries (Pipeline, Data Catalog)
– Java
• Extensive use of threads, concurrency utilities for performance
– Jython interpreter for user scripts
– JMX MxBean Interfaces for monitoring, communication
– XML used for processing-task definitions
– Batch jobs use e-mail for status notification
• Apache/James Email server
Web:
– Apache/Tomcat servers
– JSP for web pages
• DisplayTag for tabular data
• AIDA tag libraries for plotting
• Custom tag libraries expose Pipeline client methods
– Java Servlets
• Serve GraphViz State diagrams
– JMX Interfaces
T. Johnson
FERMI Data Processing Pipeline
CHEP 2010, Taipei, Taiwan
16/18
Conclusion
• Pipeline is in extensive production use:
– Real Data
– Automated and Routine Science Processing
– Large Simulations
– Reprocessing
• Software not coupled specifically to Fermi or SLAC
– Second pipeline now in use for EXO and CDMS
• New job control daemon written for CDMS
– Submits jobs to condor farm (at SMU)
– Possibility to extend this to submit Cloud/Grid jobs in future
– Pipeline being evaluated for use by James Webb Space
Telescope (Hubble 2.0)
• To see pipeline in action:
– http://srs.slac.stanford.edu/Pipeline-II/
– http://glast-ground.slac.stanford.edu/Pipeline-II/
T. Johnson
FERMI Data Processing Pipeline
CHEP 2010, Taipei, Taiwan
17/18
Acknowledgments
Software Development Team:
Daniel Flath
Charlotte Hee
Karen Heidenreich
Claudia Lavalley
Tony Johnson
Max Turri
Beta Testers:
Warren Focke
Tom Glanzman
T. Johnson
FERMI Data Processing Pipeline
CHEP 2010, Taipei, Taiwan
18/18
Extra Slides
3-Tier Architecture
Connection
Pool
Back-End
Components
JMX
MBean
Interface
Connection
Pool
Pipeline
Server
Data
SQL
Scheduler
RMI
Job Control Service
Thread
Pool
Batch
Farms
T. Johnson
Web Browser
Tomcat Server
TNS
Stored
Procedures
RMI*
Oracle
SQL
RDBMS
HTTP
Web
Application
Apache/James
Email
POP3
Middle-Ware
FERMI Data Processing Pipeline
JMX
Line-mode
Client
JMX
JConsole Monitor
Front-End
User Interfaces
CHEP 2010, Taipei, Taiwan
20/18
Front End: Line-mode Clients
• Command line tools for direct or scripted interaction with
middle-ware
– Control Server
• Ping
• Restart
• Shutdown
– Upload Task Definitions
– Manage processing streams
• Create
• Delete
• Cancel
• Retry from failure point
– Query processing history
– Plus Interaction with Data Catalog
T. Johnson
FERMI Data Processing Pipeline
CHEP 2010, Taipei, Taiwan
21/18
Front End: Web Interfaces
• Provides all administrative functions available in a user-friendly
interactive GUI
• Interactive displays show active (and historical) processing
– Filtering by Task, Process, Status(es), Stream-range, Daterange
• Processing Statistics Plots
– Provided by AIDA tag library
– System throughput plots
• Filterable by Task, Date-Range
– Individual process statistics plots
• CPU time (vs Wallclock)
• Pending time
• By Batch Host-type
• Task diagrams generated by GraphViz and image-mapped to
provide links to task element (Sub-tasks, processes) displays
T. Johnson
FERMI Data Processing Pipeline
CHEP 2010, Taipei, Taiwan
22/18
Front End: Task Summary Display
T. Johnson
FERMI Data Processing Pipeline
CHEP 2010, Taipei, Taiwan
23/18
Front End: Process Detail Plots
T. Johnson
FERMI Data Processing Pipeline
CHEP 2010, Taipei, Taiwan
24/18
Front End: Job Detail Display
T. Johnson
FERMI Data Processing Pipeline
CHEP 2010, Taipei, Taiwan
25/18
Middle Tier: Threading
•
Makes extensive use of Java Concurrency library (java.util.concurrent)
•
Scheduler Threads
– Look for work and delegate to Execution pool
• Ready Jobs (Script and Batch)
• Submits work to execution threads
– Handle Email status-messages
• Checks for email
• Submits status transition calculations to execution threads
• Receives confirmation from workers, deletes email
– Reaper
• Searches for ‘lost’ jobs in Batch, updates processing history
accordingly
•
Execution Threads
– Decode email status messages and update process records
– Execute Jython script processes directly
– Submit Batch jobs to Farms
T. Johnson
FERMI Data Processing Pipeline
CHEP 2010, Taipei, Taiwan
26/18