Download ATLAS Experience of FTS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
ATLAS Use and Experience of FTS
FTS workshop 16 Nov 05
Outline
•
•
•
•
•
Intro to ATLAS DDM
How we use FTS
SC3 Tier 0 exercise experience
Things we like
Things we would like
ATLAS DDM System
• Moves from a file based system to one based on datasets
– Hides file level granularity from users
– A hierarchical structure makes cataloging more manageable
– However file level access is still possible
Files
Files
Files
Files
Datasets
Sites
Files
Files
• Scalable global data discovery and access via a catalog hierarchy
• No global physical file replica catalog (but global dataset replica
catalog and global logical file catalog)
ATLAS DDM System
• As well as catalogs for datasets and locations we
have ‘site services’ to replicate data
• We use ‘subscriptions’ of datasets to sites held in a
global catalog
• Site services take care of the replica resolution,
transfer and registration at the destination site
Site ‘X’:
Dataset ‘A’
File1
File2
(Container) Dataset ‘B’
Data block1
Data block2
Subscriptions:
Dataset ‘A’ | Site ‘X’
Dataset ‘B’ | Site ‘Y’
Site ‘Y’:
Subscription Agents
File state
(site local MySQL DB)
Agents
Fetcher
Function
Finds incomplete datasets
unknownSURL
ReplicaResolver
Finds remote SURL
MoverPartitioner
Assigns Mover agents
knownSURL
assigned
Mover
Moves file
Uses FTS here!
toValidate
ReplicaVerifier
Verifies local replica
validated
BlockVerifier
Verifies whole dataset complete
done
This is what runs on the VO Boxes
Within the Mover agent
• The python Mover agent reads in a XML file catalog
of source files to copy
<File ID="bc340aff-4057-4dcc-98aa-204432c4bb07">
<physical>
<pfn filetype=""
name="srm://castorgridsc.cern.ch/castor/cern.ch/grid/atlas/ddm_tier0/perm/esd.00
03/esd.0003._5645.1"/>
</physical>
<logical/>
<metadata att_name="destination"
att_value="http://vobox.grid.sinica.edu.tw:8000/dq2//esd.0003"/>
<metadata att_name="fsize" att_value="500000000"/>
<metadata att_name="md5sum" att_value=""/>
</File>
• The destination file name is based on the SRM
endpoint + dataset name + source filename
Within the Mover agent
• We create a file of source and dest SURLs and submit the bulk
job to FTS (using CLI via python commands module)
• Then query every x seconds using glite-transfer-status to see if
status changes
– ‘Done’: mark all files as successfully copied
– ‘Hold’, ‘Failed’: some or all files failed so look through the
output for successes and failures
• In the case of failed file:
– The file is put back to the ‘unknownSURL’ state and goes
again through the chain of agents (max 5 times x 3 FTS
retries = 15 retries overall)
• Successful files:
– The destination file is validated by using SRM commands
directly (getFileMetaData) to compare file size with source
catalog file size
– Would like to know if this stage is really necessary or if FTS
already does it (or will in future?) (more later…)
Using FTS within SC3
• ATLAS’ SC3 is a Tier 0 exercise where we produce
RAW data at CERN and replicate reconstructed data
to Tier 1 sites (using FTS!)
• We started officially on 2nd Nov so been running for
~2 weeks now
– With ~1 month of small scale testing using the
FTS pilot service - this was very useful for testing
integration of FTS and debugging site problems
with SRM paths etc..
Results so far
1 - 7 Nov
Results so far..
• Put latest plots here
9 - 15 Nov
What worked well
• The service is very reliable
– virtually no failures connecting to service (apart
from when CERN had unstable network)
– 99.9% of failures are problems with sites/humans
– It hasn’t lost any of our jobs information
• The interface is friendly and self-explanatory
• The throughput rate is fast enough, but we haven’t
really stressed it so far
• Response to reported errors is good (fts-support)
What we would like
• Staging from tape
– In theory this is not a problem for us in SC3 but will be in the
future
– Would like FTS to deal with staging from tape properly
(rather than giving SRM get timeouts), having a ‘staging’
status and perhaps enabling us to query through FTS
whether files are on tape or disk
• Integration with replica catalogs
– We use LFC (LCG) and Oracle/Globus RLS (through POOL
FC interface) (OSG)
– So we can say move LFN x from site y to site z and FTS
calls a service that takes care of resolution and registration
• Bandwidth monitoring within FTS
• Error reporting
– Email lists again… would like to know who to tell in case of
error. Can you give a hint based on the error?
What we would like
• TierX to TierY transfers handled by the network fabric, so
channels between all sites should exist
• support priorities, with possibility to do late reshuffling
• plugins to allow interactions with experiment's services.
Example of plug-ins - or experiment-specific services:
– catalog interactions (not exclusively grid catalogs)
– plugins to zip files on the fly (transparently to users but very
good for MSS) - after transfer starts and/or before files are
stored on storage
– an idea is for FTS to provide a callback? Must understand
VO agents framework and what can be done with that!
• reliable: keep retrying until told to stop
– but allow real-time monitoring of errors for transfer
(parseable errors preferable) so that we can do reshuffling of
transfers, cancel them, etc
– signal conditions such as source missing, destination down,
etc
Some Questions
(maybe already answered today!)
• Would like to understand how to optimise (no of files per bulk
etc)
• Do you distinguish between permanent errors (channel doesn’t
exist) and temporary errors (SRM timeout)?
– I.e. not retrying permanent errors and is there a way to report
this to us so we don’t retry either?
• Do we need our own verification stage or are we just repeating
what FTS does?
• ‘Duration’ - is this time from submission to completion or ‘Active’
time?
Conclusion
• We are happy with the FTS service so far - it’s given
us some good results
– But we haven’t tested it til it breaks!
• Probably the most reliable part of SC3 in our
experience
• We would like to see it integrated with more
components to reduce our workload (staging,
catalogs)
• Look forward to further developments!