Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
T0 report WLCG operations Workshop Barcelona, 07/07/2014 Maite Barroso, CERN IT CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it Outline • • • • • • • Facilities Next Linux version Network Cloud Grid and batch services Databases Summary 2 Facilities • Wigner (Budapest) – Additional capacity installed: mainly for openstack, and for EOS, plus some for business continuity for DB services – Wigner participated for the first time in the last HEPiX workshop – Network to Wigner • Extensive testing done on the Geant 100 Gbps to identify the source of the flaps observed • all segments of the link have been tested without errors, still source of the problem not identified • It is possible that cleaning of the fibres ahead of the tests have resolved the problem. If not, then only an incompatibility between the Brocade and Alcatel equipment remains as a possible cause. 3 Linux: next version • Plan: Adopt CentOS 7 – adding CERN specific setup via addon repositories http://cern.ch/linux/docs/Hepix-Spring-2014 Next Linux version at CERN.pdf • CentOS 7 is approaching release – within few weeks http://seven.centos.org/ • We expect to have a CERN customized test installation available in July/August • CERN own version certification ? – Is it still necessary ? – To be discussed with Linux Certification Committee 4 Network (1) • LHCONE – Increased CERN LHCONE bandwidth to 30Gbps (was 20Gbps) – working on the definition of a LHCONE AUP that can guarantee enough security while being doable in reality – Organization of LHCONE Asian workshop (https://indico.cern.ch/event/318813/) is on going. It aims to expand LHCONE connectivity to sites in Asia. • LHCOPN – Connected KI and JINR sites of the Russian Tier1s. They have two 10G links to CERN, one via Amsterdam and one via Wigner. – Bandwidth to US Tier1s will increase with the upcoming deployment of the ESnet PoP at CERN 5 Network (2) • IPv6 – From the network point of view, IPv6 deployment is finished – IT services are becoming dual stack. Right now: • • • • email (smtp, imap, pop, owa) lxplus-ipv6 Ldap web redirection – HEPiX IPv6 WG testing of IPv6 compliance of WLCG applications taking advantages of the deployment of IPv6 at CERN – CERN, KIT, PIC, NDGF, IN2P3 have IPV6 connectivity over the LHCOPN 6 Cloud (1) • All components now run latest Havana-3 release – Planning the upgrade to Icehouse • Continues to grow – Today: 2800 servers, 7000 VMs, 150 TB Volumes • Work in progress: – Commissioning resources in Wigner for experiments • Until now: only batch service – SSO, Kerberos integration, accounting with Ceilometer – Adding hardware • Aim: 6000 compute nodes this year 7 Cloud (2) 8 Cloud (3) • VM provisioning 9 Services (1) • VOMRS to VOMS-admin migration – ATLAS, ALICE, CMS and LHCb still run VOMRS • We need the new release to migrate this VOs as they need to sync with the CERN HR DB and in the current version this doesn't work • Expected mid-July – voms-admin in production for the rest of the VOs (test, ops, geant4, ...) • LFC – Decommissioned for Atlas early June, all data is kept for the moment – In contact with LHCb about the expected end date of their need for an LFC service • FTS: Agreed to stop FTS2 on August 1st 10 Services (2) Batch: • SLC6 migration: SLC5 CEs decommissioned, no grid job submission to SLC5 – SLC5 WNs final migration ongoing • Batch system migration, from LSF to HTCondor – Goals: scalability, dynamism, dispatch rate, query scaling – Replacement candidates: • SLURM feels too young • HTCondor mature and promising • Son of Grid Engine fast, a bit rough – More details of selection process: https://indico.cern.ch/event/247864/session/5/contribution /22/material/slides/0.pdf 11 Services (3) • Batch system migration, from LSF to HTCondor – Setting up pilot, will open to experiments • Start with 10 nodes, plus CREAM CE for Condor, for grid submissions • Work is ongoing on integrating AFS token granting and extension – Full capacity test in parallel, ~5000 nodes – Close contact with developers • New Squid service: – request from Atlas for a more generic Squid service covering their needs in view of Frontier as well as the already covered CVMFS needs • Implementation will be an extension of existing service, different alias, same instance 12 Databases (1) • Oracle version upgrade – Majority of DB services upgraded to 11.2.0.4 (including half of the Tier1 sites) – Few DB services upgraded to 12.1.0.1 (LHCb offline, ATLARC, COMPASS, LANDB, …) – End of 11.2 support in January 2018; looking at moving to 12c gradually • HW and Storage evolution – New HW installation RAC50 in BARN, migration of production services completed by May – New HW installations being prepared: RAC51 in BARN and Wigner (for Disaster Recovery) – New generation of storage from NetApp • Integration with the Agile Infrastructure @CERN 13 Databases (2) • Replication evolution – Replication Technology Evolution Workshop took in June 3rd-4th – Replication tests T0 to T1 using production data on-going – Plan to migrate from Streams to Golden Gate agreed with experiments and Tier0 • Database as a Service evolution (DBoD) – New HW and Storage installations – SW upgrades: MySQL (migrating to 5.6) and Oracle (migrating to 12c multi-tenancy) – PostgreSQL (version 9.2) since September 2013 More details: Evolution of Database Services today at 17:10 14 Summary • Getting experience with recent changes: – Wigner – Cloud VM provisioning – IPv6 • And preparing the next ones: – Quattor phase out – Next Linux version – HTCondor • In a continuous feedback loop with the experiments and WLCG 15