Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
PES Platform & Engineering Services Improving resilience of T0 grid services Manuel Guijarro – IT/PES Steve Traylen– IT/PES EGI Community Forum 2012 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it PES Outline • • • • • • • CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it Introduction One server, one application Virtualisation Service Consolidation DNS Load balancing Grid WMS Example Conclusion 2 PES Introduction Platform Support Section in IT-PES: • Interactive Login Services and Batch • Grid (mainly Computing) Services: – CEs, WMS, LB, VOMS, BDII, CVMFS, FTS, and LFC. • Infrastructure Services: – – – – CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it Messaging Service DNS Load Balancing Service Service Consolidation Service Internal Cloud Infrastructure 3 PES Introduction II • Grid Services not all HA by design. • Need to increase their Availability • Use in house infrastructure services: – Service Consolidation Service (Virtualisation) – DNS Load Balancing Service – Cheap solutions • Do not provide real High Availability • But greatly reduces down time of Grid Services CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it 4 PES “one server, one application” • Low Infrastructure Utilization – Typically one application per server to avoid the risk of vulnerabilities in one application affecting the availability of another application on the same server • Increasing Physical Infrastructure Costs – Power consumption, cooling and facilities costs that do not vary with utilization levels • Increasing IT Management Costs – Spend disproportionate time and resources on manual tasks associated with server maintenance, and thus require more personnel to complete these tasks • Insufficient Failover and Disaster Protection – The threat of security attacks and natural disasters has elevated the importance of business continuity CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it Application Application Application Operating System Operating System Operating System Server Server Server PES Virtualization • Virtualization is the ability of running multiple independent virtual operating systems on a single physical computer CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it PES Server consolidation • Grid VOMS servers usage CPU utilization –Grid VOMS cluster – March 2012 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it 7 PES Server consolidation A p O S A A p p O O Application S S A p O S A p O S A A p p O O Application S S A p O S A p O S A A p p O O Application S S Operating Hypervisor System Operating Hypervisor System Operating Hypervisor System Server Server Server Computer Center (513) • Main advantages: – Multiple services in the same server – Hardware agnostic – No resources underutilization CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it A p O S 8 PES Hardware interventions A A A A 1 1 1 2 O O O O Application 1 S S S S A A Ax Ax 1 1 O O O O Application 1 S S S S Operating System Hypervisor Operating System Hypervisor Operating System Hypervisor Server Server Server A A Ax 1 2 O O O Application 1 S S S Computer Center (513) • Main advantages: – User transparent – No service degradation CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it 9 Ax O S PES Virtualization tools • There are different virtualization technologies: – – – – XEN KVM Microsoft Hyper-V VMware ESXi • PES-PS tested XEN and currently we are using KVM and Microsoft Hyper-V CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it 10 PES Cloud Orchestration tools • There are several cloud orchestration tools to build private clouds: – Openstack, OpenNebula, Platform ISF, Eucalyptus, Nimbus, Microsoft SCVMM, VMware vSphere, ... • PES-PS test(ed) Platform ISF, OpenNebula, Microsoft SCVMM and OpenStack • For Service Consolidation currently using Microsoft SCVMM CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it 11 PES Is this the silver bullet? • 90% of PES Grid services run on VMs • Still some on real HWD (until it expires) • Other saving excuse: – 5-10% lost in CPU performance – 20% lost on disk I/O – Overall performance still OK for most services • Still exposed to (partial) interruptions: – OS or Grid Application upgrades – ….. CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it 12 PES DNS Load Balancing node1: metric=24 node2: metric=48 node3: metric=35 node4: metric=27 2 best nodes for application.cern.ch: node1 node4 SNMP DynDNS Load Balancing Arbiter Application Cluster DNS Server A: application.cern.ch resolves to: node4.cern.ch node1.cern.ch Connecting to node4.cern.ch ` CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it Q: What is the IP address of application.cern.ch ? PES WMS Example: setup • 3 load-balancing DNS aliases for different configuration classes (“subclusters”) – SAM monitoring (wmssam.cern.ch), CMS (wmscms.cern.ch), other VOs (wmsshared.cern.ch) – Identical configuration for all nodes in a same subcluster (using central configuration mgmt) • Node load taken into account to select a set of “best nodes” to be exposed in each DNS alias – Using metrics specific to WMS – Highly loaded nodes stop receiving new jobs • Well supported by client software (gLite UI) CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it – Users specify a single server name in their config: the DNS alias – DNS server returns a list of IP addresses for the alias – Client software randomly tries IP addresses from the list 14 PES Benefits & limits • Benefits – Flexibility: nodes can be added or removed from a DNS alias without users changing their configuration – Resource optimization: even load distribution on WMS nodes – Availability: highly loaded or sick nodes automatically removed from DNS alias – Transparent maintenance: nodes undergoing maintenance are not exposed to users • But does not replace a full HA solution CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it – Each job remains tied to a specific node (we use WMS+LB co-hosting) – WMS node unavailable = no job status update 15 PES Conclusion • Service Consolidation via Virtualisation should become a common practise • DNS Load balancing is cheap and helps • The real challenge is ahead of us: – Running services in a(n) (internal) cloud – # of Nodes varies constantly – Dynamic Configuration becomes a must • Will require service redesign for most of what we know. CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it 16