Download Slide - Indico

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Security-focused operating system wikipedia , lookup

Spring (operating system) wikipedia , lookup

Distributed operating system wikipedia , lookup

Transcript
PES
Platform & Engineering Services
Improving resilience of
T0 grid services
Manuel Guijarro – IT/PES
Steve Traylen– IT/PES
EGI Community Forum 2012
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
PES
Outline
•
•
•
•
•
•
•
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
Introduction
One server, one application
Virtualisation
Service Consolidation
DNS Load balancing
Grid WMS Example
Conclusion
2
PES
Introduction
 Platform Support Section in IT-PES:
• Interactive Login Services and Batch
• Grid (mainly Computing) Services:
– CEs, WMS, LB, VOMS, BDII, CVMFS, FTS, and
LFC.
• Infrastructure Services:
–
–
–
–
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
Messaging Service
DNS Load Balancing Service
Service Consolidation Service
Internal Cloud Infrastructure
3
PES
Introduction II
• Grid Services not all HA by design.
• Need to increase their Availability
• Use in house infrastructure services:
– Service Consolidation Service (Virtualisation)
– DNS Load Balancing Service
– Cheap solutions
• Do not provide real High Availability
• But greatly reduces down time of Grid
Services
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
4
PES
“one server, one application”
• Low Infrastructure Utilization
– Typically one application per server to avoid the risk of vulnerabilities in
one application affecting the availability of another application on the
same server
• Increasing Physical Infrastructure Costs
– Power consumption, cooling and facilities costs that do not vary with
utilization levels
• Increasing IT Management Costs
– Spend disproportionate time and resources on manual tasks associated
with server maintenance, and thus require more personnel to complete
these tasks
• Insufficient Failover and Disaster Protection
– The threat of security attacks and natural disasters has elevated the
importance of business continuity
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
Application
Application
Application
Operating System
Operating System
Operating System
Server
Server
Server
PES
Virtualization
• Virtualization is the ability of running multiple
independent virtual operating systems on a
single physical computer
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
PES
Server consolidation
• Grid VOMS servers usage
CPU utilization –Grid VOMS cluster – March 2012
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
7
PES
Server consolidation
A
p
O
S
A
A
p
p
O
O
Application
S
S
A
p
O
S
A
p
O
S
A
A
p
p
O
O
Application
S
S
A
p
O
S
A
p
O
S
A
A
p
p
O
O
Application
S
S
Operating
Hypervisor
System
Operating
Hypervisor
System
Operating
Hypervisor
System
Server
Server
Server
Computer Center (513)
• Main advantages:
– Multiple services in the same server
– Hardware agnostic
– No resources underutilization
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
A
p
O
S
8
PES
Hardware interventions
A
A
A
A
1
1
1
2
O
O
O
O
Application 1
S
S
S
S
A
A
Ax
Ax
1
1
O
O
O
O
Application 1
S
S
S
S
Operating
System
Hypervisor
Operating
System
Hypervisor
Operating
System
Hypervisor
Server
Server
Server
A
A
Ax
1
2
O
O
O
Application 1
S
S
S
Computer Center (513)
• Main advantages:
– User transparent
– No service degradation
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
9
Ax
O
S
PES
Virtualization tools
• There are different virtualization
technologies:
–
–
–
–
XEN
KVM
Microsoft Hyper-V
VMware ESXi
• PES-PS tested XEN and currently we are
using KVM and Microsoft Hyper-V
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
10
PES
Cloud Orchestration tools
• There are several cloud orchestration tools
to build private clouds:
– Openstack, OpenNebula, Platform ISF,
Eucalyptus, Nimbus, Microsoft SCVMM, VMware
vSphere, ...
• PES-PS test(ed) Platform ISF, OpenNebula,
Microsoft SCVMM and OpenStack
• For Service Consolidation currently using
Microsoft SCVMM
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
11
PES
Is this the silver bullet?
• 90% of PES Grid services run on VMs
• Still some on real HWD (until it expires)
• Other saving excuse:
– 5-10% lost in CPU performance
– 20% lost on disk I/O
– Overall performance still OK for most services
• Still exposed to (partial) interruptions:
– OS or Grid Application upgrades
– …..
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
12
PES DNS Load Balancing
node1: metric=24
node2: metric=48
node3: metric=35
node4: metric=27
2 best nodes for
application.cern.ch:
node1
node4
SNMP
DynDNS
Load Balancing
Arbiter
Application
Cluster
DNS Server
A: application.cern.ch
resolves to:
node4.cern.ch
node1.cern.ch
Connecting to
node4.cern.ch
`
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
Q: What is the IP
address of
application.cern.ch ?
PES
WMS Example: setup
• 3 load-balancing DNS aliases for different
configuration classes (“subclusters”)
– SAM monitoring (wmssam.cern.ch), CMS
(wmscms.cern.ch), other VOs (wmsshared.cern.ch)
– Identical configuration for all nodes in a same
subcluster (using central configuration mgmt)
• Node load taken into account to select a set of
“best nodes” to be exposed in each DNS alias
– Using metrics specific to WMS
– Highly loaded nodes stop receiving new jobs
• Well supported by client software (gLite UI)
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
– Users specify a single server name in their config:
the DNS alias
– DNS server returns a list of IP addresses for the alias
– Client software randomly tries IP addresses from the
list
14
PES
Benefits & limits
• Benefits
– Flexibility: nodes can be added or removed from
a DNS alias without users changing their
configuration
– Resource optimization: even load distribution on
WMS nodes
– Availability: highly loaded or sick nodes
automatically removed from DNS alias
– Transparent maintenance: nodes undergoing
maintenance are not exposed to users
• But does not replace a full HA solution
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
– Each job remains tied to a specific node (we use
WMS+LB co-hosting)
– WMS node unavailable = no job status update
15
PES
Conclusion
• Service Consolidation via Virtualisation
should become a common practise
• DNS Load balancing is cheap and helps
• The real challenge is ahead of us:
– Running services in a(n) (internal) cloud
– # of Nodes varies constantly
– Dynamic Configuration becomes a must
• Will require service redesign for most of
what we know.
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
16