Download Monitoring, Alerting, DevOps, SLAs, and all that

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Distributed operating system wikipedia , lookup

Network tap wikipedia , lookup

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

IEEE 1355 wikipedia , lookup

IEEE 802.1aq wikipedia , lookup

List of wireless community networks by region wikipedia , lookup

Airborne Networking wikipedia , lookup

CAN bus wikipedia , lookup

Routing in delay-tolerant networking wikipedia , lookup

Kademlia wikipedia , lookup

Computer cluster wikipedia , lookup

Transcript
Monitoring, Alerting, DevOps, SLAs, and all that
Jason Banfelder
May 2017
Boston, MA
The Rockefeller University

Founded 1901 as The Rockefeller Institute
for Medical Research

First institution in the U.S. devoted
exclusively to biomedical research

Rockefeller Hospital, founded 1910, was
the nation’s first clinical research facility

Graduate program, created in the 1950s,
trains new generations of scientific leaders
University Mission
Science for the benefit of humanity
The Rockefeller University Community
 82 laboratory heads
 200 research and clinical scientists
 170 PhD and 25 MD/PhD students
 350 postdoctoral researchers
 1,050 clinicians, technicians, administrative, and support staff
 1,178 alumni
THE ROCKEFELLER UNIVERSITY
Science for the benefit of humanity
Structure and Philosophy
To conduct TRANSFORMATIVE—rather than incremental—science
 Unique open structure with no academic departments—encourages collaboration and empowers faculty to embrace high‐risk, high‐reward research  Focus on hiring only the boldest, most creative scientists
 Philosophy has always been to recruit the scientist, not the field
 Rockefeller attracts scientists who want to collaborate across disciplines
 Provide innovation funding through lab head salaries and core grants
 24 Nobel Laureates
5 are current faculty
 22 Lasker Award recipients
 20 National Medal of Science recipients
 39 current National Academy of Sciences members
 17 current Howard Hughes Medical Institute Investigators
 Rockefeller known for pioneering whole new fields of endeavor—cell biology, protein chemistry
THE ROCKEFELLER UNIVERSITY
Science for the benefit of humanity
River Campus Under Construction
Representative Scientific Driver: CryoEM
• Cryo-electron Microscopy
– Take pictures (with
electrons instead of light) of
proteins
– Reconstruct protein
structure from many blurry
pictures
– Computationally intensive
• several weeks per structure
with an HPC cluster
– Extremely competitive
(time-to-result rules)
CryoEM raw data and results
Other Scientific Drivers
• High Throughput Genomics
• Microscopy
– Optical super‐resolution
– Light sheet
 “Come visit us at Janelia to use our microscope, so you too can walk away with 10 TB of data that you don’t know what to do with!”
 “[Biologists] are leaving 99% of scientific insight on the table because we aren’t able to extract meaning from our data.”
– Eric Betzig, Tri‐Institutional Seminar Series January 12th, 2016
Our New Scientific Computing Data Center (I)
Our New Scientific Computing Data Center (II)
High Performance Computing Cluster
• Single heterogeneous cluster
• Mix of shared access and dedicated nodes (“hotel” and “condo” in the same building)
– Hotel nodes owned by HPC group
– Condo nodes purchased by labs
• Slurm batch scheduler
– Participation is optional for condo nodes
High Performance Computing: Vital Statistics
• Hotel (shared)
– 60 general purpose nodes (24 cores; 256 GB RAM)
– 4 GPU nodes (2 x K80; 24 cores; 512 GB RAM)
– 2 large memory nodes (64 cores; 3072 GB RAM)
• Condo (dedicated)
– 1600+ additional cores in 60+ additional nodes
– 20+ additional GPUs (K80/P100)
High Performance Computing:
Storage Hardware
• Shared DDN SFA12K
–
–
–
–
–
GPFS (Spectrum Scale) filesystem; three NSD servers
1.5 PB usable
6 TB data drives (RAID6 pools)
Metadata on SSDs (RAID1 pools)
4K inodes (small files on SSD)
• Condo DDN SFA12K
– GPFS (Spectrum Scale) filesystem; two NSD servers
– 1.0 PB usable
– 6 TB data drives (RAID6 pools)
– 10K SAS tier (RAID6 pools)
– Metadata on SSDs (RAID1 pools)
– 4K inodes (small files on SSD)
• Backup target: Dell MD3460/R630
– XFS filesystem
– Backup target (825 TB usable)
– Zmanda software
Storage Organization
• All storage is allocated via GPFS quotas
– No “free” scratch space
– No oversubscriptions
– Heavy use of GPFS filesets
• Very small (40 GB) home directory for each user
• Space for data is leased at the lab level
– 1 TB increments
– Annual or monthly terms
– ”scratch” (not backed up) or “store” (backed up)
Storage Reliability
• Shared DDN SFA12K
– live Dec 2015
• Dedicated DDN SFA12K
– live May 2016
• 100% availability since go‐live on both systems
• Is this a good thing?
– SLAs and bonuses
– Are we doing enough?
Networking Summary
• We have found GPFS to be extremely
robust if the underlying networks are solid.
– Non-routed 10 GbE network for data and
provisioning
– InfiniBand for GPFS data movement/MPI
– Physically separate, non-routed 1 GbE
management network
– NAT bridge to the campus (and world)
HPC IP Network Overview
•
•
•
•
•
•
provision01 runs a DHCP and PXE server (serving the HPC P/D Network), and so can never touch the campus network
all nodes have their iDRAC on the HPC Admin Network
provision01 and mgmt0[12] OSes can talk to the HPC Admin Network, to issue commands to, and monitor, iDRACs
all nodes except provision01 have first boot device as NIC1(PXE) and second boot device as /dev/sda
any node can be rebuilt by (dynamically) setting its next-server and filename in provision01’s DHCP config,
and then sending a power‐cycle command to its iDRAC via the HPC Admin Network
login02 runs NAT to allow machines on the HPC P/D network to initiate outbound traffic to the campus and internet
(e.g. to allow wget from a node). This will run on a dedicated IP address; login01 will act as a failover for this.
1000 VLAN
login0[12]
NIC1/10 GbE
(PXE boot, NAT)
NIC2/10 GbE
Mellanox InfiniBand Fabric
HPC Provisioning/Data Network
(10 GbE) Dell S4048-ON
Campus Network
iDRAC/1 GbE
provision01
NIC3/1 GbE
NIC1/10 GbE (PXE boot)
NIC1/10 GbE (PXE boot)
100 VLAN
NIC1/10 GbE
DHCP & PXE servers
NIC4/1 GbE
iDRAC/1 GbE
mgmt0[12]
(compute[0-9]{3} | gpfs[0-9]{2})
NIC3/1 GbE
HPC Admin Network
(1 GbE) Dell S3048-ON
iDRAC/1 GbE
iDRAC/1 GbE
InfiniBand Fabric
• EDR switches (100 Gbps)
FDR cards in nodes (56 Gbps)
• ~3:1 blocking factor
– Most of our applications are not MPI
dependent (at scale)
– Can implement rack-aware scheduling if the
need arises
InfiniBand Topology
36-port EDR
Level 0
36-port EDR
6 x 100 Gbps =
600 Gbps up
1680 / 600 = 2.8
36-port EDR
Level 1
30 x 56 Gbps =
1680 Gbps down
up to 12
replicates
up to
30 nodes
FDR node
This works for up to 360 nodes, without a director switch
InfiniBand Robustness
• Mellanox only deployment
• Unified Fabric Manager
– redundant opensm daemons
– continuous, comprehensive monitoring and
alerting
• Can be used in monitoring only mode
• Licensed per device on fabric
• Complex HA deployment (MLNX on-site)
Alerting Philosophy (I)
• Once email/pager fatigue sets in, you
don’t have an alerting system anymore
– SAs receiving 1,000 nagios emails/day
• Configure for no false positive alerts
– It is better to miss real conditions than it is to
alert on false positives
See Kyle Brant’s GrafanaCon 2016 talk
on monitoring and alerting at StackOverflow
Alerting Philosophy (II)
Normal operations
Problems
alerts are
predominantly
false positives
few missed conditions
net result: all alerts will be ignored
Alerting Philosophy (II)
Normal operations
Problems
false positives
some more missed conditions
very few alerts per day; nearly all are real and will be acted upon immediately
Alerting Philosophy (III)
• Only alert on critical conditions
– e.g. don’t email on a network port going dark (node reboot)
– e.g. don’t email on a single disk failure in a RAID6 pool
• Only send one alert per escalation
– Don’t repeat hourly, e.g.
• We re-alert every 8 hours
• An unacknowledged alert is a (management) problem
– Don’t alert at all on conditions clearing
• The cause still needs to be investigated
• Avoids “wait and see” approach to problem resolution
• Check dashboard periodically (e.g. daily) for non-critical
conditions
Monitoring Philosophy I
• Collect raw data on everything you can
• At high time resolution
– We use one minute as a default
• Keep it all… forever
– Avoid summarizing or aggregating
– Averages are not nearly as useful as you think
– Collect short term histograms or tail percentile values
• Keep it in the same place
– vendor specific tools are of limited value
• Make it ubiquitously available
– establish a single source of truth that everyone has access to
Monitoring Tool Stack
• We prefer Open Source tools
–
–
–
–
InfluxDB for continuous data
Grafana for dashboarding
Probably ELK for event data (not there yet)
R for reporting and analytics (Shiny?)
• Avoid cost models that scale with…
– Amount of data collected
– Number of devices or metrics monitored (IoT?)
– Number of users that have access to the data
• Configuration is via DevOps procedures
InfluxDB (I)
• Our use is for data collection and storage only
– Does this one thing really well
– Visualization, dashboarding, alerting, analytics, and reporting are
separate (albeit dependent) problems
• Space efficient
– 21 GB for 7 months of data on 127 cluster nodes and core switch
port data
• Stable
– influxd daemon running for 237 days
• CPU efficient
– 5% of one core for database daemon
– 0.2% of one core for collection daemon
The Value of High Resolution Data
six months
90 minutes
Six months of free memory on a GPFS NSD server
Historical record of an “event”
on that server: 10 minutes of
downtime (Sept 25, 2016)
This is the speaker fork bombing
one of the GPFS NSD servers.
The Value of Raw Data
Raw packet counts are stored; rates (packets or bytes per sec) are computed at plot time
Different questions need different windows
Rates are on a log scale – average rates are misleading
InfluxDB (II)
• Many out-of-the-box collectors available
• The “API” is a simple text format
slurm_node_state,host=node001 partition="hpc",state="allocated"
• Zero configuration on the server side
– Just send it data and it will store it
– No server configuration for new devices or metrics
• VMs, containers, other transients
DevOps Software Stack
•(Too) Many tools to choose from.
• Our philosophy: keep it as thin and simple as possible
• Encourage incremental adoption
DevOps Example I: Build a new cluster node
• Define new node in
ansible inventory
• Ansible creates a
kickstart file with a
minimal build
• IPMI command sent
to reboot node into
PXE environment
• Standard ansible for
rest of build
manufacturer: Dell
hardware_model: R630
remedy_description: NFS AFM backend for
rack_name: 25
rack_location: 26
drac_ip: 172.31.0.22
node_ip: 172.31.4.22
pxe_mac: 14:18:77:57:BF:96
pxe_dev: NIC.Integrated.1-1-1
boot_partition_size: 1000
pvs:
- disk: sda
name: pv.100000
vgs:
- suffix: rootvg
pvs:
- pv.100000
lvs:
- path: swap
fstype: swap
name: swap
vg: rootvg
DevOps Example II: Monitor slurm state
[root@mgmt01-hpc roles]# tree slurm_telegraf/
slurm_telegraf/
├── meta
│
└── main.yml
├── README.md
├── tasks
│
└── main.yml
└── templates
├── node_slurm_states.j2
└── slurm_telegraf.conf.j2
[root@mgmt01-hpc slurm_telegraf]# cat templates/slurm_telegraf.conf.j2
[[inputs.exec]]
interval="1m"
commands=["/etc/telegraf/node_slurm_states"]
timeout="15s"
data_format="influx"
[root@mgmt01-hpc slurm_telegraf]# cat templates/node_slurm_states.j2
#! /bin/bash
sinfo -N --noheader --format "%N %R %T" | \
awk '{ printf("slurm_node_state,host=%s partition=\"%s\",state=\"%s\"\n", $1, $2, $3); }'
DevOps Example III: Add a node to the GPFS cluster
• Complex, 20 step process; reward: no operational mistakes
– Tricky to achieve idempotency
– Authored custom ansible modules (python scripts)
– Handle rebuild of node that wasn’t cleanly removed from the cluster
• Orchestrate node and NSD servers
-
name:
name:
name:
name:
name:
name:
name:
name:
name:
name:
name:
name:
name:
name:
name:
name:
name:
name:
name:
name:
retrieve list of gpfs clients
assess if this host needs to be added as a gpfs client
allow administration from gpfs heads
add gpfs bin to root user PATH
install gfps dependencies
install gfps rpms
gpfs make Autoconfig
gpfs make World
check if mmfs kernal needs to be installed
gpfs make InstallImages
check if mmfsNodeData exists
re-add a rebuilt gpfs node -- part I
re-add a rebuilt gpfs node -- part II
add client to gpfs cluster
get gpfs client info
add gpfs client license
refresh gpfs client config files
start gpfs on client
wait for gpfs to be ready
mount all gpfs filesystems
Questions
Thank You!
Jason Banfelder
[email protected]