Download Monitoring, Alerting, DevOps, SLAs, and all that

Monitoring, Alerting, DevOps, SLAs, and all that Jason Banfelder May 2017 Boston, MA The Rockefeller University  Founded 1901 as The Rockefeller Institute for Medical Research  First institution in the U.S. devoted exclusively to biomedical research  Rockefeller Hospital, founded 1910, was the nation’s first clinical research facility  Graduate program, created in the 1950s, trains new generations of scientific leaders University Mission Science for the benefit of humanity The Rockefeller University Community  82 laboratory heads  200 research and clinical scientists  170 PhD and 25 MD/PhD students  350 postdoctoral researchers  1,050 clinicians, technicians, administrative, and support staff  1,178 alumni THE ROCKEFELLER UNIVERSITY Science for the benefit of humanity Structure and Philosophy To conduct TRANSFORMATIVE—rather than incremental—science  Unique open structure with no academic departments—encourages collaboration and empowers faculty to embrace high‐risk, high‐reward research  Focus on hiring only the boldest, most creative scientists  Philosophy has always been to recruit the scientist, not the field  Rockefeller attracts scientists who want to collaborate across disciplines  Provide innovation funding through lab head salaries and core grants  24 Nobel Laureates 5 are current faculty  22 Lasker Award recipients  20 National Medal of Science recipients  39 current National Academy of Sciences members  17 current Howard Hughes Medical Institute Investigators  Rockefeller known for pioneering whole new fields of endeavor—cell biology, protein chemistry THE ROCKEFELLER UNIVERSITY Science for the benefit of humanity River Campus Under Construction Representative Scientific Driver: CryoEM • Cryo-electron Microscopy – Take pictures (with electrons instead of light) of proteins – Reconstruct protein structure from many blurry pictures – Computationally intensive • several weeks per structure with an HPC cluster – Extremely competitive (time-to-result rules) CryoEM raw data and results Other Scientific Drivers • High Throughput Genomics • Microscopy – Optical super‐resolution – Light sheet  “Come visit us at Janelia to use our microscope, so you too can walk away with 10 TB of data that you don’t know what to do with!”  “[Biologists] are leaving 99% of scientific insight on the table because we aren’t able to extract meaning from our data.” – Eric Betzig, Tri‐Institutional Seminar Series January 12th, 2016 Our New Scientific Computing Data Center (I) Our New Scientific Computing Data Center (II) High Performance Computing Cluster • Single heterogeneous cluster • Mix of shared access and dedicated nodes (“hotel” and “condo” in the same building) – Hotel nodes owned by HPC group – Condo nodes purchased by labs • Slurm batch scheduler – Participation is optional for condo nodes High Performance Computing: Vital Statistics • Hotel (shared) – 60 general purpose nodes (24 cores; 256 GB RAM) – 4 GPU nodes (2 x K80; 24 cores; 512 GB RAM) – 2 large memory nodes (64 cores; 3072 GB RAM) • Condo (dedicated) – 1600+ additional cores in 60+ additional nodes – 20+ additional GPUs (K80/P100) High Performance Computing: Storage Hardware • Shared DDN SFA12K – – – – – GPFS (Spectrum Scale) filesystem; three NSD servers 1.5 PB usable 6 TB data drives (RAID6 pools) Metadata on SSDs (RAID1 pools) 4K inodes (small files on SSD) • Condo DDN SFA12K – GPFS (Spectrum Scale) filesystem; two NSD servers – 1.0 PB usable – 6 TB data drives (RAID6 pools) – 10K SAS tier (RAID6 pools) – Metadata on SSDs (RAID1 pools) – 4K inodes (small files on SSD) • Backup target: Dell MD3460/R630 – XFS filesystem – Backup target (825 TB usable) – Zmanda software Storage Organization • All storage is allocated via GPFS quotas – No “free” scratch space – No oversubscriptions – Heavy use of GPFS filesets • Very small (40 GB) home directory for each user • Space for data is leased at the lab level – 1 TB increments – Annual or monthly terms – ”scratch” (not backed up) or “store” (backed up) Storage Reliability • Shared DDN SFA12K – live Dec 2015 • Dedicated DDN SFA12K – live May 2016 • 100% availability since go‐live on both systems • Is this a good thing? – SLAs and bonuses – Are we doing enough? Networking Summary • We have found GPFS to be extremely robust if the underlying networks are solid. – Non-routed 10 GbE network for data and provisioning – InfiniBand for GPFS data movement/MPI – Physically separate, non-routed 1 GbE management network – NAT bridge to the campus (and world) HPC IP Network Overview • • • • • • provision01 runs a DHCP and PXE server (serving the HPC P/D Network), and so can never touch the campus network all nodes have their iDRAC on the HPC Admin Network provision01 and mgmt0[12] OSes can talk to the HPC Admin Network, to issue commands to, and monitor, iDRACs all nodes except provision01 have first boot device as NIC1(PXE) and second boot device as /dev/sda any node can be rebuilt by (dynamically) setting its next-server and filename in provision01’s DHCP config, and then sending a power‐cycle command to its iDRAC via the HPC Admin Network login02 runs NAT to allow machines on the HPC P/D network to initiate outbound traffic to the campus and internet (e.g. to allow wget from a node). This will run on a dedicated IP address; login01 will act as a failover for this. 1000 VLAN login0[12] NIC1/10 GbE (PXE boot, NAT) NIC2/10 GbE Mellanox InfiniBand Fabric HPC Provisioning/Data Network (10 GbE) Dell S4048-ON Campus Network iDRAC/1 GbE provision01 NIC3/1 GbE NIC1/10 GbE (PXE boot) NIC1/10 GbE (PXE boot) 100 VLAN NIC1/10 GbE DHCP & PXE servers NIC4/1 GbE iDRAC/1 GbE mgmt0[12] (compute[0-9]{3} | gpfs[0-9]{2}) NIC3/1 GbE HPC Admin Network (1 GbE) Dell S3048-ON iDRAC/1 GbE iDRAC/1 GbE InfiniBand Fabric • EDR switches (100 Gbps) FDR cards in nodes (56 Gbps) • ~3:1 blocking factor – Most of our applications are not MPI dependent (at scale) – Can implement rack-aware scheduling if the need arises InfiniBand Topology 36-port EDR Level 0 36-port EDR 6 x 100 Gbps = 600 Gbps up 1680 / 600 = 2.8 36-port EDR Level 1 30 x 56 Gbps = 1680 Gbps down up to 12 replicates up to 30 nodes FDR node This works for up to 360 nodes, without a director switch InfiniBand Robustness • Mellanox only deployment • Unified Fabric Manager – redundant opensm daemons – continuous, comprehensive monitoring and alerting • Can be used in monitoring only mode • Licensed per device on fabric • Complex HA deployment (MLNX on-site) Alerting Philosophy (I) • Once email/pager fatigue sets in, you don’t have an alerting system anymore – SAs receiving 1,000 nagios emails/day • Configure for no false positive alerts – It is better to miss real conditions than it is to alert on false positives See Kyle Brant’s GrafanaCon 2016 talk on monitoring and alerting at StackOverflow Alerting Philosophy (II) Normal operations Problems alerts are predominantly false positives few missed conditions net result: all alerts will be ignored Alerting Philosophy (II) Normal operations Problems false positives some more missed conditions very few alerts per day; nearly all are real and will be acted upon immediately Alerting Philosophy (III) • Only alert on critical conditions – e.g. don’t email on a network port going dark (node reboot) – e.g. don’t email on a single disk failure in a RAID6 pool • Only send one alert per escalation – Don’t repeat hourly, e.g. • We re-alert every 8 hours • An unacknowledged alert is a (management) problem – Don’t alert at all on conditions clearing • The cause still needs to be investigated • Avoids “wait and see” approach to problem resolution • Check dashboard periodically (e.g. daily) for non-critical conditions Monitoring Philosophy I • Collect raw data on everything you can • At high time resolution – We use one minute as a default • Keep it all… forever – Avoid summarizing or aggregating – Averages are not nearly as useful as you think – Collect short term histograms or tail percentile values • Keep it in the same place – vendor specific tools are of limited value • Make it ubiquitously available – establish a single source of truth that everyone has access to Monitoring Tool Stack • We prefer Open Source tools – – – – InfluxDB for continuous data Grafana for dashboarding Probably ELK for event data (not there yet) R for reporting and analytics (Shiny?) • Avoid cost models that scale with… – Amount of data collected – Number of devices or metrics monitored (IoT?) – Number of users that have access to the data • Configuration is via DevOps procedures InfluxDB (I) • Our use is for data collection and storage only – Does this one thing really well – Visualization, dashboarding, alerting, analytics, and reporting are separate (albeit dependent) problems • Space efficient – 21 GB for 7 months of data on 127 cluster nodes and core switch port data • Stable – influxd daemon running for 237 days • CPU efficient – 5% of one core for database daemon – 0.2% of one core for collection daemon The Value of High Resolution Data six months 90 minutes Six months of free memory on a GPFS NSD server Historical record of an “event” on that server: 10 minutes of downtime (Sept 25, 2016) This is the speaker fork bombing one of the GPFS NSD servers. The Value of Raw Data Raw packet counts are stored; rates (packets or bytes per sec) are computed at plot time Different questions need different windows Rates are on a log scale – average rates are misleading InfluxDB (II) • Many out-of-the-box collectors available • The “API” is a simple text format slurm_node_state,host=node001 partition="hpc",state="allocated" • Zero configuration on the server side – Just send it data and it will store it – No server configuration for new devices or metrics • VMs, containers, other transients DevOps Software Stack •(Too) Many tools to choose from. • Our philosophy: keep it as thin and simple as possible • Encourage incremental adoption DevOps Example I: Build a new cluster node • Define new node in ansible inventory • Ansible creates a kickstart file with a minimal build • IPMI command sent to reboot node into PXE environment • Standard ansible for rest of build manufacturer: Dell hardware_model: R630 remedy_description: NFS AFM backend for rack_name: 25 rack_location: 26 drac_ip: 172.31.0.22 node_ip: 172.31.4.22 pxe_mac: 14:18:77:57:BF:96 pxe_dev: NIC.Integrated.1-1-1 boot_partition_size: 1000 pvs: - disk: sda name: pv.100000 vgs: - suffix: rootvg pvs: - pv.100000 lvs: - path: swap fstype: swap name: swap vg: rootvg DevOps Example II: Monitor slurm state [root@mgmt01-hpc roles]# tree slurm_telegraf/ slurm_telegraf/ ├── meta │ └── main.yml ├── README.md ├── tasks │ └── main.yml └── templates ├── node_slurm_states.j2 └── slurm_telegraf.conf.j2 [root@mgmt01-hpc slurm_telegraf]# cat templates/slurm_telegraf.conf.j2 [[inputs.exec]] interval="1m" commands=["/etc/telegraf/node_slurm_states"] timeout="15s" data_format="influx" [root@mgmt01-hpc slurm_telegraf]# cat templates/node_slurm_states.j2 #! /bin/bash sinfo -N --noheader --format "%N %R %T" | \ awk '{ printf("slurm_node_state,host=%s partition=\"%s\",state=\"%s\"\n", $1, $2, $3); }' DevOps Example III: Add a node to the GPFS cluster • Complex, 20 step process; reward: no operational mistakes – Tricky to achieve idempotency – Authored custom ansible modules (python scripts) – Handle rebuild of node that wasn’t cleanly removed from the cluster • Orchestrate node and NSD servers - name: name: name: name: name: name: name: name: name: name: name: name: name: name: name: name: name: name: name: name: retrieve list of gpfs clients assess if this host needs to be added as a gpfs client allow administration from gpfs heads add gpfs bin to root user PATH install gfps dependencies install gfps rpms gpfs make Autoconfig gpfs make World check if mmfs kernal needs to be installed gpfs make InstallImages check if mmfsNodeData exists re-add a rebuilt gpfs node -- part I re-add a rebuilt gpfs node -- part II add client to gpfs cluster get gpfs client info add gpfs client license refresh gpfs client config files start gpfs on client wait for gpfs to be ready mount all gpfs filesystems Questions Thank You! Jason Banfelder [email protected]

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Monitoring, Alerting, DevOps, SLAs, and all that