Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
IBM Systems and Technology Group System Management Considerations for Beowulf Clusters Bruce Potter Lead Architect, Cluster Systems Management IBM Corporation, Poughkeepsie NY [email protected] © 2006 IBM Corporation System Management Considerations for Beowulf Clusters Beowulf Clusters Beowulf is the earliest surviving epic poem written in English. It is a story about a hero of great strength and courage who defeated a monster called Grendel. Beowulf Clusters are scalable performance clusters based on commodity hardware, on a private system network, with open source software (Linux) infrastructure. The designer can improve performance proportionally with added machines. http://www.beowulf.org © 2003 IBM Corporation System Management Considerations for Beowulf Clusters Beowulf Clusters are Used in a Broad Spectrum of Markets Life Sciences Research, drug discovery, diagnostics, information-based medicine Business Intelligence Digital Media Digital content creation, management and distribution, Game server farms Data warehousing and data mining, CRM Petroleum Oil and gas exploration and production, seismic and reservoir analysis Industrial/Product Lifecycle Management CAE, EDA, CAD/PDM for electronics, automotive, and aerospace 3 © 2006 IBM Corporation Financial Services Optimizing IT infrastructure, risk management and compliance, financial market modeling, Insurance/actuary analysis Government & Higher Education Scientific research, classified/defense, weather/environmental sciences System Management Considerations for Beowulf Clusters IBM Deep Computing What Drives HPC? --- “The Need for Speed…” Computational Needs of Technical, Scientific, Digital Media and Business Applications Approach or Exceed the Petaflops/s Range CFD Wing Simulation 512x64x256 Grid ( 8.3 x 10e6 mesh points) 5000 FLOPs per mesh point, 5000 time steps/cycles 2.15 x 10e14 FLOPs CFD Full Plane Simulation 512x64x256 Grid ( 3.5 x 10e17 mesh points) 5000 FLOPs per mesh point 5000 time steps/cycles Source: A. Jameson, et al Magnetic Materials: Current: 2000 atoms; 2.64 TF/s, 512GB Future: HDD Simulation – 30TF/s, 2 TBs Electronic Structures: Current: 300 atoms; 0.5 TF/s, 100GB Future: 3000 atoms; 50TF/s, 2TB 8.7x 10e24 FLOPs Digital Movies and Special Effects ~ 1E14 FLOPs per frame 50 frames/sec 90 minute movie 2.7E19 FLOPs ~ 150 days on 2000 1 GFLOP/s CPUs Source: Pixar 4 Materials Science © 2006 IBM Corporation Source: D. Bailey, NERSC Spare Parts Inventory Planning Modeling the optimized deployment of 10,000 part numbers across 100 parts depots and requires: 2 x 10e14 FLOP/s ( 12 hours on 10, 650MHz CPUs) 2.4 PetaFlop/s sust. performance ( 1 hour turn-around time ) Industry trend for rapid, frequent modeling for timely business decision support drives higher sustained performance Source: B. Dietrich, IBM System Management Considerations for Beowulf Clusters The Impact of Machine-Generated Data Created by hand (salary, orders, etc.) Historically in databases Low volume (but high value) Machine-generated data Sensors High volume Not amenable to traditional database architecture 5 © 2006 IBM Corporation Machine-generated versus authored Machine-generated versus authored data data 1,000 1,000 online Storage online Machine-generated Machine-generated Storage Allmedical medical imaging All imaging Data 100 100 Data Medical data data stored Medical stored Surveillance Surveillance bytes bytes Gigabytes/ US capita/ year 10 10 Surveillance Surveillance for for urban areas urban areas Personal multimedia Personal multimedia 11 databases InIndatabases .1.1 .01 .01 .001 .001 Gigabytes/US capita/year Authored data Authored Authored data data Static Web data Static Web data Text data Text data 1995 1995 2000 2000 2005 2005 2010 2010 2015 2015 Year Year System Management Considerations for Beowulf Clusters GPFS on ASC Purple/C Supercomputer 6 1536-node, 100 TF pSeries cluster at Lawrence Livermore National Laboratory 2 PB GPFS file system (one mount point) 500 RAID controller pairs, 11000 disk drives 126 GB/s parallel I/O measured to a single file (134GB/s to multiple files) © 2006 IBM Corporation System Management Considerations for Beowulf Clusters IDE: Programming Model Specific Editor 7 © 2006 IBM Corporation System Management Considerations for Beowulf Clusters Scheduling and Resource Management Concepts Resource manager Batch Scheduler Scheduler Job Submission and control Launch jobs (serial and parallel) on specific resources at specific times Optimize use of Cluster resources to maximize throughput and comply with organizational policies Additional capabilities: Job1 Job2 Job3 Job4 a) provide a runtime execution environment; b) system utilization and monitoring environment Job Queue Compute Cluster 8 © 2006 IBM Corporation System Management Considerations for Beowulf Clusters LL Vision: Enterprise Data Centers Scheduling Domain: The enterprise Cluster D /NCSA over WAN Cluster A Cluster C Cluster B Sc03 SAN Customer data center environments are heterogeneous Customer data center environments are sometimes scattered over a WAN Customers want meta-scheduling capabilities with automatic load balancing Facility wide file sharing now becoming possible making scheduling across the enterprise critical 9 © 2006 IBM Corporation System Management Considerations for Beowulf Clusters Wide Area File Sharing: GPFS With 64 dual processor IA-64 IBM eServers and 500 TB of IBM FastT100 storage,we created a General Parallel File System (GPFS) accessible over the dedicated 40 Gb/s backbone network (Wide Area Network (WAN)) of the TeraGrid. The WAN GPFS is mounted at the TeraGrid partner sites in addition to any local parallel file systems. To determine ownership of files between sites with different UID spaces, a user's Globus x.509 certificate identity is used. GPFS-WAN Read Performance 7000 6000 5000 MB/sec Scientists and researchers use geographically distributed compute, data and other scientific instruments and resources on the TeraGrid. Efficient use of these distributed resources is often hindered by the manual process of checking for available storage at various sites, copying files and then managing multiple copies of datasets. If a scientist's files and datasets were automatically available at each geographically distributed resource, the scientist's ability to perform research efficiently is increased. A scientist could schedule resources at different sites to process the globally available dataset more quickly. SDSC 1ppn SDSC 2ppn ANL 1ppn ANL 2PPN NCSA 1PPN NCSA 2PPN 4000 3000 2000 1000 0 1 2 4 8 16 24 32 48 64 96 128 192 NodeCount Several applications including the Bioinformatics Research Network (BIRN), the Southern California Earthquake Center (SCEC), the National Virtual Observatory (NVO) and ENZO, the cosmological simulation code use WAN GPFS as part of a pilot project. The upper graph shows measured performance on parallel IO benchmarks from SDSC, NCSA, and ANL. The SDSC numbers represent the peak achievable bandwidth to the filesystem. The lower graph shows sustained usage of the TeraGrid network generated by GPFS-WAN usage over a 24-hour period. Special thanks to:, Roger Haskin, IBM , Yuri Volobuev, IBM, Puneet Chaudhury, IBM, Daniel McNabb, IBM Jim Wyllie, IBM Phil Andrews, SDSC Patricia Kovatch, SDSC Brian Battistuz, SDSC, Tim Cockerill, Ruth Ayott NCSA, Dan Lapine NCSA, Anthony Tong NCSA, Darrn Adams NCSA, Stephen Simms, NJ, Chris Raymhauer, Purdue, Dave Carver TACC Point of Contact: Chris Jordan, [email protected] Apps Contact: Don Frederik, [email protected] 10 © 2006 IBM Corporation System Management Considerations for Beowulf Clusters Mapping Applications to Architectures Application performance is affected by . . . – Node design (cache, memory bandwidth, memory bus architecture, clock, operations per clock cycle, registers – Network architecture (latency, bandwidth) – I/O architecture (separate I/O network, NFS, SAN, NAS, local disk, diskless) – System architecture (OS support, network drivers, scheduling software) A few considerations – – – – Memory bandwidth, I/O, clock frequency, CPU architecture, cache SMP size (1-64 CPUs) SMP memory size (1-256 GB) Price/performance The cost of switching to other microprocessor architectures or operating systems is high, especially in production environments or organizations with low to average skill levels Deep Computing customers have application workloads that include various properties that are both problem and implementation dependent 11 © 2006 IBM Corporation © 2006 IBM Corporation System Management Considerations for Beowulf Clusters Platform Positioning – Application Characteristics Auto/Aero Climate / Ocean Media & Entertainment Electronics Environment Memory Bandwidth Data Analysis / Data Mining Reservoir Simulation NVH, Structural & Thermal Analysis, Selected CFD Structure Based Drug Design Weather Selected CFD Gene Sequencing & Assembly Seismic Games, DCC, Migration & Image Processing Imaging General Seismic PLM Bioinformatics Crash EDA Data Analysis / Data Mining HCLS Petroleum I/O Bandwidth 12 © 2006 IBM Corporation © 2006 IBM Corporation System Management Considerations for Beowulf Clusters Power Affects System Size Floor-Loading Size = 36 sq ft Physical Size = 14.5 sq ft Service Size = 25 sq ft Cooling Size = 190 sq ft (@150 W/sq ft, 8% floor utilization) "What matters most to the computer designers at Google is not speed, but power -- low power, because data centers can consume as much electricity as a city." Eric Schmidt, CEO Google (Quoted in NY Times, 9/29/02) 13 © 2006 IBM Corporation System Management Considerations for Beowulf Clusters Top 500 List http://www.top500.org/ The # 1 position was again claimed by the Blue Gene/L System, a joint development of IBM and DOE’s National Nuclear Security Administration (NNSA) and installed at DOE’s Lawrence Livermore National Laboratory in Livermore, Calif. It has reached a Linpack benchmark performance of 280.6 TFlop/s (“teraflops” or trillions of calculations per second) “Even as processor frequencies seem to stall, the performance improvements of full systems seen at the very high end of scientific computing shows no sign of slowing down… the growth of average performance remains stable and ahead of Moore’s Law.” © 2003 IBM Corporation System Management Considerations for Beowulf Clusters © 2003 IBM Corporation System Management Considerations for Beowulf Clusters IBM Supercomputing Leadership TOP500 Nov ember 2005 100% 53 17 80% 60% 17 6 18 2 11 18 7 6 169 8 1 1 2 1 40% 20% 49 219 5 Others NEC Dell Cray SGI HP IBM 0% TOP500 TOP100 TOP10 Semiannual independent ranking of top 500 supercomputers in the world Nov 2005 Aggregate Performance IBM: 1,214 of 2,300 TFlops Dell 5% Sun 0% NEC 2% Linux Networx 2% IBM is clear leader • #1 System – DOE/LLNL - BlueGene/L (280.6 TF) • Most entries on TOP500 list (219) • Most installed aggregate throughput (over 1,214 TF) • Most in TOP10 (5), TOP20 (8) and TOP100 (49) • Fastest system in Europe (Mare Nostrum) • Most Linux Commodity Clusters with 158 of 360 16 © 2006 IBM Corporation Other 7% IBM 53% Cray 6% SGI 6% HP 19% Source: www.top500.org System Management Considerations for Beowulf Clusters System Management Challenges for Beowulf Clusters Scalability/simultaneous operations – ping: 3 seconds – 3 x 1500 / 60 = 75 minutes – Do operations in parallel, but… – Can run into limitations in: • • • • # of file descriptors per process # of port numbers ARP cache network bandwidth – Need a “fan-out” limit on the # of simultaneous parallel operations © 2003 IBM Corporation System Management Considerations for Beowulf Clusters System Management Challenges… Lights out machine room – Location, security, ergonomics – Out-of-band hardware control and console – Automatic discovery of hardware – IPMI – defacto standard protocol for x86 machines • http://www.intel.com/design/servers/ipmi/ • http://openipmi.sourceforge.net/ – conserver - logs consoles and supports multiple viewers of console • http://www.conserver.com/ – vnc – remotes a whole desktop • http://www.tightvnc.com/ – openslp – Service Location Protocol • http://www.openslp.org/ © 2003 IBM Corporation System Management Considerations for Beowulf Clusters System Management Challenges … Node installation Mgmt Svr – Parallel and hierarchical – Unattended, over the network Install Svr Install Svr Install Svr – Repeatable (uniformity) Node Node Node – Different models: Node Node Node … … … • • • direct (e.g. kickstart) cloning diskless – Kickstart, Autoyast – direct installation of distro RPMs – syslinux, dhcp, tftp – PXE network boot of nodes – system imager – cloning of nodes • http://www.systemimager.org/ – warewulf – diskless boot of nodes • http://www.warewulf-cluster.org/ © 2003 IBM Corporation System Management Considerations for Beowulf Clusters System Management Challenges … Software/firmware maintenance – Selection of combination of upgrades (BIOS, NIC firmware, RAID firmware, multiple OS patches, application updates – Distribution and installation of updates – Inventory of current levels – up2date - Red Hat Network • https://rhn.redhat.com/ – you - YAST Online Update – yum – Yellow Dog Updater, Modified (RPM update mgr) • http://linux.duke.edu/projects/yum/ – autoupdate – RPM update manager • http://www.mat.univie.ac.at/~gerald/ftp/autoupdate/index.html © 2003 IBM Corporation System Management Considerations for Beowulf Clusters System Management Challenges… Configuration management – – – – – – – – – – – Manage /etc files across the cluster Apply changes immediately (w/o reboot) Detect configuration inconsistencies between nodes Time synchronization Mgmt Svr User management rdist – distribute files in parallel to many nodes • http://www.magnicomp.com/rdist/ rsync – copy files (when changed) to another machine node 1 node 2 . . . node n cfengine – manage machine configuration • http://www.cfengine.org/ NTP – network time protocol • http://www.ntp.org/ openldap – user information server • http://www.openldap.org/ NIS – user information server • http://www.linux-nis.org/ © 2003 IBM Corporation System Management Considerations for Beowulf Clusters System Management Challenges… Monitoring for failures – – – – – – – – – With the mean time to failure of some disks, in large clusters several nodes can fail each week Heart beating Storage servers, networks, node temperatures, OS metrics, daemon status, etc. Notification and automated responses Use minimum resources on the nodes and network fping – ping many nodes in parallel • http://www.fping.com/ Event monitoring web browser interfaces • ganglia - http://ganglia.sourceforge.net/ • nagios - http://www.nagios.org/ • big brother – http://bb4.com/ snmp – Simple Network Management Protocol • http://www.net-snmp.org/ pegasus – CIM Object Manager • http://www.openpegasus.org/ © 2003 IBM Corporation System Management Considerations for Beowulf Clusters System Management Challenges… Cross-geography – Secure connections (e.g. SSL) • WANs are often not secure – Tolerance of routing • Broadcast protocols (e.g. DHCP) usually are not forwarded through routers – Tolerance of slow connections • Move large data transfers (e.g. OS installation) close to target – Firewalls • Minimize the number of ports used – openssh – Secure Shell • http://www.openssh.com/ © 2003 IBM Corporation System Management Considerations for Beowulf Clusters System Management Challenges… Security/Auditability – Install/confirm security patches – Conform to company security policies – Log all root activity – sudo - give certain users the ability to run some commands as root while logging the commands • http://www.courtesan.com/sudo/ Accounting – Track usage and charge departments for use © 2003 IBM Corporation System Management Considerations for Beowulf Clusters System Management Challenges… Flexibility – Broad variety of hardware and Linux distros – Node installation customization – Monitoring extensions – Spectrum of security policies – Hierarchical clusters Executive Mgmt Svr – Scriptable commands First Line Mgmt Svr First Line Mgmt Svr First Line Mgmt Svr Node Node Node Node Node Node … … … © 2003 IBM Corporation System Management Considerations for Beowulf Clusters Beowulf Cluster Management Suites Open source – OSCAR – collection of cluster management tools • – ROCKS – stripped down RHEL with management software • – http://oscar.openclustergroup.org/ http://www.rocksclusters.org/ webmin – web interface to linux administration tools • http://www.webmin.com/ Products – Scali Manage • – – Linux Networx Clusterworx • http://linuxnetworx.com/ Egenera • – http://www.penguincomputing.com/ HP XC • – http://www.egenera.com/ Scyld • – http://www.scali.com/ http://h20311.www2.hp.com/HPC/cache/275435-0-0-0-121.html Windows Compute Cluster Server • http://www.microsoft.com/windowsserver2003/ccs/ © 2003 IBM Corporation System Management Considerations for Beowulf Clusters Beowulf Cluster Management Suites IBM Products – – – – IBM 1350 – Cluster hardware & software bundle • http://www.ibm.com/systems/clusters/hardware/1350.html IBM BladeCenter • http://www.ibm.com/systems/bladecenter/ Blue Gene • http://www.ibm.com/servers/deepcomputing/bluegene.html CSM • http://www14.software.ibm.com/webapp/set2/sas/f/csm/home.html © 2003 IBM Corporation