Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Supercomputing on Windows Clusters: Experience and Future Directions Andrew A. Chien CTO, Entropia, Inc. SAIC Chair Professor Computer Science and Engineering, UCSD National Computational Science Alliance Invited Talk, USENIX Windows, August 4, 2000 Overview Critical Enabling Technologies The Alliance’s Windows Supercluster – Design and Performance Other Windows Cluster Efforts Future – Terascale Clusters – Entropia Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA External Technology Factors Microprocessor Performance MIPS R2000 (125) Clock (ns) 100 Microprocessors MIPS R3000 (40) 10 HP 7000 (15) R4000 (10) Cray 1S (12.5) R4400 (6.7) Cray X-MP (8.5) Cray Y-MP (6) Cray C90 (4.2) DEC Alpha (5) 1 Vector supercomputers 1975 1980 1985 X86/Alpha (1) 1990 1995 Year Introduced Micros: 10MF -> 100 MF -> 1GF -> 3GF -> 6GF (2001?) => Memory system performance catching up (2.6 GB/s 21264 memory BW) Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Adapted from Baskett, SGI and CSC Vanguard Killer Networks GigSAN/GigE: 110 MB/s LAN: 10Mb/s -> 100Mb/s -> ? SAN: 12MB/s -> 110MB/s (Gbps) -> 1100MB/s -> ? – Myricom, Compaq, Giganet, Intel,... UW Scsi: 40 MB/s FastE: 12 MB/s Ethernet 1MB/s Network bandwidths limited by system internal memory bandwidths Cheap and very fast communication hardware Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Rich Desktop Operating Systems Environments HD Storage Networks Graphical Interfaces Audio/Graphics Clustering, Performance, Mass store, HP networking, Multiprocess Protection Management, Availability, etc. SMP support Basic device access 1981 1985 1990 1995 1999 Desktop (PC) operating systems now provide – – – – richest OS functionality best program development tools broadest peripheral/driver support broadest application software/ISV support Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Critical Enabling Technologies Critical Enabling Technologies Cluster management and resource integration (“use like” one system) Delivered communication performance – IP protocols inappropriate Balanced systems – Memory bandwidth – I/O capability Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA The HPVM System Goals – Enable tightly coupled and distributed clusters with high efficiency and low effort (integrated solution) – Provide usable access thru convenient standard parallel interfaces – Deliver highest possible performance and simple programming model Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Delivered Communication Performance Early 1990’s, Gigabit testbeds – 500Mbits (~60MB/s) @ 1 MegaByte packets – IP protocols not for Gigabit SAN’s Cluster Objective: High performance communication to small and large messages Performance Balance Shift: Networks faster than I/O, memory, processor Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Fast Messages Design Elements User-level network access Lightweight protocols – flow control, reliable delivery – tightly-coupled link, buffer, and I/O bus management Poll-based notification Streaming API for efficient composition Many generations 1994-1999 – [IEEE Concurrency, 6/97] – [Supercomputing ’95, 12/95] Related efforts: UCB AM, Cornell U-Net,RWCP PM, Princeton VMMC/Shrimp, Lyon BIP => VIA standard Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Improved Bandwidth Performance (megabytes/sec) 250 200 150 MB/s 100 50 0 1995 1996 1997 1998 1999 20MB/s -> 200+ MB/s (10x) – Much of advance is software structure: API’s and implementation – Deliver *all* of the underlying hardware performance Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Improved Latency 25 1-way latency 20 15 microseconds 10 5 0 1995 1996 1997 1998 1999 100ms to 2ms overhead (50x) – Careful design to minimize overhead while maintaining throughput – Efficient event handling, fine-grained resource management and interlayer coordination – Deliver *all* of the underlying hardware performance Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA HPVM = Cluster Supercomputers MPI Put/Get Global Arrays BSP Scheduling & Mgmt (LSF) Fast Messages Myrinet Server- Giganet Net VIA SMP WAN Performance Tools HPVM 1.0 (8/1997) HPVM 1.2 (2/1999) - multi, dynamic, install HPVM 1.9 (8/1999) - giganet, smp Turnkey Cluster Computing; Standard API’s Network hardware and API’s increase leverage for users, achieve critical mass for system Each involved new research challenges and provided deeper insights into the research issues – Drove continually better solutions (e.g. multi-transport integration, robust flow control and queue management) Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA HPVM Communication Performance 120 100 MB/s 80 FM on Myrinet MPI on FM-Myrinet 60 40 • N1/2 ~ 400 Bytes 20 16128 15232 14336 13440 12544 11648 10752 9856 8960 8064 7168 6272 5376 4480 3584 2688 512 4 0 message size (bytes) Delivers underlying performance for small messages, endpoints are the limits 100MB/s at 1K vs 60MB/s at 1000K – >1500x improvement Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA HPVM/FM on VIA 90 80 70 MB/s 60 50 FM on Giganet VIA MPI-FM on Giganet VIA 40 30 • N1/2 ~ 400 Bytes 20 10 4 10 24 29 44 39 68 49 92 60 16 70 40 80 64 90 8 10 8 11 11 2 13 12 6 16 13 0 18 14 4 20 15 8 23 16 2 25 6 0 message size (bytes) FM Protocol/techniques portable to Giganet VIA Slightly lower performance, comparable N1/2 Commercial version: WSDI (stay tuned) Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Unified Transfer and Notification (all transports) <space> Procs Variable Size Data Fixed Size Frames Increasing Addresses Networks Fixed Size Trailer + Length/Flag Solution: Uniform notify and poll (single Q representation) Scalability: n into k (hash); arbitrary SMP size or number of NIC cards Key: integrate variable-sized messages; achieve single DMA transfer – no pointer-based memory management, no special synchronization primitives, no complex computation Memory format provides atomic notification in single contiguous memory transfer (bcopy or DMA) Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Integrated Notification Results Single Transport Myrinet (latency) 8.3ms Myrinet (BW) 101MB/s Shared Memory (latency) 3.4ms Shared Memory (BW) 200+MB/s Integrated 8.4ms 101MB/s 3.5ms 200+MB/s No polling or discontiguous access performance penalties Uniform high performance which is stable over changes of configuration or the addition of new transports – no custom tuning for configuration required Framework is scalable to large numbers of SMP processors and network interfaces Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Supercomputer Performance Characteristics (11/99) Cray T3E MF/Proc Flops/Byte Flops/NetworkRT 1200 ~2 ~2,500 SGI Origin2000 500 ~0.5 ~1,000 HPVM NT Supercluster 600 ~8 ~12,000 IBM SP2 (4 or 8-way) 2.6-5.2GF ~12-25 ~150-300K Beowulf(100Mbit) 600 ~50 ~200,000 Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Windows The NT Supercluster Windows Clusters Early prototypes in CSAG – 1/1997, 30P, 6GF – 12/1997, 64P, 20GF Alliance’s Supercluster – 4/1998, 256P, 77GF – 6/1999, 256P*, 109GF Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA NCSA’s Windows Supercluster AS-PCG MPI Performance - 2D Navie Engineering Fluid Flow Problem 128 HP Kayak XU Dual PIII 550 MHz/1GB RAM 20 18 16 Origin GFLOPs 14 12 550 MHz 10 8 6 4 Using NT, Myrinet Interconnect, and HPVM 300 MHz 2 SGI O2000, NT Cluster: I NT Cluster: I Cluster: 128 0 #207 in Top 500 Supercomputing Sites 0 32 64 D. Tafti, NCSA Rob Pennington (NCSA), Andrew Chien (UCSD) 96 128 16 Processors Windows Cluster System Front-End Systems File servers LSF master Fast Ethernet Internet •Apps development •Job submission LSF Batch Job Scheduler FTP to Mass Storage Daily backups 128 GB Home 200 GB Scratch 128 Compute Nodes, 256 CPUs Windows NT, Myrinet and HPVM 128 Dual 550 MHz Systems Infrastructure and Development Testbeds Windows 2K and NT 8 4p 550 + 32 2p 300 + 8 2p 333 (courtesy Rob Pennington, NCSA) Example Application Results MILC – QCD Navier-Stokes Kernel Zeus-MP – Astrophysics CFD Large-scale Science and Engineering codes Comparisons to SGI O2K and Linux clusters Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA MILC Performance 12 IA-32/Win NT, 300 MHz PII 250 MHz SGI O2K T3E 900 IA-32/Win NT 550MHz Xeon 10 GFLOPs 8 6 4 2 0 0 50 100 Processors Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Src: D. Toussaint and K. Orginos, Arizona MFlops/sec Zeus-MP (Astrophysics CFD) 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 SGI O2K Janus (ASCI Red) NT Supercluster 550 Mhz 1 4 16 32 64 96 128 192 256 # procs Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA 2D Navier Stokes Kernel AS-PCG MPI Performance - 2D Navier Stokes Kernel 20 18 128 300 MHz Intel Pentium II + 128 550 MHz Pentium III Xeon 16 GFLOPs 14 12 10 8 6 SGI O2000, 250 MHz R10000 NT Cluster: Intel 550 MHz PIII Xeon HP Kayak NT Cluster: Intel 300MHz PII HP Kayak Cluster: 128 550MHz + 128 300 MHz 4 2 0 0 32 64 96 Source: Danesh Tafti, NCSA 160 128 Processors 192 224 256 Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Applications with High Performance on Windows Supercluster Zeus-MP (256P, Mike Norman) ISIS++ (192P, Robert Clay) ASPCG (256P, Danesh Tafti) Cactus (256P, Paul Walker/John Shalf/Ed Seidel) MILC QCD (256P, Lubos Mitas) QMC Nanomaterials (128P, Lubos Mitas) Boeing CFD Test Codes, CFD Overflow (128P, David Levine) freeHEP (256P, Doug Toussaint) ARPI3D (256P, weather code, Dan Weber) GMIN (L. Munro in K. Jordan) DSMC-MEMS (Ravaioli) FUN3D with PETSc (Kaushik) SPRNG (Srinivasan) MOPAC (McKelvey) Astrophysical N body codes (Bode) => Little code retuning and quickly running ... Parallel Sorting (Rivera – CSAG), 18.3 GB Minutesort World Record Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA MinuteSort Sort max data disk-to-disk in 1 minute “Indy sort” – fixed size keys, special sorter, and file format HPVM/Windows Cluster winner for 1999 (10.3GB) and 2000 (18.3GB) – Adaptation of Berkeley NOWSort code (Arpaci and Dusseau) Commodity configuration ($$ not a metric) – PC’s, IDE disks, Windows – HPVM and 1Gbps Myrinet Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA MinuteSort Architecture Kayak Netserver HPVM & 1Gbps Myrinet Kayak 32 HP Kayaks 3Ware Controllers 4 x 20GB IDE disks 32 HP Netservers 2 x 16GB SCSI disks Inc -- University of California, San Diego (UCSD/CSE) -- NCSA (Luis Rivera UIUC, Xianan ZhangEntropia, UCSD) Sort Scaling Concurrent read/bucket-sort/communicate is bottleneck – faster I/O infrastructure required (bussesEntropia, and memory, not disks)San Diego (UCSD/CSE) -- NCSA Inc -- University of California, MinuteSort Execution Time Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Reliability Gossip: “Windows platforms are not reliable” – Larger systems => intolerably low MTBF Our Experience: “Nodes don’t crash” – Application runs of 1000s of hours – Node failure means an application failure; effectively not a problem Hardware – Short term: Infant mortality (1 month burn-in) – Long term • ~1 hardware problem/100 machines/month • Disks, network interfaces, memory • No processor or motherboard problems. Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Windows Cluster Usage NT Cluster Usage by Number of Processors May1999 to Jul2000 CPU Hours 500000 400000 300000 200000 100000 0 1 - 31 32 - 63 64 - 256 Number of Processors Lots of large jobs Runs up to ~14,000 hours (64p * 9 days) Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Other Large Windows Clusters Sandia’s Kudzu Cluster (144 procs, 550 disks, 10/98) Cornell’s AC3 Velocity Cluster (256 procs, 8/99) Others (sampled from vendors) – – – – – – – – – GE Research Labs (16, Scientific) Boeing (32, Scientific) PNNL (96, Scientific) Sandia (32, Scientific) NCSA (32, Scientific) Rice University (16, Scientific) U. of Houston (16, Scientific) U. of Minnesota (16, Scientific) Oil & Gas (8,Scientific) – Merrill Lynch (16, Ecommerce) – UIT (16, ASP/Ecommerce) Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA The AC3 Velocity 64 Dell PowerEdge 6350 Servers • Quad Pentium III 500 MHz/2 MB Cache Processors (SMP) • 4 GB RAM/Node • 50 GB Disk (RAID 0)/Node GigaNet Full Interconnect • 100 MB/Sec Bandwidth between any 2 Nodes • Very Low Latency 2 Terabytes Dell PowerVault 200S Storage • • • • • 2 Dell PowerEdge 6350 Dual Processor File Servers 4 PowerVault 200S Units/File Server 8 36 GB/Disk Drives/PowerVault 200S Quad Channel SCSI Raid Adapter 180 MB/sec Sustained Throughput/ Server #381 2 Terabyte PowerVault 130T Tape Library • 4 DLT 7000 Tape Drives • 28 Tape Capacity (courtesy David A. Lifka, Cornell TC) in Top 500 Supercomputing Sites Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Recent AC3 Additions 8 Dell PowerEdge 2450 Servers (Serial Nodes) • Pentium III 600 MHz/512 KB Cache • 1 GB RAM/Node • 50 GB Disk (RAID 0)/Node 7 Dell PowerEdge 2450 Servers (First All NT Based AFS Cell) • Dual Processor Pentium III 600 MHz/512 KB Cache • 1 GB RAM/Node Fileservers, 512 MB RAM/Node Database servers • 1 TB SCSI based RAID 5 Storage • Cross platform filesystem support 64 Dell PowerEdge 2450 Servers (Protein Folding, Fracture Analysis) • • • • Dual Processor Pentium III 733 Mhz/256 KB Cache 2 GB RAM/Node 27 GB Disk (RAID 0)/Node Full Giganet Interconnect 3 Intel ES6000 & 1 ES1000 Gigabit switches • Upgrading our Server Backbone network to Gigabit Ethernet (courtesy David A. Lifka, Cornell TC) Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA 3 AC Goals Only commercially supported technology – Rapid spinup and spinout – Package technologies for vendors to sell as integrated systems => All of commercial packages were moved from SP2 to Windows, all users are back and more! Users: “I don’t do windows” => “I’m agnostic about operating systems, and just focus on getting my work done.” Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Protein Folding The cooperative motion of ion and water through the gramicidin ion channel. The effective quasi-article that permeates through the channel includes eight water molecules and the ion. Work of Ron Elber with Bob Eisenberg, Danuta Rojewska and Duan Pin. Reaction path study of lig and diffusion in leghemoglobin. The ligand is CO (white) and it is moving from the binding site, the heme pocket, to the protein exterior. A study by Weislaw Nowak and Ron Elber. http://www.tc.cornell.edu/reports/NIH/resource/CompBiologyTools/ (courtesy David A. Lifka, Cornell TC) Protein Folding Per/Processor Performance Results on different computers for a protein structures: Machine System CPU CPU speed [MHz] 222 compiler Blue Horizon (SP San Diego) Linux cluster AIX 4 Power3 Linux 2.2 PentiumIII 650 PGF 3.1 59.1 Velocity (CTC) Win 2000 PentiumIII Xeon PentiumIII 500 df v6.1 46.0 Velocity+ (CTC) Win 2000 733 df v6.1 59.2 CPU speed [MHz] 222 compiler Energy evaluations per second 15.0 xlf Energy evaluations per second 44.3 Results on different computers for (a /b or b proteins): Machine System CPU Blue Horizon (SP San Diego) Linux cluster AIX 4 Power3 Linux 2.2 PentiumIII 650 PGF 3.1 21.0 Velocity (CTC) Win 2000 500 df v6.1 16.9 Velocity+ (CTC) Win 2000 PentiumIII Xeon PentiumIII 733 df v6.1 22.4 (courtesy David A. Lifka, Cornell TC) xlf AC3 Corporate Members -Air Products and Chemicals -Candle Corporation -Compaq Computer Corporation -Conceptual Reality Presentations -Dell Computer Corporation -Etnus, Inc. -Fluent, Inc. -Giganet, Inc. -IBM Corporation -ILOG, Inc. -Intel Corporation -KLA-Tencor Corporation -Kuck & Associates, Inc. (courtesy David A. Lifka, Cornell TC) -Lexis-Nexis -MathWorks, Inc. -Microsoft Corporation -MPI Software Technologies, Inc. -Numerical Algorithms Group -Portland Group, Inc. -Reed Elsevier, Inc. -Reliable Network Solutions, Inc. -SAS Institute, Inc. -Seattle Lab, Inc. -Visual Numerics, Inc. -Wolfram Research, Inc. Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Windows Cluster Summary Good performance Lots of Applications Good reliability Reasonable Management complexity (TCO) Future is bright; uses are proliferating! Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Windows Cluster Resources NT Supercluster, NCSA – http://www.ncsa.uiuc.edu/General/CC/ntcluster/ – http://www-csag.ucsd.edu/projects/hpvm.html AC3 Cluster, TC – http://www.tc.cornell.edu/UserDoc/Cluster/ University of Southampton – http://www.windowsclusters.org/ => application and hardware/software evaluation => many of these folks will work with you on deployment Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Tools and Technologies for Building Windows Clusters Communication Hardware – Myrinet, http://www.myri.com/ – Giganet, http://www.giganet.com/ – Servernet II, http://www.compaq.com/ Cluster Management and Communication Software – – – – – – LSF, http://www.platform.com/ Codeine, http://www.gridware.net/ Cluster CoNTroller, MPI, http://www.mpi-softtech.com/ Maui Scheduler http://www.cs.byu.edu/ MPICH, http://www-unix.mcs.anl.gov/mpi/mpich/ PVM, http://www.epm.ornl.gov/pvm/ Microsoft Cluster Info – Win2000 http://www.microsoft.com/windows2000/ – MSCS,http://www.microsoft.com/ntserver/ntserverenterprise/ exec/overview/clustering.asp Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Future Directions Terascale Entropia Clusters A Terascale Cluster 10+ Teraflops in 2000? NSF currently running a $36M Terascale competition Budget could buy ? #1 in Top 500 ? – an Itanium cluster (3000+ processors) Supercomputing Sites – ~3TB of main memory – > 1.5Gbps high speed network interconnect Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Entropia: Beyond Clusters COTS, SHV enable larger, cheaper, faster systems • Supercomputers (MPP’s) to… • Commodity Clusters (NT Supercluster) to… • Entropia • Internet Computing Idea: Assemble large numbers of idle PC’s in people’s homes, offices, into a massive computational resource – Enabled by broadband connections, fast microprocessors, huge PC volumes Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Unprecedented Power Entropia network: ~30,000 machines (and growing fast!) – 100,000, 1Ghz => 100 TeraOp system – 1,000,000, 1Ghz => 1,000 TeraOp system (1 PetaOp) IBM ASCI White (12 TeraOp, 8K processors, $110 Million system) Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Why Participate: Cause Computing! People Will Contribute Millions have demonstrated willingness to donate their idle cycles “Great Cause” Computing – Current: Find ET, Large Primes, Crack DES… – Next: find cure for cancer, muscular dystrophy, air and water pollution, … • understand human genome, ecology, fundamental properties of matter, economy Participate in science, medical research, promoting causes that you care about! Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Technical Challenges Heterogeneity (machine, configuration, network) Scalability (thousands to millions) Reliability (turn off, disconnect, fail) Security (integrity, confidentiality) Performance Programming ... Entropia: harnessing the computational power of the Internet Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Entropia is . . . Power: a network with unprecedented power and scale Empower: ordinary people to participate in solving the great social challenges and mysteries of our time Solve: team solving fascinating technical problems Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Summary Windows clusters are powerful, successful high performance platforms – Cost effective and excellent performance – Poised for rapid proliferation Beyond clusters are Internet computing systems – Radical technical challenges, vast and profound opportunities For more information see – HPVM http://www-csag.ucsd.edu/ – Entropia http://www.entropia.com/ Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Credits NT Cluster Team Members – CSAG (UIUC and UCSD Computer Science) – my research group – NCSA Leading Edge Site -- Robert Pennington’s team Talk materials • NCSA (Rob Pennington, numerous application groups) • Cornell TC (David Lifka) • Boeing (David Levine) • MPISoft (Tony Skjellum) • Giganet (David Wells) • Microsoft (Jim Gray) Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA