Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Commodity Computing Clusters - next generation supercomputers? Paweł Pisarczyk, ATM S. A. [email protected] Agenda • Introduction • Supercomputer classification • Architecture and implementations • Commodity clusters • Processors • Operating systems • Summary Supercomputer • „A supercomputer is a device for turning computebound problems into I/O-bound problem” Seymour Cray • A supercomputer is a computer system that leads the world in terms of processing capacity, particularly speed of calculations, at the time of its introduction. source: http://en.wikipedia.org Supercomputer History (1) • • • • • • 1945-50 - Manchester Mark I 1950-55 - MIT Whirlwind 1955-60 - IBM 7090 - 210 KFLOPS 1960-65 - CDC 6600 -10.24 MFLOPS 1965-70 - CDC 7600 - 32.27 MFLOPS 1970-75 - CDC Cyber 76 Supercomputer History (2) • • • • • • • 1975-80 - Cray-1 - 160 MFLOPS 1980-85 - Cray X-MP - 500 MFLOPS 1985-90 - Cray Y-MP - 1.3 GFLOPS 1990-95 - Fujitsu Numerical Wind Tunnel - 236 GFLOPS 1995-00 - Intel ASCI Red - 2.150 TFLOPS 2000-02 - IBM ASCI White, SP Power3 375 MHz - 7.226 TFLOPS 2002-03 - NEC Earth Simulator - 35 TFLOPS Supercomputer Classes (1) • General-purpose supercomputers: – vector processing machines - the same operation carried out on a large amount of data simultaneously – tightly connected cluster computers (NUMA) communication oriented architectures engineered from ground up, based on high speed interconnects and large number of processors – commodity clusters - collection of large number of commodity PCs (COTS) interconnected by highbandwidth low-latency network Supercomputer Classes (2) • Special-purpose supercomputers - high performance computing devices with a hardware architecture dedicated to solve a single problem (equipped with custom ASICS or FPGA chips) Examples – Deep Blue – GRAPE for astrophysics Flynn taxonomy - 1972 (1) • SISD - Single Instruction Single Data (DEC, Sun Microsystems, PC) • SIMD - Single Instruction Multiple Data – computers with large number o processing units (i.e. ALUs) - CPP DAP Gamma II, Quadrics Apemille – vector processing machines - NEC SX6, IA32 MMX • MISD - Multiple Instruction Single Data – theoretical model, no practical implementation Flynn taxonomy - 1972 (2) • MIMD - Multiple Instruction Multiple Data – SM-MIMD - Shared Memory MIMD • global address space • SMP systems and ccNUMA systems – DM-MIMD - Distributed Memory MIMD • many nodes with local address spaces • high-bandwidth, low-latency communication • common NUMA architectures (Non Uniform Memory Access) • operating system have to be communication oriented (Mach project) SM-MIMD implementations • S-COMA - Simple Cache-Only Memory Architecture – common SMP systems • ccNUMA - Cache Coherent NUMA – SGI Origin 3000 – SGI Altix 3000 – HP SuperDome S-COMA (SMP) RAM L2 cache L2 cache L2 cache CPU 0 CPU 1 CPU N ccNUMA RAM 0 RAM K L3 cache L3 cache L2 cache L2 cache L2 cache L2 cache CPU 0 CPU 1 CPU N-1 CPU N ccNUMA implementation SGI Altix 3000 (ccNUMA) • 64 Itanium 2 (IA64) processors • C-brick modules with 2 CPUs and ASIC SHUB • NUMAflex, NUMAlink interconnects (6.4 GB/s, 2.4 GB/s) • Modified Linux kernel (2.6 NUMA support) DM-MIMD implementations • Massively parallel systems (NUMA) – communication oriented architecture – low-latency, high-bandwidth interconnects – topologies: hypercube, torus, tree – Butterfly networks, Omega networks, engineered from ground up communication DM-MIMD implementations • Commodity clusters – a cluster is a collection of connected, independent computers working in unison to solve a problem – COTS technology – nodes are interconnected by Ethernet LAN, Myrinet, QsNet ELAN etc. – computation can be performed by using popular programming toolkits and frameworks: OpenMP, MPI – clusters require dedicated management software NUMA implementations Cray T3E-1350 • Processor: Alpha 21164 675 MHz • Number of CPUs: 40 - 2176 • 3-D Torus topology • Operating system: UNICOS/mk - microkernel based • Peak performance: 3 TFLOPS Commodity cluster implementation (1) Linux Networx/Quadrics • Processor: Intel Xeon 2.4 GHz • CPUs: 2304 • Interconnections: QsNet ELAN3 • Operating system: Linux + management tools + Lustre Cluster File System • Peak performance: 7.6 TFLOPS • 3rd computer on TOP500 list • Developed for Lawrence Livermore National Laboratory in 2002 Commodity cluster implementation (2) HP XC6000 Cluster (XC3000 Cluster) • Processor: Intel Itanium 2 6M 1.5 GHz (Intel Xeon 3 GHz) • Node: HP Integrity rx2600 (HP ProLiant DL380) • Number of processors: 34-512 • Interconnections: QsNet ELAN3 (Myricom Myrinet XP) • Operating system: Linux + SSI Middleware + management tools + Lustre Cluster File System • Peak performance: 34 CPUs - 204 GFLOPS, 512 CPUs - 3 TFLOPS Commodity Clusters - software • Operating system - Linux or SSI Linux (Single System Image) • Platform for specialized applications for science, engineering and business (simulation, modeling, data mining) • Distributed computation environments are used for software development (OpenMP, MPI) • Common supercomputer applications require porting to clusters Performance Scaling Scale Right Scale-Out (Cluster) Scale-Up (SMP, ccNUMA) Processors (1) • Many types of existing processors are used in supercomputers • Microprocessor development directions: – Increasing of clock frequency and speed instruction stream processing – Processing of large collection of data in single processor instruction - SIMD – Control path multiplication – multithreading Processors (2) • Vector processors – NEC SX-6 – Cray (Cray X1) • RISC processors – MIPS – IBM Power4 – Alpha • CISC processors – IA32 – AMD x86-64 • VLIW processors – IA64 Intel Itanium 2 features • State-of-the-art unconventional 64-bit architecture • New programming model implementing VLIW paradigm • EPIC technology – Explicitly Parallel Instruction Computing – compiler determines instruction dependency informing processor how to process an instruction stream parallel • Many registers (128 64-bit), register stack management • 6 GFLOPS peak performance • Full advantages of the processor can be used by dedicated compiler Operating systems • Monolithic kernel based OSs - UNIX (modification of existing solutions) – BSD – Solaris – Irix – Linux • Microkernel based OSs – Mach Microkernel architecture Task A Task B Task C Kernel Kernel Hardware Hardware Summary • Today’s there is a lot of supercomputer architectures • Both vector processors and common RISC, CISC, VLIW chips are used for supercomputers • Commodity clusters under control of Linux OS are an attractive method for supercomputer implementation TOP 500 list (1) 1. Earth Simulator, NEC - 35.86 TFLOPS 2. HP Alphaserver SC, HP - 13.88 TFLOPS 3. Linux Networx / Quadrics IA32 - 7.634 TFLOPS Top 500 list (2) Source: http://www.top500.org/list/2003/06/