Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Towards Petaflops Summary A D Kennedy, D C Heggie, S P Booth The University of Edinburgh Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Programme Mond ay 24th May Computational Chemistry Tu esd ay 25th May Particle Physics & A stronomy Dr N H Christ Dr J M Levesqu e Dr G Ackland Dr P Bu rton REGISTRATIO N Colum bia University, USA IBM Research, USA University of Ed inburgh UKMO, Bracknell 09:00 09.30 10:00 Wed nesd ay 26th May Thu rsd ay 27th May Biological sciences M aterials, soft & hard W elcome Prof A D Kenned y 10:15 10:30 11:15 11:30 TEA/CO FFEE Prof R Catlow Royal Institution, Lond on Prof R D’Inverno Dr M F Gu est Dr M Payne Dr T Arber TEA/CO FFEE University of Southam pton CLRC, Daresbury Cam brid ge University University of St Andrews Prof E De Schu tter Dr R Trip iccione Dr N Top ham University of Antw erp, Belgium IN FN -Pisa, Italy University of Ed inburgh 12.30 13.30 14.15 Dr D Row eth Discussion Session Dr A H N elson Prof M van H eel Chair: Prof R Catlow University of Card iff Im perial College, Lond on Prof D C H eggie Mr M Wood acre Dr M F O’Boyle University of Ed inburgh SGI/Cray University of Ed inburgh Dr L G Ped ersen Dr M Wilson University of Durham 17:00 Dr M Ru ffert University of Ed inburgh Discussion Session 15.00 16.15 Dr P Coveney Queen Mary & Westfield College, Lond on LUNCH University of N orth Carolina, USA 15.30 Frid ay 28th May M eteorology & Fluids TEA/CO FFEE Discussion Session Discussion Session Chair: Prof N Christ Chair: Dr J Bard QSW, Bristol Chair: Dr T Arber Discussion Session Chair: Dr M Payne Dr A R Jen kins Dr C M Reeves University of Durham University of Ed inburgh CLO SE Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Participants Dr Graham Ackland Prof Tony Arber Dr Jonathan Bard Dr Stephen Booth Dr Ken Bowler Dr Paul Burton Prof Mike Cates Prof Richard Catlow Prof Norman Christ Prof Peter Coveney Prof Ray D d' Inverno Prof Erik De Schutter Dr Paul Durham Mr Dietland Gerloff Mr Simon Glover Dr Bruce Graham Dr Martyn F Guest Prof Douglas C Heggie Dr Suhail A Islam Dr Adrian R Jenkins Mr Bruce Jones Mr Balint Joo Prof Anthony D Kennedy Prof Richard Kenway Dr Crispin Kneble graeme@holyrood [email protected] [email protected] [email protected] [email protected] pmburton.meto.gov.uk [email protected] [email protected] [email protected] p.v.coveney.qmw.ac.uk [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] Mr John M Levesque Dr Nick Maclaren Mr Rod McAllister Dr Avery Meiksin Dr Alistair Nelson Dr Mike O'Boyle Dr John Parkinson Dr Mike C Payne Dr Lee G Pederson Dr Nilesh Raj Dr Federico Rapuano Dr Clive M Reeves Dr Duncan Roweth Dr Max Ruffert Mr Vance Shaffer Dr Doug Smith Mr Philip Snowdon Dr Nigel Topham Dr Arthur Trew Dr Raffaele Tripiccione Prof Marin Van Heel Mr Claudio Verdozzi Dr Mark Wilson Mr Stuart Wilson Mr Michael Woodacre Dr Andrea Zavanella [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] Federico.Rapuano@roma1 [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] mark.wilson.durham.ac.uk [email protected] [email protected] [email protected] Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Introduction Objectives of this Summary Summarise areas of general agreement Highlight areas of uncertainty or disagreement Concentrate on technology, architecture, & organisation For details of science which might be done see the slides of the individual talks This summary expresses views & opinions of its authors – It does not necessarily represent a consensus or majority view... – … but it tries to do so as far as possible Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Devices & Hardware Silicon CMOS will continue to dominate GaAs is still tomorrow’s technology (and always will be?) Moore’s Law Performance increases exponentially Doubling time of 18 months Will continue for at least 5 years, and probably more Trade-offs between density & speed – Gb DRAM and GHz CPU in O(5 years) ... – … but not both on the same chip Choice between speed & power 10 transistors per device by 2005, 10 by 2012 Most cost-effective technology is usually a generation behind the latest technology Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Devices & Hardware Memory latency will increase More levels of cache hierarchy – Implies a tree-like hierarchy of access speeds – Not clear how scientific HPC applications map onto this Access to memory becoming relatively as slow as access to remote processor’s cache Understanding of memory architecture required to achieve optimal performance – analogous to use of virtual memory In the fairly near future Arithmetic will be almost free Pay for memory & communications bandwidth Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Devices & Hardware Devices & HardwareTechnology driven by mass market Commodity parts “Intercepting technology” – systems designed to use technology current under development – cost & risk of newest generation v. performance benefit – “sweet point” on technology curve PCs Workstations DSP … not designed for HPC Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Devices & Hardware Level of integration HPC vendors will move from board to chip level design Cost effective to produce O(10³—10) chips Silicon compilers Time scale? Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Devices & Hardware Error rates will increase Fault tolerance required Implications for very large systems? Time scale? Disks & I/O Increasing density Decreasing cost/bit Increasing relative latency Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Architecture Memory addressing Flat memory (implicit communications) – Model that naïve users want – Does not really exist in hardware – Dynamic hardware coherency mechanisms seem unlikely to work well enough in practice Distributed memory (explicit communications) – NUMA – Protocols • MPI, OpenMP,… • SHMEM – Scientific problems usual have a simple static communications structure, easily handled by get and put primitives Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Architecture Single node performance Fat nodes or thin nodes? – Limited by communication network bandwidth? – Limited by memory bandwidth (off-chip access)? – “Sweet point” on technology curve Single node architecture VLIW Vectors Superscalar Multiple CPUs on a chip Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Architecture Communications What network topology? – 2d, 3d, or 4d grid – network, butterfly, hypercube, fat tree – Crossbar switch Bandwidth Latency – Major problem for coarse-grain machines Packet size – A problem for very fine-grain machines Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Architecture MPP Scalable for the right kind of problems – up to technological limits Commercial interconnects – e.g., from QSW (http://www.quadrics.com/) Flexibility v. price/performance – Custom networks for a few well-understood problems which require high-end performance (e.g., QCD) – More general networks for large but more general purpose machines Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Architecture SMP clusters Limited scalability? Appears to be what vendors want to sell to us – IBM, SGI, Compaq,… – Large market for general purpose SMP machines – Adding cluster interconnect is cheap Unclear whether large-scale scientific problems map onto the tree-like effective network topology well How do we program such machines? Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Architecture PC or workstation clusters Beowulf, Avalon, ... Cheap, but not tested for large machines Communication mechanisms unspecified “Farms” of PCs very cost-effective solution to provide large capacity Static v. dynamic scheduling Static (compiler) instruction scheduling more appropriate than dynamic (hardware) scheduling for most large scientific applications Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Languages & Tools Efficiency new languages will not be widely used for HPC unless they can achieve performance comparable with low level languages (assembler, C, Fortran) Portability to different parallel architectures to next generation of machines to different vendor’s architecture Reusability Object-oriented programming Current languages not designed for HPC (C++, JAVA, …) Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Languages & Tools Optimisation Compilers can handle local optimisation well – register allocation – instruction scheduling Global optimisation will not be automatic – – – – – choice of algorithms data layout memory hierarchy management re-computation v. memory use could be helped by better languages & tools Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Languages & Tools How to get scientists & engineers to use new languages & tools? Performance must be good enough Compilers & tools must be widely available Compilers & tools must be reliable Documentation & training New generation of scientists with interest and expertise in both Computer Science and Applications required Encouragement to work & publish in this area Usual problems for interdisciplinary work – credit for software written is not on par with publications Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Models & Algorithms Disciplines with simple well-established methods Models are “exact” Methods are well understood & stable Errors are under control – at least as well as for experiments Leading-edge computation required for international competitiveness Examples: – Particle physics (QCD) – Astronomy (N body) Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Models & Algorithms Disciplines with complex models Approximate models used for small-scale physics Is reliability limited by – sophistication of underlying models? – scale of computation? – availability of data for initial or boundary conditions? Many different calculations for different systems – capacity v. capability issues Examples: – Meteorology – Materials Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Models & Algorithms Reliance on packages Commercial packages not well-tuned for large parallel machines – Algorithms may need changing Community has resistance to changing to new packages or writing their own systems Examples: – Chemistry – Engineering Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Models & Algorithms Exploration HPC not widely used Access to machine and expertise is a big hurdle Exciting prospects for future progress Algorithms and models need development Examples: – Biology Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Access & Organisation Bespoke machines The best solution for a few special areas QCDSP, APE Special-purpose machines Grape Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Access & Organisation Performance versus cost Slide courtesy of Norman Christ (Columbia University) Diagonal lines are fixed cost Note dates of various machines Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Access & Organisation Commercial machines SMP – Convenient, easy to use, but not very powerful – Good for capacity as opposed to capability SMP clusters – Unclear how effective for large-scale problems – Unclear how they will be programmed Commercial interconnects (QSW,…) Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Access & Organisation Capacity v. capability Large machine required to get “final point on graph” Cost-effectiveness International competitiveness Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Access & Organisation Shared v. dedicated resources Systems management costs – advantages of shared resources • central large-scale data store • backups – disadvantages • more reasonable requests if users have to pay for their implementation • tendency for centres to invent software projects which are not the users’ highest priority Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Access & Organisation Dedicated machines for consortia Flexibility in scheduling – Do not have to prioritise projects in totally different subjects – Users know & can negotiate with each other Ease of access for experimental projects – consortia can be more flexible at allocating resources for promising new approaches Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Sponsors The workshop was supported by The University of Edinburgh Faculty of Science & Engineering Hitachi IBM Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Glossary API Application Program Interface – documented interface to a software subsystem so that its facilities can be used by application (user) programs Butterfly Network Network topology which allows “perfect shuffle” required for FFTs to be carried out in parallel – equivalent to network, Fat tree, and Hypercube Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Glossary Cache Fast near-processor memory (usually SRAM) which contains a copy of the contents of parts of main memory – often separate instruction & data caches – commonly organised as a hierarchy of larger caches of increasing latency – data is automatically fetched from memory when needed if it is not already in the cache – an entire cache “line” is moved from/to memory even if only part of it is required/modified – data is written back to memory when cache “line” is needed for data from some other memory address, or when someone else needs the new value of the data – one (direct map) or several (set associative) cache “lines” can be associated with a given memory address Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Glossary Capability The ability to solve on big problem in a given time Capacity The ability to solve many small problems in a given time CISC Complex Instruction Set Computer – – – – instructions combine memory and arithmetic operations instructions are not of uniform size or duration often implemented using microcode power is dissipated only when switching states Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Glossary CMOS Complementary Metal Oxide Semiconductor – by far the most widely used VLSI technology at present – VLSI technology in which both p type and n type FETs are used – design methodology is that each output is connected by a low impedance path to either the source or drain voltage, thus low static power dissipation – NMOS technology requires less fabrication steps, but draws more current and is not in common use anymore – BiCMOS allows the construction of both FET and Bipolar transistors on the same chip • requires more fabrication steps and therefore larger minimum feature size (and thus lower density) for an acceptable yield • bipolar transistors can drive more current (for a given size and delay) than FETs Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Glossary Coherency Means of ensuring that the copies of data in memory and caches are consistent – the illusion that a single value is associated with each address Crossbar Network topology allowing an arbitrary permutation in a single operation Data Parallel Programming model in which all nodes carry out the same operation on different data simultaneously – may be implemented using SIMD or MIMD architectures Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Glossary Delayed branches several instructions following a branch are unconditionally executed before the branch is taken, allowing the instruction pipeline to remain filled DRAM Dynamic RAM – each bit is stored as the charge on the gate of an FET transistor – only one transistor required to store each bit – DRAM needs to be refreshed (read and rewritten) every few s before charge leaks away Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Glossary DSP Digital Signal Processor – – – – low cost low power no cache used for embedded devices Dynamic Scheduling Order in which instructions are issued is determined on the basis of current activity – dynamic branch prediction: instructions are prefetched along the path taken the last few times through this branch – scoreboarding: instructions are delayed until the resources required (e.g., registers) are free Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Glossary ECC Error Correcting Codes – mechanism for correcting bit errors in DRAM by using, e.g., Hamming codes Fat Nodes Fast & large processor nodes in a multiprocessor machine – allow relatively large sub-problem to live on each node – permits coarse-grained communications (large packets, but less of them) – memory bandwidth is a potential problem Fat Tree Network topology which allows “perfect shuffle” required for FFTs to be carried out in parallel – equivalent to Butterfly, network, and Hypercube Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Glossary FET Field Effect Transistor – transistor in which the channel (source to drain) impedance is controlled by the charge applied to the gate – no current flows from the gate to the channel – c.f., a bipolar transistor in which a current is drawn flows through the base from the emitter to the collector FFT Fast Fourier Transform – O(n log n) algorithm for taking the Fourier transform of n data values Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Glossary GaAs Gallium Arsenide – a semiconductor whose band gap structure allows faster switching times than those for Si (silicon) – fabrication technology lags that for Si – larger minimum feature size for acceptable yield – VLSI speed is limited by path length, not by switching times Generation a level of technology used for chip fabrication, usually measured by the minimum feature size GRAPE special purpose machine for solving the n-body problem Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Glossary Hypercube Network topology which allows “perfect shuffle” required for FFTs to be carried out in parallel – nodes live on the vertices of a d-dimensional hypercube – communications links are the edges of the hypercube – equivalent to Butterfly, network, and Fat tree Instruction prefetch The ability to fetch & decode an instruction while the previous instruction is still executing – allows “pipelining” of instruction execution stages – requires special techniques to deal with conditional branches where it is not yet known which instruction is “next” Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Glossary Latency The time between issuing a request for data and receiving it Microcode sequence of microinstructions used to implement a single machine instruction – similar functionality to having simpler (RISC) instructions with an instruction cache, but less flexible – stored in ROM – reduces complexity of processor logic – some similarities • “vertical” code RISC • “horizontal” code VLIW Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Glossary MPI Message Passing Interface MPP Massively Parallel Processor – a collection of processor nodes each with their own local memory – explicit distributed memory architecture – nodes connected by a network of some regular topology MIMD Multiple Instruction Multiple Data – Architecture in which each processor has its own instruction and data streams NUMA Non Uniform Memory Access Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Glossary Object Oriented Programming Programming paradigm in which data and procedures are encapsulated into objects – objects are defined by their interfaces, i.e., what they do and not how they do it – the way the data is represented within an object, and the way the methods which can manipulate it are implemented is hidden from the rest of the program Network Network topology which allows “perfect shuffle” required for FFTs to be carried out in parallel – equivalent to Butterfly, Fat tree, and Hypercube Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Glossary OpenMP Message Passing API – See http://www.openmp.org for details Packet Size The amount of data that can (has to) be transferred as an atomic unit – overhead associated with each packet (framing, headers,…) – packet size must be small for a fine-grained machine, otherwise available bandwidth cannot be used to send useful data QCD Quantum ChromoDynamics – theory of the strong interaction, by which strongly interacting elementary particles are built from quarks and gluons – non-perturbative QCD calculations use Monte Carlo methods on a lattice discretisation of space-time Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Glossary RAM Random Access Memory RISC Reduced Instruction Set Computer – arithmetic operations only act on registers – instructions are of uniform length and duration • this rule is almost always violated by the inclusion of floating point instructions – memory access, instruction decoding, and arithmetic can be carried out by separate units in the processor ROM Read Only Memory Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Glossary SDRAM Synchronous DRAM – a DRAM chip protocol allowing more overlap of memory access operations SECDED Single Error Correction Double Error Detection – the most common form of ECC used for DRAM memories – usually uses 7 syndrome bits for 32 bit words or 8 syndrome bits for 64 bit words Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Glossary Silicon Compilers Translators from a Hardware Description Language (such as VHDL) into set of masks from which a (semi) custom chip can be made SIMD Single Instruction Multiple Data – architecture in which all processors execute the same instruction at the same time on different data – instructions are usually broadcast from a single copy of the program – processors run in lock-step Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Glossary SMP Symmetric MultiProcessor – Set of processors connected to a shared memory by a common bus or switch SRAM Static RAM – bit value stored as state of bistable “flip-flop” – six transistors required per bit – used for registers and on-chip caches Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Glossary Static Scheduling Scheduling of instructions (usually by a compiler) to optimise performance – does not rely on hardware to analyse dynamic behaviour of the program – may make use of knowledge of “average” behaviour of a program obtained from profiling Superscalar architecture having several functional units which can carry out several operations simultaneously – – – – load/store integer arithmetic branch floating point pipelines Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Glossary Thin Nodes Relatively slow & small processors in a multiprocessor machine – require fine-grained parallelism Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Glossary Vector Instructions Instructions which carry out the same operation on a whole stream of data values – reduces memory bandwidth requirements by providing many data words for one address word – reduces memory bandwidth requirements by reducing number of instructions (no gain if there is a reasonably large instruction cache) – requires more register space for temporaries – some architectures using short vector instructions • Intel Pentium MMX instructions • Hitachi PA-RISC extensions Summary of “Towards Petaflops” Workshop, 24—28 May 1999 Glossary VLIW Very Long Instruction Word – instructions which allow software to control many functional units simultaneously – hard to program by hand – allows compilers opportunities for static scheduling VLSI Very Large Scale Integration – technology in which a large number of TTL (transistor-transistor logic) circuits an fabricated on a single semiconductor chip Summary of “Towards Petaflops” Workshop, 24—28 May 1999