Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Rhodes Island, Greece, April 25-26, 2006 Reiner Hartenstein TU Kaiserslautern (keynote) (from HPC to) New Horizons of Very High Performance Computing (VHPC): Hurdles and Chances TU Kaiserslautern Reconfigurable Supercomputing (VHPC) going commercial Cray XD1 silicon graphics RASC … and other vendors © 2006, [email protected] 2 http://hartenstein.de The Pervasiveness of RC TU Kaiserslautern “FPGA and ….” ECE-savvy scene # of hits by Google unqualified for RC ? Math/SW-savvy scene # of hits by Google 647,000 1,490,000 171,000 194,000 398,000 1,620,000 127,000 113,000 158,000 162,000 915,000 272,000 © 2006, [email protected] 3 http://hartenstein.de Methodology ? TU Kaiserslautern world-wide a mass movement reminds me to the mass migration of lemmings not really a sense of direction terminology chaos an urgent need to get organized © 2006, [email protected] 4 http://hartenstein.de >> Outline << TU Kaiserslautern • Reconfigurable Computing Paradox • The Supercomputing Paradox • We are using the wrong model • Coarse-grained Reconfigurable Devices • Super Pentium for Desktop Supercomputer http://www.uni-kl.de © 2006, [email protected] 5 http://hartenstein.de The Reconfigurable Computing Paradox TU Kaiserslautern poor FPGA technology: very poor effective integration density „very power-hungry“ [Rick Kornfeld*] lower clock frequencies, and more expensive. poor tools: very poor application development support Languages and tools unacceptable for software people most hardware experts (86%**) hate their tools poor education: RC education: extremely poor, or none … teach like for a 50 year old mainframe … ignored by CS curricula © 2006, [email protected] **) DeHon ‘98 6 *) personal communication http://hartenstein.de TU Kaiserslautern Joint Task Force for Computing Curricula 2004 fully ignores Reconfigurable Computing Education ? FPGA & synonyma: 0 hits (Google: 10 million hits) not even here © 2006, [email protected] 7 http://hartenstein.de Completed ? TU Kaiserslautern Computing Curricula v.2005: no changes other than „… FPGA, etc.“ (not really mentioning that it‘s missing) Taskforce activity completed ? Next task force in 2020 or later ? © 2006, [email protected] 8 http://hartenstein.de Tools ? TU Kaiserslautern End of this week: brainstorming session at DARPA: (urgently needed – overdue! ) © 2006, [email protected] 9 http://hartenstein.de Technology: fine-grained RC: 1st DeHon‘s Law [1996: Ph. D, MIT] TU Kaiserslautern density: overhead: transistors / microchip wiring FPGA physical overhead 109 106 FPGA logical reconfigurability overhead> FPGA routed routing congestion immense area inefficiency 103 100 >> 10 000 1980 1990 2000 © 2006, [email protected] 10 2010 http://hartenstein.de pre-FPGA era published speed-up factors relative performance TU Kaiserslautern 109 DSP and wireless 106 Image processing, Decoding Pattern matching, real-time face Reed-Solomon detection 2400 6000 crypto Multimedia video-rate stereo visionMAC 1000 Grid-based DRC („fair comparizon“) 1000 400 pattern recognition 730 900 288 SPIHT wavelet-based image compression 457 Bioinformatics 15000 2000 10 000 Viterbi Decoding Smith-Waterman 10 000 pattern matching 88 molecular dynamics simulation 100 Grid-based DRC: no FPGA: DPLA 52 FFT protein identification BLAST on MoM by TU-KL Los Alamos traffic simulation 47 40 Pentium 4 103 20 Lee Routing 160 (by TU-KL) 2-D FIR filter [TU-KL] GRAPE Astrophysics 39,4 8080 100 1980 http://xputers.informatik.uni-kl.de/faq-pages/fqa.html © 2006, [email protected] 1990 2000 11 2010 http://hartenstein.de pre FPGA era: Why DPLA* was so good TU Kaiserslautern Large arrays of canonical boolean expressions PLA layout ~similar to RAM / ROM layout: Close to Moore because of small overhead (wiring, programmability, routing) Mid’ 80ies: first very tiny FPGAs available 2 GAG Generic Address Generator to avoid address computation overhead ASM ASM: AutoSequencing Memory *) designed by TU-KL, fabricated by E.I.S. German multi university project http://hartenstein.de © [email protected] [M.2006, Herz et al.: ICECS 2003, Dubrovnik] 12 TU Kaiserslautern (anti-von-Neumann machine paradigm) ASM GAG ASM: AutoSequencing Memory RAM Generalization of the DMA data counter GAG & enabling technology: published 1989 [by TU-KL], Survey paper: [M. Herz et al.*: IEEE ICECS 2003, Dubrovnik] patented by TI** 1995 © 2006, [email protected] Data Counter instead of Program Counter 13 Storge Scheme optimization methodology, etc. *) IMEC & TU-KL **) -http://hartenstein.de TU Kaiserslautern Thousands or Millions of $ for free Application migration [from supercomputer] resulting not only in massive speed-ups Electricity bills reduced by an order of magnitude and even more you may get for free …. up to millions of $ dollars per year (also a matter of national energy policy) © 2006, [email protected] 14 Google Amsterdam NY http://hartenstein.de TU Kaiserslautern Reconfigurable Scientific Computing How software types do programming the FPGAs ? Hiring a good student from the EE Dept. ? Because of Missing RC education: Far away from optimum solutions ? Much higher speedup achievable ? 1 or 2 more orders of magnitude ? 100.000 ? 1.000.000 ? © 2006, [email protected] 15 http://hartenstein.de By education: better speed-up factors ? relative performance TU Kaiserslautern 109 DSP and wireless 106 Image processing, Decoding Pattern matching, real-time face Reed-Solomon detection 2400 6000 crypto Multimedia video-rate stereo visionMAC 1000 Grid-based DRC („fair comparizon“) 1000 400 pattern recognition 730 900 288 SPIHT wavelet-based image compression 457 Bioinformatics 15000 2000 10 000 Viterbi Decoding Smith-Waterman 10 000 pattern matching 88 molecular dynamics simulation 100 Grid-based DRC: no FPGA: DPLA 52 FFT protein identification BLAST on MoM by TU-KL P4 Los Alamos traffic simulation 47 40 103 20 Lee Routing 160 (by TU-KL) 2-D FIR filter [TU-KL] GRAPE Astrophysics 39,4 8080 100 1980 http://xputers.informatik.uni-kl.de/faq-pages/fqa.html © 2006, [email protected] 1990 2000 16 2010 http://hartenstein.de >> Outline << TU Kaiserslautern • Reconfigurable Computing Paradox • The Supercomputing Paradox • We are using the wrong model • Coarse-grained Reconfigurable Devices • Super Pentium for Desktop Supercomputer http://www.uni-kl.de © 2006, [email protected] 17 http://hartenstein.de The Supercomputing Paradox TU Kaiserslautern COTS processor decreasing cost Increasing number of processors running in parallel Growing listed Teraflops Almost stalled application implementation progress Often limited sustained Teraflops The Law of More Very high total cost of the Tera(?)flops Scientists waiting for affordable compute capacity © 2006, [email protected] 18 http://hartenstein.de >> Outline << TU Kaiserslautern • Reconfigurable Computing Paradox • The Supercomputing Paradox • We are using the wrong model • Coarse-grained Reconfigurable Devices • Super Pentium for Desktop Supercomputer http://www.uni-kl.de © 2006, [email protected] 19 http://hartenstein.de Why traditional supercomputing / HPC failed TU Kaiserslautern because of the wrong multi-core interconnect architecture the wrong way, how the data are moved around instruction-stream-based: memory-cycle-hungry stolen from Bob Colwell © 2006, [email protected] 20 http://hartenstein.de moving data around inside the Earth Simulator TU Kaiserslautern Crossbar weight: 220 t, 3000 km of thick cable, ES 20: TFLOPS © 2006, [email protected] 5120 Processors, 5000 pins each 21 http://hartenstein.de Bringing together data and processor TU Kaiserslautern Moving data to by Software the processor: moving the grand piano © 2006, [email protected] 22 http://hartenstein.de >> Outline << TU Kaiserslautern • Reconfigurable Computing Paradox • The Supercomputing Paradox • We are using the wrong model • Coarse-grained Reconfigurable Devices • Super Pentium for Desktop Supercomputer http://www.uni-kl.de © 2006, [email protected] 23 http://hartenstein.de rDPA coarse-grained RC: Hartenstein‘s Law [1996: ISIS, Austin, TX] TU Kaiserslautern transistors / microchip 109 >> 10 000 106 FPGA routed area efficiency very close to Moore‘s law 103 100 1980 1990 © 2006, [email protected] 2000 24 2010 http://hartenstein.de higher speed-up factors by coarse-grained? relative performance TU Kaiserslautern 109 DSP and wireless 106 Image processing, Decoding Pattern matching, real-time face Reed-Solomon detection 2400 6000 crypto Multimedia video-rate stereo visionMAC 1000 Grid-based DRC („fair comparizon“) 1000 400 pattern recognition 730 900 288 SPIHT wavelet-based image compression 457 Bioinformatics 15000 2000 10 000 Viterbi Decoding Smith-Waterman 10 000 pattern matching 88 molecular dynamics simulation 100 Grid-based DRC: no FPGA: DPLA 52 FFT protein identification BLAST on MoM by TU-KL P4 Los Alamos traffic simulation 47 40 103 20 Lee Routing 160 (by TU-KL) 2-D FIR filter [TU-KL] GRAPE Astrophysics 39,4 8080 100 1980 http://xputers.informatik.uni-kl.de/faq-pages/fqa.html © 2006, [email protected] 1990 2000 25 2010 http://hartenstein.de Coarse grain is about computing, not logic TU Kaiserslautern SNN filter on KressArray (mainly a pipe network) rout thru only array size: 10 x 16 = 160 rDPUs no CPU rDPU reconfigurable Data Path Unit, Legend: e. g. 32 bits wide © 2006, [email protected] rDPU not used backbus connect used for routing only backbus connect operator and routing port location not usedmarker [Ulrich Nageldinger] 26 http://hartenstein.de SW 2coarse-grained CW migration example TU Kaiserslautern rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU S rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU © 2006, [email protected] + 27 http://hartenstein.de TU Kaiserslautern Compare it to software solution on CPU S = R + (if C then A else B endif); R B A on a very simple CPU memory C = 1 cycles C =1 nano seconds read instruction if C then read A instruction decoding read operand* operate & register transfers if not C then read B + + S S Clock 200 read instruction instruction decoding read instruction add & store instruction decoding operate & register transfers store result total © 2006, [email protected] 28 http://hartenstein.de hypothetical branching example to illustrate software-to-configware migration TU Kaiserslautern S = R + (if C then A else B endif); R B A C =1 + S clock 200 MHz (5 nanosec) © 2006, [email protected] C=1 simple conservative CPU example read instruction instruction decoding if C then read A read operand* operate & reg. transfers read instruction if not C then read B instruction decoding read instruction instruction decoding add & store operate & reg. transfers store result total memory nano cycles seconds 1 100 1 100 1 100 1 100 1 5 100 500 *) if no intermediate storage in register file 29 http://hartenstein.de Why the speed-up? What‘s the difference? TU Kaiserslautern moving the locality of operation into the route of the data stream by P&R instead of moving data by instruction streams © 2006, [email protected] 30 http://hartenstein.de The wrong mind set .... TU Kaiserslautern S = R + (if C then A else B endif); „but you can‘t implement decisions!“ section of a very thru only large piperout network: R B A C =1 not knowing this solution: symptom of the hardware / software chasm and the configware / software chasm + not used backbus connect We need Reconfigurable Computing Education Legend: rDPU not used backbus connect used for routing only operator and routing port location marker [Ulrich Nageldinger] © 2006, [email protected] 31 http://hartenstein.de The new paradigm: how the data are traveling TU Kaiserslautern [Jack Lipovski, EUROMiCRO, no, not by instruction execution Nice, 1975] not transport-triggered: old hat + instruction-driven DPU pipeline, or chaining DPU DPU vN Move Processor instruction-driven super systolic array P&R: move locality of operation, not data ! © 2006, [email protected] 32 http://hartenstein.de time ASM x x x x x x - ASM x x x - - ASM 50 & more on-chip ASM are feasible © 2006, [email protected] ASM Data streams input data stream | | port # - - - x x x port # H. T. Kung paradigm (systolic array) | x x x | | | | | | | | | | | x x x x x x time 33 | x x x ASM implemented by distributed memory DPA x x x ASM (pipe network) x x x ASM TU Kaiserslautern ASM ASM define: ... which data item time at which time at which port time ASM - - - - x x x ASM - - - - - x x x ASM port # GAG output data streams RAM data counter ASM: AutoSequencing Memory http://hartenstein.de The Generalization of the Systolic Array TU Kaiserslautern Kress-Kung paradigm super systolic array only for applications with regular data dependencies remedy? discard algebraic synthesis methods [R. Kress]: use optimization algorithms e. g.: simulated annealing reconfigurability makes sense Achievement: also non-linear and non-uniform pipes, and even more wild pipe structures possible © 2006, [email protected] 34 http://hartenstein.de >> Outline << TU Kaiserslautern • Reconfigurable Computing Paradox • The Supercomputing Paradox • We are using the wrong model • Coarse-grained Reconfigurable Devices • Super Pentium for Desktop Supercomputer http://www.uni-kl.de © 2006, [email protected] 35 http://hartenstein.de TU Kaiserslautern Here is the common model it’s not von Neumann the vN monopoly in our curricula is severely harmful we need dual paradigm software code configware code education instructiondatastreambased CPU the tail is wagging the dog © 2006, [email protected] 36 streambased reconfigurable accelerator hardwired accelerator http://hartenstein.de TU Kaiserslautern A potential Pentium successor Discard most caches have 64* cores, 0.5 - 1 GHz with clever interconnect for: ! concurrent processes and ! and for multithreading, and, for ! Kung-Kress pipe network The Desk-top Supercomputer! *) CPU mode / DPU mode capability © 2006, [email protected] 37 http://hartenstein.de “Super Pentium” configuration example TU Kaiserslautern CPU rDPU rDPU rDPU rDPU rDPU rDPU CPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU CPU rDPU rDPU rDPU rDPU rDPU rDPU CPU © 2006, [email protected] 38 http://hartenstein.de e. g.: ~ 8 x 8 rDPA: all feasible under 500 MHz TU Kaiserslautern World TV & game console & multi media center • Variable resolutions and refresh rates Games • Variable scan mode characteristics • Noise Reduction and Artifact Removal • High performance requirements • Variable file encoding formats • Variable content security formats Camera • Variable Displays • Luminance processing • Detail enhancement • Color processing SD/MMC Cards • Sharpness Enhancement • Shadow Enhancement • Differentiation • Programmable de-interlacing heuristics • Frame rate detection and conversion Radio• Motion detection & estimation & compensation Interface • Different standards (MPEG2/4, H.264) • A single device handles all modes http://pactcorp.com © 2006, [email protected] Videos Music SMeXPP rDPA LCD DISPLAY BasebandProcessor 39 Audio- Interface http://hartenstein.de Dual Paradigm Application Development TU Kaiserslautern high level language software/configware co-compiler software code instructionstreambased CPU configware code datastreambased reconfigurable accelerator hardwired accelerator © 2006, [email protected] 40 http://hartenstein.de TU Kaiserslautern Software / Configware Co-Compilation C language source supporting different platforms Partitioner SW compiler CPU CW compiler Resource Parameters rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU © 2006, [email protected] 41 Placement & Routing (Move the Locality of Operation) Juergen Becker’s CoDe-X, 1996 http://hartenstein.de Bringing together data and processor TU Kaiserslautern Place the location of execution into the data pipe by Configware Move the stool © 2006, [email protected] 42 http://hartenstein.de >> Conclusions << TU Kaiserslautern • Reconfigurable Computing Paradox • The Supercomputing Paradox • We are using the wrong model • Coarse-grained Reconfigurable Devices • Super Pentium for Desktop Supercomputer • Conclusions http://www.uni-kl.de © 2006, [email protected] 43 http://hartenstein.de Conclusions (1): Hurdles TU Kaiserslautern Obstacles are: unbelievably disastrous tools market: enabling technologies available, partly decades old, but not used unbelievably ignorant curricula: fragmentation into application-domainspecific cultures and trick boxes transdisciplinary models not available nor taught at CS, nor elsewhere © 2006, [email protected] 44 http://hartenstein.de Conclusions (2): Future Work TU Kaiserslautern The monopoly of the von-Neumann-based mind set in CS education: heavily stalls progress in R&D, not only in HPC causes high cost in R&D, not only in supercomputing CS graduates are not qualified for our job market The von-Neumann-only-based mind set in CS urgently needs to go to adopt the dual paradigm common model CS disciplines must recognize and accept its strategic role and its responsibility toward all its application disciplines: embedded and scientific computing. © 2006, [email protected] 45 http://hartenstein.de TU Kaiserslautern Conclusions (3): Chances New horizons: chances are brilliant © 2006, [email protected] 46 http://hartenstein.de TU Kaiserslautern thank you © 2006, [email protected] 47 http://hartenstein.de TU Kaiserslautern END © 2006, [email protected] 48 http://hartenstein.de TU Kaiserslautern thank you © 2006, [email protected] 49 http://hartenstein.de TU Kaiserslautern Backup: © 2006, [email protected] 50 http://hartenstein.de Co-Compiler Enabling Technology TU Kaiserslautern is available from academia only a small team needed for commercial re-implementation on the road map to the Personal Supercomputer © 2006, [email protected] 51 http://hartenstein.de Compilation: Software vs. Configware TU Kaiserslautern Software Engineering source program Configware Engineering C, FORTRAN MATHLAB placement source „program“ & routing mapper software compiler configware compiler data scheduler software code configware code © 2006, [email protected] 52 flowware code http://hartenstein.de TU Kaiserslautern Nick Tredennick’s Paradigm Shifts explain the differences Software Engineering CPU software resources: fixed algorithm: variable 1 programming source needed Configware Engineering configware flowware © 2006, [email protected] resources: variable algorithm: variable 53 2 programming sources needed http://hartenstein.de Co-Compilation TU Kaiserslautern C, FORTRAN, MATHLAB automatic SW / CW partitioner Software / Configware software Co-Compiler compiler mapper configware compiler data scheduler software code configware code © 2006, [email protected] 54 flowware code http://hartenstein.de Co-Compiler for Hardwired Kress/Kung Machine [e. g. Brodersen] TU Kaiserslautern source automatic SW / CW partitioner Software / software Flowware compiler Co-Compiler flowware compiler data scheduler software code © 2006, [email protected] 55 flowware code http://hartenstein.de The first archetype machine model TU Kaiserslautern Software Industry procedural personalization instruction-streambased mind set “von Neumann” © 2006, [email protected] Software Industry’s Secret of Success compile or assemble main frame CPU 56 simple basic . Machine Paradigm personalization: RAM-based http://hartenstein.de The 2nd archetype machine model TU Kaiserslautern Configware Industry structural personalization data-streambased mind set compile reconfigurable accelerator “Kress-Kung” © 2006, [email protected] Configware Industry’s Secret of Success 57 simple basic . Machine Paradigm personalization: RAM-based http://hartenstein.de TU Kaiserslautern „Saves more than $10,000 in electricity bills per year (7¢ / kWh) - .... per 64-processor 19" rack“ [Herb Riley, R. Associates] © 2006, [email protected] 58 http://hartenstein.de TU Kaiserslautern modern FPGA bestsellers: The new model is reality: FPGA fabrics, together with several µprocessors, many memory banks, and other IP cores, on the same COTS microchip © 2006, [email protected] 59 http://hartenstein.de TU Kaiserslautern DSP platform FPGA [courtesy Xilinx Corp.] 500MHz PowerPC™ Processors (680DMIPS) with Auxiliary Processor Unit 500MHz multi-port Distributed 10 Mb SRAM 500MHz DCM Digital Clock Management 500MHz Flexible Soft Logic Architecture 200KLogic Cells 0.6-11.1Gbps Serial Transceivers 1Gbps Differential I/O 500MHz Programmable DSP Execution Units © 2006, [email protected] 60 http://hartenstein.de