Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Dynamically Reconfigurable Architectures Dagstuhl, Germany, April 2 - 7, 2006 Reiner Hartenstein TU Kaiserslautern Reconfigurable Supercomputing: What are the Problems? What are the Solutions? The Supercomputing Paradox TU Kaiserslautern COTS processor decreasing cost Increasing number of processors running in parallel Rapidly growing listed Teraflops Almost stalled application implementation progress Often limited sustained Teraflops Very high total cost of the Tera(?)flops Scientists waiting for affordable compute capacity © 2006, [email protected] 2 http://hartenstein.de TU Kaiserslautern dangerously telling this to the supercomputing people: You … used the wrong roadmap the past 20 years !!! © 2006, [email protected] 3 http://hartenstein.de progress stalled TU Kaiserslautern © 2006, [email protected] 4 http://hartenstein.de 3 Reconfigurable Computing Paradoxes TU Kaiserslautern Reconfigurable Computing Education Paradox The low power paradox The high performance paradox © 2006, [email protected] 5 http://hartenstein.de TU Kaiserslautern The Pervasiveness of RC search “FPGA and ….” # of hits by Google # of hits by Google 647,000 1,490,000 171,000 194,000 398,000 1,620,000 127,000 113,000 158,000 162,000 915,000 272,000 © 2006, [email protected] 6 http://hartenstein.de TU Kaiserslautern going into every application area Almost 10 million hits © 2006, [email protected] 7 http://hartenstein.de TU Kaiserslautern …. educational deficits in addition to the hardware / software chasm We now also have the hardware / configware / software chasm Curricula still ignore these extremely hot new challenges The Reconfigurable Computing Education Paradox: its run-away accelerated pervasiveness, despite of all these educational deficits © 2006, [email protected] 8 http://hartenstein.de Computing Curricula TU Kaiserslautern 2004 (1) Within about 500 pages the term reconfigurable is not found – nor its synonyms © 2006, [email protected] 9 http://hartenstein.de obsolete TU Kaiserslautern von Neumann‘s monopoly inside curricula is obsolete © 2006, [email protected] 10 http://hartenstein.de von Neumann is not the common model mainframe age: RAM memory CPU von Neumann bottleneck DPU progra m counter von Neumann instruction-streambased machine © 2006, [email protected] microprocessor age: instruction- datastreamstreambased based CPU accelerator co-processors vN paradigm dominance ? the tail is wagging the dog 11 hardware morphware software TU Kaiserslautern http://hartenstein.de TU Kaiserslautern modern FPGA bestsellers: The new model is reality: FPGA fabrics, together with several µprocessors, several memory banks, and other IP cores, on the same COTS microchip © 2006, [email protected] 12 http://hartenstein.de Bill Gates TU Kaiserslautern Speech by Bill Gates at a summit meeting of US state governors: "American high schools are obsolete." "The high schools of today teach kids about today's computers like on a 50-year-old mainframe. „Without re-design for the needs of the 21st century, we will keep limiting - even ruining the lives of millions of Americans every year." © 2006, [email protected] 13 http://hartenstein.de carved out of stone TU Kaiserslautern The most important cultural revolution since the invention of text characters: it‘s not the mainframe It is the Microchip ! © 2006, [email protected] 14 http://hartenstein.de RC education needed TU Kaiserslautern Jürgen Becker Jörg Henkel R. Hartenstein 35 submissions from Australia, Brasil, India, USA, and throughout Europe http://fpl.org/RCeducation/ © 2006, [email protected] 15 http://hartenstein.de Reconfigurable Computing Paradoxes TU Kaiserslautern Reconfigurable Computing Education Paradox The low power paradox The high performance paradox © 2006, [email protected] 16 http://hartenstein.de The FPGA Low Power Paradox TU Kaiserslautern The awful technology of FPGAs: „very power-hungry“ [Rick Kornfeld*] FPGAs run at lower clock frequencies, draw much more power and are more expensive. *) personal communication Reducing the electricity bill by an order of magnitude and more by supercomputer 2 FPGA migration © 2006, [email protected] 17 http://hartenstein.de telling this to the low power design people ? TU Kaiserslautern ISLPED, Oct 4 – 6, Tegernsee you … used the wrong roadmap the past 15 years: use FPGAs ! PATMOS, Sep 13 – 15, Montpellier 1991: Kaiserslautern, Germany 1992: Paris, France 1993: Montpellier, France © 2006, [email protected] 18 http://hartenstein.de Reconfigurable Computing Paradoxes TU Kaiserslautern Reconfigurable Computing Education Paradox The low power paradox The high performance paradox © 2006, [email protected] 19 http://hartenstein.de The High Performance Paradox TU Kaiserslautern The awful technology of FPGAs: Effective integration density much worse than the Gordon Moore curve: by a factor of more than 10,000 FPGAs run at lower clock frequencies, and are more expensive. 85% of all designers hate their tools © 2006, [email protected] 20 http://hartenstein.de # fine-grained RC: 1st DeHon‘s Law [1996: Ph. D, MIT] TU Kaiserslautern density: overhead: transistors / microchip wiring FPGA physical overhead 109 106 FPGA logical reconfigurability overhead> FPGA routed routing congestion immense area inefficiency 103 100 >> 10 000 1980 1990 2000 © 2006, [email protected] 21 2010 http://hartenstein.de # coarse-grained RC: Hartenstein‘s Law [1996: ISIS, Austin, TX] TU Kaiserslautern transistors / microchip 109 >> 10 000 106 FPGA routed area efficiency very close to Moore‘s law 103 100 1980 1990 © 2006, [email protected] 2000 22 2010 http://hartenstein.de Claassen‘s Law TU Kaiserslautern 1000 MOPS / milliWatt 100 10 DSP 1 0.1 0.01 µ feature size 0.001 2 © 2006, [email protected] 1 0.5 23 0.25 0.13 0.1 0.07 http://hartenstein.de Claassen‘s Law: Hartenstein‘s Amendment TU Kaiserslautern 1000 MOPS / milliWatt 100 10 DSP 1 0.1 0.01 µ feature size 0.001 2 © 2006, [email protected] 1 0.5 24 0.25 0.13 0.1 0.07 http://hartenstein.de Selection of published speed-up factors relative performance TU Kaiserslautern 109 DSP and wireless 106 Image processing, Decoding Pattern matching, real-time face Reed-Solomon detection 2400 6000 Multimedia video-rate stereo visionMAC crypto Grid-based DRC („fair comparizon“) pattern recognition 730 SPIHT wavelet-based image compression 457 1000 400 Viterbi Decoding 900 288 Smith-Waterman Bioinformatics 15000 2000 100 000 pattern matching 88 molecular dynamics simulation 100 Grid-based DRC: no FPGA: DPLA 52 FFT protein identification BLAST on MoM by TU-KL P4 Los Alamos traffic simulation 47 40 103 20 Lee Routing 160 (DPLA by TU-KL) 2-D FIR filter (no 39,4 FPGA: DPLA by TU-KL) GRAPE Astrophysics 8080 100 1980 http://xputers.informatik.uni-kl.de/faq-pages/fqa.html © 2006, [email protected] 1990 2000 25 2010 http://hartenstein.de nd 2 TU Kaiserslautern DeHon‘s Law [IEEE COMPUTER, 2000] Computational Density 1000 FPGA 100 10 RISC 1 2 © 2006, [email protected] µ feature size 1 0.5 26 0.25 0.13 0.1 0.07 http://hartenstein.de The three RC Paradoxes TU Kaiserslautern © 2006, [email protected] 27 http://hartenstein.de TU Kaiserslautern Why supercomputing / HPC failed because of the interconnect network architecture the wrong way, how the data are moved around instruction-stream-based: memory-cycle-hungry instruction fetch overhead sequencing overhead The law or More: address computation overhead and other overhead © 2006, [email protected] 28 http://hartenstein.de moving data around inside the Earth Simulator TU Kaiserslautern Crossbar weight: 220 t, 3000 km of cable, ES 20: TFLOPS © 2006, [email protected] 5120 Processors, 5000 pins each 29 http://hartenstein.de data moved around by software TU Kaiserslautern i.e. by memory-cycle-hungry instruction streams which fully hit the memory wall P&R: move locality of operation, not data ! © 2006, [email protected] stolen from Bob Colwell 30 http://hartenstein.de An Archetype Common Model needed TU Kaiserslautern from the Configware Industry Progress stalled by the software/configware chasm Useful simple archetype not widely accepted An archetype common model should provide .... Guidance for organizing efficient solutions Make the project manageable Allow to share lessions between applications and between disciplines support undergraduate educastion © 2006, [email protected] 31 http://hartenstein.de The new paradigm: how the data are traveling TU Kaiserslautern no, not by instruction execution transport-triggered: an old hat pipeline, or chaining asynchronous (via handshake) systolic array wavefront array © 2006, [email protected] 32 http://hartenstein.de TU Kaiserslautern Def.: data streams (flowware) Flowware defines: ... which data item time at which time at which port (pipe network) x x x DPA time x x x | x x x | | x x x x x x - port # - - - x x x time - - - - x x x x x x - - - - - - - x x x port # | | | | | | | | | | | x x x © 2006, [email protected] input data streams time x x x 33 | x x x port # output data streams source and sink ? http://hartenstein.de TU Kaiserslautern Data streams source and sink: not my job Not my Job! © 2006, [email protected] 34 http://hartenstein.de | | x x x x x x x x x - | | | | | | | | | | | x x x ASM © 2006, [email protected] ASM x x x 35 | x x x ASM implemented by ASM distributed on- ASM chip memory ASM | input data streams - - - x x x ASM - - - - x x x ASM - - - - - x x x ASM output data streams RAM x x x x x x GAG x x x ASM ASM ASM TU Kaiserslautern distributed memory ASM On-chip Auto-Sequencing Memory http://hartenstein.de How the data are moved TU Kaiserslautern DMA, vN move processor [Jack Lipovski, EUROMiCRO, Nice, 1975] [TU-KL publ.: ASM use GAG generic address generator Tokyo 1989 + by the way: GAG st…. by TI [TI patent 1995] NH journal] Henk Corporaal coins the term “transport-triggered” MoM: GAG-based storage scheme methodology [Herz*] Application-specific distributed memory [Catthoor et al.] *)©[see Michael Herz et al.: ICECS36 2002 (Dubrovnik)] 2006, [email protected] http://hartenstein.de TU Kaiserslautern The dual paradigm approach Configware Engineering Software Engineering ASM CPU von Neumann paradigm © 2006, [email protected] 37 Kress-Kung paradigm http://hartenstein.de TU Kaiserslautern Mathematical Synthesis Methods algebraic methods i. e., linear projections yields only uniform arrays w. linear pipes only for applications with regular data dependencies © 2006, [email protected] 38 http://hartenstein.de TU Kaiserslautern Coarse-grained reconfigurable arrays are a Generalization of the Systolic Array .... [Rainer Kress] discard algebraic synthesis methods use optimization algorithms instead, for example: simulated annealing the achievement: also non-linear and non-uniform pipes, and even more wild pipe structures possible now reconfigurability really makes sense © 2006, [email protected] 39 http://hartenstein.de Coarse grain is about computing, not logic TU Kaiserslautern Example: mapping onto rDPA by DPSS: based on simulated annealing SNN filter on KressArray (mainly a pipe network) rout thru only array size: 10 x 16 = 160 rDPUs no CPU rDPU, 32 bit Legend: rDPU not used [Ulrich Nageldinger] backbus connect used for routing only backbus connect operator and routing port location not usedmarker tool: KressArray Xplorer: diss. Ulrich Nageldinger (downloadable) © 2006, [email protected] 40 http://hartenstein.de Software / Configware Co-Compilation [Juergen Becker’s CoDe-X, 1996] TU Kaiserslautern C language source “vN" machine paradigm Partitioner anti machine paradigm CW SW Analyzer compiler / Profiler compiler SW code © 2006, [email protected] CW Code FW Code 41 supporting different platforms Resource Parameters http://hartenstein.de Software / Configware Co-Compilation [Juergen Becker’s CoDe-X, 1996] TU Kaiserslautern C language source “vN" machine paradigm Partitioner anti machine paradigm CW SW Analyzer compiler / Profiler compiler SW code © 2006, [email protected] CW Code FW Code 42 supporting different platforms Resource Parameters http://hartenstein.de Distributed Memory Parallelism Capability TU Kaiserslautern ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM © 2006, [email protected] 43 operator and routing ASM ASM ASM rDPU not used ASM used for routing only ASM ASM backbus connect ASM Legend: ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM backbus connect layers … ASM ASM NN ports interconnect layer array size example: 10 x 16 port location marker http://hartenstein.de TU Kaiserslautern Applications for coarse-grained arrays (on-chip distributed memory for intermediate results) with steady I/O data streams at constant speed: Multi-standard world HDTV receiver Wide variety of multimedia applications Wide variety of real-time applications Many other applications © 2006, [email protected] 44 http://hartenstein.de The wrong mind set .... TU Kaiserslautern „but you can‘t implement decisions!“ (remark of a high-ranked industrial research head – discussion after a talk by Ulrich Nageldinger – RAW Orlando) © 2006, [email protected] 45 http://hartenstein.de a tiny section of the pipe network TU Kaiserslautern + S Legend: rDPU not used © 2006, [email protected] backbus connect used for routing only 46 operator and routing port location marker http://hartenstein.de The wrong mind set .... TU Kaiserslautern section of a very large pipe network: R B A C =1 =0 „but you can‘t implement decisions!“ not knowing this solution: symptom of the hardware / software chasm + © 2006, [email protected] and the configware / software chasm 47 http://hartenstein.de TU Kaiserslautern introducing hardware description languages (in the mid‘ seventies) “The decision box becomes a (de)multiplexer” This is so simple: why did it take decades to find out ? The wrong mind set – the wrong road map! © 2006, [email protected] 48 http://hartenstein.de section of a major pipe network on rDPU hypothetical branching example to illustrate software-to-configware migration TU Kaiserslautern S = R + (if C then A else B endif); R B A C =1 + S clock 200 MHz (5 nanosec) © 2006, [email protected] C=1 simple conservative CPU example read instruction instruction decoding if C then read A read operand* operate & reg. transfers read instruction if not C then read B instruction decoding read instruction instruction decoding add & store operate & reg. transfers store result total memory nano cycles seconds 1 100 1 100 1 100 1 100 1 5 100 500 *) if no intermediate storage in register file 49 http://hartenstein.de why the RC paradigm shift is so important TU Kaiserslautern by Software by Configware Move the stool or the grand piano? © 2006, [email protected] 50 http://hartenstein.de … understand only this parallelism solution: TU Kaiserslautern the instruction-stream-based approach the data-stream-based approach has no von Neumann bottleneck von Neumann bottlenecks © 2006, [email protected] 51 http://hartenstein.de What means Reconfigurable Computing? TU Kaiserslautern switching the multiplexers? routing ALU result to a register? microprogramming? concurrency of 64 or 256 CPUs on a single chip? it means using the Kress/Kung machine paradigm ! © 2006, [email protected] 52 http://hartenstein.de TU Kaiserslautern vN paradigm loosing its dominance http://bwrc.eecs.berkeley.edu/Research/RAMP/people.htm RAMP project proposes: Run LINUX on FPGAs © 2006, [email protected] 53 http://hartenstein.de TU Kaiserslautern vN paradigm loosing its dominance Xilinx inside ! Cray XD1 © 2006, [email protected] 54 http://hartenstein.de TU Kaiserslautern Recommended Pentium successor Discard most caches Have 64* cores with clever interconnect for: concurrent processes, for multithreading, and, Kung-Kress rDPA array The Desk-top Supercomputer! © 2006, [email protected] 55 http://hartenstein.de What means Reconfigurable Computing ? TU Kaiserslautern The key issue: which is the underlying paradigm? Operation not based on instruction-streams at run time No instruction fetch at run time machine paradigm is data stream-based: Kress-Kung Undergraduate education needs a dual paradigm approach: symbiosis of von Neumann / Kress-Kung © 2006, [email protected] 56 http://hartenstein.de TU Kaiserslautern thank you © 2006, [email protected] 57 http://hartenstein.de TU Kaiserslautern END © 2006, [email protected] 58 http://hartenstein.de TU Kaiserslautern © 2006, [email protected] 59 http://hartenstein.de TU Kaiserslautern Backup for Discussion: © 2006, [email protected] 60 http://hartenstein.de TU Kaiserslautern Term to be used for „soft hardware“ accelware adaptware adjustware altware alterware arrangeware changeware conformware doughware fabricsware fabrixware fitware flexware formware FPware unfortunately “Morphware” is trademarked gateware gateroutware hpcware LUTware matchware modiware morphware® morfware mouldware muxware parware paraware passware pathware patchware send yourproposal to: © 2006, [email protected] performware perfware perware pipeware platformware railware rangeware RCware ressourceware routware routeware routingware RTware shapeware shuntware 61 shuntingware speedware speedupware suiteware switchware switchingware streamware structware transferware transware variware varyware warpware xferware xware http://hartenstein.de Compilation: Software vs. Configware TU Kaiserslautern Software Engineering source program Configware Engineering C, FORTRAN MATHLAB placement source „program“ & routing mapper software compiler configware compiler data scheduler software code configware code © 2006, [email protected] 62 flowware code http://hartenstein.de Co-Compilation TU Kaiserslautern C, FORTRAN, MATHLAB automatic SW / CW partitioner Software / Configware software Co-Compiler compiler mapper configware compiler data scheduler software code configware code © 2006, [email protected] 63 flowware code http://hartenstein.de Why use Reconfigurable Computing TU Kaiserslautern Exploit spatial parallelism, and .. instead of software? instead of spec. hardware? … high bandwidth and low latency memory access … and fine-grained parallelism when useful Ride the technology curve avoiding specific silicon Adapt to change: standards, trends, ….. Adapt to application / deployment requirements Reduce risk © 2006, [email protected] 64 http://hartenstein.de TU Kaiserslautern Computing Curricula 2004 (2) CE missing # © 2006, [email protected] 65 http://hartenstein.de 2.2.1. TU Kaiserslautern © 2006, [email protected] 66 Computing Curricula 2004 (3) http://hartenstein.de 2.2.1. TU Kaiserslautern Computing Curricula 2004 (4) … how it should be CONFIGWARE morphware and configware added MORPHWARE © 2006, [email protected] 67 http://hartenstein.de TU Kaiserslautern © 2006, [email protected] 68 http://hartenstein.de