Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
PASA, Frankfurt, March 16, 2006 Reiner Hartenstein TU Kaiserslautern From Organic Computing to Reconfigurable Computing Reconfigurable Computing (RC) TU Kaiserslautern and FPGA* in the media June 2005 fastest growing segment of the semiconductor market: ~6 billion US-$ [Dataquest] Design Starts until 2010: from 80,000 to 110,000 [Dataquest] ##### © 2005, [email protected] Google: 10 million hits http://hartenstein.de Gate Array 2 *) Field-Programmable TU Kaiserslautern The Pervasiveness of RC search “FPGA and ….” # of hits by Google # of hits by Google 647,000 1,490,000 171,000 194,000 398,000 1,620,000 127,000 113,000 158,000 162,000 915,000 272,000 © 2005, [email protected] 3 http://hartenstein.de >> Outline << TU Kaiserslautern • Reconfigurable Computing Paradox • Von Neumann loosing its dominance • Software vs. Configware • The dual paradigm approach • Coarse-grained Reconfigurable Devices • Conclusions http://www.uni-kl.de © 2005, [email protected] 4 http://hartenstein.de The RC Paradox TU Kaiserslautern The awful technology of FPGAs: Effective integration density much worse than the Gordon Moore curve: by a factor of more than 10,000 „very power-hungry“ [Rick Kornfeld*] FPGAs run at lower clock frequencies, draw more power and are more expensive. application development: until recently still Logic Design on a very strange platform *) personal communication © 2005, [email protected] 5 http://hartenstein.de fine-grained RC: low effective integration density TU Kaiserslautern density: overhead: transistors / microchip wiring FPGA physical overhead 109 106 FPGA logical reconfigurability overhead FPGA routed routing congestion immense area inefficiency 103 100 > 10 000 1980 1990 2000 © 2005, [email protected] 6 2010 http://hartenstein.de published speed-up factors # relative performance TU Kaiserslautern 109 DSP and wireless 106 Image processing, Decoding Pattern matching, real-time face Reed-Solomon detection 2400 6000 Multimedia video-rate stereo visionMAC crypto Grid-based DRC („fair comparizon“) pattern recognition 730 SPIHT wavelet-based image compression 457 1000 400 Viterbi Decoding 900 288 Smith-Waterman Bioinformatics 15000 2000 100 000 pattern matching 88 molecular dynamics simulation 100 Grid-based DRC: no FPGA: DPLA 52 FFT protein identification BLAST on MoM by TU-KL P4 Los Alamos traffic simulation 47 40 103 20 Lee Routing 160 (DPLA by TU-KL) 2-D FIR filter (no 39,4 FPGA: DPLA by TU-KL) GRAPE Astrophysics 8080 100 1980 http://xputers.informatik.uni-kl.de/faq-pages/fqa.html © 2005, [email protected] 1990 2000 7 2010 http://hartenstein.de MOPS / milliWatt HeHon‘s Law TU Kaiserslautern 1000 FPGA 100 10 RISC 1 2 © 2005, [email protected] µ feature size 1 0.5 8 0.25 0.13 0.1 0.07 http://hartenstein.de However .... TU Kaiserslautern People think that high-performance must mean expensive Reducing electricity bill by an order of magnitude Application migration [from supercomputer] resulting in performance increase up to 4 orders of magnitude Hits the memory wall from a different direction © 2005, [email protected] 9 http://hartenstein.de why the RC paradigm shift is so important TU Kaiserslautern by Software by Configware Move the stool or the grand piano? © 2005, [email protected] 10 http://hartenstein.de >> Outline << TU Kaiserslautern • Reconfigurable Computing Paradox • Von Neumann loosing its dominance • Software vs. Configware • The dual paradigm approach • Coarse-grained Reconfigurable Devices • Conclusions http://www.uni-kl.de © 2005, [email protected] 11 http://hartenstein.de TU Kaiserslautern vN paradigm loosing its dominance Xilinx inside ! Cray XD1 © 2005, [email protected] 12 http://hartenstein.de von Neumann is not the common model mainframe age: RAM memory CPU von Neumann bottleneck DPU progra m counter von Neumann instruction-streambased machine © 2005, [email protected] microprocessor age: instruction- datastreamstreambased based CPU accelerator co-processors hardware software TU Kaiserslautern vN paradigm dominance ? the tail is wagging the dog 13 http://hartenstein.de Here is the common model RAM memory CPU von Neumann bottleneck DPU progra m counter CPU accelerator co-processors configware age: CPU von Neumann instruction-streambased machine © 2005, [email protected] instruction- datastreamstreambased based reconfigurable accelerator hardwired accelerator 14 hardware mainframe age: microprocessor age: morphware software TU Kaiserslautern http://hartenstein.de Here is the common model RAM memory CPU von Neumann bottleneck DPU progra m counter CPU accelerator co-processors configware age: software/configware co-compiler von Neumann instruction-streambased machine © 2005, [email protected] instruction- datastreamstreambased based CPU 15 reconfigurable accelerator hardware mainframe age: microprocessor age: morphware software TU Kaiserslautern http://hartenstein.de TU Kaiserslautern Fundamentally different mind set non-von-Neumann no instruction fetch at run time no program counter completely different OS principles it’s configware: definitely it is not software © 2005, [email protected] 16 http://hartenstein.de >> Outline << TU Kaiserslautern • Reconfigurable Computing Paradox • Von Neumann loosing its dominance • Software vs. Configware • The dual paradigm approach • Coarse-grained Reconfigurable Devices • Conclusions http://www.uni-kl.de © 2005, [email protected] 17 http://hartenstein.de Compilation: Software vs. Configware TU Kaiserslautern Software Engineering source program Configware Engineering C, FORTRAN MATHLAB placement source „program“ & routing mapper software compiler configware compiler data scheduler software code configware code © 2005, [email protected] 18 flowware code http://hartenstein.de TU Kaiserslautern Nick Tredennick’s Paradigm Shifts explain the differences Software Engineering CPU software resources: fixed algorithm: variable 1 programming source needed Configware Engineering configware flowware © 2005, [email protected] resources: variable algorithm: variable 19 2 programming sources needed http://hartenstein.de Co-Compilation TU Kaiserslautern C, FORTRAN, MATHLAB automatic SW / CW partitioner Software / Configware software Co-Compiler compiler mapper configware compiler data scheduler software code configware code © 2005, [email protected] 20 flowware code http://hartenstein.de TU Kaiserslautern Organic Computing ? Bio-inspired use of FPGAs • evolvable „hardware“ community: • crossover of chromosomes • In love with genetic algorithms: darwinistic way to fitness thru generations of populations • inefficient, but unexpected results possible • simulated annealing (genetic morphing) fitness by synthesis: highly efficient © 2005, [email protected] 21 http://hartenstein.de Software / Configware Co-Compilation Juergen Becker’s CoDe-X, 1996 TU Kaiserslautern C language source “vN" machine paradigm Partitioner Kress/Kung machine paradigm CW SW Analyzer compiler / Profiler compiler SW code © 2005, [email protected] CW Code FW Code 22 supporting different platforms Resource Parameters http://hartenstein.de Co-Compiler for Hardwired Kress/Kung Machine [e. g. Brodersen] TU Kaiserslautern source automatic SW / CW partitioner Software / software Flowware compiler Co-Compiler flowware compiler data scheduler software code © 2005, [email protected] 23 flowware code http://hartenstein.de >> Outline << TU Kaiserslautern • Reconfigurable Computing Paradox • Von Neumann loosing its dominance • Software vs. Configware • The dual paradigm approach • Coarse-grained Reconfigurable Devices • Conclusions http://www.uni-kl.de © 2005, [email protected] 24 http://hartenstein.de TU Kaiserslautern The dual paradigm approach Configware Engineering Software Engineering ASM CPU von Neumann paradigm © 2005, [email protected] 25 Kress-Kung paradigm http://hartenstein.de ASM x x x - - algebraic synthesis algorithms: © 2005, [email protected] ASM | | port # - - - x x x port # H. T. Kung paradigm (systolic array) input data streams | | | | | | | | | | | x x x x x x time 26 | time ASM - - - - x x x ASM - - - - - x x x ASM port # output data streams RAM ASM x x x x x x - ASM x x x GAG time | Data streams (flowware) ASM x x x AutoSequencing Memory ASM implemented by distributed memory DPA x x x ASM (pipe network) x x x ASM TU Kaiserslautern ASM ASM Flowware defines: ... which data item time at which time at which port http://hartenstein.de TU Kaiserslautern DSP platform FPGA [courtesy Xilinx Corp.] 500MHz PowerPC™ Processors (680DMIPS) with Auxiliary Processor Unit 500MHz multi-port Distributed 10 Mb SRAM 500MHz DCM Digital Clock Management 500MHz Flexible Soft Logic Architecture 200KLogic Cells 0.6-11.1Gbps Serial Transceivers 1Gbps Differential I/O 500MHz Programmable DSP Execution Units © 2005, [email protected] 27 http://hartenstein.de Generalization of the systolic array .... [Rainer Kress] TU Kaiserslautern remedy? discard algebraic synthesis methods use optimization algorithms instead for example: simulated annealing the achievement: also non-linear and non-uniform pipes, and even more wild pipe structures possible now reconfigurability makes sense © 2005, [email protected] 28 http://hartenstein.de >> Outline << TU Kaiserslautern • Reconfigurable Computing Paradox • Von Neumann loosing its dominance • Software vs. Configware • The dual paradigm approach • Coarse-grained Reconfigurable Devices • Conclusions http://www.uni-kl.de © 2005, [email protected] 29 http://hartenstein.de Coarse grain is about computing, not logic TU Kaiserslautern Example: mapping onto rDPA by DPSS: based on simulated annealing SNN filter on KressArray (mainly a pipe network) rout thru only array size: 10 x 16 = 160 rDPUs no CPU reconfigurable function block, Legend: rDPU not used [Ulrich Nageldinger] e. g. 32 bits wide © 2005, [email protected] backbus connect used for routing only backbus connect 30 operator and routing port location not usedmarker http://hartenstein.de coarse-grained RC: high integration density TU Kaiserslautern The Reconfigurable Computing Paradox transistors / microchip 109 > 10 000 106 FPGA routed 103 100 1980 1990 2000 © 2005, [email protected] 31 2010 http://hartenstein.de Claassen‘s Law + Hartenstein‘s Amendment TU Kaiserslautern 1000 MOPS / milliWatt 100 10 DSP 1 0.1 0.01 µ feature size 0.001 2 © 2005, [email protected] 1 0.5 32 0.25 0.13 0.1 0.07 http://hartenstein.de (r)DPA TU Kaiserslautern commercial rDPA example: PACT XPP - XPU128 XPP128 rDPA ALU • Full 32 or 24 Bit Design working silicon • 2 Configuration Hierarchies • Evaluation Board available, and • XDS Development Tool with Simulator © 2005, [email protected] buses not shown Ctrl CFG rDPU PAE core © PACT AG, http://pactcorp.com 33 http://hartenstein.de >> Outline << TU Kaiserslautern • Reconfigurable Computing Paradox • Von Neumann loosing its dominance • Software vs. Configware • The dual paradigm approach • Coarse-grained Reconfigurable Devices • Conclusions http://www.uni-kl.de © 2005, [email protected] 34 http://hartenstein.de Conclusions TU Kaiserslautern FPGAs may be configured like for a micro-processor for C/C++ code. An FPGA can perform a specific algorithm at very high speed. RC is reducing cost without loss of performance and flexibility. RC is reducing the electricity bill and the required building floor area Speed-up factors of up to 4 orders of magnitude hve been reported Compared to ASICs, prototyping time is on the order of hours rather than months, with a cost less than a tenth of that for an ASIC. Using a high-level language, the FPGA can be programmed for a wide variety of algorithms without any deep knowledge of the underlying architecture. The personal supercomputer is near © 2005, [email protected] 35 http://hartenstein.de Conclusions (2) TU Kaiserslautern We urgently need Reconfigurable Computing Education An Update of CS curricula is overdue © 2005, [email protected] 36 http://hartenstein.de TU Kaiserslautern END © 2005, [email protected] 37 http://hartenstein.de TU Kaiserslautern thank you © 2005, [email protected] 38 http://hartenstein.de The first archetype machine model TU Kaiserslautern Software Industry procedural personalization instruction-streambased mind set “von Neumann” © 2005, [email protected] Software Industry’s Secret of Success compile or assemble main frame CPU 39 simple basic . Machine Paradigm personalization: RAM-based http://hartenstein.de TU Kaiserslautern An Archetype Common Model needed from the Configware Industry Progress stalled by the software/configware chasm Useful simple archetype not widely accepted Archetype common model should provide .... Guidance for organizing efficient solutions Make the project manageable Allow to share lessions between applications and between application areas © 2005, [email protected] 40 http://hartenstein.de The 2nd archetype machine model TU Kaiserslautern Configware Industry structural personalization data-streambased mind set compile reconfigurable accelerator “Kress-Kung” © 2005, [email protected] Configware Industry’s Secret of Success 41 simple basic . Machine Paradigm personalization: RAM-based http://hartenstein.de configware solution: computing in space for demo: a tiny section of the pipe network inter-rDPU-communication: no memory cycles needed TU Kaiserslautern rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU + rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU S rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU © 2005, [email protected] 42 http://hartenstein.de TU Kaiserslautern Compare it to software solution on CPU S = R + (if C then A else B endif); R B A on a very simple CPU memory C = 1 cycles C =1 nano seconds read instruction if C then read A instruction decoding read operand* operate & register transfers if not C then read B + + S S Clock 200 read instruction instruction decoding read instruction add & store instruction decoding operate & register transfers store result total © 2005, [email protected] 43 http://hartenstein.de section of a major pipe network on rDPU hypothetical branching example to illustrate software-to-configware migration TU Kaiserslautern S = R + (if C then A else B endif); R B A C =1 + S clock 200 MHz (5 nanosec) © 2005, [email protected] C=1 simple conservative CPU example read instruction instruction decoding if C then read A read operand* operate & reg. transfers read instruction if not C then read B instruction decoding read instruction instruction decoding add & store operate & reg. transfers store result total memory nano cycles seconds 1 100 1 100 1 100 1 100 1 5 100 500 *) if no intermediate storage in register file 44 http://hartenstein.de The wrong mind set .... TU Kaiserslautern S = R + (if C then A else B endif); section of a very large pipe network: R B A C =1 „but you can‘t implement decisions!“ not knowing this solution: symptom of the hardware / software chasm + © 2005, [email protected] and the configware / software chasm 45 http://hartenstein.de The hardware / software chasm TU Kaiserslautern If I use the term "software", a variety of images might appear in the engineering audience's mind. Still we have "hardware" engineers and "software" engineers that go to different schools, attend different conferences, avoid each other's cocktail parties, and almost never play on the same volleyball teams at the company picnic. System designers begin to plan their creations around the skill sets and development processes of hardware engineers and software engineers. The two become oil and water. The hardware / software chasm © 2005, [email protected] 46 http://hartenstein.de Blurred line between hardware and software TU Kaiserslautern The line between "hardware" and "software" is rapidly blurring and even becoming irrelevant from a system design perspective. As this happens, the traditional roles and skillsets of hardware and software engineers are being challenged, and a new generation of designers is emerging as a result. the obfuscation caused by the pervasiveness of softness. © 2005, [email protected] 47 http://hartenstein.de We need Reconfigurable Computing Education TU Kaiserslautern There is an urgent need to cure severe qualification deficiencies of our graduates. We need a unification in dealing with problems, which are shared across many different application domains We need new curricula in CS and CE for providing an integrating dual paradigm mind set instead of vN-only © 2005, [email protected] 48 http://hartenstein.de TU Kaiserslautern Terminology clean-up Programming sources: Configware: for configuring morphware Flowware: for scheduling data streams primarily non-von Neumann Software: for scheduling instruction streams © 2005, [email protected] 49 von Neumann http://hartenstein.de Why coarse grain TU Kaiserslautern much more area-efficient instead of rLB (~1 bit wide) much less use rDPU (e. g. 32 bits wide) reconfigurability overhead reconfigurable Data Path Unit (e. g. rALU) much more MOPS/milliWatt instead of FPGA use rDPA Reconfigurable Computing (RC) rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU mind set close to classical computing background © 2005, [email protected] 50 http://hartenstein.de „data stream“: an ambigouos definition TU Kaiserslautern Reconfigurable Computing is not instruction-stream-based it‘s data-stream-based it‘s different from the operation of the (indeterministic) „dataflow machine“ other definition also from multimedia area usable definition from systolic array area © 2005, [email protected] 51 http://hartenstein.de >> Outline << TU Kaiserslautern • Reconfigurable Devices • Coarse-grained Reconfigurable Devices • Data-stream-based Computing • The contemporary Common Model • Reconfigurable Supercomputing • Conclusions http://www.uni-kl.de © 2005, [email protected] 52 http://hartenstein.de Why the speed-up ... TU Kaiserslautern ... although FPGA is clock slower by x 3 or even more (most know-how from „high level synthesis“ discipline) decisions without memory cycles nor clock cycles most „data fetch“ without memory cycle © 2005, [email protected] 53 http://hartenstein.de data moved around by software TU Kaiserslautern i.e. by memory-cycle-hungry instruction streams which fully hit the memory wall P&R: move locality of operation, not data ! stolen from Bob Colwell © 2005, [email protected] 54 http://hartenstein.de Replace Caches by ... TU Kaiserslautern … by 16 x 16 reconfigurable data path array (rDPA) which fits on the same chip caches stolen from Bob Colwell © 2005, [email protected] 55 http://hartenstein.de Similarly skilled TU Kaiserslautern with hardware description languages, Hardware engineers had to adopt the methodologies and techniques of software engineers - Increased softness has an impact on even our products themselves The required skills for your respective jobs are converging (against the grain in an age of increased specialization) and you'll soon be working with (and competing against) a new generation of embedded engineers that are similarly skilled in both disciplines. © 2005, [email protected] 56 http://hartenstein.de Using FPGAs Field-programmable FPGAs TU Kaiserslautern Reducing cost without loss of performance and flexibility. It may be configured like a general flexible micro-processor executing conventional C/C++ code, and as a highly specific programmability of FPGAs distinguishes to ASICs. An FPGA can perform a specific algorithm at very high speed. Compared to ASICs, prototyping time is on the order of hours rather than months, with a cost less than a tenth of that for an ASIC. Using a high-level language, the FPGA can be programmed for a wide variety of algorithms without any deep knowledge of the underlying architecture. © 2005, [email protected] 57 http://hartenstein.de Co-Compiler Enabling Technology TU Kaiserslautern is available from academia only a small team needed for commercial re-implementation on the road map to the Personal Supercomputer © 2005, [email protected] 58 http://hartenstein.de Conclusions (1) TU Kaiserslautern RC suffers from fragmentation into different cultures of the many application domains. We need a unification in dealing with problems, which are shared across many different application domains. CS is the only domain being qualified f. such an effort © 2005, [email protected] 59 http://hartenstein.de Conclusions (2) TU Kaiserslautern IEEE Computer Society should advocate to improve application development methodologies and, a common educational approach useful for the wide variety of application domains inside IEEE Computer Society, a TC on RC should lobby for more © 2005, [email protected] 60 http://hartenstein.de Conclusions (3) TU Kaiserslautern make CS more fascinating reverse the downtrend in CS enrolment educate not only students … increase membership Strategic issue for entire IEEE Computer Society © 2005, [email protected] 61 http://hartenstein.de Conclusions (4) TU Kaiserslautern The personal supercomputer is near, not only for the desktop, but also for a new road map to large scale supercomputing of up to now unthinkable highest performance dimensions. IEEE-CS is needed as a translator to explain the impact to managers and to a wide public. IEEE-CS should accept this fascinating challenge, by spearheading the paradigm shift. © 2005, [email protected] 62 http://hartenstein.de TU Kaiserslautern RC education last week at Karlsruhe 35 submissions from Australia, Brasil, India, USA, and throughout Europe Attendees declared ready to work for a task force But education is just one of several facets …… © 2005, [email protected] 63 http://hartenstein.de However .... TU Kaiserslautern “What did you say again that your company does?” My father posed the question, “Gate arrays,” I replied, “They’re chips used to…” “Oh yes, that’s right, Gatorade.” ….. “I used to give that to my marching band members so they wouldn’t get dehydrated on hot days. Don’t remember it coming in chip form …..” Explain to your grandmother what it means if you’re one of the world’s leading experts on optical proximity correction (OPC) for nanometer-scale semiconductor lithography? Could you perhaps relate it to some difficulty she has with needlepoint and her cataracts? Even those with a scientific or technical background often won’t understand precisely what we do. A PhD in molecular biology won’t help to understand VHDL and Verilog synthesis for FPGAs. Trying to relate DNA sequences to LUT truth tables might offer a starting point, but somebody has to be able to bridge the technology and terminology gap, even to initiate that analogy. Try explaining FPGAs with the consumer electronics approach. “People tend to relate when you tell them what your part goes into. Today, finally, ‘chip’ seems universally understood. I never get people asking about potato chips anymore.” © 2005, [email protected] 64 http://hartenstein.de However .... TU Kaiserslautern Abstract. Google’s yaw-dropping hit rates illustrate the pervasiveness of Reconfigurable Computing (RC), mainstream in embedded systems already for years, and now being adopted by supercomputing (Cray, sgi, etc.). From FPGA usage as accelerators, speed-up factors by up to two orders of magnitude are reported, as well as floor space requirements and electricity invoice amounts reduced by one order of magnitude. About 3 orders of magnitude and more is obtained by using coarse-grained reconfigurable datapath arrays (rDPAs) available from a number of start-ups.This is astonishing, since FPGAs and rDPAs have a substantially lower clock speed than microprocessors. Algorithmic cleverness is the secret of success, based on software to configware migration mechanisms, striving away from memory-cycle-hungry instruction-stream-based computing paradigms. The main benefit of RC platforms - having replaced the use of hardwired accelerators - is their flexibility by non-procedural programmability. This also contributes to those concepts of Organic Computing, which rely on processes of evolution, self-organization, adaptation and fault tolerance. The main hurdles on the way to heart-stopping new horizons of cheap highest performance are CS-related educational deficits causing the configware / software chasm and a methodology fragmentation between the different cultures of application domains. Current CS curricula do not sufficiently meet their transdisciplinary responsibility. The talk gives a survey on fundamental issues in RC and on new directions in CS-related curricula, focused on a dual paradigm organic computing approach. © 2005, [email protected] 65 http://hartenstein.de However .... TU Kaiserslautern „Saves more than $10,000 in electricity bills per year (7¢ / kWh) - .... per 64-processor 19" rack“ [Herb Riley, R. Associates] Reducing electricity bill by an order of magnitude Application migration [from supercomputer] resulting in performance increase up to 4 orders of magnitude Hits the memory wall from a different direction © 2005, [email protected] 66 http://hartenstein.de However .... TU Kaiserslautern © 2005, [email protected] 67 http://hartenstein.de Conclusions TU Kaiserslautern IEEE Computer Society should advocate to introduce a dual paradigm approach – away from the monopoly of the vN mind set IEEE Computer Society should advocate a common model useful for the wide variety of application domains © 2005, [email protected] 68 http://hartenstein.de Conclusions TU Kaiserslautern RC suffers from fragmentation into different cultures of the many application domains. Each domain uses its own trick box. We should teach the world to think outside the box We need a unification in dealing with problems, which are shared across many different application domains. CS is the only domain qualified for this unification © 2005, [email protected] 69 http://hartenstein.de TU Kaiserslautern An Archetype Common Model needed from the Configware Industry IEEE Computer Society should advocate to introduce a dual paradigm transdisciplinary education by using Configware Engineering as the counterpart of Software Engineering by new curricula in CS and CE for providing an integrating dual paradigm mind set supporting a unification in dealing with problems, which are shared across many different application domains - to cure severe qualification deficiencies of our graduates. © 2005, [email protected] 70 http://hartenstein.de