Download Coarse Grain Reconfigurable Architectures

CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of Mechanical Engineering, Universidade de Brasilia Reiner Hartenstein slightly modified version Speed-ups obtained by Reconfigurable Computing 1 outline Introduction Manycore Crisis & von Neumann Syndrome The Impact of Reconfigurable Computing Programmer education: new roadmap needed Conclusions © 2008, [email protected] 2 http://hartenstein.de 5 key issues climate change faster than predicted: by carbon emission, primarily from power plants ? very high and growing computer energy cost – and growing number of power plants needed here the manycore programming crisis stalls progress (end of the free ride on the Gordon Moore curve) technologically stalled Moore‘s Law* Reconfigurable Computing is a promising alternative © Tom 2008, [email protected] *) Williams (keynote): the 20 nm wall 3 [Nick Tredennick (Gilder), 2003] http://hartenstein.de 2008: 65, 45, 32 nm History of data processing The first reconfigurable computer • prototyped: 1884 Herman Hollerith •datastream-based DPU • 1st Xilinx FPGA 100 years later © 2008, [email protected] 4 http://hartenstein.de Configware Programming no instruction streams manually (Configuration) or, by swapping pre-wired board (Reconfiguration) motivating the J. v N, 1946 von Neumann paradigm 60 years later: RAM available –ferrite cores © 2008, [email protected] 5 http://hartenstein.de fine-grained reconfigurable a wire to CLB forming Connect Field-Programmable Gate Array FPGA A CLB CLB CLB CLB connect box switch box CLB CConfigurable Logic Box 6 © 2008, [email protected] CLB B Xilinx old „island architecture“ CLB http://hartenstein.de switch box forming a wire connect box Connect to CLB Field-Programmable Gate Array FPGA CLB A CLB CLB CLB CLB B CLB CLB CConfigurable Logic Box © 2008, [email protected] 7 http://hartenstein.de RAM-based hidden RAM this switch box has hidden RAM 150 transistors & 150 flipflops FF 0 0 0 0 0 patches even at the customer‘s desk 1 configware code loaded before run time into switch box “hidden RAM” part of FF “hidden RAM” hidden RAM © 2008, [email protected] FPGAs mainstream since > a decade 8 http://hartenstein.de Coarse-grained Reconfigurable Array CLB CFB ! Conditional Swap Example (parallelization of the bubble sort algorithm) if X > Y then swap; X Xi 0 1 Swap > Xo Y 1 Yi rout thru only © 2008, [email protected] rout thru and function (multiplexer) 0 Yo swap turned into a wiring pattern http://hartenstein.de Another coarse-grained r-Array SNN Filter on supersystolic Array: mainly a Pipe Network rout thru only CFB ! no CPU rDPU reconfigurable Data Path Unit, 32 Bits wide Legend: size: rDPU not used connect for routing only array 10used x 16backbus connect backbus © 2008, [email protected] 10 port used location marker not (99% placement efficiency) operator and routing by KressArray Xplorer [Ulrich Nageldinger] CoDe-X inside [Jürgen Becker] http://hartenstein.de ConfigwareCode-input Plattform-FPGA 8 – 32 fast serial I/O-channels 256 – 1704 BGA DPUs 56 – fast on-chip 424 Block RAMs: BRAMs [courtesy Lattice Semiconductor] © 2008, [email protected] 11 http://hartenstein.de Reconfigurable Supercomputing Silicon graphics Reconfigurable ApplicationSpecific Computing (RASC™) Cray XD1 Supercomputing 2007, Reno, Nevada, USA 9600 registered attandees, 440 exhibitors •Xilinx Virtex-II Pro •Library by Cray Chuck Thacker … (even Microsoft working at it) (Lab in Cambridge. UK, etc.). © 2008, [email protected] 12 http://hartenstein.de what means Configware time domain Software Source Software to Configware Migration space domain Configware Source Placement & Routing mapper Software Compiler data scheduler Software Code (instruction-procedural) © 2008, [email protected] Flowware Code (data-procedural) 13 Configware Code (structural: space domain) http://hartenstein.de outline Introduction The Manycore Crisis & the von Neumann Syndrome The Impact of Reconfigurable Computing Programmer education: new roadmap needed Conclusions © 2008, [email protected] 14 http://hartenstein.de Many-core: Break-through or Breakdown? Industry is facing a disruptive turning point “could reset µP HW & SW roadmaps for next 30 years”, [David Patterson] intel’s vision: MultiCore forcing a historic transition to a parallel programming model yet to be invented [David Callahan] HPC users lack understanding in basic precepts* it‘s an education, qualification, and a R&D problem The stakes are high ... „I would be panicked if I were inindustry“ [John Hennessy] *) PRACE consortium (Partnership foR Advanced Computing in Europe) http://www.prace-project.eu/documents/D3.3.1_document_final.pdf © 2008, [email protected] 15 http://hartenstein.de Declining Programmer Productivity The Law of More: programmer productivity declines disproportionately with increasing parallelism At particular HPC application domains massive parallelism requires 10 – 30 professionalists in multi-disciplinary multi-insitutional teams for 5 - 10 years [Douglass Post, DoD HPCMP, panelist at SC07] Software done: machine obsolete © 2008, [email protected] 16 http://hartenstein.de The von Neumann Syndrome © 2008, [email protected] 17 http://hartenstein.de The von Neumann Syndrome © 2008, [email protected] 18 http://hartenstein.de Massive Overhead Phenomena overhead piling up to code sizes of astronomic dimensions von Neumann CPU single core 2006: C.V. “RAM” Ramamoorthy: von Neumann overhead machine instruction fetch instruction stream state address computation instruction stream data address computation instruction stream data meet PU + other overh. instruction stream i / o to / from off-chip RAM instruction stream “von Neumann Syndrome” 1986, E.I.S. Projekt: 94% for address computation total speed-up: x 15000 2008 David Callahan: „a terrifying number of processes running in parallel, create sequential-processing bottlenecks and losses in Dijkstra 1968: The Goto considered harmful Koch et al. 1975: The universal Bus considered harmful Backus, 1978: Can programming be liberated from the von Neumann style? Arvind et al., 1983: A critique of Multiprocessing the von Neumann Style © 2008, [email protected] 19 data locality“ http://hartenstein.de CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU manycore von Neumann: arrays of massive overhead phenomena von CPU Neumann CPU CPU manyCPU CPU single CPU CPU CPU core fast on-chip memory cannot store such huge instruction code blocks von Neumann machine instruction fetch instruction stream state address computation instruction stream data address computation instruction stream data meet PU + other overh. instruction stream i / o to / from off-chip RAM instruction stream overhead Inter PU communication instruction stream message passing overhead instruction stream transactional memory overh. instruction stream overhead ©multithreading 2008, [email protected] etc. instruction stream proportionate to the number of processors disproportionate to the number of processors 20 http://hartenstein.de outline Introduction Manycore Crisis & von Neuman Syndrome The Impact of Reconfigurable Computing Programmer education: new roadmap needed Conclusions © 2008, [email protected] 21 http://hartenstein.de Speed-up factors obtained by Software to Configware migration Speedup-Factor 106 103 DES breaking Image processing, Pattern matching, Multimedia 28500 DSP and wireless real-time face detection Reed-Solomon Decoding 6000 MAC crypto 3000 video-rate stereo vision 2400 pattern recognition 730 SPIHT wavelet-based image compression457 52 protein identification 100 © 2008, [email protected] 20 22 900 1000 400 288 100 FFT BLAST 88 1000 Viterbi Decoding Smith-Waterman pattern matching molecular dynamics simulation 40 Bioinformatics Astrophysics GRAPE http://hartenstein.de Accelerator card from Bruchsal 16 FPGAs MAC means Multiply and ACcumulate Tera means 1012 or 1 000 000 000 000 (1 trillion) • 1.5 TeraMAC/s • I/O Bandwidth: 50 GByte/s • Manufacturer: SIEMENS Bruchsal © 2008, [email protected] 23 http://hartenstein.de Energy saving factors obtained by software to configware migration Speedup-Factor 106 103 Energy saving: almost x10 less than speed-up … … could be improved 100 DES breaking Image processing, Pattern matching, Multimedia 28500 DSP und wireless real-time face detection Reed-Solomon Decoding 6000 video-rate stereo vision pattern recognition 730 900 SPIHT wavelet-based image compression457 @10 © 2008, [email protected] 52 protein identification 20 24 MAC 1000 400 288 100 FFT BLAST 88 2400 crypto 3000 1000 Viterbi Decoding Smith-Waterman pattern matching molecular dynamics simulation 40 Bioinformatics Astrophysics GRAPE http://hartenstein.de rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU (coarse-grained rec.) rDPU rDPA: reconfigurable datapath array overhead instruction fetch state address computation von Neumann overhead vs. Reconfigurable Computing CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU von Neumann machine instruction stream instruction stream data address computation instruction stream data meet PU + other overh. instruction stream i / o to / from off-chip RAM instruction stream Inter PU communication instruction stream message passing overhead instruction stream transactional memory overh. instruction stream overhead etc. instruction stream ©multithreading 2008, [email protected] 25 anti machine none* none* none* none* none* none* none* none* none* http://hartenstein.de 25 Data meet the processor (CPU) illustrating von Neumann syndrome inefficient transport over off-Chip-memory by memory-cyclehungry instruction streams by Software This is just one of many von NeumannOverheadPhenomena © 2008, [email protected] 26 http://hartenstein.de Data meet the CPU illustrating acceleration Placement of the execution locality (not moving data) within pipe network: generated by the Configware-Compiler* *) before run time (at compile time) © 2008, [email protected] by Flowware 27 http://hartenstein.de What did we learn? There are 2 kinds of datastreams: 1) indirectly moved by an instruction stream machine (von Neumann): extremely inefficient 2) directly moved by a datastream machine (from Reconfigurable Computing): very efficient “Dataflow machine” would be a nice term, but was introduced by a different scene* *) meanwhile dead: not really a dataflow machine, but had used compilers accepting a dataflow language © 2008, [email protected] 28 http://hartenstein.de What else did we learn? There are 2 kinds of parallelism: 1) Concurrent processes: instruction stream parallelism (CPU manycores): inefficient 2) Data parallelism by parallel datastreams (in Reconfigurable Computing Systems): efficient Conclusion: - Data parallelism brings the performance (we do data processing !) © 2008, [email protected] 29 http://hartenstein.de data parallelism: rDPU rDPU rDPU rDPU What Parallelism? [Hartenstein’s watering can model] rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU instruction parallelism: CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU no von Neumannbottleneck many von Neumann bottlenecks © 2008, [email protected] 30 http://hartenstein.de Put old ideas into practice (POIIP) “We need a complete re-definition of CS” [Burton Smith and other celebrities] Wrong! I do not agree, finding out, that ... [Reiner Hartenstein] ... „The biggest payoff will come from putting old ideas into practice and teaching people how to apply them properly.“ [David Parnas] “We need a complete re-definition of curriculum recommendations - missing several key issues.” [Reiner Hartenstein] © 2008, [email protected] 31 http://hartenstein.de outline Introduction Manycore Crisis & von Neuman Syndrome The Impact of Reconfigurable Computing Programmer education: new road map needed Conclusions © 2008, [email protected] 32 http://hartenstein.de Fighting against obsolete curricula? The Embedded Systems Approach? Graduate Curriculum on Embedded Software and Systems (EU) … support their own educational approach Advanced Real Time Systems Real-Time Systems (Sweden) Recommendations for Designing new ICT Curricula Chess – Center for Hybrid and Embedded Software Systems (courses in embedded systems) WESE Workshop on Embedded Systems Education © 2008, [email protected] „You can always teach programming to a hardware guy ... ... but you can never teach hardware to a programmer“ it‘s not the programmer‘s fault: it‘s due to obsolete CS curricula http://hartenstein.de CS is a Monster fully wrong educational mainstream approaches: 1) the basic mind set exclusively instruction-streamoriented - data streams considered being exotic 2) mapping parallelism into the time domain – abstracting away the space domain is fatal We need a dual-rail education © 2008, [email protected] 34 http://hartenstein.de We need to POIIP for: Software to Hardware Migration: and Software to Configware Migration: 2 key rules of thumb - terrifically simple: 1) loop turns into pipeline [1979] 2) decision box turns into demultiplexer [1967]: PvOIIP © 2008, [email protected] 35 http://hartenstein.de Two Dichotomies Dichotomy = mutual allocation to two opposed domains such, that a third domain is excluded. The dichotomy model as an educational orientation guide for dual rail education to overcome the software/configware chasm & the software/hardware chasm 1) Machine Paradigm Dichotomy (von Neumann /Dataflow machine*): the „Twin Paradigm“ model 2) Relativity Dichotomy: time domain / space domain – helps parallelization by time to space mapping *) see definition © 2008, [email protected] 36 http://hartenstein.de Def.: Dataflow Machine The old „Dataflow Machine“ research scene is dead. sequential execution: not really a dataflow machine. indeterministic: unpredictable order of execution: had used compilers accepting a dataflow language we re-define this term: counterpart of von Neumann deterministic, w. data counters (no program counter) © 2008, [email protected] 37 http://hartenstein.de 1 ) Paradigm Dichotomy (procedural dichotomy) The Twin Paradigm Approach (TTPA) CPU program counter - datastream domain instruction domain (r)DPA data counter + © 2008, [email protected] 38 + http://hartenstein.de Paradigm Dichotomy (procedural dichotomy) The Twin Paradigm Approach CPU program counter - datastream domain instruction domain (TTPA) (r)DPA data counter s + + - + data we need+ parallelism © 2008, [email protected] 39 http://hartenstein.de ASM x x x x x x - ASM x x x - - ASM: AutoSequencing Memory © 2008, [email protected] | ASM x x x | | | | | | | | | | | | x x x x x x New is only: its generalization [1989] | - - - x x x ASM - - - - x x x ASM - - - - - x x x ASM [1990] GAG Data streams [Kung et al. 1979] RAM data counter | x x x ASM ASM x x x ASM [1995] ASM ASM (r)DPA [1995] x x x ASM systolic array super systolic Data Machine: from old stuff [1979 - ...] 40 http://hartenstein.de Procedural Languages Twins program counter data counter(s) imperative Software Languages read next instruction goto (instruction address) jump to (instruction address) instruction loop instruction loop nesting instruction loop escape instruction stream branching no: no internally parallel loops super systolic Flowware Languages read next data item goto (data address) jump to (data address) data loop data loop nesting data loop escape data stream branching yes: internally parallel loops But there is the Asymmetry © 2008, [email protected] 41 for data parallelism http://hartenstein.de Relativity Dichotomy space time/space) time (time time domain: procedure domain space domain: structure domain 2 phases: 1) programming instruction streams 2) run time 3 phases: 1) reconfiguration of structures 2) programming data streams 3) run time von Neumann Machine © 2008, [email protected] 42 Anti Machine http://hartenstein.de time-iterative to space-iterative n time steps, 1 CPU Often the space dimension is limited (e.g. because of the chip size) k*n time steps, 1 CPU a time to space mapping a time to space/time mapping 1 time step, n DPUs ( n = length of pipeline ) k time steps, n DPUs loop transformation methodogy: 70ies and later © 2008, [email protected] Strip mining 43 [D. Loveman, J-ACM, 1977] http://hartenstein.de outline Introduction Manycore Crisis & von Neuman Syndrome The Impact of Reconfigurable Computing Conclusions © 2008, [email protected] 44 http://hartenstein.de Conclusions (1) We massively need programmable accelerator co-processors Established technologies are available and we can still use standard software and their tools We need a massive Migration of Software to Configware. To cope with the implementation wall: to cope with the programmer population‘s unsustainable skills mismatches Configware skills and basic hardware knowledge are essential qualifications for programmers. © 2008, [email protected] 45 http://hartenstein.de Conclusions (2) CS education is a monster ! Fully wrong educational mainstream approaches Yaw-dropping sclerosis of curriculum taskforces We need a complete re-definition of CS education We urgently need Dual-Rail Education CS should learn a lot from Embedded Systems, like in Mechanical Engineering © 2008, [email protected] 46 http://hartenstein.de thank you for your patience © 2008, [email protected] 47 http://hartenstein.de END © 2008, [email protected] 48 http://hartenstein.de backup for discussion: © 2008, [email protected] 49 http://hartenstein.de time to space mapping time domain: procedure domain time algorithm program loop n time steps, 1 CPU Bubble Sort n x k time steps, x condition 1 swap y al „conditio time algorithm nal swap“ © 2008, [email protected] 50 unit space domain: structure domain space algorithm pipeline 1 time step, n DPUs Shuffle Sort conditio swap nal conditio swap nal conditio swap nal conditio swap nal k time steps, n „conditional space/time algorithm s swap“ units http://hartenstein.de Architecture instead of synchro Example conditio swap nal conditio swap nal conditio swap nal conditio swap nal conditio swap nal conditio swap nal conditio swap nal conditio swap nal Better Architecture instead of complex synchronisation: half he number of conditio Blocks + up und swap down of data nal conditio (shuffle function) – swap no von Neumannnal syndrome ! conditio swap nal conditio swap nal direct time to space mapping modification: with shufflefunction accessing conflicts © 2008, [email protected] „Shuffle Sort“ 51 http://hartenstein.de Transformations since the 70ies loop transformations: rich methodology publi [survey: Diss. Karin Schmidt, 1994, Shaker Verlag] time domain: procedure domain program loop space domain: structure domain Strip Mining Transformation n x k time steps, 1 C P time algorithm U © 2008, [email protected] Pipeline k time steps, DPUs n space/time algorithmus 52 http://hartenstein.de Revolution der Lehre: Mikroelektronik-Entwurfs-Revolution traditionelle Arbeitsteilung: Anwendung Einreichung Die neue M-&-C Arbeitsteilung: Anwendu ng Rückweisung (in Deutschland: das E.I.S.-Projekt) Carver Mead Lynn Conway [1980] Einreichung Rückweisung Logik-Ebene Einreichung Entrümpelung & Rückweisung intuitive Modelle SwitchingEbene Rückweisung Einreichung SchaltkreisEbene Rückweisung Einreichung zur Behebung des AusbildungsDilemmas tall thin Kohärenz man Zersplitterung RT-Ebene Layout-Ebene Technologi e im Hause Spezialisierungsbreite © 2008, [email protected] Silicon Foundry (externeTechnologie) Spezialisierungsbreite stark reduziert 53 Betonung auf “Systems” http://hartenstein.de Education Revolution: Reconfigurable Computing Revolution Christophe Bobda Application level (instructionstreambased) clearing out *) or” tall thin woman” the tall thin man* > Dichotomy < The new Program level Mead & Conwa y? Anti machine von-NeumannParadigm Paradigm (datastream-based) clearing out Twin Paradigm © 2008, [email protected] 54 http://hartenstein.de Who generates the data streams? Withourt a Sequencer it‘s not a Machine ! x x x x x x x x | x | | x x x xx x - - - - x xx - - - - xx x - - - - - x xx xxx - - © 2008, [email protected] 55 | | | | | | | | | x | | x x | x x x x x x http://hartenstein.de ASM x x x - - ASM: AutoSequencing Memory © 2008, [email protected] | ASM several date counters instead of a program counter | | | | | | | | | x | x x x x x the data counter: placed in memory** (not with datapath***) | | | x x x ASM ASM x x x x x x - The Anti Machine x x x x x x | ASM ASM ASM ASM (r)DPA* x x x ASM Supersystolic Array (Kress Array) - - - x x x ASM - - - - x x x ASM - - - - - x x x ASM GAG Data streams [Kung et al. 1979] RAM data counter programmed by Flowware 56 *) especially coarse-grained: for instance: platform FPGA **) normaly on-chip ***) not like with CPU http://hartenstein.de Misson of this talk software 2 hardware mapping (and, software 2 configware mapping) means time to space migration (and von Neumann 2 anti machine migration) We need time to space migration ++ since infinite space is not available,### we often need partial time 2 space migrat © 2008, [email protected] 57 http://hartenstein.de Morphware: old stuff structural programming (non-von-Neumann) 1971 PROMs for small logic 1975 PLA 1978 PAL with PALASM tool 1984 first Xilinx FPGA meanwhile mainstream … © 2008, [email protected] 58 http://hartenstein.de POIIP: Loop turns into pipeline [1979] loop: Memory CPU loop body (reconfigurable) DataPath Unit: rDPU loop body Pipeline: rDPU rDPU rDPU rDPU © 2008, [email protected] http://hartenstein.de super-systolic array (recall this example !) rout thru only far beyond just uniform linear pipes supporting any complex free form pipe networks Legend: © 2008, [email protected] rDPU not used backbus connect used connect for routing only backbus operator and routing port used location marker not by KressArray Xplorer [Ulrich Nageldinger] CoDe-X inside [Jürgen Becker] 60 http://hartenstein.de decision box turns into demultiplexer PvOIIP [1967] decision box: demultiplexer: ENABLE CONDITION CONDITION ENABLE B0 B0 1 0 B1 B1 W. A. Clark: 1967 SJCC, AFIPS Conf. Proc. C. G. Bell et al: IEEE Trans-C21/5, May 1972 RTM as a DEC product available: 1973 © 2008, [email protected] 61 [~1971] (introducing HDLs): „That‘ so simple! Why did it take 30 years to find out ?“ http://hartenstein.de von Neumann overhead: an example von Neumann CPU single CPU machine instruction fetch instruction stream state address computation instruction stream data address computation instruction stream data meet PU + other overh. instruction stream i / o to / from off-chip RAM instruction stream overhead rDPU rDPU rDPU rDPU (entire project: 15000x speed-up) PISA DRC accelerator [ICCAD 1984] reconfigurable address generator (GAG): ~20x speed-up © 2008, [email protected] 62 http://hartenstein.de ASM x x x x x x - ASM x x x - - ASM: AutoSequencing Memory © 2008, [email protected] | ASM x x x | | | | | | | | | | | | x x x x x x New is only: its generalization [1989] | - - - x x x ASM - - - - x x x ASM - - - - - x x x ASM [1990] GAG Data streams [Kung et al. 1979] RAM data counter | x x x ASM ASM x x x ASM [1995] ASM ASM (r)DPA [1995] x x x ASM systolic array super systolic Data Machine: from old stuff [1979 - ...] ASM 63 data counter (r)DPA data counter s http://hartenstein.de

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Coarse Grain Reconfigurable Architectures