Download ARC FPF Presentation 2005

Many-cores Make Light Work Nigel Topham PPar Conference & Industrial Engagement Event June 2015 Overview  Parallelism can, and must, become pervasive to solve future silicon scaling problems !  What drives us towards pervasive parallelism?  Can embedded applications really benefit from many-core architectures?  What does a low-power many-core embedded system look like?  What does the future hold for embedded manycore systems? 2 Embedded systems in perspective  More CPUs shipped in embedded systems than as PCs and servers combined  Embedded market has highest growth rate (PCs diminishing)  Performance expectations of embedded systems continue to grow – – – – – – smart-phones / tablets embedded vision automotive (Google car) medical robotics Internet of Things... 3 Motivation for many-core embedded systems  Imagine an embedded application requiring 20 billion operations per second to meet real-time performance targets  What type of processor is most appropriate? – a standard x86 dual-core PC, running at 2.4 GHz ? – a collection of lightweight embedded cores, running at 300 MHz ?  Answer depends on power efficiency required, and the ability to run the application in parallel x86 PC Encore (EM6) Extended EM6 Cores / chip 2 32 32 IPC / core 2 1 4 MHz / core 2,400 300 300 GOPS / sec 8.64 9.60 38.40 Power (W) 65.00 0.25 0.35 GOPS / W 0.13 38.40 109.71 Assumes EM6 cores implemented in low-power 65nm CMOS process 4 PART I – Shaping the many-core landscape Fundamentals and drivers... 5 Moore’s Law is not finished (yet)  25 1.40E+08 1.20E+08 20 20 17 15 14 1.00E+08 14 8.00E+07 12 10 10 10 6.00E+07 8 7 7 5 5 5 3.5 2.5 1.8 1.3 0 4.00E+07 2.00E+07 0.00E+00 2013 2015 2017 2019 2021 2023 2025 2028 Gate density (4-input NAND gates / sq.mm) Moore’s Law describes the rate at which transistor dimensions scale Scaling linear dimensions by 0.7, doubles density every 2 years ITRS 2013 defines a roadmap for transistor scaling up to 2028 Gate densities will increase by 30x in this period BEOL processing unlikely to keep up, so interconnect density may not scale with transistor density. Technology node (nm)      Gate length Node Gate density (4-input NAND / sq.mm) Data source: ITRS Report 2013 6 Dennard scaling ended some time ago  “The power density on a chip remains constant as transistor sizes reduce from one generation to the next.”  Each generation reduces linear dimension by 30% – – – – – 2x transistor density (Moore’s Law) 70% reduction in switching time  40% increase in MHz 30% reduction in supply voltage  65% energy reduction 65% energy reduction x 1.4 MHz increase  50% power reduction 50% power reduction x 2x transistor density  constant power density  Moore’s Law + Dennard scaling  1000x improvement in microprocessor speed over past 20 years  What went wrong? – Sub-threshold leakage current increases when Vth decreases – We have reached scaling limit of gate oxide thickness – Net result: supply voltage scaling breaks down, and power density rises at each new generation.  This does not halt Moore’s Law, but affects how we use the extra transistors! 7 Pollack’s Rule “The increase in microprocessor performance due to microarchitecture is roughly proportional to the square root of the increase in complexity.”      ILP increases due to increased out-of-order processing  increase in number of in-flight instructions  increase in number of live register values  increased branch prediction accuracy required Micro-architectural structures to manage increasing ILP tend to grow quadratically (or worse) with respect to the feature they support  Quadratic complexity growth  Pollack’s Rule 1. 2. Fred Pollack, “Pollack’s Rule of Thumb for Microprocessor Performance and Area”, http://en.wikipedia.org/wiki/Pollacks_Rule (cites [2] as source) Shekhar Borkar, Andrew A. Chien (May 2011). "The Future of Microprocessors“, Communications of ACM 54 (5) (cites [1] as source) 8 The convergence of single-core limitations  Dennard scaling broke down in year ~2000  Transistor density increase continues, and even accelerates!  ILP for desktop/server CPUs reaches a limit imposed by: – Dennard scaling – Pollack’s Rule  Note failure of architects to improve single-core ILP in this millenium 9 The power-density (many core) wall  Many-core approach seeks to use many smaller, lower power cores to avoid Pollack’s rule – –    Achieve O(n) performance growth, instead of O(n1/2)  Lots of parallel programming / compiler headaches (a.k.a “research opportunities”) But it’s not a solution – an ever-increasing proportion of future silicon will be Dark Significant energy-efficiency gains must be found from somewhere [Ezmaeilzadeh et al., ISCA ‘11] 10 Low-power embedded processor design    Avoid micro-architectural speculation Avoid superscalar, out-of-order micro-architectures Use aggressive, hierarchical clock-gating (25% of dynamic power is in the clock tree) –    State-dependent clock-gating + (explicit) module-level clock-gating Minimize memory accesses, including to on-chip caches Design CPU pipelines to allow use of low-leakage libraries, especially RAMs (even if they’re slower) H/W complexity of synthesizable cores depends on synthesis constraints! – – – Over-pipeline the core to relax the timing constraints per stage 3-stage and 10-stage implementations of same ISA will have overlapping complexity and energy curves. Better to choose a 10-stage pipeline when F > 500 MHz  Gate count and energy consumption per instruction depend on synthesis target – they are not constant!  10-stage pipeline can allow 2 clock periods for L1 cache accesses  L1 cache memories then able to use high density, lowleakage RAMs, even at maximum MHz.  3-stage pipeline’s critical paths include many gates, with many opportunities for logic synthesis to replicate logic to meet timing constraints.  3-stage pipeline can be made to run at high speed, but efficiency drops dramatically 11 Overcoming power density limitations  “Dark Silicon” and power gating – build heterogeneous systems, with functional specialization – big.LITTLE systems  Rebalanced quantities of core logic and on-chip RAM – more memory, less logic  no increase in throughput!  Spend less energy “moving data” and “synchronizing” – improve on-chip locality  many cores with local data – avoid coherency costs in shared memory  programming complexity!  Use lower leakage CMOS processes – transistors are slower  many cores running at 100-500 MHz  Over-pipelined cores – Simplifies logic complexity  many cores running at 100-500 MHz  Reduce supply voltages towards threshold voltage – big reductions in leakage power, but transistors are much slower – use thousands of cores at 10-50 MHz 12 [Image: courtesy H. Haas] PART II – A Parallel Embedded Application Making light really work... With Oscar Almer, Freddie Qu, Bjoern Franke, Mike O’Boyle, and in collaboration with Prof Harald Haas and his team in the School of Engineering and Electronics. 13 OWC coding, using OFDM       First apply forward-error-correction (FEC), typically using a convolutional code FEC output stream is partitioned into fixed-size frames Frame data is split across N orthogonal sub-channels, one symbol per channel Channel symbols are encoded using QAM (e.g. QAM-16 = 4 bits/symbol, as below) OFDM is used to multiplex the digital data to a time-varying signal Digital signal is clipped, scaled, converted to analogue, and drives an output LED [ H. Haas, Li-Fi: High Speed Wireless Communications via Light Bulbs"., 19 June 2013 ] 14 The OFDM pipeline - an overview  Transmission pipeline: – – – – –     QAM mapper converts transmit data to channel symbols Serial symbol sequence formed into a parallel symbol frame Inverse Fast Fourier Transform (IFFT) converts sub-carriers to the time domain Cyclic prefix is inserted, symbol is scaled, clipped, and sent to D/A converter Signal is DC-biased and drives an output LED to create the optical channel (e.g. 10 MHz) Receiver performs a similar process in reverse, using a forward FFT OFDM frame processing can be overlapped, at both Tx and Rx Stages can be pipelined; each stage has very different requirements FFT / IFFT dominates the computational load Transmitter Tx bits QAM mapper fl,m SP IFFT cyclic prefix insertion symbol scaling xl,k clipping PS D/A biasing Receiver Rx bits QAM demapper PS channel equalizer FFT cyclic prefix removal SP A/D yl,k AWGN zl,k Optical channel xl,k * h(k) 15 FFT / IFFT computations    Based on recursive Cooley-Tukey algorithm In-place algorithm reduces memory space Each FFT butterfly computes: w = twiddle_factor(i,k) tmp = x[i] x[i] = tmp + x[j] x[j] = w * (tmp - x[j]) // Load w from constant array // Load x[i] // Load x[j], Add, Store x[i] // Sub, Mul, Store x[j] Assuming j = i + (N >> k) was computed previously   Values are fractional fixed-point, rather than floating-point, for speed and energy-efficiency Key optimization targets: Maximize Butterfly computation rate Minimize energy consumption per Butterfly Minimize memory footprint k=2 k=1 k=0 16 PART III – Design of an energy-efficient many-core system The Pasta2 32-core chip 17 Li-Fi research chip  Background – Developed within the EPSRC project “Pasta2”  Goals of the chip – Research platform for adaptive visible light comm. using OFDM – Provide sufficient computational capacity to allow for significant experimentation – Deliver high performance through multi-core parallelism, using very small lightweight cores, at low power consumption – Proof of principle for a commercial Li-Fi communications product  Development timeline – Started system development in 2010-11 – Rev #1 test-chip taped out Nov 2013; chips functional, but required improvements – Rev #2 taped out Nov 2014, funded by an ERI PoP award. Received April 2015. 18 Design Goals  Heterogeneous clustered multi-core architecture – Mixture of specialized DSP cores and generic RISC cores, all derived from a common instruction-set architecture  Custom processor extensions for FFT acceleration – focus on energy efficiency and cycle efficiency  DMA engines with inbuilt transformations – Replace energy-sapping software functions with lean H/W structures  Primarily a platform for real-time software development for OFDM  Architecture of the test-chip may form the basis of a future product – Design flow based on standard industry practices – Processor core should be licensable and extensible – Key value-add is in the processor extension, multi-core interconnect and I/O  Targets modest clock frequencies, using a low-power process  Relies on multi-core parallelism to achieve required throughput 19 Parallelism and streaming Tx bits Transmitter convolutional encoder puncture QAM mapper fl,m SP IFFT 8Q24 to 1Q15 cyclic prefix insertion xl,k symbol scaling clipping PS D/A biasing Receiver de-puncture Viterbi decoder QAM demapper AWGN PS equalizer 8Q24 to 1Q15 FFT cyclic prefix removal yl,k SP zl,k Optical channel A/D xl,k * h(k) Rx bits  Multiple embedded cores running in parallel – –  Embedded core enhanced with FFT acceleration – –  DMA capable of chaining transforms within a DMA operation, e.g. convolutional encoder  puncture  QAM mapper. Coordinating core + DMA to I/O channels –  In-place FFT / IFFT, data stored in multi-banked scratchpad RAM (DCCM) DMA input, (I)FFF, and DMA output, all running at full speed from DCCM Streaming DMA with on-the-fly FEC, QAM mapping and data conversion –  One OFDM symbol per core One core per stage of the computation Concurrent transfer to/from PHY interfaces Optical transducers 20 OFDM system-on-chip architecture  Based on the Synopsys EM6 embedded processor core – – –  Two heterogeneous configurations of EM6 are used: – –  3.2 Coremarks/MHz, 1.9 DMIPS/MHz 20 – 30 kgates, highly configurable and extensible EM6 is derived from the Encore processor developed in the School of Informatics, and licensed to Synopsys P1: unmodified CPU core, with I-cache and D-cache P2: extended CPU core for FFT, with I-cache, D-cache and scratchpad RAM (DCCM) Interconnect, on-chip memory, DMA and links – – – – AXI-based interconnect, supporting multiple concurrent transactions 4 x 64KB fast SRAM for buffering OFDM frames and servicing L1 caches 7 x DMA devices, each supporting chained transformations 4 bidirectional 10b serial links, connect chip to supporting FPGA core 0 core 4 core 8 core 12 core 16 core 20 core 24 core 28 core 1 core 5 core 9 core 13 core 17 core 21 core 25 core 29 core 2 core 6 core 10 core 14 core 18 core 22 core 26 core 30 core 3 core 7 core 11 core 15 core 19 core 23 core 27 core 31 Interconnect DMA DMA DMA DMA 64KB 64KB Viterbi Viterbi Viterbi Viterbi SRAM SRAM clock reset JTAG DMA DMA DMA 64KB 64KB Viterbi Viterbi Viterbi SRAM SRAM I/O Link #1 10-bit SerDes with ECC I/O Link #2 10-bit SerDes with ECC I/O Link #3 10-bit SerDes with ECC I/O Link #4 10-bit SerDes with ECC 21 System verification, synthesis and layout Floor-planned  Pre-silicon testing in Xilinx V6 FPGA  6-core test system  Industry standard design flow Gates placed        7 CPU-days computing for physical design:        Synopsys DC Ultra for topographic synthesis Synopsys IC Compiler for place-and-route Calibre DRC, LVS for physical verification Synopsys StarRCXT for LPE Synopsys Primetime for signoff timing analysis VCS for post-layout gate-level simulation 6 hours to synthesize (4 CPUs) 30 hours to place and route (~40 CPU hours) 1 hour for LVS (8 CPUs) 6 hours for DRC (8 CPUs) 8 hours for LPE (8 CPUs) 2 hours for STA (1 CPU) 5 months to fabricate the chip 22 Summary of silicon implementation core 0 core 4 SRAM core 8 core 12 core 9 core 13 core 10 core 14 core 11 core 15 64 KB core 1 core 5 core 2 core 6 2-cycle access SRAM 64 KB 2-cycle access core 3 core 7       32 cores in 8 clusters 8 cores specialized for FFT 4 x 64KB SRAM 7 DMA engines, with transforms 4 bi-directional serial links JTAG debug interface   4x4mm die 50 million transistors – 4mm   core 19 core 23 core 27 core 31 core 26 core 30 core 25 core 29 core 24 core 28 SRAM 64 KB core 18 core 22 core 17 core 21 core 16 core 20 2-cycle access SRAM 64 KB 2-cycle access 4mm 144-pin CPGA package 65nm low-leakage process – – –      70% SRAM, 30% logic UMC 65LL Faraday RAM macros, HVT cells UMC cells and I/Os, SVT only 1.2V core, 2.5V I/O 3.2 mW standby power 0.96 mW / MHz Worst-case Fmax = 235 MHz Typical Fmax = 400 MHz 23 Power density and IR drop across the die    Supply current density @ 300 mW Mostly constant Small islands around some RAMs    IR drop less than 38 mV @ 300 mW Typical for 300 MHz operation Acceptable up to ~900 MHz 24 Computational throughput analysis   Assume device operates at 300 MHz... Instruction throughput: FFT cores Regular cores Aggregate IPC GOPS @ 300 MHz  Viterbi decoding throughput:     Based on XOR-SHIFT operations 7 operations per bit, taking 2 cycles  equivalent 3.5 IPC / DMA x 3 concurrent DMA devices  10.5 IPC  3.15 GOPS QAM (64) mapping / de-mapping throughput:      Based on ADD-COMPARE-SELECT kernel 4 operations per decoded bit, taking 2 cycles  equivalent ~2 IPC / DMA x 3 concurrent DMA devices  6 IPC  1.8 GOPS Convolutional encoding throughput:     IPC Number Sub-total 10.5 8 84 1 24 24 108 32.4 1 cycle per mapped output value, ~equivalent to 6 instructions  6 IPC / DMA x 6 concurrent DMA devices  36 IPC  10.8 GOPS System throughput estimated to be equivalent to 48 GOPS Equivalent to 160 unmodified single-issue cores running at 300 MHz 48 GOPS from high-end cores: 2.0 IPC/core  10 cores @ 2.4 GHZ  40-50 Watts! 25 Li-Fi system architectures  – – – ADC Pasta2 modem FPGA DAC   SDRAM SDRAM (a) System architecture for experimental Li-Fi system  Modem ADC IEEE 1901 DAC DRAM I/F SDRAM Flash Designed to be used in a multichip solution   Modem chip FPGA for I/O and DRAM interface ADC / DAC chips for the LED and sensor interfaces FPGA allows for flexible I/O design and avoids the need for DDR pads in the ASIC FPGA allows for future flexibility in interfacing of ADC / DAC to the rest of the system Future ASIC will integrate most FPGA functions and possibly some ADC / DAC functions too. Depending on memory requirements, the external SDRAM may be replaced by a Flash program memory. Modem should be capable of supporting IEEE 1901 (HomePlug) for backhaul via lighting power lines. (b) System architecture for production Li-Fi system 26 Scaling 65nm  28nm 16nm  10nm...     65 nm 65nm is not leading-edge today 32 cores at 28nm  6.9 mm2 32 cores at 10nm  2.5 mm2 At 10nm, a 20 mm2 device  2,048 cores 16 mm2 28 nm 6.9 mm2 16 nm 3.9 mm2 10 nm 2.5 mm2 27 Summary  The use of large-scale many-core architectures will allow individual cores to... – operate at low frequency & low voltage – exploit ultra-high density logic/memory cells with ultra low leakage  Embedded applications are often parallel, and can exploit heterogeneous architectures – Baseband, Computer Vision, ...  Future silicon technologies will demand more energy efficient computational systems – Silicon doesn’t have to be Dark, but it will be Dim 28 Thank you Any Questions? 29

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ARC FPF Presentation 2005