Download Massive Cores - ICR(ISAC CPU Research)

Exascale Computing July 1, 2016 Sung Bae Park ICR Revision/2016. 6. 29/2016. 6. 30/2016. 7. 1 Outline  Trend  Direction  Future Trend IT Waves VR  Contents & Service: Rich, High Quality Ubiquitous Web & Real Life VR  Network & Device: Variety of Topology & Network Infra 5G Multimedia Software Network Device 3D Computing & Memory 3D VR VR Wide MV Stereoscopic Super MV Multi-View NotePad Hologram Web3.0 Standalone Web1.0 Web2.0 Web4.0 Computing & Memory IoT Mobile Web Smart Real World WebSmart Ubiquitous Web TV PC Big Data 1G 2G 3GPhone 4G 5G HSPA MIMO AI PostAMPS GSM UPN* OFDM OFDM (Analog) CDMA Computing & Memory 3D Printer Single Core Multi Core HPP* CPU Multi GPGPU* Massive Core ADV CPU DSP Color FHD 2D * UPN: ULP Personal Networking 1D * GPGPU: General Purpose GPU 1980 1990 2000 2010 * HPP: Hybrid Parallel Processor 2017 Exascale Computing  More than Moore(100x in 10 Years, 2x/18M): 1000x for 10 Years, 2x/12M ※ 1GFLOPS(’00 CPU)  1TFLOPS(’10 GPU)  1PFLOPS(‘20)  1EFLOPS(‘30)  Crisis in Power, Efficiency and SW China Sunway TaihuLight FLOPS RIKEN K IBM Roadrunner Supercomputer 2.0 GPU NEC SX-3 1950 1960 CELL DEC 21264 Supercomputer 1.0 ENIAC GTX580 CPU Processor SW 26010 Xeon E5-2600 P100 CPU GPU Architecture CPU CPU GPU Company Shanghai HPICDC Intel Nvidia i4004 Year 2016 2015 2015 Overloaded Pipeline Peak in 1998 DEC 21264 Speed 1.45GHz 3.7GHz 1.5GHz No more effective for IPC 1024 FPU 5% Si Budget Out-Of-Order # -of Cores/Chip 26064 FPU 0.5% Si Budget 18 3584 - Non-Blocking Queue 640K x 32b 2MB RF 256 x 64b 2KB RF - Precise Branch Predict 2.6T FLOPS 1T 10T Massive Thread Overloaded Pipeline* 1970 1980 1990 2000 2010 2020 Year Direction Crisis in Speed & Power 0.4V  180nm 1GHz Samsung/DEC EV6 CPU in 1999, 14nm 4GHz in 2016 Yet!!!  No power efficiency improvements with Si scaling due to unscalable VDD Data Center Power & Cost PPA Crisis is Computing Efficiency Memory  Massive Cores on a Chip ※ Si Scaling Enables Integration Unit Changes: TR  Gate/Cell Array  Core Array  More Cores on a Chip, Less Efficiency of Computing Specific Applications Special Purpose Processor - DSP: Digital Signal Processor GPU: Graphics Processor NPU: Network Processor NMP: Neuromorphic Processor Dedicated Hardware Cost - FPGA - Reconfigurable Systems Domain Specific ISA Massive Cores Massive Cores on a Chip - Massive CPU/GPU/DSP/HWs - Run-Time-Reconfigurable Massive Cores Reconfigurable Processor Programmable Hardware Array Processor Homogeneous Many Cores Chip-Multiprocessor Compiler General Applications General Purpose Processor - PC/Server CPU - Mobile CPU Cost General Purpose Controller - MCU API Crisis in SW  Direct API for Minimum Memory Transfer  EPIC begging the CPU-like GPU for Differentiation and Productivity Programmable GPU: Market Begging 1 Core CPU 500 Core GPU 500 Core FPPA Parallel 200 5.4 <6 Sequential 200 750 < 220 Easier HW beats Faster HW! Future Key Enabler for Next Wave  x86 CPU: Highly Programmable & High Performance, but Power & Price  ARM+HW SoC: Low Power & Price, Medium Performance, but Programmability  Massive Cores: Highly Programmable & High Performance, Low Power & Price Smart SoC $250B HW SoC $100B PC CPU $50B Inflection Point 1980 Drivers Obstacles 1990 Creative Consumer On Massive Cores Nokia Phone on ARM CPU IBM PC on Intel x86 CPU 2000 • x86 Binary Compatible Mass Infra for IHV/ISV 2010 2017 • High Performance 3-4GHz 6-24 Core • Data: Low Power Low Price Dedicated HW IPs • CPU: Low Power, Low Price ARM Mass Infra for IHV/ISV • Extreme P4 [Price, Power, Performance & Programmability] enabled by Massive Core based RTR* FPPA • Power ~100W • Price ~$100 • Memory Bottleneck • HW IP: No Programmability • CPU/DSP: x10 Power, Price & Memory Bottleneck than HW IPs • *Run-Time-Reconfigurable On-Chip Dynamic Compiler On-Chip Kernel • System SW as GCD, MapReduce Smart SoC based on RTR FPPA  Run-Time-Reconfigurable Heterogeneous Field Programmable Processor Array  Run-Time-Reconfigurable 2D 3D Vector Accessible Memories Fine-Thread CPU • x86 / ARM • On-Chip Dynamic Compiler • On-Chip Kernel Mid-Thread DSP • SIMD / Vector Massive-Thread GPU • SIMT X-Y Stack Register File • Multi-GHz MB Wide IO Reconfigurable Buses • Low Swing Wide I/O Reconfigurable Memories • Multi-GHz GB Wide I/O FPGA for Special IP & IO • HDMI, Serdes,.. Design Methodology • Structured Custom to SoC • PM: PG/CG w/ DVFS Tool Chains • Integrated Compiler • System Simulator Seamless Platform • Open OS to Std. Drivers • OpenCL, MPI, GCD • Total Solutions Device • 0.4V 3mA 1pA @14nm Analog IPs • Low Swing Bus Drv/Rec • High-Q PLLs Package • 3D Integration 100mW 1TFLOPS for Exascale Computing in 2020  More than Moore: Challenge to HW ASIC Level Massive Cores Exa Flop @230KW 1/10,000 Power Revolution in 10-years FLOPS 1/100 from Scaling 100T Additional 1/100 from Innovation in HW like Run-Time-Reconfigurable Computing Si Technology for 0.1V ELV Device & Circuits Exa Flop @70MW 1/30 Power Efficiency in 5-years 1/3 from Computing, 1/10 from Scaling Multi-GHz Multi-GB Reconfigurable Memory with 2D 3D Vectored Access 10T Exa-byte/sec 3D Integration Peta Flop @2.3MW 1T GP GPU 100G HW ASIC PC CPU 10G Mobile CPU 2020 1G Moore’s Law x2 / 18-months (x10 / 5-years) 0.1 1 10 100 2015 2010 2005 1000 Watts Acknowledgements The author would like to thank Dan Dobberpuhl (Founder of SiByte, PASemi), David Ditzel(Founder of Transmeta), Jim Keller(DEC EV 6 Chief Architect), Anantha Chandrakasan (MIT), Dimitri Antoniadis (MIT), Li-Shiuan Peh (MIT), Shekhar Borkar(Intel, Fellow), Le Nguyen(Founder of AIT), Peter Song(Founder of Montalvo Systems) and Derek Lentz (GPU Architect) for their valuable comments and advices to enable this presentation. Appendix Movie Quality Virtual Reality Procedural Primitives Traced Deep Shadow Physically Plausible Shader Organized Point Clouds Fully SW Pipeline Procedural PQ Illumination PQ Shader CPU-like Programmable Random Dynamic Conditional RI Evaluation Facevarying Class Specifier Ambient Occlusion True RiSphere Primitives Blobby Implicit Surfaces OpenGL/D3D API Polygon Rasterization Modeling GPU + HW  GPGPU-like Fixed  Programmable Regular  Random Static  Dynamic Live Computer Vision Wave Kanizsa Triangle 0.4V 3.5mA/um MOSFET: Diffusion to Ballistic Intel 14nm FinFET, 2014 IEDM Beiking Univ. 9nm DG FET, 2015 IEEE EDSSC 0.4V 134W 14nm 42GHz A-CPU v @Ecrit cm/s Ecrit V/cm 3.00E+07 3.00E+04 i4004 L, m 1.00E-05 VDD 15 E V/m 1.50E+06 E V/cm 1.50E+04 VDD for Ecrit 3.00E+01 v = uE 9.00E+06 L, cm 1.00E-03 t = L/v 1.11E-10 Speed Ratio 9.99E-01 Expected Speed, Hz 7.49E+05 750KHz Real Speed, Hz 7.50E+05 Design Up 1.00E+00 I Pentium IIEV6 2.50E-07 1.80E-07 2 1.65 8.00E+06 9.17E+06 8.00E+04 9.17E+04 7.50E-01 5.40E-01 3.00E+07 3.00E+07 2.50E-05 1.80E-05 8.33E-13 6.00E-13 1.33E+02 1.85E+02 9.99E+07 1.39E+08 100MHz 140MHz 2.00E+08 1.00E+09 2.00E+00 7.21E+00 1.00E+00 I Skylake 1.40E-08 1 7.14E+07 7.14E+05 4.20E-02 3.00E+07 1.40E-06 4.67E-14 2.38E+03 1.78E+09 1.8GHz 4.20E+09 2.35E+00 3.06E+00 1.5 0.666667 1 4.20E+09 A-CPU 1.40E-08 0.4 2.86E+07 2.86E+05 4.20E-02 3.00E+07 1.40E-06 4.67E-14 2.38E+03 1.78E+09 1.8GHz 40W 0.2 50nF 0.25nF 2V 80W 0.66 20nF 0.16nF 1V 134W 1.1 20nF 0.16nF 0.4V Ion mA/um t = CV/I Final Target Practical Target Power (fCV2 ) W / mm2 C total C / mm2 VDD 0.5W 0.04 3nF 0.25nF 15V 80W 0.67 30nF 0.25nF 1.65V 2.00E+00 3 0.133333 5 2.10E+10 6.43E+10 4.20E+10 0.2V 1mA/um MOSFET: Ballistic to Tunneling Beiking Univ. 9nm DG FET, 2015 IEEE EDSSC Chenmming Hu 40nm, 2008 VLSI-TSA FQHE: Zero Resistance  Zero Power Power  Fractional Quantum Hall Effect @ Certain Magnetic Field  “Sharp Resonance” as Impedance Matching and/or Superconductor In 1980, Klaus von Klitzing [103] found that at temperatures of only a few Kelvin and high magnetic ¯eld (3-10 Tesla), the Hall resistance did not vary linearly with the Field. Instead, he found that it varied in a stepwise fashion. It was also found that where the Hall resistance was °at, the longitudinal resistance disappeared. This dissipation free transport looked very similar to superconductivity. The Field at which the plateaus appeared, or where the longitudinal resistance vanished, quite surprisingly, was independent of the material, temperature, or other variables of the experiment, but only depended on a combination of fundamental constants -¹h=e2. The quantization of resistivity seen in these early experiments came as a grand surprise and would lead to a new international standard of resistivity, the Klitzing, de¯ned as the Hall resistance measured at the fourth step. By 1982, semiconductor technology had greatly advanced and it became possible to produce interfaces of much higher quality than where available only a few years before. That same year, Horst Stormer and Dan Tsui [105] repeated Klitzing's earlier experiments with much cleaner samples and higher magnetic ¯elds. What they found was the same stepwise behavior as seen previously, but to everyone's surprise, steps also appeared at fractional ¯lling factors º = 1=3; 1=5; 2=5 : : : Strongly correlated systems are notoriously di±cult to understand, but in 1983, Robert Laughlin [106] proposed his now celebrated ansatz for a variational wavefunction which contained no free parameters: [Cooper Pairs to Molecules: J. N. Milstein] PPA Crisis: Learn from Dedicated HW IP “Years of research in low-power embedded computing have shown only one design technique to reduce power: reduce waste.” - Mark Horowitz, Stanford University & Rambus Inc. CPU GPU mCPU DSP Power (W) 60 80 0.6 0.24 0.12 0.015 Performance, # of H.264 1 2 0.1 1 1 1 Area, mm2 200 400 10 3 2 0.5 PPA 8.3E-5 6.3E-5 1.6E-3 1.4 4.2 133.3 1 0.75 19 1.7E4 5.0E4 1.6E6 Reduce Power: Reduce Waste Wasted Si Area Wasted Computation Wasted Bandwidth Wasted Voltage Wasted Design Resources HW IP I HW IP II Make HW IP Programmable: Reconfigurable Computing  Reprogrammable FSM with Microcode + Domain Specific HW FU with ISA  Extreme RISC in Horizontal Control, and Extreme CISC in Vertical Data Radio ISA RC-FU Advanced DSP design driven by workload analysis Data - 4G/5G Modem - Channel Control Latency F0 F1 F2 Instruction Fetch Throughput D1 D2 Instruction Decode Smart compiler (C/C++) E1 E2 E3 Tag Match AGU Access ing E6 E7 FU0 ALU1 ALU2 ALU3 WB MUL1 MUL2 SHFT WB I$ Instr. Op. Dec Fetch LS Pipeline MUL1 MUL2 SHFT DATA VLIW E5 FU1 I$ Instruction E4 WB LS Pipeline Central RF (Register file) FU FU FU FU FU FU FU FU RF RF RF RF FU FU FU FU RF RF RF RF FU FU FU FU RF RF RF RF Coarse Grain Array (CGA) Media ISA RC-FU - AV/Image - 3D/Ray-Tracing/VR Intellig. ISA RC-FU - Recognition - Mining - Synthesis WB FU2 WB Reconfigurable Memory  HW IP’s Outstanding PPA come from Implicit, Distributed, Stacked Queue Memory  Reconfigurable Memory for HW IP level PPA H.264 Luma Inter Prediction Algorithm Worst Case 16x16 mode (i position) Position i Vertical Filtering (6tab Filter, 21x16) y = x0–5*x1+20*x2+20*x3–5*x4+x5 21x16 Pixel (16bit) 21x21 Pixel (8bit) Horizontal Filtering (6tab Filter + Scaling, 17x16) z = ((y0–5*y1+20*y2+20*y3–5*y4+y5 )+512)>>10 21x16 Pixel (16bit) 17x16 Pixel (8bit) ¼ pel (16x16) r = (z0+z1+1)>>1 17x16 Pixel (8bit) 16x16 Pixel (8bit) 2D Reconfigurable Memory  No Need to Calculate Address: Implicit/Local/Distributed 32x32x64-Bit RF  X-Y Bi-Directional Random Access for Extreme Spatial Locality in Bit-Pixel Stream Applications X H.264 FHD decoding – Lama Interpretation X-Y Stack Vertical filtering Horizontal filtering Y ¼ peel Total cycles = ~18 cycles Total cycles = 170 Load64b add Load64b add Load64b add Load64b add Load64b add Load64b add CAT_WIN CAT_WIN CAT_WIN CAT_WIN Data Shuffling LOAD from stack; Load from Stack Store64b add Store64b add Data Store & address generation Data Computation Data load & address generation Loop 예제: II=2, loop count=16 Round Sat Round Sat FirFilter64b FirFilter64b Round Sat Avg64b FirFilter64b FirFilter64b Round Sat Avg64b NONE; Data STORE to stack; Computation Data Computation Store to Stack 3D Reconfigurable Memory  Computer Vision, Virtual Reality, AI need all Depth Processing Pixel to Voxel Processing Run-Time-Reconfigurable Computing PPA  On-Chip Run-Time Compiler & Kernel  One of the Biggest Overhead has been Reconfiguration Memory PPA  3D XPoint can accelerate RTR FPPA (Run-Time-Reconfigurable Field Programmable Processor Array) New York Univ., 2011 IEEE CVPR GPU Case  5120 Core x 1GHz X 2 FLOP/Core = 10 TFLOPS Hide the Memory Latency  Massive Thread in GPU hide Long Latency but MUCH Limited Way!  Good for Massive Thread Applications but Very MUCH Damaging for CSP and/or Random Control Flow and/or Less Massive Parallel Thread Applications Crisis in Memory Latency  CPU-like work: Small Thread - Detail, Sophisticated Rendering - Ray Tracing - Procedural Graphics ※ Random-Dynamic-Irregular Control Flow & Data Access ※ Severe Slow Down in PC/Mobile GPUs  Mobile GPU-like work: Med. Thread - Some Repetitive, Some Detail - Mid. Size of Regualr Pattern ※ Half Fixed-Static-Regular Half Random-Dynamic-Irregular Control Flow & Data Access ※ Optimized for Mobile GPUs  PC GPU-like work: Massive Thread - Repetitive, Simple Rendering - Large Chunk of Regualr Pattern - Rasterized Modeling ※ Fixed-Static-Regular Control Flow & Data Access ※ Outstanding Speed Up in PC GPUs Random Thread Variations On-Chip DRAM Memory  GHz Random Access On-Chip DRAM, but 5x Larger Si Area than commercial Samsung 40nm 2Gb, 2011 ISSCC: 25Mb/mm2 IBM 45nm SOI eDRAM, 2010 CICC: 5Mb/mm2 Direct API: 10  1 File Copy for 10x PPA SW Make Minimize the Memory Access by Freeway to Compute Algorithm  Applications  UI  Many APIs  OS  Drivers  Cores along with On-Chip API MMU to Minimize File Copy and Transfer (‘13.9 AMD Mantle GPU)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Massive Cores - ICR(ISAC CPU Research)