Download Memory Wall - Computer architecture

The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2nd, 2002 Outline The problem: the Memory Gap  Simultaneous Multithreading  Decoupled Architectures  Memory Technology  Processor-In-Memory  The Memory Latency Problem   Technological Trend: Memory latency is getting longer relative to microprocessor speed (40% per year) Problem: Memory Latency - Conventional Memory Hierarchy Insufficient: • •  Many applications have large data sets that are accessed non-contiguously. Some SPEC benchmarks spend more than half of their time stalling [Lebeck and Wood 1994]. Domain: benchmarks with large data sets: symbolic, signal processing and scientific programs Some Solutions Solution Limitations Larger Caches — Slow — Works well only if working set fits cache and there is temporal locality. Hardware Prefetching — Cannot be tailored for each application Software Prefetching — Ensure overheads of prefetching do not outweigh the benefits > conservative prefetching — Behavior based on past and present execution-time behavior — Adaptive software prefetching is required to change prefetch distance during run-time — Hard to insert prefetches for irregular access patterns Multithreading — Solves the throughput problem, not the memory latency problem Limitation of Present Solutions  Huge cache: • Slow and works well only if the working set fits cache and there is some kind of locality  Prefetching • Hardware prefetching – Cannot be tailored for each application – Behavior based on past and present execution-time behavior • Software prefetching – Ensure overheads of prefetching do not outweigh the benefits – Hard to insert prefetches for irregular access patterns  SMT • Enhance the utilization and throughput at thread level Outline The problem: the memory gap  Simultaneous Multithreading  Decoupled Architectures  Memory Technology  Processor-In-Memory  Simultaneous Multi-Threading (SMT) Horizontal and vertical sharing  Hardware support of multiple threads  Functional resources shared by multiple threads  Shared caches  Highest utilization with multi-program or parallel workload  SMT Compared to SS Superscalar Cycles INT MEM FP SMT Cycles INT MEM FP 1 1 2 2 Stall Thread 1 Thread 2 3 3 Thread 3 4 4 Thread 4 5 5 Thread 5 Thread 6 6 6 Thread 7 7 7 Thread 8 9 instr    20 instr Superscalar processors execute multiple instructions per cycle Superscalar functional units idle due to I-fetch stalls, conditional branches, data dependencies SMT dispatches instructions from multiple data streams, allowing efficient execution and latency tolerance • Vertical sharing (TLP and block multi-threading) • Horizontal sharing (ILP and simultaneous multiple thread instruction dispatch) CMP Compared to SS CMP-p2 Superscalar Super -scalar Cycles INT MEM FP INT FP Cycles /MEM 1 2 3 4 5 Super -scalar INT FP /MEM 1 Stall 2 Thread 1 3 Thread 2 4 Thread 3 5 Thread 4 6 Thread 5 7 Thread 6 8 Thread 7 9 Thread 8 6 7 9 instr    13 instr CMP uses thread-level parallelism to increase throughput CMP has layout efficiency • More functional units • Faster clock rate CMP hardware partition limits performance • Smaller level-1 resources cause increased miss rates • Execution resources not available from across partition Wide Issue SS Inefficiencies  Architecture and software limitations • Limited program ILP => idle functional units • Increased waste of speculative execution  Technology issues • Area grows O((d3) {d = issue or dispatch width} • Area grows an additional O(tLog2(t)) {t= #SMT threads} • Increased wire delays (increased area, tighter spacings, thinner oxides, thinner metal) • Increased memory access delays versus processor clock • Larger pipeline penalties Problems solved through:  CMP - localizes processor resources  SMT - efficient use of FUs, latency tolerance  Both CMP and SMT - thread level parallelism POSM Configurations Ext Int Ext Int Fetch Level 2 Cache Decode, Rename Reorder Buffer, Instr. Queues, O-O-O logic FP Unit Wide-issue SMT    Ext Int Ext Int iL1 Proc 1 Proc 1 Proc 2 TLB TLB TLB TLB iL1 dL1 Level 2 Cache dL1 L2 Cross bar iL1 INT unit dL1 iL1 dL1 iL1 dL1 Level 2 Cache L2 Cross bar iL1 dL1 iL1 dL1 TLB TLB TLB Proc 2 Proc 3 Proc 4 Two processor POSM P1 Four processor POSM Level 2 Cache P2 P3 i d i d i d L L L L L L 1 1 1 1 1 1 L2 Cross bar i d i d i d L L L L L L 1 1 1 1 1 1 P5 P6 P7 8 processor CMP All architectures above have eight threads Which configuration has the highest performance for an average workload? Run benchmarks on various configurations, find optimal performance point P4 i d L L 1 1 i d L L 1 1 P8 Superscalar, SMT, CMP, and POSM Processors CMP-p2 Superscalar SMT Super -scalar Super -scalar INT MEM FP INT MEM FP INT FP /MEM INT FP /MEM 1 1 2 2  INT FP /MEM SMT INT FP /MEM 1 1 Stall 2 2 Thread 1 3 3 Thread 2 3 4 4 Thread 3 4 4 5 5 Thread 4 6 6 Thread 5 7 7 Thread 6 8 8 9 9 5 6 6 7 7 9 instr  SMT 3 5  POSM-p2 20 instr 13 instr Thread 7 Thread 8 33 instr CMP and SMT both have higher throughput than superscalar Combination of CMP/SMT has highest throughput Experiment results IPC Equivalent Functional Units 10.00 9.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 smt.p1.f2.t8.d16 posm.p2.f2.t4.d8 posm.p4.f1.t2.d4 cmp.p8.f1.t1.d2 1 2 3 4 5 6 7 8 Number of threads •SMT.p1 has highest performance through vertical and horizontal sharing •cmp.p8 has linear increase in performance NIPC Equivalent Silicon Area and System Clock Effects 10.00 9.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 smt.p1.f2.t8.d9 posm.p2.f2.t4.d6 posm.p4.f1.t2.d4 cmp.p8.f1.t1.d2 1 2 3 4 5 6 7 8 Number of threads •SMT.p1 throughput is limited •SMT.p1 and POSM.p2 have equivalent single thread performance •POSM.p4 and CMP.p8 have highest throughput Synthesis    “Comparable silicon resources” are required for processor evaluation POSM.p4 has 56% more throughput than wide-issue SMT.p1 Future wide-issue processors are difficult to implement, increasing the POSM advantage • Smaller technology spacings have higher routing delays due to parasitic resistance and capacitance • The larger the processor, the larger the O(d2tLog2(t)) and O(d3t) impact on area and delays   SMT works well with deep pipelines The ISA and micro-architecture affect SMT overhead • 4-thread x86 SMT would have 1/8th the SMT overhead • Layout and micro-architecture techniques reduces SMT overhead Outline The problem: the memory gap  Simultaneous Multithreading  Decoupled Architectures  Memory Technology  Processor-In-Memory  The HiDISC Approach Observation: • Software prefetching impacts compute performance • PIMs and RAMBUS offer a high-bandwidth memory system - useful for speculative prefetching Approach: • Add a processor to manage prefetching -> hide overhead • Compiler explicitly manages the memory hierarchy • Prefetch distance adapts to the program runtime behavior Decoupled Architectures 8-issue 3-issue 5-issue 2-issue Computation Processor (CP) Computation Processor (CP) Computation Processor (CP) Computation Processor (CP) Registers Registers Registers Registers Access Processor (AP) - (5-issue) Cache Access Processor (AP) - (3-issue) Cache Cache 3-issue 2nd-Level Cache 2nd-Level Cache and Main Memory Cache Cache Mgmt. Processor (CMP) Cache Mgmt. Processor (CMP) and Main Memory 2nd-Level Cache and Main Memory 2nd-Level Cache and Main Memory MIPS DEAP CAPP HiDISC (Conventional) (Decoupled) 2nd-Level Cache and Main Memory DEAP: [Kurian, Hulina, & Caraor ‘94] PIPE: [Goodman ‘85] Other Decoupled Processors: ACRI, ZS-1, WA (New Decoupled) 3-issue What is HiDISC? Computation Processor (CP)  A dedicated processor for each level of the memory hierarchy  Explicitly manage each level of the memory hierarchy using instructions generated by the compiler  Hide memory latency by converting data access predictability to data access locality (Just in Time Fetch)  Exploit instruction-level parallelism without extensive scheduling hardware  Zero overhead prefetches for maximal computation throughput 2-issue Registers Store Address Queue Load Data Queue Slip Control Queue Access Processor (AP) 3-issue Store Data Queue L1 Cache Cache Mgmt. Processor (CMP) 3-issue L2 Cache and Higher Level HiDISC Slip Control Queue  The Slip Control Queue (SCQ) adapts dynamically if (prefetch_buffer_full ()) Don’t change size of SCQ; else if ((2*late_prefetches) > useful_prefetches) Increase size of SCQ; else Decrease size of SCQ; • Late prefetches = prefetched data arrived after load had been issued • Useful prefetches = prefetched data arrived before load had been issued Decoupling Programs for HiDISC (Discrete Convolution - Inner Loop) while (not EOD) y = y + (x * h); send y to SDQ Computation Processor Code for (j = 0; j < i; ++j) y[i]=y[i]+(x[j]*h[i-j-1]); Inner Loop Convolution SAQ: Store Address Queue SDQ: Store Data Queue SCQ: Slip Control Queue EOD: End of Data for (j = 0; j < i; ++j) { load (x[j]); load (h[i-j-1]); GET_SCQ; } send (EOD token) send address of y[i] to SAQ Access Processor Code for (j = 0; j < i; ++j) { prefetch (x[j]); prefetch (h[i-j-1]; PUT_SCQ; } Cache Management Code Benchmarks Benchmarks Source of Benchmark Lines of Source Code LLL1 Livermore Loops [45] 20 LLL2 Livermore Loops 24 LLL3 Livermore Loops 18 LLL4 Livermore Loops 25 LLL5 Livermore Loops 17 Tomcatv SPECfp95 [68] 190 MXM NAS kernels [5] 113 CHOLSKY NAS kernels 156 VPENTA NAS kernels 199 Qsort Quicksort sorting algorithm [14] 58 Description Data Set Size 1024-element arrays, 100 iterations 1024-element arrays, 100 iterations 1024-element arrays, 100 iterations 1024-element arrays, 100 iterations 1024-element arrays, 100 iterations 33x33-element matrices, 5 iterations Unrolled matrix multiply, 2 iterations Cholesky matrix decomposition Invert three pentadiagonals simultaneously Quicksort 24 KB 16 KB 16 KB 16 KB 24 KB <64 KB 448 KB 724 KB 128 KB 128 KB Simulation Parameters Parameter Value Parameter Value L1 cache size 4 KB L2 cache size 16 KB L1 cache associativity 2 L2 cache associativity 2 L1 cache block size 32 B L2 cache block size 32 B Memory Latency Variable, (0-200 cycles) Memory contention time Variable Victim cache size 32 entries Prefetch buffer size 8 entries Load queue size 128 Store address queue size 128 Store data queue size 128 Total issue width 8 Simulation Results LLL3 5 Tomcatv 3 MIPS DEAP CAPP HiDISC 4 3 MIPS DEAP 2.5 CAPP HiDISC 2 1.5 2 1 1 0 0.5 0 40 80 120 160 Main Memory Latency 200 0 40 80 120 160 Main Memory Latency 200 Vpenta Cholsky 16 14 12 10 8 6 4 2 0 0 12 MIPS DEAP CAPP HiDISC MIPS DEAP 8 CAPP 6 HiDISC 10 4 2 0 40 80 120 160 Main Memory Latency 200 0 0 40 80 120 160 Main Memory Latency 200 VLSI Layout Overhead (I)      Goal: Cost effectiveness of HiDISC architecture Cache has become a major portion of the chip area Methodology: Extrapolated HiDISC VLSI Layout based on MIPS10000 processor (0.35 μm, 1996) The space overhead of HiDISC is extrapolated to be 11.3% more than a comparable MIPS processor The benchmark should be run again using these parameters and new memory architectures VLSI Layout Overhead (II) Component Original MIPS R10K(0.35 m) Extrapolation (0.15 m) HiDISC (0.15 m) D-Cache (32KB) 26 mm2 6.5 mm2 6.5 mm2 I-Cache (32KB) 28 mm2 7 mm2 14 mm2 TLB Part 10 mm2 2.5 mm2 2.5 mm2 External Interface Unit 27 mm2 6.8 mm2 6.8 mm2 Instruction Fetch Unit and BTB 18 mm2 4.5 mm2 13.5 mm2 Instruction Decode Section 21 mm2 5.3 mm2 5.3 mm2 Instruction Queue 28 mm2 7 mm2 0 mm2 Reorder Buffer 17 mm2 4.3 mm2 0 mm2 Integer Functional Unit 20 mm2 5 mm2 15 mm2 FP Functional Units 24 mm2 6 mm2 6 mm2 Clocking & Overhead 73 mm2 18.3 mm2 18.3 mm2 Total Size without L2 Cache 292 mm2 73.2 mm2 87.9 mm2 129.2 mm2 143.9 mm2 Total Size with on chip L2 Cache The Flexi-DISC  Fundamental characteristics: •   Dynamic reconfigurable central computational kernel (CK) Multiple levels of caching and processing around CK •  inherently highly dynamic at execution time. adjustable prefetching Multiple processors on a chip which will provide for a flexible adaptation from multiple to single processors and horizontal sharing of the existing resources. The Flexi-DISC  Partitioning of Computation Kernel • It can be allocated to the different portions of the application or different applications    CK requires separation of the next ring to feed it with data The variety of target applications makes the memory accesses unpredictable Identical processing units for outer rings • Highly efficient dynamic partitioning of the resources and their run-time allocation can be achieved Multiple HiDISC: McDISC      Problem: All extant, large-scale multiprocessors perform poorly when faced with a tightly-coupled parallel program. Reason: Extant machines have a long latency when communication is needed between nodes. This long latency kills performance when executing tightly-coupled programs. (Note that multi-threading à la Tera does not help when there are dependencies.) The McDISC solution: Provide the network interface processor (NIP) with a programmable processor to execute not only OS code (e.g. Stanford Flash), but user code, generated by the compiler. Advantage: The NIP, executing user code, fetches data before it is needed by the node processors, eliminating the network fetch latency most of the time. Result: Fast execution (speedup) of tightly-coupled parallel programs. The McDISC System: Memory-Centered Distributed Instruction Set Computer Understanding FLIR SAR VIDEO ESS Inference Analysis Computation Instructions Computation Processor (CP) Register Links to CP Neighbors Sensor Data Registers Program Compiler Access Processor (AP) Access Instructions Network Management Instructions 3-D Torus of Pipelined Rings X Cache Management Instructions Y Z to Displays and Network Cache Cache Management Processor (CMP) Network Interface Processor (NIP) Main Memory Disc Cache Disc Processor (DP) Adaptive Signal PIM (ASP) SAR Video RAID Dynamic Database Sensor Inputs SES Adaptive Graphics PIM (AGP) Decision Process Targeting Situation Awareness Disc Farm Summary  A processor for each level of the memory hierarchy  Adaptive memory hierarchy management  Reduces memory latency for systems with high memory bandwidths (PIMs, RAMBUS)  2x speedup for scientific benchmarks  3x speedup for matrix decomposition/substitution (Cholesky)  7x speedup for matrix multiply (MXM) (similar results expected for ATR/SLD) Outline The problem: the memory gap  Simultaneous Multithreading  Decoupled Architectures  Memory Technology  Processor-In-Memory  Memory Technology  New DRAM technologies • DDR DRAM, SLDRAM and DRDRAM • Most DRAM technologies achieve higher bandwidth  Integrating memory and processor on a single chip (PIM and IRAM) • Bandwidth and memory access latency sharply improve New Memory Technologies (Cont.)  Rambus DRAM (RDRAM) • memory interleaving system integrated onto a single memory chip • Four outstanding requests with pipelined micro architecture • Operates at much higher frequencies than SDRAM  Direct Rambus DRAM (DRDRAM) • Direct control of all row and column resources concurrently with data transfer operations • Current DRDRAM can achieve 1.6 Gbytes/sec bandwidth transferring on both clock edges Intelligent RAM (IRAM) Merging technology of processor and memory  All the memory accesses remain within a single chip  • Bandwidth can be as high as 100 to 200 Gbytes/sec • Access latency is less than 20ns  Good solution for data intensive streaming application Vector IRAM  Cost effective system • Incorporates vector processing units and the memory system on a single chip Beneficial for the multimedia application with critical DSP features  Good energy efficiency  Attractive for future mobile computing processors  Outline The problem: the memory gap  Simultaneous Multithreading  Decoupled Architectures  Memory Technology  Processor-In-Memory  Overview of the System  Proposed DCS (Data-intensive Computing System) Architecture DCS System (Cont’d)  Programming • Different from the conventional programming model • Applications are divided into two separate sections – Software : Executed by the host processor – Hardware : Executed by the CMP • The programmer must use CMP instructions  CMP • Several CMPs can be connected to the system bus • Variable CMP size and configuration depending on the amount and complexity of job it has to handle • Variable size, function and location of logics inside of CMP to better handle the application.  Memory, Coprocessors, I/O CMP Architecture  CMP (Computational Memory Processor) Architecture • The Heart of our work • Responsible for executing the core operation of dataintensive applications • Attached to the system bus • CMP instructions are encapsulated in the normal memory operations. • Consists of many ACME (Application-specific Computational Memory Element) cells interconnected amongst themselves through dedicated communication links  CMC(Computing Memory Cluster) • A small number of ACME cells are put together to form a CMC CMP Architecture CMC Architecture ACME Architecture  ACME (Application-specific Computational Memory Elements) Architecture • ACME-memory, configuration cache, CE (Computing Element), FSM • CE is the reconfigurable computing unit and consists of many CC (Computing Cells) • FSM govern the overall execution of the ACME Inside the Computing Elements Synchronization and Interface  Three different kinds of communications • Host processor with CMP (eventually with each ACME) – Done by synchronization variables (specific memory locations) located inside the memory of each ACME cells – Example : start and end signals for operations. CMP instructions for each ACME • ACME to ACME – Two different approaches • Host mediated – Simple – Not practical for frequent communications • Distributed mediated approach – Expensive and complex – Efficient • CMP to CMP Benefits of the Paradigm  All the benefits from being the PIM • Increased bandwidth and Reduced latency • Faster Computation – Parallel execution among many ACMEs     Effective usage of the full memory bandwidth Efficient co-existence of Software and Hardware More parallel execution inside of ACMEs by efficiently configuring the structure with considerations for applications Scalability Implementation of the CMP  Projected how our CMP will be implemented… • According to 2000 edition of ITRS (International Technology Roadmap for Semiconductors), in year 2008 – A High-end MPU with 1.381 billion transistors will be in production with 0.06um technology and 427mm2 – If half of the die size is allocated to memory, 8.13 Gbits storage will be available and 690 million transistors for logic – There can be 2048 ACME cells with each 512Kbytes of memory and 315K transistors for logic, control, anything inside ACME and rest of resources (36M transistors) for interconnections inside. Motion Estimation of MPEG Finding the motion vectors for a macro block in the frame.  It absorbs about 70% of the total execution time of MPEG  Huge amount of simple addition, subtraction and comparisons  Example ME execution  One ACME structure to find a motion vector for a macro block • Executes in pipelined fashion reusing the data Example ME execution  Performance • For a 8*8 macro block with 8 pixel displacement • 276 clock cycles to find the motion vector for one macro block  Performance comparison with other architectures

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Memory Wall - Computer architecture