Download Parallel On-chip Simultaneous Multithreading

USC UNIVERSITY OF SOUTHERN CALIFORNIA HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot University of Southern California http://www-pdpc.usc.edu 28 Sep. 2000 Personnel PIs: Alvin M. Despain and Jean-Luc Gaudiot  Graduate students  • • • • Manil Makhija Wonwoo Ro Seongwon Lee Steve Jenks USC Parallel and Distributed Processing Center 2 HiDISC: Hierarchical Decoupled Instruction Set Computer New Ideas Sensor Inputs Application (FLIR SAR VIDEO ATR /SLD Scientific ) • A dedicated processor for each level of the Decoupling Compiler memory hierarchy • Explicitly manage each level of the memory hierarchy using instructions generated by the compiler Dynamic Database Processor Processor Processor Registers Cache • Hide memory latency by converting data access predictability to data access locality • Exploit instruction-level parallelism without extensive scheduling hardware HiDISC Processor Memory Situational Awareness Impact • 2x speedup for scientific benchmarks with large data sets over an in-order superscalar processor • 7.4x speedup for matrix multiply over an in-order issue superscalar processor • Zero overhead prefetches for maximal computation throughput Schedule • Defined benchmarks • Completed simulator • Performed instruction-level simulations on hand-compiled benchmarks • Continue simulations of more benchmarks (SAR) •Define HiDISC architecture •Benchmark result • Develop and test a full decoupling compiler • Update Simulator •Generate performance statistics and evaluate design • 2.6x speedup for matrix decomposition/substitution over an in-order issue superscalar processor • Reduced memory latency for systems that have high memory bandwidths (e.g. PIMs, RAMBUS) • Allows the compiler to solve indexing functions for irregular applications April 98 Start April 99 April 00 • Reduced system cost for high-throughput scientific codes USC Parallel and Distributed Processing Center 3 Outline  HiDISC Project Description  Experiments and Accomplishments  Schedule and Work in Progress  Future Avenues  Summary USC Parallel and Distributed Processing Center 4 HiDISC: Hierarchical Decoupled Instruction Set Computer Sensor Inputs Application (FLIR SAR VIDEO ATR /SLD Scientific ) Decoupling Compiler Dynamic Database Processor Processor Processor Registers Cache HiDISC Processor Memory Situational Awareness Technological Trend: Memory latency is getting longer relative to microprocessor speed (40% per year) Problem: Some SPEC benchmarks spend more than half of their time stalling [Lebeck and Wood 1994] Domain: benchmarks with large data sets: symbolic, signal processing and scientific programs Present Solutions: Multithreading (Homogenous), Larger Caches, Prefetching, Software Multithreading USC Parallel and Distributed Processing Center 5 Present Solutions Solution Limitations Larger Caches — Slow — Works well only if working set fits cache and there is temporal locality. Hardware Prefetching — Cannot be tailored for each application Software Prefetching — Ensure overheads of prefetching do not outweigh the benefits > conservative prefetching — Behavior based on past and present execution-time behavior — Adaptive software prefetching is required to change prefetch distance during run-time — Hard to insert prefetches for irregular access patterns Multithreading — Solves the throughput problem, not the memory latency problem USC Parallel and Distributed Processing Center 6 The HiDISC Approach Observation: • Software prefetching impacts compute performance • PIMs and RAMBUS offer a high-bandwidth memory system - useful for speculative prefetching Approach: • Add a processor to manage prefetching -> hide overhead • Compiler explicitly manages the memory hierarchy • Prefetch distance adapts to the program runtime behavior USC Parallel and Distributed Processing Center 7 What is HiDISC? Computation Instructions Computation Processor (CP)  A dedicated processor for each level of the memory hierarchy  Explicitly manage each level of the memory hierarchy using instructions generated by the compiler  Hide memory latency by converting data access predictability to data access locality (Just in Time Fetch)  Exploit instruction-level parallelism without extensive scheduling hardware  Zero overhead prefetches for maximal computation throughput Registers Program Compiler Access Instructions Access Processor (AP) Cache Cache Mgmt. Instructions Cache Mgmt. Processor (CMP) 2nd-Level Cache and Main Memory USC Parallel and Distributed Processing Center 8 Decoupled Architectures 8-issue 3-issue 5-issue 2-issue Computation Processor (CP) Computation Processor (CP) Computation Processor (CP) Computation Processor (CP) Registers Registers Registers Registers Access Processor (AP) - (5-issue) Cache Access Processor (AP) - (3-issue) Cache Cache 3-issue 2nd-Level Cache 2nd-Level Cache and Main Memory Cache Cache Mgmt. Processor (CMP) Cache Mgmt. Processor (CMP) and Main Memory 2nd-Level Cache and Main Memory 2nd-Level Cache and Main Memory MIPS DEAP CAPP HiDISC (Conventional) (Decoupled) 2nd-Level Cache and Main Memory 3-issue (New Decoupled) DEAP: [Kurian, Hulina, & Caraor ‘94] PIPE: [Goodman ‘85] Other Decoupled Processors: ACRI, ZS-1, WA USC Parallel and Distributed Processing Center 9 Slip Control Queue  The Slip Control Queue (SCQ) adapts dynamically if (prefetch_buffer_full ()) Don’t change size of SCQ; else if ((2*late_prefetches) > useful_prefetches) Increase size of SCQ; else Decrease size of SCQ; • Late prefetches = prefetched data arrived after load had been issued • Useful prefetches = prefetched data arrived before load had been issued USC Parallel and Distributed Processing Center 10 (Discrete Convolution - Inner Loop) while (not EOD) y = y + (x * h); send y to SDQ Computation Processor Code for (j = 0; j < i; ++j) y[i]=y[i]+(x[j]*h[i-j-1]); Inner Loop Convolution SAQ: Store Address Queue SDQ: Store Data Queue SCQ: Slip Control Queue EOD: End of Data for (j = 0; j < i; ++j) { load (x[j]); load (h[i-j-1]); GET_SCQ; } send (EOD token) send address of y[i] to SAQ Access Processor Code for (j = 0; j < i; ++j) { prefetch (x[j]); prefetch (h[i-j-1]; PUT_SCQ; } Cache Management Code USC Parallel and Distributed Processing Center 11 Benchmarks Benchmarks Source of Benchmark Lines of Source Code LLL1 Livermore Loops [45] 20 LLL2 Livermore Loops 24 LLL3 Livermore Loops 18 LLL4 Livermore Loops 25 LLL5 Livermore Loops 17 Tomcatv SPECfp95 [68] 190 MXM NAS kernels [5] 113 CHOLSKY NAS kernels 156 VPENTA NAS kernels 199 Qsort Quicksort sorting algorithm [14] 58 Description Data Set Size 1024-element arrays, 100 iterations 1024-element arrays, 100 iterations 1024-element arrays, 100 iterations 1024-element arrays, 100 iterations 1024-element arrays, 100 iterations 33x33-element matrices, 5 iterations Unrolled matrix multiply, 2 iterations Cholsky matrix decomposition Invert three pentadiagonals simultaneously Quicksort 24 KB 16 KB 16 KB 16 KB 24 KB <64 KB 448 KB 724 KB 128 KB 128 KB USC Parallel and Distributed Processing Center 12 Simulation Parameter Value Parameter Value L1 cache size 4 KB L2 cache size 16 KB L1 cache associativity 2 L2 cache associativity 2 L1 cache block size 32 B L2 cache block size 32 B Memory Latency Variable, (0-200 cycles) Memory contention time Variable Victim cache size 32 entries Prefetch buffer size 8 entries Load queue size 128 Store address queue size 128 Store data queue size 128 Total issue width 8 USC Parallel and Distributed Processing Center 13 Simulation Results LLL3 5 Tomcatv 3 MIPS DEAP CAPP HiDISC 4 3 MIPS DEAP 2.5 CAPP HiDISC 2 1.5 2 1 1 0 0.5 0 40 80 120 160 Main Memory Latency 200 0 40 80 120 160 Main Memory Latency 200 Vpenta Cholsky 16 14 12 10 8 6 4 2 0 0 12 MIPS DEAP CAPP HiDISC MIPS DEAP 8 CAPP 6 HiDISC 10 4 2 0 40 80 120 160 Main Memory Latency 200 0 0 40 80 120 160 Main Memory Latency 200 USC Parallel and Distributed Processing Center 14 Accomplishments  2x speedup for scientific benchmarks with large data sets over an inorder superscalar processor  7.4x speedup for matrix multiply (MXM) over an in-order issue superscalar processor - (similar operations are used in ATR/SLD)  2.6x speedup for matrix decomposition/substitution (Cholsky) over an in-order issue superscalar processor  Reduced memory latency for systems that have high memory bandwidths (e.g. PIMs, RAMBUS)  Allows the compiler to solve indexing functions for irregular applications  Reduced system cost for high-throughput scientific codes USC Parallel and Distributed Processing Center 15 Schedule • Defined benchmarks • Completed simulator • Performed instruction-level simulations on handcompiled benchmarks April 98 Start • Continue simulations of more benchmarks (ATR/SLD) •Define HiDISC architecture •Benchmark results April 99 •Develop and test a full decoupling compiler • Update Simulator • Generate performance statistics and evaluate design April 00 USC Parallel and Distributed Processing Center 16 Tasks Compiler Design Selecting a Front end Program Flow Analysis Separating Instructions into Streams Compiler Optimizations Control Flow Analysis Simulator Update Stressmarks Simulation Simulate DIS Benchmarks DIS Benchmarks DIS Benchmarks Analysis Analyze and handcompile Stressmarks USC Parallel and Distributed Processing Center 17 Work in Progress • Compiler design • Data Intensive Systems (DIS) benchmarks analysis • Simulator update • Parameterization of silicon space for VLSI implementation USC Parallel and Distributed Processing Center 18 Compiler Requirements  Source language flexibility  Sequential assembly code for streaming • • • • Ease of implementation Optimality of sequential code Source level language flexibility Portability  Ease of implementation  Portability and upgradability USC Parallel and Distributed Processing Center 19 Compiler Front-ends  Trimaran is a compiler infrastructure for supporting research in compiling for ILP architectures.The system provides explicit support for EPIC architectures. • Designed for the HPL-PD architecture.  SUIF(Stanford University Intermediate Format) provides a platform for research on compiler techniques for high-performance machines. • Suitable for high level optimization  GCC is a part of the GNU Project, aiming at improving the compiler used in the GNU system. The GCC development effort uses an open development environment and supports many platforms. USC Parallel and Distributed Processing Center 20 Trimaran Architecture  Machine description facility (MDES) • Describes ILP architectures   Compiler front end (IMPACT) for C A compiler back-end (ELCOR) which is parameterized by MDES and performs machine dependent optimizations. + Support for Predication + Explicit support for EPIC architectures + Software Pipelining - Low Portability - Currently only a C front end is available USC Parallel and Distributed Processing Center 21 SUIF Architecture Part of the national compiler infrastructure project  Designed to support collaborative research in optimizing and parallelizing compiler  Originally designed to support high-level program analysis of C and Fortran programs  + Highly Modular + Portable - More suited to high level optimizations - Only has a front end for C USC Parallel and Distributed Processing Center 22 Gcc-2.95 Features   Localized register spilling, global common sub expression elimination using lazy code motion algorithms There is also an enhancement made in the control flow graph analysis.The new framework simplifies control dependence analysis, which is used by aggressive dead code elimination algorithms + Provision to add modules for instruction scheduling and delayed branch execution + Front-ends for C, C++ and Fortran available + Support for different environments and platforms + Cross compilation Theoretical power and simplicity are secondary USC Parallel and Distributed Processing Center 23 Compiler Organization Source Program GCC Assembly Code Stream Separator Computational Assembly Code Access Assembly Code Cache Management Assembly Code Assembler Assembler Assembler Computation Assembly Code Access Assembly Code Cache Management Object Code HiDISC Compilation Overview USC Parallel and Distributed Processing Center 24 HiDISC Stream Separator Sequential Source Program Flow Graph Classify Address Registers Computation Stream Allocate Instruction to streams Current Work Access Stream Future Work Fix Conditional Statements Move Queue Access into Instructions Move Loop Invariants out of the loop Add Slip Control Queue Instructions Substitute Prefetches for Loads, Remove global Stores, and Reverse SCQ Direction Add global data Communication and Synchronization Produce Assembly code Computation Assembly Code Access Assembly Code Cache Management Assembly Code USC Parallel and Distributed Processing Center 25 Compiler Front End Optimizations  Jump Optimization: simplify jumps to the following instruction, jumps across jumps and jumps to jumps  Jump Threading: detect a conditional jump that branches to an identical or inverse test Delayed Branch Execution: find instructions that can go into the delay slots of other instructions  Constant Propagation: Propagate constants into a conditional loop  USC Parallel and Distributed Processing Center 26 Compiler Front End Optimizations (contd.)  Instruction Combination: combine groups of two or three instructions that are related by data flow into a single instruction  Instruction Scheduling: looks for instructions whose output will not be available by the time that it is used in subsequent instructions  Loop Optimizations: move constant expressions out of loops, and do strengthreduction USC Parallel and Distributed Processing Center 27 SAR-ATR/SLD Benchmarks  Second-level detection (SLD) algorithm of the template-based automatic target recognition (ATR) benchmark • Using SAR (synthetic aperture radar) image   Compare the in-order issue superscalar MIPS (8way) and Cache and Prefetch Processor (CAPP) ATR/SLD benchmark exhibits good temporal and spatial locality • Cache sizes of larger than 2 KB reduce the miss rate to less than 0.3%. USC Parallel and Distributed Processing Center 28 Benchmarks (Image Understanding) SLD/ATR DIS Image Understanding MIPS CAPP 0.9 MIPS G ; 3G 0.7 G Cache Miss Rate: First-Level Cache Miss Rate: 0.8 0.6 G 0.5 0.4 ; 0.3 ; 2 KB G ; G ; G ; G ; ;G 4 KB 8 KB 16 KB 32 KB 64 KB First-Level Cache Size   G G G G G 16 KB 32 KB 64 KB 2.5 2 1.5 0.5 0 1 KB  G 1 0.2 0.1  G 3.5 0 1 KB 2 KB 4 KB 8 KB First-Level Cache Size DIS Image Understanding benchmark for the smallest data input set (iu1.in). The miss rate is approximately 3 percent on an 8-issue MIPS processor Caches are not very effective at capturing the memory references HiDISC processor will be a good architecture for improving the performance of the DIS Image Understanding benchmark. USC Parallel and Distributed Processing Center 29 DIS Benchmarks  Atlantic Aerospace DIS benchmark suite: • Too large for hand-compiling  Atlantic Aerospace Stressmarks suite: • small Data Intensive benchmark suite ( < 200 lines) • Preliminary description released on Jul, 2000 • Final description released on Sep, 2000 (ver 1.0) • Stressmarks suite to be available soon USC Parallel and Distributed Processing Center 30 DIS Benchmarks Suite  Application oriented benchmarks • Many defense application employ large data sets – non contiguous memory access and no temporal locality  Five benchmarks in three categories • Model based image generation – Method of Moments, Simulated SAR Ray Tracing • Target detection – Image Understanding, Multidimensional Fourier Transform • Database management – Data Management USC Parallel and Distributed Processing Center 31 Stressmark Suite Stressmark Problem Memory Access Pointer Pointer following Small blocks at unpredictable locations. Can be parallelized Update Pointer following with memory update Small blocks at unpredictable location Matrix Conjugate gradient Dependent on matrix representation simultaneous equation solver Likely to be irregular or mixed, with mixed levels of reuse Neighborhood Calculate image texture measures by finding sum and difference histograms Regular access to pairs of words at arbitrary distances Field Collect statistics on large field of words Regular, with little re-use Corner-Turn Matrix transposition Block movement between processing nodes with practically nil computation Transitive Closure Find all-pairs-shortest-path solution for a directed graph Dependent on matrix representation, but requires reads and writes to different matrices concurrently * DIS Stressmark Suite Version 1.0, Atlantic Aerospace Division USC Parallel and Distributed Processing Center 32 Example of Stressmarks  Pointer Stressmark • Basic idea: repeatedly follow pointers to randomized locations in memory • Memory access pattern is unpredictable • Randomized memory access pattern: – Not sufficient temporal and spatial locality for conventional cache architectures • HiDISC architecture provides lower memory access latency USC Parallel and Distributed Processing Center 33 Decoupling of Pointer Stressmarks for (i=j+1;i<w;i++) { if (field[index+i] > partition) balance++; } if (balance+high == w/2) break; else if (balance+high > w/2) { min = partition; } else { max = partition; high++; } while (not EOD) if (field > partition) balance++; if (balance+high == w/2) break; else if (balance+high > w/2) { min = partition; } else { max = partition; high++; } Computation Processor Code for (i=j+1; i<w; i++) { load (field[index+i]); GET_SCQ; } send (EOD token) Access Processor Code Inner loop for the next indexing for (i=j+1; i<w; i++) { prefetch (field[index+i]); PUT_SCQ; } Cache Management Code USC Parallel and Distributed Processing Center 34 Stressmarks  Hand-compile the 7 individual benchmarks • Use gcc as front-end • Manually partition each of the three instruction streams and insert synchronizing instructions  Evaluate architectural trade-offs • Updated simulator characteristics such as outof-order issue • Large L2 cache and enhanced main memory system such as Rambus and DDR USC Parallel and Distributed Processing Center 35 Simulator Update  Survey the current processor architecture • Focus on commercial leading edge technology for implementation  Analyze the current simulator and previous benchmark results • Enhance memory hierarchy configurations • Add Out-of-Order issue USC Parallel and Distributed Processing Center 36 Memory Hierarchy  Current modern processors have increasingly large L2 on-chip caches • E.g., 256 KB L-2 cache on Pentium and Athlon processor • reduces L1 cache miss penalty  Also, development of new mechanisms in the architecture of the main memory (eg. RAMBUS) reduces the L2 cache miss penalty USC Parallel and Distributed Processing Center 37 Out-of-Order multiple issue    Most of the current advanced processors are based on the Superscalar and Multiple Issue paradigm. • MIPS-R10000, Power-PC, Ultra-Sparc, Alpha and Pentium family Compare HiDISC architecture and modern superscalar processors • Out-of-Order instruction issue • For precision exception handling, include inorder completion New access decoupling paradigm for out-of-order issue USC Parallel and Distributed Processing Center 38 VLSI Layout Overhead (I)     Goal: Increase layout effectiveness of HiDISC architecture Cache has become a major portion of the chip area Methodology: Extrapolate HiDISC VLSI Layout based on MIPS10000 processor (0.35 μm, 1996) The space overhead is 11.3% over a comparable MIPS processor USC Parallel and Distributed Processing Center 39 VLSI Layout Overhead (II) Component Original MIPS R10K(0.35 m) Extrapolation (0.15 m) HiDISC (0.15 m) D-Cache (32KB) 26 mm2 6.5 mm2 6.5 mm2 I-Cache (32KB) 28 mm2 7 mm2 14 mm2 TLB Part 10 mm2 2.5 mm2 2.5 mm2 External Interface Unit 27 mm2 6.8 mm2 6.8 mm2 Instruction Fetch Unit and BTB 18 mm2 4.5 mm2 13.5 mm2 Instruction Decode Section 21 mm2 5.3 mm2 5.3 mm2 Instruction Queue 28 mm2 7 mm2 0 mm2 Reorder Buffer 17 mm2 4.3 mm2 0 mm2 Integer Functional Unit 20 mm2 5 mm2 15 mm2 FP Functional Units 24 mm2 6 mm2 6 mm2 Clocking & Overhead 73 mm2 18.3 mm2 18.3 mm2 Total Size without L2 Cache 292 mm2 73.2 mm2 87.9 mm2 129.2 mm2 143.9 mm2 Total Size with on chip L2 Cache USC Parallel and Distributed Processing Center 40 Architecture Schemes for a Flexible HiDISC Efficient VLSI layout  HiDISC with Out-of-Order issue  • Trade-off issue width and resources across the processor hierarchy • Synchronization for each of the streams Utilize high bandwidth memory system  Multithreading HiDISC Architecture (SMT & CMP)  Multiple HiDISC  USC Parallel and Distributed Processing Center 41 Decoupled Conventional and DRAM Processor Threads Conventional Instructions Conventional Processor (CP) (CP) Program Compiler Synchronizer DRAM+Processor (Access Processor) Memory Management Instructions  Compiling a single program into two cooperating instruction streams • One stream is running on a conventional processor • The other stream is running on a DRAM processor (such as PIM) USC Parallel and Distributed Processing Center 42 HiDISC with Modern DRAM Architecture  RAMBUS and DDR DRAM improve the memory bandwidth • Latency does not improve significantly  Decoupled access processor can fully utilize the enhanced memory bandwidth • More requests caused by access processor • Pre-fetching mechanism hide memory access latency USC Parallel and Distributed Processing Center 43 HiDISC / SMT  Reduced memory latency of HiDISC can • decrease the number of threads for SMT architecture • relieve memory burden of SMT architecture • lessen complex issue logic of multithreading  The functional unit utilization can increase with multithreading features on HiDISC • More instruction level parallelism is possible USC Parallel and Distributed Processing Center 44 Multiple HiDISC on a Chip Ext Int Ext Int Fetch Level 2 Cache Decode, Rename Reorder Buffer, Instr. Queues, O-O-O logic FP Unit Wide-issue SMT  Ext Int Ext Int iL1 Proc 1 Proc 1 Proc 2 TLB TLB TLB TLB iL1 dL1 Level 2 Cache dL1 L2 Cross bar iL1 INT unit dL1 iL1 dL1 iL1 dL1 Level 2 Cache L2 Cross bar Level 2 Cache iL1 dL1 iL1 dL1 TLB TLB TLB Proc 2 Proc 3 Proc 4 Two processor POSM P1 Four processor POSM P2 P3 i d i d i d L L L L L L 1 1 1 1 1 1 L2 Cross bar i d i d i d L L L L L L 1 1 1 1 1 1 P5 P6 P7 P4 i d L L 1 1 i d L L 1 1 P8 8 processor CMP Flexible adaptation from multiple to single processor USC Parallel and Distributed Processing Center 45 Multiple HiDISC: McDISC       Problem: All extant, large-scale multiprocessors perform poorly when faced with a tightly-coupled parallel program. Reason: Extant machines have a long latency when communication is needed between nodes. This long latency kills performance when executing tightly-coupled programs. The McDISC solution: Provide the network interface processor (NIP) with a programmable processor to execute not only OS code (e.g. Stanford Flash), but user code. Advantage: The NIP, executing user code, fetches data before it is needed. Result: Fast execution of tightly-coupled parallel programs. Execution: Lockstep, Multithreaded, Nomadic Threads, etc. USC Parallel and Distributed Processing Center 46 The McDISC System: Memory-Centered Distributed Instruction Set Computer Understanding FLIR SAR VIDEO ESS Inference Analysis Computation Instructions Computation Processor (CP) Register Links to CP Neighbors Sensor Data Registers Program Compiler Access Processor (AP) Access Instructions Network Management Instructions 3-D Torus of Pipelined Rings X Cache Management Instructions SES Y Z to Displays and Network Cache Cache Management Processor (CMP) Network Interface Processor (NIP) Main Memory Disc Cache Disc Processor (DP) Adaptive Signal PIM (ASP) SAR Video RAID Dynamic Database Sensor Inputs Adaptive Graphics PIM (AGP) Decision Process Targeting Situation Awareness Disc Farm USC Parallel and Distributed Processing Center 47 Summary  Designing a compiler • Porting gcc to HiDISC Benchmark simulation with new parameters and updated simulator  Analysis of architectural trade-offs for equal silicon area  Hand-compilation of Stressmarks suites and simulation  DIS benchmarks simulation  USC Parallel and Distributed Processing Center 48 Benchmarks  Tomcatv (SPEC ’95) • Vectorized mesh-generation program • Six different inner loops. – One of the loops references six different arrays – Cache conflicts are a problem  MXM (NAS kernels) • Matrix multiply program  Cholsky (NAS kernels) • Matrix decomposition/substitution program • High memory bandwidth requirement USC Parallel and Distributed Processing Center 49 Benchmarks  Vpenta (NAS kernels) • Inverting three pentadiagonals simultaneously  Qsort • Quick sorting algorithm • Chosen as a symbolic benchmark – Reference patterns are not easily predictable USC Parallel and Distributed Processing Center 50

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Parallel On-chip Simultaneous Multithreading