Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Advanced Topics in Pipelining - SMT and Single-Chip Multiprocessor Priya Govindarajan CMPE 200 Introduction Researchers have proposed two alternative microarchitectures that exploit multiple threads of control: simultaneous multithreading SMT [1] chip multiprocessors CMP [2] CMP Vs SMT Why software and hardware trends will favor the CMP microarchitecture. Conclusion on the performance results from comparison of simulated superscalar, SMT, and CMP microarchitectures. SMT Discussion Outline Introduction Mutithreading MT Approaches of Multithreading Motivation for introducing SMT Implementation of SMT CPU Performance estimates Architectural abstraction Introduction to SMT SMT processors augment wide (issuing many instructions at once) superscalar processors with hardware that allows the processor to execute instructions from multiple threads of control concurrently Dynamically selecting and executing instructions from many active threads simultaneously. Higher utilization of the processor’s execution resources Provides latency tolerance in case a thread stalls due to cache misses or data dependencies. When multiple threads are not available, however, the SMT simply looks like a conventional wide-issue superscalar. Introduction to SMT SMT uses the insight that a dynamically scheduled processor already has many of h/w mechanisms needed to support the integrated exploitation of TLP through MT. MT can be built on top of out-of-order processor by adding a per thread register renaming, PCs and providing capability for instructions from multiple threads to commit. Mutithreading: Exploiting Thread-Level Parallelism Multithreading Multiple threads to share the functional units of a single processor in an overlapping fashion. The processor must duplicate the independent state of each thread. (register file, a separate PC, page table) Memory can be shared through the virtual memory mechanisms, which already support multiprocessing Needs hardware support for changing the threads. Multithreading…. Two main approaches to multithreading Fine-grained multithreading Coarse-grained multithreading Fine-grained .. Coarse-grained multithreading Switches between threads on each instruction, causing interleaving Interleaving in round-robin. Skipping any threads that r stalled Switches threads only on costly stalls. Fine-grained multithreading Advantages Hides throughput losses that arise from both short and long stalls. Disadvantages Slows down the execution of an individual threads, since a thread that is ready to execute without stalls will be delayed by instructions from other threads. Coarse-grained multithreading Advantages Relieves the need to have thread switching be essentially free and is much less likely to slow down the execution of an individual threads Coarse-grained multithreading Disadvantages Throughput losses, especially from shorter stalls. This is because coarse grained issues instructions from a single thread, when a stall occurs, the pipeline must be emptied or frozen. New thread begins executing after the stall must fill the pipeline before instructions will be able to complete. Simultaneous Multithreading Is a variation on multithreading that uses the resources of a multiple-issue processors, dynamically scheduled processor to exploit TLP at the same time it exploits ILP. Why ? Modern multiple-issue processors often have more functional unit parallelism available than a single thread can effectively use. With register renaming and dynamic scheduling, multiple instructions from independent threads can be issued without any dependences among them. Basic Out-of-order Pipeline SMT Pipeline Challenges for SMT processor Dealing with a larger register file needed to hold multiple contexts Maintaining low overhead on the clock cycle, particularly in issue , completion Ensuring cache conflicts by simultaneous execution of multiple threads do not cause significant performance degradation. SMT SMT will significantly enhance multistream performance across a wide range of applications without significant hardware cost and without major architectural changes Instruction Issue Reduced function unit utilization due to dependencies Superscalar Issue Superscalar leads to more performance, but lower utilization Simultaneous Multithreading Maximum utilization of function units by independent operations Fine Grained Multithreading Interleaving – no empty slot Intra-thread dependencies still limit performance Architectural Abstraction 1 CPU with 4 Thread Processing Units (TPUs ) Shared hardware resources System Block Diagram Changes for SMT Basic pipeline – unchanged Replicated resources Program counters Register maps Shared resources Register file (size increased) Instruction queue Instruction queue First and second level caches Translation buffers Branch predictor Multithreaded applications Performance Single-Chip Multiprocessor CMPs use relatively simple single-thread processor cores to exploit only moderate amounts of parallelism within any one thread, while executing multiple threads in parallel across multiple processor cores. If an application cannot be effectively decomposed into threads, CMPs will be underutilized. Comparing Alternative Architectures Issue up to 12 instructions per cycle Super scalar Architecture Comparing … SMT Architecture 8 separate PCs , executes instructions from 8 diff thread concurre Multi ban caches Chip multiprocessor architecture 8 small 2 issue superscalar processors. Depend on TLP SMT and Memory Large demands on memory SMT require more bandwidth from primary cache (MT allows more load and store) To allow this they have 128-kbye cache Complex MESI(modified , exclusive, shared and invalid) cache-coherence protocol CMP and Memory Eight cores are independent and integrated with their individual pairs of caches – another form of clustering leads to high-frequency design for primary cache system Small cache size and tight connection to these caches allows single-cycle access. Need simpler coherence scheme Quantitative performance.. CPU cores To keep the processors execution units busy, SMT features advanced branch prediction register renaming out-of-order issue non blocking data caches. Which makes it inherently complex CMP Approach…h/w simple number of registers increases Number of ports on each register must increase CMP Solution Exploit ILP using more processors instead of large issue widths within single processor SMT Approach Longer cycle times Long, high capacitance I/O wires span the large buffers, queues and register files Extensive use of multiplexers and crossbars to interconnect these units adds more capacitance Delays associates dominate delay along CPU’s critical path The cycle time impact of these structures can be mitigated by careful design using deep pipelining, by breaking the structures with small,fast clusters of closely related components by short wires. But deep pipelining increases branch misprediction penalities and clustering tends to reduce the ability of the processor to find and exploit instruction level parallelism. CMP Solution Short cycle time to be be targeted with relatively little design effort, since its h/w is naturally clusteredeach of the small CPUs is already a very small fast cluster of components. Since OS allocates a single s/w thread of control to each processor, the partitioning of work among the “clusters” is natural and requires no h/w to dynamically allocate instructions to different clusters Heavy reliance on s/w to direct instructions to clusters limits the amount of ILP of CMP but allows the clusters within CMP to be small and fast. SMT and CMP Architectural point of view, the SMT processor’s flexibility makes it superior. However, the need to limit the effects of interconnect delays, which are becoming much slower than transistor gate delays, will also drive the billion-transistor chip design. Interconnect delays will force the microarchitecture to be partitioned into small, localized processing elements. CMP is much more promising because it is already partitioned into individual processing cores. Because these cores are relatively simple, they are amenable to speed optimization and can be designed relatively easily. Compiler support for SMT and CMP Programmers must find TLP in order to maximize CMP performance SMT requires programmers to explicitly divide code into threads to get maximum performance but unlike CMP, it can dynamically find more ILP if TLP is limited. But with multithreaded OS these problems should prove to be less daunting Having all eight of the CPUs on a single chip allows designers to exploit TLP even when threads communicate frequently Performance results A comparison of three architectures indicates that a multiprocessor on a chip will be easiest to implement while still offering excellent performance. Disadvantages of CMP When code cannot be MT, only one processor can be targeted to the task However, a single 2 issue processor on CMP is only moderately slower than superscalar or SMT, since applications with little thread-level parallelism also lack ILP Conclusion on CMP CMP is promising candidate for a billion-transistor architecture. Offers superior performance using simple h/w Code that can be parallelized into multiple threads, the small CMP cores will perform comparable or better Easier to design and optimize SMTs use resources more efficiently than CMP, but more execution units can be included in a CMP of similar area, since less die area need be devoted to wide-issue logic. D. TULLSEN, S. EGGERS, AND H. LEVY, “Simultaneous Multithreading: Maximizing On-Chip Parallelism,” Proc. 22nd Ann. Int’l Symp. Computer Architecture, ACM Press, New York, 1995, pp. 392-403. J. BORKENHAGEN, R. EICKEMEYER, AND R. KALLA :A Multithreaded PowerPC Processor for Commercial Servers, IBM Journal of Research and Development, November 2000, Vol. 44, No. 6, pp. 885-98. J. LO, S. EGGERS, J. EMER, H. LEVY, R. STAMM, AND D. TULLSEN. Converting thread-level parallelism into instruction-level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, 15(2), August 1997. Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang. The case for a single-chip multiprocessor. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 2--11, Cambridge, Massachusetts, October 1--5, 1996. LANCE HAMMOND, BASEM A NAYFEH, KUNLE OLUKOTUn. A Single-Chip Multiprocessor. IEEE September1997 GULATI, M. AND BAGHERZADEH, N. 1996. Performance study of a multithreaded superscalar microprocessor. In the 2nd International Symposium on High-Performance Computer Architecture(Feb.). 291–301. KYOUNG PARK, SUNG-HOON CHOI, YONGWHA CHUNG, WOO-JONG HAHN AND SUK-HAN YOON. On-Chip Multiprocessor with Siultaneous Multithreading. http://etrij.etri.re.kr/etrij/pdfdata/22-04-02.pdf NAYFEH, B. A., HAMMOND, L., AND OLUKOTUN, K. 1996. Evaluation of design alternatives for a multiprocessor microprocessor. In the 23rd Annual International Symposium on Computer Architecture (May). 67–77. OLUKOTUN, K., NAYFEH, B. A., HAMMOND, L., WILSON, K., AND CHANG, K. 1996. The case for a single-chip multiprocessor. In the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (Oct.). ACM, New York, 2–11. LANCE HAMMOND, BENEDICT A. HUBBERT, MICHAEL SIU, MANOHAR K. PRABHU, MICHAEL CHEN, KUNLE OLUKOTUN. The Stanford Hydra CMP. IEEE Micro March/April 2000 (Vol. 20, No. 2) S. EGGERS, J. EMER, H. LEVY, J. LO, R. STAMM, D. TULLSEN. Simultaneous Multithreading: A Platform for Next-generation Processors. In IEEE Micro, pages 12-18, September/October 1997 V. KRISHNAN AND J. TORRELLAS. Hardware and Software Support for Speculative Execution of Sequential Binaries on ChipMultiprocessor. In ACM International Conference on Supercomputing (ICS’98), pages 85-92, June 1998. goethe.ira.uka.de/people/ungerer/proc-arch/ EUROPAR-tutorial-slides.ppt http://www.acm.uiuc.edu/banks/20/6/page4.html Simultaneous Multithreading home page http://www.cs.washington.edu/research/smt/ The Stanford Hydra Chip Multiprocessor Kunle Olukotun The Hydra Team Computer Systems Laboratory Stanford University Technology Architecture Transistors are cheap, plentiful and fast Wires are cheap, plentiful and slow Moore’s law 100 million transistors by 2000 Wires get slower relative to transistors Long cross-chip wires are especially slow Architectural implications Plenty of room for innovation Single cycle communication requires localized blocks of logic Exploiting Program Parallelism Levels of Parallelism Process Thread Loop Instruction 1 10 100 1K 10K Grain Size (instructions) 100K 1M Hydra Approach A single-chip multiprocessor architecture composed of simple fast processors Multiple threads of control Memory renaming and thread-level speculation Exploits parallelism at all levels Makes it easy to develop parallel programs Keep design simple by taking advantage Outline Base Hydra Architecture Performance of base architecture Speculative thread support Speculative thread performance Improving speculative thread performance Hydra prototype design Conclusions The Base Hydra Design Centralized Bus Arbitration Mechanisms CPU 0 L1 Ins t. Cac he L1 Data Cache CPU 0 Me mory Controller CPU 1 L1 Ins t. Cac he CPU 2 L1 Data Cache CPU 1 Me mory Controller L1 Ins t. Cac he L1 Data Cache CPU 2 Me mory Controller CPU 3 L1 Ins t. Cac he L1 Data Cache CPU 3 Me mory Controller Write-through Bus (64b) Read/Replace Bus (256b) On-chip L2 Cache Single-chip multiprocessor Four processors Separate primary caches Write-through data caches to maintain coherence Rambus Memory Interface I/O Bus Interface DRAM Main Memory I/O Devices Shared 2nd-level cache Low latency interprocessor communication (10 cycles) Separate read and write Hydra vs. Superscalar 4 Hydra 4 x 2-way issue 3.5 Superscalar 6-way issue 3 2 1.5 1 pmake OLTP tomcatv swim applu MPEG2 apsi m88ksim 0 eqntott 0.5 compress Speedup 2.5 ILP only SS 30-50% better than single Hydra processor ILP & fine thread SS and Hydra comparable ILP & coarse thread Hydra 1.5– 2better Problem: Parallel Software Parallel software is limited Hand-parallelized applications Auto-parallelized dense matrix FORTRAN applications Traditional auto-parallelization of Cprograms is very difficult Threads have data dependencies synchronization Pointer disambiguation is difficult and expensive Solution: Data Speculation Data speculation enables parallelization without regard for data-dependencies Loads and stores follow original sequential semantics Speculation hardware ensures correctness Add synchronization only for performance Loop parallelization is now easily automated Other ways to parallelize code Break code into arbitrary threads (e.g. speculative subroutines ) Data Speculation Requirements I Forward data between parallel threads Detect violations when reads occur too Data Speculation Requirements II Writes after Violations Iteration i Writes after Successful Iterations Iteration i Iteration i+1 Iteration i+1 write A write X write X TIME write X read X write B 1 TRASH 2 PERMANENT ST AT E Safely discard bad state after violation Correctly retire speculative state Data Speculation Requirements III Maintain multiple “views” of memory Hydra Speculation Support Centralized Bus Arbitration Mechanisms CPU 0 L1 Ins t. Cac he CP2 CPU 1 L1 Data Cache & Spe cula tion Bits L1 Ins t. Cac he CP2 L1 Data Cache & Spe cula tion Bits CPU 1 Me mory Controller CPU 0 Me mory Controller CPU 2 L1 Ins t. Cac he CP2 L1 Data Cache & Spe cula tion Bits CPU 2 Me mory Controller CPU 3 L1 Ins t. Cac he CP2 L1 Data Cache & Spe cula tion Bits CPU 3 Me mory Controller Write-through Bus (64b) Read/Replace Bus (256b) Speculation Write Buffers #0 #1 #2 #3 On-chip L2 Cache retire Rambus Memory Interface I/O Bus Interface DRAM Main Memory I/O Devices Write bus and L2 buffers provide forwarding “Read” L1 tag bits detect violations “Dirty” L1 tag bits and write buffers provide backup Write buffers reorder and retire speculative state Separate L1 caches with pre-invalidation & smart L2 forwarding Speculative Reads Nonspeculative “Head” CPU CPU #i-2 Speculative earlier CPU CPU #i-1 “Me” Speculative later CPU CPU #i CPU #i+1 1 2 L1 Cache D L2 Cache C B A Write Buffer Write Buffer Write Buffer L1 hit The read bits are set Write Buffer L1 miss L2 and write buffers are checked in parallel The newest bytes written to a line are pulled in by priority encoders on each byte (priority A-D) Speculative Writes A CPU writes to its L1 cache & write buffer “Earlier” CPUs invalidate our L1 & cause RAW hazard checks “Later” CPUs just pre-invalidate our L1 Non-speculative write buffer drains out into the L2 Speculation Runtime System Software Handlers Control speculative threads through CP2 interface Track order of all speculative threads Exception routines recover from data dependency violations Adds more overhead to speculation than hardware but more flexible and simpler to implement Complete description in “Data Speculation Creating Speculative Threads Speculative loops Typically one speculative thread per iteration Speculative procedures for and while loop iterations Execute code after procedure speculatively Procedure calls generate a speculative thread Compiler support Base Speculative Thread Performance 4 3.5 Base 3 2 1.5 1 0.5 sparse1.3 simplex ear cholesky alvin mpeg2 ijpeg wc m88ksim grep eqntott 0 compress Speedup 2.5 Entire applications GCC 2.7.2 -O2 4 single-issue processors Accurate modeling of all aspects of Hydra Improving Speculative Runtime System Procedure support adds overhead to loops Threads are not created sequentially Dynamic thread scheduling necessary Start and end of loop: 75 cycles End of iteration: 80 cycles Performance Best performing speculative applications use loops Procedure speculation often lowers Improved Speculative Performance 4 3.5 Optimized RTS Base 3 2 1.5 1 0.5 sparse1.3 simplex ear cholesky alvin mpeg2 ijpeg wc m88ksim grep eqntott 0 compress Speedup 2.5 Improves performance of all applications Most improvement for applications with fine-grained threads Eqntott uses procedure Optimizing Parallel Performance Cache coherent shared memory No explicit data movement 100+ cycle communication latency Need to optimize for data locality Look at cache misses (MemSpy, Flashpoint) Speculative threads No explicit data independence Frequent dependence violations limit performance Feedback and Code Transformations Feedback tool Synchronization Collects violation statistics (PCs, frequency, work lost) Correlates read and write PC values with source code Synchronize frequently occurring violations Use non-violating loads Code Motion Code Motion Rearrange reads and writes to increase parallelism iteration i iteration i Delay reads and advance writes read x read x iteration i+1 write x readearlier x Create local copies to allow data read x write x read x’ forwarding read x’ write x iteration i+1 read x write x Optimized Speculative Performance 4 3.5 3 Base performance Optimized RTS with no manual intervention 2 1.5 1 0.5 sparse1.3 simplex ear cholesky alvin mpeg2 ijpeg wc m88ksim grep eqntott 0 compress Speedup 2.5 Violation statistics used to manually transform code Size of Speculative Write State Max no. lines of write state Max size determines size of write buffer for max performance Non-head processor stalls when write buffer fills up Small write buffers (< 64 lines) will achieve good performance compress 24 eqntott 40 grep 11 m88ksim 28 wc 8 ijpeg 32 mpeg 56 alvin 158 cholesky 4 ear 82 simplex 14 32 byte cache lines Hydra Prototype Design based on Integrated Device Technology (IDT) RC32364 Conclusions Hydra offers a new way to design microprocessors Single-chip MP exploits parallelism at all levels Low overhead support for speculative parallelism Provides high performance on applications with medium to large-grain parallelism Allows performance optimization migration path for difficult to parallelize fine-grain Hydra Team Team Monica Lam, Lance Hammond, Mike Chen, Ben Hubbert, Manohar Prahbu, Mike Siu, Melvyn Lim and Maciek Kozyrczak (IDT) URL http://www-hydra.stanford.edu