Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Alpha 21364 and 21464 Microprocessors: Continuing the Performance Lead Beyond Y2K Shubu Mukherjee, Ph.D. Principal Hardware Engineer VSSAD Labs, Alpha Development Group Compaq Computer Corporation Shrewsbury, Massachusetts Slides: 1998 Microprocessor Forum (Peter Bannon) and 1999 Microprocessor Forum (Joel Emer) Better answers Alpha Microprocessor Roadmap Higher Performance 0.125mm 0.18mm 0.35mm 21464 EV8 21364 EV7 21264 EV6 0.125mm 0.28mm 21364 EV78 21264 EV67 0.18mm 21264 EV68 1998 Better answers 1999 2000 2001 First System Ship 2002 2003 Alpha 21264 Microprocessor Architectural Features First “Out-of-Order” Alpha Four-wide superscalar … Performance World’s Fastest Microprocessor (www.spec.org, 11/17/99) 39 SPECINT95, 68 SPECFP95 @ 700 Mhz – Better answers Intel Pentium III @ 733 Mhz delivers 36 SPECINT95, 30 SPECFP95 Alpha Microprocessor Roadmap Higher Performance 0.125mm 0.18mm 0.35mm 21464 EV8 21364 EV7 21264 EV6 0.125mm 0.28mm 21364 EV78 21264 EV67 0.18mm 21264 EV68 1998 Better answers 1999 2000 2001 First System Ship 2002 2003 Alpha 21364 Goals Leadership single stream performance Higher operating frequency Integrated memory interface Leadership multiprocessor performance Integrated system / multiprocessor interface Better answers Alpha 21364 Features System-on-a-Chip Fault-Tolerance Better answers Alpha 21264 core with enhancements Integrated L2 Cache Integrated memory controller Integrated network interface Support for lock-step operation to enable highavailability systems. 21364 Chip Block Diagram 16 L1 Miss Buffers Address In R A M B U S Address Out 64K Icache 21264 Core 64K Dcache 16 L1 Victim Buf Better answers L2 Cache Memory Controller Network Interface 16 L2 Victim Buf N S E W I/O 21364 Core FETCH Stage: 0 Branch Predictors MAP 1 2 QUEUE 3 REG 4 EXEC 5 Int Reg Map Int Issue Queue (20) Reg File (80) Exec 80 in-flight instructions plus 32 loads and 32 stores Next-Line Address L1 Ins. Cache 64KB 2-Set Better answers Reg File (80) Exec DCACHE 6 Addr Exec Addr Exec L1 Data Cache 64KB 2-Set L2 cache1 .5MB 6-Set 4 Instructions / cycle FP Reg Map FP Issue Queue (15) Reg File (72) FP ADD Div/Sqrt FP MUL Victim Buffer Miss Address Integrated L2 Cache 1.5 MB 6-way set associative 16 GB/s total read/write bandwidth 16 Victim buffers for L1 -> L2 16 Victim buffers for L2 -> Memory ECC SECDED code 12ns load to use latency Better answers Integrated Memory Controller Direct RAMbus High data capacity per pin 800 MHz operation 30ns CAS latency pin to pin 6 GB/sec read or write bandwidth 100s of open pages Directory based cache coherence ECC SECDED Better answers Integrated Network Interface Direct processor-to-processor interconnect 10 GB/second per processor 15ns processor-to-processor latency Out-of-order network with adaptive routing Asynchronous clocking between processors 3 GB/second I/O interface per processor Better answers 21364 System Block Diagram M 364 364 M 364 M 364 IO IO IO IO M M M M 364 364 364 364 IO IO IO IO M M M M 364 364 IO Better answers M 364 IO 364 IO IO Alpha 21364 Technology 0.18 mm CMOS 1000+ MHz 100 Watts @ 1.5 volts 2 3.5 cm 6 Layer Metal 100 million transistors Better answers 8 million logic 92 million RAM Alpha 21364 Status 70 SPECint95 (estimated) 120 SPECfp95 (estimated) RTL model running Tapeout: Summer 2000 Better answers 21364 Summary: System on a Chip Integrated L2 cache and memory controller outstanding single processor performance Integrated network interface high performance multi-processor systems scales to large number of processors Better answers Alpha Microprocessor Overview Higher Performance 0.125mm 0.18mm 0.35mm 21464 EV8 21364 EV7 21264 EV6 0.125mm 0.28mm 21364 EV78 21264 EV67 0.18mm 21264 EV68 1998 Better answers 1999 2000 2001 First System Ship 2002 2003 Alpha 21464 Goals Leadership single stream performance Higher operating frequency / better technology New microarchitecture Integrated memory interface (like 21364) Leadership multiprocessor performance Simultaneous Multithreading (with minimal change/cost) Integrated system / multiprocessor interface (like 21364) Better answers Alpha 21464 Technology Overview Leading edge process technology – 1.2-2.0GHz 0.125µm CMOS SOI-compatible Cu interconnect low-k dielectrics Chip characteristics ~1.2V Vdd ~250 Million transistors Better answers Alpha 21464 Architecture Overview Enhanced out-of-order execution 8-wide superscalar Large on-chip L2 cache Direct RAMBUS interface On-chip router for system interconnect Glueless, directory-based, ccNUMA for up to 512-way multiprocessing 4-way simultaneous multithreading (SMT) Better answers Instruction Issue Time Reduced function unit utilization due to dependencies Better answers Superscalar Issue Time Superscalar leads to more performance, but lower utilization Better answers Predicated Issue Time Adds to function unit utilization, but results are thrown away Better answers Chip Multiprocessor Time Limited utilization when only running one thread Better answers Fine Grained Multithreading Time Intra-thread dependencies still limit performance Better answers Simultaneous Multithreading Time Maximum utilization of function units by independent operations Better answers Basic Out-of-order Pipeline Fetch Decode/ Map Queue Reg Read Execute Dcache/ Store Buffer Reg Write PC Register Map Regs Dcache Icache Thread-blind Better answers Regs Retire SMT Pipeline Fetch Decode/ Map Queue Reg Read Execute Dcache/ Store Buffer Reg Write PC Register Map Regs Icache Better answers Dcache Regs Retire Changes for SMT Basic pipeline – unchanged Replicated resources Program counters Register maps Shared resources Register file (size increased) Instruction queue First and second level caches Translation buffers Branch predictor Better answers Multiprogrammed workload 250% 200% 1T 2T 3T 4T 150% 100% 50% 0% SpecInt Better answers SpecFP Mixed Int/FP Decomposed SPEC95 Applications 250% 200% 1T 2T 3T 4T 150% 100% 50% 0% Turb3d Better answers Swm256 Tomcatv Multithreaded Applications 300% 250% 200% 1T 2T 4T 150% 100% 50% 0% Barnes Better answers Chess Sort TP Architectural Abstraction 1 Processor with 4 Thread Processing Units (TPUs) Shared hardware resources TPU 0 Icache TPU1 TPU2 TLB Scache Better answers TPU3 Dcache 21464 System Block Diagram 0123 M EV8 EV8 M EV8 IO IO IO M M M EV8 EV8 EV8 IO IO IO M M M EV8 EV8 IO Better answers M EV8 IO IO Alpha 21464 Summary Leadership single stream performance Higher operating frequency / better technology New microarchitecture Integrated memory interface (like 21364) Leadership multiprocessor performance Simultaneous Multithreading (with minimal changes/cost) Integrated system / multiprocessor interface (like 21364) Better answers Maintain Performance Lead Beyond Y2K Alpha 21364 Reuses 21264 microprocessor core System on a chip Alpha 21464 New microarchitecture System on a chip Better answers Simultaneous Multithreading My Current Research: Beyond 21464? The Truth Project (w/ Joel Emer) The Multinet Project (w/ Rick Kessler) Tightly-coupled multiprocessor networks The Reliant Project (w/ Steve Reinhardt) Examines different microarchitectural issues Self-Checking Microprocessors using SMT, ISCA submission Asim (w/ VSSAD Labs) Performance Model for Alphas beyond 21464 Better answers