Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Alpha 21364 and 21464 Microprocessors: Continuing the Performance Lead Beyond Y2K Shubu Mukherjee, Ph.D. Principal Hardware Engineer VSSAD Labs, Alpha Development Group Compaq Computer Corporation Shrewsbury, Massachusetts Slides: 1998 Microprocessor Forum (Peter Bannon) and 1999 Microprocessor Forum (Joel Emer) Better answers Alpha Microprocessor Roadmap Higher Performance 0.125mm 0.18mm 0.35mm 21464 EV8 21364 EV7 21264 EV6 0.125mm 0.28mm 21364 EV78 21264 EV67 0.18mm 21264 EV68 1998 Better answers 1999 2000 2001 First System Ship 2002 2003 Alpha 21264 Microprocessor  Architectural Features First “Out-of-Order” Alpha Four-wide superscalar  …    Performance   World’s Fastest Microprocessor (www.spec.org, 11/17/99) 39 SPECINT95, 68 SPECFP95 @ 700 Mhz – Better answers Intel Pentium III @ 733 Mhz delivers 36 SPECINT95, 30 SPECFP95 Alpha Microprocessor Roadmap Higher Performance 0.125mm 0.18mm 0.35mm 21464 EV8 21364 EV7 21264 EV6 0.125mm 0.28mm 21364 EV78 21264 EV67 0.18mm 21264 EV68 1998 Better answers 1999 2000 2001 First System Ship 2002 2003 Alpha 21364 Goals   Leadership single stream performance  Higher operating frequency  Integrated memory interface Leadership multiprocessor performance  Integrated system / multiprocessor interface Better answers Alpha 21364 Features  System-on-a-Chip      Fault-Tolerance  Better answers Alpha 21264 core with enhancements Integrated L2 Cache Integrated memory controller Integrated network interface Support for lock-step operation to enable highavailability systems. 21364 Chip Block Diagram 16 L1 Miss Buffers Address In R A M B U S Address Out 64K Icache 21264 Core 64K Dcache 16 L1 Victim Buf Better answers L2 Cache Memory Controller Network Interface 16 L2 Victim Buf N S E W I/O 21364 Core FETCH Stage: 0 Branch Predictors MAP 1 2 QUEUE 3 REG 4 EXEC 5 Int Reg Map Int Issue Queue (20) Reg File (80) Exec 80 in-flight instructions plus 32 loads and 32 stores Next-Line Address L1 Ins. Cache 64KB 2-Set Better answers Reg File (80) Exec DCACHE 6 Addr Exec Addr Exec L1 Data Cache 64KB 2-Set L2 cache1 .5MB 6-Set 4 Instructions / cycle FP Reg Map FP Issue Queue (15) Reg File (72) FP ADD Div/Sqrt FP MUL Victim Buffer Miss Address Integrated L2 Cache 1.5 MB  6-way set associative  16 GB/s total read/write bandwidth  16 Victim buffers for L1 -> L2  16 Victim buffers for L2 -> Memory  ECC SECDED code  12ns load to use latency  Better answers Integrated Memory Controller  Direct RAMbus    High data capacity per pin 800 MHz operation 30ns CAS latency pin to pin 6 GB/sec read or write bandwidth  100s of open pages  Directory based cache coherence  ECC SECDED  Better answers Integrated Network Interface Direct processor-to-processor interconnect  10 GB/second per processor  15ns processor-to-processor latency  Out-of-order network with adaptive routing  Asynchronous clocking between processors  3 GB/second I/O interface per processor  Better answers 21364 System Block Diagram M 364 364 M 364 M 364 IO IO IO IO M M M M 364 364 364 364 IO IO IO IO M M M M 364 364 IO Better answers M 364 IO 364 IO IO Alpha 21364 Technology 0.18 mm CMOS  1000+ MHz  100 Watts @ 1.5 volts 2  3.5 cm  6 Layer Metal  100 million transistors  Better answers  8 million logic  92 million RAM Alpha 21364 Status 70 SPECint95 (estimated)  120 SPECfp95 (estimated)  RTL model running  Tapeout: Summer 2000  Better answers 21364 Summary: System on a Chip  Integrated L2 cache and memory controller   outstanding single processor performance Integrated network interface   high performance multi-processor systems scales to large number of processors Better answers Alpha Microprocessor Overview Higher Performance 0.125mm 0.18mm 0.35mm 21464 EV8 21364 EV7 21264 EV6 0.125mm 0.28mm 21364 EV78 21264 EV67 0.18mm 21264 EV68 1998 Better answers 1999 2000 2001 First System Ship 2002 2003 Alpha 21464 Goals   Leadership single stream performance  Higher operating frequency / better technology  New microarchitecture  Integrated memory interface (like 21364) Leadership multiprocessor performance  Simultaneous Multithreading (with minimal change/cost)  Integrated system / multiprocessor interface (like 21364) Better answers Alpha 21464 Technology Overview  Leading edge process technology – 1.2-2.0GHz      0.125µm CMOS SOI-compatible Cu interconnect low-k dielectrics Chip characteristics   ~1.2V Vdd ~250 Million transistors Better answers Alpha 21464 Architecture Overview Enhanced out-of-order execution  8-wide superscalar  Large on-chip L2 cache  Direct RAMBUS interface  On-chip router for system interconnect  Glueless, directory-based, ccNUMA    for up to 512-way multiprocessing 4-way simultaneous multithreading (SMT) Better answers Instruction Issue Time Reduced function unit utilization due to dependencies Better answers Superscalar Issue Time Superscalar leads to more performance, but lower utilization Better answers Predicated Issue Time Adds to function unit utilization, but results are thrown away Better answers Chip Multiprocessor Time Limited utilization when only running one thread Better answers Fine Grained Multithreading Time Intra-thread dependencies still limit performance Better answers Simultaneous Multithreading Time Maximum utilization of function units by independent operations Better answers Basic Out-of-order Pipeline Fetch Decode/ Map Queue Reg Read Execute Dcache/ Store Buffer Reg Write PC Register Map Regs Dcache Icache Thread-blind Better answers Regs Retire SMT Pipeline Fetch Decode/ Map Queue Reg Read Execute Dcache/ Store Buffer Reg Write PC Register Map Regs Icache Better answers Dcache Regs Retire Changes for SMT  Basic pipeline – unchanged  Replicated resources    Program counters Register maps Shared resources      Register file (size increased) Instruction queue First and second level caches Translation buffers Branch predictor Better answers Multiprogrammed workload 250% 200% 1T 2T 3T 4T 150% 100% 50% 0% SpecInt Better answers SpecFP Mixed Int/FP Decomposed SPEC95 Applications 250% 200% 1T 2T 3T 4T 150% 100% 50% 0% Turb3d Better answers Swm256 Tomcatv Multithreaded Applications 300% 250% 200% 1T 2T 4T 150% 100% 50% 0% Barnes Better answers Chess Sort TP Architectural Abstraction 1 Processor with 4 Thread Processing Units (TPUs)  Shared hardware resources  TPU 0 Icache TPU1 TPU2 TLB Scache Better answers TPU3 Dcache 21464 System Block Diagram 0123 M EV8 EV8 M EV8 IO IO IO M M M EV8 EV8 EV8 IO IO IO M M M EV8 EV8 IO Better answers M EV8 IO IO Alpha 21464 Summary   Leadership single stream performance  Higher operating frequency / better technology  New microarchitecture  Integrated memory interface (like 21364) Leadership multiprocessor performance  Simultaneous Multithreading (with minimal changes/cost)  Integrated system / multiprocessor interface (like 21364) Better answers Maintain Performance Lead Beyond Y2K  Alpha 21364    Reuses 21264 microprocessor core System on a chip Alpha 21464 New microarchitecture  System on a chip   Better answers Simultaneous Multithreading My Current Research: Beyond 21464?  The Truth Project (w/ Joel Emer)   The Multinet Project (w/ Rick Kessler)   Tightly-coupled multiprocessor networks The Reliant Project (w/ Steve Reinhardt)   Examines different microarchitectural issues Self-Checking Microprocessors using SMT, ISCA submission Asim (w/ VSSAD Labs)  Performance Model for Alphas beyond 21464 Better answers