Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Design Example: Register Files C.K. Ken Yang UCLA [email protected] Courtesy of BA, MAH EE 215B 1 Overview • • Reading – Papers Overview – An extreme of “SRAM” design is the register file. Register files are small SRAMs that are used heavily by the datapath. It serves as very local information that is fast to access. It often involves multiple ports for simultaneous access by a number of functional units/ALUs. – These design parameters lead to very different cell designs and performance targets. This set of notes reviews the basic concepts and shows an example of such a design. EE 215B 2 Outline • Architecture – What is a register file – 2 basic approaches • Design Example EE 215B 3 What Is a Register File • • • Fastest memory block available to the microprocessor. Stores intermediate results of the microprocessor units such as ALU & MMU Access speed is directly proportional to the performance of the processor. EE 215B 4 Architecture: Multi-ported Design • At least 1 write port and 2 read ports – Accommodate a single ALU with 2-operand instructions. – r3 <= r2 + r1 • Superscalar designs – Multiple functional units access the register file. EE 215B 5 5 Example: 3-ported Cell • Separate read/write bitlines – Single-port reads – Dual-port write • Enable different design constraints – Cell sizing – Different pre-charge of the read-port EE 215B 6 Architecture: Multi-banking • Multi-porting has a large cost in peripheral circuits. – Replicate memory into many banks • Homogenous – even division to a number of banks. – Faster access to each bank. – Smaller register size – More MUXing circuitry EE 215B 7 Heterogeneous Multi-banking • Dividing the ports and registers unevenly to the banks. – Smaller bank for the critical data – Bigger bank for the noncritical data • Prediction of critical data based on an algorithm similar to cache prediction. EE 215B 8 Outline • Architecture • Design Example – Itanium register file EE 215B 9 Itanium 2 Integer Register File • • 6 ALUs share 144 x 65 bit 22 ported general registers • 128 GRs + 16 Kernel Register aliased to R16-31 • 64 data path bits plus parity 12 read ports and 10 write ports – 8 active, 2 inactive • Active and inactive writes can occur simultaneously Datapath bypassing on write ports between multi-media (MMU) and integer execution units (IEU) IEU MMU 1.00 mm • 1.37mm EE 215B FetzerISSCC05 10 Integer RF Structure Address Driver Address Repeater Decode Data Array Bitline Repeater Global Precharger Parity State Machine EE 215B FetzerISSCC05 11 Floating Point Register File • 128 x 82 bit 18 ported general registers 8 Read Ports • 6 MAC data ports, 2 store data ports 10 write ports, 6 active 4 inactive • 2 MAC result ports , 4 load data ports 1.14 mm • • MAC MAC 1.11mm EE 215B FetzerISSCC05 12 Floating Point RF Structure Bitline Repeater/Globa l Precharger Data Array Parity State Machine Decode EE 215B Address Repeater Address Driver FetzerISSCC05 13 Register File Timing WRITE Write Write Bit Line Bitline Pre- Data Bypass discharge Read READ Addr Decode Write Addr Decode Read Local Bitline Evaluate Read Global Bitline Evaluate CK Phase 1 EE 215B Register Write Read Local Precharge Read Global Precharge CK Phase 2 FetzerISSCC05 14 Write Following Reads • • • Reading a register that is being written into occurs very often Itanium solution – Each register file access contains a READ followed by a WRITE. – No contention, the READ result can be used half-cycle early. Another common solution – Write bypass: • WRITE while READ results in a slow read since the cell is being flipped. • Bypass the READ with the WRITE information at the multiplexer. EE 215B 15 Register File Decode highb highb sel[i] lowb one read/write port self-timed pulse width control address lowb matchb en PCK2 sel[9:0] timer_enable • • writeen wordline PCK2 WRITEH NCK Wordline (en) is pulsed – PCK2X pulses each phase – Read followed by write WriteH is generated for the accessed register 16 FetzerISSCC05 16 Storage Cell WRITEH • writel thread ida nb0 nb0 b0 thread idb writel thread writel nb1 nb1 b1 • One storage node for each thread Storage node – Tristated by writel to assist NFET only pass gate writes. – writel drain connected PFETs provide extra pullup during a thread switch and make write easier. thread writel thread Storage nodes thread selection FetzerISSCC05 17 17 Register File READ/WRITE (1) writei writel write bitline write read • read bitline activedata write bitline inactivedata read • read bitline writel Buffered read – Isolate the cell from the read BL Additional buffering from write – Isolate stored data from read access. – Improve the write timing - wordline[9:0] EE 215B 18 Register File READ/WRITE (2) writei writel write bitline write read Port sharing – Active thread READ shares wordlines with inactive WRITE – Reduce the number of total ports read bitline activedata write bitline inactivedata read • read bitline writel - wordline[9:0] read/write circuit EE 215B 19 Register File READ/WRITE (3) read bitline writel writei activedata writel write bitline write read Wordline conditioned by writel – Writel high, enables the read – Writel low, enables the pull up for the write. read bitline read write bitline inactivedata • - wordline[9:0] read/write circuit EE 215B 20 Register File Organization • • 8 banks – 16 registers per bank 8 cells per bitline – 2 bitlines merge at the sense-amplifier – Small number of cells • Logic gate as the sense amplifiers • Pre-charged and evaluates low (high-skew) • 200ps access time! EE 215B 21 Register File Read Path PRECK CK local0 read0 reg0 ... read7 local1 LG0 .... global LG8 reg7 global bitline circuit Pulldown in bitcell PRECK CK read EE 215B 22 READ Simulation • • Just over 200ps from CK to global bitline evaluate – PCK2X pulses twice per cycle – Matchb is the wordline enable signal. Local read/write signals generated from each wordline Matchb PCK2X Read Wordline Global BL Local BL EE 215B 23 WRITE Simulation To read port writel WRITEH thread ida nb0 nb0 b0 thread writel and parity write wordline write bitline idb writel thread Writing a “1” wordline Writing a “0” WRITEH b0 write 24 Floating Nodes During Write •The storage node in the inactive thread floats low during writes to the active thread. •At low frequency data could be lost so a timer is implemented on WRITEH to end the writes early TIMER CIRCUIT WRITEH writel writel nb0 nb0 b0 treadchanged NCK enable nr1 writel RF Storage Node NCK •NCK rises and nr1 slowly drops. If the NCK phase is long enough enable drops low ending the write Slow long L devices EE 215B 25 Switching Threads WRITEH writel • thread ida nb0 nb0 b0 thread idb • writel thread The READ/WRITE I/O ports look like large caps and there is a significant amount of charge sharing WRITEH is held at GND when thread/thread_b change values writel nb1 nb1 b1 thread writel thread EE 215B 26 Switching Threads Simulation WRITEH writel thread nb0 nb0 b0 thread ida thread idb nb0 writel thread Needed or b1 would fail! b0 writel thread ida nb1 nb1 b1 b1 thread nb0 idb writel thread EE 215B 27 Parity Parity Functional Representation biti-1 d0i biti-1 d0b d1i parityin outpb biti midp d0b FETs biti-1 shared with Read biti Buffering biti d0i d0i biti d1b d1b midp parityout• parityin parityin • d1i parityin d1i biti EE 215B parityout parityin Parity ripples through 32 stages in three clock cycles after a write (41 stages in four cycles in FPU) The two bit parity computation is 6.5 FETs per bit out of 109.5 (<6.0%) 28 Parity State Machine thread en Thread Changed write thread ParitySeed parity XOR computation tree b0 • • • b1 b2 …... en b81 thread StoredParity ParityComp ParityError Register N The parity state machine is below the data array and gets the same inputs (wordlines/write/parity_in) as a bitcell Parity is continuously computed and checked – Register file outputs parity error. – Scan can observe a parity error before the register is read ParityError is read with a duplicate of a register read circuit 29 29 Register File Comparison Design Montecito Integer Montecito FP McKinley Integer ISSCC 2002 Technology 0.09μm 0.09μm 0.18μm Write Ports 10 10 8 Read Ports 12 8 12 144 x 65bit 128 x 82bit 128 x 65bit 1.43M 1.30M 832K Parity SM Area 0.098mm2 0.083mm2 NA Array Area 0.930mm2 0.935mm2 1.67mm2 Decoder Area 0.330mm2 0.220mm2 0.39mm2 Global Overhead 0.012mm2 0.052mm2 0.13mm2 Total Size 1.37mm2 1.29mm2 2.2mm2 Registers Transistors 30 Summary • • • • • Register files are critical functional units similar to ALUs. – Determine the cycle-time of a processor Highly constrained memory design – Small number of entries – Large number of ports – Highly partitioned (tradeoff of #ports per cell versus many cells). Cell design is very unique. – Single-ended reads – Buffered reads – Multi-threading Sense-amplifiers are often digital logic gates Parity protection is increasingly critical for reliability. Reference 3 31