Download Design Example: Register Files

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Design Example: Register Files
C.K. Ken Yang
UCLA
[email protected]
Courtesy of BA, MAH
EE 215B
1
Overview
•
•
Reading
– Papers
Overview
– An extreme of “SRAM” design is the register file. Register
files are small SRAMs that are used heavily by the datapath.
It serves as very local information that is fast to access. It
often involves multiple ports for simultaneous access by a
number of functional units/ALUs.
– These design parameters lead to very different cell designs
and performance targets. This set of notes reviews the basic
concepts and shows an example of such a design.
EE 215B
2
Outline
•
Architecture
– What is a register file
– 2 basic approaches
•
Design Example
EE 215B
3
What Is a Register File
•
•
•
Fastest memory block available to the microprocessor.
Stores intermediate results of the microprocessor units such as
ALU & MMU
Access speed is directly proportional to the performance of the
processor.
EE 215B
4
Architecture: Multi-ported Design
• At least 1 write port
and 2 read ports
– Accommodate a single
ALU with 2-operand
instructions.
– r3 <= r2 + r1
• Superscalar designs
– Multiple functional units
access the register file.
EE 215B
5
5
Example: 3-ported Cell
• Separate read/write
bitlines
– Single-port reads
– Dual-port write
• Enable different
design constraints
– Cell sizing
– Different pre-charge
of the read-port
EE 215B
6
Architecture: Multi-banking
• Multi-porting has a large
cost in peripheral circuits.
– Replicate memory into many
banks
• Homogenous – even
division to a number of
banks.
– Faster access to each bank.
– Smaller register size
– More MUXing circuitry
EE 215B
7
Heterogeneous Multi-banking
• Dividing the ports and
registers unevenly to the
banks.
– Smaller bank for the critical
data
– Bigger bank for the noncritical data
• Prediction of critical data
based on an algorithm
similar to cache
prediction.
EE 215B
8
Outline
•
Architecture
•
Design Example
– Itanium register file
EE 215B
9
Itanium 2 Integer Register File
•
•
6 ALUs share 144 x 65 bit 22 ported general registers
• 128 GRs + 16 Kernel Register aliased to R16-31
• 64 data path bits plus parity
12 read ports and 10 write ports – 8 active, 2 inactive
• Active and inactive writes can occur simultaneously
Datapath bypassing on write ports between multi-media (MMU) and
integer execution units (IEU)
IEU
MMU
1.00 mm
•
1.37mm
EE 215B
FetzerISSCC05
10
Integer RF Structure
Address Driver
Address Repeater
Decode
Data
Array
Bitline Repeater
Global Precharger
Parity State Machine
EE 215B
FetzerISSCC05
11
Floating Point Register File
•
128 x 82 bit 18 ported general registers
8 Read Ports
• 6 MAC data ports, 2 store data ports
10 write ports, 6 active 4 inactive
• 2 MAC result ports , 4 load data ports
1.14 mm
•
•
MAC
MAC
1.11mm
EE 215B
FetzerISSCC05
12
Floating Point RF Structure
Bitline
Repeater/Globa
l Precharger
Data Array
Parity State Machine
Decode
EE 215B
Address Repeater
Address Driver
FetzerISSCC05
13
Register File Timing
WRITE
Write
Write Bit Line
Bitline
Pre- Data Bypass
discharge
Read
READ Addr
Decode
Write
Addr
Decode
Read Local Bitline
Evaluate
Read Global Bitline
Evaluate
CK Phase 1
EE 215B
Register
Write
Read Local
Precharge
Read Global
Precharge
CK Phase 2
FetzerISSCC05
14
Write Following Reads
•
•
•
Reading a register that is being written into occurs very often
Itanium solution
– Each register file access contains a READ followed by a
WRITE.
– No contention, the READ result can be used half-cycle early.
Another common solution
– Write bypass:
• WRITE while READ results in a slow read since the cell is
being flipped.
• Bypass the READ with the WRITE information at the
multiplexer.
EE 215B
15
Register File Decode
highb
highb
sel[i]
lowb
one read/write port
self-timed pulse width control
address
lowb
matchb
en
PCK2
sel[9:0]
timer_enable
•
•
writeen
wordline
PCK2
WRITEH
NCK
Wordline (en) is pulsed
– PCK2X pulses each phase
– Read followed by write
WriteH is generated for the
accessed register
16
FetzerISSCC05
16
Storage Cell
WRITEH
•
writel
thread
ida
nb0
nb0
b0
thread
idb
writel
thread
writel
nb1
nb1
b1
•
One storage node for each
thread
Storage node
– Tristated by writel to
assist NFET only pass
gate writes.
– writel drain connected
PFETs provide extra pullup during a thread switch
and make write easier.
thread
writel
thread
Storage nodes
thread selection
FetzerISSCC05
17
17
Register File READ/WRITE (1)
writei
writel
write bitline
write
read
•
read bitline
activedata
write bitline
inactivedata
read
•
read bitline
writel
Buffered read
– Isolate the cell from
the read BL
Additional buffering from
write
– Isolate stored data
from read access.
– Improve the write
timing
-
wordline[9:0]
EE 215B
18
Register File READ/WRITE (2)
writei
writel
write bitline
write
read
Port sharing
– Active thread READ
shares wordlines
with inactive WRITE
– Reduce the number
of total ports
read bitline
activedata
write bitline
inactivedata
read
•
read bitline
writel
-
wordline[9:0]
read/write circuit
EE 215B
19
Register File READ/WRITE (3)
read bitline
writel
writei
activedata
writel
write bitline
write
read
Wordline conditioned by writel
– Writel high, enables the
read
– Writel low, enables the pull
up for the write.
read bitline
read
write bitline
inactivedata
•
-
wordline[9:0]
read/write circuit
EE 215B
20
Register File Organization
•
•
8 banks
– 16 registers per bank
8 cells per bitline
– 2 bitlines merge at the sense-amplifier
– Small number of cells
• Logic gate as the sense amplifiers
• Pre-charged and evaluates low (high-skew)
•
200ps access time!
EE 215B
21
Register File Read Path
PRECK
CK
local0
read0
reg0
...
read7
local1
LG0
....
global
LG8
reg7
global bitline circuit
Pulldown in bitcell
PRECK
CK
read
EE 215B
22
READ Simulation
•
•
Just over 200ps from CK to
global bitline evaluate
– PCK2X pulses twice per
cycle
– Matchb is the wordline
enable signal.
Local read/write signals
generated from each wordline
Matchb
PCK2X
Read
Wordline
Global BL
Local BL
EE 215B
23
WRITE Simulation
To read port
writel
WRITEH
thread
ida
nb0
nb0
b0
thread
writel
and parity
write
wordline
write bitline
idb
writel
thread
Writing a “1”
wordline
Writing a “0”
WRITEH
b0
write
24
Floating Nodes During Write
•The storage node in the inactive thread
floats low during writes to the active thread.
•At low frequency data could be lost so a
timer is implemented on WRITEH to end the
writes early
TIMER CIRCUIT
WRITEH
writel
writel
nb0
nb0
b0
treadchanged
NCK
enable
nr1
writel
RF Storage Node
NCK
•NCK rises and nr1 slowly
drops. If the NCK phase is long
enough enable drops low
ending the write
Slow long L devices
EE 215B
25
Switching Threads
WRITEH
writel
•
thread
ida
nb0
nb0
b0
thread
idb
•
writel
thread
The READ/WRITE I/O
ports look like large
caps and there is a
significant amount of
charge sharing
WRITEH is held at GND
when thread/thread_b
change values
writel
nb1
nb1
b1
thread
writel
thread
EE 215B
26
Switching Threads Simulation
WRITEH
writel
thread
nb0
nb0
b0
thread
ida
thread
idb
nb0
writel
thread
Needed or b1 would fail!
b0
writel
thread
ida
nb1
nb1
b1
b1
thread
nb0
idb
writel
thread
EE 215B
27
Parity
Parity Functional Representation
biti-1
d0i
biti-1
d0b
d1i
parityin
outpb
biti
midp
d0b
FETs
biti-1
shared with
Read
biti
Buffering
biti
d0i
d0i
biti
d1b
d1b
midp
parityout•
parityin
parityin
•
d1i
parityin
d1i
biti
EE 215B
parityout
parityin
Parity ripples through
32 stages in three clock
cycles after a write (41
stages in four cycles in
FPU)
The two bit parity
computation is 6.5
FETs per bit out of
109.5 (<6.0%)
28
Parity State Machine
thread
en
Thread
Changed
write
thread
ParitySeed
parity
XOR computation tree
b0
•
•
•
b1
b2
…...
en
b81
thread
StoredParity
ParityComp
ParityError
Register N
The parity state machine is below the data array and gets the
same inputs (wordlines/write/parity_in) as a bitcell
Parity is continuously computed and checked
– Register file outputs parity error.
– Scan can observe a parity error before the register is read
ParityError is read with a duplicate of a register read circuit
29
29
Register File Comparison
Design
Montecito
Integer
Montecito
FP
McKinley
Integer
ISSCC 2002
Technology
0.09μm
0.09μm
0.18μm
Write Ports
10
10
8
Read Ports
12
8
12
144 x 65bit
128 x 82bit
128 x 65bit
1.43M
1.30M
832K
Parity SM Area
0.098mm2
0.083mm2
NA
Array Area
0.930mm2
0.935mm2
1.67mm2
Decoder Area
0.330mm2
0.220mm2
0.39mm2
Global Overhead
0.012mm2
0.052mm2
0.13mm2
Total Size
1.37mm2
1.29mm2
2.2mm2
Registers
Transistors
30
Summary
•
•
•
•
•
Register files are critical functional units similar to ALUs.
– Determine the cycle-time of a processor
Highly constrained memory design
– Small number of entries
– Large number of ports
– Highly partitioned (tradeoff of #ports per cell versus many
cells).
Cell design is very unique.
– Single-ended reads
– Buffered reads
– Multi-threading
Sense-amplifiers are often digital logic gates
Parity protection is increasingly critical for reliability.
Reference 3
31