Download A 100MHz to 1GHz, 0.35V to 1.5V Supply 256 x

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
21st International Conference on VLSI Design
A 100MHz to 1GHz, 0.35V to 1.5V Supply 256 x 64 SRAM Block using
Symmetrized 9T SRAM Cell with Controlled Read
Satish Anand Verkila
VLSI Circuits and Systems
Lab, E.C.E,
IISc - Bangalore-560012
[email protected]
Siva Kumar Bondada
Nano Scale Device
Research Lab,CEDT,
IISc - Bangalore-560012
sivakumar.bondada@gmai
l.com
Bharadwaj S. Amrutur
VLSI Circuits and Systems
Lab, E.C.E,
IISc - Bangalore-560012.
[email protected]
Abstract
In this paper, we present Dynamic Voltage and
Frequency Managed 256 x 64 SRAM block in 65nm
technology, for frequency ranging from 100MHz to
1GHz. The total energy is minimized for any operating
frequency in the above range and leakage energy is
minimized during standby mode. Since noise margin of
SRAM cell deteriorates at low voltages, we propose
Static Noise Margin improvement circuitry, which
symmetrizes the SRAM cell by controlling the body
bias of pull down NMOS transistor. We used a 9T
SRAM cell that isolates Read and Hold Noise Margin
and has less leakage. We have implemented an efficient
technique of pushing address decoder into zigzagsuper-cut-off in stand-by mode without affecting its
performance in active mode of operation. The Read Bit
Line (RBL) voltage drop is controlled and pre-charge
of bit lines is done only when needed for reducing
power wastage.
Fig.1 9 Transistor SRAM cell obtained by
eliminating the PMOS of the 10T cell [1].
The main barrier to achieving a low-voltage SRAM
is the increase in process variation with process
scaling. Lower supply and higher variation in
advanced processes makes the SRAM operation
difficult, and some techniques must be used to expand
its operating margin.
A standard 6 Transistor SRAM cell has significant
problems working at low voltages due to its poor read
and write noise margin. The read noise margin is
typically only half of hold noise margin and it is
difficult to robustly write into the cell at supplies less
than 0.6V. The 6T cell also has large leakages which
prevents having too many cells sharing a single bitline
column. This implies a large partitioning overhead to
accommodate buffers and muxes for each local bitline
column.
The authors in [1, 2, 7] explore the use of cells with
more than 6 transistors to reduce leakage and improve
noise margins. Even though a single cell is larger in
area compared to a 6 transistor cell, due to smaller
1. Introduction
Dynamic voltage and frequency scaling has
recently emerged as a very effective technique to trade
off power and performance of a chip dynamically [8].
The supply voltage and substrate bias are adjusted to
achieve the minimum power at any operating
frequency [9, 10]. Most of the work in this area though
has primarily focused on logic circuits. Since SRAMs
consume a significant fraction of the total chip power,
in this paper we look at the problem of designing
SRAMs with a large frequency and voltage range of
operation.
1063-9667/08 $25.00 © 2008 IEEE
DOI 10.1109/VLSI.2008.89
564
560
partitioning overheads, larger arrays with these cells
end up having smaller overall area.
The authors in [7] propose an 8 Transistor cell
which is able to separate both read and write noise
margins, but it too suffers from the bitline column
height limitation due to large leakages.
The authors in [2] propose a 7 Transistor cell which
has no read noise margin limitations. But during a read
operation, a cell storage node is in dynamic state,
which can lead to problems because of process
variations, especially with long read times.
The authors in [1] use some of the ideas above to
create a 10T SRAM cell shown in Fig. 1, mainly for
sub-threshold operation. They separate the read and
write ports and provide a read buffer consisting of 4
transistors (M7, M8, M9, Mp). This allows separation
of read and hold noise margins. The read buffer has an
extra series NMOS transistor, M9, which reduces
bitline leakage drastically. They also solve the write
noise margin problem by using the floating VDD
technique [4].
We extend the ideas proposed in the above papers
to design an SRAM with a large supply and frequency
range of operation. We propose a 9T cell with
threshold control for all the NMOS transistors to
enable a very low supply operation. The threshold
control enables symmetrizing the cell DC transfer
curve and gives better noise margins at low supply
voltages. The low leakage property of the cell also
enables large bitline columns and hence less overall
array area. Also we get good reduction in cell area
compared to 10T in which we need some room in the
layout to place PMOS Mp. Additionally we apply the
Zigzag Super Cutoff CMOS scheme [11] for the
decoder to achieve low standby power at very minimal
performance loss.
In section 2 we discuss the 9T cell and reasons for
going to 9T. Section 3 discusses how to symmetrize
the cell. In section 4 we describe an efficient scheme of
pushing the decoder into sleep mode without delay
degradation. Section 5 gives results of our DVFM
implementation. We conclude the paper in section 6.
Fig. 2. Layout of a) 8T [2] and b) 9T SRAM cells
QB=0, the 9T cell connects a high impedance node to
the bitline during the read (unlike the 10T cell).
However if the bitline capacitance is large and it is
precharged to Vdd, even under the worst case
conditions, the bitline signal doesn’t degrade by more
than 10mV.
The 9T cell shares the same leakage reduction
properties of the 10T cell due to the series stacking
effect of the NMOS M7-9. Thus in a SRAM block
having 128 cells per column using VDD = 0.35 (which
is the minimum required supply at 100MHz operation)
at 100C temperature, the ratio of Iread_cell to
Itotal_leakage is 1.96 for a 9T design and is 1.06 for
8T design. Hence we can put double the number of 9T
cells per column than in 8T design. This enables
smaller bitline partitioning and reduced area overhead.
A 9T SRAM cell has less area than 10T cell (due to
missing PMOS) and is only a bit more area compared
to the 8T SRAM cell (Fig. 2). Mosis scalable rules
have been used to layout both the cells in the figure.
The 6T SRAM cell area is 26X45 = 1170 and 8T
SRAM cell area is 72X18 = 1296. So our 9T SRAM
cell area is 72X19 = 1368. 8T has 10% more area
compared to 6T and our 9T has 17% more area
compared to 6T. 9T cell is having 16.5% less area than
10T cell reported in [1]. The area overhead is not so
high, because in 8T and 9T the pull-down NMOS
transistor size is not constrained by any condition
unlike in 6T. We used large width for bottom two
transistors in read buffer and small width for upper
transistor, so that RBL capacitance is under control and
circuit speed won’t deteriorate much.
2. 9T SRAM cell
The 9T cell is obtained by removing the PMOS Mp
from the 10T cell (Fig.1). The main effect of this is
that the node between M8 and M9 will be floating
during the hold mode (RWL=0). We don’t expect this
floating node to cause any problems during hold as it
will be disconnected from the large bitline
capacitance. Even a 10T cell will have this node to be
floating when QB=1. However during a read, if
3. Symmetric Cell for Noise Margin
Improvement
In a conventional SRAM cell, the PMOS is usually
made very weak compared to the access transistor in
the cell in order to satisfy the requirements of good
561
565
Inverter trip
point
normalised to
Vdd/2
1.05
diode
connected
1
0.95
off transistor
0.9
0.85
0.5 Vdd (V) 1
0
(a)
(b)
Fig.7 Comparison of adjusted inverter trip
point using two symmetrizing circuits
Fig. 3 Symmetrizing circuit using replica cell
inverters with a) diode connected ON
transistors; (b) OFF transistors
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0.4
RNM (V)
9T with diode
connected
transistor
9T
Vdd (V) 1
RNM (V)
RNM (V)
Fig.4 Comparison of read noise margin across
supply range of 0.3V to 1.5V for 6T and 9T
cells.
TT
LL
HH
Vdd (V)
1
1.5
RNM (V)
TT
0.4
LL
0.3
HH
0.2
HT
LT
TL
0.1
TH
0
0
0.5
Vdd (V)
1
0.1
0.15
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
TT
LT
HT
TL
TH
LH
HL
0.2
0.4
0.6
0.8
Vdd (V)
1
1.2
1.4
technique [4]. Also the read buffer decouples the read
and hold noise margins. This in turn allows the
potential to have a symmetric DC transfer for the cell
inverter to improve the cell noise margin.
Since the PMOS will be weaker than NMOS, their
drive strengths can be equalized by raising the NMOS
threshold voltage. This can be done for a triple well
process with all the NMOS transistors sharing the
same well. We have explored two possibilities for
balancing the NMOS and PMOS drive strengths. In
one, the ON currents are matched in a replica cell
inverter by adjusting the body bias of the NMOS via a
feedback amplifier (Fig 3a). In the other, the OFF
currents are balanced in a similar way (Fig.3b).In
either case, the body bias of the pull down NMOS is
adjusted such that node A stays at Vdd/2. This body
bias is applied to all SRAM cell NMOS’s.
All our simulations are done using the PTM SPICE
models [12] for the 65nm technology node. We
emulate the process corners with fast (L), nominal (T)
and slow (H) devices by creating transistor models
with -0.1V and +0.1V threshold shift for the fast and
slow cases compared to the nominal case.
Fig.5 Change in RNM with process variations
using symmetrizing circuit using diode
connected transistor
0.5
0
-0.05
0
0.05
Vth difference (V)
Fig.8(b) Degradation of read noise margin with
supply for a •Vth=25mV between replica cell
inverter and SRAM cell for different corners
TH
TL
0.5
-0.1
0
LT
HT
0
0.1
Fig.8(a) Degradation of read noise margin with
variation effects for different •Vth between
replica cell inverter and SRAM cell
1.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0.2
-0.15
9T with off
transistor
0.5
0.3
RNM (V)
6T
0
1.5
1.5
Fig.6 Change in RNM with process variations
when using symmetrizing circuit using off
transistor
write margin. Simultaneously, the access transistor is
made weaker than the driver transistor to create good
read margins. Thus the sizes of all the transistors are
coupled and the DC transfer curve of the cell inverter
is quite skewed as the PMOS becomes much weaker
than the driver NMOS. However in the 9T cell, the
write margins are improved by using the floating Vdd
562
566
the replica cell inverter and the SRAM cell at 1V TT
corner. Fig. 8(b) shows the change in noise margin
across the Vdd for different corners for a 25mV Vth
difference between the replica cell inverter and a
SRAM cell.
4. Low Leakage High Speed Decoder
Decoders have very low activity since only one of
2n output lines will be activated at a time. Additionally
the gates in the decoder are sequentially activated from
input to the output. Both of these properties can be
used in combination with zigzag super cutoff technique
[11] to reduce leakage in stand by mode without any
delay degradation for active mode.
As an example consider the 2-to-4 decoder shown
in Fig. 9. In stand by mode, by pushing output nodes
x1, x2, x3, x4 to zero, all the NANDs will have off-off
stacking and hence low leakage. The output of all
NANDs will be ‘1’ and output of all subsequent NOT
gates will be ‘0’. Hence using PMOS sleep transistors
in series for the inverters shown, even these will be in
low leakage mode. This will in turn cause all the
subsequent NAND gates of the next stage to get off-off
stacked. Thus we can achieve low leakage without too
much speed degradation. In each decode stage, only
one NOT gate will be active at any time. Also the NOT
gates across stages will be active at different times.
Hence the PMOS sleep transistors for the NOT gates
can all be merged into a single PMOS of the same size.
So by using single PMOS sleep transistor and single
NMOS sleep transistor (needed for buffer chain), the
entire decoder can be pushed into a low leakage stand
by mode. Sizing of the sleep transistors is done so as to
not degrade speed by more than 2.5%(wire delay of
decoder is also considered), yet achieve a leakage
reduction of 80%.
Fig. 9 Pushing decoder into low leakage stand
by state
The NMOS is stronger for the process corners TT,
LL and HH which are the most dominant process
corners (the first letter applies to NMOS and the
second to PMOS threshold). But for process corners
such TL, LT, TH, HT we can adjust NMOS body bias
from 0.3V to more negative voltage so as to make cell
symmetric. But for cases like LH, HL, PMOS body
bias has to be adjusted so as to make cell symmetric,
which we don’t do here.
Fig.4 shows read noise margins for the 6T, 9T,
symmetrized 9T SRAM using Diode connected
transistor (Fig. 3a) and 9T SRAM cell symmetrized
using off transistor (Fig. 3b). 9T SRAM cell has 100%
more SNM than 6T SRAM cell. 9T SRAM cell
symmetrized using diode connected transistor has at
least 15% more noise margin at lower Vdds and 40%
more noise margin at higher Vdds compared to 9T
SRAM cell. 9T SRAM cell symmetrized using OFF
transistor has 6% to 20% more Noise Margin than 9T
SRAM cell. This enables lower voltage operation with
the symmetrized cell than is possible with the regular
9T SRAM cell.
Fig.5 reveals that the noise margin for the
symmetrized 9T cell with diode connected transistor is
high enough even across process corners. Fig.6 shows
that noise margin for symmetrized 9T cell with OFF
transistor feedback is less immune to process variations
than that with diode connected transistor. This is
merely a reflection of the fact that matching OFF
currents doesn’t imply a good match of the ON
currents. Fig.7 gives the comparison of adjusted
inverter trip point using diode connected symmetrizing
circuit and off transistor symmetrizing circuit.
In reality, the replica cell inverter circuits in Fig. 3
will have a variation with respect to any cell in the
array due to both systematic and random variations.
We study the impact of this mismatch in figure 8
across supply voltages as well as the mismatch
amount. Fig. 8(a) shows the degradation of the read
noise margin as a function of the mismatch between
5. Controlled Read and DVFM
We use a replica based control technique to monitor
delay of our SRAM block and control the Read Bit
Line voltage drop [6]. For the delay mimic we have
used an extra column of SRAM cells.
Bit line sensing is done using a skewed inverter and
this single ended reading is slower because one single
buffer in SRAM cell has to discharge entire Read Bit
Line. Hence it takes more time to read than to write
into cell. In our SRAM block, reading is slower than
writing (uses floating VDD technique). write operation
is achieved by floating the cell’s VDD prior to write
[4]. In Fig.11 first cycle is write operation. As we can
see that VVDD line is dropped down to 0.64V for
563
567
1.15V operation, enabling easier writing of SRAM
cell. Afterwards the QB signal recovered to 1.15V.
For DVFM implementation, we tap the replica
bitline signal and delay it through buffers to provide
margins. The output of each buffer is captured in flip
flops [5]. The DVFM control loop can use this in its
feedback to adjust the supply voltage so that required
delay is met with some margin. We have used a flip
flop chain to monitor the delay as shown in Fig. 10 and
adjust for minimum VDD for that frequency of
operation. In Fig.11 2nd and 3rd cycles are read
operations. In Fig.11 we see the RBL line discharge is
stopped after sensed_replica_output came (as internal
word line (WL) is cut-off). Thus we can save major
fraction of SRAM block power.
Fig. 12 shows the results of performing DVFM
adjustment by using the replica bitline shown in Fig.
10 as the delay synthesizer. We have given a margin of
7.5% in terms of buffer delay after the replica bitline.
The minimum supply voltages possible over clock
frequency ranging from 100MHz to 1GHz are shown
in the Fig.12. One problem of using symmetrizing
technique for SRAM is we require separate well for
read buffer because we should not apply body bias to
it. So as to overcome this problem, what we do is use
same P-well for all NMOSs as done routinely, but
apply body bias for symmetrizing only at lower
voltages, where noise margins are low. From fig.4 we
can infer that, we need to improve RNM only for
supply voltages less than 0.6V. So we use the
following algorithmStep1: Get the minimum VDD required for the current
frequency of operation.
Step2: If VDD < 0.6V apply body bias to symmetrize
the SRAM cell (of course read buffer may become
slow, as body bias is applied to it also).
Step3: Now take what ever VDD that is given by
DVFM circuit.
In fig.12 we have also shown the increase in VDD
after applying this algorithm. Supply voltage needed is
maximum in HH corner (both PMOS and NMOS are
weak) and minimum in LL corner (both PMOS and
NMOS are strong). The maximum Vdd for 1GHz
operation is 1.5V and minimum is 0.899V. The
maximum Vdd for 100MHz operation is 0.751V and
minimum is 0.35V. Thus having a replica bitline based
delay column, local to each SRAM block, allows good
tracking across process variations, as long as these are
limited within a SRAM block.
While the replica bitline based delay mimic works
well for tracking out correlated variations across the
Fig.10 Model for implementing Read control
and DVFM
Fig.11 One cycle of write followed by two
cycles of read
entire memory bank it will not track out cell to cell
variations which are becoming significant as we scale
the technology. In order to study this, we plot the delay
difference between the replica path and the main
bitline path in Fig. 13 as a function of the threshold
voltage difference between the replica cell and the
memory cell. We observe that for threshold voltage
difference of 55mV and more the replica block is faster
than SRAM block, thus turning off SRAM word line,
leading to read failure. In the opposite direction of the
mismatch with the replica cell having more Vth than
the SRAM cells, power is wasted due to the delay in
turning off the word line. A 200ps of delay margin for
SRAM block is maintained at zero mismatch between
replica and SRAM block. This extends allowable
process variations.
564
568
1.4
HH
1.2
HL
1
Vdd (V)
speed. Replica bitline is used to provide both the delay
mimic for DVFM control, as well as used to reduce the
bitline power. Low voltage writes are achieved by
adopting the floating-Vdd technique.
TT
1.6
0.8
LH
0.6
7. References
LL
0.4
0.2
0
50
250
450
650
Frequency (MHz)
850
1050
TT with
bodybias
LL with
bodybias
[1] B.Calhoun and A.Chandrakasan,”A 256-kb 65-nm Subthreshold SRAM Design for Ultra- Low-Voltage Operation”.
IEEE JSSC , VOL. 42, NO. 3, MAR. 2007
Fig.12 Minimum supply voltage for different
frequencies at different process corners.
[2] K.Takeda et. al., “A Read-Static-Noise-Margin-Free
SRAM Cell for Low- VDD and High-Speed Applications”,
IEEE JSSC , VOL. 41, NO. 1, JAN. 2006
[3] M.Nomura et. al “Delay and Power Monitoring Schemes
for Minimizing Power Consumption by Means of Supply
and Threshold Voltage Control in Active and Standby
Modes” IEEE JSSC, VOL. 41, NO. 4, APRIL 2006
[4] M.Yamaoka, et. al. “90-nm Process-Variation Adaptive
Embedded SRAM Modules With Power-Line-Floating Write
Technique”, IEEE JSSC, VOL. 41, NO. 3, MAR. 2006
[5] M.Nakai et. al., “Dynamic voltage and frequency
management for a low-power embedded microprocessor,”
IEEE JSSC, vol. 40, no. 1, pp. 28–35, JAN. 2005.
Fig.13 Delay and power changes with process
variations
[6] Bharadwaj S.Amrutur, and Mark A. Horowitz, “A
Replica Technique for Wordline and Sense Control in LowPower SRAM’s” IEEE JSSC, VOL. 33, NO. 8, AUG. 1998
In stand by mode the decoder is completely cut off
as described earlier and SRAM block leakage is
reduced by reducing its supply to 0.3V where its hold
noise margin is 0.117V. Overall, the 256X64 block in
active mode for write operation consumes 2.624mW
and the stand-by mode power is 6uW.
[7] L. Chang et al., “Stable SRAM cell design for the 32 nm
node and beyond,” in Symp. VLSI Tech. Dig., JUNE. 2005,
pp. 128–129.
[8] M. Nakai et. al., “Dynamic voltage and frequency
management for a low-power embedded microprocessor,”
IEEE JSSC, vol. 40, no. 1, JAN. 2005, pp. 28–35.
6. Conclusions
[9] K. Nose and T. Sakurai, “Optimization of VDD and VTH
for lowpower and high-speed applications,” Proc. ASP-DAC,
Jan. 2000, pp. 469–474.
We have explored the design of a low-power high
speed Dynamic Voltage and Frequency Managed 256 x
64 SRAM block in this paper. We have presented a 9T
SRAM cell with improved the static noise margin and
low leakage. The reduced leakage enables to have a
larger number of cells in a bitline column, thus
reducing overall memory area. The noise margins at
very low supplies are further improved by
symmetrizing the cell inverter’s DC characteristics, via
adjustment of the NMOS thresholds. A replica cell
inverter is used in the feedback for this adjustment.
This enables operation even into sub threshold region
with a supply of about 0.3V. We reduce the stand-by
power of the decoder by using the zigzag cut-off
technique, with very little degradation of the decoder
[10] Nomura et.al., “Delay and Power Monitoring Schemes
for Minimizing Power Consumption by Supply and
Threshold Voltage Control in Active and Standby Modes”,
IEEE JSSC , April 2006.
[11] Min et. al., “ZigZag Super Cut-off CMOS (ZSSCMOS)
Block Activation with self adaptive voltage level controller:
An alternative to clock gating scheme in leakage dominant
era”, ISSCC 2003.
[12] PTM Spice models
565
569