Download A pico-joule class, 1GHz, Kbyte x 64b DSP SRAM with self reverse

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Time-to-digital converter wikipedia , lookup

Bio-MEMS wikipedia , lookup

Shockley–Queisser limit wikipedia , lookup

Transcript
19-2
A Pico-Joule Class, 1 GHz, 32 KJ3yte x 64b DSP SRAM with Self Reverse Bias
Azeez J. Bhavnagmala, Stephen V. Kosonocky, Michael Immediato, Dan Knebel & Anne-Marie Haen
IBM T J Watson Research Center, Yorktown Heights NY 10598, [email protected]
other address bits. The decode path serves as a ‘clock uath’
Abstract
New S R A M circuit techniques implemented in a from which a sequence of pulses are generated to precharge
standard 0.13 pm bulk Si CMOS process are reported in this and evaluate the bitpath in the selected global column (Fig.2).
work that (i) enable pico-joule energy dissipation per The RW enable pulse widths (Fig. 1) are expanded by tying
accessed bit at 1 GHz, (U) lower total leakage power by over to V, or Gnd the gates of long channel NFETs or PFETs.
80% from all unaccessed cells, during both active and These expanded RIW pulses, generated in a global column,
standby modes, using a rigorous, self reverse biasing scheme not only gate the data idout bus to the selected global column
that addresses leakage due to qnantum tunneling and thermal but also demonstrate ample margin to enable only local write
excitation in all cell transistors, with an area, perfomance buffers or local sense amplifiers (LSAs) when they intersect
and noise margin penalty of less than 3% each and (iii) the short pulse off the clock path selecting a subarray within a
enable a programmable leakage reduction option that lowers global column. (Fig. 3).
Cell Pitch Matched Bitpaths
leakage by over 90% when stored data is no longer desired.
The local & global BL pairs directly drives a single cell
Introduction
Single-cycle S R A M caches for fast, IOW cost wireless pitch matched cross-coupled SA. Conventional SRAMs (Fig.
DSP applications such as 3G cell phones impose s h g e n t 4) typically decode one of many columns of BL pairs using
design constraints on both active and standby power [I-21. pass gates so that the LSA may fit within a larger pitch. At
With increases in the spreads of Vt fluctuations seen in cell low Vdd, however, the voltage drop across the parasitic R of
[3] and sense amplifier [4] transistors, delay variability of the pass gates consumes an increasing fraction of the differential
bit path increases dramatically at low operating voltages, voltage developed by the cell (Fig. 5). Pass gates also add a
requiring larger margins for evaluate periods of the SRAM significant additional parasitic C for the cell to discharge,
cell and sense amps, and as a result increases bitpath power, increasing both BL power and cell delay. A Cell-Pitchwhich dominates SRAM active power. Techniques to match Matched bitpath thus lowers both bitpath power and cell
the clock path with the bitpath using dummy BLs to minimize delay, enabling much faster operation at lower Vu. Parasitic
skew have been demonstrated [5] but are necessary only in Ird from cells deselected by the mux is also eliminated.
large subarrays with long bit lines where the skew between
Self Reverse Biasing (SRB)
A novel SRB SRAM technique (Fig. 6) relieves the
pulses along the clock and bitpath become significant [6]. We
propose an altemative scheme that employs very small reverse bias on all SRAM cells within an accessed sub array
subarrays with short WLs (16b) and shori BLs (32b) where by a pulse off the decode path that selects the sub array, so
we tradeoff a rapidly improving bitpath delay and delay that the virtual ground of the accessed sub array is restored
variability for a slowly deteriorating decode delay [7] while very close to gound before the WL selects a row of cells in
using proportionately smaller cell-pitch matched peripheral that sub array. The SRFJ circuit PFET, with gate grounded
circuitry to reduce the area penalty that accompanies smaller
during uormal data-retaining operation, acts as a simple
subarrays. This scheme employs an Address Gated Pulse regulator raising the virtual ground of unaccessed sub arrays
operation to synchronize and restrict switching activity to enable leakage reduction (i) by reverse biasing the sourcespatially and temporally to only regions of the array that are body junctions of all cell NFET devices, (ii) by a negative
being accessed for periods just long enough [SI to build gate-source voltage on the leaking access W E T and (iii) by
sufficient signal for reads or writes. In tandem with mu-less lowering the tunneling probability across the gate oxides of 4
operation along the entire bitpath, this scheme enables SRAM of 5 affected cell devices (Figs. 7,s). Both cell static noise
cell delays to track CMOS gate delays more closely at low margin (Fig. 9) and area (Fig. 11) penalties are less than 3%
voltages. Past attempts [9] to reduce standby power by gating and performance is unaffected (Fig. 12) with the SRB NFET
the S R A M cell ground terminal with an NFET device to sized to ensure that the subarray VGND stays within a few
produce a stacking effect severely compromise the ability of mV of GND. When data stored in the array is no longer
the SRAM cell to retain data. With the cell storage node at desired, the SRB PFET gate terminal (SLP in Fig. 6) is driven
‘O’, now ’floating’, at high impedance and no longer f m l y
to V,, into ‘sleep’ mode, raising the virtual ground to over %
strapped to ground through the conducting pull-down cell Vu and lowering total leakage by over 90%.
WET, fluctuations in cell transistor Vt make such a scheme
Performance & Energy
very risky at low VU where the headroom for variation of the
Table I lists the access time and total power of the 32
cell virtual ground due to random Vt variations becomes KByte array. Ebit is calculated as the Pmd-T,,
product per
smaller with scaling. We propose an alternative where the data JJO.
virtual ground of the SRAM subarray is pinned to a References:
predetermined, programmable voltage using a stack of diodes [I] T. Yamada et al, ISSCC Dig. Feb. 2002, pp. 370-371.
S . Agarwala a al, ISSCC Dig. Feb. 2002, pp. 56-57.
that function as regulators. The area, performance and noise [2]
131 A. Bhavmgatwala,et 81, JSSC, Vol36, No. 4, pp. 658-665, Apr2001.
margin penalty is shown to be less than 3%.
141 B.Amruturetal,IEEEJSSC, Vol. 35,NO. 2, Feb. 2000,pp. 175-185
Address Gated Pulse Operation
(51 K Os&, a al, ISSCC Dig. Feb. 2001, pp. 168-169.
A short (250-3OOps) pulse generated off the rising edge 161 K Zhang, et al, P m .Symp. VLSl Ckts., June 2000, pp. 226227.
1A. Bhamgawala et al, ICCD, Sept. 2001, pp. 352-357
of the clock (Fig. 1) samples the least significant address bit. 1[SIT.
Chappell, et al, ISSCC Dig. Feb. 1991, pp. 50-51.
This pulse propagates only through a decode path defined by 191M.Powell, et al. ISLPED, Aug. 2000, pp. 90-95.
9Ki
A A 0 1 1 A.0‘4A.R
2003 Svmmsium on VLSl Circuits Diaest of Technical Papers
To 1 d 32 Global
n
and spatial restriction of switching activity.
Pulse
:.w;
:R".;
.K'i
Predecoder ;
7 ,__._
,
, ,.
Global C o l u m.........
n
A
A
...........:
..
........... ..
WL
..
:
,---.
----n
WEgulse
Cell-L
Cell-R
LBL
LBLB
...
DL-PEI1l- .......-....-.--.--..
*
.--------.
, a---Fig. 2: Anay architecture: Each of 4 maaos yields 16b of data.A macro is
divided into 32 elobal columns and each elobal column into4 submvs
Fig. 3 (at right): The decode path functions as a 'clock' path with the pulse
selecting a global column also synchronizing precharge and evaluate of the
selected local and global BLs in the global column. Pulse widtbs decrease at
-
I
.:
,-
- -..
WELD
GSA-u~tpku
GSA-EN
Dau-bar
0.9
1.5
1.91
0.97
0.50
14mW
<90mW
0.42
1.36
,
.+..>
\ ........
I
.
i
.
Reg W SRB
n
n
LSA-EN
GBL
^_,
i.
I .......C
n
...
...
;
r
.
W l y l Cow.
14.92
46.55
Fig 13: SRB cla.with a stack of
PFETs programmahie to enable
SRB operation at nominal and at
energyhit fiom simulations of exhncted netlists with 33% & 25% activity
low voltages
Fig. 12: S R B control durin.q active mode
on address and data respectively. I, @ 50°c
1.00
2003 SvmDosium on VLSl Circuits ninest of T w h n i r d
Pannrc
?fir)