Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
19-2 A Pico-Joule Class, 1 GHz, 32 KJ3yte x 64b DSP SRAM with Self Reverse Bias Azeez J. Bhavnagmala, Stephen V. Kosonocky, Michael Immediato, Dan Knebel & Anne-Marie Haen IBM T J Watson Research Center, Yorktown Heights NY 10598, [email protected] other address bits. The decode path serves as a ‘clock uath’ Abstract New S R A M circuit techniques implemented in a from which a sequence of pulses are generated to precharge standard 0.13 pm bulk Si CMOS process are reported in this and evaluate the bitpath in the selected global column (Fig.2). work that (i) enable pico-joule energy dissipation per The RW enable pulse widths (Fig. 1) are expanded by tying accessed bit at 1 GHz, (U) lower total leakage power by over to V, or Gnd the gates of long channel NFETs or PFETs. 80% from all unaccessed cells, during both active and These expanded RIW pulses, generated in a global column, standby modes, using a rigorous, self reverse biasing scheme not only gate the data idout bus to the selected global column that addresses leakage due to qnantum tunneling and thermal but also demonstrate ample margin to enable only local write excitation in all cell transistors, with an area, perfomance buffers or local sense amplifiers (LSAs) when they intersect and noise margin penalty of less than 3% each and (iii) the short pulse off the clock path selecting a subarray within a enable a programmable leakage reduction option that lowers global column. (Fig. 3). Cell Pitch Matched Bitpaths leakage by over 90% when stored data is no longer desired. The local & global BL pairs directly drives a single cell Introduction Single-cycle S R A M caches for fast, IOW cost wireless pitch matched cross-coupled SA. Conventional SRAMs (Fig. DSP applications such as 3G cell phones impose s h g e n t 4) typically decode one of many columns of BL pairs using design constraints on both active and standby power [I-21. pass gates so that the LSA may fit within a larger pitch. At With increases in the spreads of Vt fluctuations seen in cell low Vdd, however, the voltage drop across the parasitic R of [3] and sense amplifier [4] transistors, delay variability of the pass gates consumes an increasing fraction of the differential bit path increases dramatically at low operating voltages, voltage developed by the cell (Fig. 5). Pass gates also add a requiring larger margins for evaluate periods of the SRAM significant additional parasitic C for the cell to discharge, cell and sense amps, and as a result increases bitpath power, increasing both BL power and cell delay. A Cell-Pitchwhich dominates SRAM active power. Techniques to match Matched bitpath thus lowers both bitpath power and cell the clock path with the bitpath using dummy BLs to minimize delay, enabling much faster operation at lower Vu. Parasitic skew have been demonstrated [5] but are necessary only in Ird from cells deselected by the mux is also eliminated. large subarrays with long bit lines where the skew between Self Reverse Biasing (SRB) A novel SRB SRAM technique (Fig. 6) relieves the pulses along the clock and bitpath become significant [6]. We propose an altemative scheme that employs very small reverse bias on all SRAM cells within an accessed sub array subarrays with short WLs (16b) and shori BLs (32b) where by a pulse off the decode path that selects the sub array, so we tradeoff a rapidly improving bitpath delay and delay that the virtual ground of the accessed sub array is restored variability for a slowly deteriorating decode delay [7] while very close to gound before the WL selects a row of cells in using proportionately smaller cell-pitch matched peripheral that sub array. The SRFJ circuit PFET, with gate grounded circuitry to reduce the area penalty that accompanies smaller during uormal data-retaining operation, acts as a simple subarrays. This scheme employs an Address Gated Pulse regulator raising the virtual ground of unaccessed sub arrays operation to synchronize and restrict switching activity to enable leakage reduction (i) by reverse biasing the sourcespatially and temporally to only regions of the array that are body junctions of all cell NFET devices, (ii) by a negative being accessed for periods just long enough [SI to build gate-source voltage on the leaking access W E T and (iii) by sufficient signal for reads or writes. In tandem with mu-less lowering the tunneling probability across the gate oxides of 4 operation along the entire bitpath, this scheme enables SRAM of 5 affected cell devices (Figs. 7,s). Both cell static noise cell delays to track CMOS gate delays more closely at low margin (Fig. 9) and area (Fig. 11) penalties are less than 3% voltages. Past attempts [9] to reduce standby power by gating and performance is unaffected (Fig. 12) with the SRB NFET the S R A M cell ground terminal with an NFET device to sized to ensure that the subarray VGND stays within a few produce a stacking effect severely compromise the ability of mV of GND. When data stored in the array is no longer the SRAM cell to retain data. With the cell storage node at desired, the SRB PFET gate terminal (SLP in Fig. 6) is driven ‘O’, now ’floating’, at high impedance and no longer f m l y to V,, into ‘sleep’ mode, raising the virtual ground to over % strapped to ground through the conducting pull-down cell Vu and lowering total leakage by over 90%. WET, fluctuations in cell transistor Vt make such a scheme Performance & Energy very risky at low VU where the headroom for variation of the Table I lists the access time and total power of the 32 cell virtual ground due to random Vt variations becomes KByte array. Ebit is calculated as the Pmd-T,, product per smaller with scaling. We propose an alternative where the data JJO. virtual ground of the SRAM subarray is pinned to a References: predetermined, programmable voltage using a stack of diodes [I] T. Yamada et al, ISSCC Dig. Feb. 2002, pp. 370-371. S . Agarwala a al, ISSCC Dig. Feb. 2002, pp. 56-57. that function as regulators. The area, performance and noise [2] 131 A. Bhavmgatwala,et 81, JSSC, Vol36, No. 4, pp. 658-665, Apr2001. margin penalty is shown to be less than 3%. 141 B.Amruturetal,IEEEJSSC, Vol. 35,NO. 2, Feb. 2000,pp. 175-185 Address Gated Pulse Operation (51 K Os&, a al, ISSCC Dig. Feb. 2001, pp. 168-169. A short (250-3OOps) pulse generated off the rising edge 161 K Zhang, et al, P m .Symp. VLSl Ckts., June 2000, pp. 226227. 1A. Bhamgawala et al, ICCD, Sept. 2001, pp. 352-357 of the clock (Fig. 1) samples the least significant address bit. 1[SIT. Chappell, et al, ISSCC Dig. Feb. 1991, pp. 50-51. This pulse propagates only through a decode path defined by 191M.Powell, et al. ISLPED, Aug. 2000, pp. 90-95. 9Ki A A 0 1 1 A.0‘4A.R 2003 Svmmsium on VLSl Circuits Diaest of Technical Papers To 1 d 32 Global n and spatial restriction of switching activity. Pulse :.w; :R".; .K'i Predecoder ; 7 ,__._ , , ,. Global C o l u m......... n A A ...........: .. ........... .. WL .. : ,---. ----n WEgulse Cell-L Cell-R LBL LBLB ... DL-PEI1l- .......-....-.--.--.. * .--------. , a---Fig. 2: Anay architecture: Each of 4 maaos yields 16b of data.A macro is divided into 32 elobal columns and each elobal column into4 submvs Fig. 3 (at right): The decode path functions as a 'clock' path with the pulse selecting a global column also synchronizing precharge and evaluate of the selected local and global BLs in the global column. Pulse widtbs decrease at - I .: ,- - -.. WELD GSA-u~tpku GSA-EN Dau-bar 0.9 1.5 1.91 0.97 0.50 14mW <90mW 0.42 1.36 , .+..> \ ........ I . i . Reg W SRB n n LSA-EN GBL ^_, i. I .......C n ... ... ; r . W l y l Cow. 14.92 46.55 Fig 13: SRB cla.with a stack of PFETs programmahie to enable SRB operation at nominal and at energyhit fiom simulations of exhncted netlists with 33% & 25% activity low voltages Fig. 12: S R B control durin.q active mode on address and data respectively. I, @ 50°c 1.00 2003 SvmDosium on VLSl Circuits ninest of T w h n i r d Pannrc ?fir)