Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CMPEN 411 VLSI Digital Circuits Spring 2009 Lecture 24: Peripheral Memory Circuits [Adapted from Rabaey’s Digital Integrated Circuits, Second Edition, ©2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp09 CMPEN 411 L24 S.1 Review: Read-Write Memories (RAMs) Static – SRAM data is stored as long as supply is applied large cells (6 fets/cell) – so fewer bits/chip fast – so used where speed is important (e.g., caches) differential outputs (output BL and !BL) use sense amps for performance compatible with CMOS technology Dynamic – DRAM periodic refresh required (every 1 to 4 ms) to compensate for the charge loss caused by leakage small cells (1 to 3 fets/cell) – so more bits/chip slower – so used for main memories single ended output (output BL only) need sense amps for correct operation not typically compatible with CMOS technology Sp09 CMPEN 411 L24 S.2 Peripheral Memory Circuitry Row and column decoders Read bit line precharge logic Speed Power consumption Area – pitch matching Sense amplifiers Timing and control Sp09 CMPEN 411 L24 S.6 Row Decoders Collection of 2M complex logic gates organized in a regular, dense fashion (N)AND decoder for 8 address bits WL(0) = !A7 & !A6 & !A5 & !A4 & !A3 & !A2 & !A1 & !A0 … WL(255) = A7 & A6 & A5 & A4 & A3 & A2 & A1 & A0 NOR decoder for 8 address bits WL(0) = !(A7 | A6 | A5 | A4 | A3 | A2 | A1 | A0) … WL(255) = !(!A7 | !A6 | !A5 | !A4 | !A3 | !A2 | !A1 | !A0) Goals: Pitch matched, fast, low power Sp09 CMPEN 411 L24 S.7 Implementing a Wide NOR Function Single stage 8x256 bit decoder (as in Lecture 22) Decompose logic into multiple levels One 8 input NOR gate per row x 256 rows = 256 x (8+8) = 4,096 Pitch match and speed/power issues !WL(0) = !(!(A7 | A6) & !(A5 | A4) & !(A3 | A2) & !(A1 | A0)) First level is the predecoder (for each pair of address bits, form Ai|Ai-1, Ai|!Ai-1, !Ai|Ai-1, and !Ai|!Ai-1) Second level is the word line driver Predecoders reduce the number of transistors required Four sets of four 2-bit NOR predecoders = 4 x 4 x (2+2) = 64 256 word line drivers, each a four input NAND – 256 x (4+4) = 2,048 - 4,096 vs 2,112 = almost a 50% savings Number of inputs to the gates driving the WLs is halved, so the propagation delay is reduced by a factor of ~4 Sp09 CMPEN 411 L24 S.8 Hierarchical Decoders Multi-stage implementation improves performance ••• WL 1 WL 0 A 0A 1 A 0A 1 A 0A 1 A 0A 1 A 2A 3 A 2A 3 A 2A 3 A 2A 3 ••• NAND decoder using 2-input pre-decoders A1 A0 A0 Sp09 CMPEN 411 L24 S.9 A1 A3 A2 A2 A3 Dynamic Decoders Precharge devices GND VDD GND WL 3 VDD WL 3 WL 2 WL 2 VDD WL 1 WL 1 V DD WL 0 WL 0 VDD f A0 A0 A1 A1 2-input NOR decoder A0 A0 A1 A1 f 2-input NAND decoder Which one is faster? Smaller? Low power? Sp09 CMPEN 411 L24 S.10 Pass Transistor Based Column Decoder A1 A0 2 input NOR decoder BL3 !BL3 BL2 !BL2 S3 S2 S1 S0 data_out BL1 !BL1 BL0 !BL0 !data_out Read: connect BLs to the Sense Amps (SA) drive one of the BLs low to write a 0 into the cell Writes: Fast since there is only one transistor in the signal path. However, there is a large transistor count ( (K+1)2K + 2 x 2K) For K = 2 3 x 22 (decoder) + 2 x 22 (PTs) = 12 + 8 = 20 Sp09 CMPEN 411 L24 S.11 Tree Based Column Decoder BL3 !BL3 BL2 !BL2 BL1 !BL1 data_out !data_out BL0 !BL0 A0 !A0 A1 !A1 Number of transistors reduced to (2 x 2 x (2K -1)) for K = 2 2 x 2 x (22 – 1) = 4 x 3 = 12 Delay increases quadratically with the number of sections (K) (so prohibitive for large decoders) can fix with buffers, progressive sizing, combination of tree and pass transistor approaches Sp09 CMPEN 411 L24 S.12 Decoder Complexity Comparisons Consider a memory with 10b address and 8b data Conf. 1D 2D 2D 2D Data/Row Row Decoder 10b = a 10x210 decoder Single stage = 20,480 Two stage = 10,320 32b 8b = 8x28 decoder Single stage = 4,096 T (32x256 core) Two stage = 2,112 T 64b 7b = 7x27 decoder Single stage = 1,792 T (64x128 core) Two stage = 1,072 T 128b 6b = 6x26 decoder Single stage = 768 T (128x64 core) Two stage = 432 T Sp09 CMPEN 411 L24 S.13 Column Decoder 8b 2b = 2x22 decoder PT = 76 T Tree = 96 T 3b = 3x23 decoder PT = 160 T Tree = 224 T 4b = 4x24 decoder PT = 336 T Tree = 480 T Bit Line Precharge Logic First step of a Read cycle is to precharge (PC) the bit lines to VDD every differential signal in the memory must be equalized to the same voltage level before Read Turn off PC and enable the WL !PC the grounded PMOS load limits the bit line swing (speeding up the next precharge cycle) Sp09 CMPEN 411 L24 S.14 BL !BL equalization transistor - speeds up equalization of the two bit lines by allowing the capacitance and pull-up device of the nondischarged bit line to assist in precharging the discharged line Sense Amplifiers Amplification – resolves data with small bit line swings (in some DRAMs required for proper functionality) SA input output Delay reduction – compensates for the limited drive capability of the memory cell to accelerate BL transition tp = ( C * V ) / Iav large small make V as small as possible Power reduction – eliminates a large part of the power dissipation due to charging and discharging bit lines Signal restoration – for DRAMs, need to drive the bit lines full swing after sensing (read) to do data refresh Sp09 CMPEN 411 L24 S.15 Classes of Sense Amplifiers Differential SA – takes small signal differential inputs (BL and !BL) and amplifies them to a large signal singleended output common-mode rejection – rejects noise that is equally injected to both inputs Only suitable for SRAMs (with BL and !BL) Types Current mirroring Two-stage Latch based Single-ended SA – needed for DRAMs Sp09 CMPEN 411 L24 S.16 Differential Sense Amplifier V DD M3 M4 y M1 bit SE M2 Out bit M5 Directly applicable to SRAMs Sp09 CMPEN 411 L24 S.17 Differential Sensing ― SRAM V DD PC V DD BL BL EQ V DD y M3 WL i M1 x SE V DD M4 M2 2y 2x 2x x SE M5 SE SRAM cell i Diff. x Sense 2x Amp V DD Output y SE Output (a) SRAM sensing scheme Sp09 CMPEN 411 L24 S.18 (b) two stage differential amplifier Approaches to Memory Timing SRAM Timing Self-Timed DRAM Timing Multiplexed Addressing Address Bus Address Bus Address Address transition initiates memory operation msb’s lsb’s Row Addr. Column Addr. RAS CAS RAS-CAS timing Sp09 CMPEN 411 L24 S.20 Reliability and Yield Memories operate under low signal-to-noise conditions word line to bit line coupling can vary substantially over the memory array - folded bit line architecture (routing BL and !BL next to each other ensures a closer match between parasitics and bit line capacitances) interwire bit line to bit line coupling - transposed (or twisted) bit line architecture (turn the noise into a common-mode signal for the SA) suffer from low yield due to high density and structural defects leakage (in DRAMs) requiring refresh operation increase yield by using error correction (e.g., parity bits) and redundancy and are susceptible to soft errors due to alpha particles and cosmic rays Sp09 CMPEN 411 L24 S.21 Redundancy in the Memory Structure Fuse bank Redundant row Redundant columns Row address Column address Sp09 CMPEN 411 L24 S.22 Row Redundancy Fused Repair Addresses == ? Redundant Wordline == ? Redundant Wordline Enable Normal Wordline Decoder Normal Wordline Normal Wordline Decoder Normal Wordline Functional Address Enable Fused Repair Addresses Page 4 Sp09 CMPEN 411 L24 S.23 == ? Redundant Wordline == ? Redundant Wordline Data 0 Page 5 Sp09 CMPEN 411 L24 S.24 Data 1 Data 2 Data 3 Data 4 Data 5 Data 6 Fuse Fuse Fuse Fuse Fuse Fuse Fuse Fuse Redundant Data Column Normal Data Column Normal Data Column Normal Data Column Normal Data Column Normal Data Column Normal Data Column Normal Data Column Normal Data Column Column Redundancy Data 7 Error-Correcting Codes Example: Hamming Codes e.g. If B3 flips 1 1 =3 0 2K>= m+k+1. m # data bit, k # check bit For 64 data bits, needs 7 check bits Sp09 CMPEN 411 L24 S.25 Performance and area overhead for ECC Sp09 CMPEN 411 L24 S.26 Redundancy and Error Correction Sp09 CMPEN 411 L24 S.27 Soft Errors Nonrecurrent and nonpermanent errors from alpha particles (from the packaging materials) neutrons from cosmic rays System FITS As feature size decreases, the charge stored at each node decreases (due to a lower node capacitance and lower VDD) and thus Qcritical (the charge necessary to cause a bit flip) decreases leading to an increase in the soft error rate (SER) Sp09 CMPEN 411 L24 S.28 From Semico Research Corp. 10000 1000 100 10 1 0.25 0.18 0.13 0.09 0.05 Process Technology From Actel MTBF (hours) .13 m .09 m Ground-based 895 448 Civilian Avionics System 324 162 Military Avionics System 18 9 CELL Processor! See class website for web links Sp09 CMPEN 411 L24 S.29 CELL Processor! Sp09 CMPEN 411 L24 S.30 CELL Processor! Sp09 CMPEN 411 L24 S.31 Embedded SRAM (4.6Ghz) Sp09 CMPEN 411 L24 S.32 Each SRAM cell 0.99um2 Each block has 32 sub-arrays, Each sub-array has 128 WL plus 4 redundant line, Each block has 2 redundant BL, Multiplier in CELL Sp09 CMPEN 411 L24 S.33 Next Lecture and Reminders Next lecture Power consumption in datapaths and memories - Reading assignment – Rabaey, et al, 11.7; 12.5 Sp09 CMPEN 411 L24 S.34