Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Critical ALU Path Optimization and Implementation in a BiCMOS Process for Gigahertz Range Processors Matthew W. Ernest Electrical, Computer and Systems Engineering Dept. Rensselaer Polytechnic Institute mwe/PHD/1 Overview • • • • • Motivation Parallel Prefixes and Carry Types HBT Digital Circuits Pseudo-carry Adder Future Directions mwe/PHD/2 Motivation “Speed has always been important otherwise one wouldn't need the computer.” -Seymour Cray • Ubiquity • Simplicity • Complexity mwe/PHD/3 Parallel Prefixes Given: x0 Find: x0 x1 x2 x0 x1 x0 x1 x2 xk ... x0 x1 x2... xk ... • The set of problems covering sequences of operations where terms are added in order to the result of the previous operation • Carry computation is an application of parallel prefix theory mwe/PHD/4 Carry types: Carry Select 1 1 0 0 • Compute possible results in parallel • Select when actual carry-in available • Requires internal carry for blocks, e.g. ripple • Delay: O(f(n/b) +b), min. O(n1/2) • Area: O(f(n/b)b+b), approx. 2n • Affected by block sizing mwe/PHD/5 Carry Types: Carry look-ahead • Carry-out can be “generated” at current position or carry-in “propagated” • Delay: O(1) • Area: O(n2) • High fan-in/fan-out mwe/PHD/6 Carry Types: Block carry lookahead • A block propagates a carry if all bits in the block propagate a carry • A block generates a carry if a bit generates a carry and all succeeding bits propagate • Delay: O(log n) • Area: O(n log n) mwe/PHD/7 Block carry look-ahead trees mwe/PHD/8 Carry vs. Pseudo-carry Cout=Gn+ Pn• Gn-1 +…+Pn• Pn-1• ... P0• Cin If and then G=A•B P=A+B G=G•P Cout= Pn•Gn+ Pn• Gn-1 +…+Pn• Pn-1• ... P0• Cin Cout= Pn(Gn+ Gn-1 +…+Pn-1• ... P0• Cin) Cout= Pn•Hn Hn =Gn+ Gn-1 +…+Pn-1• ... P0• Cin mwe/PHD/9 Carry vs. Pseudo-carry • Redundant terms create factorization opportunities • Factorization moves terms from critical paths to non-critical paths • Multiple paths can be parallelized • Products with fewer terms lead to implementations with smaller, faster gates mwe/PHD/10 Deriving Block Pseudo-carry from Block Carry Look-ahead Terms Block Generate: Gi•j0= Gij + PijGij-1i + … + PijPij-1iPij-2i•••Gi0 If and then G=A•B P=A+B G=G•P Gi•j0= PijGij + PijGij-1i + … + PijPij-1iPij-2i•••Gi0 Gi•j0= Pij(Gij + Gij-1i + … + Pij-1iPij-2i•••Gi0) Hi•j0= Gij + Gij-1i + … + Pij-1iPij-2i•••Gi0 • Pseudo-carries can be generated in blocks like carries mwe/PHD/11 Generalized Pseudocarry Equations H2s= G1s+1 + G1s Hi+js= Hjs+i + Ijs+i-1•His Hi+j+ks= Hks+I+j + Iks+I+j-1•Hjs+i + Iks+I+j-1• Ijs+i-1•His Ip+qt= Iqt+p•Ipt Ip+q+rt= Irt+q+p•Iqt+p•Ipt mwe/PHD/12 Generating Sums Using Pseudocarry Sn=AnBnCn-1 If then Tn=AnBn Cm= Pm•Hm Sn=TnPn-1Hn-1 • Sum with pseudo-carry no more complex than sum with carry • Other look-ahead features still apply, e.g. Han-Carlson “every other carry” mwe/PHD/13 CSel B C A 32 32 12 12 9 6 5 64 64 20 16 12 7 6 CLA Bits Ripple PCLA Adder comparision mwe/PHD/14 HBT Digital Circuits • Exponential I/V relationship leads to high gain and fast switching • Vertical arrangement allows critical dimensions to be smaller with tighter tolerances • Traditionally high DC power consumption: compare increasing leakage and switching currents for FETs mwe/PHD/15 Current Steering Logic • Constant current source equals combined emitter currents • Ratio of current through each transistor is exp. function of base voltage • Difference in currents at collector converted to difference in voltage on pull-up resistors. mwe/PHD/16 Single-ended vs. Double-ended • Limited to simple functions • Large fan-in • Any function of inputs • Fan-in limited by supply voltage mwe/PHD/17 Look-ahead gate w/ fully differential logic Hn-1 Hn-2 Hn-1 In Hn-2 In-1 In In-1 Hn-1 Hn Hn-1 Hn In Hn In Hn mwe/PHD/18 Mixed input look-ahead gates Hn Hn-1 In Vr Hn Vr In • In(Hn+ Hn-1) + In•Hn • Hn+ In•Hn-1 • Two series-gated levels for three inputs mwe/PHD/19 Mixed input look-ahead gates Hn Hn-1 Hn Hn-2 In-1 In In-1 Hn-1 Hn • In In-1(Hn+ Hn-1 + Hn-2) + In In-1(Hn+ Hn-1) + In• In-1• Hn • Hn+ In•Hn-1 + In• In-1• Hn-2 • Three series-gated In levels for five inputs mwe/PHD/20 Pseudocarry Blocks H2s H2s H6s H2s H2s H2s H6s H2s H2s H2s H2s H6s H18s H2s H2s H2s H2s H6s H2s H2s H6s H14s H32s mwe/PHD/21 H2s Pseudocarry Tree Oscillator 0 Select 1 31 32 1 B A Cin Cout mwe/PHD/22 Carry Tree High-speed Output 2 x 165 ps mwe/PHD/23 Breakdown of measured delay Resistor model Temperature 11% Total measured delay = 165 ps 6% Wire C 12% 71% Devices mwe/PHD/24 Loaded vs. unloaded toggling • At design time, fT peak at 1.2mA/um2 but limit at 2mA/um2 • For some devices, max. frequency when driving load can occur above fT peak current • Models supported this, no reason at time to not believe them • However, models are never qualified above fT peak current! mwe/PHD/25 Loaded vs. unloaded toggling 8.00E-11 7.00E-11 Buffer Delay 6.00E-11 5.00E-11 4.00E-11 3.00E-11 2.00E-11 1.00E-11 0.00E+00 0.00E+00 5.00E-04 1.00E-03 1.50E-03 Tail Current 2.00E-03 2.50E-03 mwe/PHD/26 Resistor Model Effects 9805A 99B Simulated Fabricated Pull-up 444 528 Tail 1000 1091 mwe/PHD/27 Model parameter variation 500 450 400 350 DARPA02 Design DARPA02 Fabrication Paramter value 300 RB (ohms) RE (ohms) RC (ohms) 250 200 150 100 50 0 9708A 9802 9805 1999B v2.3 Design Kit mwe/PHD/28 Cadence internal parasitic methods • Approximates all capacitance as polynomial function of distance between conductors • Cannot extract RC and capacitance between conductors at the same time: killer for differential wiring! • Convenient, but window of usability small and shrinking mwe/PHD/29 QuickCap capacitance extraction • Field solving with floating random walk method • Accuracy almost wholly a function of run time: 4x run time give ½ error • Random walks independent, near perfect parallelization mwe/PHD/30 Comparing parasitic extraction 50 45 40 35 Delay (ps) 30 Qcap RC RCNET 25 PCAP Calc RC 20 15 10 5 0 0 200 400 600 800 1000 1200 Length (um ) mwe/PHD/31 Cadence/QuickCap Design Flow • Extract physical data from layout • Compute RC with QuickCap • Extract netlist from schematic • Combine to simulate with Spectre mwe/PHD/32 Partial manual extraction with QuickCap • Identify main wires of oscillation paths: approx. dozen pairs • QuickCap extraction for each wire-ground cap. and cap. between pair • Add RC-ladder for each pair by hand to schematic and simulate mwe/PHD/33 Simulation with Parasitic Extraction Feedback path w/o parasitics (ps) QuickCap parasitic cap. (ps) COEFGEN parasitic cap. (ps) Raphael parasitic cap. (ps) QuickCap parasitic RC (ps) Cin 100 121 128 131 135 A1 103 123 130 129 137 A31 108 127 129 132 141 mwe/PHD/34 Pseudo-carry Tree configured as Ring Oscillator 00...00 11...11 30 32 1 Sel0 Sel1 1 B A Cin 1 Cout mwe/PHD/35 SMI00 Test Structure Layout mwe/PHD/36 SMI00 Test Structure mwe/PHD/37 Carry Tree High-speed Outputs 16 x 146 ps mwe/PHD/38 Comparisons of published adders Reference ZIMM96 STEL96 WANG97 CHAN98 SILB98 AIPP99 SAGE01 MATH01 STAS01 LEE02 VANA02 Type Carry Adder Adder Adder Fixed Adder Adder Adder Adder Adder ALU Size 32 64(32) 32 64(32) 64 64 32[16x2] 64 64 64 32 Gate Del. 5 12.5(12?) 3 27(19.5) 8 Time 2.7ns 550 ps 660 ps <500ps 482 ps 440 ps 900 ps <200 ps mwe/PHD/39 Cascode Output Stage • Eliminates Miller capacitance between input and output • Reduces Cjc and Cjs on outputs • Shortens rise time, but increases delay mwe/PHD/40 Dotted Emitter/Collector mwe/PHD/41 “Wide/Short” gate with dotted emitter/collector mwe/PHD/42 “Wide/Short” gate with dotted emitter/collector • Shorter trees lead to lower supply voltages • Wider trees reduce ratio of emitter-followers to terms computed, lowering total current • More inputs per look-ahead gate means fewer look-ahead levels • Elimination of single-ended inputs on critical H signals allow faster switching with reduced swing mwe/PHD/43 Even wider look-ahead gate Width limited by • Accumulated Cjc and Cjs of dotted-and node • Saturation vs. breakdown • Fan-out loading from inputs and interconnect mwe/PHD/44 Conclusions • 32-bit addition depth reduced to 5 gates fabricated. 4 and 3 gate depth circuits designed. • Gate to compute 3-way look-ahead fabricated. Up to 8-way look-ahead designed. • Carry delay for 32-bit addition measured at 146ps. • QuickCap technology file for 5HP brings simulated results within 11% of measured. mwe/PHD/45