Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Practical Design and Performance Evaluation of Completion Detection Circuits Fu-Chiung Cheng Department of Computer Science Columbia University Reading 4 1 Outline • Motivation • Previous Work • New Completion Detection Circuit • Performance Evaluation • Conclusion 2 Motivation • Circuits: Synchronous or Asynchronous. • Synchronization: Sync: a global clock Async: start and completion mechanisms Motivation • Potential advantages of async. design: • No clock skew problem, • Low power consumption, • Average-case performance, • Modularity, composability and reusability • Easier technology migration • The promise of high performance is especially attractive. Motivation • High performance async. design: 1. fast self-timed components with good average case performance 2. fast completion detection circuits, detecting the completion. 0 A0 A 10 0 S0 S 10 .. .. Self-timed . B 00 . component S 0n-1 1 1 B0 S n-1 + + Ack0 .. . Ackn-1 C DoneReset Motivation • High performance async. design: 1. fast self-timed components with good average case performance 2. fast completion detection circuits, detecting the completion. 0 A0 A 10 0 S0 S 10 .. .. Self-timed . B 00 . component S 0n-1 1 1 B0 S n-1 + + Ack0 .. . Ackn-1 C DoneReset Motivation • Fast self-timed components: 1. Delay-insensitive carry-lookahead adders Time complexity : (log log n ) Logic complexity : ( n ) 2. Delay-insensitive comparators: Time complexity : ( 1 ) Logic complexity : ( n ) Motivation • Fast completion detection circuits: 1. Completion detection circuits (CDCs) are considered as the major overhead. 2. This paper address the design of fast completion detection circuits. Previous Work: • Self-timed components may use 1. bundled data protocol 2. dual-rail signaling Previous Work: • CDCs for bundled data components 1. Delay elements (an inverter chain). delay > worst case delay. 2. Speculative completion [Nowick97] performance depend on A. number of matched delays and B. associated abort detection network 3. Current-Sensing Completion-Detection [Dean94,Grass96] A. consume substantial power B. requires several gate delays Previous Work: • CDCs for dual-rail self-timed components 1. General model: A. n two-input ORs B. 1 n-input C-element 2. Operations: A. computation cycle: DoneReset=1 B. reset cycle: DoneReset=0 0 A0 A 10 0 S0 S 10 .. Self-timed .. B 00 . component S 0n-1. 1 1 B0 S n-1 + + Ack0 .. . Ackn-1 C DoneReset Previous Work: • N-input C-element: a tree of 2-input C-elms 1. long delay 2. large variance Ack0 C …. Ack1 …. …. Ackn-2 C Ackn-1 …. C C C Previous Work: • N-input C-element: 1. More efficient implementation: DoneReset = (done+reset DoneReset) A. done circuit: an n-input AND done = Ack0 Ack1 … Ackn-1 B. reset: circuit: an n-input OR reset = Ack0 + Ack1 + …+ Ackn-1 C. a 2-input C-elem. Ack0 2. delay & variance: better than the tree of 2-input C-elem .. . & done DoneReset Ackn-1 Ack0 .. . Ackn-1 C + reset Previous Work: • Wuu’s CDCs [Wuu93]: DoneReset (done reset DoneReset) (done reset DoneReset) done (done (reset DoneReset)) A. done circuit: a tree of NAND done Ack0 Ack1 ... Ackn 1 B. reset circuit: a tree of NOR reset Ack0 Ack1 ... Ackn 1 C. long delay D. small variance E. use static gates reset Previous Work: • Yun’s CDCs [Yun97]: E. use dynamic CMOS 1 0 1 ( Si + Si ) i=16 M 15 0 1 ( Si + Si ) i=8 1 1 1 1 prech 1 done prech 0 1 S4 S4 S 0 5 S 1 5 S 0 6 S 1 6 0 S3 0 0 1 S7 S7 0 1 0 S2 1 0 S1 1 1 S2 S0 0 0 1 S3 S1 S0 prech 0 8-bit completion detection domino logic 7 M D. large variance 0 ( Si + Si ) i=24 23 M A. done circuit: a tree of domino logic B. no reset circuit C. variant delay M 31 i=0 0 1 ( Si + Si ) 0 0 Our Design • Computation Completion detection circuits done Ack0 Ack1 ... Ackn 1 Ack0 Ack1 ... Ackn 1 (dynamic n-input NOR) Acki S 0i S 1i (static 2-input NOR) 1 1 0 Si Acki done 1 Si Acki 0 1 Si Si 0 Ack0 0 Ack1 ... 0 0 Ackn-2 Ackn-1 0 0 Our Design • Reset Completion detection circuits reset Ack0 Ack1 ... Ackn 1 ((S 00 S 10 ) ... (S n01 S n11)) Acki S 0i S 1i (dynamic 2n-input Or) 1 0 Si 1 Si 0 S0 reset 1 S0 0 ... 0 0 Si Si ... 0 0 1 0 Sn - 1 1 Sn - 1 0 0 Our Design • Computation cycle: Either S 0i or S 1i will be eventually turned on. For the done signal, 1. the PMOS transistor (Acki) will be closed and 2. all NMOS transistors will be open. 3. Thus, the done signal will be turned on. 1 1 0 Si Acki done 1 Si Acki 0 1 Si Si 0 Ack0 0 Ack1 ... 0 0 Ackn-2 Ackn-1 0 0 Our Design • Computation cycle: Either S 0i or S 1i will be eventually turned on. For the reset signal, the reset signal is turned on as soon as any Acki signal goes high 1 0 Si 1 Si 0 S0 reset 1 S0 0 ... 0 0 Si Si ... 0 0 1 0 Sn - 1 1 Sn - 1 0 0 Our Design • Reset cycle: Either S 0i or S 1i will be eventually turned off. For the done signal, the done signal is turned off as soon as any Acki signal is turned off 1 1 0 Si Acki done 1 Si Acki 0 1 Si Si 0 Ack0 0 Ack1 ... 0 0 Ackn-2 Ackn-1 0 0 Our Design • Reset cycle: Either S 0i or S 1i will be eventually turned off. For the reset signal, the reset signal is turned off only after all Acki signals are turned off. 1 0 Si 1 Si 0 S0 reset 1 S0 0 ... 0 0 Si Si ... 0 0 1 0 Sn - 1 1 Sn - 1 0 0 Our Design • done + reset circuits = dual-rail multi-input C-element • done + reset circuits + 2-input C-element = single-rail multi-input C-element • Implementation of 2-input C-element: 1 1 Weak done done reset reset done DoneReset done reset reset 0 DoneReset 0 DIRCA With CDC: part 1 DIRCA With CDC: part 2 Our Design • The PMOS in the pull-up circuit of the done circuit saves power in non-operation mode. • In a quiescent state, all Acki signals are zero. All pull-down transistors are closed. • To save power, pull-up transistor is open to cut off the path from Vdd to Ground. 1 1 0 Si Acki done 1 Si Acki 0 1 Si Si 0 Ack0 0 Ack1 ... 0 0 Ackn-2 Ackn-1 0 0 Our Design • Input low arrives too early, power is wasted. • Input low arrives too late, take a longer time to turn on the done signal. • Low power consumption latest Acki signal • High performance any not-latest Acki signal 1 1 0 Si Acki done 1 Si Acki 0 1 Si Si 0 Ack0 0 Ack1 ... 0 0 Ackn-2 Ackn-1 0 0 SPICE Output: done circuit Delay=0.55ns ChengDone0: 1. Ack0 is the latest signal. 2. input pulses: 3 and 4 3. buffered input:1004 4. Ack0:100 5. Done:24680 6. DoneReset: 200 SPICE Output: done circuit Delay=0.22ns ChengDone1: 1. Ack1 is the latest signal. 2. input pulses: 5 and 6 3. buffered input:1006 4. Ack1:101 5. Done:24680 6. DoneReset: 200 SPICE Output: done circuit Delay=0.64ns ChengDone37: 1. All Ack arrive at the same time 2. Done:24680 3. DoneReset: 200 SPICE Output: reset circuit Delay=1.23ns ChengReset0: 1. Ack0 is the latest signal. 2. input pulse: 3 and 4 3. buffered input:1004 5. Reset:13579 6. DoneReset: 200 SPICE Output: reset circuit Delay=0.87ns ChengReset1: 1. Ack0 is the latest signal. 2. input pulse: 3 and 4 3. buffered input:1004 5. Reset:13579 6. DoneReset: 200 SPICE Output: reset circuit Delay=1.34ns ChengReset37: 1. All Ack reset at the same time 2. Done:24680 3. DoneReset: 200 Our Design • Constraint: when conducting, Rpull-up 5Rpull-dwon when only one pull-down transistor is conducting. • This can be achieved by properly sizing transistors. 1 1 0 Si Acki done 1 Si Acki 0 1 Si Si 0 Ack0 0 Ack1 ... 0 0 Ackn-2 Ackn-1 0 0 Logic Complexity circuit n-bit Wuu 10n-4 Yun 4n-5 Cheng 5n+1 done 32-bit 316 123 161 done+reset 64-bit n-bit 32-bit 64-bit 636 14n-8 440 888 251 N/A 321 7n+5 229 453 # of transistors Performance Evaluation • SPICE Simulation: 1. use MOSIS 2 micron CMOS level 2 parameters 2. W=3u L=2u (buffer 0.4 ns 2-input Nor 0.18ns) • Computation-completion detection circuits 38 typical cases (for Wuu, Yun and Cheng) The delay measured includes the delay of the OR gate for Acki. • Reset-completion detection circuits: 38 typical cases (Wuu and Cheng) Performance Evaluation Case Min Max Avg Computation Completion Detection 32-bit done(ns) Speed up Wuu Yun Cheng C vs W C vs Y 2.18 1.46 0.22 4.1 2.8 2.65 3.36 0.64 10.4 14.3 2.27 2.53 0.28 9.2 10.2 Performance Evaluation Case Min Max Avg Reset Completion Detection 32-bit reset(ns) Speed up Wuu Cheng C vs W 2.40 0.87 2.89 1.34 2.85 0.71 4.0 Conclusions • A new completion detection circuit for dual-rail self-timed components. 1. very fast computation-completion detection 2. very fast reset-completion detection • Low-overhead, very fast completion detection circuit is crucial for high performance self-timed circuits. Conclusions • SPICE simulation results: 1. our computation-completion detection circuit 9 times faster than Wuu's and Yun's 2. our reset-completion detection circuit: 4 times faster than Wuu's.