Download Practical Design and Performance Evaluation of Completion

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Practical Design and
Performance Evaluation of
Completion Detection Circuits
Fu-Chiung Cheng
Department of Computer Science
Columbia University
1
Outline
• Motivation
• Previous Work
• New Completion Detection Circuit
• Performance Evaluation
• Conclusion
2
Motivation
• Circuits: Synchronous or Asynchronous.
• Synchronization:
Sync: a global clock
Async: start and completion
mechanisms
Motivation
• Potential advantages of async. design:
• No clock skew problem,
• Low power consumption,
• Average-case performance,
• Modularity, composability and reusability
• Easier technology migration
• The promise of high performance is
especially attractive.
Motivation
• High performance async. design:
1. fast self-timed components with
good average case performance
2. fast completion detection circuits,
detecting the completion.
0
A0
A 10
0
S0
S 10
..
.. Self-timed
.
B 00 . component S 0n-1
1
1
B0
S n-1
+
+
Ack0
..
.
Ackn-1
C
DoneReset
Motivation
• High performance async. design:
1. fast self-timed components with
good average case performance
2. fast completion detection circuits,
detecting the completion.
0
A0
A 10
0
S0
S 10
..
.. Self-timed
.
B 00 . component S 0n-1
1
1
B0
S n-1
+
+
Ack0
..
.
Ackn-1
C
DoneReset
Motivation
• Fast self-timed components:
1. Delay-insensitive carry-lookahead adders
Time complexity : (log log n )
Logic complexity : ( n )
2. Delay-insensitive comparators:
Time complexity : ( 1 )
Logic complexity : ( n )
Motivation
• Fast completion detection circuits:
1. Completion detection circuits (CDCs) are
considered as the major overhead.
2. This paper address the design of
fast completion detection circuits.
Previous Work:
• Self-timed components may use
1. bundled data protocol
2. dual-rail signaling
Previous Work:
• CDCs for bundled data components
1. Delay elements (an inverter chain).
delay > worst case delay.
2. Speculative completion [Nowick97]
performance depend on
A. number of matched delays and
B. associated abort detection network
3. Current-Sensing Completion-Detection
[Dean94,Grass96]
A. consume substantial power
B. requires several gate delays
Previous Work:
• CDCs for dual-rail self-timed components
1. General model:
A. n two-input ORs
B. 1 n-input C-element
2. Operations:
A. computation cycle: DoneReset=1
B. reset cycle: DoneReset=0
0
A0
A 10
0
S0
S 10
.. Self-timed
..
B 00 . component S 0n-1.
1
1
B0
S n-1
+
+
Ack0
..
.
Ackn-1
C
DoneReset
Previous Work:
• N-input C-element: a tree of 2-input C-elms
1. long delay
2. large variance
Ack0
C
….
Ack1
….
….
Ackn-2
C
Ackn-1
….
C
C
C
Previous Work:
• N-input C-element:
1. More efficient implementation:
DoneReset = (done+reset DoneReset)
A. done circuit: an n-input AND
done = Ack0 Ack1 … Ackn-1
B. reset: circuit: an n-input OR
reset = Ack0 + Ack1 + …+ Ackn-1
C. a 2-input C-elem. Ack0
2. delay & variance:
better than the tree
of 2-input C-elem
..
.
&
done
DoneReset
Ackn-1
Ack0
..
.
Ackn-1
C
+
reset
Previous Work:
• Wuu’s CDCs [Wuu93]:
DoneReset  (done  reset  DoneReset)
 (done  reset  DoneReset)
done
 (done  (reset  DoneReset))
A. done circuit: a tree of NAND
done  Ack0  Ack1  ...  Ackn  1
B. reset circuit: a tree of NOR
reset  Ack0  Ack1  ...  Ackn  1
C. long delay
D. small variance
E. use static gates
reset
Previous Work:
• Yun’s CDCs [Yun97]:
E. use dynamic
CMOS
1
0
1
( Si + Si )
i=16
M
15
0
1
( Si + Si )
i=8
1
1
1
1
prech
1
done
prech
0
1
S4
S4
S
0
5
S
1
5
S
0
6
S
1
6
0
S3
0
0
1
S7
S7
0
1
0
S2
1
0
S1
1
1
S2 S0
0
0
1
S3 S1
S0
prech
0
8-bit completion
detection domino logic
7
M
D. large variance
0
( Si + Si )
i=24
23
M
A. done circuit:
a tree of
domino logic
B. no reset circuit
C. variant delay
M
31
i=0
0
1
( Si + Si )
0
0
Our Design
• Computation Completion detection circuits
done  Ack0  Ack1  ...  Ackn  1
 Ack0  Ack1  ...  Ackn  1 (dynamic n-input NOR)
Acki  S 0i  S 1i (static 2-input NOR)
1
1
0
Si
Acki
done
1
Si
Acki
0
1
Si
Si
0
Ack0
0
Ack1
...
0
0
Ackn-2
Ackn-1
0
0
Our Design
• Reset Completion detection circuits
reset  Ack0  Ack1  ...  Ackn  1
 ((S 00  S 10 )  ...  (S n01  S n11))
Acki  S 0i  S 1i
(dynamic 2n-input Or)
1
0
Si
1
Si
0
S0
reset
1
S0
0
...
0
0
Si
Si
...
0
0
1
0
Sn - 1
1
Sn - 1
0
0
Our Design
• Computation cycle:
Either S 0i or S 1i will be eventually turned on.
For the done signal,
1. the PMOS transistor (Acki) will be closed and
2. all NMOS transistors will be open.
3. Thus, the done signal will be turned on.
1
1
0
Si
Acki
done
1
Si
Acki
0
1
Si
Si
0
Ack0
0
Ack1
...
0
0
Ackn-2
Ackn-1
0
0
Our Design
• Computation cycle:
Either S 0i or S 1i will be eventually turned on.
For the reset signal,
the reset signal is turned on as soon as
any Acki signal goes high
1
0
Si
1
Si
0
S0
reset
1
S0
0
...
0
0
Si
Si
...
0
0
1
0
Sn - 1
1
Sn - 1
0
0
Our Design
• Reset cycle:
Either S 0i or S 1i will be eventually turned off.
For the done signal,
the done signal is turned off as soon as
any Acki signal is turned off
1
1
0
Si
Acki
done
1
Si
Acki
0
1
Si
Si
0
Ack0
0
Ack1
...
0
0
Ackn-2
Ackn-1
0
0
Our Design
• Reset cycle:
Either S 0i or S 1i will be eventually turned off.
For the reset signal,
the reset signal is turned off only after all
Acki signals are turned off.
1
0
Si
1
Si
0
S0
reset
1
S0
0
...
0
0
Si
Si
...
0
0
1
0
Sn - 1
1
Sn - 1
0
0
Our Design
• done + reset circuits
= dual-rail multi-input C-element
• done + reset circuits + 2-input C-element
= single-rail multi-input C-element
• Implementation of 2-input C-element:
1
1
Weak
done
done
reset
reset
done
DoneReset done
reset
reset
0
DoneReset
0
DIRCA With CDC: part 1
DIRCA With CDC: part 2
Our Design
• The PMOS in the pull-up circuit of the done
circuit saves power in non-operation mode.
• In a quiescent state, all Acki signals are zero.
All pull-down transistors are closed.
• To save power, pull-up transistor is open to cut off
the path from Vdd to Ground.
1
1
0
Si
Acki
done
1
Si
Acki
0
1
Si
Si
0
Ack0
0
Ack1
...
0
0
Ackn-2
Ackn-1
0
0
Our Design
• Input low arrives too early, power is wasted.
• Input low arrives too late, take a longer time
to turn on the done signal.
• Low power consumption
latest Acki signal
• High performance
any not-latest Acki signal
1
1
0
Si
Acki
done
1
Si
Acki
0
1
Si
Si
0
Ack0
0
Ack1
...
0
0
Ackn-2
Ackn-1
0
0
SPICE Output: done circuit
Delay=0.55ns
ChengDone0:
1. Ack0 is the
latest signal.
2. input pulses:
3 and 4
3. buffered
input:1004
4. Ack0:100
5. Done:24680
6. DoneReset:
200
SPICE Output: done circuit
Delay=0.22ns
ChengDone1:
1. Ack1 is the
latest signal.
2. input pulses:
5 and 6
3. buffered
input:1006
4. Ack1:101
5. Done:24680
6. DoneReset:
200
SPICE Output: done circuit
Delay=0.64ns
ChengDone37:
1. All Ack arrive
at the same
time
2. Done:24680
3. DoneReset:
200
SPICE Output: reset circuit
Delay=1.23ns
ChengReset0:
1. Ack0 is the
latest signal.
2. input pulse:
3 and 4
3. buffered
input:1004
5. Reset:13579
6. DoneReset:
200
SPICE Output: reset circuit
Delay=0.87ns
ChengReset1:
1. Ack0 is the
latest signal.
2. input pulse:
3 and 4
3. buffered
input:1004
5. Reset:13579
6. DoneReset:
200
SPICE Output: reset circuit
Delay=1.34ns
ChengReset37:
1. All Ack reset
at the same
time
2. Done:24680
3. DoneReset:
200
Our Design
• Constraint: when conducting,
Rpull-up  5Rpull-dwon
when only one pull-down transistor is conducting.
• This can be achieved by properly sizing transistors.
1
1
0
Si
Acki
done
1
Si
Acki
0
1
Si
Si
0
Ack0
0
Ack1
...
0
0
Ackn-2
Ackn-1
0
0
Logic Complexity
circuit
n-bit
Wuu 10n-4
Yun 4n-5
Cheng 5n+1
done
32-bit
316
123
161
done+reset
64-bit n-bit 32-bit 64-bit
636 14n-8 440 888
251
N/A
321 7n+5 229 453
# of transistors
Performance Evaluation
• SPICE Simulation:
1. use MOSIS 2 micron CMOS level 2 parameters
2. W=3u L=2u (buffer 0.4 ns 2-input Nor 0.18ns)
• Computation-completion detection circuits
38 typical cases (for Wuu, Yun and Cheng)
The delay measured includes the delay of
the OR gate for Acki.
• Reset-completion detection circuits:
38 typical cases (Wuu and Cheng)
Performance Evaluation
Case
Min
Max
Avg
Computation Completion Detection
32-bit done(ns)
Speed up
Wuu
Yun Cheng C vs W C vs Y
2.18 1.46 0.22
4.1
2.8
2.65 3.36 0.64
10.4 14.3
2.27 2.53 0.28
9.2
10.2
Performance Evaluation
Case
Min
Max
Avg
Reset Completion Detection
32-bit reset(ns)
Speed up
Wuu
Cheng
C vs W
2.40
0.87
2.0
2.89
1.34
3.1
2.85
0.71
2.7
Conclusions
• A new completion detection circuit for
dual-rail self-timed components.
1. very fast computation-completion detection
2. very fast reset-completion detection
• Low-overhead, very fast completion detection
circuit is crucial for high performance
self-timed circuits.
Conclusions
• SPICE simulation results:
1. our computation-completion detection circuit
9 times faster than Wuu's and Yun's
2. our reset-completion detection circuit:
2.7 times faster than Wuu's.
Related documents