Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Reducing Power Consumption with Relaxed Quasi Delay-Insensitive Circuits Christopher LaFrieda and Rajit Manohar Computer Systems Laboratory Cornell University Outline Motivation / Background Contributions Relaxed Quasi Delay-Insensitive (RQDI) RQDI Voltage Scaling RQDI Two Phase Circuits Results Summary Motivation: How Does Dynamic Power Scale? PD N CL Vdd f 2 α – activity factor (1x) N – total number of transistors (2x) CL – average load capacitance per transistor (.7x) Vdd – doesn’t scale well anymore Scaled by 17-20% from 130nm to 65nm. Scaled by 10% at 45nm and 5.5% at 32nm. 2 PD1 Vdd1 f1 1.4 2 PD 0 Vdd 0 f 0 Motivation: Power Scaling With Fixed Frequency Power Scaling With Fixed Frequency 150 140 Power (%) 130 120 110 100 90 80 130 90 65 45 Technology(nm) 32 22 Motivation: Process Variations Getting Worse Process Variation in 65nm: FO4 delays across corners: SS Corner TT Corner FF Corner 13.6 ps 18.2 ps 22.6 ps FF is 70% faster than SS. Circuits need to be robust w.r.t. process variations. QDI is a logical place to start. Background: QDI – WCHB Buffer • Simple buffer. • Neutrality is checked in the pull-up stack of the c-element. • Timing assumption? RQDI: Staticizer Timing Assumption I • Data is neutral and enable is high. RQDI: Staticizer Timing Assumption II • Data is neutral and enable is high. • Data becomes valid which sets _R0 low. If R0 inverter is slow, R0 will remain low. RQDI: Staticizer Timing Assumption III • Data is neutral and enable is high. • Data becomes valid which sets _R0 low. If R0 inverter is slow, R0 will remain low. • Nothing is fighting the weak feedback, _R0 can go high. RQDI: Half Cycle Timing Assumption The half cycle timing assumption (HCTA): A small amount of combinational logic (1-2 transitions) will always switch within one half cycle of a process. There is a 4.5x (@ 18 t.p.c.) timing margin. With worst case corners, 2.7x margin in 65nm. Wire delays make the assumption even more conservative. QDI has an HCTA in staticizers. RQDI allows them everywhere. RQDI: HCHB Template • N tracks neutrality. • Check N+, but assume N- happens in the first half cycle. • Two transition latency. • 14 transition cycle time. • Validity must be checked by pulldown. RQDI Voltage Scaling: Scaling Scenarios Mismatched slack • Two possible scenarios for voltage scaling. • Top: mismatched slack. Lower pipeline can run slower. • Bottom: Token limited loop. Latency through loop should be minimal, but cycle time can scale. • In some applications these can’t be avoided. Token limited loop RQDI Voltage Scaling: Slack Mismatch In An FPGA • Logic blocks (LB) for • • • • logic. Switch boxes (SB) for routing. Limited routing resources. Imperfect slack matching. Can scale voltage on blue path. RQDI Voltage Scaling: DVHB: Dual Voltage Template • Data rails are full swing. • Acknowledges are low swing. • Latency remains constant through voltage scaling. • Cycle time can be adjusted through voltage scaling. RQDI Two Phase Circuits: Two Phase Buffer (HCFB2P) • An HCTA exists on the right pair of XORs. • Two transition latency. • Seven transition cycle time. • Twice the area of a WCHB. However, it can replace two stages. RQDI Two Phase Circuits: Two Phase In An FPGA • Replace routing (SB) with two phase logic. • Logic (LB) remains four phase. • Phase converters are placed around logic blocks. • Routing makes up over half the area in an asynchronous FPGA, so power savings can be large. Width N Switch RQDI Two Phase Circuits: Converters Need to convert between two phase (for routing) and four phase (for logic). The 4:2 converter is 3x larger than a WCHB. The 2:4 converter is 3.25x larger than a WCHB. Experimental Setup • Simulated in HSpice with a 65nm bulk technology. • Circuits are sized to the drive strength of a 20/10 lambda inverter. Name Description Inputs Outputs Implies Validity? and2 And 2 1 No or2 Or 2 1 No xor2 Exclusive Or 2 1 Yes fa Full Adder 3 2 Yes benc Booth Encoder 3 2 No Results : HCHB – Energy Per Cycle • HCHB consumes Energy Per Operation (pJ) 32% less energy than PCHB. • HCHB consumes 36% less energy than PCEHB. • Slight frequency improvement. • Negligible latency penalty. 0.25 PCHB PCEHB 0.2 HCHB 0.15 0.1 0.05 0 and2 or2 xor2 fa benc Benchmark avg Results: HCHB – Total Transistor Area • Despite the Total Transistor Area (µm2) additional transistors to check validity, HCHB is smaller. • HCHB is about 20% smaller than PCHB. • HCHB is about 15% smaller than PCEHB. 8 7 PCHB 6 PCEHB 5 HCHB 4 3 2 1 0 and2 or2 xor2 fa benc Benchmark avg Results: DVHB – Low voltage vs. Dual Voltage 0.25 100 90 0.2 80 0.1 60 50 40 DVHB 30 Low Vdd 0.05 20 DVHB 10 0 1 0.9 0.8 0.7 Voltage(V) 0.6 0.5 0 1 0.9 0.8 0.7 Voltage(V) 0.6 0.5 Power (%) 0.15 Dynamic Slack 70 Results: HCFB2P Switch – Energy Reduction vs. WCHB • Wider switches 52 Energy Reduction (%) means larger MUXes and larger PCs. • The associated caps switch half as much. • Over 50% reduction in power. Due to replacing two stages. 52.5 51.5 51 50.5 50 49.5 49 2 4 6 8 10 Switch Width 12 14 16 RQDI Two Phase Circuits: Results – Area Overhead • Typically, there is 40 Area Overhead (%) about of 8 stages of 4-wide switches between logic blocks. • Area overhead is 15%. • With direct connections, there are about 10 stages with an overhead of 10%. 45 35 30 width 4 25 20 15 10 5 0 4 6 8 10 12 14 Number of Stages 16 18 20 Summary RQDI allows half cycle timing assumptions outside of staticizers. With RQDI, we can simplify the PCHB logic template. The resulting template, HCHB, consumes 32% less energy. The dual voltage logic template can be used to adjust the dynamic slack of a stage. This allows us to save energy with a minimal throughput penalty in token limited loops. Replacing the routing in an FPGA with two phase logic can reduce energy consumption by 50%. Using the RQDI two phase buffer and converters will achieve this with a 10-15% area overhead. Questions?