Download FPGA Power Reduction Using Configurable Dual-Vdd

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Power Modeling and Architecture
Evaluation for FPGA with Novel
Circuits for Vdd Programmability
Yan Lin, Fei Li and Lei He
EE Department, UCLA
[email protected]
Partially supported by NSF.
Overview

FPGA architecture evaluation



Area and delay [Rose et al, JSSC’90]
Power [Poon et al, FPLA’02][Li et al, FPGA’03]
Vdd programmability for power reduction



Concept in [FPGA’03]
Application to logic [FPGA’04][DAC’04]
Application to interconnects
[ICCAD’04][Anderson et al, ICCAD’04]
Novel circuits and Architecture evaluation
for FPGAs with Vdd-programmability
 Reduce power by 50% with 17% area and
3% delay increase

Outline

Power modeling and architecture
evaluation methodology

FPGA Circuits for Vdd Programmability

Architecture Evaluation with Vdd
programmability

Conclusions and Ongoing Work
Framework fpgaEva-LP
Benchmark circuits
Logic Optimization(SIS)
Tech-Mapping (RASP)
Arch
Spec
Parasitic
Extraction
Timing-Driven Packing (TV-Pack)
Placement & Routing (VPR)
Area
Delay
Cycle-accurate
Power
Simulator
Power
FPGA Structure and Models

Cluster-based Island Style FPGA Structure



Area and delay models similar to [Betz-RoseMarquardt]


100% buffered interconnects, subset switch block
input fc = 50%, output fc = 25%
But based on layout and SPICE for 100nm and below
Mixed-level power model from [FPGA’03]
Dynamic power


Capacitive power
Short-circuit power
( transition time)
Capacitive power


Functional switch
Glitch
Static Power



Sub-threshold leakage
Reverse biased leakage
Gate leakage
New Power Model in fpgaEva-LP2

Short-circuit power 
switching time * switching power

fpgaEva-LP used average signal transition time

fpgaEva-LP2 calculates transition time for each
buffer as tr    tbuffer, the buffer delay


 is NOT a constant 2 as in literature due to input slew
 is pre-characterized by SPICE
buffer delay
<0.012 ns
< 0.03 ns
>0.03 ns
α
2
4.4
7
Validation Using SPICE

Validate by comparison for each power-component
High fidelity with average absolute error of 8%
0.0025
SPICE simulation
fpgaEVA-LP
fpgaEVA-LP2
0.002
FPGA Power (watt)

0.0015
0.001
0.0005
0
b1
parity
cm138a
Benchmark Circuits
z4ml
decode
Impact of Random Seeds in VPR
5.6
circuit: s38584
FPGA Energy (nJ/cycle)
5.55
8
10
5.5
3
5.45
6
+5%
5.4
5
2
7
5.35
4
5.3
5.25
10.2
9
+12%
1
10.4
10.6
10.8
11
11.2
11.4
11.6
11.8
12
Critical Path Delay (ns)


12% delay variation and 5% energy variation
Min-delay solution among 10 runs is used
Total FPGA Energy (nJ/cycle)
Evaluation of Single-Vdd FPGAs
9
(12, 7)
8
7
(10, 7)
(12, 6)
6
(8, 7)
(6, 7)
(6, 6)
(10, 5)
(8, 5)
(12, 4)
5
4

(6, 3)
(6, 5)
(6, 4)
(10, 4)
(8, 4)
10
11
12
13
14
15
16
17
Critical Path Delay (ns)
Cluster size N = {6, 8, 10, 12}
LUT size k = {3, 4, 5, 6, 7}
Energy-delay (ED) dominant architectures


(12, 5)
Architectures explored


(8, 3)
3
9

(10, 6)
(8, 6)
(10, 3)
(12, 3)
Architecture with smaller delay or less energy (compared
to any other architecture)
Relaxed ED dominant set may be also valuable
Energy versus Delay
For 100nm ITRS technology


Min-Energy arch (N,k)=(10,4) or (8.4)
Min-Delay arch (N,k)=(8,7)  0.8x delay but 1.7x power
9
Total FPGA Energy (nJ/cycle)

Current commercial
architecture
(12, 7)
8
7
(10, 7)
(12, 6)
6
(8, 7)
(10, 6)
(8, 6)
(6, 7)
(6, 6)
(10, 5)
(8, 5)
(12, 4)
5
4
(12, 3)
(10, 3)
(8, 3)
(12, 5)
(6, 3)
(6, 5)
(6, 4)
(10, 4)
(8, 4)
3
9
10
11
12
13
14
Critical Path Delay (ns)
15
16
17
Outline

Power modeling and evaluation
methodology

FPGA Circuits for Vdd Programmability

Architecture Evaluation with Vdd
programmability

Conclusions and Ongoing Work
Vdd-programmable FPGA
[DAC’04][ICCAD’04]

Vdd-programmable logic
block


Vdd selection
Power-gating unused blocks
Vdd-programmable FPGA
[FPGA’04][ICCAD’04]

Vdd-programmable logic
block


Vdd selection
Power-gating unused blocks

Vdd-programmable switch

Vdd-level conversion is
needed when VddL drives
VddH

To avoid excessive leakage
Vdd-programmable Routing Switch

Conventional routing switch

Vdd-programmable routing switch

Brute-force design [ICCAD’04]


Two extra SRAM cells for each routing switch
New design


One extra SRAM cell
NAND2 gate –- minimum size & high-Vt transistor
Vdd-Programmable
Interconnect Connection Block


Brute-force design [ICCAD’04]
 2n extra SRAM cells for n connection switches
New design


Only TWO extra SRAM cells for n connection switches
Control logic includes 2n NAND2 and a decoder
Power and Delay

Vdd-programmable switch uses



Compared to conventional switch


4X PMOS power transistor for 7X routing switch
1X PMOS power transistor for 4X connection switch
1000X less leakage power
Connection box is 28% faster and has 18% less dynamic
power

By moving mux from critical path of connection box
Switch delay (ns)
Energy per switch (Joule)
(Vdd=1.3v)
Type
w/o power
transistor
w/ power
transistor
w/o power
transistor
w/ power
transistor
Routing
5.9E-11
6.5E-11(+11%)
3.3E-14
3.2E-14 (-2%)
Connection
2.9E-10
2.1E-10(-28%)
3.8E-14
3.1E-14(-18%)
Vdd-gateable Routing Switch

Conventional

Vdd-gateable


two states  Normal Vdd or Power-gating
Enable power-gating capability w/o extra SRAM
cells
Power
transitor

Can be replaced by tri-state buffer
Vdd-gateable Connection Block


Conventional

Vdd-gateable
Enable power-gating capability w/ only one extra SRAM
and a low leakage decoder
Outline

Power modeling and evaluation
methodology

FPGA Circuits for Vdd Programmability

Architecture Evaluation with Vdd
programmability

Conclusions and Ongoing Work
FPGA Architecture Classes
Architecture Class Logic Block
Interconnect
Class0 (baseline)
single-Vdd
single-Vdd
Class1
programmable
dual-Vdd
programmable dual-Vdd,
level converters in routing
Class2
programmable
dual-Vdd
VddH and Vdd-gateable
Class3
programmable
dual-Vdd
Class 1, but no level
converters in routing

High-Vt is applied to configuration SRAM cells for
all the classes
Vdd-level Converters

Class3 removes Vdd-level converters from interconnects in
Class1

With constraints that no VddL drives VddH

We developed a routing that one routing
tree has a single Vdd level
 But trees with different Vdd-levels can
share the same wire track

Alternative approaches:
 Combined vdd-level converter and buffer [Anderson et al,
ICCAD’04]
 Our new work [DAC’05] allows dual vdd in a tree with a
chip level time slack budgeting for extra power reduction
Energy versus Delay
Total FPGA Energy/Cycle (nJ)
LUT 7
6
High Performance
Class 0
Class 1
Class 2
Class 3
(8, 7)
5.5
(6, 7)
5
4.5
(12, 4)
4
(8, 7)
(6, 7)
3.5
(8,7)
(6,7)
(8,7)(6,7)
3
2.5
LUT 4
Low Energy
(6, 6) (8, 6)
(10, 5) (8, 5)
2
(6, 6)
(6, 5)
(8, 4)(6, 4) (10, 4)
(10, 5)
(8, 5)
(12, 4)
(10,6) (6,6) (8,6)
(10,5) (8,5)
(10,6) (6,6)
(12,4)
(8,6)
(10,5) (8,5)
(12,4)
(8, 4) (6, 4)
1.5
10
10.5
11
11.5
12
12.5
13
Critical Path Delay (ns)

ED-product reduction




20% by Class1 (Vdd-programmable interconnects w/ level converters)
45% by Class2 (Vdd-gateable interconnects)
50% by Class3 (class1 minus level converters)
Performance degrades 3% due to Vdd programmability
Energy versus Area
Class0
6
Class1
Class2
Class3
Total FPGA Energy/Cycle (nJ)
(8,7)
(6,7)
(10,5)
5
(8,5)
4
(6,5)
(6,4)
(8,4)
(10,4)
(12,4)
(8,7)
3
(6,7)
(8,7)
(8,6)
(8,4)
(12,4)
(10,4)
(8,7)
(10,6)
(6,6)
(8,5)
(10,5)
(10,4)
2
Min-area

Min-energy
(8,6)
(6,6)
(8,4)
(6,7)
(8,6)
(8,5)
(10,5)
(8,4)
(6,4)
(6,7)
(10,5)
(12,4)
(6,6)
(8,5)
(10,6)
(6,6)
(12,4)
1
6.00E+06 8.00E+06 1.00E+07 1.20E+07 1.40E+07 1.60E+07 1.80E+07 2.00E+07 2.20E+07 2.40E+07 2.60E+07
Total FPGA Device Area

Average area overhead




118% for Class1 (Vdd-programmable interconnects w/ level converters)
17% for Class2 (Vdd-gateable interconnects)
52% by Class3 (Vdd-programmable interconnects w/o level converters)
Class2 is the best considering both energy and area
Energy Breakdown
Total FPGA Energy (nJ/Cycle)
4.5
2.94%
3.71%
4
16.03%
3.5
8.09%
3
2.70%
3.04%
Logic Leakage Energy
Logic Dynamic Energy
Local Interconnect Leakage Energy
Local Interconnect Dynamic Energy
Global Interconnect Leakage Energy
Global Interconnect Dynamic Energy
2.5
26.22%
2
4.07%
3.92%
7.43%
4.40%
4.32%
49.89%
1.5
39.69%
1
42.93%
9.81%
42.84%
10.81%
5.85%
4.88%
0.5
0
19.33%
Class0
37.62%
17.77%
Class1
31.70%
Class2
Class3
FPGA Architecture (N,k) = (12,4)


Class2 and Class3 dramatically reduce global interconnect
leakage
But class1 fails due to leakage in Vdd-level converters
Area Overhead
20%
18%
1.39%
Power Transistors & SRAMs (CLBs)
1.80%
Vdd-level Converters (CLBs)
4.82%
Control (Connection Blocks)
16%
Logic Blocks 3.19%
14%
FPGA Area Overhead
12%
10%
Connection Blocks 10.38%
8%
4.96%
Power Transistors (Connection Blocks)
0.60%
SRAMs (Connection Blocks)
6%
4%
2%
Routing Switches 3.87%
3.87%
Power Transistors (Routing Switches)
0%
Class2: Vdd-gateable interconnects + Vdd-programmable CLBs(12, 4)

17% = 9% for power transistors + 5% for control + 2% for SRAM
Conclusions and New Results





Field programmability is needed for fine-grained dual-vdd
and Vdd-gating in FPGA
Vdd-gating offers a better area-power tradeoff than Vddselection
45% energy-delay product reduction with 17% area
overhead
Architecture with Vdd-programmability
 LUT size 4  low energy and area
 LUT size 7  best performance
New results [dac’05]


Time slack allocation for Vdd-programmable
interconnects
Device and architecture co-optimization for 77% energydelay reduction
References and Download
 All
references and tools at
http://eda.ee.ucla.edu
 Results
in the slides have been
updated compared to the paper in
ISFPGA’05