Download 05-Logic Cluster Design - Department of Computer Science and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
FPGA Logic Cluster Design
Dr. Philip Brisk
Department of Computer Science and Engineering
University of California, Riverside
CS 223
How Much Logic Should Go in an
FPGA Logic Block?
Vaughn Betz, Jonathan Rose
IEEE Design & Test of Computers
15(1): 10-15 (1998)
Three Questions
• How many inputs should the FPGA routing provide to a cluster of LUTs? (I)
– Routing flexibility vs. area
• As the number of LUTs in a logic cluster changes, how should the FPGA’s
routing architecture change? (Fc)
• How many LUTs should be included in a cluster? (N)
Experimental Methodology
• 20 MCNC Benchmarks
– Well-established
– A bit old, even by 1998
standards
– Sadly, still in use
• 4-LUT Architecture
• Fs = 3
– Vary other parameters
to see what works best
Area Model
• Count the number of min-width transistors
required to implement a benchmark circuit in an
FPGA architecture
• Normalized Area
(Num min-width transistors used) / (Num BLEs used)
How many cluster inputs do we need?
Input sharing and output
re-use within a
logic cluster
We hit near 100% utilization
when I = 50-60% of the total
number of BLE inputs
We can pack BLEs together
to share common inputs
Re-use locally generated
outputs
Works because the packing
algorithm was effective!
Visual Depiction
W×Fcin:1 multiplexer
Isolation
Buffers
Each CLB has N BLEs (K-LUTs)
Configurable Logic Block (CLB)
IntraCluster
Routing
...
C Block
(inputs)
W routing segments
Fanout
Each BLE connects to W×Fcout
segments in the routing channel
K
BLE
...
K
...
BLE
...
...
N local feedbacks
I = Number of of CLB inputs
I = ~0.6KN is pretty good
W routing segments
C Block (outputs)
Use the feedbacks!
The Packer was Effective!
It packed BLEs together to
share common inputs
It re-use locally generated
outputs via the feedbacks
Cluster inputs vs. Cluster size
Approx. (2N + 2)
N = 1 BLE
uses 3.5/4
inputs
(on average)
N = 16 BLEs uses
19.7 / 64 inputs, on average
Commercial FPGAs
• Altera Flex 8000 FPGA uses a cluster of size
N=8 with I=24
– Results suggest to reduce I to 18 (save area)
• Xilinx 5200 FPGA uses a cluster of size N=4
with I=16
– Results suggest to reduce I to 10 (save area)
Routing Flexiblity vs. Cluster Size
• Set Fc = W/N
– Each routing track is driven by one LUT output pin
in the cluster
W×Fcin:1 multiplexer
Isolation
Buffers
Each CLB has N BLEs (K-LUTs)
Configurable Logic Block (CLB)
IntraCluster
Routing
...
C Block
(inputs)
W routing segments
Each BLE connects to W×Fcout
segments in the routing channel
K
BLE
...
K
...
BLE
...
...
N local feedbacks
I = Number of of CLB inputs
W routing segments
C Block (outputs)
Area Efficiency vs. Cluster Size
I is set to achieve 98% logic utilization
N=2 BLEs introduces
intra-cluster routing
Area efficiency
rapidly degrades
beyond this point
Reduce routing
between logic blocks
Conclusions
• I = 2N + 2 for N < 16
– Slow, linear growth
• Reduce Fc
– Works because LUT inputs are equivalent
• Cluster area efficiency is within 10% for 1 < N < 8
• Large clusters reduce the size of the placement
problem and increase FPGA speed
The Effect of LUT and Cluster Size on
Deep-Submicron FPGA Performance
and Density
Elias Ahmed, Jonathan Rose
IEEE Transactions on VLSI Systems
12(3): 288-298 (2004)
Contributions
• Vary LUT size (K) from 2 to 7
• Vary cluster size (N) from 1 to 10 LUTs
– Experimentally determine the number of cluster
inputs (I) as a function of K and N
– Clustering small LUTs (K=2,3) produces good area
results, but bad performance (~2x worse)
– LUTs of size (K=4,5,6), clusters of size (N=3…10)
yield the best area-delay product
CAD Flow
Inputs Req.’d for 98% Area Utilization
W×Fcin:1 multiplexer
Isolation
Buffers
Each CLB has N BLEs (K-LUTs)
...
W routing segments
I = ½K(N+1)
Configurable Logic Block (CLB)
IntraCluster
Routing
C Block
(inputs)
Each BLE connects to W×Fcout
segments in the routing channel
K
BLE
...
K
...
BLE
...
...
N local feedbacks
I = Number of of CLB inputs
W routing segments
C Block (outputs)
Total Area
• Intra-cluster routing area is 25-35% of the total area
• LUT sizes of K = 4,5 are the most area efficient for all cluster sizes
• Reduction in total area as cluster size increases from 1-3 for all LUT sizes
• As clusters are made larger (N > 4) there is little impact on total FPGA area
Total Intra-cluster Routing Area
The increase in cluster size
far outweighs the rate of
decrease in the number of
clusters: hence the
upward trend
#Clusters and Area/Cluster vs. K
W×Fcin:1 multiplexer
Isolation
Buffers
Each CLB has N BLEs (K-LUTs)
Configurable Logic Block (CLB)
IntraCluster
Routing
...
C Block
(inputs)
W routing segments
Each BLE connects to W×Fcout
segments in the routing channel
K
BLE
...
25-35%
K
...
BLE
...
...
N local feedbacks
I = Number of of CLB inputs
N = 1 BLE per Cluster
W routing segments
C Block (outputs)
LUT area vs. Intra-cluster Mux Area
W×Fcin:1 multiplexer
Isolation
Buffers
Each CLB has N BLEs (K-LUTs)
Configurable Logic Block (CLB)
IntraCluster
Routing
...
C Block
(inputs)
W routing segments
Each BLE connects to W×Fcout
segments in the routing channel
K
BLE
...
K
...
BLE
...
...
N local feedbacks
I = Number of of CLB inputs
W routing segments
C Block (outputs)
Intra-cluster routing area is 2535% of logic cluster area
LUT area
dominates
Intra-cluster Routing Area as a
Function of LUT Size
Total intra-cluster routing area decreases
near-linearly from K = 3 to 7
Total Intra-cluster Routing Area
Routing area decreases linearly with LUT size
• Increasing LUT sizes decreases the
number of clusters used faster than the
rate of increase in routing area per cluster
• Depends on good CAD tools
The product of these two curves gives
the total inter-cluster routing area.
Critical Path Delay vs. LUT Size
As N and K increase
• LUT delay and the delay through a
single cluster increases
• The number of LUTs and clusters in
series on the critical path decreases
• Reduced global routing delay
Increasing both N and K has a positive effect
• Benefits saturate as N and K get large
Intra-cluster Delay vs. LUT Size
Intra-cluster delay decreases as K increases
• Reduction in number of BLE levels on critical path
Intra-cluster delay increases as N increases
• Larger intra-cluster cluster muxes are slower
• The delay through these muxes is still much faster
than global routing delay
BLE Delay vs. K
BLE delay increases linearly as K
increases (intuitive)
Number of BLEs on the critical
path decreases quadratically as K
increases
• Fewer, but larger, BLEs
Global Routing Delay vs. K
As K increases
• Fewer LUTs on the critical path
• Fewer global routing links
As N increases
• More opportunities to use faster
intra-cluster routing
Critical Path Delay (K = 4)
• K remains constants
– No reduction in number of BLEs on critical path
• N increases
– BLE and intra-cluster routing delay increase
– More logic implemented internally within clusters
– Can use faster intra-cluster routing instead of global routing
Critical Path Delay vs. LUT Size (Recap)
Increasing N beyond 3 has minimal effects
• Limited effectiveness of clustering
• Architectural weakness?
• Semi-effective CAD tools?
Number of Logic Clusters on Critical Path
The number of logic levels decrease with
increasing N and K
• For a given K, most of the reduction is from
N = 1 to 3
• The majority of the critical path delay was
reduced in this range
• Increasing N is less effective when K is large
BLE Fanout vs. LUT Size
Larger LUTs have larger average fanout
• Harder to ensure that increasing N will
result in fewer cluster levels on the
critical path
Smaller LUTs have better response to increasing N
because each LUT
has a relatively small fanout
• Adding an extra BLE to the cluster guaranteed
some reduction in the number of logic levels
Area-Delay Product
Large Delays
• Many BLEs on
critical path
• Slightly larger
area requirement
Large area cost for
K=7 outweighs
marginal delay
improvement
Caveats
• Quality of CAD tools
• Mix of benchmark circuits
• Limited exploration of routing parameter
design space
– Parameters were derived from N = K = 4
Best Overall Results and Summary
• To achieve 98% LUT utilization, set I = ½K(N+1)
• Small LUT sizes are not area efficient and have
poor performance characteristics
• Future challenges
– Reduce number of BLEs on critical path without
resorting to larger LUTs
– Reduce intra-cluster routing delays