Download FPGA Implementation of a Low Latency and High Throughput

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
THE INSTITUTE OF ELECTRONICS,
INFORMATION AND COMMUNICATION ENGINEERS
IEICE ICDV 2011
FPGA Implementation of a Low Latency and High Throughput
Network-on-Chip Router Architecture
Nam-Khanh Dang, Thanh-Vu Le-Van, Xuan-Tu Tran*
SIS Laboratory, University of Engineering and Technology, VNU Hanoi
144 Xuan Thuy road, Hanoi 10000, Vietnam. (*) Corresponding author: [email protected]
Abstract  The Network-on-Chip (NoC) paradigm has recently been known as a promising solution for designing large
complex Systems-on-Chip (SoCs), especially when the semiconductor technology turns into 3D integration era. This paper
presents the design of a NoC router architecture which provides low latency, high throughput communication. The NoC router
architecture is implemented on a Xilinx technology FPGA chip for prototyping purpose with an obtained latency of 8.1ns and a
maximum throughput of 123Mflits/s on each communication channel. These results can be improved when the design is
implemented with ASIC design flows.
Keyword: Network-on-Chip (NoC), On-chip communication, router architecture.
analyzed the latency of different topologies to find the
1. Introduction
best matched topology for each application in considering
Thanks to the rapidly evolution of the semiconductor
power consumption.
technology, more and more Intellectual Properties (IPs)
In this paper, we present the design of an efficient
can be integrated into a complex System-on-Chip (SoC)
router architecture that can be used to build high
to meet high demand of new applications. This leads to a
throughput, low latency NoC architectures by optimizing
big challenge for on-chip communication design because
the router architecture and using a double-crossbar. The
of many critical limitations of conventional communica-
2-D mesh topology has been selected to develop our NoC
tion models based on shared buses or point-to-point links;
architecture in order to be easily implemented with
especially, when the semiconductor technology turns into
current semiconductor technology and to apply routing
3-D integration.
algorithms. However, other 2-D topologies such as 2-D
Recently, the Network-on-Chip (NoC) paradigm has
torus or 2-D folded-torus can be easily developed from
become a promising solution for on-chip communication
the proposed router architecture. The design of a complete
of large complex systems thanks to its potential advan-
NoC architecture is then implemented and validated on
tages such as: high throughput communication, high
FPGA devices from Xilinx to evaluate the performance of
scalability and versatility, as well as good power effi-
the design. This implementation is targeted to prototyping
ciency [1], [2], [3]. In consequence, many works and/or
proposals on NoC architectures and methodologies have
purposes before the design will be synthesized and implemented on ASIC technology. Preliminary experimental
been presented in literature last several years. However,
results show that the design implemented on Xilinx
the design methodology has not yet been completed and
Virtex-5 FPGA [11] has a very small latency of 8.1ns and
a high throughput of 123Mflits/s per communication
there are still many research issues to bring NoC paradigm to the market such as network latency and commu-
channel (two communication channels with different
nication throughput improvement [4], [5], [6], [7]. In
priority levels can be concurrently implemented). These
which, Nilsson and Oberg [6] proposed a pseudochronous
mechanism to reduce the network latency while Beigné et
results can be improved when the design is implemented
with ASIC design flows in the next step.
al. [4], [9] implement the whole NoC architecture using
quasi-delay insensitive asynchronous logic to minimize
The remaining part of this paper is organized as fol-
the network latency thanks to the asynchronous design
lows: Section 2 describes the design of our NoC architecture, from network topology to communication mecha-
advantages. However, it makes the test of asynchronous
nism. Section 3 introduces our proposed router architecture used to develop out NoC architecture. Verification
NoC architectures more complex [8] due to the lack of
testing tools for asynchronous logics. Kreutz et al. [5]
112
Copyright ©2011 by IEICE
and implementation of the design based on Xilinx FPGA
before being transferred onto the network. Each packet is
technology is presented in Section 4. Finally, conclusions
always composed by a header flit, following by one or
and perspectives will be given in Section 5.
more data flits (including body flits and a tail flit). The
2. Network-on-Chip design
size of each flit is 34 bits, where 32 bits are used for data
and two most significant bits are used for control pur-
2.1. Network topology
poses. The formats of these flits are described in Figure 2.
As computer networks, there are many topologies can
33
be used to design NoC architectures such as ring, fat-tree,
32
31
18 17
BoP EoP
payload
butterfly, 2-D mesh, 2-D torus, 2-D folded torus, etc.
0
path-to-target
Header flit (BoP = ‘1’, EoP = ‘0’)
33
Between these topologies, 2-D mesh and 2-D torus to-
32
31
0
BoP EoP
pologies are usually preferred in developing NoC archi-
payload
Body flit (BoP = ‘0’, EoP = ‘0’)
tectures because these topologies can be easily imple-
33
mented with the current semiconductor technology. The
32
31
0
BoP EoP
payload
2-D torus has a smaller network diameter and higher
Tail flit (BoP = ‘0’, EoP = ‘1’)
bandwidth. However, the structure of 2-D torus network is
Figure 2: Flits’ format.
more complex than 2-D mesh network with long network
links to establish the loops between border routers. In our
Begin-of-Packet (BoP) and End-of-Packet (EoP) are
case, 2-D mesh topology has been selected to develop
two control bits specifying the type of a flit. Header flit
NoC architectures in order to easily implement routing
has BoP = ‘1’ and Tail flit has EoP = ‘1’. If both BoP and
algorithms. In addition, it is easier to expand network
EoP equal ‘1’, the packet has only one flit while if both
scale with 2-D mesh topology.
values equal ‘0’, this is a body flit. Thanks to these
control bits, the network router can easily recognize the
In our NoC architecture, each router has five 32-bit
four
type of an incoming flit (header flit, body flit, or tail flit)
neighboring routers (North, East, South, and West) and a
and make a routing decision. To route a packet on net-
nearest IP core via a Network Interface (NI) as depicted
work, routing information have to be included in the
in Figure 1. The communication mechanism between
header flit (in the “path-to-target” field). This field will
routers is established by using a handshake protocol with
be shifted to the right at each router for next routing
two virtual channels providing system’s quality of service
direction after two least significant bits are used due to
(QoS).
source routing algorithms.
bi-directional
ports
which
are
connected
to
Figure 3 presents the communication interface between
two network routers (i.e., between a network router and
an IP core).
req0, req1
North
OUT
West
Router
Router
router
router
East
req0, req1
IN
data
resp0, resp1
NI
South
IN
data
resp0, resp1
router/
router/
IP
core
IP core
OUT
IP
…
resp[i]
Figure 1: 2-D mesh NoC architecture.
req[i]
data
2.2. Communication mechanism
header
body
body
…
…
tail
BoP
The communication mechanism used in this NoC archi-
EoP
tecture is packet switching with Wormhole commutation
mode and source routing algorithm. The message is
Figure 3: Router-to-router/IP core interface and
therefore split and encapsulated into packets at the source
handshake protocol.
113
The network links are 34-bit bidirectional channels, in
management unit (Credit MU) manages flit flow control.
which 32 bits are used for data and two bits are used to
This will implement the handshake protocol between
encode flit information as described above. In addition,
routers at flit level. The detail of the input/output mod-
“req0/req1” and “resp0/resp1” signals are used to estab-
ules is explained as follows.
lish a handshake protocol for communication between
At the input module, the datapath is composed of a
network elements. This protocol also guarantees that there
Pre-processor, buffers and multiplexer/de-multiplexer as
is only one virtual channel implemented on the physical
described in Figure 5.
channel in one time moment.
control logics
3. Proposed router architecture
ack [i] from
output modules
req0, req1
resp0, resp1
To be used in developing 2-D mesh NoC architectures,
BoP, EoP
we proposed a network router architecture having five
2
VC0
VC0 direction
direction
dir
32-bit bi-directional input/output ports (North, East,
2
data_in
datapath
resp0/resp1
req0/req1
Credit
Credit MU
MU
Routing
Routinglogic
logic
Routing
Routinglogic
logic
VC
VCallocater
allocater
resp0/resp1
req0/req1
HFS
VC1
VC1 buffer
buffer
data_to_crossbar_VC1
Figure 5: Micro-architecture of input module (IM).
Because Wormhole commutation mode is used, the
buffers have only 1-flit depth. The coming data packets
VC
VCallocater
allocater
dir
VC
data_to_crossbar_VC0
output module 0
Credit
Credit MU
MU
BoP/EoP
VC0
VC0 buffer
buffer
Pre-processor
Pre-processor
ble-crossbar as depicted in Figure 4.
Input module 0
VC
HFS
composed of input modules, output modules, and a dou-
send [i] to
output modules
VC1
VC1 direction
direction
South, West, and Local) connected to four neighboring
routers and the nearest IP core. The router architecture is
lock [i] to
output modules
input
input
arbiter
arbiter
will be allocated into their corresponding virtual channel
VC
depending on the value of req0 and req1 signals. The
data
datapath
datapath
datapath
datapath
data
outputs of virtual channel buffers are connected to corresponding crossbars. The use of two crossbars allows the
Input module 4
router to establish two different routing paths on different
output module 4
virtual channels at the same time. This provides high
Figure 4: Router architecture.
throughput communication and low latency for the network architecture. Of course, the routing paths should be
In such architecture, each input/output module has its
on different directions. Routing decision, credit manage-
own datapath and control logic unit. The datapath is
ment, and other control circuits are implemented in the
divided into two communication channels in order to
input arbiter.
implement two virtual channels, VC0 and VC1. In which,
At the output module, the architecture is a little bit
the VC0 has a higher priority and is used for transferring
simpler, as described in Figure 6.
important control information or providing real-time
services. The VC1 provides best-effort service and is
ack [i] to
input modules
usually used for transferring bulk data on the network.
Beside the datapath, the control logic unit is composed
control logics
resp0
resp1
send [i] from
input modules
of VC allocater, routing logic, and credit management
req0
req1
unit. In which, VC allocater allocates the incoming flit
into its corresponding virtual channel; routing logic unit
lock [i] from
input modules
analyzes the routing information and generates necessary
output
output
arbiter
arbiter
control signals to the crossbar to implement routing paths.
data_from_crossbar_VC0
A header flit shifting (HFS) signal is also sent to datapath
data_out
to shift the path-to-target field for getting new routing
data_from_crossbar_VC1
information used at the next router. To synchronize with
the output modules, routing logic unit also generate
datapath
Figure 6: Micro-architecture of output module (OM).
handshake signals to the output modules. The credit
114
In this architecture, routing decision, virtual channel
allocation, and credit management are implemented by an
mented on a Xilinx Virtex-5 FPGA chip [11] for verification purpose. The FPGA-based prototyping does not
output arbiter and several multiplexers and/or de-multi-
demonstrate the feasibility of the NoC architecture but
plexers, called control logics. The datapath is just a 34-bit
also enables accurate evaluation of performance, area,
multiplexer controlled by the control logics.
and various design trade-offs [10]. The hardware overhead of the network architecture without IP cores is
The crossbar of the router is a double crossbar used to
briefly reported in Table 1. Number of occupied slices of
connect 5 input modules to 5 output modules as presented
each router is 717, approximately 1% available resource
in Figure 7. Thanks to this double-crossbar, the router can
of the targeted FPGA device.
establish two concurrent communication channels on
different virtual channels. The quality of service of the
Table 1: Hardware implementation report on Xilinx
whole NoC architecture is then improved.
Virtex-5 FPGA chip (xc5vlx330)
im_to_crossbar_VC0 [0]
im_to_crossbar_VC0 [0]
crossbar_VC0_to_om [0]
crossbar_VC1_to_om [0]
im_to_crossbar_VC0 [1]
im_to_crossbar_VC0 [1]
crossbar_VC0_to_om [1]
crossbar_VC1_to_om [1]
im_to_crossbar_VC0 [2]
im_to_crossbar_VC0 [2]
crossbar_VC0_to_om [2]
crossbar_VC1_to_om [2]
im_to_crossbar_VC0 [3]
im_to_crossbar_VC0 [3]
crossbar_VC0_to_om [3]
crossbar_VC1_to_om [3]
Resources
Used
Slice LUTs
10791
207360
5.2%
Slice LUT-Flip
Flop pairs
13806
207360
6.7%
Slice Registers
3600
207360
1.7%
915
1200
76%
I/O buffers
Available
Percentage
(%)
To verify the router architecture, we have implemented
the following test cases, as depicted in Figure 9, to check
im_to_crossbar_VC0 [4]
im_to_crossbar_VC0 [4]
crossbar_VC0_to_om [4]
crossbar_VC1_to_om [4]
all routing paths of the network router.
Figure 7: Double-crossbar architecture.
R
R
4. Implementation and Verification
R
R
R
R
IP-1
IP-1
R
R
IP-1
IP-1
To validate the proposed router architecture, we have
R
R
modeled a 3×3 NoC architecture as described in Figure 8.
R
R
This network has 9 network routers connected to 9 IP
R
R
R
R
routing path
cores. For testing purpose, the IP cores are very simple;
(a)
(b)
they just generate the data with defined routing informaR1
R1
tion corresponding to testing scenario, transmit the data
R2
R2
R1
R1
R2
R2
onto the network, and receive data flow from the network.
IP-2
IP-2
R1
R1
R2
R2
R4
R4
IP-1
IP-1
IP-2
IP-2
R3
R3
IP-2
IP-2
R5
R5
R6
R6
IP-4
IP-4
R4
R4
R5
R5
R7
R7
R5
R5
R6
R6
R8
R8
R9
R9
IP-4
IP-4
R6
R6
R8
R8
IP-4
IP-4
R4
R4
IP-3
IP-3
IP-5
IP-5
R8
R8
IP-6
IP-6
R9
R9
routing path
(c)
R9
R9
(d)
Figure 9: Verification model for 4-router group.
IP-7
IP-7
IP-8
IP-8
IP-9
IP-9
First, the IP core will generate data and routing information to test the loop routing path between 4 routers as
Figure 8: A 3×3 NoC architecture.
presented in Figure 9(a) and Figure 9(b), for clockwise
and counterclockwise, respectively. Then, to test the
The NoC architecture is then synthesized and imple-
remaining routing paths of the router, we implement the
115
[3] W.J. Dally and B. Towles. Route Packets, Not Wires:
On-Chip Interconnection Networks. In Proceedings
of the Design Automation Conference (DAC), pp.
684–689, Las Vegas, NV, USA, June 2001.
routing paths as presented in Figure 9(c) and Figure 9(d).
In Figure 9(c), test data are generated by the IP core 2
(IP-2), then passed through R2, R1, R4, R5, R6, R9, R8,
R5, R2, and finally received by the IP-2. In Figure 9(d)
[4] Edith Beigne, et al. An Asynchronous NoC Architecture Providing Low Latency Service and its
Multi-level Design Framework. In Proceedings of
the 11th IEEE Int’l Symposium on Asynchronous
Circuits and Systems (ASYNC), pp. 54–63, 2005.
test data are generated by IP core 4 (IP-4), then passed
through R4, R1, R2, R5, R8, R9, R6, R5, R4, and finally
received by IP-4. These routing paths are opposite to each
[5] M. Kreutz, C. Marcon, L. Carro, N. Calazans, and
A.A. Susin. Energy and Latency Evaluation of NoC
Topologies. In Proceedings of the 2005 IEEE International Symposium on Circuits and Systems
(ISCAS), pp. 5866–5869, Vol. 6, 2005.
other. The verification results approved that all routing
paths inside the router architecture have been passed for
the test.
From the experimental results, it shows that we need
[6] Erland Nilsson, Johnny Oberg. Reducing Power and
Latency in 2-D mesh NoCs using Globally Pseudochronous Locally Synchronous Clocking. In Proceedings of the 2004 International Conference on
Hardware/Software Codesign and System Synthesis,
pp. 176, November 2004.
only once cycle to transmit a data flit from an input port
to an output port of the router. The obtained latency on
the targeted FPGA device is 8.1ns. The maximum
throughput achieved on each communication channel is
123Mflits/s. As presented above, two communication
[7] Radu Marculescu, Umit Ogras, Li-Shiuan Peh, Natalie Enright Jerger, and Yatin Hoskote. Outstanding
Research Problems in NoC Design: System, Microarchitecture, and Circuit Perspectives. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 28(1), January
2009.
channels can concurrently established, the maximum
throughput can be reached is 246Mflits/s.
5. Conclusion and Future Works
In this paper, we have presented the design and implementation of a NoC router architecture on a Xilinx FPGA
[8] Xuan-Tu Tran, Yvain Thonnart, Jean Durupt, Vincent
Beroulle, Chantal Robach. Design-for-Test Approach
of an Asynchronous Network-on-Chip Architecture
and its Associated Test Pattern Generation and Application. IET Journal on Computers and Digital
Techniques, Volume 3, Issue 5, pp. 487-500, September 2009.
Virtex-5 chip. The preliminary experimental results show
that the design can be used to develop low latency and
high throughput NoC architectures based on 2-D mesh/
torus topologies. The obtained latency is 8.8ns and the
maximum throughput achieved on each communication
channel
is 123Mflits/s.
Because
two
[9] Edith Beigne, E.; Clermidy, F.; Lhermet, H.; Miermont, S.; Thonnart, Y.; Tran, X.T.; Valentian, A.;
Varreau, D.; Vivet, P.; Popon, X.; Lebreton, H. An
Asynchronous Power Aware and Adaptive NoC
Based Circuit. IEEE Journal of Solid State Circuits,
Volume 44, Issue 4, pp. 1167-1177, April 2009.
communication
channels can be concurrently established at the same time
on different directions and different virtual channels
thanks to its architecture with double-crossbar, the maxi-
[10] Umit Ogras et al. Challenges and Promising Results
in NoC Prototyping using FPGAs. IEEE Micro,
pp.86-95, September – October 2007.
mum throughput of a router can be reached to 246Mflits/s,
equivalent to 7.872Gbps. This can improve the NoC
performance and guarantee the quality of service of the
[11] Xilinx Virtex-5 FPGA. www.xilinx.com
system. The experimental results can be improved when
the design will be implemented with ASIC design flows in
next step.
Acknowledgement
This work is partly supported by Vietnam National University, Hanoi (VNU) through research project No.
QGDA.10.02 (VENGME).
References
[1] A. Jantsch and H. Tenhunen (Eds). Networks on Chip.
Kluwer Academic Publisher, February 2003.
[2] L. Benini and G. De Micheli. Networks on Chips: a
New SoC Paradigm. IEEE Computer, 35(1):70–78,
January 2002.
116