Download FPGA Implementation of a Low Latency and High Throughput

THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS IEICE ICDV 2011 FPGA Implementation of a Low Latency and High Throughput Network-on-Chip Router Architecture Nam-Khanh Dang, Thanh-Vu Le-Van, Xuan-Tu Tran* SIS Laboratory, University of Engineering and Technology, VNU Hanoi 144 Xuan Thuy road, Hanoi 10000, Vietnam. (*) Corresponding author: [email protected] Abstract  The Network-on-Chip (NoC) paradigm has recently been known as a promising solution for designing large complex Systems-on-Chip (SoCs), especially when the semiconductor technology turns into 3D integration era. This paper presents the design of a NoC router architecture which provides low latency, high throughput communication. The NoC router architecture is implemented on a Xilinx technology FPGA chip for prototyping purpose with an obtained latency of 8.1ns and a maximum throughput of 123Mflits/s on each communication channel. These results can be improved when the design is implemented with ASIC design flows. Keyword: Network-on-Chip (NoC), On-chip communication, router architecture. analyzed the latency of different topologies to find the 1. Introduction best matched topology for each application in considering Thanks to the rapidly evolution of the semiconductor power consumption. technology, more and more Intellectual Properties (IPs) In this paper, we present the design of an efficient can be integrated into a complex System-on-Chip (SoC) router architecture that can be used to build high to meet high demand of new applications. This leads to a throughput, low latency NoC architectures by optimizing big challenge for on-chip communication design because the router architecture and using a double-crossbar. The of many critical limitations of conventional communica- 2-D mesh topology has been selected to develop our NoC tion models based on shared buses or point-to-point links; architecture in order to be easily implemented with especially, when the semiconductor technology turns into current semiconductor technology and to apply routing 3-D integration. algorithms. However, other 2-D topologies such as 2-D Recently, the Network-on-Chip (NoC) paradigm has torus or 2-D folded-torus can be easily developed from become a promising solution for on-chip communication the proposed router architecture. The design of a complete of large complex systems thanks to its potential advan- NoC architecture is then implemented and validated on tages such as: high throughput communication, high FPGA devices from Xilinx to evaluate the performance of scalability and versatility, as well as good power effi- the design. This implementation is targeted to prototyping ciency [1], [2], [3]. In consequence, many works and/or proposals on NoC architectures and methodologies have purposes before the design will be synthesized and implemented on ASIC technology. Preliminary experimental been presented in literature last several years. However, results show that the design implemented on Xilinx the design methodology has not yet been completed and Virtex-5 FPGA [11] has a very small latency of 8.1ns and a high throughput of 123Mflits/s per communication there are still many research issues to bring NoC paradigm to the market such as network latency and commu- channel (two communication channels with different nication throughput improvement [4], [5], [6], [7]. In priority levels can be concurrently implemented). These which, Nilsson and Oberg [6] proposed a pseudochronous mechanism to reduce the network latency while Beigné et results can be improved when the design is implemented with ASIC design flows in the next step. al. [4], [9] implement the whole NoC architecture using quasi-delay insensitive asynchronous logic to minimize The remaining part of this paper is organized as fol- the network latency thanks to the asynchronous design lows: Section 2 describes the design of our NoC architecture, from network topology to communication mecha- advantages. However, it makes the test of asynchronous nism. Section 3 introduces our proposed router architecture used to develop out NoC architecture. Verification NoC architectures more complex [8] due to the lack of testing tools for asynchronous logics. Kreutz et al. [5] 112 Copyright ©2011 by IEICE and implementation of the design based on Xilinx FPGA before being transferred onto the network. Each packet is technology is presented in Section 4. Finally, conclusions always composed by a header flit, following by one or and perspectives will be given in Section 5. more data flits (including body flits and a tail flit). The 2. Network-on-Chip design size of each flit is 34 bits, where 32 bits are used for data and two most significant bits are used for control pur- 2.1. Network topology poses. The formats of these flits are described in Figure 2. As computer networks, there are many topologies can 33 be used to design NoC architectures such as ring, fat-tree, 32 31 18 17 BoP EoP payload butterfly, 2-D mesh, 2-D torus, 2-D folded torus, etc. 0 path-to-target Header flit (BoP = ‘1’, EoP = ‘0’) 33 Between these topologies, 2-D mesh and 2-D torus to- 32 31 0 BoP EoP pologies are usually preferred in developing NoC archi- payload Body flit (BoP = ‘0’, EoP = ‘0’) tectures because these topologies can be easily imple- 33 mented with the current semiconductor technology. The 32 31 0 BoP EoP payload 2-D torus has a smaller network diameter and higher Tail flit (BoP = ‘0’, EoP = ‘1’) bandwidth. However, the structure of 2-D torus network is Figure 2: Flits’ format. more complex than 2-D mesh network with long network links to establish the loops between border routers. In our Begin-of-Packet (BoP) and End-of-Packet (EoP) are case, 2-D mesh topology has been selected to develop two control bits specifying the type of a flit. Header flit NoC architectures in order to easily implement routing has BoP = ‘1’ and Tail flit has EoP = ‘1’. If both BoP and algorithms. In addition, it is easier to expand network EoP equal ‘1’, the packet has only one flit while if both scale with 2-D mesh topology. values equal ‘0’, this is a body flit. Thanks to these control bits, the network router can easily recognize the In our NoC architecture, each router has five 32-bit four type of an incoming flit (header flit, body flit, or tail flit) neighboring routers (North, East, South, and West) and a and make a routing decision. To route a packet on net- nearest IP core via a Network Interface (NI) as depicted work, routing information have to be included in the in Figure 1. The communication mechanism between header flit (in the “path-to-target” field). This field will routers is established by using a handshake protocol with be shifted to the right at each router for next routing two virtual channels providing system’s quality of service direction after two least significant bits are used due to (QoS). source routing algorithms. bi-directional ports which are connected to Figure 3 presents the communication interface between two network routers (i.e., between a network router and an IP core). req0, req1 North OUT West Router Router router router East req0, req1 IN data resp0, resp1 NI South IN data resp0, resp1 router/ router/ IP core IP core OUT IP … resp[i] Figure 1: 2-D mesh NoC architecture. req[i] data 2.2. Communication mechanism header body body … … tail BoP The communication mechanism used in this NoC archi- EoP tecture is packet switching with Wormhole commutation mode and source routing algorithm. The message is Figure 3: Router-to-router/IP core interface and therefore split and encapsulated into packets at the source handshake protocol. 113 The network links are 34-bit bidirectional channels, in management unit (Credit MU) manages flit flow control. which 32 bits are used for data and two bits are used to This will implement the handshake protocol between encode flit information as described above. In addition, routers at flit level. The detail of the input/output mod- “req0/req1” and “resp0/resp1” signals are used to estab- ules is explained as follows. lish a handshake protocol for communication between At the input module, the datapath is composed of a network elements. This protocol also guarantees that there Pre-processor, buffers and multiplexer/de-multiplexer as is only one virtual channel implemented on the physical described in Figure 5. channel in one time moment. control logics 3. Proposed router architecture ack [i] from output modules req0, req1 resp0, resp1 To be used in developing 2-D mesh NoC architectures, BoP, EoP we proposed a network router architecture having five 2 VC0 VC0 direction direction dir 32-bit bi-directional input/output ports (North, East, 2 data_in datapath resp0/resp1 req0/req1 Credit Credit MU MU Routing Routinglogic logic Routing Routinglogic logic VC VCallocater allocater resp0/resp1 req0/req1 HFS VC1 VC1 buffer buffer data_to_crossbar_VC1 Figure 5: Micro-architecture of input module (IM). Because Wormhole commutation mode is used, the buffers have only 1-flit depth. The coming data packets VC VCallocater allocater dir VC data_to_crossbar_VC0 output module 0 Credit Credit MU MU BoP/EoP VC0 VC0 buffer buffer Pre-processor Pre-processor ble-crossbar as depicted in Figure 4. Input module 0 VC HFS composed of input modules, output modules, and a dou- send [i] to output modules VC1 VC1 direction direction South, West, and Local) connected to four neighboring routers and the nearest IP core. The router architecture is lock [i] to output modules input input arbiter arbiter will be allocated into their corresponding virtual channel VC depending on the value of req0 and req1 signals. The data datapath datapath datapath datapath data outputs of virtual channel buffers are connected to corresponding crossbars. The use of two crossbars allows the Input module 4 router to establish two different routing paths on different output module 4 virtual channels at the same time. This provides high Figure 4: Router architecture. throughput communication and low latency for the network architecture. Of course, the routing paths should be In such architecture, each input/output module has its on different directions. Routing decision, credit manage- own datapath and control logic unit. The datapath is ment, and other control circuits are implemented in the divided into two communication channels in order to input arbiter. implement two virtual channels, VC0 and VC1. In which, At the output module, the architecture is a little bit the VC0 has a higher priority and is used for transferring simpler, as described in Figure 6. important control information or providing real-time services. The VC1 provides best-effort service and is ack [i] to input modules usually used for transferring bulk data on the network. Beside the datapath, the control logic unit is composed control logics resp0 resp1 send [i] from input modules of VC allocater, routing logic, and credit management req0 req1 unit. In which, VC allocater allocates the incoming flit into its corresponding virtual channel; routing logic unit lock [i] from input modules analyzes the routing information and generates necessary output output arbiter arbiter control signals to the crossbar to implement routing paths. data_from_crossbar_VC0 A header flit shifting (HFS) signal is also sent to datapath data_out to shift the path-to-target field for getting new routing data_from_crossbar_VC1 information used at the next router. To synchronize with the output modules, routing logic unit also generate datapath Figure 6: Micro-architecture of output module (OM). handshake signals to the output modules. The credit 114 In this architecture, routing decision, virtual channel allocation, and credit management are implemented by an mented on a Xilinx Virtex-5 FPGA chip [11] for verification purpose. The FPGA-based prototyping does not output arbiter and several multiplexers and/or de-multi- demonstrate the feasibility of the NoC architecture but plexers, called control logics. The datapath is just a 34-bit also enables accurate evaluation of performance, area, multiplexer controlled by the control logics. and various design trade-offs [10]. The hardware overhead of the network architecture without IP cores is The crossbar of the router is a double crossbar used to briefly reported in Table 1. Number of occupied slices of connect 5 input modules to 5 output modules as presented each router is 717, approximately 1% available resource in Figure 7. Thanks to this double-crossbar, the router can of the targeted FPGA device. establish two concurrent communication channels on different virtual channels. The quality of service of the Table 1: Hardware implementation report on Xilinx whole NoC architecture is then improved. Virtex-5 FPGA chip (xc5vlx330) im_to_crossbar_VC0 [0] im_to_crossbar_VC0 [0] crossbar_VC0_to_om [0] crossbar_VC1_to_om [0] im_to_crossbar_VC0 [1] im_to_crossbar_VC0 [1] crossbar_VC0_to_om [1] crossbar_VC1_to_om [1] im_to_crossbar_VC0 [2] im_to_crossbar_VC0 [2] crossbar_VC0_to_om [2] crossbar_VC1_to_om [2] im_to_crossbar_VC0 [3] im_to_crossbar_VC0 [3] crossbar_VC0_to_om [3] crossbar_VC1_to_om [3] Resources Used Slice LUTs 10791 207360 5.2% Slice LUT-Flip Flop pairs 13806 207360 6.7% Slice Registers 3600 207360 1.7% 915 1200 76% I/O buffers Available Percentage (%) To verify the router architecture, we have implemented the following test cases, as depicted in Figure 9, to check im_to_crossbar_VC0 [4] im_to_crossbar_VC0 [4] crossbar_VC0_to_om [4] crossbar_VC1_to_om [4] all routing paths of the network router. Figure 7: Double-crossbar architecture. R R 4. Implementation and Verification R R R R IP-1 IP-1 R R IP-1 IP-1 To validate the proposed router architecture, we have R R modeled a 3×3 NoC architecture as described in Figure 8. R R This network has 9 network routers connected to 9 IP R R R R routing path cores. For testing purpose, the IP cores are very simple; (a) (b) they just generate the data with defined routing informaR1 R1 tion corresponding to testing scenario, transmit the data R2 R2 R1 R1 R2 R2 onto the network, and receive data flow from the network. IP-2 IP-2 R1 R1 R2 R2 R4 R4 IP-1 IP-1 IP-2 IP-2 R3 R3 IP-2 IP-2 R5 R5 R6 R6 IP-4 IP-4 R4 R4 R5 R5 R7 R7 R5 R5 R6 R6 R8 R8 R9 R9 IP-4 IP-4 R6 R6 R8 R8 IP-4 IP-4 R4 R4 IP-3 IP-3 IP-5 IP-5 R8 R8 IP-6 IP-6 R9 R9 routing path (c) R9 R9 (d) Figure 9: Verification model for 4-router group. IP-7 IP-7 IP-8 IP-8 IP-9 IP-9 First, the IP core will generate data and routing information to test the loop routing path between 4 routers as Figure 8: A 3×3 NoC architecture. presented in Figure 9(a) and Figure 9(b), for clockwise and counterclockwise, respectively. Then, to test the The NoC architecture is then synthesized and imple- remaining routing paths of the router, we implement the 115 [3] W.J. Dally and B. Towles. Route Packets, Not Wires: On-Chip Interconnection Networks. In Proceedings of the Design Automation Conference (DAC), pp. 684–689, Las Vegas, NV, USA, June 2001. routing paths as presented in Figure 9(c) and Figure 9(d). In Figure 9(c), test data are generated by the IP core 2 (IP-2), then passed through R2, R1, R4, R5, R6, R9, R8, R5, R2, and finally received by the IP-2. In Figure 9(d) [4] Edith Beigne, et al. An Asynchronous NoC Architecture Providing Low Latency Service and its Multi-level Design Framework. In Proceedings of the 11th IEEE Int’l Symposium on Asynchronous Circuits and Systems (ASYNC), pp. 54–63, 2005. test data are generated by IP core 4 (IP-4), then passed through R4, R1, R2, R5, R8, R9, R6, R5, R4, and finally received by IP-4. These routing paths are opposite to each [5] M. Kreutz, C. Marcon, L. Carro, N. Calazans, and A.A. Susin. Energy and Latency Evaluation of NoC Topologies. In Proceedings of the 2005 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 5866–5869, Vol. 6, 2005. other. The verification results approved that all routing paths inside the router architecture have been passed for the test. From the experimental results, it shows that we need [6] Erland Nilsson, Johnny Oberg. Reducing Power and Latency in 2-D mesh NoCs using Globally Pseudochronous Locally Synchronous Clocking. In Proceedings of the 2004 International Conference on Hardware/Software Codesign and System Synthesis, pp. 176, November 2004. only once cycle to transmit a data flit from an input port to an output port of the router. The obtained latency on the targeted FPGA device is 8.1ns. The maximum throughput achieved on each communication channel is 123Mflits/s. As presented above, two communication [7] Radu Marculescu, Umit Ogras, Li-Shiuan Peh, Natalie Enright Jerger, and Yatin Hoskote. Outstanding Research Problems in NoC Design: System, Microarchitecture, and Circuit Perspectives. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 28(1), January 2009. channels can concurrently established, the maximum throughput can be reached is 246Mflits/s. 5. Conclusion and Future Works In this paper, we have presented the design and implementation of a NoC router architecture on a Xilinx FPGA [8] Xuan-Tu Tran, Yvain Thonnart, Jean Durupt, Vincent Beroulle, Chantal Robach. Design-for-Test Approach of an Asynchronous Network-on-Chip Architecture and its Associated Test Pattern Generation and Application. IET Journal on Computers and Digital Techniques, Volume 3, Issue 5, pp. 487-500, September 2009. Virtex-5 chip. The preliminary experimental results show that the design can be used to develop low latency and high throughput NoC architectures based on 2-D mesh/ torus topologies. The obtained latency is 8.8ns and the maximum throughput achieved on each communication channel is 123Mflits/s. Because two [9] Edith Beigne, E.; Clermidy, F.; Lhermet, H.; Miermont, S.; Thonnart, Y.; Tran, X.T.; Valentian, A.; Varreau, D.; Vivet, P.; Popon, X.; Lebreton, H. An Asynchronous Power Aware and Adaptive NoC Based Circuit. IEEE Journal of Solid State Circuits, Volume 44, Issue 4, pp. 1167-1177, April 2009. communication channels can be concurrently established at the same time on different directions and different virtual channels thanks to its architecture with double-crossbar, the maxi- [10] Umit Ogras et al. Challenges and Promising Results in NoC Prototyping using FPGAs. IEEE Micro, pp.86-95, September – October 2007. mum throughput of a router can be reached to 246Mflits/s, equivalent to 7.872Gbps. This can improve the NoC performance and guarantee the quality of service of the [11] Xilinx Virtex-5 FPGA. www.xilinx.com system. The experimental results can be improved when the design will be implemented with ASIC design flows in next step. Acknowledgement This work is partly supported by Vietnam National University, Hanoi (VNU) through research project No. QGDA.10.02 (VENGME). References [1] A. Jantsch and H. Tenhunen (Eds). Networks on Chip. Kluwer Academic Publisher, February 2003. [2] L. Benini and G. De Micheli. Networks on Chips: a New SoC Paradigm. IEEE Computer, 35(1):70–78, January 2002. 116

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download FPGA Implementation of a Low Latency and High Throughput