Download Qualitative and Quantitative Evaluation of A Proposed Circuit

ICSE2010 Proc. 2010, Melaka, Malaysia Qualitative and Quantitative Evaluation of A Proposed Circuit Switched Network-on-Chip New Chin-Ee, and Norhayati Soin Department of Electrical Engineering, University of Malaya (UM) 50603 Kuala Lumpur, Malaysia Email: [email protected] [email protected] Abstract- The advancement of semiconductor industry has led to continuously increasing level of integration. Due to this and driven by shorter time-to-market and product life cycle, the industry has migrated into SoC paradigm. NoC is viewed as a practical solution for SoC interconnection due to its reusability and scalability. Existing NoC designs are mainly based on packet switching. However, packet switching NoC requires significant buffering resources, which consumes silicon area and power. An alternative to packet switching is circuit switching based NoC. In this paper, a circuit switched network protocol and NoC design had been proposed and evaluated both qualitatively and quantitatively. Simulations were performed to measure and compare the performance of both NoCs to determine the viability of CNoC as on-chip interconnection solution. I. INTRODUCTION The semiconductor industry has been driven by accelerated advancement in process and design technologies. Rapid growth in the former has enabled higher level of integration, with the current transistor feature size shrunken down to 32nm region. The pursue of higher level of integration is to achieve chips which are smaller, faster and energy efficient. At present, industry players are moving progressively towards system-onchip (SoC) paradigm due to the increase in transistor density as well as shorter time-to-market and design cycle [1]. SoC is an integrated circuit which consists of a number of heterogenous blocks, or intellectual property (IP) cores manufactured on a single monolithic substrate [2, 11]. SoC enables design reusability, where IP cores are assembled as modular components to achieve the overall chip design [3]. An essential element in reusing IP cores involves the exchange of information between them [4]. In order to facilitate inter-core data transfer, standards had been defined to ease integration. Although the shrinking of transistor feature size offers a multitude of benefits such as higher transistor density, lower power consumption, faster clock, higher yield, and etc., it has also worsened the deep sub-micron (DSM) effects such as crosstalk, capacitive and inductive loads and electromagnetic interference [5]. Although the power and processing of IP cores scales in parallel with the integration trend, the inter-core interconnections do not, resulting in inter-core interconnections becoming a potential performance bottleneck and source of significant energy consumption in the future [6]. 108 978-1-4244-6609-2/10/$26.00 ©2010 IEEE Currently network-on-chip (NoC) has become a topic of intensive research as a feasible solution for on-chip interconnection. Majority of existing researches revolve on the design and optimization of packet switched NoC, whereas little attention was given to the alternative, which is circuit switched NoC. This work aims to evaluate and compare the benefits, weaknesses and practicality of both types of NoC, as well as to present the design of a circuit switched NoC. II. ON-CHIP INTERCONNECTION On-chip interconnection had traditionally been based on dedicated wires and shared buses. Dedicated wires are the fastest but are not configurable, impractical as number of core increases due to wiring congestion and high manufacturing costs [1]. Shared bus comprises of a set of wires interconnecting and shared by a number of cores. Presently, it is a mature technology as much research had been done to optimize the performance of shared bus and many sophisticated bus architectures had been proposed such as segmented and hierarchical bus architectures. Compared to dedicated wires, shared bus is more flexible and reusable. The main drawback of shared bus is that only 1 transaction is allowed at a time [13]. Increasing the number of cores sharing a bus will result in increase in load capacitance of the bus, subsequently degrading the bus operating frequency [5-7]. This limits the scalability of shared bus to less than 1 or 2 dozens of IP cores [10]. In order to address the limitations of dedicated wires and shared buses, on-chip interconnection is moving into NoC paradigm [12]. Generally, NoC consists of switches which are interconnected by communication channels [1]. It is expected to be able to achieve 3 major communications requirement for SoC, which are reuseability, scalability and parallelism [5-7]. Although several notable NoC designs had been proposed, much improvements are still needed in terms of area and power consumption as well as scalability. III. COMPARISON OF PACKET AND CIRCUIT SWITCHING There are generally 2 types of switching modes in NoC, namely circuit (CNoC) and packet (PNoC) switching [8, 10]. In the former, a transmission path is established prior to the data transmission. During the transmission, the entire path is ICSE2010 Proc. 2010, Melaka, Malaysia reserved and cannot be allocated to any other data transmission [1,8]. The latter switching mode does not require setting up of path prior to data transmission but instead relies on buffering schemes, routing strategy and flow control to ensure successful data transmission. Currently, many proposed NoC architectures are packet switched and synchronous. In terms of communication services, packet and circuit switching typically provides best effort (BE) and guaranteed transmission (GT) services respectively. For GT service, data are transmitted with transmission and timing guarantees whereas for BE service, only transmission guarantee is provided. GT service is generally suitable for real time, streaming data with constant data arrival rate. BE service is generally suitable for system that involve heterogenous and bursty traffic pattern such multimedia. Packet switching typically requires large buffer resources at network nodes to ease traffic congestion. This buffer requirement can be reduced via properly designed and selected routing schemes. Circuit switching does not require any buffer, thus consumes less silicon area but with a tradeoff of requiring extra latency for setting up the path for data transmission. Although reservation of path by circuit switching may lead to higher network contention and lower network usage, it may be reduced via methods such as contention free routing, static scheduling, virtual channels, virtual circuits as well as priorities [8]. At present, packet switching had been the predominant switching method in NoC designs [14-15]. One of the main reasons behind the dominance of packet switching in NoC designs is the highly successful and scalable Internet [10]. However, it is worthwhile to consider circuit switching due to the following merits: i. Allows easier implementation of pipelined asynchronous communication as data and control signals can be separated ii. Requires minimal amount of control, eg. does not require arbitration, thus increasing energy efficiency and maximum throughtput iii. Contention free communication iv. Does not require any buffering schemes Circuit switching is highly suitable for systems where the majority of on-chip traffic requires GT instead of BE service and are semi-static, which means that the data streams last for a relatively long time. One of the most significant drawbacks of circuit switching is the blocking of routers due to the reservation of the physical channels. A number of strategies had been proposed to address this issue, centering on multiplexing the physical channel for multiple data streams such as time division multiplexing (TDM) and lane division multiplexing (LDM). In the former, different time slots are allocated for different data streams to utilize the channel. A pipelined TDM can be achieved by reserving consecutive time slots in consecutive routers to 1 data stream. The latter method involves segmenting the bus into smaller sets of bus which can be used by different data streams simultaneously [8]. Although multiplexing techniques may be able to relieve the router blockages, their implementation requires additional logic and silicon area. In packet switching domain, packet transfers are usually performed via store-and-forward, wormhole and virtual cutthrough techniques [8-10]. Store-and-forward involves storing the entire data packet at a node before transferring it to another node. Huge buffer resource is needed, resulting in costly NoC solution and higher per-node latency. In order to reduce the buffers needed, designers usually resort to the latter 2 methods, which are wormhole and virtual cut-through. In wormhole method, packets are forwarded in smallest units of flow control called flits immediately after the header flit has been examined. Routing is based on the information stored in header flit and the payload and trailer flits will follow the same route [1,8,9]. Similarly, virtual cut-through involves transmission of data in flits, but current node would wait for guarantee from the next node that the entire packet can be accepted prior to transmission [9]. At present wormhole packet switching is most prevalent in proposed NoC designs. However, an in-depth examination indicates that a packet transferred via wormhole method would occupy multiple links and node simultaneously, thus a stalled header would result in the entire path to be blocked from other data transmission. This is essentially similar to the issue faced by circuit switching and implies that a circuit switching oriented design may be inevitable in order to meet the silicon resource constraints. Although several methods had been proposed to address the issue of blocked nodes and links due to the non-guaranteed nature of packet switching, most requires additional resources for buffering or arbitration. A significant example would be virtual channels, where stalled flits are stored in the output buffer of the NoC node, thus freeing the node for other data transmission [9]. IV. PROPOSED CIRCUIT SWITCHING NETWORK PROTOCOL Considering the merits of circuit switching, the proposed NoC designs implement circuits switching for data transmission as well as addresses some of the main issues related to circuit switching. In order to establish communication between IP cores, a network protocol has been defined, which includes description of the structure of the data packet as well as the handshaking algorithm prior and post of data transmission. The handshaking algorithm can be categorized as into 4 phases: i. Circuit setup ii. Payload data transmission iii. Circuit teardown iv. Circuit unavailable Data transmission is first initiated by the source core by sending out a special packet called CRT_SETUP packet. A CRT_SETUP packet contains addresses of target and source cores for routing purposes as shown in Figure 1. When an NoC router receives a CRT_SETUP packet, it would be routed to the next router and a connection will be set up at the NoC router to link the incoming and outgoing ports of the router. 109 ICSE2010 Proc. 2010, Melaka, Malaysia The packet is then propagated on until a circuit is established between the source and target cores. Fig 1: Complete transaction Fig 3: Circuit unavailable: Intermittent mode Fig 2: Circuit unavailable - persistent mode Upon receiving CRT_SETUP packet, a target core will assert a CRT_READY signal back to the source core to indicate that a circuit has already been established and ready for payload data transmission. When the source core detected the CRT_READY signal, it can start the payload data transmission to the target core by asserting DATA_VALID signal. When payload data is completely transferred to the target core, source core will initiate a circuit teardown phase by deasserting the DATA_VALID signal. This teardown condition will be propagated along to all network nodes in the circuit until the circuit is entirely torn down and all the network nodes are released. 110 In the event that the NoC links are occupied by other transactions and no alternative routes are available to establish the route, the network node at which CRT_SETUP packet is blocked will assert CRT_BLOCKED signal back to the source core. When CRT_BLOCKED signal is detected, source core can choose to operate either in intermittent or persistent mode. In the former mode, source core will initiate a circuit teardown and when CRT_BLOCKED is received and retry circuit setup after a short duration. In the latter mode, source core will wait until the blocked node is freed for transmission without tearing down the partially completed circuit. The latter option can be used when the data transaction is regarded as of higher priority. As can be observed, source core plays the most active role in all phases of the protocol. V. NETWORK TOPOLOGY AND ROUTING METHODOLOGY Routing schemes can be categorized as either deterministic or adaptive, and source or distributed. In deterministic routing, packet route is determined solely by the target and source core addresses, whereas in adaptive routing, network traffic is also taken into account for routing decision. In source routing, the route is determined at the source and the entire information for the route is stored in the header of the packet, which is then examined and used by routers to determine the next hop. On the other hand, information of the route need not be sent in distributed routing as routing decisions are made at each network router [1] As compared to source routing, distributed routing involves less overhead as it does not require the transmission of information for entire route. However, it had been claimed in some researches that distributed routing would result in more expensive data routers as routing tables need to be stored at each network router [10]. ICSE2010 Proc. 2010, Melaka, Malaysia VI. DESIGN OF NOC ROUTER 4 1 3 2 Fig 4: Illustration of multi-ring topology with core addresses The proposed circuit switched NoC uses an adaptive and distributed routing scheme in a multi-ring network topology. In this topology, each IP core is given an address and is connected in a circular manner by a few parallel rings as shown in Figure 4. Data transmission is initiated by IP core by sending CRT_SETUP to the interface router, which is the router connected directly to the IP core. Shortest path routing is done at the interface router by comparing the source and target address via the simple algorithm: If ((target address > source address) and (target address < intersection address)) Route to right port Else Route to left port End if where intersection address is defined as the address of the core opposite of the source core in the circular ring. The secondary ring functions as a reserved path for data transmission when the primary path is blocked by an ongoing data transmission. Only when the primary ring is blocked, a data transmission will be routed to utilize the secondary ring. In this topology, the worst case scenario occurs when a source core transmits data to a destination core located furthest away, thus occupying most number of network nodes. Considering this routing strategy which essentially limits the maximum route length to half of the ring, the maximum number of simultaneous data transmissions permissible in the worst case scenario can be calculated by the formula: Max. Route = No. of Rings * 2 In CNoC, network router plays the most significant role as it is responsible for almost all of the phases in the handshaking protocol. The roles of router include routing, establishing and maintaining circuit connections, as well as disconnecting circuit once transmission is completed. Comparatively to PNoCs, the main difference of CNoC routers is the lack of buffers which generally consumes a lot of silicon area. Figure 6 shows the major components of the router, namely arbitrator, ready and block controllers as well as configuration and block registers. The arbitrator block implements round robin based arbitration which provides equal time slice for each input data port of the router. Arbitration is only required during the circuit setup phase and not required when the circuit has been established. The arbitrator block is also comprised of a routing block which performs simplistic routing decision as mentioned in Section 5. If the target output port is available, the routing logic will set the configuration register so that the input data port can be connected to the target output port via the output multiplexer, thus establishing a circuit for data transmission within the router. If the target output port is not available, routing block will hand over the responsibility to blocked controller, which will set the block register so that a CRT_BLOCKED signal can be transmitted back to the source core. In other words, the network router uses 2 sets of registers, which are blocked and configuration registers to maintain the states and connections of the router. Both registers will only be cleared when CRT_TEARDOWN is received. (1) The benefits of the proposed routing strategy are: i. Simpler routing algorithm allowing distributed routing without the need to maintain expensive lookup tables at each router ii. Does not have any potential hotspots for congestion due to topology and routing algorithm iii. Systematic adaptive routing utilizing reserved links The probability of encountering worst case scenario can be minimized by employing spatial locality of reference, where cores which are predicted to require frequent communication are placed near to each other. Fig 5: Components of network router VII. METHODOLOGY In order to evaluate and compare the performance of the proposed CNoC with PNoC, simulators for each NoC had been developed, which are capable of providing accurate functional and clock count based timing analysis. The PNoC design used in the simulation is output buffered and wormhole routing based. Instead of performing simulations based on deduced formula, the developed simulators emulate the real hardware via concurrent simulation of the entire network. The simulators were developed in C++ language and employs object oriented 111 ICSE2010 Proc. 2010, Melaka, Malaysia performance-cost ratio in terms of buffering resource and that increasing buffering resource is not a proper optimization strategy for PNoC due to the fast saturation in performance with the increase in network congestion. 70 Average clock count designs to model the network nodes. In the simulations, both NoCs consist of 10 transmitters and 10 receivers interconnected by 20 router nodes in double ring topology as shown in Figure 4. Since both CNoC and PNoC designs evaluated in this study are synchronous, the networks are modelled with 2 basic operations for each clock cycle, namely synchronous and combinational operations as shown in Figure 6. The NoC designs are optimized so that each flit is forwarded to the next node in a single clock cycle without degrading the NoCs’ maximum operating frequencies. 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 No. of simultaneous transmission Packet switched NoC Circuit switched NoC Fig 7: Comparison of total transmission time for PNoC and CNoC with increasing network congestion Fig 6: Modelling of synchronous circuits in NoC simulators based on clock cycles 1400 VIII. RESULTS AND DISCUSSIONS Figure 7 shows the comparison of transmission characteristics for random network traffic in CNoC and PNoC. The graph shows a slight increase in both transmission times as the number of simulataneous transmissions increases or network becomes more congested. The average transmission time for CNoC is consistently a little higher than PNoC. From analysing the networks, this delay was found to be due to the circuit setup time of CNoC prior to payload transmission. However, with proper optimization and network localization, the circuit setup time can be minimized, improving the timing performance of CNoC. The low coupling between transmission time and number of simulataneous transmissions also indicates scalability for both networks as more network nodes and cores can be added without affecting the transmission time significantly. Figure 8 shows the effect of increasing the size of output buffers of each routers towards the timing performance of PNoCs. The graph shows that increasing buffer size does not improve the timing performance at low network utilization. When network congestion increases, increasing buffer size generally improves the timing performance of the network as less links and nodes are occupied by a single transmission at any 1 time. The graph also shows PNoCs suffer from low 112 Average clock count 1200 Modelling and simulation of NoC designs enables an accurate and quick evaluation of performance of big NoCs without requiring expensive hardware resources. It is also possible to simulate heavy network loads to test for possibility of deadlocks in the NoCs. 1000 800 600 400 200 0 1 2 3 4 5 6 7 8 9 10 No. of simultaneous transmission 20% 40% 60% 80% 100% Fig 8: Comparison of the effect of increase in buffer size (% of packet length) to total transmission time As can be seen from Figure 9, the flit transmission time for CNoC shows a dip at the center of the graph, indicating that the highest performance for CNoC is when packet size is between 700 to 1300 flits. For smaller packet size, the ratio of setup time over payload transmission time is higher resulting in lower transmission efficiency. Packet sizes which are too big would result in longer network congestion and lower network utilization. Figure 10 shows the timing performance for CNoC for simulated network congestion. Network congestion is simulated by setting all transmitters to transmit to a single receiver simultaneously. The graph shows that the transmission time for CNoC increases as the packet length and number of simultaneous transmissions increase. ICSE2010 Proc. 2010, Melaka, Malaysia Flit Transmission Time Using the simulators, a heavy network load was simulated by transmitting a total of 1.6 MB data from 10 simultaneous transmitters to a single destination core. Under a heavy traffic load, it was found that the proposed CNoC does not experience any deadlock and able to recover completely after the traffic burst. 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 REFERENCES 0.2 0 90 0 11 00 13 00 15 00 17 00 19 00 70 0 50 0 [1] 30 0 10 0 CNoC, namely multiple links and nodes reservation during a transmission. Quantitatively, simulations of PNoC showed that PNoC suffers from a low performance to cost ratio in terms of buffering resource. Simulations also shown that PNoC and CNoC exhibits similar transmission characteristics for low and heavy traffic load, except timing performance of CNoC is generally slightly lower than PNoC. The latency is mainly due to circuit setup prior to transmission and can be minimized via network localization. Due to its advantages of not requiring buffering resources, lower area and power consumption, better scalability and high performance, CNoC maybe the better altenative solution of on-chip interconnects. Further research and optimization will be required to further improve on the performance of CNoC and will be presented in future works. [2] Packet Length in Clock Counts Fig 9: Per-flit transmission time for different packet length [3] [4] [5] [6] [7] [8] Packet length [9] [10] Fig 10: Transmission characteristic for simulated congestion in CNoC [11] IX. CONCLUSION In this work, the network protocol and design of CNoC had been proposed and qualitative as well as quantitative evaluations of the proposed network had been presented. From the qualitative analysis of available PNoC designs, it was found that efforts to address 1 of the most significant issues in PNoC designs, which is limited buffering resource led to implementation of wormhole routing for most PNoC designs. However, wormhole routing suffers from similar drawback as [12] [13] [14] [15] Moraes F., Mello A., Moller L., Ost L., & Calazans N., “A Low Area Overhead Packet-Switched Network on Chip: Architecture and Prototyping”, pp. 1-6. Gupta R.K. & Zorian Y., 1997, “Introducing Core-Based System Design”, IEEE Design and Test of Computers, vol. 4, pp. 1-5. Saastamoinen I., Alho M. & Nurmi J., 2003, “Buffer Implementation for Proteo Network-on-Chip”, IEEE, pp II-113 – II116. Bartic T.A., Mignolet J.Y., Nollet V., Maraescaux T., Verkest D., Vernalde S. & Lauwereins R. 2003, “Highly Scalable Network on Chip for Reconfigurable Systems”, IEEE, pp. 1-8. Tortosa D.S. and Nurmi J., “Proteo: A New Approach to Network-onChip”, pp. 1-5. Adriahantenaina A., Charlery H., Greiner A., Mortiez L. & Zeferino C.A. 2003, “SPIN: A Scalable, Packet Switched, On-Chip Micro Network”, IEEE, Proc. Of the Design, Auto. & Test in Europe Conf. & Exh., pp. 14. Zeferino C.A., & Susin A.A. 2003, “SoCIN: A Parametric and Scalable Network-on-Chip”, IEEE Proc. of the 16th Symp. On Int. Circuits & Systems Design, pp. 1-6. Wolkotte P.T., Smit G.J.M., Rauwerda G.K. & Smit L.T., 2005, “An Energy Efficient Reconfigurable Circuit Switched Network-on-Chip”, IEEE, Proc. Of the 19th IEEE Int. Parallel and Distributed Processing Symp., pp1-8. Bjerregaard T. & Mvadevan S., “A Survey of Research and Practices of Network-on-Chip”, ACM Computing Surveys p. 33. Dielissen J., Radulescu A., Goossens K. &Rjipkema E., “Concepts and Implementation of Philips Network-on-Chip”, pp.1-6 Ali, M., Welzl, M. & Zwicknagl, M. 2008, “Networks on Chips: Scalable Interconnects for Future System on Chips”, IEEE, pp. 240-245. Zeferino, C.A., Kreutz, M.E., Carro, L. & Susin, A.A., 2002, “A Study on Communication Issues for Systems-on-Chip”, IEEE, Proc. of The 15th Symp. on IC and Systems Design (SBCCI’02), pp. 1-6 Henkel, J., Wofl, W., & Chakradhar, S., 2004, “On-chip networks: A Scalable, Communication-centric Embedded System Design Paradigm”, IEEE, Proc. of the 17th Int. Conf. on VLSI Design, pp. Moraes, F., Mello, A., Moller, L., Ost, L., & Calazans, N., “A Low Area Overhead Packet-Switched Network on Chip: Architecture and Prototyping”, pp. 1-6 Zeferino, C.A., & Susin, A.A, 2003, IEEE, Proc. of the 16th Symp. on IC and Systems Design (SBBCC’03), pp. 1-6 113

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Qualitative and Quantitative Evaluation of A Proposed Circuit