* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download CPU Performance
Distributed firewall wikipedia , lookup
Deep packet inspection wikipedia , lookup
Computer network wikipedia , lookup
IEEE 802.1aq wikipedia , lookup
Cracking of wireless networks wikipedia , lookup
Network tap wikipedia , lookup
Recursive InterNetwork Architecture (RINA) wikipedia , lookup
Peer-to-peer wikipedia , lookup
Network-on-Chip Introduction Z. Lu / A. Jantsch / I. Sander Dally and Towles, Chapter 1, 2, 3, 4,5 Overview Interconnect Network Introduction Deadlock, Livelock Topology (Regular, irregular) Router Architecture (pipelined, classic) Routing (algorithms, mechanics) Network Interface (message passing, shared memory) Flow Control (Circuit, packet switching (SAF, wormhole, virtual channel) Concepts May 25, 2017 Implementation Summary SoC Architecture Performance Analysis and QoS Evaluation 2 Network-on-Chip (NoC) Today buses are the dominating technology for systems-on-chips (SoCs) However, buses have severe limitations that become evident, if the number of components in a system is large The bus is a communication bottleneck, bandwidth is limited The bus uses broadcast for each transfer, thus is power inefficient Networks-on-Chip shall overcome the limitation of buses, since they provide a much larger amount of communication resources and are scalable May 25, 2017 SoC Architecture 3 A Network-on-Chip S S T S S T S T S T S T T S T Switch T Channel Terminal Node S S T S T S T May 25, 2017 S S T S T T S T SoC Architecture T 4 Network-on-Chip S S T S S T S T S S S T S May 25, 2017 T T T T T S T A terminal node can be any kind of component like S S T T S S S T T T S T SoC Architecture Processor Memory Hardware component Bus-based system with several components, e.g. Processor and Memory 5 Network-on-Chip S S T S S T S T S S S T S May 25, 2017 T T T S T S T T S S S T T T S Information in the form of packets (not wires) is routed via channels and switches from one terminal node to another T S T T SoC Architecture 6 Network Interface Network Switch Interface Terminal Node (Resource) May 25, 2017 SoC Architecture Different terminals with different interfaces shall be connected to the network The network uses a specific protocol and all traffic on the network has to comply to the format of this protocol 7 Network Interface Network Switch In order to allow for different resources to connect to the network, the network interface can be divided into A resource independent part (Network Interface) A resource dependent part (Resource Network Interface) This is also the solution for the Nostrum NoC (developed at KTH) Network Interface Resource Network Interface May 25, 2017 Terminal Node (Resource) SoC Architecture 8 Network abstractions International Standards Organization (ISO) developed the Open Systems Interconnection (OSI) model to describe networks: 7-layer model Provides a standard way to classify network components and operations Networks-on-Chips use a similar protocol stack corresponding to the 4 lowest layers of the OSI protocol May 25, 2017 SoC Architecture 9 OSI model application presentation session May 25, 2017 end-use interface data format transport application dialog control connections network end-to-end service data link reliable data transport physical mechanical, electrical SoC Architecture 10 OSI layers Physical: connectors, bit formats, electrical properties Data link: error detection and control across a single link (single hop). Network: end-to-end multi-hop data communication Transport: connection-oriented services over multiple links, e.g. ordering of packets, errorfree connection May 25, 2017 SoC Architecture 11 OSI layers, cont’d. Session: services for end-user applications: data grouping, checkpointing, etc Presentation: data formats, transformation services Application: interface between network and end-user programs May 25, 2017 SoC Architecture 12 Internet Protocol (not an on-chip protocol!) application presentation session transport network data link physical network data link physical node A router May 25, 2017 IP SoC Architecture application presentation session transport network data link physical node B 13 Units of Resource Allocation A message is a continuous group of bits that is delivered from source terminal to destination terminal. A message consists of packets. A packet is the basic unit for routing and sequencing. Packets maybe divided into flits. A flit (flow control digits) is the basic unit of bandwidth and storage allocation. Body/tail flits do not have any routing or sequence information and have to follow the head flit and remain in order. A phit (physical transfer digits) is the unit that is transfered across a channel in a single clock cycle. May 25, 2017 SoC Architecture 14 Units of Resource Allocation Message Packet Header Packet RI SN Head Flit Flit Type Body Flit Body Flit Tail Flit VC Phit Messages, Packets, Flits and Phits are handled in different layers of the network protocol May 25, 2017 SoC Architecture 15 Performance Factors S S T S T S T T Factors that influence the performance of a network-onchip are S S S S T T T T S S S S T S T S T May 25, 2017 T S T T S T T SoC Architecture Topology (static arrangement of channels and nodes) Routing Techniques (selection of a path through the network) Switching Techniques (How a route is traversed) Flow Control (how are network resources allocated, if packets traverse the network) Router Architecture (buffers and switches) Traffic Pattern 16 Network-on-Chip Topologies Dally: Ch 3, (4), 5 Network Topology The network topology refers to the static arrangement of channels and nodes in the network A good topology allows to fulfill the requirements of the traffic at reasonable costs Network topology can be compared with a network of roads May 25, 2017 SoC Architecture 18 Topology Examples Destination-tag Routing 0 0 0 1 2 3 00 0 10 20 1 4-node ring 00 01 02 1 1 2 2 03 01 11 21 3 10 11 12 3 How is it routed to destination 6, 110? 13 4 4 02 20 21 22 23 30 31 32 33 12 22 5 5 6 6 03 13 23 7 ”4x4”-Torus May 25, 2017 7 Butterfly with 8 nodes SoC Architecture 19 Rings, Tori and Meshes 0 1 2 3 4-node ring (4-ary 1-cube) 00 01 02 03 00 01 02 03 10 11 12 13 10 11 12 13 20 21 22 23 20 21 22 23 30 31 32 33 30 31 32 33 May 25, 2017 ”4x4”-Torus (4-ary 2-cube) ”4x4”-Mesh (4-ary 2-mesh) SoC Architecture 20 Combined Node consists of Terminal and Switch Node Switch Node is equivalent to Combined Node Terminal Node May 25, 2017 SoC Architecture 21 Nomenclature Network-on-Chip 00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33 ”4x4”-Torus (Channels are bidirectional) May 25, 2017 The topology of an interconnection network is specified by a set of nodes N* connected by a set of channels C Messages originate and terminate in a set of terminal nodes N, where N N* Here: SoC Architecture N = N* = 16 C = 16 ∙ 4 = 64 22 Nomenclature Network-on-Chip 00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33 ”4x4”-Torus (Channels are bidirectional) May 25, 2017 SoC Architecture Each channel c = (x, y) ∈ C connects a source node x to a destination node y, where x, y ∈ N* A channel is characterized by its width wc or wxy , which is the number of parallel signals it contains The source node of a channel is denoted sc and the destination node dc 23 Nomenclature Network-on-Chip 00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33 ”4x4”-Torus (Channels are bidirectional) May 25, 2017 SoC Architecture Its frequency fc or fxy is the rate at which bits are transported on each signal Its latency tc or txy is the time required for a bit to travel from x to y Usually the latency is directly related to the physical length of the channel, lc = vtc , by a propagation velocity v The bandwidth of the channel is bc = wcfc 24 Nomenclature Network-on-Chip 00 01 02 03 Each switch node x has a channel set Cx = CIx ⋃ COx , where 10 11 12 13 20 21 22 23 30 31 32 33 The degree of x is δx = |Cx |, which is the sum of the in degree and out degree ”4x4”-Torus (Channels are bidirectional) May 25, 2017 CIx = {c ∈ C | dc = x} is the input channel set COx = {c ∈ C | sc = x} is the output channel set SoC Architecture 25 Direct and Indirect Networks 00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33 Direct Network Every Node in the network is both a terminal and a switch Direct Network Indirect Network Nodes are either a switch or terminal 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 Indirect Network May 25, 2017 SoC Architecture 26 Bisection of a network A bisection of a network is a cut that partitions the entire network nearly in half The channel bisection of a network is the minimum channel count over all bisections of the network min C ( N , N ) Bc bisections 1 2 The bisection bandwidth of a network is the minimum bandwidth over all bisections of the network min B( N , N ) BB bisections 1 2 May 25, 2017 SoC Architecture 27 Bisection Channel bisection BC = 4 (2 bidirectional channels go through the bisection) 1 2 3 4-node ring Bandwidth bisection 0 BB = 4b (b is the bandwidth of each channel) May 25, 2017 SoC Architecture 28 Paths Non-Minimal Path (|P| = 5) 00 01 02 03 A path is an ordered set of channels P = { c1, c2, ... ,cn }, where d c sc for i = 1... (n - 1) The length or hop count of a path is |P | A minimal path from node x to node y is a path with the smallest hop-count i 10 11 12 13 20 21 22 23 30 31 32 33 ”4x4”-Torus (Channels are bidirectional) May 25, 2017 Minimal Path (|P| = 3) SoC Architecture i 1 29 Paths 00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33 ”4x4”-Torus (Channels are bidirectional) May 25, 2017 The set of all minimal paths between x and y is denoted Rxy Minimal Paths (|P| = 3) SoC Architecture 30 Paths Largest minimal hop count (Diameter Hmax = 4) 00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33 The diameter Hmax is the largest minimal hop count over all pairs of terminal nodes ”4x4”-Torus (Channels are bidirectional) May 25, 2017 SoC Architecture 31 Paths Distance in hops from node 00 00 0 01 1 02 2 03 1 10 1 11 2 12 3 13 2 2 20 3 21 4 22 3 23 30 1 31 2 32 3 May 25, 2017 H min 33 2 ”4x4”-Torus (Channels are bidirectional) The average minimum hop count Hmin is defined as the average hop count over all sources and destinations SoC Architecture 1 2 N H ( x, y ) x , yN Here: Hmin = 2 32 Paths Non-Minimal Path (|P| = 5) 00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33 H avg H min ”4x4”-Torus (Channels are bidirectional) May 25, 2017 A specific implementation may choose to incorporate some non-minimal path Then the actual average hop count Havg is defined over the path used by the network SoC Architecture 33 Paths Non-Minimal Path (|P| = 5) 00 01 02 03 10 11 12 13 D( P) lc cP 20 21 22 23 30 31 32 33 The physical distance of the path is The delay of the path is t ( P) D( P) / v ”4x4”-Torus (Channels are bidirectional) May 25, 2017 SoC Architecture 34 Traffic Patterns Traffic pattern is a very important factor for the performance of a network In uniform random traffic each source is equally likely to send to each destination Uniform random traffic is the most commonly used traffic pattern for network evaluation. However it implies balancing of the load, which often does not cause a problem for the network May 25, 2017 SoC Architecture 35 Throughput The throughput of a network is the data rate in bits per second that the network accepts per input port The topology of a network has a significant impact on the throughput (besides flow control and routing) The ideal throughput is defined as the throughput assuming a perfect routing and flow control Load is balanced over alternate paths No idle cycles on bottleneck channels May 25, 2017 SoC Architecture 36 Throughput Maximum throughput occurs, if some channel of the network becomes saturated Why some channel not all channels? The channel load of a channel is the ratio of the bandwidth demanded from the channel to the bandwidth of the input ports (in other words) the amount of traffic that must cross the channel, if each input unit injects one unit of traffic according to the given traffic pattern The channel that carries the largest fraction of the traffic determines the maximum channel load max May 25, 2017 SoC Architecture 37 Throughput The ideal throughput ideal is the input bandwidth that saturates the bottleneck channel ideal = b / max In general it is difficult to determine the maximum channel load max, but in case for uniform traffic, bounds can be found. Use the ideal throughput of a network on uniform traffic ideal (U) as the capacity of the network. May 25, 2017 SoC Architecture 38 An example: Ideal throughput Channel bandwidth, channel load S1 D1 1 packet/cycle ideal = b / max S1 Assume b = 1 packet/cycle D1 Mux De-mux S2 1. Where is the bottleneck channel? 2. What is the load of the bottleneck channel? 3. What is the throughput on the bottleneck channel? 4. What is the network’s ideal throughput? D2 What if b = 2 packet/cycle ? May 25, 2017 SoC Architecture 39 Ideal Throughput in a Torus 00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33 ”4x4”-Torus (Channels are bidirectional) May 25, 2017 SoC Architecture Assuming uniform traffic, 50% of the packets cross the bisection channels Best throughput, if packets are evenly distributed over the bisection channels Load on these channels is then B = N / 2BC Thus max ≥ B = N / 2BC And the ideal throughput is ideal = b / max ≤ 2bBC /N 40 Ideal Throughput in a Torus 00 01 02 03 Thus max ≥ B = N / 2BC ideal = b / max ≤ 2bBC /N Example: 10 11 12 13 20 21 22 23 30 31 32 33 ”4x4”-Torus (Channels are bidirectional) May 25, 2017 SoC Architecture 4 x 4 torus: B = 16 / 2BC = 8/16 = ½ 4 x 4 mesh: B = 16 / 2BC = 8/8 = 1 n x n torus: B = n2 / 2BC = n2/(2 · 4n) = n/8 n x n mesh: B = n2 / 2BC = n2/4n = n/4 41 Another useful lower bound on channel load A packet needs Hmin hops to be delivered There are C channels in the network We have N nodes sending packets With equal load, we get a lower bound for max c , LB May 25, 2017 H min N C SoC Architecture 42 Lower Bounds on Bottlneck Hop count bound: max c , LB max Bisection bound: H min N C N 2 BC ideal = b / max May 25, 2017 SoC Architecture 43 Bisection Bandwidth Mesh – Uniform Traffic BT … total channel count BC … bisection channel count E … node emission probability (Emitted packets per cycle) Havg … average hop count BT 4k 4k 2 Total Load: ENH avg Balance: BC 2k N k2 1. When k increases, what happens for E? Why? 2. Are both bounds correct? 3. If so, what does this mean? 4. Why is the B-bound tighter? May 25, 2017 2 2 Ek k Ek 3 3 3 2 2 3 Ek 4k 2 4k 3 E Bisection Load: Balance: 3 ( 4k 2 4k ) k 1 6 2 3 2 k k 1 2 Ek 2k 2 E k T-bound B-bound 2 3/2=1.5 2 4 9/8=1.125 1 6 5/6=0.83 2/3=0.67 8 21/32=0.656 ½=0.5 10 27/50=0.54 2/5=0.4 SoC Architecture EN Ek 2 2 2 4 k 44 Latency The latency of the network is the time required for a message to traverse a network, from the time the head arrives at the input port to the time when the tail of the message departs the output port Latency depends not only on topology, but also on routing, flow control and the design of the router Topology gives a lower bound on latency May 25, 2017 SoC Architecture 45 Latency There are two latency components: Head latency Th : Time required for head of the message to traverse the network Serialization latency Ts= L/b : Time required for the tail to catch up (time for a message of length L to cross a channel with bandwidth b) May 25, 2017 SoC Architecture 46 Head Latency Head latency depends on two topology factors Router delay Tr (time spent in the routers) and time of flight Tw (time spent on wires) May 25, 2017 Tr = Hmin tr Tw = Dmin / v (average distance Dmin, propagation velocity v) SoC Architecture 47 Latency Together this gives ideal average latency: T0 = Hmin tr + Dmin / v + L / b (no congestion) Clearly Hmin, Dmin, and b are to a large extent determined by the topology If there is congestion in the network, there is a forth term TC May 25, 2017 SoC Architecture 48 Latency with time-space diagram Head Tail How the time-space diagram looks like if the transmission is not pipelined, say, in a store-and-forward fashion? x Arrival at node x tr Leave x txy y Arrival at node y tr Leave y tyz z Arrival at switch z L/b May 25, 2017 SoC Architecture 49 Examples T0 = Hmin tr + Dmin / v + L / b A torus 8-by-8 network The network has N = 64 nodes Avg. min. hop count Hmin = 4 Channel width wc = 16 Channel frequency fc = 1 GHz Channel latency tc = 5 ns Router delay tr = 8 ns Packet Length L = 64 bytes May 25, 2017 SoC Architecture Tr = Hmin tr = 4 ∙ 8 ns = 32 ns Tw = Hmin tc = 4 ∙ 5 ns = 20 ns Ts = L / b = L / (fc wc ) = 64 ∙ 8 / (1 GHz ∙ 16) = 512 / 16 ns = 32 ns T0 = 32 ns + 20 ns + 32 ns = 84 ns 50 To get a feeling about NoC (Toy example) The network has N = 64 nodes Hmin = 2 ∙ 8 / 3 = 5.33 Channel width wc = 32 Channel frequency fc = 1 GHz Channel latency tc = 1 ns Router delay tr = 1 ns Packet length L = 16 bytes May 25, 2017 SoC Architecture 51 To get a feeling about NoC (Toy example) The network has N = 64 nodes Hmin = 2 ∙ 8 / 3 = 5.33 Channel width wc = 32 Channel frequency fc = 1 GHz Channel latency tc = 1 ns Router delay tr = 1 ns Packet length L = 16 bytes May 25, 2017 SoC Architecture Tr = Hmin tr = 5.33 ∙ 1ns = 5.33 ns Tw = Hmin tc = 5.33 ns Ts = L / b = L / (fc wc ) = 16 ∙ 8 / (1 GHz ∙ 32) = 128 / 32 ns = 4 ns T0 = 5.33 ns + 5.33 ns + 4 ns = 14.66 ns 52 Path Diversity A network with multiple minimal paths between most pairs of nodes is more robust than a network that has only one single route between the nodes May 25, 2017 SoC Architecture 53 Path Diversity 0 Random Traffic Each node is equally likely to send a message to any other node 50% of the packets pass the bisection max = 4/4=1 0 00 10 20 1 1 2 2 01 11 21 3 3 4 4 02 12 22 5 5 6 6 03 13 23 7 7 Butterfly with 8 nodes May 25, 2017 SoC Architecture 54 Traffic Patterns The performance of a network is strongly depending on the traffic pattern The table below shows a number of different traffic patterns that can be used to analyze the performance of the network May 25, 2017 SoC Architecture 55 Path Diversity 0 Bit (left) Rotation Traffic The node with address { b2, b1, b0 } sends to { b1, b0, b2 } Thus for sources { 0, 1, 2, 3, 4, 5, 6, 7 }, we get the shuffle permutation { 0, 2, 4, 6, 1, 3, 5, 7 } Thus packets from nodes {0,1,4, 5} will all have to pass switch node 10 max,BR = 2 (since for instance channel (00,10) is used by two connections) Max capacity: 50% May 25, 2017 0 00 10 20 1 1 2 2 01 11 21 3 3 4 4 02 12 22 5 5 6 6 03 13 23 7 SoC Architecture 7 Butterfly with 8 nodes 56 Torus and Mesh Networks Torus and Mesh networks, k-ary n-cubes, pack N = kn nodes. Advantages Regular structure allows efficient packaging For local communication latency is low Good path diversity Disadvantage Comparably larger hop count May 25, 2017 SoC Architecture 57 Rings, Tori and Meshes 0 1 2 3 4-node ring (4-ary 1-cube) 00 01 02 03 00 01 02 03 10 11 12 13 10 11 12 13 20 21 22 23 20 21 22 23 30 31 32 33 30 31 32 33 May 25, 2017 ”4x4”-Torus (4-ary 2-cube) ”4x4”-Mesh (4-ary 2-mesh) SoC Architecture 58 Properties of Tori and Meshes Torus Channel Bisection BC,T = 4 N / k Channel load under uniform traffic (50% of traffic crosses bisection) T,U = k / 8 Channel load under worst traffic (100% of traffic crosses bisection) T,W = k / 4 Average minimum hop count (k even) Hmin, T = nk / 4 May 25, 2017 Mesh Channel Bisection BC,M = 2 N / k Channel load under uniform traffic (50% of traffic crosses bisection) M,U = k / 4 Channel load under worst traffic (100% of traffic crosses bisection) M,W = k / 2 Average minimum hop count (k even) Hmin, M = nk / 3 SoC Architecture 59 Physical implementation of Mesh and Tori In order to implement a network on a chip, the abstract nodes of the network must be mapped to real positions in physical space A goal is to have the same latency for all channels May 25, 2017 SoC Architecture 60 Folding networks leads to shorter largest channel length Folded 4-ary 2 cube May 25, 2017 SoC Architecture 61 Summary Network is characterized by topology, routing, switching and flow control. Topology is the first concern of the network Mesh and Tori are regular and symmetric, which many on-chip networks of today propose. Performance metrics: throughput and latency Use channel load to determine ideal throughput Latency includes time spent on routers, wires and serialization Performance is strongly sensitive to traffic pattern May 25, 2017 SoC Architecture 62