Download Advances in Ethernet

Routing algorithms Yaakov (J) Stein Sept 2009 Chief Scientist RAD Data Communications Outline Control Plane principles of IP routing longest prefix match RIBs and FIBs Forwarding plane classification and lookup mechanisms switching fabrics Routing Slide 2 Introduction Routing Slide 3 Routers A router is a combination of hardware, software, and memory that is responsible for forwarding packets towards their destinations Routers generally work at ISO layer 3 (network layer) but can also function at layer “2.5” (for MPLS) and may inspect higher layers, but only for optimization (QoS management, load balancing, etc.) Note that Ethernet switches technically filter rather than forward router switch In order to correctly fulfill their function (i.e. to know where to forward) routers usually run routing protocols to exchange information between themselves Ethernet switches do not need such protocols as they learn how to filter So the router performs 2 distinct algorithms :  forwarding algorithm (forwarding component)  routing algorithm (control component) Routing Slide 4 What does a router do ? Control plane (routing algorithm)  run routing protocols  identify interface and next hop L2 addresses  populate RIBs (if Link State, perform SPFs)  scan all RIBs, and produce FIB (entries map FEC to NH) Data plane (forwarding algorithm)  deframing (CRC/checksum/defragmentation/reassembly/demapping…)  parsing (pulling values from appropriate fields – simple IPv4 DA,       complex finding URL or MIB variable) classification (add metadata, based on DA, DA+ToS, MPLS, …) FEC lookup / search packet modification and replication framing traffic management and queuing compression, encryption, etc. Routing Slide 5 IP networks IP networks are made up of  hosts  middleboxes (e.g., firewalls, NATs, NAPTs, Application Layer GWs)  routers (obsolete terminology – gateways) It will be useful to differentiate between  core routers (connect to other routers)  edge routers (connect to hosts) We will see shortly that it is more complex than that To understand how a router is different from other network elements we need to know the basic principles the IP protocol architecture We will mainly deal with IPv4 unicast forwarding Routing Slide 6 The basics (1) The first principle of IP is the end-to-end (E2E) principle All functionality should be implemented only with the knowledge and help of the application at the end points The second principle is the hourglass model  IP (l3) is the common layer  below IP (L3) is not part of IP suite, above is Thus :  most functionality and state is in the hosts  middlebox functionality is severely limited  routers are limited to forwarding packets without extensive packet manipulation (exception - TTL) ftp, http, … TCP, UDP, … IP Ethernet, PPP UTP, optic, wireless The third principle is that forwarding is  connectionless  on a hop-by-hop basis Routing Slide 7 The basics (2) The fourth principle is that unicast IP forwarding is performed based on a Destination Address (DA)  Addresses must usually be unique (end-to-end principle)  Hosts usually have a single IP address, routers have many addresses  It is the responsibility of a service provider (SP) to allocate addresses IP addresses are not arbitrary, like Ethernet MAC addresses The fifth principle is that IP addresses are aggregated into subnetworks  All addresses in a subnetwork share a common prefix  Subnetworks may be further aggregated The sixth principle is that it is the responsibility of the router to forward towards the host’s subnet but it is not its responsibility to deliver the packet on the subnet  the IP suite starts above L2  subnet’s L2 (e.g., Ethernet, PPP) delivers the packet to the host Routing Slide 8 IP Routing types  Distance Vector (Bellman-Ford), e.g. RIP, RIPv2, IGRP, EIGRP – send <addr,cost> to neighbors – routers maintain cost to all destinations – need to solve “count to  problem“  Path Vector, e.g. BGP – – – – –  send <addr,cost,path> to neighbors similar to distance vector, but w/o “count to  problem“ like distance vector has slow convergence* doesn’t require consistent topology can support hierarchical topology => exterior protocol (EGP) Link State, e.g. OSPF, IS-IS – send <neighbor-addr,cost> to all routers – determine entire flat network topology (SPF - Dijkstra’s algorithm) – fast convergence*, guaranteed loopless => interior routing protocol (IGP) *convergence time is the time taken until all routers work consistently before convergence is complete packets may be misforwarded, and there may be loops 1. Y(J)S MPLS Slide 9 Control Plane (Routing) Routing Slide 10 FIBs Based on the 6 principles we can understand what a router does 1. 2. 3. 4. 5. 6. The router looks at the packet’s DA It deduces to which subnet the packet belongs If the router can directly interface that subnet it must use the appropriate L2 to send the packet to the host Otherwise it must retrieve the next hop (router) that sends the packet towards the subnet The next router does the same If routing has converged there will be no loops or black holes but there may be during transients The information needed by the router to properly forward packets is stored in the Forwarding Information Base (FIB) The FIB associates address prefixes with Next Hops (NHs) (and, to save an additional lookup, usually with L2 addresses as well) Do not confuse the FIB with a Routing Information Base (RIB) Routing Slide 11 More on FIBs Simple (and primitive) routers have a routing table modern large routers have several different databases The FIB is designed to be fast to search we will talk about its data structure later RIBs are designed to be fast to update which is quite a different structure There may be many RIBs, one for each routing protocol running and static routes may be entered into any of them There are sometime other databases as well for example – link state routing protocols require a LSDB from which the RIB is built The basic idea is that we build the FIB from the RIBs Routing Slide 12 Router interfaces Routers connect to hosts and to other routers via interfaces (from 1 to many thousands of interfaces per router) 0.0.0.0/0 .2 Interfaces have layer 3 and above properties and also contain layer a and 2 properties (ports) Each interface is assigned a unique IP address Interfaces are grouped into subnetworks All interfaces on a subnetwork share the same prefix .3 .129 192.0.2.128/25 .1 192.0.2.0/25 Routers are responsible for forwarding packets  arriving at an ingress interface  to an egress interface .130 .131 Routing Slide 13 Prefixes and masks Since 1993 (RFC 1519 - CIDR) subnets can have any length prefix There are two ways of specifying the prefix length  slash notation, e.g., 192.168.16.0/20 note – unspecified bits are set to zero  mask notation, e.g., 192.168.16.0 with mask 255.255.240.0 Note that 192.168.16.0/20 means all addresses from 192.168.16.0 through 192.168.31.255 Note that it contains 192.168.16.0/21, 192.168.24.0/22, etc. since they are in the range and have longer prefixes (larger masks) /32 are fully qualified IP addresses 0.0.0.0/0 matches every IP address –  it is the default route route taken when there is no matching entry in FIB  the gateway of last resort Routing Slide 14 Prefix tables slash Mask slash mask A.B.C.D/32 255.255.255.255 A.B.0.0/16 255.255.0.0 A.B.C.D/31 255.255.255.254 A.B. 0.0/15 255.254.0.0 A.B.C.D/30 255.255.255.252 A.B. 0.0/14 255.252.0.0 A.B.C.D/29 255.255.255.248 A.B. 0.0/13 255.248.0.0 A.B.C.D/28 255.255.255.240 A.B. 0.0/12 255.240.0.0 A.B.C.D/27 255.255.255.224 A.B. 0.0/11 255.224.0.0 A.B.C.D/26 255.255.255.192 A.B. 0.0/10 255.192.0.0 A.B.C.D/25 255.255.255.128 A.B.0.0/9 255.128.0.0 A.B.C.0/24 255.255.255.0 A.0.0.0/8 255.0.0.0 A.B.C.0/23 255.255.254.0 A.0.0.0/7 254. 0.0.0 A.B.C.0/22 255.255.252.0 A.0.0.0/6 252. 0.0.0 A.B.C.0/21 255.255.248.0 A.0.0.0/5 248. 0.0.0 A.B.C.0/20 255.255.240.0 A.0.0.0/4 240. 0.0.0 A.B.C.0/19 255.255.224.0 A.0.0.0/3 224. 0.0.0 A.B.C.0/18 255.255.192.0 A.0.0.0/2 192. 0.0.0 A.B.0.0/17 255.255.128.0 A.0.0.0/1 128. 0.0.0 0.0.0.0/0 0.0.0.0 Note : for /25 D=0 or 128 for /26 D= 0, 64, 128, or 192 etc. Routing Slide 15 Special IP addresses Some IP addresses are reserved for special purposes they are not assigned by IANA and may require special treatment by router prefix range purpose 0.0.0.0/8 0.0.0.0 – 0.255.255.255 defaults 10.0.0.0/8 10.0.0.0 – 10.255.255.255 private addresses 127.0.0.0/8 127.0.0.0 – 127.255.255.255 loopback addresses 169.254.0.0/16 169.254.0.0 - 169.254.255.255 zeroconf 172.16.0.0/12 172.16.0.0 - 172.31.255.255 private addresses 192.0.2.0/24 192.0.2.0 - 192.0.2.255 Documentation 192.88.99.0/24 192.88.99.0 - 192.88.99.255 IPv6-IPv4 relay 192.168.0.0/16 192.168.0.0 - 192.168.255.255 private addresses 198.18.0.0/15 198.18.0.0 - 198.19.255.255 device benchmark 224.0.0.0/4 224.0.0.0 – 239.255.255.255 multicast 240.0.0.0/4 240.0.0.0 – 255.255.255.255 reserved Routing Slide 16 Longest prefix match and FECs To find the subnet, we need to look at the packet’s DA  and to find the best match for the DA that is known  find known the FIB entry that matches the longest prefix of the DA (LPM) All packets that are forwarded in the same way are grouped into a Forwarding Equivalence Class (FEC) In the simplest case, a FEC is simply a known IP address prefix i.e., the packet’s subnet In more complex cases it might get more complex for example, ToS field, source address, etc. Every packet has to be looked up in the FIB and classified to a FEC and this forwarding has to be done fast Routing Slide 17 Example Assume the following FIB prefix next hop IP address 0.0.0.0/0 (gateway of last resort) A 192.0.2.0/24 B 192.0.2.128/25 C 192.0.2.192/26 D and let’s look up 192.0.2.131 (last byte 10000011) It matches the first entry with prefix length 0 (everything matches) It matches the second with length 24 (first three bytes) It matches the third entry with prefix length 25 (last byte 1xxxxxxx) It does not match the fourth entry (11xxxxxx ) So the packet is forwarded to next hop C Routing Slide 18 Packet processing time How much time do we have to process a packet ? 100 Mbps 1 Gbps 10 Gbps 64 B 5 msec 500 nanosec 50 nanosec 256 B 20 msec 2 msec 200 nanosec 1500 B 120msec 12 msec 1.2 msec disregarding L2 overhead, IPGs, etc. So the FIB data structure has to be optimized for fast look-up Routing Slide 19 Policy and autonomy The FIB information is based on :  static routes  dynamic routes determined by routing protocols  gateway of last resort (default route) 0.0.0.0/0 In general we need to apply policy, since  there are conflicting sources of information  may not want to use, or even believe, information received from peers  it is all a matter of autonomy  an Autonomous System can request service from another but can not force it to provide service Routing Slide 20 Autonomous systems Routers are grouped into Autonomous Systems (ASs) ASs may be grouped into domains AS look to the outside world as single entity (they usually have an AS ID) Routers in the same AS obey a common policy, and trust each other AS are truly autonomous one AS can request another to forward a packet, but can not force it to Inside ASs we run Internal Gateway Protocols (IGPs)  e.g. OSPF, IS-IS Between ASs we run External Gateway Protocols (EGPs)  e.g., BGP A router that runs both is called an AS Border Router (ASBR) Routing Slide 21 More of the story Actually, it can get a lot more complicated In general, a router will be running multiple routing protocols For example :  one or more IGPs (RIP,OSPF, IS-IS) between routers in the same AS  internal BGP (iBGP) between routers in the same AS (usually a full mesh, but when too complex we can use route reflectors)  external BGP (eBGP) between ASBRs in different ASs How does a router know if a BGP session is iBGP or eBGP ?  by the AS number ! IGP is used to find a path to another router (including ASBR) in the same AS eBGP is used by ASBRs to learn / distribute routes to other ASs iBGP is used for ASBR to inform core routers of external routes Routing Slide 22 Simplest example Stub ASs (my home router)  single homed to outside world  single internal subnet, so don’t need IGP  single homed, so don’t need to run BGP to ISP  don’t need to have an AS number 0.0.0.0/0 192.168.0.0/24 .101 .102 .1 .103 .104 .105 Routing Slide 23 More complex example Connecting to a server connected to another ISP with dual homing a.b.c.d 1 2 3 7 4 5 6 AS 1 Full mesh of iBGP sessions       AS 2 eBGP sessions A.B.C.D Routers 3 and 6 learn from eBGP how to reach A.B.C.D Policy determines that 3 will be used (see later) Router 1 learns from iBGP session that A.B.C.D is reachable via router 3 Router 1 learns from IGP that router 3 is reachable via router 2 Router 2 knows how to directly reach router 3 because of IGP adjacency Packet from a.b.c.d is forwarded via 1-2-3 to AS 2 and to A.B.C.D Routing Slide 24 Even more complex example Three ASs, with one possibly acting as a transit domain AS 2 AS 1 AS 1 would like to hand off the traffic to AS 2 AS 2 has no economic incentive to carry this traffic But AS 2 gets the route from AS3 What can AS 2 do to stop this ? (remember autonomy!) What can AS 1 then do ? AS3 Routing Slide 25 Rules for customer ASs Stub AS Single-homed AS does not need to learn routes from provider It only has to send all traffic via its unique exit point (0.0.0.0/0) Provider gets routes from static or IGP or private-AS eBGP Multihomed Nontransit AS AS advertises only its own routes to both SPs AS filters out traffic for foreign routes that reach it via static/default routing eBGP is not needed, but recommended for route propagation and filtering Multihomed Transit AS Uses eBGP to SPs and iBGP for transit traffic Routing Slide 26 BGP rules There are :  internal routes  external routes  customer routes When eBGP learns a route it is repeated via iBGP to all others in AS thus all routers in AS learn it When iBGP learns a route it is repeated only to externals via eBGP since internals also get it directly When there is another ASBR that can reach the same other AS a second route is repeated by iBGP to the ASBR The ASBR will then make the decision as to which to use ! Routing Slide 27 IGP rules IGPs are used between routers in the same AS  so IGPs do not have sophisticated policy control  routers usually blindly accept all information received For proper operation (no routing loops)  all routers in AS must have the same IGP RIB  for link state protocols (OSPF, IS-IS) there is a Link State Data Base (LSDB) from which IGP RIBs can be constructed (will be explained shortly) Because all routers have the same LSDB Although the forwarding is hop-by-hop the result is the same as if there were coordination IGPs  do not scale to inifinity  require complete knowledge  are not suitable for interaction with non-trusted routers since a single misconfiguration can be fatal Routing Slide 28 LSDBs We said before that LS routing protocols have another database LSDB contains representation of every router and link in the AS implicitly holding the complete topology of the network In addition, the LSDB associate costs (metric) with every link  RIP - the metric is always hop count, no non-trivial metric  OSPF - the metric is more general, for example link length These costs form a matrix From \ to Router A Router A Router B M(B,A) Router C M(C,A) Router B Router C M(A,B) M(A,C) M(B,C) M(C,B) The topology is symmetric, but the costs need not be Routing Slide 29 LSDBs and IGP RIBs Each router can independently calculate the least-cost path to every other router in the AS A Shortest Path First (SPF) algorithm (e.g., Dijkstra’s algorithm) is used to compute a tree of the shortest paths to all destinations Each route in the SPF tree is an entire path but for each router we can extract the next hop and build the RIB for that router (each router has its own RIB) From the RIBs we build the FIB needed for efficient forwarding of packets Routing Slide 30 Graph search algorithms There are many algorithms for search on graphs : Breadth first – Bellman-Ford – Iterative deepening  Depth first (backtracking) – Depth limited  Best first – Greedy algorithms  Dijkstra’s algorithm – Beam search – A* – B* etc. etc.  0 4 1 2 3 5 6 7 8 We will discuss graphs, trees, etc. later on Routing Slide 31 Dijkstra’s algorithm Graph search algorithm first described by Edsger Dijkstra in 1959 It assumes additive, non-negative, costs for each link in graph It is a best-first greedy algorithm Think of a city street map We want to from initial intersection to destination one with the least walking Start at the initial intersection – its distance is zero Measure and label the distances to all adjacent intersections (breadth first) Choose the closest one (this is the greedy step) Consider all the neighbors of the chosen intersection If the distance (sum of the distance to the chosen intersection and the distance from chosen intersection to neighbor) is the shortest known way to get to that neighbor then remember that distance (not a tree!) Once you have considered all neighbors of the intersection mark the chosen intersection as visited (its distance is now known) Choose the unvisited intersection with shortest distance Continue until all intersections have been visited Routing Slide 32 Dijkstra’s algorithm - formal Let's call the node we are starting with an initial node. The COST of node X will be the distance from the initial node, i.e., the sum of distances of all links along the path from the initial node to X Initialization Set initial node’s COST to zero, all others to infinity Set initial node as current, all other as unvisited Main step For all unvisited neighbours of current node : Calculate their distances from the initial node as COST(neighbor) = COST(current) + DIST(current to neighbor) If this is less than what is presently marked, overwrite the marking When all neighbors have been considered, mark current node as visited (once visited, this node’s COST is final) Recurse Select the unvisited node with the smallest COST as current node Go to Main step Routing Slide 33 Implementation issues If our graph has N nodes and L links In a straightforward implementation of Dijkstra’s algorithm  finding the unvisited node with lowest cost takes O(N)  this is done N times  so the total computation for this is O(N2)  the computation of the distances to every node takes O(L) since each link is followed once (to the node it lands on) So the total complexity is O(N2 + L) ≈ O(N2) By using more sophisticated data structures (Fibonacci heap) this can be reduced to O(N log N + L) ≈ O(N log N) Routing Slide 34 RIBs to FIB So we are finally ready to see how the FIB is populated First rejection rules are applied, for example :  do not accept routes from ASs without agreements  do not accept routes that loop (e.g., BGP advertisements with AS number in the AS-PATH) Then install FIB entries according to policy, for example : 1. first install Static routes 2. then routes from IGP RIB 3. choose eBGP before iBGP (hot potato rule- get it out of my network - let someone else handle) 4. if there are different routes from BGP choose the route with highest local preference 5. if routes have equal local preference choose the route with the shortest AS-PATH 6. if routes have equal AS-PATHs choose the route with the lowest origin number 7. if still equal choose highest BGP peer address Routing Slide 35 Advertisement Not everything received is accepted for inclusion in the FIB Not everything accepted for inclusion in the FIB is further advertised Never advertise information not accepted to FIB ! import RIB in policy export RIB local policy RIB out Best path FIB Routing Slide 36 Data Plane (Forwarding) Routing Slide 37 Forwarding Now that we have a FIB installed, let’s forward some packets ! There are main steps to forwarding :  classification  switching  misc. (scheduling, queuing, QoS, compression, encryption, …) The operations  may be stateless vs. statefull  may change over time  may involve peeking into the packet (Deep Packet Inspection)  may be recursive Packet forwarding can be done by SW or HW or some combination We normally differentiate between  the fast path (simple forwarding)  the slow path (control protocol packets) Routing Slide 38 Lookup and data structures Lookup comes in several varieties, such as :  Exact match (e.g., MAC addresses, VLANs, IP multicast)  Longest Prefix Matching (LPM) (needed for searching FIBs)  Range matching (e.g., ports, firewalls) In order to optimize lookup, we use appropriate data structures Wirth’s law Programs = algorithms + data structures We need to perform the following on our data structures :  Insert  Delete  Modify  Search and to check the following metrics :  Time complexity for each of the above  Size (spatial) efficiency  Scalability Routing Slide 39 LookUp Table The simplest (and fastest) data structure is the LookUp Table (LUT) AKA indexed array, Location Addressable Memory (LAM) The incoming address is used as an index to access the (NH) information We can put in more info, e.g., L2 type and address, to save further lookups Example : address interface NH L2 type L2 NH address 0 1 192.0.2.0 Ethernet 00-17-42-F7-14-14 1 2 192.0.2.16 PPP - 2 3 192.0.2.128 SDH VC ID Limitations  only for exact match, not LPM  limitation only for small number of possible addresses, e.g. VLANs We can use LUTs after other data structures that return a key as an index Routing Slide 40 Hash tables Hash tables enable handling a large number of potential addresses A hashing function is a function  from a large variable (large number of bits or large number of bytes)  to a small variable (small number of bits or bytes) which is white (small changes in the input create widely different outputs) Hashing long addresses returns a short index The problem is that the hashing function is not 1:1 so there will always be the probability of hash clashes (collisions) Solutions :  perfect hashing – only when addresses are known ahead of time  index in table returns list of all addresses stored  multiple hashing – linear probing, quadratic probing Hashing is very good for exact match (e.g. MAC addresses) but is not suitable for LPM Routing Slide 41 Hash implementation From an efficiency point of view hash tables are between LUTs and search tables (to be discussed next) To control collisions, we need a relatively large table (birthday paradox !) 23 people – ½ Example : 57 people – 99% 99% probability of collision when 3000 entries are put into a hash table of size 1 million Using multiple hashing average computational load is O(1 + keys/table-size) A primitive hash function is the modulo hash H(key) = addr mod table-size table-size should be prime – must not be a power of 2 or close A better hash function is (for appropriate integer m and fraction f) H(key) = Trunc [ m Frac( f addr) ] The quality of the hashing function depends on f For Fibonacci hashing : f = 1 / γ γ is the golden ratio ½ (√5 – 2) ≈ 1.618 Routing Slide 42 Search table The most spatially efficient data structure is the search table It is similar to a LUT but the addresses being looked up are not indexes rather, we need to sequentially search the table for the address We can reduce the search time by ordering the table Search tables are good for LPM ! For example : Order FIB from most specific (longest prefix) to least (0.0.0.0/0) row prefix interface NH 0 192.168.16.0/24 1 10.10.1.1 1 192.168.196.0/20 2 10.16.54.2 2 192.168.0.0/17 2 10.16.1.16 3 0.0.0.0/0 3 10.1.1.0 Loop through FIB list until find the first match Routing Slide 43 Limitations of search tables Search tables are not limited in size like LUTs but this comes at a price  it is expensive to search for an address  it is expensive to modify the information for an address  it is very expensive to insert or delete addresses (copy!) Search table FIBs can be rebuilt from RIBs each time For exact match it is possible to speed up by binary search  order by address  guess position in middle of table range where key should be  choose new range by comparison Routing Slide 44 Linked lists Search tables are hard to maintain if we need to insert or delete an element we need extensive moving Linked lists are designed to simplify mechanics of such updates still need exhaustive search to find where to insert Linked lists can be singly or doubly linked start start end Skip lists increase efficiency by enabling skipping over ranges start Routing Slide 45 Linked list implementation If we already know where to insert or what to delete or whom to modify then linked lists are very efficient However, search is O(N) where N is the number of entries worst case N average case N / 2 Properly constructed multi-level skip lists take O(log N) on average but are still O(N) in the worst case But if we already need to use double pointers then trees are better Routing Slide 46 Tree structures A graph is a collection of nodes and edges connecting them In a directed graph the edges have direction and so or every edge there is a father node and a child node Nodes without children are called leaves A forest is a directed graph without loops (only one path between 2 nodes) A tree is a forest with a single root -- it thus defines a partial order (it is conventional to draw the tree upsidedown) A binary tree has no more than 2 output edges per node forest tree binary tree Trees can be implemented using arrays, pointers (like linked lists), heaps, etc. Routing Slide 47 Search trees Search trees may store data     in their nodes in leaves only on edges combinations of the above For general search trees searching can be breadth-first or depth-first Breadth-first     start at the root find all children of root and check for desired data if not found, find all children’s children recurse Depth first      start at the root find first child and check for desired data if not found, find first child of first child recurse until data found or leaf if leaf backtrack to father and try the next child Routing Slide 48 Binary trees Binary Search Trees simplify storage and manipulation We call the children of a node – L and R Perfect binary trees have exactly 2 children for each internal node Thus there are exactly 2H leaves, where H is the height of the tree Altogether there are (2H+1 – 1) nodes To store keys in a binary search tree  place keys in all nodes  for every intermediate node N – key(L) < key(N) L – key(R) > key(N) N R To search for a key in a BST  start at root (set current node to root)  check if key is stored in current node  if not : if key < key(N) then set current to L else set current to R In the worst case this takes only H comparisons (for perfect trees O(log N)) Routing Slide 49 Balanced binary trees Since the search complexity is proportional to the BST’s height H we would like to use BSTs with minimal height Balanced BSTs are not perfect BSTs, but as close as possible to perfect It is more complex to build a balanced BST, but faster to search balanced BST unbalanced BST 4 2 2 4 1 6 3 1 3 5 6 7 5 7 Routing Slide 50 Tries From retrieve, but usually pronounced try A trie is an ordered tree with  subvalues on edges  values at leaves Tries are related to prefix search representations Tries were originally developed for exact match Tries can be binary or not Tries are good for all lexicographical searches O(log n) but particularly efficient for LPM Trie variants are the fastest known lookups - O(log log n)  LC-trie used in modern Linux router implementations Routing Slide 51 Example trie Assume the following FIB : 10.0.128.0/16 10.0.0.0/8 192.168.0.0/16 192.168.128.0/24 0.0.0.0/0 10 0 0 0 0 192 128 168 128 0 0 Note: in the plain trie, all IP prefixes have the same length 0 /8 0 /16 0 /0 /16 0 0 /24 Routing Slide 52 Example compact trie Same FIB : 10.0.128.0/16 10.0.0.0/8 192.168.0.0/16 192.168.128.0/24 0.0.0.0/0 0.0/8 10.0 192.168 0.0.0.0/0 128.0/16 0.0/16 128.0/24 Now maximum length is 2 We could also save memory by exploiting common suffixes Routing Slide 53 More trie variants Patricia trie Practical Algorithm to Retrieve Information Coded in Alphanumeric  Binary trie which store in nodes the number of bits to skip before next decision point  For LPM store prefixes in internal nodes Bucket tries  Store multiple keys in leaves Multibit tries (M-tries)  Fewer branching decisions than binary tries  Multibit strides (usually variable strides) Level-compressed tries (LC-tries)  Replace perfect subtrees in binary trie with single degree 2k nodes  LPC trie – level and path compressed And there are many more (Lulea, full-tree, …) Routing Slide 54 Example LC-trie binary trie LC-trie Routing Slide 55 Content Addressable Memory CAM (AKA associative memory) Addressable by content, rather than by location (LAM) Special purpose hardware  Fastest possible lookup (essentially searches entire table in one clock)  but limited in size  usually drives regular memory for additional storage Binary CAM (BCAM) stores 0 or 1 in each bit Ternary CAM (TCAM) allows wildcards  Can be used for LPM  Can prioritize solution by – number of bits matched – order in table CAM technology today  32 to 144 bit keys  128K – 512 K memories  hundreds of millions searches per second Routing Slide 56 Example – 3 bit BCAM encoder M1[1] M1[2] M1[3] M2[1] M2[2] M2[3] M3[1] M3[2] M3[3] match lines search lines encode output to access LAM Q[1] search word Q[2] Q[3] Routing Slide 57 Other uses of lookup Deep Packet Inspection  URL lookup (often partial or with wildcards) – can use Trees and tries  XML information, patterns, etc.  multiple encapsulations – e.g. Ethernet in IP, MPLS over IP, etc. – there are special-purpose languages to describe such cases Firewalls, Access Control  source/destination IP addresses + source/destination ports 4-tuples  TCP/UDP ports in ranges Routing Slide 58 Switching Once the packet has been classified we need to properly forward it The classifier result (the FEC) becomes a metadata field that controls the switch A switch fabric is combination of HW and SW components that enable moving the packet from input interface to output interface The metaphor is borrowed from woven material At low packets per second (pps) switching can be all software At high speeds highly parallel hardware is needed Routing Slide 59 Blocking The major design constraint in switches is blocking Blocking occurs when some resource is being used and the present packet can not be processed We differentiate three types of blocking 1. Output blocking (the egress port is in use) 2. Internal blocking (some internal resource of the switch is in use) 3. Input (head of line) blocking (packet can not enter the switch) A switch which is designed such that there is never internal blocking is called a nonblocking switch To alleviate blocking we can add buffers to store the waiting packets According to the type of blocking we wish to avoid we have  output queues  internal buffers  input queues Routing Slide 60 Switch types In the TDM days there were two switching mechanism types :  time domain switches (time slot interchange)  spatial domain switches For packet networks there are no time slots but time domain switching can still be used  shared memory switches (packets from all inputs placed in single memory)  shared bus switches (all inputs and outputs share a ring/hypercube, etc. ) In spatial switching the packets destined for different output interfaces travel different internal paths The simplest spatial switch is the crossbar invented for analog telephone circuits does not scale well – O(N2) possible connections Multilayer crossbars can scale better 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 Routing Slide 61 Shared memory/bus Shared memory switches are simple and cheap to implement The memory is the heart of the fabric it determines the throughput and delay The memory must run N times faster than the ingress rates Packets may be divided into frame buffers Scalability can be achieved by parallelism (bit or byte slicing) The architecture breaks down at high speeds Shared bus switches are similar in principle to shared memory ones The shared medium is the heart of the switch and determines its characteristics Scalability can be achieved by parallelism (parallel buses) The architecture breaks down at high speeds Routing Slide 62 Multistage crossbars Crossbars are fast, but  are not nonblocking  do not scale These problems are solved by multistage crossbars The only blocking of a single stage crossbar is output blocking more than one packet needs to leave the egress interface Time-space switches solve this by sorting the frames before switching 1234 2314 3412 4123 1234 2341 3412 4123 More complex architectures are time-space-time switches There are many multistage solutions to the scaling problem … Routing Slide 63 Clos network The Clos network is more efficient than a single N*N crossbar We divide N into r groups of n N = n r It has 3 stages of crossbars  the first stage has r groups of n*n crossbars  the second stage has n groups of r*r crossbars  the third stage has r groups of n*n crossbars The shuffle rule : Xth output of Yth crossbar connects to Yth input of Xth crossbar Instead of N2 = n2r2 connections just r2 n + 2 r n2 For example: if r = n = √N then instead of n4 just 3n 3 a reduction by 3/n = 3 / √N shuffle shuffle Routing Slide 64 Benes network The Benes network contains only 2*2 crossbars and is built recursively For N = 2n, the network has 2n-1 stages and 4 N log (N-1/2) connections The shuffle rule is that 1st outputs go to the 1st block, 2nd to the 2nd Benes(n-1) Benes(n-1) Routing Slide 65 Banyan network The Banyan network is a binary tree of 2*2 crossbars with exactly one path from each ingress interface to each egress interface At each stage the next bit of the identifier determines the switching Banyan switches are time and space efficient  O(log N) delay  ½ N log N 2*2 crossbars The main problem is internal blocking at any of the crossbars One way to relieve this is to speed up the crossbars and to add buffers shuffle shuffle Routing Slide 66 Batcher’s sort To prevent internal blocking in a Banyan switch we can first sort the frames according to egress interface When this is done with Batcher’s parallel sorter ½ log N ( log N + 1) stages of simple sorters we get the Batcher-Banyan switch 4th order Batcher sorter to Banyan switch Routing Slide 67

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Advances in Ethernet