Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Multiprotocol Label Switching wikipedia , lookup
Deep packet inspection wikipedia , lookup
IEEE 802.1aq wikipedia , lookup
Cracking of wireless networks wikipedia , lookup
Recursive InterNetwork Architecture (RINA) wikipedia , lookup
Zero-configuration networking wikipedia , lookup
Week 7 Routing Protocols 1 Intra-AS Routing Also known as Interior Gateway Protocols (IGP) Most common Intra-AS routing protocols: RIP: Routing Information Protocol OSPF: Open Shortest Path First IGRP: Interior Gateway Routing Protocol (Cisco proprietary) 2 RIP ( Routing Information Protocol) Distance vector algorithm Included in BSD-UNIX Distribution in 1982 Distance metric: # of hops (max = 15 hops) u v A z C B D w x y destination hops u 1 v 2 w 2 x 3 y 3 z 2 3 RIP advertisements Distance vectors: exchanged among neighbors every 30 sec via Response Message (also called advertisement) Each advertisement: list of up to 25 destination nets within AS 4 RIP: Example z w A x D B y C Destination Network w y z x …. Next Router Num. of hops to dest. …. .... A B B -- 2 2 7 1 Routing table in D 5 RIP: Example Dest w x z …. Next C … w hops 4 ... A Advertisement from A to D z x Destination Network w y z x …. D B C y Next Router Num. of hops to dest. …. .... A B B A -- Routing table in D 2 2 7 5 1 6 RIP: Link Failure and Recovery If no advertisement heard after 180 sec --> neighbor/link declared dead routes via neighbor invalidated new advertisements sent to neighbors neighbors in turn send out new advertisements (if tables changed) link failure info quickly propagates to entire net poison reverse used to prevent ping-pong loops (infinite distance = 16 hops) 7 OSPF (Open Shortest Path First) “open”: publicly available Uses Link State algorithm LS packet dissemination Topology map at each node Route computation using Dijkstra’s algorithm OSPF advertisement carries one entry per neighbor router Advertisements disseminated to entire AS (via flooding) Carried in OSPF messages directly over IP (rather than TCP or UDP 8 OSPF “advanced” features (not in RIP) Security: all OSPF messages authenticated (to prevent malicious intrusion) Multiple same-cost paths allowed (only one path in RIP) For each link, multiple cost metrics for different TOS (e.g., satellite link cost set “low” for best effort; high for real time) Integrated uni- and multicast support: Multicast OSPF (MOSPF) uses same topology data base as OSPF Hierarchical OSPF in large domains. 9 Hierarchical OSPF 10 Hierarchical OSPF Two-level hierarchy: local area, backbone. Link-state advertisements only in area each nodes has detailed area topology; only know direction (shortest path) to nets in other areas. Area border routers: “summarize” distances to nets in own area, advertise to other Area Border routers. Backbone routers: run OSPF routing limited to backbone. Boundary routers: connect to other AS’s. 11 Internet inter-AS routing: BGP BGP (Border Gateway Protocol): the de facto standard BGP provides each AS a means to: 1. 2. 3. Obtain subnet reachability information from neighboring ASs. Propagate the reachability information to all routers internal to the AS. Determine “good” routes to subnets based on reachability information and policy. Allows a subnet to advertise its existence to rest of the Internet: “I am here” 12 BGP basics Pairs of routers (BGP peers) exchange routing info over semi- permanent TCP conctns: BGP sessions Note that BGP sessions do not correspond to physical links. When AS2 advertises a prefix to AS1, AS2 is promising it will forward any datagrams destined to that prefix towards the prefix. AS2 can aggregate prefixes in its advertisement 3c 3a 3b AS3 1a AS1 2a 1c 1d 1b 2c AS2 2b eBGP session iBGP session 13 Distributing reachability info With eBGP session between 3a and 1c, AS3 sends prefix reachability info to AS1. 1c can then use iBGP do distribute this new prefix reach info to all routers in AS1 1b can then re-advertise the new reach info to AS2 over the 1b-to-2a eBGP session When router learns about a new prefix, it creates an entry for the prefix in its forwarding table. 3c 3a 3b AS3 1a AS1 2a 1c 1d 1b 2c AS2 2b eBGP session iBGP session 14 Outline 1. Highlights 2. Addressing and CIDR 3. BGP Messages and Prefix Attributes 4. BGP Decision and Filtering Processes 5. I-BGP 6. Route Reflectors 7. Multihoming 8. Aggregation 9. Routing Instability 10. BGP Table Growth 15 Internet Topology large ISP large ISP medium ISP dialup ISP dedicated access ISP • AS (Autonomous System) - a collection of routers under the same technical and administrative domain. • EGP (External Gateway Protocol) - used between two AS’s to allow them to exchange routing information so that traffic can be forwarded across AS borders. Example: BGP 16 Purpose: to share connectivity information you can reach net A via me AS2 BGP AS1 R3 R2 traffic to A R1 table at R1: dest next hop A R2 A R border router internal router 17 BGP Sessions One router can participate in many BGP sessions. Initially … node advertises ALL routes it wants neighbor to know (could be >50K routes) Ongoing … only inform neighbor of changes AS1 BGP Sessions AS3 AS2 18 Routing Protocols IGP: Interior Gateway Protocol. Examples: IS-IS, OSPF I-BGP R3 R2 A AS1 E-BGP announce B AS2 R1 AS3 R5 R4 R border router internal router B 19 Configuration and Policy A BGP node has a notion of which routes to share with its neighbor. It may only advertise a portion of its routing table to a neighbor. A BGP node does not have to accept every route that it learns from its neighbor. It can selectively accept and reject messages. What to share with neighbors and what to accept from neighbors is determined by the routing policy, that is specified in a router’s configuration file. 20 History Popularity: until a number of years ago BGP fairly unknown. Used by small number of large ISPs. in 1995 (beginning of web popularity) number of organizations using BGP grew tremendously. Two reasons for growth of usage and interest: significant growth in number of ISPs; organizations were born that had missioncritical dependence upon their connectivity CIDR (Classless Inter-Domain Routing) introduced in 1995 21 Assigning IP address and AS numbers (Ideally) A host gets its IP address from the IP address block of its organization An organization gets an IP address block from its ISP’s address block An ISP gets its address block from its own provider OR from one of the 3 routing registries: ARIN: American Registry for Internet Numbers RIPE: Reseaux IP Europeens APNIC: Asia Pacific Network Information Center Each AS is assigned a 16-bit number (65536 total) Currently 10,000 AS’s in use 22 Addressing Schemes Original addressing schemes (class-based): 32 bits divided into 2 parts: Class A 0 8 0 network Class B 0 0 network Class C 0 0 network host 16 host 24 host ~2 million nets 256 hosts CIDR introduced to solve 2 problems: exhaustion of IP address space size and growth rate of routing table 23 Problem #1: Lifetime of Address Space Example: an organization needs 500 addresses. A single class C address not enough (256 hosts). Instead a class B address is allocated. (~64K hosts) That’s overkill -a huge waste. CIDR allows networks to be assigned on arbitrary bit boundaries. permits arbitrary sized masks: 178.24.14.0/23 is valid requires explicit masks to be passed in routing protocols CIDR solution for example above: organization is allocated a single /23 address (equivalent of 2 class C’s). 24 Problem #2: Routing Table Size Without CIDR: 232.71.0.0 232.71.1.0 232.71.2.0 ….. 232.71.255.0 service provider 232.71.0.0 232.71.1.0 232.71.2.0 ….. 232.71.255.0 Global internet With CIDR: 232.71.0.0 232.71.1.0 232.71.2.0 ….. 232.71.255.0 service provider 232.71.0.0/16 Global internet 25 CIDR: Classless Inter-Domain Routing Address format <IP address/prefix P>. The prefix denotes the upper P bits of the IP address. Idea - use aggregation - provide routing for a large number of customers by advertising one common prefix. This is possible because nature of addressing is hierarchical Summarizing routing information reduces the size of routing tables, but allows to maintain connectivity. Aggregation is critical to the scalability and survivability of the Internet 26 Address Arithmetic: Address Blocks The <address/prefix> pair defines an address block: Examples: 128.15.0.0/16 => [ 128.15.0.0 - 128.15.255.255 ] 188.24.0.0/13 => [ 188.24.0.0 - 188.31.255.255 ] consider 2nd octet in binary: 198.00011000.0.0 settable Address block sizes 13th bit a /13 address block has 232-13 addresses (/16 has 232-16) a /13 address block is 8 times as big as a /16 address block because 232-13 = 232-16 * 23 27 CIDR: longest prefix match Because prefixes of arbitrary length allowed, overlapping prefixes can exist. Example: router hears 124.39.0.0/16 from one neighbor and 124.39.11.0/24 from another neighbor Router forwards packet according to most specific forwarding information, called longest prefix match Packet with destination 124.39.11.32 will be forwarded using /24 entry. Packet w/destination 124.39.22.45 will be forwarded using /16 entry 28 Will CIDR work ? For CIDR to be successful need: address registries must assign addresses using CIDR strategy providers and subscribers should configure their networks, and allocate addresses to allow for a maximum amount of aggregation BGP must be configured to do aggregation as much as possible Factors that complicate achieving aggregation multihoming, proxy aggregation, changing providers 29 Four Basic Messages Open: Establishes BGP session (uses TCP port #179) Notification: Report unusual conditions Update: Inform neighbor of new routes that become active Inform neighbor of old routes that become inactive Keepalive: Inform neighbor that connection is still viable 30 UPDATE Message used to either advertise and/or withdraw prefixes path attributes: list of attributes that pertain to ALL the prefixes in the Reachability Info field FORMAT: Withdrawn routes length (2 octets) Withdrawn routes (variable length) Total path attributes length (2 octets) Path Attributes (variable length) Reachability Information (variable length) 33 ATTRIBUTES ORIGIN: Who originated the announcement? Where was a prefix injected into BGP? IGP, EGP or Incomplete (often used for static routes) AS-PATH: a list of AS’s through which the announcement for a prefix has passed each AS prepends its AS # to the AS-PATH attribute when forwarding an announcement useful to detect and prevent loops Prefix 128.73.4.21/21 Next hop 232.14.63.4 AS Path 1239 701 3985 631 35 AS-Path Attribute Sequence of ASes a route has traversed Loop detection Apply policy AS 200 AS 100 170.10.0.0/16 180.10.0.0/16 Network Path 180.10.0.0/16 300 200 100 170.10.0.0/16 300 200 AS 300 AS 400 150.10.0.0/16 AS 500 Network 180.10.0.0/16 170.10.0.0/16 150.10.0.0/16 Path 300 200 100 300 200 300 400 36 Loop Handling 172.16.10.0/24 -- 1 AS2 172.16.10.0/24 – 2 1 AS1 172.16.10.0/24 AS3 172.16.10.0/24 – 4 3 2 1 AS4 172.16.10.0/24 – 3 2 1 37 AS_path Manipulation 192.213.1.0/24 – 200 50 192.213.1.0/24 -- 50 AS50 AS200 AS300 192.213.1.0/24 – 300 200 50 192.213.1.0/24 192.213.1.0/24 – 50 50 50 AS100 192.213.1.0/24 – 100 50 50 50 38 Attribute: NEXT HOP For EBGP session, NEXT HOP = IP address of neighbor that announced the route. For IBGP sessions, if route originated inside AS, NEXT HOP = IP address of neighbor that announced the route For routes originated outside AS, NEXT HOP of EBGP node that learned of route, is carried unaltered into IBGP. 209.15.1.0/24 1.1.1.1 A 3.3.3.3 D C 2.2.2.2 IBGP B 140.20.1.0/24 BGP Table at Router C: Destination Next Hop 140.20.1.0/24 2.2.2.2 209.15.1.0/24 1.1.1.1 IP Routing Table at Router C: Destination Next Hop 140.20.1.0/24 2.2.2.2 2.2.2.0/24 3.3.3.3 3.3.3.0/24 Connected 209.15.1.0/24 1.1.1.1 1.1.1.0/24 3.3.3.3 39 Attribute: Multi-Exit Discriminator (MED) when AS’s interconnected via 2 or more links Hint to external neighbors about the preferred path into an AS that has multiple entry points AS announcing prefix sets MED enables AS2 to indicate its preference AS receiving prefix uses MED to select link a way to specify how close a prefix is to the link it is announced on a lower value is preferred called an external metric AS1 Link B Link A MED=50 MED=10 AS2 AS4 AS3 40 Attribute: Local Preference Used to indicate preference among multiple paths for the same prefix anywhere in the internet. The higher the value the more preferred Exchanged between IBGP peers only. Local to the AS. Often used to select a specific exit point for a particular destination 140.20.1.0/24 AS1 AS3 AS2 AS4 BGP table at AS4: Destination AS Path Local Pref 140.20.1.0/24 AS3 AS1 300 140.20.1.0/24 AS2 AS1 100 41 128.213.0.0/16 WINET Example ZNET XNET 128.213.0.0/16 YNET 128.213.0.0/16 T3 T1 Set local preference to 200 Set local preference to 300 SJ LA ANET Affects the traffic going out of the AS 42 Routing Process Overview Routes received from peers accept, deny, set preferences Input policy engine Choose best route Decision process BGP table forward, not forward set MEDs Routes used by router Output policy engine Routes sent to peers IP routing table Page 145 of Halabi 43 Input Policy Engine Inbound filtering controls outbound traffic filters route updates received from other peers filtering based on IP prefixes, AS_PATH, community denying a prefix means BGP does not want to reach that prefix via the peer that sent the announcement accepting a prefix means traffic towards that prefix may be forwarded to the peer that sent the announcement Attribute Manipulation sets attributes on accepted routes example: specify LOCAL_PREF to set priorities among multiple peers that can reach a given destination 44 BGP Decision Process 1. Choose route with highest LOCAL-PREF 2. If have more than 1 route, select route with shortest AS-PATH 3. If have more than 1 route, select according to lowest ORIGIN type where IGP < EGP < INCOMPLETE 4. If have more than 1 route, select route with lowest MED value 5. Select min cost path to NEXT HOP using IGP metrics 6. If have multiple internal paths, use BGP Router ID to break tie. 45 Output Policy Engine Outbound Filtering controls inbound traffic forwarding a route means others may choose to reach the prefix through you not forwarding a route means others must use another router to reach the prefix may depend upon whether E-BGP or I-BGP peer example: if ORIGIN=EGP and you are a nontransit AS and BGP peer is non-customer, then don’t forward Attribute Manipulation sets attributes such as AS_PATH and MEDs 46 Transit vs. Nontransit AS Transit traffic = traffic whose source and destination are outside the AS Nontransit AS: does not carry transit traffic • Advertise own routes only • Do not propagate routes learned from other AS’s • case 1: r1 Transit AS: does carry transit traffic • Advertises its own routes PLUS routes learned from other AS’s ISP1 ISP1 r3 ISP2 r1 r3 r2 r2 r2 r1 ISP2 r3 r2,r3 r2 r2,r1 AS1 AS1 r2 r3 AS2 r1 • case 2: r3 r1 r1 r3 r2 AS1 r2 47 Clients, Providers and Peers AS has customers, providers and peers Relationships between AS pairs: customer-provider peer-to-peer Type of relationship influences policies Exporting to provider: AS exports its routes & its customer’s routes, but not routes learned from other providers or peers Exporting to peer: (same as above) Exporting to customer: AS exports its routes plus routes learned from its providers and peers 48 Internal BGP (I-BGP) Used to distribute routes learned via EBGP to all the routers within an AS I-BGP and E-BGP are same protocol in that same message types used same attributes used same state machine BUT use different rules for readvertising prefixes Rule #1: prefixes learned from an E-BGP neighbor can be readvertised to an I-BGP neighbor, and vice versa Rule #2: prefixes learned from an I-BGP neighbor cannot be readvertised to another I-BGP neighbor 49 I-BGP: Preventing Loops and Setting Attributes Why rule #2? To prevent announcements from looping. In E-BGP can detect via AS-PATH. AS-PATH not changed in I-BGP Implication of rule: a full mesh of I-BGP sessions between each pair of routers in an AS is required Setting Attributes: The router that injects the route into the I-BGP mesh is responsible for setting the LOCAL-PREF attribute prepending AS # to AS-PATH 50 Route Reflectors Problem: requiring a full mesh of I-BGP sessions between all pairs of routers is hard to manage for large AS’s. Solution: I-BGP Peering RR group routers into clusters. Assign a leader to each cluster, called a route reflector (RR). Members of a cluster are called clients of the RR clients peer only with their RR RR’s must be fully meshed RR RR A B C clusters clients 51 Route Reflectors: Rule on Announcements If received from RR, reflect to clients If received from a client, reflect to RRs and clients If received from E-BGP, reflect to all RRs and clients RR’s reflect only the best route to a given prefix, not all announcements they receive. helps size of routing table sometimes clients don’t need to carry full table 52 Default Routes If you don’t have a routing entry in the table for a destination, send it along the default route Can be statically configured Can be learned via BGP Can have multiple defaults use LOCAL_PREF to distinguish primary and backup default routes AS2 0.0.0.0/0 next_hop=1.1.1.1 Local pref=300 0.0.0.0/0 next_hop=2.2.2.2 Local pref=100 AS1 53 Multihoming 54 Single-homed vs. Multi-homed subscribers A single-homed network has one connection to the Internet (i.e., to networks outside its domain) A multi-homed network has multiple connections to the Internet. Two scenarios: can be multi-homed to a single provider can be multi-homed to multiple providers Why multi-home? Reliability Performance • a site’s bandwidth to internet is sum of bandwidth on all links 55 Single-homed AS Subscriber called a “stub AS” Provider-Subscriber communication for route advertisement: static configuration. most common. • Provider’s router is configured with subscriber’s prefix. • good if customer’s routes can be represented by small set of aggregate routes • bad if customer has many noncontiguous subnets Provider R1 R2 Subscriber can use BGP between provider and stub AS 56 Multihoming to a Single Provider: 4 scenarios ISP ISP R1 R1 R2 R2 Customer R3 Customer ISP R1 ISP R2 R3 Customer R1 R2 R3 R4 Customer 57 Multihoming to Multiple Providers ISP 3 ISP 2 ISP 1 Customer 58 Multihoming Issues Load sharing how distribute the traffic over the multiple links? Reliability if load sharing leads to preferering certain links for certain subnets, is reliability reduced? Address/Aggregation which subnet addresses should the multihomed customer use? how will this affect its provider’s ability to aggregate routes? 59 Load sharing from Customer to ISP using policy Goal: send traffic to ISP’s customers on one link; send traffic to the rest of the Internet on 2nd link Implement using policy to control announcements blue: announcements red: traffic ISP R1 R2 advertise customer routes only advertise default route 0/0 R3 traffic Customer 60 Load sharing from ISP to customer using attributes Goal: provider splits traffic across 2 links according to prefix Implement this strategy using attributes customer sets MEDs provider sets LOCAL_PREF 61 Example C4 C3 C2 C5 SF NY X:200 Y:200 Rest:300 C3:300 C2:300 Rest:200 (X,Y) W:200 Z:200 Rest:250 SF NY C5:300 C4:300 Rest:250 (Z,W) IBGP 62 Example ISP R1 140.35/16 208.22/16 R2 R3 Customer 208.22/16 140.35/16 63 Address/Aggregation Issue Where should the customer get its address block from? 1. From ISP1 2. From ISP2 3. From both ISP1 and ISP2 4. Independently from address registry ISP 3 ISP 1 ISP 2 customer (cases 1 and 2 are equivalent) 64 Case 1 & 2: Get address block from one ISP example: customer gets address from ISP 1 ISP 1’s aggregation is not broken customer’s prefix not aggregatable at ISP 2 longer prefix becomes a traffic magnet How good is load sharing? If all ISP’s generate same amount of traffic for customer, then ISP2customer link twice as loaded as ISP1-customer link ISP 3 140.20/16 200.50/16 140.20.6/24 ISP 1 ISP 2 140.20/16 200.50/16 140.20.6/24 140.20.6/24 customer 140.20.6/24 65 Case 3: Get address block from both ISPs announcement policy: announce prefix only to its “parent” advantage: both ISP’s can aggregate the prefix they receive disadvantage: lose reliability load balancing good? depends upon how much traffic sent to each prefix ISP 3 ISP 1 ISP 2 140.20/16 200.50/16 140.20.1/24 200.50.1/24 customer 140.20.1/24 200.50.1/24 66 Case 4: obtain address block from registry no aggregation possible no traffic magnets created ISP 3 good reliability how achieve load sharing? customer breaks address block into 2 /25 blocks, and announce one per link (but may lose reliability) OR use method of AS-PATH manipulation 100.20/16 200.50/16 150.55.10/24 150.55.10/24 ISP 1 ISP 2 100.20/16 200.50/16 150.55.10/24 150.55.10/24 customer 150.55.10/24 67 AS-PATH manipulation Idea: prepend your AS number in AS-PATH multiple times to discourage use of a link makes AS-PATH seem longer than it is recall BGP decision process uses shortest AS-PATH length as a criteria for selecting best path Example: ISP 3 will choose path through AS 2 because its AS-PATH appears shorter ISP 3 150.55.10/24 - 2 33 150.55.10/24 - 1 33 33 AS 2 AS 1 150.55.10/24 - 33 33 150.55.10/24 - 33 customer 150.55.10/24 AS 33 68 Aggregation 69 Address Arithmetic: When is aggregation possible? Case 1 Possible when one prefix contained in another. Example: Two AS’s having customer-provider relationship. Provider does the aggregation. Provider has address block 18.0.0.0/8 Its customers have address blocks 18.6.0.0/15 and 18.9.0.0/15 Provider announces its address block only Rule: Prefix p1 contains prefix p2 iff length(p2) > length(p1) AND address(p2) / 232-length(p1) = address(p1) / 232length(p1) 70 Address Arithmetic: When is aggregation possible? Case 2 Some pairs of consecutive prefixes Example: routes within the same AS: AS has 2 address blocks: 1.2.2.0/24 = 0000001.00000010.00000010.00000000/24 1.2.3.0/24 = 0000001.00000010.00000011.00000000/24 can announce instead 1.2.2.0/23 Rule: consecutive prefixes p1 and p2 are aggregatable iff length(p1)=length(p2) AND address(p1) / 232-length(p1) +1 = address(p2) / 232-length(p2) AND address(p1) % 233-length(p1) = 0 71 Black holes and cardinal sins “The cardinal sin of BGP NAP routing is advertising routes that you don't wrong !! know how to get to; 100.24.0.0/18 called "black-holing" 100.24/16 someone” 100.24.56.0/21 If you announce part of [100.24.56.0-100.24.63.255] 222.2/16 IP space owned by Net C someone else, using a more-specific prefix, all their traffic flows to ISP 1 ISP 2 you. Makes that address 100.24/16 222.2/16 block disconnected from the Internet. Example: black holes can 100.24.16.0/21 100.24.0.0/20 happen inadvertently by [100.24.0.0-100.24.15.255] non-careful aggregation [100.24.16.0-100.24.23.255] Net B Net A 73 Limitations to Aggregation Lack of hierarchical allocation of address space prior to CIDR (before 1995) A single AS can have noncontiguous address blocks Customer AS and provider AS can have noncontiguous address blocks Reluctance of customers to renumber their address space when they switch providers Multi-homing multi-homed prefixes require global visibility may choose not to: load sharing 74 Rules of Thumb for Aggregation To avoid black holes: an ISP is not allowed to aggregate routes unless it is a supernet of the address block it wants to aggregate In other words, an ISP has to specifically announce each IP address range of its downstream customers that are not part of its own address space. 75 Routing Instability 76 Route Stability Routing instability: rapid fluctuation of network reachability information route flapping: when a route is withdrawn and re-announced repeatedly in a short period of time happens via UPDATE messages because messages propagate to global Internet, route flapping behavior can cascade and deteriorate routing performance in many places Effects: increased packet loss, increased network latency, CPU overhead, loss of connectivity Route Stability Studies by C. Labovitz, R. Malan & F. Jahanian 77 Types of Routing Updates Forwarding instability reflects legitimate topology changes e.g., changes in Prefix, NEXT_HOP and/or ASPATH affects forwarding paths used Policy fluctuation reflects changes in policy e.g., changes in MED, LOCAL_PREF, etc. may not necessarily affect forwarding paths used Pathological redundant messages reflect neither topology nor policy changes 78 Anecdotes of Route Flap Storms April 25, 1997 - small Virginia ISP injected incorrect map into global Internet. Map said Virginia ISP had optimal connectivity to all destinations. Everyone sent their traffic to this ISP. Result: shutdown of Tier-1 ISPs for 2 hours. August 14, 1998 - misconfigured database server forwarded all queries to “.net” to wrong server. Result: loss of connectivity to all .net servers for few hours. Nov. 8, 1998 - router software bug led to malformed routing control message. Caused interoperability problem between Tier-1 ISPs. Result: persistent pathological oscillations and connectivity loss for several hours. 79 Taxonomy (as per Labovitz et. al.) Name Type Character WADiff AADiff WADup AADup WWDup Explicit withdrawal followed by announcement. Replace route with different path Announced twice (implicit withdrawal). Replace route with different path. Explicit withdrawal followed by announcement. Replace route with same path. Announced twice (implicit withdrawal). Replace route with same path. Repeated duplicate withdrawals Legitimate Legitimate Legitimate or pathological Policy change or pathological Pathological 80 General Statistics 1996: 3-5 million updates per day in Internet core 1998: 300K-700K updates per day in Internet core 1996: average number of announcements per day was ~275K. 1998: average number of announcements per day was ~400K Correlation of instability and usage instability highest during business hours instability lowest during nights, on weekends and in summer 81 Per Event Type Statistics 1996 relative impact (approximately): WWDup (96%), AADup (2%), WADup (1%), AADiff(1/2%), WADiff (1/2%) 1998 relative impact (approximately): AADup (30-40%), WWDup(25-30%), AADiff (~20%) other (rest) 82 Who’s Responsible? AS’s No single AS dominates instability statistics No correlation between the size of an AS and its share of updates generated. Prefixes Instability is evenly distributed across routes. Example of measurements: • 75% of AADiff events come from prefixes change less than 10 times a day. • 80-90% of instability comes from prefixes that are announced less than 50 times/day. 83 Sources of Instabilities in General router configuration errors transient physical and data link problems software bugs problems with leased lines (electrical timing issues that cause false alarms of disconnect) router failures network upgrades and maintenance 84 Instability Problem and Cause. Example 1. Problem: 3-5 million duplicate withdrawals Cause: stateless BGP implementation time-space tradeoff: no state maintained on what advertised to peers when receive any change, transmit withdrawal to all peers regardless of whether previously notified or not sent updates for both explicit and implicit withdrawals By 1998, most vendors had BGP implementations with partial state. Result: number of WWDups reduced by an order of magnitude 85 Instability Problem and Cause. Example 2 Problem: duplicate announcements Cause: min-advertisement timer & stateless BGP min-adv timer: wait 30 seconds. Combine all received updates in last 30 seconds into single outbound update message (if possible). within 30 seconds route can be withdrawn and re-announced so that there is no net change to original announcement Solution: Have BGP keep some state about recently sent messages to peers. Avoid sending duplicate messages 86 Instability Problem and Cause. Example 3 AS 1 Net X R2 R8 R1 R3 4 R4 AS3 R6 3 R7 6 5 10 2 R5 AS2 R Example: interaction IGP/BGP policy: set MED using IGP metrics, such as shortest path border router internal router E-BGP IGP 87 Controlling route instability: Route Dampening track number of times a route has flapped over a period of time routes that exhibit a high level of instability in a short period of time should be suppressed (not advertised) penalize ill behaved routes proportionally to their expected future stability if a suppressed route stops flapping for a long enough period of time, unsuppress it (readvertise) 88 Route Dampening penalty suppress limit reuse limit time 89 Route Dampening Algorithm Each time a route flaps, increase the penalty If the route has not flapped in the last ‘half-life’ time units, then cut penalty in half. If the penalty > suppress limit, then suppress the route If the penalty < reuse limit, then free a suppressed route 90 Side Effect of Route Dampening A legitimate update may arrive and it will be ignored because that route has been suppressed and not yet released. The modification needed for the legitimate announcement is delayed 91 Aggregation can help route flapping If a more-specific route is flapping, but provider only announces aggregated prefix, then other networks don’t see route flap. Hence aggregation can mask route flapping. Aggregation helps combat instability because it reduces the number of networks visible in the core Internet. Tier 1 provider not flapping 140.40/16 Tier 2 provider flapping 140.40.4/24 customer 92 Growth of BGP Tables 93 Long Term Growth Trends in Internet Routing Will this routing system be able to scale and meet the growth of the Internet and its ever-expanding level of demands? Are there any inherent limitations? As more devices connect to Internet and consume addresses, the need to maintain reachability to these addresses implies larger routing tables What is the ability of the system to produce a stable view of the overall network topology? 94 BGP Table Growth (1989-2001) 95 Reasons for Exponential Growth Measures Growth Number of AS’s Growing exponentially - 50% yearly Range of addresses spanned by Growing 7% yearly table Average span of route table entry Growing 1 bit per 29 months Holes in the table Currently at 37% Number of announcements Growing 42% yearly Data in last 3 slides from Geoff Huston’s INET publications 97 Increasing fine levels of routing details in BGP table AS space growth at 50% & addresses spanned by the table growing at 7% => each AS advertising smaller address ranges /24 is fastest growth prefix in table (in absolute terms). /24 /31 is area with fastest relative growth 1999: average span of prefix 16,000 addresses (mean prefix length 18.03) 2000: average span of prefix 12,000 addresses (mean prefix length 18.44) 98 Holes When both aggregated prefix and a more- specific prefix exist in the table, the more-specific prefix is called a “hole”. Why are holes created? Punch hole in policy of larger aggregated announcement to create a different policy for finer announcement. Load sharing in multihoming scenario reliability via multihoming 37% of BGP table is holes. 99 Multihoming vs. Resiliency Multiple peering relationships can be cheaper than using a single upstream provider implies: multihoming is seen as a substitute for upstream service resiliency Impact providers can’t command a price for reliability, and thus don’t spend money to engineer it into network. resiliency is becoming responsibility of customer not provider Can BGP scale adequately to continue to undertake this role? 100 Conclusions Things are getting better (stability) router software and configuration management are maturing increased emphasis on aggregation and route dampening are helping Things are getting worse (scalability) multihoming is still growing internet topology growing less hierarchical ‘noise’ in BGP table is growing 101 Longest Prefix Match Algorithms 102 Forwarding Engine Packet payload header Router Destination Address Routing Lookup Data Structure Outgoing Port Forwarding Table Dest-network Port 65.0.0.0/8 3 128.9.0.0/16 1 149.12.0.0/19 7 103 The Search Operation is not a Direct Lookup (Incoming port, label) Memory (Outgoing port, label) IP addresses: 32 bits long 4G entries 104 The Search Operation is also not an Exact Match Search Exact match search: search for a key in a collection of keys of the same length. Relatively well studied data structures: Hashing Balanced binary search trees 105 Example Forwarding Table Destination IP Prefix Outgoing Port 65.0.0.0/8 3 128.9.0.0/16 Prefix length 142.12.0.0/19 1 7 IP prefix: 0-32 bits 65.0.0.0/8 0 65.0.0.0 128.9.0.0/16 128.9.16.14 224 65.255.255.255 142.12.0.0/19 232-1 106 Prefixes can Overlap Longest matching prefix 128.9.176.0/24 128.9.16.0/21 128.9.172.0/21 65.0.0.0/8 0 128.9.0.0/16 128.9.16.14 142.12.0.0/19 232-1 Routing lookup: Find the longest matching prefix (aka the most specific route) among all prefixes that match the destination address. 107 Prefix Length Difficulty of Longest Prefix Match 32 24 128.9.176.0/24 128.9.16.0/21 128.9.172.0/21 142.12.0.0/19 128.9.0.0/16 8 65.0.0.0/8 128.9.16.14 Prefixes 108 Lookup Rate Required Year Line Line-rate (Gbps) 40B packets (Mpps) 1998-99 OC12c 0.622 1.94 1999-00 OC48c 2.5 7.81 2000-01 OC192c 10.0 31.25 2002-03 OC768c 40.0 125 31.25 Mpps 33 ns DRAM: 50-80 ns, SRAM: 5-10 ns 109 Number of Prefixes Size of the Forwarding Table 100000 90000 80000 70000 60000 50000 10,000/year 40000 30000 20000 10000 0 95 96 97 Year 98 99 Source: http://www.telstra.net/ops/bgptable.html 00 110 Longest Prefix Match is Harder than Exact Match The destination address of an arriving packet does not carry with it the information to determine the length of the longest matching prefix Hence, one needs to search among the space of all prefix lengths; as well as the space of all prefixes of a given length 111 LPM in IPv4 Use 32 exact match algorithms for LPM! Exact match against prefixes of length 1 Network Address Exact match against prefixes of length 2 Priority Encode and pick Port Exact match against prefixes of length 32 112 Metrics for Lookup Algorithms Speed (= number of memory accesses) Storage requirements (= amount of memory) Low update time (support >10K updates/s) Scalability With length of prefix: IPv4 unicast (32b), Ethernet (48b), IPv4 multicast (64b), IPv6 unicast (128b) With size of routing table: (sweetspot for today’s designs = 1 million) Flexibility in implementation Low preprocessing time 113 Radix Trie Trie node A 1 P1 111* H1 P2 10* H2 P3 1010* H3 P4 10101 H4 C P2 G P3 Lookup 10111 next-hop-ptr (if prefix) right-ptr left-ptr B 1 0 1 D 1 E 0 1 Add P5=1110* 0 P4 H P5 P1 F I 114 Radix Trie W-bit prefixes: O(W) lookup, O(NW) storage and O(W) update complexity Advantages Disadvantages Simplicity Worst case lookup slow Wastage of storage space in chains Extensible to wider fields 115 Leaf-pushed Binary Trie Trie node A 1 P1 111* H1 P2 10* H2 P3 1010* H3 P4 10101 H4 C P2 G B 1 0 1 0 left-ptr or next-hop right-ptr or next-hop D P1 E P2 P3 P4 116 Practical Algorithm To Retrieve Information Coded In Alphanumeric PATRICIA B D 2 0 3 0 1 A Patricia tree internal node 1 P1 E C bit-position right-ptr left-ptr 5 P2 F P1 111* H1 P2 10* H2 P3 1010* H3 P4 10101 H4 0 P3 1 P4 G Lookup 10111 Bitpos 12345 117 PATRICIA • W-bit prefixes: O(W2) lookup, O(N) storage and O(W) update complexity Advantages Disadvantages Decreased storage Extensible to wider fields Worst case lookup slow Backtracking makes implementation complex 118 Path-compressed Tree Lookup 10111 1, , 2 0 0 10,P2,4 1 P4 111* H1 P2 10* H2 P3 1010* H3 P4 10101 H4 1 P1 C D 1010,P3,5 P1 B A E Path-compressed tree node structure variable-length next-hop (if prefix present) bitstring left-ptr bit-position right-ptr 119 Path-compressed Trie • W-bit prefixes: O(W) lookup, O(N) storage and O(W) update complexity Advantages Disadvantages Decreased storage Worst case lookup slow 120 Binary Search by Prefix Intervals {}, 1, 10, 100, 101, 1110 1: 10000-11111 {}: 00000-11111 10: 10000-10111 101: 10100-10111 1110: 11100-11101 100: 10000-10011 ( ( ( ( ) ( ) ) ( ) ) ) 00000 10000 10000 10000 10011 10100 10111 10111 11100 11101 11111 11111 1100 121 Rules If the destination address we search for fits to the right of a left paranthesis, the prefix represented by the left paranthesis is the longest match Otherwise, increment for each right paranthesis, decrement for each left paranthesis until the count reaches -1 122 Example revisited ( ( ( ( ) ( ) ) ( ) ) ) 00000 10000 10000 10000 10011 10100 10111 10111 11100 11101 11111 11111 1100 -1 0 1 2 1 2 1 1100 is in the interval 10000-11111, longest prefix match is for the prefix 1 123 Binary Search on Intervals •W-bit N prefixes: O(logN) lookup, O(N) storage Advantages Disadvantages Storage is linear Can be ‘balanced’ Lookup time independent of W But, lookup time is dependent on N Incremental updates more complex than tries Each node is big in size: requires higher memory bandwidth 124