* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download fast rerouting - CSE Labs User Home Pages
Deep packet inspection wikipedia , lookup
Asynchronous Transfer Mode wikipedia , lookup
Distributed firewall wikipedia , lookup
Piggybacking (Internet access) wikipedia , lookup
Multiprotocol Label Switching wikipedia , lookup
Wake-on-LAN wikipedia , lookup
Technological convergence wikipedia , lookup
Computer network wikipedia , lookup
Cracking of wireless networks wikipedia , lookup
IEEE 802.1aq wikipedia , lookup
List of wireless community networks by region wikipedia , lookup
Network tap wikipedia , lookup
Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast Re-Routing (IP FRR) • Network Failures and Intra-Domain Routing (IGP) – ISP Link Failure Studies – Failure Characteristics, Causes and Impacts • IGP Fast Routing Convergence •Speed up routing convergence after routing changes • IP Fast Re-Routing (IP FRR) – Fast Rerouting Schemes: Failure Insensitive Routing – Other Schemes Readings: Please do the required readings CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Why Network Fails • Many, many possible reasons and causes • Human Errors – misconfigurations – other mistakes: e.g, let’s see what that red button does • Software Bugs – buggy implementation, incompatibility, … • Hardware failures – flaky interfaces, link errors, fiber cuts, router crashes due to CPU overload or running of memory, …. • Malicious attacks • Network Overload – traffic surges causing network congestion, … • Others: e.g., natural disasters, major accidents – E.g., Baltimore tunnel fire, Ohio train accident, … CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 2 Understanding Network Link Failure Characteristics • Failure Characteristics Within an ISP network – – – – How often do links/routers failure? How many? Are they random, correlated? How long do they last? … • What about inter-domain or Internet wide? – What causes BGP to update/withdraw routes? – Destination network down? AS internal failures? BGP session resets? Policy changes? … • How do we measure, detect and analyze network failures? • How do we trouble-shoot network failures and perform rootcase analysis? • How do we design more robust and resilient mechanisms? CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 3 This Lecture: Focusing on Impact of Failures within an ISP Network • With IP networks becoming the dominant and “converged” information delivery substrate, displacing telephone networks, and eventually cable TV? – Need to better “service availability” – Telephone networks: service availability metrics: 5 9’s: i.e., 99.999% – What about IP networks? • Effect of IP network failures: routers lose “reachability”: i.e., no forward entries – or existence of transient/permanent forwarding loops – • What are impacts of network failures? – In particular, on VoIP services CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 4 Failures Affect Link Loads • Many ISP networks are “over-provisioned” so as to handle network failures: • – Many claim: normal load utilization < 50% But still high variability in link utilization: – Can find a link w/ load > 50% every 15 minutes; > 90% every 8 days CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 5 Traffic Potholes or Blackholes Sprint Measurement Study Anecdote: • Average delay over 5 sec intervals • Traffic was blackholed for more than 10 minutes • It took about 40 minutes for the network to reach a stable state • Root Cause: Route Misconfigurations! CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 6 Routing Loops under Failures • Loops due to link failures/new route advertisement • Measurements from 3 backbone links – 25% packets caught in a loop in one failure instance – 1% lost due to expire TTL; those that escape have long delays CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 7 Sprint Link Failure Study • Link “failures” occur fairly frequently, well spread over time – Inter-POP links are more stable than intra-PoP links – Many intra-PoP link “failures” due to planned events, less impact on traffic due to “full-mesh” intra-PoP topology • Most link failures tend to be transient – Excluding “planned” failures – Most are single link failures – Some are correlated link failures • Link failure characteristics vary depend on links – Depending causes of failures, e.g., flaky interfaces, router overloads, fiber cuts, etc., • Impact of link failures – OC48 link down for 6 seconds: 3 million packets may be lost! – significant impact on applications such as VoIP, on-line gamiing CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 8 Methodology: Integrated Monitoring Sprint Measurement Study [I+02,M+04] • Tier-1 ISP backbone (600+ nodes) • Passive route listener software to collect IS-IS & BGP updates • IPMON passive traffic monitoring & active probes • SONET alarm logs; router configurations and BGP policies CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 9 IGP Failure Events • IP link: adjacency between two IS-IS routers • Link Failure: loss of this adjacency • Results shown in the following slides only include – US inter-PoP links (OC48) – Failures less than 24 hrs long CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 10 Sprint Study: Link Failure Frequencies CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 11 Sprint Study: Duration of Failures CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 12 Sprint Study: Failures across Links CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 13 Scatter Plot of US Failure Events • Apr. – Nov. 2002 CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 14 Maintenance (or Planned Failures) • Weekly schedule (Mondays 5am – 2pm UC): 20% of failures CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 15 Examples of Planned Failures • Upgrades – Changing link to higher capacity – Loading new operating system on a router – Swapping out an old interface card • Maintenance – Fixing a flaky optical amplifier – Configuration changes that require a reboot – Responsible for 50% of intradomain failures • Cable intrusions – Construction activities near a fiber CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 16 Failure Classification CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 17 Anomalies Found in Shaikh04 paper • Intermittent hardware problem – Router periodically losing OSPF adjacencies – Risk of network partition if 2nd failure occurred • External link flaps – Congestion on edge link causing lost messages – Lost adjacency leading to flapping routes • Configuration errors – Two routers assigned the same IP address – Inefficient config leading to duplicate LSAs • Vendor implementation bug – More frequent refreshing of LSAs than specified CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 18 Converging After a Failure • Failure detection – Router recognizes an incident link has failed • Failure notification Routing convergence – Router informs other routers about the change • Path re-computation – Routers compute new paths avoiding the link • Forwarding-table update Forwarding convergence – Routers update their forwarding tables – Data traffic starts to flow over the new path • AT&T, Sprint studies show – convergence time 100s milliseconds up to a few seconds CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 19 All Together: Looking Inside Router LSA Processing Route Processor (CPU) OSPF Process LSA Flooding Topology View SPF Calculation SPF Calculation FIB Update FIB LSA LS Ack Forwarding Forwarding Data packet Interface card CSci5221: LSA Switching Fabric Network Failures; IGP Fast Convergence and IP Fast Re-Routing Data packet Interface card 20 Bad Things Happen During Convergence • Transient inconsistencies – Creating “transient forwarding loops” due to – Routers have different views of the network – Forwarding decisions may be inconsistent • Effects on data traffic – – – – Black-hole: packet loss Loops: packets going in circles Delay: packets going on very long paths Out-of-order: new packets arrive before old ones • Want to minimize convergence delay – … and especially the effects on the data traffic CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 21 Example: Transient Forwarding Loop (or Micro-loop) • Set of routers disagree – One router acting on old information – Another router acting on new information Loop! s CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing d 22 Reducing Impact of Link Failures Assuming Traditional Link-State Protocols • Improving convergence time of control/data plane – Reducing timer value for HELO messages – Can achieve sub-second convergence time • 200 msecs common target, threshold for VoIP quality, do-able! – However, • Still react to failure events, can’t prevent packet loops or losses during convergence • may amplify effect of short “transient” failures that last subseconds • Prevent “micro loops” during transient routing convergence periods – One solution: using “ordered FIB updates” – requires coordination among routers, adds complexity, delays convergence time • Dealing with “Planned Failures” ? CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 23 Reducing Impact of Link Failures Using MPLS • Can pre-compute back-up paths • – Often done using the “link protection” scheme – For each link, there is a MPLS protection (back-up) path But – Need to change “forwarding plane” of routers – Many networks don’t have MPLS deployed Question: Can we perform fast rerouting using “traditional” link state routing protocols without resort to MPLS? CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 24 Fast Re-Routing using Link State Protocols [Nelakuditi et al] • Motivations – Most common link failures are transient single-link failures – Hastily react to such failures by LSA flooding may do more harm than good, causing network instability! – Suppress such failures unless it lasts longer than a threshold – But we want to be able to re-route affected packets along a back-up path, not simply dropping them ! • FIFR (failure Insensitive Fast Re-routing): nearly 100% forwarding continuity – prepare for (instead of react to) failures – adapt to changes while ensuring stability – Other Advantages: • no change to forwarding plane • minimal change to routing plane CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 25 What is Interface Specific Forwarding? • Interface-independent forwarding – destination next-hop – Each line card has a copy of the same FIB • Interface-specific forwarding – <incoming interface, destination> next-hop – Different forwarding entries at each line card • Forwarding operation remains the same CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 26 ISF Enables Local Rerouting • Infer failures based on interface and destination – Find the farthest keylink whose failure would cause a packet to arrive at the unusual interface along the reverse shortest path to the destination • Precompute interface-specific forwarding tables – Avoid the keylink in choosing next hop for a destination • Failure Inferencing based Fast Rerouting – IP fast reroute without explicit routing/tunneling CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 27 Illustration: No Failure Scenario F CSci5221: B B C C D D E B F B A A C AE D A E E F E F Network Failures; IGP Fast Convergence and IP Fast Re-Routing 28 Illustration: Local Rerouting without ISF F A A C A D A E A F A new routing table at router B after detecting the failure link B – E fails! CSci5221: B B C C D D E B F B F Network Failures; IGP Fast Convergence and IP Fast Re-Routing 29 Illustration: Local Rerouting with ISF CSci5221: B B C C D D E B F B B - A A C C C A D D D A E C E A F D F A F F Network Failures; IGP Fast Convergence and IP Fast Re-Routing 30 ISF Table Computation • Infer failed links from packet’s arrival at an interface – K dj i keylink whose failure causes packet to d arrive at i from j – A link u -> v is a candidate keylink if • • with u->v, j is a next hop from i to d without u->v, edge j->i is along the shortest path from u to d – K dj i is the farthest one from i among candidate keylinks • Avoid keylink in choosing the destination’s next hop – Fd j i • next hops to d from i when packet arrives at i from j Fjdi Rid (E \ K dji ) Failure inferencing is not done per packet – CSci5221: ISF table entries computed upon link state updates Network Failures; IGP Fast Convergence and IP Fast Re-Routing 31 Illustration: ISF Table Computation B - C C D D E C F D B B C - D D E B F B K BC A {} K BD A {} K BE A {B-E} CSci5221: B B C C D - E B F B K BF A {E-F} When no more than one link failure is suppressed in a network with symmetric weights, FIFR always forwards successfully to a destination if a path to it exists Network Failures; IGP Fast Convergence and IP Fast Re-Routing 32 Operations under FIFR Event Adjacent nodes Packet arrival Interface-specific forwarding Link down Initiate local rerouting Link up before suppression interval Resume forwarding on the recovered link Link down beyond suppression interval Link state update Recompute interfacespecific forwarding tables Link up after suppression interval Link state update Recompute interfacespecific forwarding tables CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing Other nodes 33 Handling both Link and Node Failures • Infer keynodes instead of keylinks – A node u is a candidate keynode if • with u, j is a next hop from i to d • without u, edge j->i is along the shortest path from the upstream node of u (w.r.t. the path from i to u) to d – Keynode is the farthest one from i among candidates • When no route to destination without a node – Node adjacent to the failure assumes link failure – Non-adjacent nodes treat it as adjacent node failure – May cause loops when destination is indeed not reachable • Protects against non-partitioning single failures CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 34 Networks with Asymmetric Link Weights • FIFR can handle asymmetric link weights – By forcing packets to take reverse shortest path – Provided links are bidirectional • Keynode computation based on rSPF – A node u is a candidate keynode if • with u, j is a next hop from i to d • without u, edge i->j is along the shortest path from d to the upstream node of u (w.r.t the path from i to u) – Keynode is the farthest one from i among candidates – Works with both symmetric and asymmetric weights CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 35 Networks with Broadcast Links • FIFR applicable to networks with broadcast links – A broadcast link is modeled with point to point links from/to the designated router • Adjacent failures – Broadcast link failure treated as that of designated router • Non-adjacent failures – Not necessary to know the previous hop of a packet to compute interface-specific keynode per destination – Failure inferencing can be done as before CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 36 Summary of FIFR • Fast reroute under any single failures – Without changing/encapsulating IP datagram • May cause loops under multiple failures – With ISF, guaranteed-protection against single failures or loop-freedom under multiple failures but not both – Blacklist-based Interface Specific Forwarding • Needs interface-specific forwarding – Two forwarding entries per destination – O(|E|log2|V|) to compute forwarding entries CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 37 Other Approaches See the optional reading [GRY07] for more detials. •Loop-free Alternative (LFA): fast re-routing only when direct link to (default) next-hop fails – simpler computation – we know exactly which link to remove when computed new next-hop; but protection limited – using IP-tunnels, etc. •U-turn: allow protection over multiple hops •Using “Not-Via” Addresses •Multi-topology routing – routers and links (with possibly different link weights) belong to multiple topologies • E.g., a default topology, plus “back-up” topologies with various (assumed) links removed (or new link weights) – packets are “marked” with “topology id” for look-up • IETF Fast Rerouting and MT-Routing Working Groups CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 38