Download Slides

Protection and Restoration • Definitions • A major application for MPLS The problem • Network resources will fail – Nodes and links • IGP will re-converge – But this may take some time • 10s of seconds – Fast convergence has a price • May make IGP more sensitive/unstable – I may have sensitive traffic that can not afford interruptions • Voice, Consumer TV • Do something for the time until IGP reconverges Terminology • Restoration – Bring traffic back to normal • Backup – Alternative resources to be used when there is a failure • Protection – Determine and allocate the backup resources before the failure – When there is a failure just activate them – Can be very fast • Repair – Determine, allocate and activate the backup resources after the failure – Will be slower Failure Modes • Single vs. multiple link failures – If duration of link failure is short, can assume that there will be only a single link failure – Much harder to deal with multiple link failures • Node vs. link failures – Can assume that links will fail more frequently than nodes – Node failures are harder to handle Backup resources • Can be multiple types – – – – – Links Paths Trees Cycles Whole topologies • In order to avoid network overload after a failure need to have some extra capacity for backup resources • Problem is how to engineer them so as not to make the network too expensive – Minimize the amount of backup capacity that is reserved More jargon • 1:1 – 1 working, 1 backup – Wastes a lot of bandwidth for the backups • 1:N – N working and 1 backup – Assume that only 1 working will fail – Then 1 backup is enough – save bandwidth • Revertive: – when the failure is fixed, revert to the primary • SRLG: Shared Risk Link Group – A set of network links that fails together – E.g fibers that are in the same conduit • A bulldozer will cut all of them together Other issues • How to detect the failure fast – BFD is one general solution – There are medium specific solutions • OAM for ATM • Alarms for SONET • Preferable if they exist – Protocol mechanisms (RSVP HELLOs, OSPF HELLOs, etc) • How to activate the backup – I.e how to make traffic use an alternate path, or a tree Backbone failure analysis • Sprint backbone ca. March 2002 – Link in class website • Monitor IS-IS traffic • Data only for link failures, not node failures • Failure Duration – 50% failures last less than 1 min – 40% failures last between 1 and 20 min • Maintenance – 50% of failures during maintenance windows • Mean time between failure (MTBF) – Mean time between failures varies a lot across links • “good” and “bad” links – 3 bad links account for 25% of the failures More analysis • Unplanned failure breakdown – Shared link failures = 30% • Router related = 16.5% • Optical related = 11.5% – Individual link failures = 70% • Node failures less common that single link failures • About 16.5% of failures affect more than 1 link Handling failures with IP • Easy case – ECMP, no need to do anything extra during failure – But it may not repair all failures – Coverage: what percentage of the possible failures can be repaired • In general activating backup resources is hard with IP – Packets will follow the IP route table/FIB – Forwarding is hop-by-hop – Even if I compute a backup link for a failure, I have no control what will happen after the next hop • May have routing loops IP protection • Backup next-hop – Each node computes a backup nexthop for each destination • so that I will not have routing loops – It may not have 100% coverage • For more general solutions I need tunneling – Must force packets to reach their destination – Without crossing the failed resource • Tunnel to the node after the failed link • Tunnel to an intermediate node – IP tunneling is an expensive operation • It is packet encapsulation Not-Via addresses • Consider router A, with interfaces A1, A2, A3 – – – – A1 connects to interface B1 or router B, A2 connects to interface C2 of router C B1 has a second address B1-not-via-A All routers compute paths to B1-not-via-A by removing router A from topology and running SPF – When router A fails, if C wants to reach B sends packets to address B1-not-via-A • Encapsulates the packets • 100% coverage • Can handle node and link failures • Still needs encapsulation Multi-topology protection • New approach • Have multiple subsets of the topology – IGP protocols already support multi-topology routing – Switch to a different topology when there is a failure • By modifying the header of the packet • Or even using an MPLS label • Allows for more flexible routing of traffic after a failure Using MPLS • MPLS can conveniently direct traffic where I want • Ideal for setting up backup resources – Mostly backup paths • Can be used to repair both IP and MPLS failures (I.e. LSP failure) • LSP protection can be – Path – Local Path protection • For each LSP (primary) have a backup LSP – It is already established (with RSVP) but it is not carrying any traffic • Primary and backup LSPs should be link and node disjoint • When there a failure the source of the LSP will start sending traffic to the backup • Source needs to be notified for the failure – May take some time for the repair of the traffic • Can work in both 1:1 and 1:N modes Local protection • When a link or node fails the node upstream from the failure repairs the traffic – Traffic is put into a back LSP that does not go over the failed resource – Backup LSP merges with the primary LSP • Repairing router does not send a PATHerr upstream – Instead notify upstream nodes that it is repairing the failure • It is very fast • Can work in 1:1 and 1:N modes • Can be – Node • Bypass a failed node – Link • Bypass a failed link Link local protection • The node upstream of the failed link initiates the protection – Point of local repair (PLR) • Backup LSP will merge back to the primary one – At the next-hop (Nhop) of the PLR • Can work in 1:1 and 1:N modes – Usually a single backup LSP protects multiple primary LSPs – Else scalability is not good Node local protection • When a node fails, assume its links have failed too • The node upstream of the failed node initiates the protection – Point of local repair (PLR) • Backup LSP will merge back to the primary one – At the next-next-hop (NNHop) of the PLR • What label does the NNHop use for the primary LSP? – Need RSVP’s help to find out • Will need multiple backup LSPs for each node – At least one for each NNHop – Can optionally configure more Label stacking • Each time I send traffic into an LSP I push a label on the packets • Packets in the primary LSP already have a label – I create a label stack – Top label is popped by the router just before the merge point • A catch – At the merge point, packet arrives from an interface different than the expected one – Must have global (platform) label space Need some RSVP support • If the LSP is protected do not send a errors upstream/downstream when there is a failure – Instead notify upstream nodes that repair is in progress • During failure the PATH,RESV for the primary LSP must continue – Send them through the backup LSP • For node protection need to know the label the NNHop is using for the primary – Use the record label option for the LSP – All the labels used in all the hops are recorded in the RESV message LSP protecting IP • Can use the above techniques to also protect IP traffic • If a link fails all the traffic that would go through the link is sent over the backup LSP • Similar for node failures – But in this case, do I know the nnhop for IP? • In general, If I have MPLS in my network all my traffic will be inside MPLS tunnels anyway Observations • If node degree is d and I have N nodes then – I need at least O(Nd) tunnels for link protection – And at least O(Nd^2) for node protection • Of course I can not protect from failures of the ingress or egress node • The assumption is that failures will be short lived – Traffic may be unbalanced during the failure – Links can get overloaded The resource allocation problem • How do I setup the backup tunnels so that – I do not overload any link after a failure – I minimize the amount of extra bandwidth that will need to be reserved for the backups • It is a form of traffic engineering (TE) – We will see more on TE later on • Has been studied a lot – In optical and telephone networks – And recently in MPLS type networks • Solutions can be – On-line (as the requests arrive) – Off-line Example • Kodialam, Lakshman, 2001 – Local link and node protection – Assume I know the b/w demands of all LSPs – Assume that only one link or node can fail at a time • Find a set of backup paths that minimizes the amount of bandwidth for both primary and backup LSPs – Backup LSPs can share bandwidth on some links • What do I know about the links? – How much bandwidth is used by each LSP • Complete but expensive to maintain – How much bandwidth is available • Almost zero information – How much bandwidth is used by backup LSPs • Little bit better than zero

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Slides