Download IETF55 presentation on OSPF congestion control 11/21/02

Congestion Avoidance & Control for OSPF Networks (draft-ash-manral-ospf-congestion-control-00.txt) Jerry Ash AT&T [email protected] Anurag Maunder Sanera Systems [email protected] Gagan Choudhury AT&T [email protected] Vera Sapozhnikova AT&T [email protected] Vishwas Manral NetPlane Systems [email protected] Mostafa Hashem Sherif AT&T [email protected] 1 Outline (draft-ash-manral-ospf-congestion-control-00.txt)  problem:  concerns over scalability of IGP link-state protocols (e.g., OSPF)  much evidence that LS protocols cannot recover from large failures & widespread loss of topology database information – failure experience – vendor analysis – simulation & modeling  propose protocol mechanisms to address problem  throttle LSA updates/retranmissions – detect & notify congestion state – neighbor nodes throttle LSA updates/retransmissions  keep adjacencies up  database backup & resynchronization  proprietary implementations of mechanisms have improved scalability/stability – need standard features for uniform implementation & interoperability  issues discussed on list 2 Background & Motivation  Failure experience  LS routing protocols cannot recover from large ‘flooding storms’ – triggered by wide range of causes: network failures, bugs, operational errors, etc. – flooding storm overwhelms processors, causes database asynchrony & incorrect shortest path calculation, etc.  AT&T has experienced several very large LS protocol failures (4/13/1998, 7/2000, 2/20/2001, described in I-D)  vendor analysis of LS protocol recovery from total network failure(loss of all database information in the specified scenario, 400 nodes, etc.)  recovery time estimates up to 5.5 hours  expectation is that vendor equipment recovery not adequate under large failure scenario  network-wide event simulation model [choudhury]  medium to large flooding storms cause network to recover with difficulty and/or not recover at all  model validated -- results match actual network experience 3 Failure Experience AT&T Frame Relay Network, 4/13/98  cause & effect  administrative error coupled with a software bug  result was the loss of all topology database information  the link-state protocol then attempted to recover the database with the usual Hello & topology state updates (TSUs)  huge overload of control messages kept network down for very long time  several problems occurred to prevent the network from recovering properly (based on root-cause analysis)  very large number of TSUs being sent to every node to process, causing general processor overload  route computation based on incomplete topology recovery; routes generated based on transient, asynchronous topology information & then in need of frequent re-computation  inadequate work queue management to allow processes to complete before more work is put into the process queue  inability to access node processors with network management commands due to lack of necessary priority of these messages  worked with vendor to make protocol fixes to address problems  along the lines suggested in the I-D 4 Proposed Protocol Mechanisms Throttle LSA Updates/Retransmissions  detect node-congestion by  length of internal work queues  high processor occupancy & long CPU busy times  notify congestion state to other nodes  use TBD packet to convey congestion signal  when a node detects congestion from a neighbor  progressively decrease flooding rate, e.g.  double LSA_RETRANSMIT_INTERVAL for low congestion  quadruple LSA_RETRANSMIT_INTERVAL for high congestion  simulation analysis shows proposed mechanisms perform effectively (Choudhury)  deals better with non-linear failure modes than statistical detection/notification methods 5 Issues Discussed on List  is there a problem (need to prevent catastrophic network collapse)  most seem to agree there is a problem  several have observed ‘LSA storms’ & their ill effects – storms triggered by hardware failure, software bug, faulty operational practice, etc., many different events – sometimes network cannot recover – unacceptable to operators  vendors invited to analyze failure scenario given in draft – no response yet  how to solve problem  better/smart implementation/coding of protocol within current specification – e.g., ‘never losing an adjacency solves problem’ – these are proprietary, single-vendor, implementation extensions  standard protocol extensions – for uniform implementation – for multi-vendor interoperability – already demonstrated with proprietary, single-vendor implementations 6 Issues Discussed on List  what protocol extensions?  not just ‘signaling congestion message on the wire’ but also response – need uniform response to congestion signal ‘slow down by this much’ to be effective – rather than ‘implementation dependent’ response – like helper router response to ‘grace LSA’ from congested router in hitless restart  how evaluate effectiveness of proposals  expert analysis based on experience  simulation – a couple of ‘academic’ & ‘shaky simulation’ comments – validated simulations used widely • for network design of routing features, nm features, congestion control, etc. • for many years • many large-scale network design examples (e.g., ‘Dynamic Routing in Telecommunications Networks’, McGraw Hill)  ‘white-box’ approach – implement & text in the lab 7  expert analysis, simulation, white-box all useful Issues Discussed at IETF-55 Routing Area Meeting & MPLS WG Meeting  box builders view:  ‘stop intruding into our box’  design choices should be made by box builders  nothing wrong with current way of building boxes  box users view:  still observe major failures – most agree there is a problem (from list discussion)  box-builder/vendor analysis shows unacceptable failure response (in draft) – box-builders/vendors invited to analyze scenario in draft  box-builders approach doesn’t work to prevent failures  boxes need a few, critical, standard protocol mechanisms to address problem  have gotten vendors to make proprietary changes to fix problem  require standard protocol extensions – for uniform implementation – for multi-vendor interoperability  user requirements need to drive solution to problem 8 Conclusions  problem:  concerns over scalability of IGP link-state protocols  evidence that LS routing protocols (e.g., OSPF) currently can not recover from large failures & widespread loss of topology database information  problem is flooding, data base asynchrony, shortest path calculation, etc.  evidence based on failure experience, vendor analysis, simulation & modeling  propose protocol mechanisms to address problem, e.g.  throttle LSA update/retransmissions – detect & notify congestion state – neighbor nodes throttle LSA updates/retransmissions  simulation analysis shows effectiveness of proposed changes (Choudhury)  propose draft as an OSPF WG document  refine/evolve proposed protocol extensions 9 Backup Slides 10 Proposed Congestion Control Mechanisms  throttle LSA updates/retransmissions  detect & notify congestion state  congested node signals other nodes to limit rate of LSA messages sent to it  neighbor nodes throttle LSA updates/retransmissions – automatically reduce rate under congestion  keep adjacencies up  database backup & resynchronization  topology database automatically recovered from loss based on local backup mechanisms  allows a node to recover gracefully from local faults on the node  prioritized processing of Hello & LSA Ack messages (Choudhury draft) 11 Keep Adjacencies Up  increase adjacency break interval under congestion  goal is to avoid breaking adjacencies by increasing wait interval for non-receipt of Hello messages – if node detects congestion from a neighbor & if no packet received in NODE_DEAD-INTERVAL – wait additional time = ADJACENCY_BREAK_INTERAL before calling adjacency down  throttle setups of link adjacencies  define MAX_ADJACENCY_BUILD_COUNT = maximum number of adjacencies a node can bring up at one time 12 Database Backup & Resynchronization  database backup  node should provide a local, primary, nonvolatile memory backup [GR-472-CORE]  node should back up all non-self-originated LSAs, routing tables, & states of interfaces  database should be backed up at least every 5 minutes  restoration of data should be completed within 5 minutes of initiation [GR-472-CORE]  nodes signal neighbors when ’safe’ to perform resynchronization procedures  based on TBD packet format  under resynchronization, node  should generate all its own LSAs  should receive only LSAs that have changed between time it failed & current time  should base its routing on current database, derived as above 13 Database Backup & Resynchronization  database resynchronization  propose changes to receiving/transmitting database summary & LSA request packets  when in full state – node sends & receives database summary & LSA request packets as if performing database synchronization when peer data structure is in Negotiating, Exchanging, & loading states  node informs neighbor when to use resync procedures  node supports resync to neighbor request by receiving/transmitting database summary & LSA request packets 14 Failure Experience  other failures which have occurred with similar consequences  moderate TSU storm following ATM nodes upgrade, 7/2000  network recovered, with difficulty  large TSU storm in ATM network, 2/20/2001 [pappalardo1, pappalardo2]  manual procedures required to reduce TSU flooding & stabilize network  desirable to automate procedures for TSU flooding reduction under overload  worked with vendor to make protocol fixes to address problems  along the lines suggested in the I-D  other relevant LS-network failures have been reported [cholewka, jander]  conclusions  LS vulnerable to loss of database information, control overload to re-sync databases, & other failure/overload scenarios  networks more vulnerable in absence of adequate protection mechanisms  generic problem of LS protocols – across a variety of implementations – across FR, ATM, & IP-based technologies 15 Vendor Analysis  vendors & service providers asked to analyze LS protocol recovery from total network failure(loss of all database information in the specified scenario  network scenario  400 node network – 100 backbone nodes – 3 edge nodes per backbone node (edge single homed)  backbone nodes connected to max of 10 backbone nodes – max node adjacency is 13 – sparse network  101 peer groups – 1 backbone peer group with 100 backbone nodes – 100 edge peer groups, each with 3 nodes, all homed on the backbone peer group  1,000,000 addresses advertised 16 Vendor Analysis  projected recovery times  Recovery Time Estimate A – 3.5 hours  Recovery Time Estimate B – 5-15 minutes  Recovery Time Estimate C – 5.5 hours  expectation is that vendor equipment recovery not adequate under large failure scenario 17 Analysis Modeling  various studies published [atmf00-0249, maunder, choudhury]  [choudhury] reports network-wide event simulation model  study impact of a TSU storm  captures – node congestion – propagation delay between nodes – retransmissions if TSU not acknowledged within 5 seconds – link declared down if Hello delayed beyond “node-dead interval” (aka “inactivity timer” in PNNI, “router-dead interval” in OSPF) – link recovery following database synchronization  approximates real network behavior & processing times  results show – dispersion -- number of control packets generated but not processed in at least one node – medium to large TSU storms cause network to recover with difficulty and/or not recover at all – results match actual network experience 18 Impact of TSU Storm on Network Stability 500 Dispersion 400 Storm Size 900 300 200 100 Storm Size 600 Storm Size 300 0 10 111 131 160 360 Time (Sec) 19

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download IETF55 presentation on OSPF congestion control 11/21/02