Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Why are we scared of SPF? IGP Scaling and Stability Dave Katz Overview History Components of IGP Convergence Conclusions History 1990: Stability, Scalability, Speed, Correctness-Choose one First few years spent just getting implementations to work Naïve implementations had enough trouble accomplishing correctness without being complicated by reality Prototype-quality software shipped; things tended to fall apart in really ugly ways when pushed hard Copyright © 2002, Juniper Networks, Inc. 3 History 1994: Stability, Scalability, Speed, Correctness-Choose two Convergence speed became marketing bullet, InterOp booth fodder Cute trick for demos, but the world wasn’t clamoring for it Fast convergence == network back up before someone can call the NOC Efforts to speed convergence tended to cause instability Copyright © 2002, Juniper Networks, Inc. 4 History 1995: Stability, Scalability, Speed, Correctness-Choose 2.5 Networks started getting larger; the era of large ISPs began Stability and scalability were really important, lest you end up in the newspaper (“AOL down for 19 hours,” other less famous catastrophes) Simplistic software/hardware architectures were inherently unstable Big guard rails used to stay away from the instability cliff Speed was sacrificed (chunky timers) Copyright © 2002, Juniper Networks, Inc. 5 The Modern Era Pressure is mounting to get fast again Real applications exist that could make use of it (VoIP, etc.) Not just a parlor trick any more Perception of IP as being “too slow” used to promote other technologies We know how to do better now Copyright © 2002, Juniper Networks, Inc. 6 Components of IGP Convergence Detection LSA/LSP Generation Flooding/Propagation SPF Calculation Route Recursion Route Download Detection Hardware detection is vastly preferable Can be debounced, held down, etc., in or close to hardware to reduce churn GE and 10GE use in POPs makes this difficult (since you need a way to detect a failed path to a neighbor, not just a failed interface) Copyright © 2002, Juniper Networks, Inc. 8 Detection Software detection (Hellos) ultimately needed Fast hellos have been destabilizing in the past due to scheduling latencies (relative to adjacency timeouts) Fast hellos are now doable, and are even somewhat scalable (subsecond detection and hundreds of neighbors) Intelligent scheduling and/or distributed processing If Hello load exceeds 100% of capacity (CPU or protocol I/O bandwidth) things will still fail Adjacency maintenance must be immune to heavy CPU load Copyright © 2002, Juniper Networks, Inc. 9 LSA/LSP Generation When something changes, you have to tell the world Traditionally, generation delayed to collect multiple changes, then hold down to limit network traffic (on order of seconds) More intelligent strategy is to rapidly announce interesting changes, allow several successive changes to be announced quickly before holddown Newer LSPs will tend to overtake old ones during flooding on systems under load, if done intelligently Copyright © 2002, Juniper Networks, Inc. 10 LSA/LSP Generation ISIS relatively malleable; some time constants specified but none are “truly normative” OSPF requires receivers to drop LSAs updated within five seconds (limiting senders is sufficient) Suggestion--drop receiver behavior completely, use adaptive strategy on transmit Old receivers will drop rapid updates, but retransmission will operate in similar timeframe (or add a knob) Copyright © 2002, Juniper Networks, Inc. 11 Flooding/Propagation Propagation of received LSA/LSPs delayed Group LSAs into bigger LSUpd packets in OSPF Throttling transmission bounds neighbor load (no flow control) Propagation delays directly affect convergence The next guy can’t even think of calculating routes until the LSA/LSP arrives Background noise (refreshes, flaps) add to the problem Copyright © 2002, Juniper Networks, Inc. 12 Flooding/Propagation Intelligent scheduling gives “interesting” linkstate data flooding priority Adaptive retransmission schemes can help when things get tough Proper scheduling puts noise “in the noise” Copyright © 2002, Juniper Networks, Inc. 13 SPF Calculation Traditionally viewed with abject terror Naïve implementations were slow Run-to-completion scheduling led to lost hellos Inefficient implementations caused even more overhead (reinstalling all routes in FIB) Holddowns and scheduling delays added to work around stability problems Delays slow convergence, create routing loops (23 times delay value) Copyright © 2002, Juniper Networks, Inc. 14 SPF Calculation In a properly engineered system, SPF should not be destabilizing Do adjacency maintenance in a preemptive fashion Schedule SPF calculations as background (relative to LSA/LSP processing, flooding, etc.) SPF should be able to run back-to-back all day long without threatening stability, and with only marginal impact on overall convergence Incremental SPF helps even more, though gains are not significant compared to other things given current networks Backoff algorithms arguably unnecessary (especially exponential backoff) Copyright © 2002, Juniper Networks, Inc. 15 Route Recursion A change in IGP next hop may cause a next hop change in many thousands of BGP routes By far the richest target in improving convergence Traditionally done in software in order to produce a “flat” forwarding table Indirect lookup in hardware has minimal forwarding time cost (essentially free if forwarding engine has any free cycles) with huge win in convergence time Copyright © 2002, Juniper Networks, Inc. 16 Route Download Output of route calculations typically must be downloaded to hardware Download overhead typically rises with the number of forwarding tables Can be very expensive unless recursion is done in hardware Some level of distribution (multiple engines) necessary for scaling; fixing recursion problem and careful engineering minimizes cost Copyright © 2002, Juniper Networks, Inc. 17 Conclusions Conclusions Stability and Scalability have been the primary concerns until recently; this effort was quite successful Some of the biggest barriers to overall network convergence have been outside of the IGP implementation per se; examine the behavior of the system as a whole (and the network as a whole) As these barriers fall it becomes more interesting to take more heroic measures to improve IGP performance Copyright © 2002, Juniper Networks, Inc. 19 Conclusions 2002: Stability, Scalability, Speed, Correctness-Choose 3.5 Careful engineering should be able to provide speed, scalability, and stability The only effect of a heavily loaded system should be a gradual slowing in convergence (not to crash and burn) IGPs are not inherently unstable, at least until it is no longer possible to support all of the adjacencies (and even then it should be possible to gnaw off limbs) Copyright © 2002, Juniper Networks, Inc. 20 Conclusions Adding knobs is not the answer Nobody really knows how to set them Most settings are wrong Either make the parameters adaptive, or make them non-critical Keep adaptivity simple and bounded; behavior is chaotic enough as it is Copyright © 2002, Juniper Networks, Inc. 21 http://www.juniper.net Copyright © 2002, Juniper Networks, Inc. All rights reserved. Juniper Networks is registered in the U.S. Patent and Trademark Office and in other countries as a trademark of Juniper Networks, Inc. G10, Internet Processor, Internet Processor II, JUNOS, JUNOScript, M5, M10, M20, M40, M40e, and M160 are trademarks of Juniper Networks, Inc. All other trademarks, service marks, registered trademarks, or registered service marks are the property of their respective owners. All specifications are subject to change without notice. Juniper Networks assumes no responsibility for any inaccuracies in this presentation. Juniper Networks reserves the right to change, modify, transfer, or otherwise revise this information without notice.