Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Toward Interactive Debugging for ISP Networks Chia-Chi Lin†, Matthew Caesar†, Jacobus Van der Merwe§ †University of Illinois at Urbana-Champaign §AT&T Labs – Research Debugging in ISP Networks • Internet: most complex distributed system ever created – Leads to complex failure modes – Bugs, vulnerabilities, compromise, misconfigurations • Major challenges in debugging in ISP Networks – Lack of visibility – High rates of change of protocols – Complex interdependencies • These could cause devastating effects – Long-term outages, slow repair – February 2009 BGP outage 1 Interactive Debugging is Necessary • Problems exist with fully automated techniques – Focus on detection rather than diagnosis – Modeling could be inexact – Logical and semantic errors seems to require human knowledge to solve • Our position: – Humans must be “in-the-loop” – Tools are required to facilitate the process 2 Pause when the outage occurs A Scenario Cloned Network ISP Customer 3 Our Vision • Isolation of the operational network – Prevent diagnostic procedure from interfering with live network operation – Solution: virtualization technologies • Reproducibility of network execution – Enable operator to replay execution, narrow in on rare events – Solution: instill a pseudorandom ordering over events, messages • Interactive stepping through execution – Operator can slowly step through operation, trace messages – Solution: protocols providing tight control over distributed execution 4 The Architecture Virtual Service Platforms Physical Network Infrastructure Application 2: e.g. OSPF Application 1: e.g. BGP Virtual Service Nodes Virtual Service Coordinator Physical Network Debugging Node Coordinator User (human troubleshooter) 5 Key Challenge: Reproducibility • Reproducibility simplifies interactive debugging – Can run multiple times, varying inputs to narrow down cause – When rare bug occurs, don’t need to wait for it to reoccur • One option: generate comprehensive logs of all events – e.g., log all packet sends/receives, all data – Problem: not scalable to large networked software • Our approach: eliminate randomness in execution – Starting with the same initial state will produce same execution – Make execution “pseudorandom” to explore different execution paths – Key challenge: how to eliminate randomness in large-scale software execution? 6 An Algorithm for Distributed, Reproducible Execution • Approach: – Encapsulate software in virtual environment – Intercept software’s inputs/outputs, instill an ordering over them – Make sure that ordering is the same, every time software is run • How this is done: – Network is run in lockstep fashion – On every cycle: messages from neighbors are buffered – Before deliver to application, pseudorandom ordering is instilled by consistent hash of packet’s contents – Human sends “step” commands to move to next lockstep cycle 7 Improving Performance for the Production Network • Problem: running application in lockstep fashion slows operation – Might be okay for some protocols (e.g., BGP) – Probably not okay for others (e.g., OSPF) • Solution: “optimistic” execution of events – Choose pseudorandom ordering in advance that is likely to happen anyway – Don’t buffer packets, deliver them immediately – If we guess wrong, roll back application to earlier state 8 Example: Running the Lockstep Algorithm in a Cloned Network Transmission Phase App S I finished processing. transmitting. I am ready to transmit. process. Processing S A Phase App L K A S K App App A App App App L K App L 1. 2. 3. 4. S L K …… App Sending Buffer Receiving Buffer 9 Example: Live Algorithm in Production Network S The Seattle Packets from live algorithm does two things: Seattle should of events • Determine the ordering come before •10 Roll back events violating the ordering those from Los is violated! Pseudorandom ordering Chicago 13 L 14 Salt Lake City Angeles 13 S K C KL C K C Los Angeles 1. 2. 3. 4. 5. 6 14 Seattle 16 Los Angeles Kansas City Chicago …… New York C 3 9 K 10 Kansas City 8 Washington 7 Atlanta 11 Houston 10 Connecting the Two Algorithms • We can run the production network using the live algorithm – Achieves a fixed ordering over messages – But how to actually debug it? • Solution: replay using the lockstep algorithm – First let the production network run, checkpoint starting state – To debug, start lockstep algorithm with same staring state – Lockstep algorithm will traverse the same execution • Can replay multiple times, narrow in on problem, experiment by changing inputs, etc. 11 Simulation Settings • Protocol evaluated: OSPF • Topologies used: BRITE, Internet2 backbone • Link delay model: 1 ms + (0, 0.5] exponentially distributed random delay • Events simulated: Abilene IS-IS traces over the month of January 2009 (giving 209 events) • Measure performance overheads of our approach 12 Results – Overhead in Production Networks • Live algorithm suffers from rollbacks, incurring 4x inflation in traffic overhead • Using delay-estimation optimization reduces overhead to 0.02x traffic inflation 13 Results – Response Time in Cloned Networks • Low response time is beneficial to interactive debugging • Response time is low for variety of network sizes 14 Conclusion • Humans are required to be “in-the-loop” to diagnose problems • Our architecture is a first step towards interactive debugging – Builds on known techniques, e.g., virtualization technologies and distributed semaphores – Develop techniques to reproduce distributed executions • Simulations on real-world events show the scheme accompanied with low overheads 15 16 The State of the Art: Automated Techniques • Logging observations – X-Trace, Friday, etc. • Model checking – rcc, OD flow, etc. • Debugging standalone programs – Coverity, AVIO, etc. 17 Optimized Ordering in the Production Network • Goal: avoid rollbacks by selecting ordering likely to happen anyway – Events separated by long period will fall into different groups which means ordering is easy – Problem: some failure events are correlated • E.g., multiple overlay links sharing same physical link – How to order events in same group? • Solution: if we know link delays, we can reliably estimate expected arrival of events – In practice we don’t know exact link delays – But we can estimate them – Can improve estimation by giving protocol messages high priority 18 Results – Storage in Production Network • State required for rolling back packets is small and increases slowly with network size 19