Download dimacs-tomography2008

Challenges in Making Tomography Practical Yiyi Huang, Georgia Tech Nick Feamster, Georgia Tech Renata Teixeira, LIP6 Christophe Diot, Thomson Problem • Network operators need to detect and isolate faults quickly, before customers complain • Plenty of existing alarms – SNMP traps – Active probes – Anomaly detection systems • Unfortunately, this set of alarms does not help operators locate and eliminate problems that induce problems on end-to-end paths 2 Network Tomography to the Rescue Monitor Targets y x • Send end-to-end probes through the network • Monitor paths for differences in reachability • Infer location of reachability problem from these differences 3 Some Problems • Scalability vs. speed: Detection must be fast • Ambiguity: Losses are one-way but don’t always have access to both ends of the path • Lack of synchronization: Different monitors see different conditions • Dynamics: Topology can change, loss can be transient 4 Doppler: Making Tomography Practical • Fast, scalable detection – Solution: Monitor selection algorithm to reduce the number of monitors and targets so that “cycle times” are fast • Transient packet loss – Solution: Triggered confirmation of failed paths • One-way losses – Solution: New algorithm based on IP spoofing • Dynamic routing – Solution: Periodic snapshots of the network topology Controlled evaluation on VINI, plus limited wide-area experiments. 5 Fast, Scalable Detection • Select monitors, targets to satisfy two conditions – All interfaces are “covered” (or diagnosable) – The number of monitors is small enough to ensure a short round time • Two goals – Coverage: When a failure occurs, system detects it • Every interface is covered by at least one path – Diagnosability: When a failure occurs, system locates it • Every interface is covered by a unique set of paths 6 Offline Path Selection: Diagnosability • Step 1: Compute the set of paths that cover all interfaces (greedy set cover heuristic) • Step 2: Compute hitting set for each interface • Step 3: Build equivalence classes for interfaces with common hitting set – For each interface in a set with more than one interface, find path that crosses only that interface 7 Detection, Confirmation, Correlation • Periodic (once per 5 minutes) topology snapshot from all monitors to all destinations keeps track of underlying topology before the failure • Detection: Periodic probes (once per “cycle time”) detect failure • Confirmation: When a probe is lost, the monitor sends three additional probes. If all three are lost, path is determined to have failed. • Correlation: Paths that fail within 10 seconds of one another are grouped. 8 Disambiguating One-Way Losses: Spoofing • Monitor sends request to spoofer to send probe • Probe has IP address of the monitor • If reply reaches the monitor, reverse path is working T M Spoofer: Send spoofed packet with source address of M 9 Identification: NetDiagnoser • Binary network tomography algorithm [Dhamdhere et al.] • Input: hosts, destinations, topology before the failure • Output: Set of possible locations for the fault 10 Evaluation of Detection Algorithms • Controlled experiments on the VINI testbed – Emulated copy of Abilene network on wide-area paths – Probing strategy emulates the paths that would be probed in monitor selection algorithm – Compare reduced set of paths to “aggressive” measurement approach • Varied failure location and duration – Duration varied from 5 to 80 seconds – Test repeated for each failed link • Measure detection and false alarm rates • Preliminary experiments using data from real-world networks 11 Detection: Scale and Speed • Compute reduction in the number of paths required to achieve coverage and diagnosability – Reduction from about 27,000 paths to 151 paths • For real-world networks, compute corresponding reduction in cycle time – Reduction from aout 3.5 minutes to < 5 seconds 12 Single-Link Failures • More selective probing identifies more of the shorter link failures (due to shorter cycle time) • Also results in fewer false alarms 13 Single-Node Failures • Similar results to single-link failures – Selective measurements result in faster detection, fewer false alarms 14 Does Failure Confirmation Reduce the Total Number of Alarms? • Confirmation reduces the number of failures by > 35% • Correlation further reduces the number of alarms (by about a factor of 10) 15 How Quickly can Doppler Identify Failures? • Answer: Roughly 20 seconds using the reduced set of paths • Two main components – Detection/Confirmation: Time from when failure was injected to the time Doppler could detect and confirm the failure – Correlation: Time to group failures and construct reachability matrix 16 Detection and Confirmation Delay Most failures are detected within 3-5 seconds 17 Correlation Delay Reducing the number of paths to probe significantly reduces total correlation time 18 Summary • Making tomography practical is challenging – – – – Asynchronous measurements Scale and speed Changing topologies Ambiguity about forward and reverse paths • Doppler: Set of techniques to address many of these problems • Current analysis is still performed offline – Many additional challenges remain to coordinate online measurements 19

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download dimacs-tomography2008