* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download slides - Inria
Deep packet inspection wikipedia , lookup
Multiprotocol Label Switching wikipedia , lookup
Wake-on-LAN wikipedia , lookup
Zero-configuration networking wikipedia , lookup
Piggybacking (Internet access) wikipedia , lookup
Distributed firewall wikipedia , lookup
Computer network wikipedia , lookup
Cracking of wireless networks wikipedia , lookup
IEEE 802.1aq wikipedia , lookup
Internet measurements: fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas Internet monitoring is essential  For – Monitor service-level agreements – Fault diagnosis – Diagnose anomalous behavior  For 1 network operators users or content/application providers – Verify network performance – Verify network neutrality Network operators can’t know the user’s experience AS3 AS2 AS1  Network – – 2 AS4 operators only have data of one AS AS4 doesn’t detect any problem AS3 doesn’t know who is affected by the failure End users can’t know what happens in the network AS3 AS2 AS1  End-hosts 3 AS4 can only monitor end-to-end paths Network tomography to rescue Inference of unknown network properties from measurable ones  Network operators  End users – Monitor network paths – Cooperative monitoring – From monitoring hosts – Among end users – From users to popular services – • In network • Third-party monitoring services From home gateways http://cmon.grenouille.com http://www.nanodatacenters.eu 4 Fault diagnosis using end-to-end measurements  Faults are persistent reachability problems detection continuous path monitoring identification binary tomography 5 Outline  Background  Fault detection – Active vs. passive measurements – Reducing overhead of active measurements – Disambiguating one-way failures  Fault identification using binary tomography – Correlated path reachability – Topology discovery  Open 6 in network tomography issues Network tomography to infer link performance  What are the properties of network links? – Loss rate – Delay – Bandwidth – Connectivity D A AS 1 C B  Given – 7 end-to-end measurements No access to routers F E AS 2 The origins  MINC: Multicast-based Inference of Network-internal Characteristics  Key – probe sender idea: multicast probes Exploit correlation in traces to estimate link properties probe collectors [MINC project, 1999] 8 Inferring link loss rates m  Assumptions – Known, logical-tree topology – Losses are independent α2 α3 – Multicast probes t1 t2 1 0 1 1 1 1  Method – Maximum likelihood estimates for αk [Adams, 2000] 9 success probabilities α1 ^ α2 ^ α3 ^ α1 estimated success probabilities Binary tomography  Labels – – – Loss-rate estimation requires tight correlation Instead, separate good/bad performance If link is bad, all paths that cross the link are bad [Duffield, 2006] 10 m links as good or bad α1 α2 α3 t1 t2 1 0 0 1 1 1 bad good Single-source tree  “Smallest Consistent Failure Set” algorithm – – Assumes a single-source tree and known topology Find the smallest set of links that explains bad paths • Given bad links are uncommon • Bad link is the root of maximal bad subtree [Duffield, 2006] 11 m bad t1 t2 1 0 0 1 1 1 bad good Fault identification with binary tomography  Fault monitoring needs multiple sources and targets  Problem – m2 t1 t2 becomes NP-hard Minimum hitting set problem  Iterative greedy heuristic – Given the set of links in bad paths – Iteratively choose link that explains the max number of bad paths [Kompella, 2007] [Dhamdhere, 2007] 12 m1 Hitting set of link = paths that traverse the link Practical issues  Topology – Need to measure accurate topology  Multicast – – 13 of targets is not always practical Need one-way performance from round-trip probes  Links – not available Need to extract correlation from unicast probes Even using probes from different monitors  Control – is often unknown can fail for some paths, but not all Need to extend tomography algorithms Outline  Background  Fault detection with no control of targets – Active vs. passive measurements – Reducing overhead of active measurements – Disambiguating one-way failures  Fault identification using binary tomography – Correlated path reachability without multicast – Topology discovery  Open 14 in network tomography issues Detection techniques  Active – – probing: ping Send probe, collect response From any end host • Works for network operators and end users  Passive – – 15 analysis of user’s traffic Tap incoming and outgoing traffic • At user’s machines or servers: tcpdump, pcap • Inside the network: DAG card Monitor status of TCP connections Detection with ping probe ICMP echo request t  If receives reply –  If no reply before timeout – m 16 reply ICMP echo reply Then, path is good Then, path is bad Persistent failure or measurement noise?  Many – Timeout may be too short – Rate limiting at routers – Some end-hosts don’t respond to ICMP request – Transient congestion – Routing change  Need – 17 reasons to lose probe or reply to confirm that failure is persistent Otherwise, may trigger false alarms Failure confirmation  Upon detection of a failure, trigger extra probes  Goal: minimize detection errors – – Sending more probes Waiting longer between probes  Tradeoff: detection error and detection time loss burst packets on a path time Detection error [Cunha, 2009] 18 Passive detection at end hosts  tcpdump/pcap captures packets  Track status of each TCP connection – RTTs, timeouts, retransmissions  Multiple – If current seq. number > last seq. number seen • – Path is good If current seq. number = last seq. number seen • Timeout has occurred • After four timeouts, declare path as bad [Zhang, 2004] 19 timeouts indicate path is bad Passive detection inside the network is hard  Traffic – Need special hardware • – volume is too high DAG cards can capture packets at high speeds May lose packets  Tracking 20 TCP connections is hard – May not capture both sides of a connection – Large processing and memory overhead Passive vs. active detection Passive + No need to inject traffic + Detects all failures that affect user’s traffic + Responses from targets that don’t respond to ping ‒ Not always possible to tap user’s traffic ‒ Only detects failures in paths with traffic 21 Active + No need to tap user’s traffic + Detects failures in any desired path ‒ Probing – – overhead Cover a large number of paths Detect failures fast Outline  Background  Fault detection with no control of targets – Active vs. passive measurements – Reducing overhead of active measurements – Disambiguating one-way failures  Fault identification using binary tomography – Correlated path reachability without multicast – Topology discovery  Open 22 in network tomography issues Active monitoring: reducing probing overhead M1 T1 target network C A T2 D B monitors M2 23 Goal detect failures of any of the interfaces in the target network with minimum probing overhead target hosts T3 The coverage solution T1 M1 C A T2 D T3 B  Instead of probing all paths, select the minimum set of paths that covers all interfaces in target network  Coverage problem is NP-hard M2 – Solution: greedy set-cover heuristic [Nguyen, 2004] [Bejerano,2003] 24 Coverage solution doesn’t detect all types of failures  Detects – Failures that affect all packets that traverse the faulty interface •  But – Eg., interface or router crashes, fiber cuts, bugs not path-specific failures Failures that affect only a subset of paths that cross the faulty interface • Eg., router misconfigurations [Nguyen, 2009] 25 fail-stop failures New formulation of failure detection problem  Select – the frequency to probe each path Lower frequency per-path probing can achieve a high frequency probing of each interface T1 1 every 9 mins M1 1 every 3 mins M2 [Nguyen, 2009] 26 C A B D T2 T3 Outline  Background  Fault detection with no control of targets – Active vs. passive measurements – Reducing overhead of active measurements – Disambiguating one-way failures  Fault identification using binary tomography – Correlated path reachability without multicast – Topology discovery  Open 27 in network tomography issues Is failure in forward or reverse path? probe t  Paths reply m 28 can be asymmetric – Load balancing – Hot-potato routing Disambiguating one-way losses: Spoofing t m [Katz-Bassett, 2008] 29 Spoofer  Monitor requests to spoofer to send probe  Spoofer sends spoofed probe with source address of the monitor  If reply reaches the monitor, reverse path is good Limits of spoofing  Network – operators often drop spoofed packets Spoofed packets are normally used for attacks  Placement – t of spoofer Paths from spoofer to targets need to be independent than paths from monitors m 30 Summary: Fault detection  End – – users: passive plus active probing Passive measurements capture user’s experience Active probes • When path has no traffic • When TCP connections are too short  Network – – 31 operators: alarms plus active probing Alarm systems directly report many faults Active monitoring to capture customer’s experience • Detect blackholes (i.e., faults that don’t appear in alarms) • Detect faults in other networks Outline  Background  Fault detection with no control of targets – Active vs. passive measurements – Reducing overhead of active measurements – Disambiguating one-way failures  Fault identification – Correlated path reachability without multicast – Topology discovery  Open 32 in network tomography issues Uncorrelated measurements lead to errors  Lack of synchronization leads to inconsistencies – – Probes cross links at different times Path may change between probes m t1 mistakenly inferred failure 33 t2 Sources of inconsistencies  In –  In – – 34 measurements from a single monitor Probing all targets can take time measurements from multiple monitors Hard to synchronize monitors for all probes to reach a link at the same time Impossible to generalize to all links Inconsistent measurements with multiple monitors mK path reachability … m1, tN good mK,t1 good mK, tN bad inconsistent measurements 35 … good … m1,t1 … m1 tN t1 Solution: Reprobe paths after failure mK path reachability … m1, tN bad  Consistency mK, tN bad has a cost – Delays fault identification – Cannot identify short failures [Cunha, 2009] 36 mK,t1 good … good … m1,t1 … m1 tN t1 Summary: Correlated measurements  Trade-off: – – Faster identification leads to false alarms Slower identification misses short failures  Network – – 37 operators Too many false alarms are unmanageable Longer failures are the ones that need intervention  End – consistency vs. identification speed users Even short failures affect performance Outline  Background  Fault detection with no control of targets – Active vs. passive measurements – Reducing overhead of active measurements – Disambiguating one-way failures  Fault identification – Correlated path reachability without multicast – Topology discovery  Open 38 in network tomography issues Measuring router topology  With – Topology of one network – Routing monitors (OSPF or IS-IS)  No 39 access to routers (or “from inside”) access to routers (or “from outside”) – Multi-AS topology or from end-hosts – Monitors issue active probes: traceroute Topology from inside  Routing protocols flood state of each link – Periodically refresh link state – Report any changes: link down, up, cost change  Monitor – listens to link-state messages Acts as a regular router • AT&T’s OSPFmon or Sprint’s PyRT for IS-IS  Combining – link states gives the topology Easy to maintain, messages report any changes [Mortier] [Shaikh, 2004] 40 Inferring a path from outside: traceroute Actual path TTL exceeded from B.1 TTL exceeded from A.1 A.1 m A A.2 B.1 B B.2 t TTL = 1 TTL = 2 Inferred path m 41 A.1 B.1 t A traceroute path can be incomplete  Load – balancing is widely used Traceroute only probes one path  Sometimes – ICMP rate limiting – Anonymous routers  Tunnelling – 42 taceroute has no answer (stars) (e.g., MPLS) may hide routers Routers inside the tunnel may not decrement TTL Traceroute under load balancing Actual path A C TTL = 2 m B t E L D TTL = 3 Inferred path A Missing nodes and links C False link m E L B [Augustin, 2006] 43 D t Errors happen even under per-flow load balancing Flow 1 A m L C TTL = 2 Port 2 B E t D TTL = 3 Port 3  Traceroute – – Needs to match probe to response Response only has the header of the issued probe [Augustin, 2006] 44 uses the destination port as identifier Paris traceroute  Solves – the problem with per-flow load balancing Probes to a destination belong to same flow  Changes – Use the UDP checksum m [Augustin, 2006] 45 the location of the probe identifier L A C TTL = 2 Port 1 TTL = 3 Port 1 Checksum 2 Checksum 3 B D E t Topology from traceroutes Actual topology 1 m1 A 3 2 1 1 B 2 3 3 C 4 2 D 1 2 t1 Inferred topology D.1 m1 A.1 C.1 C.2 t2 m2 m2 B.3  Inferred nodes = interfaces, not routers  Coverage depends on monitors and targets 46 t1 – Misses links and routers – Some links and routers appear multiple times t2 Alias resolution: Map interfaces to routers  Direct probing – –  Responses from the same router will have close IP identifiers and same TTL Record-route IP option – Records up to nine IP addresses of routers in the path [Spring, 2002] [Sherwood, 2008] 47 Inferred topology Probe an interface, may receive response from another D.1 m1 A.1 t1 C.1 C.2 m2 B.3 t2 same router Large-scale topology measurements  Probing – – – a large topology takes time E.g., probing 1200 targets from PlanetLab nodes takes 5 minutes on average (using 30 threads) Probing more targets covers more links But, getting a topology snapshot takes longer  Snapshot – Paths may change during snapshot  Hard – 48 may be inaccurate to get up-to-date topology To know that a path changed, need to re-probe Faster topology snapshots  Probing redundancy – Intra-monitor – Inter-monitor  Doubletree – Combines backward and forward probing to eliminate redundancy [Donnet, 2005] 49 t1 D m1 A C B m2 t2 Summary: Topology discovery  Network – – Own network: routing messages Neighbor networks: traceroutes  End – – users: combining traceroutes Be aware of inaccuracies • False or missing links and nodes • Hidden hops: stars, tunneling Fault identification with lower precision • 50 operators Determine the network to blame Outline  Background  Fault detection with no control of targets – Active vs. passive measurements – Reducing overhead of active measurements – Disambiguating one-way failures  Fault identification – Correlated path reachability without multicast – Topology discovery  Open 51 in network tomography issues Tomography algorithms  Make robust to measurement noise  Make robust to topology uncertainties – – Multiple topologies close to the time of an event Multiple paths between a monitor and a target  Identify – – 52 other types of faults Path specific Intermittent Monitoring techniques  Track – dynamics of large-scale topologies Fast identification requires up-to-date topology  Passive – – detection inside a network High speed packet processing Detect faults with incomplete information  Large-scale – Consolidating measurements becomes bottleneck  Define – – 53 deployment changes to easy fault diagnosis Router reports or behavior Common monitoring infrastructure REFERENCES 54 Network tomography theory  Survey on network tomography –  Traffic matrix estimation –  55 R. Castro, M. Coates, G. Liang, R. Nowak, and B. Yu, “Network Tomography: Recent Developments”, Statistical Science, Vol. 19, No. 3 (2004), 499-517. Y. Vardi, “Network Tomography: Estimating Source-Destination Traffic Intensities from Link Data”, Journal of the American Statistical Association, Vol. 91, 1996. Inference of link performance/connectivity – MINC project: http://gaia.cs.umass.edu/minc/ – A. Adams et al., “The Use of End-to-end Multicast Measurements for Characterizing Internal Network Behavior”, IEEE Communications Magazine, May 2000. Binary tomography  Single-source tree algorithm –  Applying tomography in one network –  A. Dhamdhere, R. Teixeira, C. Dovrolis, and C. Diot, “NetDiagnoser:Troubleshooting network unreachabilities using end-toend probes and routing data”, CoNEXT, 2007. Obtaining accurate path status for binary tomography – 56 R. R. Kompella, J. Yates, A. Greenberg, A. C. Snoeren, “Detection and Localization of Network Blackholes”, IEEE INFOCOM, 2007. Applying tomography in multiple network topology –  N. Duffield, “Network Tomography of Binary Network Performance Characteristics”, IEEE Transactions on Information Theory, 2006. I. Cunha, R. Teixeira, N. Feamster, and C. Diot, “Measurement Methods for Fast and Accurate Blackhole Identification with Binary Tomography”, Thomson technical report CR-PRL-2009-05-006, 2009. Topology from inside  IS-IS monitoring –  OSPF monitoring –  A. Shaikh and A. Greenberg, “OSPF Monitoring: Architecture, Design and Deployment Experience”, NSDI 2004 Commercial products – 57 R. Mortier, “Python Routeing Toolkit (`PyRT')”, https://research.sprintlabs.com/pyrt/ Packet Design: http://www.packetdesign.com/ Topology with traceroute  Tracing accurate paths under load-balancing –  Reducing overhead to trace topology of a network and alias resolution with direct probing –  R. Sherwood, A. Bender, N. Spring, “DisCarte: A Disjunctive Internet Cartographer”, SIGCOMM, 2008. Reducing overhead to trace a multi-network topology – 58 N. Spring, R. Mahajan, and D. Wetherall, “Measuring ISP Topologies with Rocketfuel”, SIGCOMM 2002. Use of record route to obtain more accurate topologies –  B. Augustin et al., “Avoiding traceroute anomalies with Paris traceroute”, IMC, 2006. B. Donnet, P. Raoult, T. Friedman, and M. Crovella, “Efficient Algorithms for Large-Scale Topology Discovery”, SIGMETRICS, 2005. Reducing overhead of active fault detection   Selection of paths to probe – H. Nguyen and P. Thiran, “Active measurement for multiple link failures diagnosis in IP networks”, PAM, 2004. – Yigal Bejerano and Rajeev Rastogi, “Robust monitoring of link delays and faults in IP networks”, INFOCOM, 2003. Selection of the frequency to probe paths – 59 H. X. Nguyen , R. Teixeira, P. Thiran, and C. Diot, " Minimizing Probing Cost for Detecting Interface Failures: Algorithms and Scalability Analysis", INFOCOM, 2009. Internet-wide fault detection systems  Detection with BGP monitoring plus continuous pings, spoofing to disambiguate one-way failures, traceroute to locate faults –  Detection with passive monitoring of traffic of peer-to-peer systems or content distribution networks, traceroutes to locate faults – 60 E. Katz-Bassett, H. V. Madhyastha, J. P. John, A. Krishnamurthy, D. Wetherall, T. Anderson, “Studying Black Holes in the Internet with Hubble”, NSDI, 2008. M. Zhang, C. Zhang, V. Pai, L. Peterson, and R. Wang, “PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services”, OSDI, 2004.
 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            