* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download slides - Inria
Deep packet inspection wikipedia , lookup
Multiprotocol Label Switching wikipedia , lookup
Wake-on-LAN wikipedia , lookup
Zero-configuration networking wikipedia , lookup
Piggybacking (Internet access) wikipedia , lookup
Distributed firewall wikipedia , lookup
Computer network wikipedia , lookup
Cracking of wireless networks wikipedia , lookup
IEEE 802.1aq wikipedia , lookup
Internet measurements: fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas Internet monitoring is essential For – Monitor service-level agreements – Fault diagnosis – Diagnose anomalous behavior For 1 network operators users or content/application providers – Verify network performance – Verify network neutrality Network operators can’t know the user’s experience AS3 AS2 AS1 Network – – 2 AS4 operators only have data of one AS AS4 doesn’t detect any problem AS3 doesn’t know who is affected by the failure End users can’t know what happens in the network AS3 AS2 AS1 End-hosts 3 AS4 can only monitor end-to-end paths Network tomography to rescue Inference of unknown network properties from measurable ones Network operators End users – Monitor network paths – Cooperative monitoring – From monitoring hosts – Among end users – From users to popular services – • In network • Third-party monitoring services From home gateways http://cmon.grenouille.com http://www.nanodatacenters.eu 4 Fault diagnosis using end-to-end measurements Faults are persistent reachability problems detection continuous path monitoring identification binary tomography 5 Outline Background Fault detection – Active vs. passive measurements – Reducing overhead of active measurements – Disambiguating one-way failures Fault identification using binary tomography – Correlated path reachability – Topology discovery Open 6 in network tomography issues Network tomography to infer link performance What are the properties of network links? – Loss rate – Delay – Bandwidth – Connectivity D A AS 1 C B Given – 7 end-to-end measurements No access to routers F E AS 2 The origins MINC: Multicast-based Inference of Network-internal Characteristics Key – probe sender idea: multicast probes Exploit correlation in traces to estimate link properties probe collectors [MINC project, 1999] 8 Inferring link loss rates m Assumptions – Known, logical-tree topology – Losses are independent α2 α3 – Multicast probes t1 t2 1 0 1 1 1 1 Method – Maximum likelihood estimates for αk [Adams, 2000] 9 success probabilities α1 ^ α2 ^ α3 ^ α1 estimated success probabilities Binary tomography Labels – – – Loss-rate estimation requires tight correlation Instead, separate good/bad performance If link is bad, all paths that cross the link are bad [Duffield, 2006] 10 m links as good or bad α1 α2 α3 t1 t2 1 0 0 1 1 1 bad good Single-source tree “Smallest Consistent Failure Set” algorithm – – Assumes a single-source tree and known topology Find the smallest set of links that explains bad paths • Given bad links are uncommon • Bad link is the root of maximal bad subtree [Duffield, 2006] 11 m bad t1 t2 1 0 0 1 1 1 bad good Fault identification with binary tomography Fault monitoring needs multiple sources and targets Problem – m2 t1 t2 becomes NP-hard Minimum hitting set problem Iterative greedy heuristic – Given the set of links in bad paths – Iteratively choose link that explains the max number of bad paths [Kompella, 2007] [Dhamdhere, 2007] 12 m1 Hitting set of link = paths that traverse the link Practical issues Topology – Need to measure accurate topology Multicast – – 13 of targets is not always practical Need one-way performance from round-trip probes Links – not available Need to extract correlation from unicast probes Even using probes from different monitors Control – is often unknown can fail for some paths, but not all Need to extend tomography algorithms Outline Background Fault detection with no control of targets – Active vs. passive measurements – Reducing overhead of active measurements – Disambiguating one-way failures Fault identification using binary tomography – Correlated path reachability without multicast – Topology discovery Open 14 in network tomography issues Detection techniques Active – – probing: ping Send probe, collect response From any end host • Works for network operators and end users Passive – – 15 analysis of user’s traffic Tap incoming and outgoing traffic • At user’s machines or servers: tcpdump, pcap • Inside the network: DAG card Monitor status of TCP connections Detection with ping probe ICMP echo request t If receives reply – If no reply before timeout – m 16 reply ICMP echo reply Then, path is good Then, path is bad Persistent failure or measurement noise? Many – Timeout may be too short – Rate limiting at routers – Some end-hosts don’t respond to ICMP request – Transient congestion – Routing change Need – 17 reasons to lose probe or reply to confirm that failure is persistent Otherwise, may trigger false alarms Failure confirmation Upon detection of a failure, trigger extra probes Goal: minimize detection errors – – Sending more probes Waiting longer between probes Tradeoff: detection error and detection time loss burst packets on a path time Detection error [Cunha, 2009] 18 Passive detection at end hosts tcpdump/pcap captures packets Track status of each TCP connection – RTTs, timeouts, retransmissions Multiple – If current seq. number > last seq. number seen • – Path is good If current seq. number = last seq. number seen • Timeout has occurred • After four timeouts, declare path as bad [Zhang, 2004] 19 timeouts indicate path is bad Passive detection inside the network is hard Traffic – Need special hardware • – volume is too high DAG cards can capture packets at high speeds May lose packets Tracking 20 TCP connections is hard – May not capture both sides of a connection – Large processing and memory overhead Passive vs. active detection Passive + No need to inject traffic + Detects all failures that affect user’s traffic + Responses from targets that don’t respond to ping ‒ Not always possible to tap user’s traffic ‒ Only detects failures in paths with traffic 21 Active + No need to tap user’s traffic + Detects failures in any desired path ‒ Probing – – overhead Cover a large number of paths Detect failures fast Outline Background Fault detection with no control of targets – Active vs. passive measurements – Reducing overhead of active measurements – Disambiguating one-way failures Fault identification using binary tomography – Correlated path reachability without multicast – Topology discovery Open 22 in network tomography issues Active monitoring: reducing probing overhead M1 T1 target network C A T2 D B monitors M2 23 Goal detect failures of any of the interfaces in the target network with minimum probing overhead target hosts T3 The coverage solution T1 M1 C A T2 D T3 B Instead of probing all paths, select the minimum set of paths that covers all interfaces in target network Coverage problem is NP-hard M2 – Solution: greedy set-cover heuristic [Nguyen, 2004] [Bejerano,2003] 24 Coverage solution doesn’t detect all types of failures Detects – Failures that affect all packets that traverse the faulty interface • But – Eg., interface or router crashes, fiber cuts, bugs not path-specific failures Failures that affect only a subset of paths that cross the faulty interface • Eg., router misconfigurations [Nguyen, 2009] 25 fail-stop failures New formulation of failure detection problem Select – the frequency to probe each path Lower frequency per-path probing can achieve a high frequency probing of each interface T1 1 every 9 mins M1 1 every 3 mins M2 [Nguyen, 2009] 26 C A B D T2 T3 Outline Background Fault detection with no control of targets – Active vs. passive measurements – Reducing overhead of active measurements – Disambiguating one-way failures Fault identification using binary tomography – Correlated path reachability without multicast – Topology discovery Open 27 in network tomography issues Is failure in forward or reverse path? probe t Paths reply m 28 can be asymmetric – Load balancing – Hot-potato routing Disambiguating one-way losses: Spoofing t m [Katz-Bassett, 2008] 29 Spoofer Monitor requests to spoofer to send probe Spoofer sends spoofed probe with source address of the monitor If reply reaches the monitor, reverse path is good Limits of spoofing Network – operators often drop spoofed packets Spoofed packets are normally used for attacks Placement – t of spoofer Paths from spoofer to targets need to be independent than paths from monitors m 30 Summary: Fault detection End – – users: passive plus active probing Passive measurements capture user’s experience Active probes • When path has no traffic • When TCP connections are too short Network – – 31 operators: alarms plus active probing Alarm systems directly report many faults Active monitoring to capture customer’s experience • Detect blackholes (i.e., faults that don’t appear in alarms) • Detect faults in other networks Outline Background Fault detection with no control of targets – Active vs. passive measurements – Reducing overhead of active measurements – Disambiguating one-way failures Fault identification – Correlated path reachability without multicast – Topology discovery Open 32 in network tomography issues Uncorrelated measurements lead to errors Lack of synchronization leads to inconsistencies – – Probes cross links at different times Path may change between probes m t1 mistakenly inferred failure 33 t2 Sources of inconsistencies In – In – – 34 measurements from a single monitor Probing all targets can take time measurements from multiple monitors Hard to synchronize monitors for all probes to reach a link at the same time Impossible to generalize to all links Inconsistent measurements with multiple monitors mK path reachability … m1, tN good mK,t1 good mK, tN bad inconsistent measurements 35 … good … m1,t1 … m1 tN t1 Solution: Reprobe paths after failure mK path reachability … m1, tN bad Consistency mK, tN bad has a cost – Delays fault identification – Cannot identify short failures [Cunha, 2009] 36 mK,t1 good … good … m1,t1 … m1 tN t1 Summary: Correlated measurements Trade-off: – – Faster identification leads to false alarms Slower identification misses short failures Network – – 37 operators Too many false alarms are unmanageable Longer failures are the ones that need intervention End – consistency vs. identification speed users Even short failures affect performance Outline Background Fault detection with no control of targets – Active vs. passive measurements – Reducing overhead of active measurements – Disambiguating one-way failures Fault identification – Correlated path reachability without multicast – Topology discovery Open 38 in network tomography issues Measuring router topology With – Topology of one network – Routing monitors (OSPF or IS-IS) No 39 access to routers (or “from inside”) access to routers (or “from outside”) – Multi-AS topology or from end-hosts – Monitors issue active probes: traceroute Topology from inside Routing protocols flood state of each link – Periodically refresh link state – Report any changes: link down, up, cost change Monitor – listens to link-state messages Acts as a regular router • AT&T’s OSPFmon or Sprint’s PyRT for IS-IS Combining – link states gives the topology Easy to maintain, messages report any changes [Mortier] [Shaikh, 2004] 40 Inferring a path from outside: traceroute Actual path TTL exceeded from B.1 TTL exceeded from A.1 A.1 m A A.2 B.1 B B.2 t TTL = 1 TTL = 2 Inferred path m 41 A.1 B.1 t A traceroute path can be incomplete Load – balancing is widely used Traceroute only probes one path Sometimes – ICMP rate limiting – Anonymous routers Tunnelling – 42 taceroute has no answer (stars) (e.g., MPLS) may hide routers Routers inside the tunnel may not decrement TTL Traceroute under load balancing Actual path A C TTL = 2 m B t E L D TTL = 3 Inferred path A Missing nodes and links C False link m E L B [Augustin, 2006] 43 D t Errors happen even under per-flow load balancing Flow 1 A m L C TTL = 2 Port 2 B E t D TTL = 3 Port 3 Traceroute – – Needs to match probe to response Response only has the header of the issued probe [Augustin, 2006] 44 uses the destination port as identifier Paris traceroute Solves – the problem with per-flow load balancing Probes to a destination belong to same flow Changes – Use the UDP checksum m [Augustin, 2006] 45 the location of the probe identifier L A C TTL = 2 Port 1 TTL = 3 Port 1 Checksum 2 Checksum 3 B D E t Topology from traceroutes Actual topology 1 m1 A 3 2 1 1 B 2 3 3 C 4 2 D 1 2 t1 Inferred topology D.1 m1 A.1 C.1 C.2 t2 m2 m2 B.3 Inferred nodes = interfaces, not routers Coverage depends on monitors and targets 46 t1 – Misses links and routers – Some links and routers appear multiple times t2 Alias resolution: Map interfaces to routers Direct probing – – Responses from the same router will have close IP identifiers and same TTL Record-route IP option – Records up to nine IP addresses of routers in the path [Spring, 2002] [Sherwood, 2008] 47 Inferred topology Probe an interface, may receive response from another D.1 m1 A.1 t1 C.1 C.2 m2 B.3 t2 same router Large-scale topology measurements Probing – – – a large topology takes time E.g., probing 1200 targets from PlanetLab nodes takes 5 minutes on average (using 30 threads) Probing more targets covers more links But, getting a topology snapshot takes longer Snapshot – Paths may change during snapshot Hard – 48 may be inaccurate to get up-to-date topology To know that a path changed, need to re-probe Faster topology snapshots Probing redundancy – Intra-monitor – Inter-monitor Doubletree – Combines backward and forward probing to eliminate redundancy [Donnet, 2005] 49 t1 D m1 A C B m2 t2 Summary: Topology discovery Network – – Own network: routing messages Neighbor networks: traceroutes End – – users: combining traceroutes Be aware of inaccuracies • False or missing links and nodes • Hidden hops: stars, tunneling Fault identification with lower precision • 50 operators Determine the network to blame Outline Background Fault detection with no control of targets – Active vs. passive measurements – Reducing overhead of active measurements – Disambiguating one-way failures Fault identification – Correlated path reachability without multicast – Topology discovery Open 51 in network tomography issues Tomography algorithms Make robust to measurement noise Make robust to topology uncertainties – – Multiple topologies close to the time of an event Multiple paths between a monitor and a target Identify – – 52 other types of faults Path specific Intermittent Monitoring techniques Track – dynamics of large-scale topologies Fast identification requires up-to-date topology Passive – – detection inside a network High speed packet processing Detect faults with incomplete information Large-scale – Consolidating measurements becomes bottleneck Define – – 53 deployment changes to easy fault diagnosis Router reports or behavior Common monitoring infrastructure REFERENCES 54 Network tomography theory Survey on network tomography – Traffic matrix estimation – 55 R. Castro, M. Coates, G. Liang, R. Nowak, and B. Yu, “Network Tomography: Recent Developments”, Statistical Science, Vol. 19, No. 3 (2004), 499-517. Y. Vardi, “Network Tomography: Estimating Source-Destination Traffic Intensities from Link Data”, Journal of the American Statistical Association, Vol. 91, 1996. Inference of link performance/connectivity – MINC project: http://gaia.cs.umass.edu/minc/ – A. Adams et al., “The Use of End-to-end Multicast Measurements for Characterizing Internal Network Behavior”, IEEE Communications Magazine, May 2000. Binary tomography Single-source tree algorithm – Applying tomography in one network – A. Dhamdhere, R. Teixeira, C. Dovrolis, and C. Diot, “NetDiagnoser:Troubleshooting network unreachabilities using end-toend probes and routing data”, CoNEXT, 2007. Obtaining accurate path status for binary tomography – 56 R. R. Kompella, J. Yates, A. Greenberg, A. C. Snoeren, “Detection and Localization of Network Blackholes”, IEEE INFOCOM, 2007. Applying tomography in multiple network topology – N. Duffield, “Network Tomography of Binary Network Performance Characteristics”, IEEE Transactions on Information Theory, 2006. I. Cunha, R. Teixeira, N. Feamster, and C. Diot, “Measurement Methods for Fast and Accurate Blackhole Identification with Binary Tomography”, Thomson technical report CR-PRL-2009-05-006, 2009. Topology from inside IS-IS monitoring – OSPF monitoring – A. Shaikh and A. Greenberg, “OSPF Monitoring: Architecture, Design and Deployment Experience”, NSDI 2004 Commercial products – 57 R. Mortier, “Python Routeing Toolkit (`PyRT')”, https://research.sprintlabs.com/pyrt/ Packet Design: http://www.packetdesign.com/ Topology with traceroute Tracing accurate paths under load-balancing – Reducing overhead to trace topology of a network and alias resolution with direct probing – R. Sherwood, A. Bender, N. Spring, “DisCarte: A Disjunctive Internet Cartographer”, SIGCOMM, 2008. Reducing overhead to trace a multi-network topology – 58 N. Spring, R. Mahajan, and D. Wetherall, “Measuring ISP Topologies with Rocketfuel”, SIGCOMM 2002. Use of record route to obtain more accurate topologies – B. Augustin et al., “Avoiding traceroute anomalies with Paris traceroute”, IMC, 2006. B. Donnet, P. Raoult, T. Friedman, and M. Crovella, “Efficient Algorithms for Large-Scale Topology Discovery”, SIGMETRICS, 2005. Reducing overhead of active fault detection Selection of paths to probe – H. Nguyen and P. Thiran, “Active measurement for multiple link failures diagnosis in IP networks”, PAM, 2004. – Yigal Bejerano and Rajeev Rastogi, “Robust monitoring of link delays and faults in IP networks”, INFOCOM, 2003. Selection of the frequency to probe paths – 59 H. X. Nguyen , R. Teixeira, P. Thiran, and C. Diot, " Minimizing Probing Cost for Detecting Interface Failures: Algorithms and Scalability Analysis", INFOCOM, 2009. Internet-wide fault detection systems Detection with BGP monitoring plus continuous pings, spoofing to disambiguate one-way failures, traceroute to locate faults – Detection with passive monitoring of traffic of peer-to-peer systems or content distribution networks, traceroutes to locate faults – 60 E. Katz-Bassett, H. V. Madhyastha, J. P. John, A. Krishnamurthy, D. Wetherall, T. Anderson, “Studying Black Holes in the Internet with Hubble”, NSDI, 2008. M. Zhang, C. Zhang, V. Pai, L. Peterson, and R. Wang, “PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services”, OSDI, 2004.