Download slides - Inria

Document related concepts

Deep packet inspection wikipedia , lookup

Net bias wikipedia , lookup

Multiprotocol Label Switching wikipedia , lookup

IEEE 1355 wikipedia , lookup

Wake-on-LAN wikipedia , lookup

Zero-configuration networking wikipedia , lookup

Piggybacking (Internet access) wikipedia , lookup

Distributed firewall wikipedia , lookup

Computer network wikipedia , lookup

Cracking of wireless networks wikipedia , lookup

IEEE 802.1aq wikipedia , lookup

Network tap wikipedia , lookup

Airborne Networking wikipedia , lookup

Transcript
Internet measurements: fault
detection, identification, and
topology discovery
Renata Teixeira
Laboratoire LIP6
CNRS and UPMC Paris Universitas
Internet monitoring is essential
 For
–
Monitor service-level agreements
–
Fault diagnosis
–
Diagnose anomalous behavior
 For
1
network operators
users or content/application providers
–
Verify network performance
–
Verify network neutrality
Network operators can’t know
the user’s experience
AS3
AS2
AS1
 Network
–
–
2
AS4
operators only have data of one AS
AS4 doesn’t detect any problem
AS3 doesn’t know who is affected by the failure
End users can’t know what
happens in the network
AS3
AS2
AS1
 End-hosts
3
AS4
can only monitor end-to-end paths
Network tomography to rescue
Inference of unknown network properties from
measurable ones

Network operators

End users
–
Monitor network paths
–
Cooperative monitoring
–
From monitoring hosts
–
Among end users
–
From users to popular
services
–
•
In network
•
Third-party monitoring services
From home gateways
http://cmon.grenouille.com
http://www.nanodatacenters.eu
4
Fault diagnosis using end-to-end
measurements
 Faults
are persistent reachability problems
detection
continuous path monitoring
identification
binary tomography
5
Outline
 Background
 Fault
detection
–
Active vs. passive measurements
–
Reducing overhead of active measurements
–
Disambiguating one-way failures
 Fault
identification using binary tomography
–
Correlated path reachability
–
Topology discovery
 Open
6
in network tomography
issues
Network tomography to infer
link performance
 What
are the properties of network links?
–
Loss rate
–
Delay
–
Bandwidth
–
Connectivity
D
A
AS 1 C
B
 Given
–
7
end-to-end measurements
No access to routers
F
E AS 2
The origins
 MINC:
Multicast-based
Inference of Network-internal
Characteristics
 Key
–
probe
sender
idea: multicast probes
Exploit correlation in traces to
estimate link properties
probe
collectors
[MINC project, 1999]
8
Inferring link loss rates
m
 Assumptions
–
Known, logical-tree topology
–
Losses are independent
α2
α3
–
Multicast probes
t1
t2
1
0
1
1
1
1
 Method
–
Maximum likelihood
estimates for αk
[Adams, 2000]
9
success
probabilities
α1
^ α2
^ α3
^
α1
estimated
success
probabilities
Binary tomography
 Labels
–
–
–
Loss-rate estimation requires
tight correlation
Instead, separate good/bad
performance
If link is bad, all paths that
cross the link are bad
[Duffield, 2006]
10
m
links as good or bad
α1
α2
α3
t1
t2
1
0
0
1
1
1
bad
good
Single-source tree
 “Smallest
Consistent Failure
Set” algorithm
–
–
Assumes a single-source tree
and known topology
Find the smallest set of links
that explains bad paths
•
Given bad links are uncommon
•
Bad link is the root of maximal
bad subtree
[Duffield, 2006]
11
m
bad
t1
t2
1
0
0
1
1
1
bad
good
Fault identification with binary
tomography
 Fault
monitoring needs multiple
sources and targets
 Problem
–
m2
t1
t2
becomes NP-hard
Minimum hitting set problem
 Iterative
greedy heuristic
–
Given the set of links in bad paths
–
Iteratively choose link that explains
the max number of bad paths
[Kompella, 2007] [Dhamdhere, 2007]
12
m1
Hitting set of link =
paths that traverse
the link
Practical issues
 Topology
–
Need to measure accurate topology
 Multicast
–
–
13
of targets is not always practical
Need one-way performance from round-trip probes
 Links
–
not available
Need to extract correlation from unicast probes
Even using probes from different monitors
 Control
–
is often unknown
can fail for some paths, but not all
Need to extend tomography algorithms
Outline
 Background
 Fault
detection with no control of targets
–
Active vs. passive measurements
–
Reducing overhead of active measurements
–
Disambiguating one-way failures
 Fault
identification using binary tomography
–
Correlated path reachability without multicast
–
Topology discovery
 Open
14
in network tomography
issues
Detection techniques
 Active
–
–
probing: ping
Send probe, collect response
From any end host
•
Works for network operators and end users
 Passive
–
–
15
analysis of user’s traffic
Tap incoming and outgoing traffic
•
At user’s machines or servers: tcpdump, pcap
•
Inside the network: DAG card
Monitor status of TCP connections
Detection with ping
probe
ICMP
echo request
t
 If
receives reply
–
 If
no reply before timeout
–
m
16
reply
ICMP
echo reply
Then, path is good
Then, path is bad
Persistent failure or
measurement noise?
 Many
–
Timeout may be too short
–
Rate limiting at routers
–
Some end-hosts don’t respond to ICMP request
–
Transient congestion
–
Routing change
 Need
–
17
reasons to lose probe or reply
to confirm that failure is persistent
Otherwise, may trigger false alarms
Failure confirmation
 Upon
detection of a failure, trigger extra probes
 Goal: minimize detection errors
–
–
Sending more probes
Waiting longer between probes
 Tradeoff:
detection error and detection time
loss burst
packets on
a path
time
Detection error
[Cunha, 2009]
18
Passive detection at end hosts
 tcpdump/pcap
captures packets
 Track status of each TCP connection
–
RTTs, timeouts, retransmissions
 Multiple
–
If current seq. number > last seq. number seen
•
–
Path is good
If current seq. number = last seq. number seen
•
Timeout has occurred
•
After four timeouts, declare path as bad
[Zhang, 2004]
19
timeouts indicate path is bad
Passive detection inside the
network is hard
 Traffic
–
Need special hardware
•
–
volume is too high
DAG cards can capture packets at high speeds
May lose packets
 Tracking
20
TCP connections is hard
–
May not capture both sides of a connection
–
Large processing and memory overhead
Passive vs. active detection
Passive
+ No
need to inject traffic
+ Detects all failures that
affect user’s traffic
+ Responses from targets
that don’t respond to ping
‒ Not
always possible to tap
user’s traffic
‒ Only detects failures in
paths with traffic
21
Active
+ No
need to tap user’s traffic
+ Detects failures in any
desired path
‒ Probing
–
–
overhead
Cover a large number of paths
Detect failures fast
Outline
 Background
 Fault
detection with no control of targets
–
Active vs. passive measurements
–
Reducing overhead of active measurements
–
Disambiguating one-way failures
 Fault
identification using binary tomography
–
Correlated path reachability without multicast
–
Topology discovery
 Open
22
in network tomography
issues
Active monitoring: reducing
probing overhead
M1
T1
target
network
C
A
T2
D
B
monitors
M2
23
Goal
detect failures of any of the
interfaces in the target network
with minimum probing overhead
target hosts
T3
The coverage solution
T1
M1
C
A
T2
D
T3
B

Instead of probing all paths, select the
minimum set of paths that covers all
interfaces in target network

Coverage problem is NP-hard
M2
–
Solution: greedy set-cover heuristic
[Nguyen, 2004] [Bejerano,2003]
24
Coverage solution doesn’t detect
all types of failures
 Detects
–
Failures that affect all packets that traverse the
faulty interface
•
 But
–
Eg., interface or router crashes, fiber cuts, bugs
not path-specific failures
Failures that affect only a subset of paths that cross
the faulty interface
•
Eg., router misconfigurations
[Nguyen, 2009]
25
fail-stop failures
New formulation of failure
detection problem
 Select
–
the frequency to probe each path
Lower frequency per-path probing can achieve a
high frequency probing of each interface
T1
1 every 9 mins
M1
1 every 3 mins
M2
[Nguyen, 2009]
26
C
A
B
D
T2
T3
Outline
 Background
 Fault
detection with no control of targets
–
Active vs. passive measurements
–
Reducing overhead of active measurements
–
Disambiguating one-way failures
 Fault
identification using binary tomography
–
Correlated path reachability without multicast
–
Topology discovery
 Open
27
in network tomography
issues
Is failure in forward or
reverse path?
probe
t
 Paths
reply
m
28
can be asymmetric
–
Load balancing
–
Hot-potato routing
Disambiguating one-way losses:
Spoofing
t
m
[Katz-Bassett, 2008]
29
Spoofer

Monitor requests to spoofer to
send probe

Spoofer sends spoofed probe
with source address of the
monitor

If reply reaches the monitor,
reverse path is good
Limits of spoofing
 Network
–
operators often drop spoofed packets
Spoofed packets are normally used for attacks
 Placement
–
t
of spoofer
Paths from spoofer to
targets need to be
independent than paths
from monitors
m
30
Summary: Fault detection
 End
–
–
users: passive plus active probing
Passive measurements capture user’s experience
Active probes
•
When path has no traffic
•
When TCP connections are too short
 Network
–
–
31
operators: alarms plus active probing
Alarm systems directly report many faults
Active monitoring to capture customer’s experience
•
Detect blackholes (i.e., faults that don’t appear in alarms)
•
Detect faults in other networks
Outline
 Background
 Fault
detection with no control of targets
–
Active vs. passive measurements
–
Reducing overhead of active measurements
–
Disambiguating one-way failures
 Fault
identification
–
Correlated path reachability without multicast
–
Topology discovery
 Open
32
in network tomography
issues
Uncorrelated measurements
lead to errors
 Lack
of synchronization
leads to inconsistencies
–
–
Probes cross links at
different times
Path may change between
probes
m
t1
mistakenly
inferred failure
33
t2
Sources of inconsistencies
 In
–
 In
–
–
34
measurements from a single monitor
Probing all targets can take time
measurements from multiple monitors
Hard to synchronize monitors for all probes to reach
a link at the same time
Impossible to generalize to all links
Inconsistent measurements with
multiple monitors
mK
path reachability
…
m1, tN good
mK,t1 good
mK, tN bad
inconsistent
measurements
35
…
good
…
m1,t1
…
m1
tN
t1
Solution:
Reprobe paths after failure
mK
path reachability
…
m1, tN bad
 Consistency
mK, tN bad
has a cost
–
Delays fault identification
–
Cannot identify short failures
[Cunha, 2009]
36
mK,t1 good
…
good
…
m1,t1
…
m1
tN
t1
Summary: Correlated
measurements
 Trade-off:
–
–
Faster identification leads to false alarms
Slower identification misses short failures
 Network
–
–
37
operators
Too many false alarms are unmanageable
Longer failures are the ones that need intervention
 End
–
consistency vs. identification speed
users
Even short failures affect performance
Outline
 Background
 Fault
detection with no control of targets
–
Active vs. passive measurements
–
Reducing overhead of active measurements
–
Disambiguating one-way failures
 Fault
identification
–
Correlated path reachability without multicast
–
Topology discovery
 Open
38
in network tomography
issues
Measuring router topology
 With
–
Topology of one network
–
Routing monitors (OSPF or IS-IS)
 No
39
access to routers (or “from inside”)
access to routers (or “from outside”)
–
Multi-AS topology or from end-hosts
–
Monitors issue active probes: traceroute
Topology from inside
 Routing
protocols flood state of each link
–
Periodically refresh link state
–
Report any changes: link down, up, cost change
 Monitor
–
listens to link-state messages
Acts as a regular router
•
AT&T’s OSPFmon or Sprint’s PyRT for IS-IS
 Combining
–
link states gives the topology
Easy to maintain, messages report any changes
[Mortier] [Shaikh, 2004]
40
Inferring a path from outside:
traceroute
Actual path
TTL exceeded
from B.1
TTL exceeded
from A.1
A.1
m
A
A.2
B.1
B
B.2
t
TTL = 1
TTL = 2
Inferred path
m
41
A.1
B.1
t
A traceroute path can be
incomplete
 Load
–
balancing is widely used
Traceroute only probes one path
 Sometimes
–
ICMP rate limiting
–
Anonymous routers
 Tunnelling
–
42
taceroute has no answer (stars)
(e.g., MPLS) may hide routers
Routers inside the tunnel may not decrement TTL
Traceroute under load balancing
Actual path
A
C
TTL = 2
m
B
t
E
L
D
TTL = 3
Inferred path
A
Missing nodes
and links
C
False link
m
E
L
B
[Augustin, 2006]
43
D
t
Errors happen even under
per-flow load balancing
Flow 1
A
m
L
C
TTL = 2
Port 2
B
E
t
D
TTL = 3
Port 3
 Traceroute
–
–
Needs to match probe to response
Response only has the header of the issued probe
[Augustin, 2006]
44
uses the destination port as identifier
Paris traceroute
 Solves
–
the problem with per-flow load balancing
Probes to a destination belong to same flow
 Changes
–
Use the UDP checksum
m
[Augustin, 2006]
45
the location of the probe identifier
L
A
C
TTL = 2
Port 1
TTL = 3
Port 1
Checksum 2
Checksum 3
B
D
E
t
Topology from traceroutes
Actual topology
1
m1
A
3
2
1
1 B 2
3
3
C 4
2
D
1 2
t1
Inferred topology
D.1
m1 A.1
C.1
C.2
t2
m2
m2
B.3

Inferred nodes = interfaces, not routers

Coverage depends on monitors and targets
46
t1
–
Misses links and routers
–
Some links and routers appear multiple times
t2
Alias resolution:
Map interfaces to routers

Direct probing
–
–

Responses from the same
router will have close IP
identifiers and same TTL
Record-route IP option
–
Records up to nine IP
addresses of routers in the path
[Spring, 2002] [Sherwood, 2008]
47
Inferred topology
Probe an interface, may receive
response from another
D.1
m1 A.1
t1
C.1
C.2
m2
B.3
t2
same router
Large-scale topology
measurements
 Probing
–
–
–
a large topology takes time
E.g., probing 1200 targets from PlanetLab nodes
takes 5 minutes on average (using 30 threads)
Probing more targets covers more links
But, getting a topology snapshot takes longer
 Snapshot
–
Paths may change during snapshot
 Hard
–
48
may be inaccurate
to get up-to-date topology
To know that a path changed, need to re-probe
Faster topology snapshots
 Probing
redundancy
–
Intra-monitor
–
Inter-monitor
 Doubletree
–
Combines backward and
forward probing to
eliminate redundancy
[Donnet, 2005]
49
t1
D
m1 A
C
B
m2
t2
Summary: Topology discovery
 Network
–
–
Own network: routing messages
Neighbor networks: traceroutes
 End
–
–
users: combining traceroutes
Be aware of inaccuracies
•
False or missing links and nodes
•
Hidden hops: stars, tunneling
Fault identification with lower precision
•
50
operators
Determine the network to blame
Outline
 Background
 Fault
detection with no control of targets
–
Active vs. passive measurements
–
Reducing overhead of active measurements
–
Disambiguating one-way failures
 Fault
identification
–
Correlated path reachability without multicast
–
Topology discovery
 Open
51
in network tomography
issues
Tomography algorithms
 Make
robust to measurement noise
 Make
robust to topology uncertainties
–
–
Multiple topologies close to the time of an event
Multiple paths between a monitor and a target
 Identify
–
–
52
other types of faults
Path specific
Intermittent
Monitoring techniques
 Track
–
dynamics of large-scale topologies
Fast identification requires up-to-date topology
 Passive
–
–
detection inside a network
High speed packet processing
Detect faults with incomplete information
 Large-scale
–
Consolidating measurements becomes bottleneck
 Define
–
–
53
deployment
changes to easy fault diagnosis
Router reports or behavior
Common monitoring infrastructure
REFERENCES
54
Network tomography theory

Survey on network tomography
–

Traffic matrix estimation
–

55
R. Castro, M. Coates, G. Liang, R. Nowak, and B. Yu, “Network
Tomography: Recent Developments”, Statistical Science, Vol. 19, No.
3 (2004), 499-517.
Y. Vardi, “Network Tomography: Estimating Source-Destination Traffic
Intensities from Link Data”, Journal of the American Statistical
Association, Vol. 91, 1996.
Inference of link performance/connectivity
–
MINC project: http://gaia.cs.umass.edu/minc/
–
A. Adams et al., “The Use of End-to-end Multicast Measurements for
Characterizing Internal Network Behavior”, IEEE Communications
Magazine, May 2000.
Binary tomography

Single-source tree algorithm
–

Applying tomography in one network
–

A. Dhamdhere, R. Teixeira, C. Dovrolis, and C. Diot,
“NetDiagnoser:Troubleshooting network unreachabilities using end-toend probes and routing data”, CoNEXT, 2007.
Obtaining accurate path status for binary tomography
–
56
R. R. Kompella, J. Yates, A. Greenberg, A. C. Snoeren, “Detection
and Localization of Network Blackholes”, IEEE INFOCOM, 2007.
Applying tomography in multiple network topology
–

N. Duffield, “Network Tomography of Binary Network Performance
Characteristics”, IEEE Transactions on Information Theory, 2006.
I. Cunha, R. Teixeira, N. Feamster, and C. Diot, “Measurement
Methods for Fast and Accurate Blackhole Identification with Binary
Tomography”, Thomson technical report CR-PRL-2009-05-006, 2009.
Topology from inside

IS-IS monitoring
–

OSPF monitoring
–

A. Shaikh and A. Greenberg, “OSPF Monitoring: Architecture,
Design and Deployment Experience”, NSDI 2004
Commercial products
–
57
R. Mortier, “Python Routeing Toolkit (`PyRT')”,
https://research.sprintlabs.com/pyrt/
Packet Design: http://www.packetdesign.com/
Topology with traceroute

Tracing accurate paths under load-balancing
–

Reducing overhead to trace topology of a network and alias
resolution with direct probing
–

R. Sherwood, A. Bender, N. Spring, “DisCarte: A Disjunctive Internet
Cartographer”, SIGCOMM, 2008.
Reducing overhead to trace a multi-network topology
–
58
N. Spring, R. Mahajan, and D. Wetherall, “Measuring ISP Topologies
with Rocketfuel”, SIGCOMM 2002.
Use of record route to obtain more accurate topologies
–

B. Augustin et al., “Avoiding traceroute anomalies with Paris
traceroute”, IMC, 2006.
B. Donnet, P. Raoult, T. Friedman, and M. Crovella, “Efficient
Algorithms for Large-Scale Topology Discovery”, SIGMETRICS, 2005.
Reducing overhead of active
fault detection


Selection of paths to probe
–
H. Nguyen and P. Thiran, “Active measurement for multiple link
failures diagnosis in IP networks”, PAM, 2004.
–
Yigal Bejerano and Rajeev Rastogi, “Robust monitoring of link
delays and faults in IP networks”, INFOCOM, 2003.
Selection of the frequency to probe paths
–
59
H. X. Nguyen , R. Teixeira, P. Thiran, and C. Diot, " Minimizing
Probing Cost for Detecting Interface Failures: Algorithms and
Scalability Analysis", INFOCOM, 2009.
Internet-wide fault detection
systems

Detection with BGP monitoring plus continuous pings, spoofing to
disambiguate one-way failures, traceroute to locate faults
–

Detection with passive monitoring of traffic of peer-to-peer systems
or content distribution networks, traceroutes to locate faults
–
60
E. Katz-Bassett, H. V. Madhyastha, J. P. John, A. Krishnamurthy, D.
Wetherall, T. Anderson, “Studying Black Holes in the Internet with
Hubble”, NSDI, 2008.
M. Zhang, C. Zhang, V. Pai, L. Peterson, and R. Wang, “PlanetSeer:
Internet Path Failure Monitoring and Characterization in Wide-Area
Services”, OSDI, 2004.