Download tom

Tomography-based Overlay Network Monitoring Hugo Angelmar Slides courtesy of (Yan Chen, David Bindel, and Randy H. Katz) Motivation • Infrastructure ossification led to thrust of overlay and P2P applications • Such applications flexible on paths and targets, thus can benefit from E2E distance monitoring – Overlay routing/location – VPN management/provisioning – Service redirection/placement … • Requirements for E2E monitoring system – – – – Scalable & efficient: small amount of probing traffic Accurate: capture congestion/failures Incrementally deployable Easy to use Existing Work • General Metrics: RON (n2 measurement) • Latency Estimation – Clustering-based: IDMaps, Internet Isobar, etc. – Coordinate-based: GNP, ICS, Virtual Landmarks • Network tomography – Focusing on inferring the characteristics of physical links rather than E2E paths – Limited measurements -> under-constrained system, unidentifiable links Problem Formulation Given an overlay of n end hosts and O(n2) paths, how to select a minimal subset of paths to monitor so that the loss rates/latency of all other paths can be inferred. Assumptions: • Topology measurable • Can only measure the E2E path, not the link Our Approach topology Overlay Network Operation Center measurements End hosts Select a basis set of k paths that fully describe O(n2) paths (k «O(n2)) • Monitor the loss rates of k paths, and infer the loss rates of all other paths • Applicable for any additive metrics, like latency A 1 p1 3 Modeling of Path Space D 2 C B Path loss rate p, link loss rate l 1  p1  (1  l1)(1  l 2)  log( 1  l1)  log( 1  p1)  log( 1  l1)  log( 1  l 2)  1 1 0 log( 1  l 2) log( 1  l 3)   x1  1 1 0 x2  b1  x3 A 1 p1 3 Putting All Paths Together D 2 C B Totally r = O(n2) paths, s links, s «r Gx  b, where path matrix G {0 | 1}rs link loss rate vector x   … s1 , path loss rate vector b   = r 1 Sample Path Matrix A b2 1 b1 3 D 2 B b3 C 1 1 0 G  0 0 1 1 1 1  x1   b1  G  x 2   b 2   x3  b3  x2 (1,1,0) (1,-1,0) path/row space (measured) null space (unmeasured) x1 x3 • x1 - x2 unknown => cannot compute x1, x2 1 0 b1 / 2 ( x1  x2 )   • Set of vectors [1 1 0]T 0  b / 2 xG  1  x 3  1     2 form null space 0 1  b2  • To separate identifiable vs. 1 unidentifiable components: ( x1  x2 )   x = xG + xN xN   1  2  0  Intuition through Topology Virtualization Virtual links: • Minimal path segments whose loss rates uniquely identified • Can fully describe all paths • xG is composed of virtual links x2 (1,-1,0) null space (unmeasured) A b2 1 b1 3 2 x1 x3 Virtualization  D b3 (1,1,0) path/row space (measured) C 1 2 Virtual links B All E2E paths are in path space, i.e., GxN = 0 1 0 b1 / 2 ( x1  x2 )   0  b / 2 xG  1  x   3   1  2 0 1  b2  b  Gx  GxG  GxN  GxG More Examples 1 1 0 G  1 0 1   1 1’ 2 Rank(G)=2 1 1 G 0  0 0 0 1  1 Rank(G)=3 1 0 1 0 0 1 0 1 1’  2’ 3 1 2 2’ 3’ 4 3 2 1 1  2 3 4’ Virtualization Real links (solid) and all of the overlay paths (dotted) traversing them Virtual links Algorithms GxG  b … … = = • Select k = rank(G) linearly independent paths to monitor – Use QR decomposition – Leverage sparse matrix: time O(rk2) and memory O(k2) • E.g., 10 minutes for n = 350 (r = 61075) and k = 2958 • Compute the loss rates of other paths – Time O(k2) and memory O(k2) How many measurements saved ? k « O(n2) ? For a power-law Internet topology • When the majority of end hosts are on the overlay k = O(n) (with proof) • When a small portion of end hosts are on overlay – If Internet a pure hierarchical structure (tree): k = O(n) – If Internet no hierarchy at all (worst case, clique): k = O(n2) – Internet has moderate hierarchical structure [TGJ+02] For reasonably large n, (e.g., 100), k = O(nlogn) (extensive linear regression tests on both synthetic and real topologies) Practical Issues • Topology measurement errors tolerance – Care about path loss rates than any interior links – Poor router alias resolution => assign similar loss rates to the same links – Unidentifiable routers => add virtual links to bypass • Measurement load balancing on end hosts – Randomly order the paths for scan and selection of • Topology Changes – Efficient algorithms for incrementally update of for adding/removing end hosts & routing changes G G Topology Changes • Basic building block: add/remove one path – Incremental changes: O(k2) time (O(n2k2) for re-scan) – Add path: check linear dependency with old basis set, G – Delete path p : hard when p  G The essential info described by p : a vector y, such that y  G, but y  {G  p} if {G  p}  y  0, add any path x to G, where x {G  p} and x  y  0 • Add/remove end hosts , Routing changes • Topology relatively stable in order of a day => incremental detection Evaluation • Simulation – Topology • BRITE: Barabasi-Albert, Waxman, hierarchical: 1K – 20K nodes • Real topology from Mercator: 284K nodes – Fraction of end hosts on the overlay: 1 - 10% – Loss rate distribution (90% links are good) • Good link: 0-1% loss rate; bad link: 5-10% loss rates • Good link: 0-1% loss rate; bad link: 1-100% loss rates – Loss model: • Bernouli: independent drop of packet • Gilbert: busty drop of packet – Path loss rate simulated via transmission of 10K pkts • Experiments on PlanetLab Evaluation • Extensive Simulations • Experiments on PlanetLab – 51 hosts, each from different organizations – 51 × 50 = 2,550 paths – On average k = 872 US (40) • Results Highlight – Avg real loss rate: 0.023 – Absolute error mean: 0.0027 90% < 0.014 – Relative error mean: 1.1 90% < 2.0 – On average 248 out of 2550 paths have no or incomplete routing information – No router aliases resolved # of hosts Areas and Domains Europe (6) International (11) .edu 33 .org 3 .net 2 .gov 1 .us 1 France 1 Sweden 1 Denmark 1 Germany 1 UK 2 Taiwan 1 Hong Kong 1 Asia (2) Canada 2 Australia 1 Experiments on Planet Lab • 51 hosts, each from different organizations – 51 × 50 = 2,550 paths • Simultaneous loss rate measurement US (40) – 300 trials, 300 msec each – In each trial, send a 40-byte UDP pkt to every other host • Simultaneous topology measurement – Traceroute • Experiments: 6/24 – 6/27 – 100 experiments in peak hours # of hosts Areas and Domains Europe (6) International (11) .edu 33 .org 3 .net 2 .gov 1 .us 1 France 1 Sweden 1 Denmark 1 Germany 1 UK 2 Taiwan 1 Hong Kong 1 Asia (2) Canada 2 Australia 1 Sensitivity Test of Sending Frequency • Big jump for # of lossy paths when the sending rate is over 12.8 Mbps PlanetLab Experiment Results • Loss rate distribution loss rate [0, 0.05) % 95.9% lossy path [0.05, 1.0] (4.1%) [0.05, 0.1) [0.1, 0.3) [0.3, 0.5) [0.5, 1.0) 1.0 15.2% 31.0% 23.9% 4.3% 25.6% • Metrics – Absolute error |p – p’ |: • Average 0.0027 for all paths, 0.0058 for lossy paths – Relative error [BDPT02]  p( ) p' ( )  F ( p, p' )  max  ,  p ' (  ) p (  )   where p( )  max(  , p), p' ( )  max(  , p' ) – Lossy path inference: coverage and false positive ratio • On average k = 872 out of 2550 Accuracy Results for One Experiment • 95% of absolute error < 0.0014 • 95% of relative error < 2.1 Accuracy Results for All Experiments • For each experiment, get its 95% absolute & relative errors • Most have absolute error < 0.0135 and relative error < 2.0 Lossy Path Inference Accuracy • 90 out of 100 runs have coverage over 85% and false positive less than 10% • Many caused by the 5% threshold boundary effects Topology/Dynamics Issues • Out of 13 sets of pair-wise traceroute … • On average 248 out of 2550 paths have no or incomplete routing information • No router aliases resolved Conclusion: robust against topology measurement errors • Simulation on adding/removing end hosts and routing changes also give good results Performance Improvement with Overlay • With single-node relay • Loss rate improvement – Among 10,980 lossy paths: – 5,705 paths (52.0%) have loss rate reduced by 0.05 or more – 3,084 paths (28.1%) change from lossy to non-lossy • Throughput improvement – Estimated with 1.5 throughput  rtt lossrate – 60,320 paths (24%) with non-zero loss rate, throughput computable – Among them, 32,939 (54.6%) paths have throughput improved, 13,734 (22.8%) paths have throughput doubled or more • Implications: use overlay path to bypass congestion or failures Adaptive Overlay Streaming Media Stanford 5. Alert + New Overlay Path OVERLAY NETWORK OPERATION CENTER 2. Register trigger 4. Detect congestion / failure UC Berkeley X CLIENT UC San Diego 1. Setup connection SERVER 3. Network congestion / failure 7. Skip-free streaming media recovery 6. Setup New Path OVERLAY RELAY NODE HP Labs • • • • Implemented with Winamp client and SHOUTcast server Congestion introduced with a Packet Shaper Skip-free playback: server buffering and rewinding Total adaptation time < 4 seconds Adaptive Streaming Media Architecture MEDIA SOURCE OVERLAY NETWORK OPERATION CENTER SERVER CLIENT Winamp client Triggering / alert + new path SHOUTcast Server RELAY Overlay Layer Buffering Layer Winamp Video/Audio Filter Byte Counter Path Management RELAY Path Management TCP/IP Layer Calculated concatenation point TCP/IP Layer Client 4 Client 3 Client 2 Client 1 From SHOUTcast server BUFFER Byte Counter Overlay Layer TCP/IP Layer Internet Client 1 OVERLAY RELAY NODE Client 2 INTERNET Client 3 Client 4 Conclusions • A tomography-based overlay network monitoring system – Given n end hosts, characterize O(n2) paths with a basis set of O(nlogn) paths – Selectively monitor O(nlogn) paths to compute the loss rates of the basis set, then infer the loss rates of all other paths • Both simulation and real Internet experiments promising • Built adaptive overlay streaming media system on top of monitoring services – Bypass congestion/failures for smooth playback within seconds

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download tom