Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Network Traffic Modeling Mark Crovella Boston University Computer Science Outline of Day • • • • • 9:00 – 10:45 Lecture 1, including break 10:45 – 12:00 Exercise Set 1 12:00 – 13:30 Lunch 13:30 – 15:15 Lecture 2, including break 15:15 – 17:00 Exercise Set 2 The Big Picture • There are two main uses for Traffic Modeling: – Performance Analysis • Concerned with questions such as delay, throughput, packet loss. – Network Engineering and Management: • Concerned with questions such as capacity planning, traffic engineering, anomaly detection. • Some principal differences are that of timescale and stationarity. Relevant Timescales Performance effects happen on short timescales Network Engineering effects happen on long timescales from nanoseconds up to an hour 1 usec 1 sec from an hour to months 1 hour 1 day 1 week Stationarity, informally • “A stationary process has the property that the mean, variance and autocorrelation structure do not change over time.” • Informally: “we mean a flat looking series, without trend, constant variance over time, a constant autocorrelation structure over time and no periodic fluctuations (seasonality).” NIST/SEMATECH e-Handbook of Statistical Methods http://www.itl.nist.gov/div898/handbook/ The 1-Hour / Stationarity Connection • Nonstationarity in traffic is primarily a result of varying human behavior over time • The biggest trend is diurnal • This trend can usually be ignored up to timescales of about an hour, especially in the “busy hour” Outline • Morning: Performance Evaluation – Part 0: Stationary assumption – Part 1: Models of fine-timescale behavior – Part 2: Traffic patterns seen in practice • Afternoon: Network Engineering – Models of long-timescale behavior – Part 1: Single Link – Part 2: Multiple Links Morning Part 1: Traffic Models for Performance Evaluation • Goal: Develop models useful for – Queueing analysis • eg, G/G/1 queues – Other analysis • eg, traffic shaping – Simulation • eg, router or network simulation A Reasonable Approach • Fully characterizing a stochastic process can be impossible – Potentially infinite set of properties to capture – Some properties can be very hard to estimate • A reasonable approach is to concentrate on two particular properties: marginal distribution and autocorrelation Marginals and Autocorrelation Characterizing a process in terms of these two properties gives you – – – – a good approximate understanding of the process, without involving a lot of work, or requiring complicated models, or requiring estimation of too many parameters. … Hopefully! Marginals Given a stochastic process X={Xi}, we are interested in the distribution of any Xi: i.e., f(x) = P(Xi=x) Since we assume X is stationary, it doesn’t matter which Xi we pick. Estimated using a histogram: Histograms and CDFs • A Histogram is often a poor estimate of the pdf f(x) because it involves binning the data • The CDF F(x) = P[Xi <= x] will have a point for each distinct data value; can be much more accurate Modeling the Marginals • We can form a compact summary of a pdf f(x) if we find that it is well described by a standard distribution – eg – – – – – Gaussian (Normal) Exponential Poisson Pareto Etc • Statistical methods exist for – asking whether a dataset is well described by a particular distribution – Estimating the relevant parameters Distributional Tails • A particularly important part of a distribution is the (upper) tail • P[X>x] • Large values dominate statistics and performance • “Shape” of tail critically important Light Tails, Heavy Tails • Light – Exponential or faster decline • Heavy – Slower than any exponential f1(x) = 2 exp(-2(x-1)) f2(x) = x-2 Examining Tails • Best done using log-log complementary CDFs • Plot log(1-F(x)) vs log(x) 1-F2(x) 1-F1(x) Heavy Tails Arrive pre-1985: Scattered measurements note high variability in computer systems workloads 1985 – 1992: Detailed measurements note “long” distributional tails – File sizes – Process lifetimes 1993 – 1998: Attention focuses specifically on (approximately) polynomial tail shape: “heavy tails” post-1998: Heavy tails used in standard models Power Tails, Mathematically We say that a random variable X is power tailed if: P[ X x] ~ x where a ~ b means lim a a( x) x b ( x ) 0 a 2 1. Focusing on polynomial shape allows Parsimonious description Capture of variability in a parameter A Fundamental Shift in Viewpoint • Traditional modeling methods have focused on distributions with “light” tails – Tails that decline exponentially fast (or faster) – Arbitrarily large observations are vanishingly rare • Heavy tailed models behave quite differently – Arbitrarily large observations have nonnegligible probability – Large observations, although rare, can dominate a system’s performance characteristics Heavy Tails are Surprisingly Common • Sizes of data objects in computer systems – Files stored on Web servers – Data objects/flow lengths traveling through the Internet – Files stored in general-purpose Unix filesystems – I/O traces of filesystem, disk, and tape activity • Process/Job lifetimes • Node degree in certain graphs – Inter-domain and router structure of the Internet – Connectivity of WWW pages • Zipf’s Law Evidence: Web File Sizes Barford et al., World Wide Web, 1999 Evidence: Process Lifetimes Harchol-Balter and Downey, ACM TOCS, 1997 The Bad News • Workload metrics following heavy tailed distributions are extremely variable • For example, for power tails: – When a 2, distribution has infinite variance – When a 1, distribution has infinite mean • In practice, empirical moments are slow to converge – or nonconvergent • To characterize system performance, either: – Attention must shift to distribution itself, or – Attention must be paid to timescale of analysis Heavy Tails in Practice Power tails with a=0.8 Large observations dominate statistics (e.g., sample mean) Autocorrelation • Once we have characterized the marginals, we know a lot about the process. • In fact, if the process consisted of i.i.d. samples, we would be done. • However, most traffic has the property that its measurements are not independent. • Lack of independence usually results in autocorrelation • Autocorrelation is the tendency for two measurements to both be greater than, or less than, the mean at the same time. Autocorrelation Measuring Autocorrelation Autocorrelation Function (ACF) (assumes stationarity): R(k) = Cov(Xn,Xn+k) = E[Xn Xn+k] – E2[X0] ACF of i.i.d. samples How Does Autocorrelation Arise? Network traffic is the superposition of flows Client Request Internet (TCP/IP) “click” Response Server Why Flows?: Sources appear to be ON/OFF ON P2: P3: • • • { { P1: OFF Superposition of ON/OFF sources Autocorrelation P1: P2: P3: Morning Part 2: Traffic Patterns Seen in Practice Traffic patterns on a link are strongly affected by two factors: – amount of multiplexing on the link • Essentially – how many flows are sharing the link? – Where flows are bottlenecked • Is each flow’s bottleneck on, or off the link? • Do all bottlenecks have similar rate? Low Multiplexed Traffic • Marginals: highly variable • Autocorrelation: low Highly Multiplexed Traffic High Multiplexed, Bottlenecked Traffic • Marginals: tending to Gaussian • Autocorrela tion: high Highly Mutiplexed, Mixed-Bottlenecks dec-pkt-1 (Internet Traffic Archive) Alpha and Beta Traffic • ON/OFF model revisited: High variability in connection rates (RTTs) Low rate = beta High rate = alpha + + + = = fractional Gaussian noise stable Levy noise Rice U., SPIN Group Long Range Dependence R[k] ~ a-k a>1 R[k] ~ k-a 0 < a < 1 H=1-a/2 Correlation and Scaling • Long range dependence affects how variability scales with timescale • Take a traffic timeseries Xn, sum it over blocks of size m – This is equivalent to observing the original process on a longer timescale • How do the mean and std dev change? – Mean will always grow in proportion to m – For i.i.d. data, the std dev will grow in proportion to sqrt(m) – So, for i.i.d. data, the process is “smoother” at longer timescale Self-similarity: unusual scaling of variability • Exact self-similarity of a zero-mean, stationary process Xn d Xt m H tm X i ( t 1) m 1 i for all m N, t 0 • H: Hurst parameter 1/2 < H < 1 • H = 1/2 for i.i.d. Xn • LRD leads to (at least) asymptotic s.s. Self Similarity in Practice H=0.95 H=0.50 10ms 1s 100s The Great Wave (Hokusai) How Does Self-Similarity Arise? Self-similarity Autocorrelation Flows Autocorrelation declines like a power law Distribution of flow lengths has power law tail Power Tailed ON/OFF sources Self-Similarity P1: P2: P3: OFF { { ON Measuring Scaling Properties • In principle, one can simply aggregate Xn over varying sizes of m, and plot resulting variance as a function of m • Linear behavior on a log-log plot gives an estimate of H (or a). • Slope > -1 indicates LRD WARNING: this method is very sensitive to violation of assumptions! Better: Wavelet-based estimation Veitch and Abry Optional Material: Performance Implications of Self-Similarity Performance implications of S.S. • Asymptotic queue length distribution (G/G/1) – For SRD Traffic: P[Q x] ~ exp( c1 x) – For LRD Traffic: P[Q x] ~ exp( c2 x – Severe - but, realistic? 2 2 H ) Evaluating Self-Similarity • Queueing Models like these are open systems – delay does not feed back to source • TCP dynamics are not being considered – packet losses cause TCP to slow down • Better approach: – Closed network, detailed modeling of TCP dynamics – self-similarity traffic generated “naturally” Simulating Self-Similar Traffic • Simulated network with multiple clients, servers • Clients alternative between requests, idle times • Files drawn from heavy-tailed distribution – Vary a to vary self-similarity of traffic • Each request is simulated at packet level, including detailed TCP flow control • Compare with UDP (no flow control) as an example of an open system Traffic Characteristics • Self-similarity varies smoothly as function of a Performance Effects: Packet Loss Open Loop: UDP Closed Loop: TCP Performance Effects: TCP Throughput Performance Effects: Buffer Queue Lengths Open Loop: UDP Closed Loop: TCP Severity of Packet Delay Performance Implications • Self-similarity is present in Web traffic – The internet’s most popular application • For the Web, the causes of s.s. can be traced to the heavy-tailed distribution of Web file sizes – Caching doesn’t seem to affect things much – Multimedia tends to increase tail weight of Web files – But, even text files alone appear to be heavytailed Performance Implications (continued) • Knowing how s.s. arises allows us to recreate it “naturally” in simulation • Effects of s.s. in simulated TCP networks – Packet loss not as high as open-loop models might suggest – Throughput not a problem – Packet delays are the big problem Morning Lab Exercises • For each dataset, explore its marginals – Histograms – CDFs, CCDFs, – Log-log CCDFs to look at tails • For each dataset, explore its correlations – ACFs – Logscale diagrams – Compare to scrambled dataset • Study performance – Simple queueing – Compare to scrambled dataset Afternoon: Network Engineering • Moving from stationary domain to nonstationary • Goal: Traffic models that are useful for – capacity planning – traffic engineering – anomaly / attack detection • Two main variants – Looking at traffic on a single link at a time – Looking at traffic on multiple or all links in a network simultaneously Part 1: Single Link Analysis • The general modeling framework that is most often used is “signal + noise” • Sometimes interested in the signal, sometimes the noise… • So, signal processing techniques are common – Frequency Domain / Spectral Analysis • Generally based on FFT – Time-Frequency Analysis • Generally based on wavelets A Typical Trace Notable features: periodicity noisiness “spikes” Capacity Planning • Here, mainly interested in the “signal” • Want to predict long-term trends • What do we need to remove? – Noise – Periodicity K. Papagiannaki et al., Infocom 2003 Periodicity: Spectral Analysis Denoising with Wavelets Capacity Planning via Forecasting Anomaly detection • Goal: models that are useful in detecting – – – – – Network equipment failures Network misconfigurations Flash crowds Attacks Network Abuse Misconfiguration detection via Wavelets Traffic High Freq. Med Freq. Low Freq. P. Barford et al., Internet Measurement Workshop 2002 Flash Crowd Detection Long Term Change in Mean Traffic (8 weeks) Detecting DoS attacks Afternoon Part 2: Multiple Links • How to analyze traffic from multiple links? – Clearly, could treat as a collection of single links, and proceed as before – But, want more: to detect trends and patterns across multiple links • Observation: multiple links share common underlying patterns – Diurnal variation should be similar across links – Many anomalies will span multiple links • Problem is one of pattern extraction in high dimension – Dimension is number of links Example Link Traces from a Single Network Some have visible structure, some less so… High Dimensionality: A General Strategy • Look for a low-dimensional representation preserving the most important features of data • Often, a high-dimensional structure may explainable in terms of a small number of independent variables • Commonly used tool: Principal Component Analysis (PCA) Principal Component Analysis For any given dataset, PCA finds a new coordinate system that maps maximum variability in the data to a minimum number of coordinates New axes are called Principal Axes or Components Correlation in Space, not Time time # Links Traditional Frequency-Domain Analysis PCA on Link Traffic # Links # Links time time # Links # Links PC Single Link X: V: OD flow matrix XVS1=U Principal matrix (Orthogonal) Eigenlink U: Eigenlink matrix (Columns are Orthonormal) X=USVT Singular Value Decomposition X=USVT is the singular value decomposition of X PCA on Link Traffic (2) ; Singular values indicate the energy attributable to a principal component Each link is weighted sum of all eigenlinks Low Intrinsic Dimensionality of Link Data A plot of the singular values reveals how much energy is captured by each PC Sharp elbow indicates that most of the energy captured by 5-10 singular values, for all datasets Implications of Low Instrinsic Dimensionality • Apparently, we can reconstruct X with high accuracy, keeping only a few columns of U X’=U’SVT • A form of lossy data compression • Even more, a way of extracting the most significant part of X, automatically – “signal + noise”? Approximating With Top 5 Eigenlinks Approximating With Top 5 Eigenlinks Approximating With Top 5 Eigenlinks A Link, Reconstructed Link traffic Components 1-5 Components 6-10 All the 100+ rest Anomaly Detection Single Link Approach: Use wavelets to detrend each flow in isolation. [Barford:IMW02] Multi Link approach: Detrend all links simultaneously by choosing only certain principal components. X = X’ + X’’ PCA based anomaly detection L2 norm of entire traffic vector X L2 norm of residual vector X’’ Traffic Forecasting Single-Link approach: Treat each flow timeseries independently. Use wavelets to extract trends. Build timeseries forecasting models on trends. [Papagiannaki:INFOCOM03] X’=U’SVT Multi-Link approach: Build forecasting models on most significant eigenlinks as trends. Allows simultaneous examination and forecasting for entire ensemble of links. Conclusions • Whew! • Bibliography on handout • Traffic analysis methods vary considerably depending on: – Question being asked – Timescale – Stationarity Conclusions Performance Evaluation Network Engineering -Marginals -Watch out for heavy tails -Correlation (in time) -Watch out for LRD / Self Similarity •Signal + Noise •Single-link: Frequency domain analysis •Multi-link: Exploit Spatial correlation 1 usec 1 sec 1 hour 1 day 1 week Afternoon Lab Exercises • For each dataset, – Perform PCA – Assess the variance in each component – Reconstruct using small number of components • Time / Interest permitting, – Analyze some of the single link timeseries using wavelets (matlab wavelet toolbox)