Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Traffic Prediction on the Internet Anne Denton Outline Paper by Y. Baryshnikov, E. Coffman, D. Rubenstein and B. Yimwadsana Solutions Time-Series prediction Our work for the KDD-cup 03 Time Series Prediction on the Internet By Y. Baryshnikov, E. Coffman, D. Rubenstein and B. Yimwadsana Adjustment to “hot spots” Avoiding degradation, even “denial of service” Can “hot spots” be predicted? Can predicted “hot spots” be avoided? What are “hot spots”? Exceptionally large numbers of requests Spontaneous, short lifetime “instant” ramp up in traffic Only valid on long time scales Claim: time scale for increase larger than time scale to react Why does increase take time? Passing on the word How good does a predictor have to be? Cost of missing a “hot spot” higher than aggregate cost of false alarms (similar to hurricane) Examples Olympics (Nagano 98) Soccer World Cup (98) NASA (95) What to do about “hot spots”? <Detour> “The Columbia Hotspot Rescue Service: A Research Plan” E. Coffman, P. Jelenkovic, J.Nieh, and D. Rubenstein Approaches Deal ad hoc with high request Build a better network (expensive) Content delivery services Caching Extra bandwidth Suggested solution: use available and underutilized resources Hotspot Rescue Service Server-based approach Requires additional resources from server when necessary Resources provided by other members of Hotspot Rescue Service Peer-to-Peer approach Requires additional resources from client when necessary Caching Four Phases Prediction (see rest of presentation) Server-based: daemons P2P: plug-ins Replication Server-based: replication of objects P2P: identified cached copies More advanced: redistribution of traffic load Notification Modifications to DNS (Domain Name System) P2P system proactively announces hot objects and indicates alternative locations? Termination <End of Detour> Tail of Distribution Requests per 10-second time slot X-axis: number of hits per time slot Y-axis: probability that that number of hits will be exceeded Time Scales Prediction relies on correlation between values at different times Auto correlation function f (t ) f (t ) dt Predictability on time scales of 5-30 min Prediction Algorithm Standard problem Signal processing Econometrics Internet traffic Particularly bursty Simplest model Linear extrapolation Structure of Prediction Algorithms Traffic observation # of requests in time unit (t-1,t] Usually 1s Prediction window Duration Wp 0 Advance notice Prediction at time t: Mapping of observations in [t-Wp,t] to a number pt 0 of requests predicted in interval [t+, t++1] that is units in the future Linear Prediction Linear Fit: Least squares linear fit pt = ft(t+) with t ft(s) = at s+bt 2 f ( i ) r t i Minimizing i t W p Performance: O(W+T) W: Window size T: uptime duration Problems Prediction window size must match burstiness parameters governing request flow Results Depends on properties of autocorrelation function Conclusions of Paper Build a load-based taxonomy of web server traffic Depends on technological, sociological, and psychological factors Look for quantification of basic patterns reflecting behavior Do we agree ??? Why cluster when we can classify!! Our Approach Normally time series prediction uses only data in that time series We use similarity to other instances E.g., other web sites Model-free Weighted Nearest Neighbor approach Problem: How integrate time? Typical Nearest Neighbor Classification / Regression R(A1, …, An, C) Attributes Ai C class label (classification) or continuous variable (regression) Based on distance function on Ai K nearest neighbors Neighbors within a range Use kernel function to weight closer ones higher Weighting of Attributes Some attributes are more important than others Apply scaling to space Optimize weights through Hill-climbing Genetic Algorithm How does this generalize to a timeseries? Our Answer Identify “relevant” sections in the time series E.g. times with already high download rates We’ll call each relevant section a “prediction” Predictions Each prediction contains information about The nature of the time series The time instance in question, i.e. the history of requests The actual change in requests Make a table of predictions Leads to a relation just as standard classification / regression setting Data Set Paper citations in “e-print ArXive” Background: KDD-cup 03 Predict the change in citations in successive 3month periods Only consider periods with at least 6 citations Evaluation: L1 distance (Manhattan distance) between predicted and real difference Very close match between citation history and request history Predict change in requests Only consider periods that already show large number of requests Attributes of a “Prediction” Quantitative attributes Number of citations in window Gradient of citations in window Aggregate number of citations up to and through window (assume finite time series) Attribute values given by time series Keyword occurrences Author Number of revisions of papers Maximum time interval between revisions Country of origin Format Similarity Function Common kernel-function ( x0 x1 ) 2 K ( x0 , x1 ) exp 2 2 What worked better 1 K ( x0 , x1 ) 1 w x0 x1 Plot of Similarity Function 1 f(x) 0.8 0.6 Gaussian 0.4 1/(1+x) 0.2 0 0 5 10 x 15 20 Accuracy No linear extrapolation data available Could lead to negative citations Comparison Default prediction: No change: 1851 Very simple model (decrease by 0.3 in 3 months): 1532 Prediction based on average of time series (synchronized at first non-0): 1593 Prediction based on quantitative attributes: 1465 Full prediction (prelimiary): 1357 Weight optimized (very preliminary): reduction 1414 -> 1391 Results 3000 2500 2000 Series1 Series2 1500 Series3 Series4 1000 500 0 1 2 3 4 5 6 7 8 9 10 11 Conclusions Method works well for citation prediction Yet to be tested for hot-spot prediction