Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mining Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Internal Talk November 2002 1 Motivation Why do we care about Web traces? Content providers How do users come to visit the Web site? Why do users leave the Web site? Is poor performance the cause for this? Where are the performance bottlenecks? What content are users interested in? How do users’ interest vary in time? How do users’ interest vary across different geographical regions? 2 Motivation (Cont.) Web hosting companies ISPs Accounting & billing Server selection Provisioning server farms: where to place servers How to save bandwidth by storing proxy caches? Traffic engineering & provisioning Researchers Where are the performance bottlenecks? How to improve Web performance? Examples: Traffic measurements have influenced the design of HTTP (e.g., persistent connections and pipeline), TCP (e.g., initial congestion window) 3 Outline Background Web workload characterization Performance diagnosis Applications of traces Bibliography 4 Part I: Background Web software components Web semantic components Web protocols Types of Web traces 5 Web Software Components Web clients replica proxy Internet proxy Web Clients replica proxy Web servers Web Servers An application that establishes connections to send Web requests E.g., Mosaic, Netscape Navigator, Microsoft IE An application that accepts connections to service requests by sending back responses E.g., Apache, Microsoft IIS Web proxies (optional) Web replicas (optional) 6 Web Semantic Components Uniform Resource Identifier (URI) An identifier for a Web resource Hypertext Markup Language (HTML) Name of protocol: http, https, ftp, .. Name of the server Name of the resource on the server e.g., http://www.foobar.com/info.html Platform-independent styles (indicated by markup tags) that define the various components of a Web document Hypertext Transfer Protocol (HTTP) Define the syntax and semantics of messages exchanged between Web software components 7 Example of a Web Transaction 1. DNS query DNS server 2. Setup TCP connection 3. HTTP request 4. HTTP response Browser Web server 8 Internet Protocol Stack Application layer: application programs (HTTP, Telnet, FTP, DNS) Transport layer: error control + flow control (TCP,UDP) Network layer: routing (IP) Datalink layer: handle hardware details (Ethernet, ATM) Physical layer: moving bits (coaxial cable, optical fiber) 9 HTTP Protocol Hypertext Transfer Protocol (HTTP) HTTP 1.0 [BLFF96] The most widely used HTTP version A “Stop and wait” protocol HTTP 1.1 [GMF+99] Adds persistent connections, pipelining, caching, content negotiation, … 10 HTTP 1.0 HTTP request Request = Simple-Request | Full-Request Simple-Request = "GET" SP Request-URI CRLF Full-Request = Request-Line; *( General/Request/Entity Header) ; CRLF [ Entity-Body ] ; Request-Line = Method SP Request-URI SP HTTPVersion CRLF Method = "GET" ;| "HEAD" ; | "POST" ;| extensionmethod Example: GET /info.html HTTP/1.0 11 HTTP 1.0 (Cont.) HTTP response Response = Simple-Response | Full-Response Simple-Response = [ Entity-Body ] Full-Response = Status-Line; *( General/Response/Entity Header ); CRLF [ Entity-Body ] ; Example: HTTP/1.0 200 OK Date: Mon, 09 Sep 2002 06:07:53 GMT Server: Apache/1.3.20 (Unix) (Red-Hat/Linux) PHP/4.0.6 Last-Modified: Mon, 29 Jul 2002 10:58:59 GMT Content-Length: 21748 Content-Type: text/html This is the document content … <21748 bytes of the current version of info.html> 12 HTTP 1.1 Connection management Persistent connections [Mogul95] Use one TCP connection for multiple HTTP requests Pros: Cons: Reduce the overhead of connection setup and teardown Avoid TCP slow start head-of-line blocking increase servers’ state Pipeline [Pad95] Send multiple requests without waiting for a response between requests Pros: avoid the round-trip delay of waiting for each response Cons: connection aborts are harder to deal with 13 HTTP 1.1 (Cont.) Caching Continues to support the notion of expiration used in HTTP 1.0 Add a cache-control header to handle the issues of cacheability and semantic transparency [KR01] E.g., no-cache, only-if-cache, no-store, max-age, maxstale, min-fresh, … Others Range request Content negotiation Security … 14 Types of Web Traces Application level traces Server logs: CLF and ECLF formats CLF format <remoteIP,remoteID,usrName,Time,request,responseCode, contentLength> e.g., 192.1.1.1, -, -, 8/1/2000, 10:00:00, “GET /news/index.asp HTTP/1.1”, 200, 3410 Proxies logs: CLF and ECLF formats Client logs: no standard logging formats Packet level traces Collection method: monitor a network link Available tools: tcpdump, libpcap, netmon Concerns: packet dropping, timestamp accuracy 15 Tutorial Outline Background Web workload characterization Performance diagnosis Applications of traces Bibliography 16 Part II: Web Workload Characterization Overview of workload characterization Content dynamics Access dynamics Common pitfalls Case studies 17 Overview of Workload Characterization Process of trace analyses Common analysis techniques Common analysis tools Challenges in workload characterization 18 Process of Trace Analyses Collect traces where to monitor, how to collect (e.g., efficiency, privacy, accuracy) Determine key metrics to characterize Process traces Draw inferences from the data Apply the traces or insights gained from the trace analyses to design better protocols & systems 19 Common Analysis Techniques Statistics Mean Median Geometric mean: less sensitive to outliers GM n xi E (log x ) Variance and standard deviation 1 N var( x) ( xi u ) 2 , std ( x) var( x) N i 1 Confidence interval A range of values that has a specified probability of containing the parameter being estimated Example: 95% confidence interval 10 x 20 20 Common Analysis Techniques – Statistics (Cont.) Cumulative distribution (CDF) Probability density function (PDF) Points: (x, P(X x)) Derivative of CDF: f(x) = dF(x)/dx Check for heavy tail distribution Log-log complementary plot, and check its tail Example: Pareto distribution a F ( x) 1 ( ) , a, 0, x a x If 2, distribution has infinite variance (a heavy tail) If 1, distribution has infinite mean 21 Common Analysis Techniques – Data Fitting Visually compare two distributions Chi Squared tests [AS86,Jain91] Divide the data points into k bins 2 k ( x E ) Compute X 2 i i Ei i 1 If X2 X2(,k-c), then two distributions are close, where is significance level, c is the number of estimated parameters for the distribution + 1 Need enough samples Kolmogorov-Smirnov tests [AS86,Jain91] Compares two distributions by finding the maximum differences between two variables’ cumulative distribution functions Need to fully specify the distribution 22 Common Analysis Techniques – Data Fitting (Cont.) Anderson-Darling Test [Ste74] Modification of the Kolmogorov-Smirnov test, giving more weight to the tails A2 N S , where N S i 1 2i 1 [ InF (Yi ) In(1 F (YN 1i ))] N If A critical value, two distributions are similar; otherwise they are not (F is CDF, and Yi are ordered data) Quantile-quantile plots [AS86,Jain91] Compare two distributions by plotting the inverse of the cumulative distribution function F-1(x) for two variables, and find best fitting line If the slope of the line is close to 1, and y-intercept is close to 0, the two data sets are almost identically distributed 23 Common Analysis Tools Scripting languages Databases SQL, DB2, … Statistics packages VB, Perl, awk, UNIX shell scripts, … Matlab, S+, R, SAS, … Write our own low level programs C, C++, C#, … 24 Challenges in Workload Characterization replica proxy Internet proxy Clients replica proxy Servers Workload characteristics vary both in space and in time Each of the Web components provides a limited perspective on the functioning of the Web 25 Workload Variation Vary with measurement points Vary with sites being measured Vary with the clients being measured Information servers (news site), e-commercial servers, query servers, streaming servers, upload servers US vs. Europe, … Internet clients vs. wireless clients University clients vs. home users US vs. Europe, … Vary in time Day vs. night Weekday vs. weekend Changes with new applications, recent events Evolve over time, … 26 Different Web Components’ Views View from clients View from servers Know details of client activities, such as requests satisfied by browser caches, client aborts The ability to record detailed information, as this does not impose significant load on a client browser Requests satisfied by browser & proxy caches will not appear in the logs May not log detailed information to ensure fast processing of client requests View from proxies Depending on the proxy’s location A proxy close to clients see requests from a a small client group to a large number of servers [KR00] A proxy close to the servers see requests from a large client group to a small number of servers [KR00] Requests satisfied by browser caches or proxy caches 27 encountered earlier will not appear in the logs Part II: Web Workload Overview Content dynamics Access dynamics Common pitfalls Case studies 28 Content Dynamics File types File size distribution File update patterns How often files are updated How much files are updated 29 File Types Text files Images Javascript, cgi, asp, pdf, ps, gzip, ppt, … Multimedia files Jpeg, gif, bitmap, … Applications HTML, plain text, … Audio, video … 30 File Size Distribution Two definitions D1: Size of all files on a Web server D2: Size of all files transferred by a Web server D1 D2, because some files can be transferred multiple times or not in completion and other files are not transferred Studies show that the distribution of file sizes in both definitions exhibit heavy tails (i.e., P[F > x] ~ x-, 0 2) 31 File Update Interval Varies in time Varies across sites Hot events and fast changing events require more frequent update, e.g., Worldcup Depending on server update policies & update tools Depending on the nature of content (e.g., University sites have slower update rate than news sites) Recent studies Study of the proxy traces collected at DEC and AT&T in 1996 showed the rate of change depended on content type, top-level domains etc. [DFK+97] Study of 1999 MSNBC logs shows that modification history yields a rough predictor of future modification interval [PQ00] 32 Extent of Change upon Modifications Varies in time Varies across sites Different events trigger different amount of updates Depending on servers’ update policies and update tools Depending on the nature of the content Recent studies Studies of 1996 DEC and AT&T proxy [MDF+97] and 1999 MSNBC log [PQ00] show that most file modifications are small delta encoding can be very useful 33 Part II: Web Workload Motivation Limitations of workload measurements Content dynamics Access Dynamics Common pitfalls Case studies 34 Access Dynamics File popularity distribution Temporal stability Spatial locality User request arrivals & durations 35 Document Popularity Web requests follow Zipf-like distribution Request frequency 1/i, where i is a document’s ranking The value of depends on the point of measurements Between 0.6 and 1 for client traces and proxy traces Close to or larger than 1 for server traces [ABC+96, PQ00] The value of varies over time (e.g., larger during hot events) Clients/Proxies Less popular servers MSNBC 2 1.5 1 0.5 0 36 Impact of the value Percentage of Requests 1 0.8 0.6 0.4 0.2 Larger means more concentrated accesses on popular documents caching is more beneficial 90% of the accesses are accounted by 0 0 0.2 0.4 0.6 0.8 1 Percentage of Documents (sorted by popularity) 12/17/98 Server Traces 10/06/99 Proxy Traces 08/01/99 Server Traces Top 36% files in proxy traces [BCF+99, PQ00] Top 10% files in small departmental server logs reported in [AW96] Top 2-4% files in MSNBC traces 37 Temporal Stability Metrics Coarse-grained: likely duration that a current popular file remains popular e.g., overlap between the set of popular documents on day 1 and day 2 Fine-grained: how soon a requested file will be requested again e.g., LRU stack distance [ABC+96] File 5 File 4 File 3 File 2 File 1 Stack distance = 4 File 2 File 5 File 4 File 3 File 1 38 Spatial Locality Refers to if users in the same geographical location or at the same organization tend to request a similar set of content E.g., compare the degree of requests locally shared 39 Spatial Locality (Cont.) Hot Event Fraction of requests shared Fraction of requests shared Normal Day 1 0.8 0.6 0.4 0.2 0 0.E+00 1.E+04 2.E+04 3.E+04 Domain ID 4.E+04 5.E+04 1.2 1 0.8 0.6 0.4 0.2 0 0.0E+00 5.0E+03 1.0E+04 1.5E+04 2.0E+04 2.5E+04 3.0E+04 3.5E+04 Domain ID Domain-based clustering Random clustering Domain membership is significant except when there is a “hot” event of global interest 40 User Request Arrivals & Duration User workload at three levels Exponential distribution [LNJV99,KR01] Session: a consecutive series of requests from a user to a Web site Click: a user action to request a page, submit a form, etc. Request: each click generates one or more HTTP requests Session duration Heavy-tail distribution [KR01] # clicks in a session, most in the range of 4-6 [Mah97] # embedded references in a Web page Think time: time between clicks Active time: time to download a Web page and its embedded images 41 Common Pitfalls Trace analyses are all about writing scripts & plotting nice graphs Challenges Understanding the limitation of the traces Trace collection: where to monitor, how to collect (e.g., efficiency, privacy, accuracy) Identify important metrics, and understand why they are important Sound measurements require disciplines [Pax97] Dealing with errors and outliers Draw implications from data analyses No representative traces: workload changes in time and in space Try to diversify data sets (e.g., collect traces at different places and different sites) before jumping into conclusions Draw inferences more than what data show 42 Part II: Web Workload Motivation Limitations of workload measurements Content dynamics Access dynamics Common pitfalls Case studies Boston University client log study UW proxy log study MSNBC server log study MSN Mobile server log study 43 Case Study I: BU Client Log Study Overview One of the few client log studies Analyze clients’ browsing pattern and their impact on network traffic [CBC95] Approaches Trace collection Modify Mosaic and distribute it to machines in CS Dept. at Boston Univ. to collect client traces in 1995 Log format: <client machine, request time, user id, URI, document size, retrieval time> Data analyses Distribution of document size, document popularity Relationship between retrieval latency and response size Implications on caching strategies 44 Major Findings Power law distributions Distribution of document sizes Distribution of user requests for documents # requests to documents as a function of their popularity Caching strategies should take into account of document size (i.e., give preference to smaller documents) 45 Case Study II: UW Proxy Log Study Overview Proxy traces collected at the University of Washington Approaches [WVS+99a, WVS+99b] Trace collection: deploy a passive network sniffer between the Univ. of Washington and the rest of the Internet in May 1999 Set well-defined objectives Understand the extent of document sharing within an organization and across different organizations Understand the performance benefit of cooperative proxy caching 46 Major Findings Members of an organization are more likely to request the same documents than a random set of clients Most popular documents are globally popular Cooperative caching is most beneficial for small organizations Cooperative caching among large organizations yield minor improvement if any 47 Case Study III: MSNBC Server Log Study Overview of MSNBC server site a large news site server cluster with 40 nodes 25 million accesses a day (HTML content alone) Period studied: Aug. – Oct. 99 & Dec. 17, 98 flash crowd 48 Approaches Trace collection HTTP access logs Content Replication System (CRS) logs HTML content logs Data analyses Content dynamics How often files are modified? How to predict modification interval? How much does a file change upon modification? Access dynamics Document popularity Temporal stability Spatial locality Correlation between document age and popularity 49 Major Findings Content dynamics Modification history is a rough predictor guide for setting TTL, but need an alternative mechanism (e.g., callback based invalidation) as backup Frequent but minimal file modifications delta encoding Access dynamics Set of popular files remains stable for days pushing/prefetching previous hot data that have undergone modifications Domain membership has a significant bearing on client accesses except during a flash crowd of global interest make sense to have a proxy cache for an organization Zipf-like distribution of file popularity but with a much larger than at proxies potential of reverse caching and replication 50 Case Study IV: Mobile Server Log Study Overview of a popular commercial Web site for mobile clients Content Services news, weather, stock quotes, email, yellow pages, travel reservations, entertainment etc. Notification Browse Period studied 3.25 million notifications in Aug. 20 – 26, 2000 33 million browse requests in Aug. 15 – 26, 2000 51 Approaches Analyze by user categories Cellular users Offline users Signup services and specify preferences Analyze by Web services Download content onto their PDAs for later (offline) browsing, e.g. AvantGo Desktop users Browse the Web in real time using cellular technologies Browse Notifications Use SQL database to manage data 52 Major Findings Notification Services Browse Services Popularity of notification messages follows a Zipf-like distribution, with top 1% notification objects responsible for 54-64% of total messages multicast notifications Exhibits geographical locality useful to provide localized notification services 0.1% - 0.5% urls account for 90% requests cache the results of popular queries The set of popular urls remain stable cache a stable set of queries or optimize query based on a stable workload Correlation between the two services Correlation is limited influence design of pricing plans 53 Tutorial Outline Background Web Workload Performance Diagnosis Applications of traces 54 Part III: Performance Diagnosis Overview of performance diagnosis Infer the causes of high end-to-end delay in Web transfers [BC00] Infer the causes of high end-to-end loss rate in Web transfers [CDH+99,DPP+01,NC01,PQ02, PQW02] 55 Overview of Performance Diagnosis Web Server Goal: determine trouble spot locations Metrics of interest Sprint AT&T MCI UUNET Qwest AOL Earthlink Why so slow? Delay Loss rate Raw bandwidth Available bandwidth Traffic rate Why interesting Resolve the trouble spots Server selection Placement of mirror servers 56 Finding the Sources of Delays Goal Why is my Web transfer slow? Is it because of the server or the network or the client? Sources of delay in Web transfer DNS lookup Server delays Client delays Network delays Propagation delays Queuing delays Delays introduced by packet losses (e.g., signaled by the fast retransmit mechanism or TCP timeouts) 57 TCPEval Tool Inputs: “tcpdump” packet traces taken at the communicating Web server and client Generates a variety of statistics for file transactions File and packet transfer latencies Packet drop characteristics Packet and byte counts per unit time Generates both timeline and sequence plots for transactions Generates critical path profiles and statistics for transactions 58 Critical Path Analysis Tool [BC00] Data flow Client Server Critical Path Client Server Network delay Network delay Server delay Network delay Client delay Network delay Server delay Network delay due to pkt loss 59 Finding Sources of Packet Losses Goal Identify lossy links server (1-l1)*(1-l2)*(1-l4) = (1-p1) (1-l1)*(1-l2)*(1-l5) = (1-p2) … (1-l1)*(1-l3)*(1-l8) = (1-p5) l1 l2 l4 l5 l6 l3 l7 l8 clients p1 p2 p3 p4 p5 - an under-constrained system of equations - measurement errors 60 Approaches Active probing Probing Multicast probes Striped unicast probes Technique -- Expectation Maximization (EM) a numerical algorithm to compute that maximizes P(D|), where D are observations, are ensemble of link loss rates S A S B A B 61 Approaches (Cont.) Passive monitoring Random sampling Linear optimization Determine a unique solution by optimizing an objective function Gibbs sampling Random sample the solution space, and draw conclusions based on samples Akin to Monte Carlo sampling Determine P(|D) by drawing samplings, where is ensemble of loss rates of links in the network, and D is observed packet transmission and losses at the clients EM A numerical algorithm to compute that maximizes P(D|) 62 Other Performance Studies using Web traces Characterize Internet performance (e.g., spatial & temporal locality) [BSS+97] Study the behavior of TCP during Web transfers [BPS+98] Reconstruct different client page accesses and measure performance characteristics for the accesses [FCT+02] 63 Tutorial Outline Background Web Workload Performance Diagnosis Applications of traces Bibliography 64 Part IV: Applications of Traces Synthetic workload generation Cache design Pre-fetching algorithms [CB98, FJC+99] Placement of Web proxies/replicas [QPV01] Other optimizations Cache replacement policies [CI97,BCF+99] Cache consistency algorithms [LC97, YBS99,YAD+01] Cooperative cache or not [WVS+99] Cache infrastructure … Improving TCP for Web transfers [Mah97,PK98,ZQK00] Concurrent downloads, pipelining, compression,… 65 Synthetic Workload Generation Generate user requests Generate user sessions using a Poisson arrival process For each user session, determine # clicks using a Pareto distribution Assign a click to a request for a Web page, while making sure The popularity distribution of files follows a Zipf-like distribution [BC98] Capture the temporal locality of successive requests for the same resource Generate a next click from the same user with think time following a Pareto distribution 66 Synthetic Workload Generation (Cont.) Generate Web pages Determine the number of Web pages Generate the size of each Web pages using a lognormal distribution Associate a page with some number of embedded pages using an empirical distribution (heavy-tail) Generate file modification events Examples of generators Webbench [Wbe], WebStone[TS95], Surge [BC98], SPecweb99 [SP99], Web Polygraph [WP], … 67 Cache Replacement Policies Problem formulation Given a fixed size cache, how to evict pages to maximize the hit ratio once the cache is full? Hit ratio Fraction of requests satisfied by the cache Fraction of the total size of requested data satisfied by the cache Factors to consider Request frequency Modification frequency Benefit of caching: reduction in latency & BW Cost of caching: storage Caveat: NOT all hits are equal. Hit ratios do NOT map directly to performance improvement. 68 Cache Replacement Policies (Cont.) Approaches Least recently used (LRU) Least frequently used (LFU) GreedyDual-size [CI97] Perfect: maintain counters for all pages seen In-cache: maintain counters only for pages that are in cache Assign a utility value to each object, and replace the one with the lowest utility Use of traces Evaluate the algorithms using trace-driven simulations or synthetic workload Analytically derive the hit ratios for different replacement policies based on a workload model 69 Placement of Web Proxies/Replicas Problem formulation [JJK+01,QPV01] How to place a fixed number of proxies/replicas to minimize users’ request latency Factors to consider Spatial distribution of requests Temporal stability of requests Stability in popularity of objects Stability in spatial distribution of requests 70 Placement of Web Proxies/Replicas (Cont.) Approaches Greedy placement Hot-spot placement Random placement Use of traces Trace-driven simulations High concentration of requests to a small number of objects focus on replicating only popular objects Temporal stability in requests no need to frequently change the locations of proxies/replicas 71 References [AS86] R. B. D’Agostino and M. A. Stephens. Goodness-of-Fit Techniques. Marcel Dekker, New York, NY 1986. [ABC+96] Virgilio Almeida, Azer Bestavros, Mark Crovella and Adriana de Oliveria. Characterizing reference locality in the WWW. In Proceedings of 1996 International Conference on Parallel and Distributed Information Systems (PDIS'96), December 1996. [ABQ01] A. Adya, P. Bahl, and L. Qiu. Analyzing Browse Patterns of Mobile Clients. In Proc. of SIGCOMM Measurement Workshop, Nov. 2001. [ABQ02] A. Adya, P. Bahl, and L. Qiu. Characterizing Alert and Browse Services for Mobile Clients. In Proc. of USENIX, Jun. 2002. [AL01] P. Albitz, and C. Liu. DNS and BIND (4th Edition), O’Reilly & Associates, Apr. 2001. [AW97] M. Arlitt and C. Williamson. Internet Web Servers: Workload Characterization and Performance Implications. IEEE/ACM Transactions on Networking, Vol. 5, No. 5, pp. 631-645, October 1997. [BC98] P. Barford and M. Crovella. Generating representative workloads for network and server performance evaluation. In Proc. of SIGMETRICS, 1998. 72 References (Cont.) [BBC+98] P. Barford, A. Bestavros, M. Crovella, and A. Bradley. Changes in Web Client Access Patterns: Characteristics and Caching Implications, Special Issue on World Wide Web Characterization and Performance Evaluation; World Wide Web Journal, December 1998. [BCF+99] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web Caching and Zipf-like Distributions: Evidence and Implications. In Proc. of INFOCOM, Mar. 1999. [BC00] P. Barford and M. Crovella. Critical Path Analysis of TCP Transactions. In Proc. of ACM SIGCOMM, Aug. 2000. [BLFF96] T. Berners-Lee, R. Fielding, and H. Frystyk. Hypertext Transfer Protocol -- HTTP/1.0. RFC 1945, May 1996. [BPS+98] H. Balakrishnan, V. N. Padmanabhan, S. Seshan, M. Stemm and R. H. Katz. TCP Behavior of a Busy Internet Server: Analysis and Improvements. In Proc. IEEE Infocom, San Francisco, CA, USA, March 1998. [BSS+97] H. Balakrishnan, S. Seshan, M. Stemm, and R. H. Katz. Analyzing Stability in Wide-Area Network Performance. In Proc. of SIGMETRICS, Jun. 1997. 73 References (Cont.) [CDH+99] R. Caceres, N. G. Duffield, J. Horowitz, D. Towsley, T. Bu. Multicast-Based Inference of Network Internal Loss Characteristics. In Proc. Infocom, Mar. 1999. [CB98] M. Crovella and P. Barford. The network effects of prefetching. In Proc. of INFOCOM, 1998. [CBC95] C. R. Cunha, A. Bestavros, and M. E. Crovella. Characteristics of WWW client-based traces. Technical Report BU-CS-95-010, CS Dept., Boston University, 1995. [CI97] P. Cao and S. Irani. Cost-Aware WWW proxy caching algorithms. In Proc. of USITS, Dec. 1997. [DFK+97] F. Douglis, A. Feldmann, B. Krishnamurth, and J. Mogul. Rate of change and other metrics: a live study of the World Wide Web. In Proc. of USITS, 1997. [DPP+01] N. G. Duffield, F. Lo Presti, V. Paxson, D. Towsley. In Proc. Infocom, Apr. 2001. [FCD+99] A. Feldmann, R. Caceres, F. Douglis, and M. Rabinovich. Performance of Web Proxy Caching in heterogeneous bandwidth enviornments. In Proc. of INFOCOM, March 1999. 74 References (Cont.) [FJC+99] L. Fan, Q. Jacobson, P. Cao and W. Lin. Web Prefetching Between Low-Bandwidth Clients and Proxies: Potential and Performance. In Proc. of SIGMETRICS, 1999. [FCT+02] Y. Fu, L. Cherkassova, W. Tang, and A. Vahdat. EtE: Passive End-toEnd Internet Service Performance Monitering. In Proc. of USENIX, Jun. 2002. [GMF+99] J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, T. BernersLee. Hypertext Transfer Protocol – HTTP 1.1. RFC 2616, Jun. 1999. [JK88] V. Jacobson, M. J. Karels. Congestion Avoidance and Control. In Proc. SIGCOMM, Aug. 1988. [JJK+01] S. Jamin, C. Jin, A. R. Kurc, D. Raz, and Y. Shavitt. Constrained Mirror Placement on the Internet. In Proc. of INFOCOM, Apr. 2001. [Jain91] R. Jain. The Art of Computer Systems Performance Analysis. John Wiley and Sons, 1991. [Kel02] T. Kelly. Thin-Client Web Access Patterns: Measurements from a Cache-Busting Proxy. Computer Communications, Vol. 25, No. 4 (March 2002), pages 357-366. [KR01] B. Krishnamurthy and J. Rexford. Web Protocols and Practice, HTTP/1.1, Networking Protocols, Caching, and Traffic Measurement. AddisonWesley, May 2001. 75 References (Cont.) [LC97] C. Liu and P. Cao. Maintaining Strong Cache Consistency in the WorldWide Web. In Proc. of ICDCS'97, pp. 12-21, May 1997. [LNJV99] Z. Liu, N. Niclausse, and C. Jalpa-Villaneuva. Web Traffic Modeling and Performance Comparison Between HTTP 1.0 and HTTP 1.1. In Erol Gelenbe, editor, System Performance Evaluation: Methodologies and Applications. CRC Press, Aug. 1999. [Mah97] Bruce Mah. An empirical model of HTTP network traffic. In Proc. of INFOCOM, April 1997. [Mogul95] Jeffrey C. Mogul. The Case for Persistent-Connection HTTP. In Proc. SIGCOMM '95, pages 299-313. Cambridge, MA, August, 1995. [MDF+97] J. C. Mogul, F. Douglis, A. Feldmann, and B. Krishnamurthy. Potential benefits of delta-encoding and data compression for HTTP, In Proc. of SIGCOMM, September 1997. [NC01] R. Nowak and M. Coates. Unicast Network Tomography using the EM algorithm. Submitted to IEEE Transactions on Information Theory, Dec. 2001 [Pad95] V. N. Padmanabhan. Improving World Wide Web Latency. Technical Report UCB/CSD-95-875, University of California, Berkeley, May 1995. 76 References (Cont.) [PQ00] V. N. Padmanabhan and L. Qiu. The Content and Access Dynamics of a Busy Web Server. In Proc. of SIGCOMM, Aug. 2000. [PQ02] V. N. Padmanabhan and L. Qiu. Network Tomography using Passive End-to-End Measurements, DIMACS on Internet and WWW Measurement, Mapping and Modeling, Feb. 2002. [PQW02] V. N. Padmanabhan, L. Qiu, and H. J. Wang. Passive Network Tomography using Bayesian Inference. Internet Measurement Workshop, Nov. 2002. [QPV01] L. Qiu, V. N. Padmanabhan, and G. M. Voelker. On the Placement of Web Server Replicas. In Proc. of INFOCOM, Apr. 2001. [SP99] SPECWeb99 Benchmark. http://www.spec.org/osg/web99/. [Pax98] V. Paxson. An Introduction to Internet Measurement and Modeling. SIGCOMM’98 tutorial, August 1998. [Ste74] M. A. Stephens. EDF Statistics for Goodness of Fit and Some Comparison. Journal of the American Statistical Association, Vol. 69, pp. 730 – 737. [TS95] G. Trent and M. Sake. WebStone: The First Generation in HTTP Server Benchmarking, Feb. 1995. http://www.mindcraft.com/webstone/paper.html. 77 References (Cont.) [Wbe] Webbench. http://www.zdnet.com/etestinglabs/stories/benchmarks/0,8829,2326243,00.html. [WP] Web Polygraph: Proxy performance benchmark. http://polygraph.ircache.net/. [WVS+99a] A. Wolman, G. Voelker, N. Sharma, N. Cardwell, M. Brown, T. Landray,D. Pinnel, A. Karlin, and H. Levy. OrganizationBased Analysis of Web-Object Sharing and Caching. In Proc. of the Second USENIX Symposium on Internet Technologies and Systems, Boulder, CO, October 1999. [WVS+99b] A. Wolman, G. M. Voelker, N. Sharma, N. Cardwell, A. Karlin, and H. M. Levy. On the scale and performance of cooperative Web proxy caching. In Proc. of the 17th ACM Symposium on Operating Systems Principles, Kiawah Island, SC, Dec. 1999. [YAD01] J. Yin, L. Alvisi, M. Dahlin, A. Iyengar. Engineering serverdriven consistency for large scale dynamic services. [YBS99] H. Yu, L. Breslau, and S. Shenker. A Scalable Web Cache Consistency Architecture. In Proc. of SIGCOMM, August 1999. 78 Acknowledgement Thank Alec Wolman for his helpful comments. 79 Thank you! http://www.research.microsoft.com/~liliq/talks/internal-web-perf.ppt 80 Web Protocols HTTP messages HTTP TCP segments TCP IP HTTP IP pkt IP IP pkt Ethernet Ethernet Sonet Ethernet TCP IP IP pkt Sonet Ethernet Sonet link A picture taken from [KR01] IP Ethernet Ethernet 81 #1: Random Sampling server l1 l3 l2 Randomly sample the solution space Repeat this several times Draw conclusions based on overall statistics How to do random sampling? l4 l5 l6 l7 l8 determine loss rate bound for each link using best downstream client iterate over all links: p1 p2 p3 p4 clients p5 pick loss rate at random within bounds update bounds for other links Problem: little tolerance for estimation error 82 #2: Linear Optimization server l1 l3 l2 l4 p1 l5 l6 p2 p3 l7 l8 p4 p5 clients 1) Convert the constraints into linear constraints using log transform Li = log(1/(1-li)), Pj = log(1/(1-pj)) (1-l1)*(1-l2)*(1-l4) = (1-p1) L1+L2+L4 = P1 2) Add slack variables to account for errors L1+L2+L4 = P1 L1+L2+L4 + S1 = P1 minimize wLi + |Sj| subject to L1+L2+L4 + S1 = P1 L1+L2+L5 + S2 = P2 … L1+L3+L8 + S5 = P5 Goals Parsimonious explanation Robust to error in client loss rate estimate 83 # 3: Gibbs Sampling D ensemble of loss rates of links in the network Goal observed packet transmission and loss at the clients determine the posterior distribution P(|D) Approach Use Markov Chain Monte Carlo with Gibbs sampling to obtain samples from P(|D) Draw conclusions based on the samples 84 # 3: Gibbs Sampling (Cont.) Applying Gibbs sampling to network tomography 1) Initialize link loss rates arbitrarily 2) For j = 1 : warmup for each link i compute P(li|D, {li’}) where li is loss rate of link i, and {li’} = kI lk 3) For j = 1 : realSamples for each link i compute P(li|D, {li’}) Use all the samples obtained at step 3 to approximate P(|D) 85 Simulation Experiments Advantage: no uncertainty about link loss rate! Methodology Topologies used: randomly-generated packet loss events at each link randomly-generated: 20 - 3000 nodes, max degree = 5-50 real topology obtained by tracing paths to microsoft.com clients A fraction f of the links are good, and the rest are “bad” LM1: good links: 0 – 1%, bad links: 5 – 10% LM2: good links: 0 – 1%, bad links: 1 – 100% Goodness metrics: Coverage: # correctly inferred lossy links False positive: # incorrectly inferred lossy links 86 Random Topologies 1000-node random topologies (d=10, f=0.95) 1000-node random topologies (d=10, f=0.5) 160 600 140 500 100 # links # links 120 80 60 40 400 300 200 100 20 0 0 Random LP Gibbs Random "# true lossy links" LP Gibbs "# true lossy links" "# correctly identified lossy links" "# false positive" "# correctly identified lossy links" "# false positive" Techniques Coverage False Positive Computation Random High High Low LP Modest Low Medium Gibbs sampling High Low High 87 Trace-driven Validation Validation approach Experimental setup Divide client traces into two: tomography and validation Tomography data set => loss inference Validation set => check if clients downstream of the inferred lossy links experience high loss Real topologies and loss traces collected from traceroute and tcpdump at microsoft.com during Dec. 20, 2000 and Jan. 11, 2002 Results False positive rate is between 5 – 30% Likely candidates for lossy links: links crossing an inter-AS boundary links having a large delay (e.g. transcontinental links) links that terminate at clients 88 Tutorial Outline Background Web Workload Characterization Performance Diagnosis Motivation Data Analyses and fittings Understanding the limitations Content dynamics Access dynamics Case Studies Synthetic workload generation Infer causes of high end-to-end delay in Web transfers Infer causes of high end-to-end loss in Web transfers Applications of traces 89 Part II: Web Workload Characterization Overview Content dynamics File size distribution File update patterns Access dynamics Process of trace analyses Common analysis techniques & tools Challenges in workload characterization File popularity distribution Temporal stability Spatial locality Browser sessions: length & arrival pattern Common pitfalls Case studies Boston University client log study, UW proxy log study, MSNBC server log study, a mobile log study 90 Cache Consistency Algorithms Problem formulation Factors to consider Approaches Use of traces 91 Major Findings Observations Implications Top 1% notification objects account for 5464% of total messages. Delivering notifications via multicast would be effective. Notification exhibits geographical locality. Useful to provide localized notification services. 92 Major Findings (Cont.) Observations Implications 0.1% - 0.5% full urls (i.e. 121-442) account for 90% requests. Caching the results of popular queries would be very effective. Cache a stable set of popular query results or optimize query performance based on a stable workload. The set of popular urls remain stable. Limited correlation between Service providers cannot users’ browsing and solely rely on users’ notification pattern. notification profile to predict how much & what they will browse. 93 Types of Web Traces Flow level traces <srcIP, destIP, nextIP, input, output, pkts, bytes, sTime, fTime, srcPort, destPort, pad, flags, prot, tos, srcAs, destAs, srcMask, destMask, seq> 94 Views from Clients Capture clients’ requests to all servers Pros Know details of client activities, such as requests satisfied by browser caches, client aborts The ability to record detailed information, as this does not impose significant load on a client browser Cons Need to modify browser software Hard to deploy for a large number of clients 95 Views from Web Servers Capture most clients’ requests (excluding those satisfied by caches) to a single server Pros Relatively easy to deploy/change logging software Cons Requests satisfied by browser & proxy caches will not appear in the logs May not log detailed information to ensure fast processing of client requests 96 Views from Web Proxies Depending on the proxy’s location Pros A proxy close to clients see requests from a a small client group to a large number of servers [KR00] A proxy close to the servers see requests from a large client group to a small number of servers [KR00] See requests from a diverse set of clients to a diverse set of servers, and determine the popularity ranking of different Web sites Useful for studying caching policies Ease of collection Cons Requests satisfied by browser caches will not appear in the logs May not log detailed information to ensure fast 97 processing of requests Web Protocols (Cont.) DNS [AL01] TCP [JK88] An application layer protocol responsible for translating hostname to IP and vice versa (e.g., msn.com 207.68.172.246) A transport layer protocol that does error control and flow control Hypertext Transfer Protocol (HTTP) HTTP 1.0 [BLFF96] The most widely used HTTP version A “Stop and wait” protocol HTTP 1.1 [GMF+99] Adds persistent connections, pipelining, caching, content negotiation, … 98