Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Effective Prediction of Web-user Accesses: A Data Mining Approach Nanopoulos Alexandros Katsaros Dimitrios Yannis Manolopoulos Aristotle Univ. of Thessaloniki, Greece Presentation: Spyros Papadimitriou, Carnegie Mellon Univ. WebKDD 2001 Aristotle University of Thessaloniki 1 Introduction (1/2) • Web Prefetching: Deducing forthcoming user accesses based on log information • Focus on: – Predictive prefetching (use of history) – Server initiated (server makes predictions and piggybacks them to the clients) WebKDD 2001 Aristotle University of Thessaloniki 2 Introduction (2/2) • Within a site, users navigate following links [5] • For server-initiated predictive prefetching interest is for access patterns reflecting this behavior WebKDD 2001 Aristotle University of Thessaloniki 3 Outline • • • • Motivation & Related work Proposed method Comparative performance evaluation Conclusions WebKDD 2001 Aristotle University of Thessaloniki 4 Presentation Outline • • • • Motivation & Related work Proposed method Comparative performance evaluation Conclusions WebKDD 2001 Aristotle University of Thessaloniki 5 Requirements • Site structure and contents impose 1. The order of dependencies (first or higher) among the documents 2. The interleaving of documents belonging to patterns with random visits (noise) • Discovered patterns should respect these factors WebKDD 2001 Aristotle University of Thessaloniki 6 Related work • Dependency graph (DG) [9] – A graph maintains pairwise accesses • Prediction by Partial Match (PPM) [10] – A trie maintains sequences of consecutive accesses • LBOT [6] – Special form of association rules of length 2 • Others (variations of the above) [3,11] WebKDD 2001 Aristotle University of Thessaloniki 7 Motivation Order (1st Req.) Noise (2nd Req.) DG No Yes PPM Yes No LBOT No No Yes Yes Proposed WebKDD 2001 Aristotle University of Thessaloniki 8 Presentation Outline • • • • Motivation & Related work Proposed method Comparative performance evaluation Conclusions WebKDD 2001 Aristotle University of Thessaloniki 9 Proposed Method (1) • Novel Web log mining algorithm (WMo) – Apriori-like – Effective • Immune to noise • Considers high order dependencies – Efficient • Significant reduction in the number of candidates WebKDD 2001 Aristotle University of Thessaloniki 10 Proposed Method (2) • Session (or transaction): A sequence of requests that occur in a specified time interval from each other [2] • Containment relationship addresses the 1st requirement (avoiding noise) • Example: T = A, X, B, Y, C X, Y noise S = A, B, C the pattern S is contained by T • Comment:With contiguous subsequences based only on support S (the pattern) will be missed. WebKDD 2001 Aristotle University of Thessaloniki 11 Proposed Method (3) • Candidate generation respects the ordering of accesses in transactions. • Example: A,B B,A • Dramatic increase in the number of candidates • Exploits the site structure for pruning [7,8] WebKDD 2001 Aristotle University of Thessaloniki 12 Proposed Method (4) Algorithm genCandidates(Lk, G) //Lk the set of large k-paths and G the graph begin foreach L=l1, …, lk, L Lk { N+(lk) = {v| arc lk v G} foreach v N+(lk) { //apply modified apriori pruning if v L and L’ = l2, …, lk,v Lk { C= l1, …, lk , v if ( S C, S L’ S Lk ) insert C in the candidate-trie } } } end WebKDD 2001 Aristotle University of Thessaloniki 13 Discussion • Sequential patterns [1] • – Reduction when “customer-sequence” = “user-session” – Suffers from large number of candidates (by not considering the site structure) Path Fragments [4] (containment relationship is performed with regular expressions and the “*” label ) – Focus on semantics (recommendation systems) • Prefetching: patterns are for system and not for human consumption • WMo focuses on efficiency/effectiveness rather on expressiveness (semantics) WebKDD 2001 Aristotle University of Thessaloniki 14 Presentation Outline • • • • Motivation & Related work Proposed method Comparative performance evaluation Conclusions WebKDD 2001 Aristotle University of Thessaloniki 15 Methodology • Synthetic (sample site with 1000 nodes) – Synthetic data generator (see the paper) • Modeling site nodes, site linkage, size of documents • Real data sets (see the paper) • Examine the impact of: – – – – noise order client cache (see the paper) efficiency WebKDD 2001 Aristotle University of Thessaloniki 16 Accuracy w.r.t. noise 0.4 0.35 DG PPM WM WMo LBOT 0.3 0.25 0.2 0.15 0.1 1.6 WebKDD 2001 1.8 2 2.2 2.4 mean noise 2.6 Aristotle University of Thessaloniki 2.8 3 17 Usefulness w.r.t. noise 0.2 DG PPM WM WMo LBOT 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 1.6 WebKDD 2001 1.8 2 2.2 2.4 mean noise 2.6 Aristotle University of Thessaloniki 2.8 3 18 Traffic w.r.t. noise 1.7 DG PPM WM WMo LBOT 1.65 1.6 1.55 1.5 1.45 1.4 1.35 1.3 1.25 1.6 WebKDD 2001 1.8 2 2.2 2.4 mean noise 2.6 Aristotle University of Thessaloniki 2.8 3 19 Accuracy w.r.t. order 0.4 0.35 0.3 DG PPM WM WMo LBOT 0.25 0.2 0.15 0.1 0.1 WebKDD 2001 0.2 0.3 0.4 0.5 0.6 0.7 higher order percentage Aristotle University of Thessaloniki 0.8 0.9 20 Usefulness w.r.t. order 0.18 0.16 0.14 DG PPM WM WMo LBOT 0.12 0.1 0.08 0.06 0.04 0.1 WebKDD 2001 0.2 0.3 0.4 0.5 0.6 0.7 higher order percentage Aristotle University of Thessaloniki 0.8 0.9 21 Traffic w.r.t. order 1.65 1.6 DG PPM WM WMo LBOT 1.55 1.5 1.45 1.4 1.35 0.1 WebKDD 2001 0.2 0.3 0.4 0.5 0.6 0.7 higher order percentage Aristotle University of Thessaloniki 0.8 0.9 22 Efficiency (see also [7,8]) 1.1e+006 1e+006 WM WMo/wp WMo 900000 800000 700000 600000 500000 400000 300000 200000 100000 0 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 support threshold (percentage) WebKDD 2001 Aristotle University of Thessaloniki 23 Presentation Outline • • • • Motivation & Related work Proposed method Comparative performance evaluation Conclusions WebKDD 2001 Aristotle University of Thessaloniki 24 Conclusions • Factors that influence Web Prefetching – Noise – Order • A new algorithm WMo was presented based on data mining • Compares favorably with previously proposed algorithms • WMo is an effective and efficient Web prefetching algorithm WebKDD 2001 Aristotle University of Thessaloniki 25 References 1. 2. R.Agrawal, Ramakrishnan Srikant, Mining Sequential Patterns, ICDE 1995. R.Cooley, B. Mobasher, J.Srivastava, Data Preparation for Mining World Wide Web Browsing Patterns, KAIS, 1(1), pp. 5-32, 1999. 3. M. Deshpande, G. Karypis, Selective Markov Models for Predicting Web-page Accesses, SIAM Data Mining, 2001. 4. W.Gaul, L.T.Schimdt-Thieme, Mining Web Navigation Path Fragments, WebKDD 2000. 5. B. A. Huberman, P. Pirolli, J. Pitkow and R. J. Lukose, Strong Regularities in World Wide Web Surfing. Science, 280, pp. 95-97, 1998. 6. B.Lan, S.Bressan, B.C. Ooi, Y.Tay, Making Web Servers Pushier, WebKDD 1999. 7. A. Nanopoulos, Y. Manolopoulos, Finding Generalized Path Patterns for Web Log Data Mining, ADBIS-DASFAA 2000. 8. A. Nanopoulos, Y. Manolopoulos, Mining patterns from graph traversals, DKE 37(3), pp.243-266, 2001. 9. V.Padmanabhan, J. Mogul, Using Predictive Prefetching to Improve World Wide Web Latency, ACM SIGCOMM Computer Communications Review, 26(3), 1996. 10. T.Palapans, A.Mendelzon, Web Prefetching Using Partial Match Prediction, WCW 1999. 11. J. Pitkow, P. Pirroli, Mining Longest Repeating Subsequences to Predict World Wide Web Surfing, USITS, 1999. 12. L.T.Schimdt-Thieme, W.Gaul, Recommender Systems Based on Navigation Path Features, WebKDD 2001. WebKDD 2001 Aristotle University of Thessaloniki 26