Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A New Reactive Method for Processing Web Usage Data Murat Ali Bayır Middle East Technical University Department of Computer Engineering Ankara, Turkey 1 OUTLINE Web Mining Previous Session Reconstruction Heuristics Smart-SRA Agent Simulator Experimental Results Conclusion Murat Ali Bayir, June 06 2 Data & Web Mining Data Mining: Discovery of useful and interesting patterns from a large dataset. Web mining: the application of data mining techniques to discover and retrieve useful information and patterns from the World Wide Web documents and services. Dimensions: – Web content mining – Web structure mining – Web usage mining Murat Ali Bayir, June 06 3 Web Mining Web Usage Mining (WUM) Application of data mining techniques to web log data in order to discover user access patterns. Example User Web Access Log IP Address Request Time Method URL Protocol Success of Return Code Number of Bytes Transmitted 144.123.121.23 [25/Apr/2005:03:04:41–05] GET A.html HTTP/1.0 200 3290 144.123.121.23 [25/Apr/2005:03:04:43–05] GET B.html HTTP/1.0 200 2050 144.123.121.23 [25/Apr/2005:03:04:48–05] GET C.html HTTP/1.0 200 4130 It is possible to capture necessary information for WUM. Murat Ali Bayir, June 06 4 Web Mining Phases of Web Usage Mining 1. Data Processing – Includes reconstruction of user sessions by using heuristics techniques. (Most important phase) since it directly affects quality of extracted frequent patterns at final step significantly. 2. Pattern Discovery – Includes Discovering useful patterns from reconstructed sessions obtained in the first phase. We have related work about Pattern Discovery phase [Bayir 06-1]. Murat Ali Bayir, June 06 5 OUTLINE Web Mining Previous Session Reconstruction Heuristics Smart-SRA Agent Simulator Experimental Results Conclusion Murat Ali Bayir, June 06 6 Previous Session Reconstruction Heuristics Session Reconstruction Includes selecting and grouping requests belonging to the same user by using heuristics techniques. Types: – Reactive strategies process requests after they are handled by the web server, they process web server logs to obtain session. The proposed approach is this thesis is reactive. – Proactive strategies process requests during the interactive browsing of the web site by the user. Session data is gathered during interaction of web user. applied on dynamic server pages. Murat Ali Bayir, June 06 7 Previous Reactive Heuristics Session Reconstruction Proactive Strategies need to change internal structure of web site. To illustrate, change in source code of each dynamic web pages. Reactive strategies need no change, used for web analytics purposes, customers give web logs of their web site and analyzed them by using this methods. Reactive methods are applicable for all web sites satisfying same log format. Murat Ali Bayir, June 06 8 Previous Reactive Heuristics Two types of reactive heuristics defined before Time-oriented heuristics [Spiliopoulou 98, Cooley 99-1] Navigation-oriented heuristic [Cooley 99-1, Cooley 99-2] Smart-SRA [Bayir 06-2] is new approach proposed in this thesis. It combines these heuristics with web topology information in order to increase the accuracy of the reconstructed sessions. Murat Ali Bayir, June 06 9 Previous Reactive Heuristics Example Web Topology Graph used for Applying heuristics The topology of web site can be represented by directed web graph. P13 P1 P23 The topology information can be extracted by using crawling module of Search engine APIs. P20 P34 P49 Example Web Page Request Sequence Page P1 P20 P13 P49 P34 P23 Timestamp 0 6 15 29 32 47 Murat Ali Bayir, June 06 10 Previous Session Reconstruction Heuristics Two types of time oriented Heuristics defined. Time-oriented heuristics -1 total duration of a discovered session is limited with a threshold 1 Example: Page P1 P20 P13 P49 P34 P23 Timestamp 0 6 15 29 32 47 Time threshold (1 = 30 mins): 1. 2. [P1, P20, P13, P49] (t(P1) - t(P49) = 29 < 30) [P34, P23] (t(P34) - t(P23) = 15 < 30) Murat Ali Bayir, June 06 11 Previous Session Reconstruction Heuristics Time-oriented Heuristics -2 The time spent on any page is limited with a threshold 2 . That means t(Pn+1) - t(Pn) < 2 Example: Page P1 P20 P13 P49 P34 P23 Timestamp 0 6 15 29 32 47 Time threshold (2 = 10 mins): 1. [P1, P20, P13] 2. [P49, P34] 3. [P23] Murat Ali Bayir, June 06 12 Previous Session Reconstruction Heuristics Navigation-Oriented Heuristic In Navigation Oriented Heuristics, when processing user request sequence, There are two cases for Adding new page WPN+1 to a session [WP1, WP2, …, WPN] If WPN has a hyperlink to WPN+1 [WP1, WP2, …, WPN, WPN+1] If WPN does not have a hyperlink to WPN+1 Assume that WPKmax is the nearest page having a hyperlink to WPN+1 add backward browser moves [WP1, WP2,…, WPN, WPN-1, WPN-2,..., WPKmax, WPN+1] Murat Ali Bayir, June 06 13 Previous Session Reconstruction Heuristics Navigation-Oriented Heuristic User request sequence Example: Curent Session Condition [] [P1] New Page P1 Link[P1, P20] =1 P20 [P1, P20] Link[P20, P13] = 0 Link[P1, P13] = 1 P13 [P1, P20, P1, P13] Link[P13, P49] = 1 P49 [P1, P20, P1, P13, P49] Link[P49, P34] = 0 Link[P13, P34] = 1 P34 [P1, P20, P1, P13, P49, P13, P34] Link[P34, P23] =1 P23 [P1, P20, P1, P13, P49, P13, P34, P23] Murat Ali Bayir, June 06 14 OUTLINE Web Mining Previous Session Reconstruction Heuristics Smart-SRA Agent Simulator Experimental Results Conclusion Murat Ali Bayir, June 06 15 Smart-SRA Phase 1: Shorter request sequences are constructed by using overall session duration time and page-stay time criteria Phase 2: Candidate sessions are partitioned into maximal sub-sessions such that: – between each consecutive page pair in a session there is a hyperlink from the previous page to the next page Topology Rule: i:1 i<n, there is a hyperlink from Pi to Pi+1 Time Rules: – o i: 1 i<n, Timestam(Pi) < Timestamp(Pi+1) – o i: 1 i<n Timestamp(Pi+1) - Timestamp(Pi) r (page stay time) – o Timestamp(Pn) - Timestamp(P1) δ (session duration time). Murat Ali Bayir, June 06 16 Smart-SRA Phase2 of Smart-SRA process a candidate session from left to right by repeating the following steps until the candidate session is empty: 1. Determine the web pages without any referrer (on its left) and remove them from the candidate session 2. For each one of these pages For each previously constructed session – If there is a hyperlink from the last page of the session to the web page and page stay time constraint is satisfied then append the web page to the session 3. Remove non-maximal sessions Murat Ali Bayir, June 06 17 Smart-SRA Example Web Topology P13 P1 P23 P20 P34 Used of Applying Smart-SRA P49 Page P1 P20 P13 P49 P34 P23 Timestamp 0 6 9 12 14 15 Example Candidate Session Murat Ali Bayir, June 06 18 Smart-SRA Iteration 1 (non referers in the set) 2 Candidate Session [P1, P20, P13, P49, P34, P23] [P20, P13, P49, P34, P23] New Session Set (before) [P1] Temp Page Set {P1} {P20, P13} Temp Session Set [P1] [P1,P20] [P1,P13] New Session Set (after) [P1] [P1,P20] [P1,P13] Iteration 3 4 Candidate Session [P49, P34, P23] [P23] New Session Set (before) [P1,P20] [P1,P13] [P1,P13,P34] [P1, P13, P49] [P1, P20] Temp Page Set {P49, P34} {P23} Temp Session Set [P1,P13,P34] [P1, P13, P49] [P1, P13, P34, P23] [P1, P13, P49, P23], [P1, P20, P23] New Session Set (after) [P1,P13,P34], [P1, P13, P49] [P1, P20] [P1, P13, P34, P23] , [P1, P13, P49, P23] [P1, P20, P23] Murat Ali Bayir, June 06 19 OUTLINE Web Mining Previous Session Reconstruction Heuristics Smart-SRA Agent Simulator Experimental Results Conclusion Murat Ali Bayir, June 06 20 Agent Simulator Models the behavior of web users and generates web user navigation and the log data kept by the web server Used to compare the performances of alternative session reconstruction heuristics Murat Ali Bayir, June 06 21 Agent Simulator Provides 4 basic behaviors of Web User. • A Web user can start session with any one of the possible entry pages of a web site. • A Web user can select the next page having a link from the most recently accessed page. • A Web user can press the back button one more time and thus selects as the next page a page having a link from any one of the previously browsed pages (i.e., pages accessed before the most recently accessed one). • A Web user can terminate his/her session. Murat Ali Bayir, June 06 22 Agent Simulator Behavior I Web user can start a new session with any one of the possible entry pages of the web site P13 P1 S1 Start page 1 P23 2 S2 P34 Murat Ali Bayir, June 06 New request from server P20 S1 Session I S2 Session II P49 23 Agent Simulator Behavior II Web user can select a new page having a link from the most recently accessed page. P1 1 P13 Start page P23 2 P34 Murat Ali Bayir, June 06 New request from server P20 S1 Session I S2 Session II P49 24 Agent Simulator Behavior III Web user can select as the next page having a link from any one of the previously browsed pages. 4 P1 1 P13 Start page New request from server 3 5 P23 P20 S1 Session I S2 Session II 2 P34 Murat Ali Bayir, June 06 P49 25 Agent Simulator Behavior IV Web user can terminate the session. 4 Example session is terminated in P1 1 P13 Start page P23. 3 P23 5 2 P34 Murat Ali Bayir, June 06 New request from server P20 6 S1 Session I S2 Session II P49 26 Agent Simulator 3 Parameters for simulating behavior of web user Session Termination Probability (STP) Link from Previous pages Probability (LPP) New Initial page Probability (NIP) Murat Ali Bayir, June 06 27 OUTLINE Web Mining Previous Session Reconstruction Heuristics Smart-SRA Agent Simulator Experimental Results Conclusion Murat Ali Bayir, June 06 28 Experimental Results Heuristics Tested Time oriented heuristic (heur1) (total time 30 min) Time oriented heuristic (heur2) (page stay 10 min) Navigation oriented heuristic (heur3) Smart-SRA heuristic (heur4) Murat Ali Bayir, June 06 29 Experimental Results Accuracy is determined as: Reconstructed session H captures a real session R if R occurs as a subsequence of H (R H) String-matching relation needed R = [P1, P3, P5] H = [P9, P1, P3, P5, P8] => H = [P1, P9, P3, P5, P8] => Murat Ali Bayir, June 06 R H Yes R H No 30 Experimental Results Parameters for generating user sessions and web topology Number of web pages (nodes) in topology 300 Average number of outdegree 15 Average number of page stay time 2,2 min Deviation for page stay time 0,5 min Number of agents 10000 STP : Fixed & Range 5% 1%-20% LPP : Fixed & Range 30% 0%-90% NIP : Fixed & Range 30% 0%-90% Murat Ali Bayir, June 06 31 Experimental Results Accuracy vs. STP Increasing STP leads to sessions with fewer pages. It becomes more easy to predict. In small length sessions the probability of LPP and NIP that holds is also small. Murat Ali Bayir, June 06 32 Experimental Results Accuracy vs LPP Real Accuracy % Real Accuracy vs LPP 50 40 heur1 30 heur2 20 heur3 10 heur4 0 0 10 20 30 40 50 60 70 80 90 LPP As LPP increases the real accuracy decreases. Increasing LPP leads to more complex sessions. Intelligent Path completion is needed for discovering more accurate sessions. Murat Ali Bayir, June 06 33 Experimental Results Accuracy vs. NIP Real Accuarcy % Real Accuracy vs NIP 35 30 25 20 15 10 5 0 heur1 heur2 heur3 heur4 0 10 20 30 40 50 60 70 80 90 NIP Increasing NIP causes more complex sessions, the accuracy decreases for all heuristics. Path separation is needed for discovering more accurate sessions. Murat Ali Bayir, June 06 34 OUTLINE Web Mining Previous Session Reconstruction Heuristics Smart-SRA Agent Simulator Experimental Results Conclusion Murat Ali Bayir, June 06 35 Conclusion New session reconstruction heuristic: Smart-SRA – Does not allow sequences with unrelated consecutive requests (no hyperlink between the previous one to the next one) – No artificial browser (back) requests insertion in order to prevent unrelated consecutive requests – Only maximal sessions discovered. Agent simulator simulates behaviors of real www users. It is possible to evaluate accuracy of heuristics by using Agent Simulator. Experimental results show Smart-SRA outperforms previous reactive heuristics. Murat Ali Bayir, June 06 36 References [Bayir 06-1] M. A. Bayir, I. H. Toroslu, A. Cosar, (2006) A Performance Comparison of Pattern Discovery Methods on Web Log Data, AICCSA-06, the 4th ACS/IEEE International Conference on Computer Systems and Applications. [Bayir 06-2] M. A. Bayir, I. H. Toroslu, A. Cosar, (2006): A New Approach for Reactive Web Usage Data Processing. ICDE Workshops, 44. [Cooley 99-1] R. Cooley, B. Mobasher, and J. Srivastava (1999), Data Preparation for Mining World Wide Web Browsing Patterns . Knowledge and Information Systems Vol. 1, No. 1. [Cooley 99-2] R. Cooley, P. Tan and J. Srivastava (1999), Discovery of interesting usage patterns from Web data. Advances in Web Usage Analysis and User Profiling. LNAI 1836, Springer, Berlin, Germany. 163-182. [Spiliopoulou 98] M. Spiliopoulou, L.C. Faulstich (1998). WUM: A tool for Web Utilization analysis. Proceedings EDBT workshop WebDB’98, LNCS 1590, Springer, Berlin, Germany. 184-203. Murat Ali Bayir, June 06 37 Thank you for Listening Any Questions ? Murat Ali Bayir, June 06 38