Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19 Outline Introduction Background Related work Purposed System Experiment and Result Conclusion and Future work AI lab. Introduction Web Log Mining Process Viewing news Saved Web Log Data in Web Server Web log preprocessing E-Mail Logged data download shopping Auction My research area: Web Site Visitor - IP -OS, Agent - Time - URL - Refer page - Date -Cookie - Method - Status - UserID - bytes -… Preprocessing DB Pattern Analysis • Visualization tools • Knowledge Query • Intelligent Agents Pattern Discovery Data Analysis AI lab. Background (1/4) Log format : 210.126.19.93 - - [23/Jan/2005:13:37:12 -0800] “GET /modules.php?name=News&file=friend&op=FriendSend&sid=8225 HTTP/1.1" 200 2705 "http://www.olloo.mn/modules.php?name=News&file=article&catid=25&sid=8225" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)“ … – Client IP - 210.126.19.93 – Date - 23/Jan/2005 – Accessed time - 13:37:12 – Method - GET (to request page ), POST, HEAD (send to server) – Protocol - HTTP/1.1 – Status code - 200 (Success), 401,301,500 (error) – Size of file - 2705 – Agent type - Mozilla/4.0 – Operating system - Windows NT 285014 lines record http://www.olloo.mn/modules.php?name=News&file=article&catid=25&sid=8225 → → http://www.olloo.mn/modules.php?name=News&file=friend&op=FriendSend&sid=8225 A visitor (210.126.19.93) after to view the news who send it to friend. AI lab. Background (2/4) - User identification, Session Identification User Identification is identifying each user accessing Web site User IP+Browser (UserID+IP+OS or cookie)=> Identify the users Session identification is to find each user’s access pattern and frequency path. Cleaning Log User Identification IP IP, Browser User Identification Session Identification Session Identification Path Completion Browser 202.131.3.100 Mozilla/5.0(Windows NT) 202.131.3.100 Mozilla/4.0 (Win2000) 210.126.19.93 Mozilla/4.0(Windows NT) Formatting Visited pages A,B,C,D,F,A,L A,B,G,L N,O 202.131.3.100 Mozilla/5.0(Windows NT) A,B,C,D,F 202.131.3.100 Mozilla/5.0(Windows NT) A,L 202.131.3.100 Mozilla/4.0 (Win2000) A,B,G,L 210.126.19.93 Mozilla/5.0(Windows NT) N,O AI lab. Background (3/4) Server Log and Caching Missed Page Views at Server If client must request every web page from the server slower. The solution to this problem is caching. Clients and Proxy Servers save local copies of pages back” and “forward … Request Client P3 Request P4 Request P3 Cache Request P6 Send 5 Server Send P4 Send P4 Never logged by server AI lab. Background (4/4) - Path completion Not all requested pages are recorded in Web log. Due to caching problem. Cleaning Log User Identification Session Identification C.html B.html D.html E.html Before .. Path completion H.html J.html I.html K.html G.html L.html Formatting F.html Topological Structure A.html Path Completion After A,B,C,D,F A,B,C,D,C,B,F A,L A,L A,B,G,I A,B,A,G,I N,O N,O M.html O.html Q.html N.html P.html AI lab. Related work Related works Using Topological Structure Removing images Removing robot text User /Session Identification Path completion R. Cooley [12] O O O Login, IP, Agent O 1996 [8] Olympics site X O X Cookie X Yan, Jacobsen [5] X O X IP, Agent X Pitkow [7] O O X Session ID O Shahabi [2] X O X Session ID O Chen, Park [3] X O X Login, IP X X – not used O – used AI lab. Construct the site topological structure by web log data in server Purposed System(1/7) (preprocessing) Web site’s topological structure (find the hyperlink relation between web pages) User Identification, session Identification, (identify each user, find each user’s access pattern) After session Identification and path completion User grouping User Identify Data cleaning (eliminate irrelevant info) Path completion Why preprocessing? User Grouping Result Preprocessing can take up to 60-80% of the times spend analyzing the data. Incomplete preprocessing task can easily result invalid pattern and wrong AI lab. conclusions. Purposed System (2/7) Make the site topological structure Helps solving data preprocessing and analysis: - user identification - path completion Goal of purposed system Discover Similar user group, Relevant page group and Frequency accessing paths AI lab. Purposed System (3/7) begin No Not end of Log file No Algorithm of Topological Structure Yes Find “http” data Yes Enter URL to URL_Queue No No Is there other Record? Yes URL Queue Not empty Yes Get head, define depth To add link to the Topo_Str_DB end Make Topological Structure AI lab. Purposed System (4/7)- Make the topological structure Topological Structure - input: URL path and link - output: complete sitemap (tree) link, path, depth and referrersqueue 0. Index.html (A) 1. L.html (referrer) 2. Sport/Team/football.html 2. Sport/News/Mongolia.html 1. Sport.html 2. Sport/Team/ 3. Sport/Team/football.html 2. Sport/Advice/ . . . Depth Index.html (A) 0 Sport.html L.html 1 Sport/News/Mongolia.html olloo.mn/L.html olloo.mn/L.html Sport/Team/football.html olloo.mn/L.html Sport/News/Mongolia.html olloo.mn/Sport.html olloo.mn/Sport.html /Team/football.html olloo.mn/Sport.html /Advice/ 2 3 Sport/Team/ Sport/Advice X Sport/Team/football.html AI lab. Purposed System (5/7) - User Identification Flow chart of User Identification algorithm Begin Yes N o Not end of log DB No N o IF current IP’s Agent and OS same N o End Yes IP not in IPSet Save the IP, Agent and OS Yes Is there other Records? Yes Assign to the User Set, Increase User counter .. for similar user group AI lab. Purposed System (6/7)- Session identification Flow chart of Session Identification algorithm Begin N o not end of log DB Yes No IP not in User Set? No No No End Is there other Records? Yes Start new Session Yes time taken >25.5? A page append to the session refer page empty? Yes Yes go to path Completion AI lab. Purposed System (7/7) - Path completion Flow chart of Path completion algorithm Begin N o Not end of Session set No Yes A page in a Session contains next page in that session Yes check to the next page Search that page from site map Complete the path End AI lab. Experiment (1/4) www.olloo.mn Raw log data URLs in Web server log AI lab. Experiment (2/4) Topological Structure AI lab. Experiment (3/4) Cleaning result 60000 50000 40000 Size (K) 30000 20000 10000 0 Before clean Data cleaning After clean 2005.01.03 2005.01.10 2005.01.17 2005.01.31 2005.02.19 2005.02.26 2005.03.14 2003.03.31 2003.04.05 AI lab. Experiment (4/4) AI lab. Result User group Path completion This result can be more helpful to discover Similar user group, Relevant page group, Frequency accessing paths in WUM. AI lab. Interface of Path Completion Preprocessing System (PCPS) Start the new project. AI lab. Interface of Path Completion Preprocessing System (PCPS) Giving the project name and folder AI lab. Interface (Re Interface of Path Completion Preprocessing System (PCPS) sult) Add the log file to project AI lab. Interface of Path Completion Preprocessing System (PCPS) Choose the log file to add AI lab. Interface of Path Completion Preprocessing System (PCPS) Asking to remove the image files (files) Should to analyze… (files) Should to clean … AI lab. Interface of Path Completion Preprocessing System (PCPS) Cleaned log and information The pages and files that wanted to analyze AI lab. Interface of Path Completion Preprocessing System (PCPS) Topological Structure AI lab. Interface of Path Completion Preprocessing System (PCPS) Browser AI lab. Interface of Path Completion Preprocessing System (PCPS) System AI lab. Comparing other preprocessing approach to Purposed System Related works Creation of Topol. Structure Using Topological Structure Removing images Removing robot text User /Session Identification Path completion R. Cooley [12] X O O O Login, IP, Agent O 1996 [8] Olympics site X X O X Cookie X Yan, Jacobsen [5] X X O X IP, Agent X Pitkow [7] X O O X Session ID O Shahabi [2] X X O X Session ID O Chen, Park [3] X X O X Login, IP X Purposed System O O O O IP,Agent, Grouping O O- used, X – not used AI lab. Conclusion Approach Identified number of access Identified number of Users Identified number of Session Not used path completion 18019 2812 10407 Purposed System 18019 3061 11019 • My work focus on preprocessing of Web log mining and enhance the discovering patterns. 3061 – 2812 = 249 users neglected. • This paper presented some new approach and practicable algorithm. • This approach can be better precision than some existence approaches. AI lab. Reference [1] R. Cooley, B. Mobasher, and J. Srivastava Department of Computer Science and Engineering University of Minnesota Minneapolis, MN 55455, USA “Web mining: Information and Pattern Discovery on the World Wide Web” 1998 [2] C. Shahabi and F.B. Kashani, “A Framework for Efficient and Anonymous Web Usage Mining Based on Client-Side Tracking,”2001 [3] M.S. Chen, J.S. Park, P.S Yu. Data mining for path traversal patterns in a Web environment. 1996 [4] H. Mannila, H. Toivonen. Discovering generalized episodes using minimal occurrence. 1996 [5] T. Yan, M. Jacobsen, H. Garcia-Molina, U. Dayal. From user access patterns to dynamic hypertext linking. 1996. [6]. J. Pitkow. In search of reliable usage data on the WWW. 1997. [7]. J. Pitkow, P. Pirolli and R. Rao. Silk. Extracting usable structures from the Web. 1996 [8]. S. Elo-Dean and M. Viveros. Data mining the IBM official 1996 Olympics Web site. [9]. Open Market Inc. Open Market Web reporter. http://www.openmarket.com,1996. [10]. net.Genesis. net.analysis desktop http://www.netgen.com,1996 [11]. Doru Tanasa, Brigitte Trousse “Advanced data preprocessing for intersites Web Usage Mining “2004 [12]. R. Cooley, Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data, PhD thesis, Dept. of Computer Science, Univ. of Minnesota, 2000. AI lab.