Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
COMP 630H A Survey on Preprocessing Techniques in Web Usage Mining Ke Yiping Student ID: 03997175 Email: [email protected] Computer Science Department The Hong Kong University of Science and Technology Dec 2003 Abstract The World Wide Web (WWW) continues to grow at an overwhelming rate in both the sheer volume of traffic and the size and complexity of Web sites. Therefore, it becomes more and more necessary, but difficult to get useful information from Web data, in order to understand and better serve the needs of Web-based applications. As a result, the Web usage mining has become a hot research topic, which combines two of the prominent research areas comprising the data mining and the World Wide Web. Web usage mining consists of three phases, namely preprocessing, pattern discovery, and pattern analysis. For the survey, I focus on the first phase—data preprocessing, which is essential and must be performed prior to applying data mining algorithms to the data sources. An overview of data preprocessing techniques aiming at identifying unique users, user sessions and transactions is presented in this survey. Key words: Web Usage Mining, Web Log Mining, Data Preprocessing. 1. Introduction and Background Web Usage Mining is that part of Web Mining which deals with the extraction of useful knowledge from the secondary data derived from the interactions of the users while interacting with the Web. The Web usage data includes the data from Web server access logs, proxy server logs, browser logs, user profiles, registration data, user sessions or transactions, cookies, user queries, bookmark data, mouse clicks and scrolls, and any other data as the results of interactions. The scope of Web usage mining is local, which means that the scope of Web usage mining spans an individual Web site. Usage Mining tools [2,4,5,6] discover and predict user behavior, in order to help the designer to improve the web site, to attract visitors, or to give regular users a personalized and adaptive service. Table 1 gives the explanations of the Web Usage Mining in terms of view of data, main data, representation, method, mining steps and applications. As is true for typical data mining applications, the issue of data preprocessing plays a fundamental role in the whole mining process. As pointed out by [22], this phase is critical for the success of pattern discovery: errors committed in this phase may make the data useless for further analysis. In Web usage analysis, data preprocessing includes tasks of repairing erroneous data and treating missing values [1, 23]. The preprocessing of web usage data, which is mainly web logs, is usually complex and time consuming. It can take up to 80% of the time spend analyzing the data. Tasks for performing preprocessing of Web Usage Mining in [1] involve data -2- cleaning, user identification, session identification, path completion, session reconstruction, transaction identification and formatting, etc. And some other tasks, such as conceptual hierarchies construction and classification of URLs, as stated in [2, 3], may be invoked in certain cases. The aim of preprocessing is to offer structural, reliable and integrated data source to next one, pattern discovery. The typical problem is distinguishing among unique users, user sessions, transactions, etc. View of Data Main Data Representation Method Mining Steps Applications Table 1: Web Usage Mining - Interactivity - Server logs - Browser logs - Relational table - Graph - Machine Learning - Statistical - (Modified) association rules - Data preprocessing - Pattern discovery - Pattern analysis - Site construction, adaptation, and management - Marketing - User modeling The rest of this survey will present the tasks and techniques of Web usage mining preprocessing in details at section 2. Section 3 describes some problems in preprocessing. And then come to the conclusion in section 4. 2. Preprocessing Tasks and Techniques 2.1 Data Cleaning This step is to remove all the data useless for data analyzing and mining e.g.: requests for graphical page content (e.g., jpg and gif images); requests for any other file which might be included into a web page; or even navigation sessions performed by robots and web spiders. The quality of the final results strongly depends on cleaning process. Appropriate cleaning of the data set has profound effects on the performance of web usage mining. The discovered associations or reported statistics are only useful if the data represented in the server log gives an accurate picture of the user accesses to the Web site. The procedures of general data cleaning are as follows: • Firstly, it should remove entries that have status of “error” or “failure”. It’s to -3- remove the noisy data from the data set. To accomplish it is quite easy. • Secondly, some access records generated by automatic search engine agent should be identified and removed from the access log. Primarily, it should identify log entries created by so-called crawlers or spiders that are used widely in Web search engine tools. Such data offer nothing to the analyzing of user navigation behaviors. Many crawlers voluntarily declare themselves in agent field of access log, so a simple string match during the data cleaning phase can strip off a significant amount of agent traffic. In addition, to exclude these accesses, [2] employs several heuristic methods that are based on indicators of non-human behavior. These indicators are: (i) The repeated request for the same URL from the same host. (ii) A time interval between requests too short to apprehend the contents of a page. (iii) A series of requests from one host all of those referred URLs are empty. The referred URL of a request is empty if the URL was typed in, requested using a bookmark, or requested using a script. • The last task of data cleaning, which is also disputable, is whether it needs to remove log entries covering image, audio and video files. In some system, such as WebMiner, all such entries are removed, while in WebLogMiner, such data are held. Generally, WUM tools applied to user navigation patterns intend to remove all such entries, because they do minor influence on the analysis of users’ navigation behaviors even such entries are generated by clicking of users. Typical web log cleaning methodologies mainly aim at removing image and picture files with extensions GIF and JPG (if analysis does not involve image examining). The analysis concerns the investigation of media/multimedia files but there are masses of other irrelevant files, which stay untouched during all the analysis process or even have a negative impact on the analysis. These can be internal web administrator actions, special purpose files, etc. The complexity of many web sites can influence the data cleaning process and impact the final results as well. In contrast to the typical data cleaning techniques, [9] introduces a novel advanced cleaning technique. At the case stated in [9], the frame pages do not have characteristic file extension (e.g., like pictures extensions jpg, gif). After applying the typical data cleaning techniques, the results were not meaningful to the client. Therefore, in the paper, the author proposes technique consist of two frame recognition stages: retrieving HTML code from web server and the filtering. Improved filtering removes pages with no links from other pages. Using this methodology, cleaning covers not just frame pages but all irrelevant pages and leaves only essential pages for further analysis and experimental studies. While requests for graphical contents and files are easy to eliminate, robots’ and web spiders’ navigation patterns must be explicitly identified. This is usually done by referring -4- to the remote hostname, by referring to the user agent, or by checking the access to the robots.txt file. However, some robots actually send a false user agent in HTTP request. In these cases, [20] proposes an approach to separate robot sessions from actual users’ sessions using standard classification techniques. Highly accurate models to achieve this goal can be obtained after three requests using a small set of access features computed from the Web server logs. A heuristic based on navigational behavior can also be used to achieve this goal (see [24]). 2.2 User Identification This task is greatly complicated by the existence of local caches, corporate firewalls, and proxy servers. The data recorded by a Web server are not sufficient for distinguishing among different users and for distinguishing among multiple visits of the same person. The standard Common Logfile Format (CLF), as well as Extended CLF (ECLF), only records host or proxy IP from which requests originate, so different visitors sharing the same host or one proxy server can not be distinguished. Methods such as cookies, user registration or the remote agent, a cookie-like mechanism used in [18] make the identification of a visitor possible. The shortcomings of such methods are that they rely on user’s cooperation, but user often denies providing such cooperation due to privacy concerns or thinking it is troublesome. In [1], the author provides some heuristics for user identification. The first heuristic states two accesses having the same IP but different browser (versions) or operation system, which are both recorded in agent field, are originated from two different users. The rational behind this heuristic is that a user, when navigating the web site, rarely employs more than one browser, much more than one OS. But this method will render confusion when a visitor actually does like that. The second heuristic states that when a web page requested is not reachable by a hyperlink from any previously visited pages, there is another user with the same IP address. But such method introduces the similar confusion when ser types URL directly or uses bookmark to reach pages not connected via links. Another main obstacle to user identification is the use of proxy servers. Use of a machine name to uniquely identify users can result in several users being erroneously grouped together as one user. An algorithm presented in [25] checks to see if each incoming request is reachable from the pages already visited. If a page is requested that is not directly linked to the previous pages, multiple users are assumed to exist on the same machine. In [17], user session lengths determined automatically based on navigation patterns are used to identify users. Other heuristics involve using a combination of IP address, machine name, browser agent, and temporal information to identify users [26]. -5- 2.3 User Session Definition User session here is the set of consecutive pages visited by a single user at a defined duration. For logs that span long periods of time, it is very likely that users will visit the Web site more than once. The goal of session identification is to divide the page accesses of each user into individual sessions. Similar to user identification, Cookie and Session mechanisms both can be used in the session identification, yet same problems also exit. To define user session, two criteria are usually considered: 1. Upper limit of the session duration as a whole; 2. Upper limit on the time spent visiting a page. Generally the second method is more practical and has been used in WUM and WebMiner. It is achieved through a timeout, where if the time between page requests exceeds a certain limit, it is assumed that the user is starting a new session. Many commercial products use 30 minutes as a default timeout, and [27] established a timeout of 25.5 minutes based on empirical data. Once a site log has been analyzed and usage statistics obtained, a timeout that is appropriate for the specific Web site can be fed back into the session identification algorithm. 2.4 Path Completion In order to reliably identify unique user session, it should determine if there are important accesses that are not recorded in the access log. The reason why causes such matter is mainly because of the presence of Cache. Mechanisms such as local caches and proxy servers can severely distort the overall picture of user traversals through a Web site. If user clicks “backward” to visit a page that has had a copy stored in Cache, browser will get the page directly from Cache. Such a page view will never be trailed in access log, thus causing the problem of incomplete path, which need mending. Current methods to try to overcome this problem include the use of cookies, cache busting, and explicit user registration. As detailed in [26], none of these methods are without serious drawbacks. Cookies can be deleted by the user, cache busting defeats the speed advantage that caching was created to provide and can be disabled, and user registration is voluntary and users often provide false information. Methods similar to those used for user identification can be user for path completion. To accomplish this task needs to refer to referrer log and site topology, along with temporal information to infer missing references. Of the referred URL of a requesting page does not exactly match the last direct page requested, it means that the requested path is not complete. Furthermore, if the referred page URL is in the user’s recent request history, we can assume that the user has clicked the “backward” button to visit page. But if the referred page is not in the history, it means that a new user session begins, just as we have stated above. We can mend the incomplete path using heuristics provided by referrer and site topology. -6- 2.5 User Session Reconstruction This step is to reconstructing the faithful users’ navigation path within the identified sessions. In particular, errors in the reconstruction of sessions and incomplete tracing of users’ activities in a site can easily result in invalid patterns and wrong conclusions. By using additional information about the web site structure is still possible to reconstruct a consistent path by means of heuristics. Heuristic methods for session reconstruction must fulfill two tasks: First, all activities performed by the same physical person should be grouped together. Second, all activities belonging to the same visit should be placed into the same group. Knowledge about a user’s identity is not necessary to fulfill these tasks. However, a mechanism for distinguishing among different users is indeed needed. For the achievement of session reconstruction, there are several “proactive” and “reactive” strategies. • Proactive strategies include user authentication, the activation of cookies that are installed at the user’s work area and can thus attach a unique identifier to her requests, as well as the replacement of a site’s static pages with dynamically generated pages that are uniquely associated with the browser that invokes them. These strategies aim at an unambiguous association of each request with an individual before or during the individual’s interaction with the Web site. • Reactive strategies exploit background knowledge on user navigational behavior to assess whether requests registered by the Web server can belong to the same individual, and whether these requests were performed during the same or subsequent visits of the individual to the site. These strategies attempt to associate requests to individuals after the interaction with the Web site, based on the existing, incomplete records. [7] evaluates the performance of heuristics employed to reconstruct sessions from the server log data. Such heuristics are called to partition activities first by user and then by visit of the user in the site, where user identification mechanisms, such as cookies, may or may not be available. 2.6 Transaction Identification Before any mining is done on Web usage data, sequences of page references must be grouped into logical units representing Web transactions. A transaction differs from a user session in that the size of a transaction can range from a single page reference to all of the page references in a user session, depending on the criteria used to identify transactions. Unlike traditional domains for data mining, such as point of sale databases, there is no convenient method of clustering page references into transactions smaller than an entire user session. This problem has been addressed in [28] and [17]. How to define transaction depends on what kind of knowledge we want to mine. In -7- WUM [2, 3], user session mentioned above as user session based on duration and transaction here as user session based on structure. As WUM aims at analyzing navigation patterns of users, so it is reasonable to take session as transaction. However, WebMiner [1, 17] concentrates on association rule and sequential pattern mining, so it tends to divide user session into transaction, which is set of semantically related pages. In order to divide user session into transaction, WebMiner classifies pages in a session into auxiliary pages and content pages. Auxiliary pages are those that are just to facilitate the browsing of a user while searching for information. Content pages are those that user are of interest and that they really want to reach. Using the concept of auxiliary and content page references, there are two ways to define transactions. The first would be to define a transaction as all the auxiliary references up to and including each content reference for a given user, which is a so-called auxiliary-content transaction as all of the transaction. The second method would be to define a transaction as all of the content references for a given user, called content-only transactions. Based on the two definitions, WebMiner employs two methods to identify transaction: one is reference length; the other is maximal forward reference. It also uses time window as a benchmark to evaluate the two methods [1, 17, 16]. 2.7 Formatting Once the appropriate preprocessing steps have been applied to the server log, a final preparation module can be used to properly format the sessions or transactions for the type of data mining to be accomplished. For example, since temporal information is not needed for the mining of association rules, a final association rule preparation module would strip out the time for each reference, and do any other formatting of the data necessary for the specific data mining algorithm to be used. [29] stores data extracted from web logs into a relational database using a click fact schema, so as to provide better support to log querying finalized to frequent pattern mining. [30] introduces a method based on signature tree to index log stored in databases for efficient pattern queries. A tree structure, WAP-tree, is also introduced in [51] to register access sequence to web pages. The author also does the optimization of this structure to exploit the sequence mining algorithm. 3. Problems in Preprocessing Clearly improved data quality can improve the quality of any analysis on it. Data quality is a source of major difficulties for Web usage mining, and particularly for the sessionizing problem. The solution of a dedicated server recording all activities of each user individually was put forward as a tested and successful -8- solution. In the absence of such a server, cookies and scripts can be used to distinguish among unique users. However, they are not always feasible or popular, due to privacy considerations. A problem in the Web domain is the inherent conflict between the analysis needs of the analysts (who want more detailed usage data collected), and the privacy needs of users (who want as little data collected as possible). In addition caching and proxy servers can make reconstructing reliable sessions difficult. 4. Conclusion In this survey, I address a typical sequence of tasks of Web usage mining preprocessing. The techniques and methods used in each individual task are also presented in great details. One important thing I want to point out is that each task relies heavily on each other. In practical application, some of the tasks are carried out together and do not distinguish with each other very clearly. Moreover, for specific mining applications, the procedures of preprocessing may be of little variation. As for the application of Web personalization [14], the preprocessing steps may include data selection, cleaning and transformation and the identification of users and user sessions. Acknowledgments I would like to express my gratefulness to Dr. Wilfred Ng for guiding this survey. I also thank Lu An for the interesting discussions. References [1] Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava. Data Preparation for Mining World Wide Web Browsing Patterns (1999), Knowledge and Information Systems V1(1). [2] B. Berendt, M. Spiliopoulou. Analyzing navigation behavior in Web sites integrating multiple information systems. VLDBJournal, Special Issue on Databases and the Web 9, 1 (2000), 56-75. [3] M. Spiliopoulou. Web usage mining for Web site evaluation. Communications of the -9- ACM Volume 43, Number 8 (2000), Pages 127-134. [4] R. Cooley, M. Deshpande, J. Srivastava, and P.N. Tan. Web usage mining: Discovery and applications of usage patterns from web data. ACM SIGKDD Explorations, 1(2), January 2000. [5] Kdnuggets. Software for web mining. http://www.kdnuggets.com/-software/web.html. [6] M. Spiliopoulou and L.C. Faulstich. WUM: a Web Utilization Miner. In Proceedings of the EDBT Workshop WebDB98, volume 1590 of LNCS, pages 109-115, 1998. [7] B. Berendt, B. Mobasher, M. Spiliopoulou, and M. Nakagawa. A framework for the evaluation of session reconstruction heuristics in web usage analysis. INFORMS Journal of Computing, 15(2), 2003. [8] L.C. Faulstich, et al. WUM: A Tool for Web Utilization Analysis. In EDBT Workshop WebDB’98. 1998, Spain. [9] Z. Pabarskaite. Implementing Advanced Cleaning and End-User Interpretability Technologies in Web Log Mining. In Proceedings of the 24th International Conference ITI 2002. [10] A. Abraham and Xiaozhe Wang. i-miner: A web usage mining framework using hierarchical intelligent systems. In The IEEE International Conference on Fuzzy Systems FUZZ-IEEE'03, 2003. [11] T. Ohmori, Y. Tsutatani, M. Hoshi. A Novel Datacube Model Supporting Interactive Web-log Mining. In Proceedings First International Symposium. pp. 419-427. Cyber Worlds, 2002. [12] R. Kosala, H. Blockeel. Web Mining Research: A Survey. SIGKDD Explorations: Newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining, ACM, 2(1), 2000. [13] B. Masand and M. Spiliopoulou. Webkdd-99: Workshop on web usage analysis and user profiling. SIGKDD Explorations, 1(2), 2000. [14] M. Baglioni, U. Ferrara, A. Romei, S. Ruggieri, and F. Turini. Preprocessing and Mining Web Log Data for Web Personalization. The 8th Italian Conf. on Artificial Intelligence : 237-249. Vol. 2829 of LNCS, September 2003. - 10 - [15] R. Cooley, B. Mobasher, and J. Srivastava. Web Mining: Information and Pattern Discovery on the World Wide Web. In Proceedings of the 9th IEEE International Conference on Tools With Artificial Intelligence (ICTAI ’97), Newport Beach, CA, 1997. [16] R. Cooley, B. mobasher, and J. Srivastava. Data Preparation for Mining World Wide Web Browsing Patterns. Knowledge and Information Systems, Volume 1, No. 1, 1999. [17] Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava, Grouping Web Page References into Transactions for Mining World Wide Web Browsing Patterns (1997), in Proceedings of the 1997 IEEE Knowledge and Data Engineering Exchange Workshop (KDEX-97), November 1997. [18] C. Shahabi, A. M. Zarkesh, J. Adibi, V. Shah. Knowledge discovery from users Web-page navigation. Research Issues in Data Engineering, 1997. Proceedings. Seventh International Workshop on , 7-8 April 1997, Page(s): 20 –29 [19] Federico Michele Facca and Pier Luca Lanzi. Recent Developments in Web Usage Mining Research. Data Warehousing and Knowledge Discovery: 5th International Conference, DaWaK 2003, Pages(s): 140-150. [20] Pang-Ning Tan, and Vipin Kumar. Modeling of Web Robot Navigational Patterns. In WebKDD 2000 – Web Mining for E-commerce – Challenges and Opportnities, Second International Workshop, August 2000. [21] Ron Kohavi, et al. Lessons and Challenges from Mining Retail E-Commerce Data. Journal of Machine Learning, 2003. [22] D. Pyle, 1999. Data Preparation for Data Mining. Morgan Kaufmann Publishers Inc., Dan Francisco, CA. [23] R. Cooley, P. Tan, J. Srivastava. 2000. Discovery of interesting usage patterns from Web data. B. Masand, M. Spiliopoulou, eds. Advances in Web Usage Analysis and User profiling. LNAI 1836, Springer, Berlin, Germany. 163-182. [24] Pang-Ning Tan, and Vipin Kumar. Discovery of web robot sessions based on their navigational patterns. Data Mining and Knowledge Discovery, 6(1): 9-35, 2002. [25] P. Pirolli, J. Pitkow, and R. Rao. Silk from a sow’s ear: Extracting usable structures from the web. In Proceedings of 1996 Conference on Human Factors in Computing - 11 - Systems (CHI-96), Vancouver, British Columbia, Canada, 1996. [26] J. Pitkow. In search of reliable usage data on the www. In Sixth International World Wide Web Conference, pages 451-463, Santa Clara, CA, 1997. [27] L. Catledge, J. Pitkow. Characterizing browsing behaviors on the World Wide Web, Computer Networks and ISDN Systems 27(6), 1995, pp. 1065-1073. [28] M.S. Chen, J.S. Park, and P.S. Yu. Data mining for path traversal patterns in a web environment. In Proceedings of the 16th International Conference on Distributed Computing Systems, pages 385-392, 1996. [29] Jesper Andersen, Anders Giversen, Allan H. Jensen, Rune S. Larsen, Torben Bach Pedersen, and Janne Skyt. Analyzing clickstreams using subsessions. In International Workshop on Data Warehousing and OLAP (DOLAP 2000), 2000. [30] Alexandros Nanopoulos, Maciej Zakrzewicz, Tadeusz Morzy, and Yannis Manolopoulos. Indexing web access-logs for pattern queries. In Fourth ACM CIKM International Workshop on Web Information and Data Management (WIDM’02), 2002. [31] Jian Pei, Jiawei Han, Behzad Mortazavi-asl, and Hua Zhu. Mining access patterns efficiently from web logs. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 396-407, 2000. - 12 -