Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Analysis of Web Usage Mining Hui Yu, Zhongmin Lu School of Management, South –Central University for Nationalities, Wuhan, P.R.China, 430074 Business School Wuhan Institute of Technology, Wuhan, P.R.China, 430074 Abstract Web usage mining is an effective means to analyze the web usage data and understand the needs of Web-based applications, after analyzed the data, hidden customers can be found, the hyperlink of web structure can be optimized and adaptive web sites can be created, and so on. The paper presents an overview of academic research in web mining and then focuses on the web usage mining and related technologies and tools to study on line consumer behaviour. Key words: data mining; Web mining; Web usage mining 1 Introduction The World Wide Web is one of the most heavily used services of the Internet. It is used by a variety of different communities to meet all kinds of information needs (Wagner, 2001). For example, Students may use the Web to find out more about their University, certain departments or particular lectures and projects. Another example in the academic context is teams involved in research that use the Web to find out about related work. Some companies use the Web to find out more about their competition; some individuals use it to keep up to date with the latest news, according to their interests. However, it has been suggested that most designers don't consider tracking web site traffic even after considerable time, effort and perhaps money, has been spent on designing the site and creating the content. Potentially useful information such as who comes to their site, where he or she comes from, when he or she comes, how she or he finds their site is ignored or lost. Individuals themselves often find it difficult to recollect their own personal history of Web navigation. The benefits provided by an automated system can be seen to have enormous advantages and potential. To provide some answers and insight into this area, the paper will introduce some web usage mining technologies. In order to clarify Web usage mining technologies, Web mining will be analysed as following: 2 What is Web Mining? Web mining is a natural combination of data mining and the WWW; it can be broadly defined as the discovery and analysis of useful information from the World Wide Web (Cooley, 1997). Etzioni (1996) suggests decomposing Web mining into the following subtasks: a.Resource Discovery: the task of retrieving the intended information from Web. b.Information Extraction: automatically selecting and pre-processing specific information from the retrieved Web resources. c.Generalisation: automatically discovers general patters at the both individual Web sites and across multiple sites. d. Analysis: analysing the mined patterns. Madria, Bhowmick, Ng, and Lim (1999) claim the Web involves three types of data: data on the Web (content), Web log data (usage) and Web structure data. Based on an emphasis to obtain information, web mining can be divided into three major parts: Web Contents Mining, Web Structure Mining, and Web Usage Mining. • Web Content Mining Web content mining describes the automatic search of information resource available online. In the Web mining domain, Web content mining essentially is an analog of data mining techniques for relational databases, since it is possible to find similar types of knowledge from the unstructured data residing in 1291 Web documents (Wang, 2000). The Web document usually contains several types of data, such as text, image, audio, video, metadata and hyperlinks. Some of them are semi- structured such as HTML documents, or a more structured data like the data in the tables or database generated HTML pages, but most of the data is unstructured text data. The unstructured characteristic of Web data force the Web content mining towards a more complicated approach. • Web Structure Mining Most of the Web information retrieval tools only use the textual information, while ignoring the link information that could be very valuable. The goal of Web structure mining is to generate structural summary about the Web site and Web page. Web structure mining can also have another direction – discovering the structure of Web document itself (Madria, Bhowmick, Ng, and Lim, 1999). This type of structure mining can be used to reveal the structure (schema) of Web pages. This would be good for navigational purposes and make it possible to compare/integrate Web page schemes. Another task of Web structure mining is to discover the nature of the hierarchy or network of hyperlinks in the Web sites of a particular domain. This may help to generalize the flow of information in Web sites that may represent some particular domain, therefore the query processing will be easier and more efficient. Web structure mining has a natural relation with Web content mining since it is very likely that the Web documents contain links, and they both use the real or primary data on the Web. It’s quite often to combine these two mining tasks in an application. (Wang, 2000). • Web Usage Mining Web Usage Mining tries to discover the useful information from the secondary data derived from the interactions of the users while surfing on the Web. It focuses on the techniques that could predict user behavior while the user interacts with Web. Spiliopoulou (1999) suggested the potential strategic aims in each domain into mining goals are Prediction of the user’s behavior within the site. Comparison between expected and actual Web site usage. Adjustment of the Web site to the interests of its users. And in the process of data preparation of Web usage mining, the Web content and Web structure will be used as the information sources. 3 Why Web Usage Mining? In this paper emphasis is placed on Web usage mining. The reasons are very simple: With the explosion of E-commerce the way in which companies are doing businesses has changed. E-commerce, mainly characterised by electronic transactions through Internet, has provided a cost-efficient and effective way of doing business. Unfortunately, to most companies the web is seemingly nothing more than a mysterious place where transactions take place. They perhaps do not realise that as millions of visitors interact daily with Web sites around the world, massive amounts of data are being generated. And it is arguable that with the exception of the major players in electronic commerce, most businesses do not realise the value that this information could be to the company in the fields of understanding customer behaviour, improving customer services and relationship, launching target marketing campaigns and measuring the success of marketing efforts. 4 How to Perform Web Usage Mining Web usage mining is achieved by reporting visitors traffic information based on web server log files. Web server log files were used initially by web designers or system administrators for the purposes of “how much traffic they are getting, how many requests fail, and what kind of errors is being generated”. However, web server log files can also record and trace the visitors’ on-line behaviours (Cooley, 1997). For example, after some basic traffic analysis, the log files can help designers or administrators to answer questions such as “from what search engine are visitors coming? What pages are the most popular? Which browsers and operating systems are most commonly used by visitors?” 1292 After the Web traffic data is obtained, it may be combined with relational databases. Through some data mining techniques such as association rules, path analysis, and so on, visitors’ behaviour patterns can be found and interpreted. The above is the brief explanation of how web usage mining is done. Most sophisticated systems and techniques are parsed into three distinctive processes: pre-processing, pattern discovery, and pattern analysis (Galeas, 2001). Every process can be categorised as follows (see figure1): WEB USAGE MINING Data Pre-processing Pattern Discovery Tools Converting IP addresses to Domain Names Converting File Names to Page Titles Statistical Analysis Association rules … Pattern Analysis Tools Visualisation Techniques Web transactions OLAP Techniques Data & Knowledge Querying Usability Analysis Figure 1 Research Areas in Web Usage Mining (Galeas, 2001) 4.1 Data Pre-processing Data preparation involves finding the answers to a number of questions such as: “How do you mine data that is not in the right form? How do you handle data that is not entirely clean?” To indicate where possible solutions may be found Groth (1999) suggests a number of factors to be considered. These are given as: • Data is not always clean. For example, a column containing a list of soft drinks may have the values “Pepsi”, “Pepsi Cola”, and “Cola”. The values refer to the same drink, but are not known to the computer as being one and the same. This is a consistency problem. • Another cleaning issue is stable data. Mailing lists have to be continually updated because people move and their addresses change. An old address that is no longer correct is often referred to as stale. • Another Data-cleaning issue is typographical errors. Words are frequently misspelled or typed incorrectly. This information needs to be integrated to form a complete data set for data mining. However, before the integration of the data, web log files need to be cleaned/filtered, using techniques from filtering the raw data to eliminate irrelevant items, grouping individual page accesses into logic units. It is important for web traffic analysis derived from filtering the raw data to be able to eliminate irrelevant items. Mobasher (1997) suggests that elimination of irrelevant items can be accomplished by checking the suffix of the URL name. For example, the embedded graphics can be filtered out from the web log file, whose suffix is usually the form of “gif”, “jpeg”, “jpg”, “GIF”, “JPEG”, “JPG”, can be removed. Before any mining is done on Web usage data, sequences of page references must be grouped into logical units representing Web transactions or user sessions (Kizhakke, 2000). A user session consists of all the page references made by a user during a single visit to a site. Identifying user sessions is similar to the problem of identifying individual users. Hence, assumptions are made depending on the final purpose of the mining process. 1293 4.2 Pattern Discovery This is the key component of the Web mining (Wang, 2000). Pattern Discovery Tools implement techniques from data mining, psychology, and information theory on the Web traffic data collected. Once user sessions have been identified, several types of access pattern mining can be performed depending on the needs of the analyst. Some of these discovery techniques are: • Converting IP Addresses to Domain Names Every visitor to a Web site connects to the Internet through an IP address (for example, 157.228.102.1). Every IP address has a corresponding domain name (for example, osiris.sunderland.ac.uk), and these are linked through the Domain Name System (DNS). DNS can convert a domain name that a visitor entered in Web browser into a corresponding IP address. When converting the IP number into the domain name, some knowledge can be discovered. For example, it can be estimated where visitors live by looking at the extension of each visitor’s domain name, such as .ca (Canada); .au (Australia); cn(China), etc. • Converting File Names to Page Titles A well-designed site will have a title (between <title> and </title>) for every page. Rather than simply report the file names (URL) requested a good system is able to look at these files and thereby determine their titles. Page titles are much easier to read than URLs so it can perhaps be inferred that a good system should show page titles on reports in addition to URLs. • Statistical Analysis Statistical techniques are the most powerful tools in extracting knowledge about visitors to a Web site (Srivastava, 2000). The analysts may perform different kinds of descriptive statistical analyses based on different variables (such as page views, viewing time and length of a navigational path) when analysing the session file. By analysing the statistical information (such as the most frequently accessed pages, average view time of a page or average length of a path through a site) contained in the periodic Web system report, the extracted report can be potentially useful for improving the system performance, enhancing the security of the system, facilitating the site modification task, and providing support for marketing decisions. • Association Rules Srivastava (2000) pointed that in the context of the Web usage mining, the association rules refer to sets of pages that are accessed together with a support value exceeding some specified threshold. The Web designers can restructure their Web sites efficiently with the help of the presence or absence of the association rules. When loading a page from a remote site, association rules can be used as a trigger for perfecting documents to reduce user perceived latency. • Clustering Clustering is a technique to group together a set of items having similar characteristics. In the Web Usage domain, there are two kinds of interesting clusters to be discovered: usage clusters and pages clusters. Clustering of users tends to establish groups of users exhibiting similar browsing patterns. On the other hand, clustering of pages will discover groups of pages having related content. This information is useful for Internet search engines and Web assistance providers. (Cooley, 2000). • Classification Classification is the technique to map a data item into one of several predefined classes. In the Web domain, Web master will have to use this technique if he/she wants to establish a profile of users belonging to a particular class or category. This requires extraction and selection of features that best describe the properties of a given class or category. The classification can be done by using supervised inductive learning algorithms such as decision tree classifiers, k-nearest neighbour classifier, Support Vector Machines etc. (Srivastava, 2000). • Path Analysis Graph models are most commonly used for Path Analysis. In the graph models, a graph represents some relation defined on Web pages (or web), and each tree of the graph represents a web site. Each node in the tree represents a web page (html document), and edges between trees represent the links between 1294 web sites, while the edges between nodes inside a same tree represent links between documents at a web site. When path analysis is used on the site as a whole, this information can offer valuable insights about navigational behaviours. • Sequential Patterns This technique enables the finding of inter-session patterns, so that a set of the items follows the presence of another’s in a time-ordered set of sessions or episodes. It is very helpful for the Web marketer to be able -up to a point – to predict future trends, which can help to place advertisements aimed at certain user groups. Sequential patterns also include some other types of temporal analysis such as trend analysis, change point detection, or similarity analysis (Cooley, 2000). 4.3 Pattern Analysis Pattern Analysis is a final stage of the whole Web usage mining. The goal of this process is to eliminate irrelevant (or unwanted) rules or patterns and to extract the interesting rules or patterns from the output of the pattern discovery process (Wang, 2000). The output of earlier stage web usage mining is often not suitable for the web site administrators. The type of information sought in this respect is “How are people using the site? Which pages are being accessed most frequently?” This type of question requires the analysis of the structure of hyperlinks as well as the contents of the pages. This can be done with the help of some analysis methodologies and tools. The common techniques used for pattern analysis are visualisation techniques, OLAP techniques, Data & Knowledge Querying, and Usability Analysis. These are commented on in more detail below: • Visualisation Techniques Visualisation has been used very successfully in helping people understand various types of phenomena, both real and abstract. Hence it is a natural choice for understanding the behaviour of web users. Groth (1999) argues that visualisation is simply the graphical presentation of data. • OLAP Techniques On-line Analytical Processing (OLAP) is emerging as a very powerful paradigm for strategic analysis of databases in business settings. Some of the key characteristics of strategic analysis include: very large data volume, explicit support for the temporal dimension, support for various kinds of information aggregation, and long-range analysis in which overall trends are more important than details of individual data items. While OLAP can be performed directly on top of relational databases, industry has developed specialised tools to make it more efficient and effective. (Information Advantage, 1997). • Data and Knowledge Querying One of the reasons attributed to the great success of relational database technology has been the existence of a high-level, declarative, query language, which allows an application to express what conditions must be satisfied by the data it needs, rather than having to specify how to get the required data. Given the large number of patterns that may be mined, there appears to be a definite need for a mechanism to specify the focus of the analysis. First, constraints may be placed on the database to restrict the portion of the database from which to mine for. Second, querying may be performed on the knowledge that has been extracted by the mining process, in which case a language for querying knowledge rather than data is needed (Mobasher, 1997a). • Usability Analysis The first step undertaken in this method is to develop instrumentation methods that collect data about software usability. This data is then used to build computerised models and simulations that explain the data. Finally, various data presentation and visualisation techniques are used to help an analyst understand the phenomenon. This approach can also be used to model the browsing behaviour of users on the web (Mobasher, 1997b). However, as most of those techniques are disliked by users because of slow speeds, inflexibility, difficult to maintain and limited functionality. To develop a more efficient, flexible and powerful set of tools to undertake this task there still remains a lot of work to be undertaken by both researcher and developer. 1295 5 Conclusion There are lots of researcher are studying Web usage mining, but few of them make great progress. The paper have been examined and discussed Web usage mining, a detailed description of the three phases of the Web usage mining process has been provided and commented on. Due to the massive growth of the e-commerce industry and associated spin-offs, privacy issues have arguably become one of the most critical concerns between the Web user and e-commerce developer; our future work will focus on this aspect, and also include using web usage mining data to create adaptive electronic commerce web sites. References [1] Accrue Software Inc (2000) Web Mining White paper: Driving Business Decisions in Web Time, www.accrue.com. [2] Cooley, Bamshad and Jaideep (1997) Web Mining: Information and Pattern Discovery on the World Wide Web, http://wwwusers.cs.umn.edu/~mobasher/webminer/survey/survey.html. [3] Cooley, R. (2000) Web Usage Mining: Discovery and Application of Interesting Patterns from Web data. http://citeseer.nj.nec.com/426030.html. [4] Etzioni, O. (1996) The World Wide Web: Quagmire or Gold Mine, Communications of the ACM, volume36, number 11 (November), pp. 65-68. [5] Galeas, P. (2001) Web Mining, http://www.galeas.de/webmining.html. [6] Groth, R. (1999) Data mining: building competitive advantage, Vanessa Moore, USA. Pp47. [7] Information Advantage (1997) Decision suite users guide: Online Analytical Processing, http://www-users.cs.york.ac.uk/~kimble/research/ak/vendors.htm. [8] kizhakke,V.P.( 2000) MIR: A Tool For Visual Presentation of WEB Access Behaviour http://citeseer.nj.nec.com/cache/papers/cs/20450/ kizhakke00mir.pdf. [9] Madria, S. K., Bhowmick, S. S, Ng, W.K., and Lim, E. P. (1999) Research issues in Web data mining.In Proceedings of Data Warehousing and Knowledge Discovery, First International Conference, DaWaK ’99, pages 303-312. [10] Mobasher, B. (1997a) Data & Knowledge Querying http://www-users.cs.umn.edu/~mobasher/webminer/survey/node21.html. [11] Mobasher, B.(1997b) Usability Analysis http://www-users.cs.umn.edu/~mobasher/webminer/survey/node22.html. [12] Spiliopoulou, M. (1999) Data mining for the Web. In Proceedings of Principles of Data Mining and Knowledge Discovery, Third European conference, PKDD’99, P588-589 [13] Srivastava, J. (2000) Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data http://www.acm.org/sigkdd/explorations/issue1-2/srivastava.pdf. [14] Wagner,H.(2001), Towards an Integrated Approach to Collaborative Web Usage http://www.pms.informatik.uni-muenchen.de/lehre/projekt-diplom-arbeit/navigation-track/doc/the sis.shtml. [15] Wang,Y.(2000) Web Mining and Knowledge Discovery of Usage Patterns, http://db.uwaterloo.ca/~tozsu/courses/cs748t/surveys/wang.pdf. 1296