Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/22/2004 by Arifa Mannan Overview        Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusions Questions/Answers 4/22/2004 Arifa Mannan, University of Vermont 2 Introduction   The Web is huge, dynamic & diverse, and thus raises the scalability, multimedia data and temporal issues respectively. Thus we are drowning in information and facing information overload. Information users can encounter problems when interacting with the Web 4/22/2004 Arifa Mannan, University of Vermont 3 More Introduction PROBLEMS:  Finding Relevant information: irrelevance of many of the search results, inability to index all the information available on the web.  Creating new knowledge out of the information available on the web: presumes that we already have a collection of web data and we want to extract potentially useful knowledge out of it. 4/22/2004 Arifa Mannan, University of Vermont 4 More Introduction  Personalization of the information: This problem is often associated with the type and presentation of the information.  Learning about consumers or individual users: This problem is about knowing what the customers do and want. 4/22/2004 Arifa Mannan, University of Vermont 5 More Introduction Web mining techniques could be directly or indirectly used to solve the information overload problems described before. directly - application of web mining techniques directly addresses the problem indirectly- web mining techniques are used as a part of a bigger application that addresses the problems mentioned before. Web mining NOT only useful tool: other useful techniques include  DB database  IR Information Retrieval  NLP Natural Language Processing  Web document community 4/22/2004 Arifa Mannan, University of Vermont 6 Web Mining: Outline      Overview of Web Mining Describe some confusion in use of the term “Web Mining” Provide a Classification Relate Classification to the agent paradigm Describe some research in their respective categories 4/22/2004 Arifa Mannan, University of Vermont 7 Web Mining: Overview Web mining is the use of data mining techniques to automatically discover and extract information from web documents and services. We suggest decomposing Web mining into these subtasks:  1 Resource finding: the task of retrieving intended web documents    2 Information selection and pre-processing: automatically selecting and pre-processing specific information from retrieved Web resources 3 Generalization: automatically discovers general patterns at individual Web sites as well as across multiple sites. 4 Analysis: validation and/or interpretation of the mined patterns. We’ll call this pattern 1-2-3-4, as we’ll later see, sometimes 1-3-4 is also used. 4/22/2004 Arifa Mannan, University of Vermont 8 Web Mining: Confusion     Web mining is often associated with Information Retrieval or Information Extraction, but it is different from both. IR is the automatic retrieval of all relevant documents while at the same time retrieving as few non-relevant ones as possible. [views documents as bag-of-words] IE has the goal of transforming a collection of documents, usually with the help of an IR system, into information that is more readily digested and analyzed. [interested in the structure or representation of a document] We argue that Web mining intersects with the application of machine learning on the web. 4/22/2004 Arifa Mannan, University of Vermont 9 Web Mining: Classification Web mining is categorized into three areas of interest based on which part of the web to mine:  Web content mining  Web structure mining  Web usage mining 4/22/2004 Arifa Mannan, University of Vermont 10 Web Mining: Classification Web content mining: describes the discovery of useful information from Web contents/data/documents.  Web content data – the data the Web page was designed to convey to the users, consists of textual, image, audio, video 4/22/2004 Arifa Mannan, University of Vermont 11 Web Mining: Classification Web structure mining: tries to discover the model underlying the link structures of the Web.  Web structure data – which describes the organization of the content. Intra-page structure information includes the arrangement of various HTML or XML tags within a given page. Interpage structure information is hyper-links connecting one page to another. 4/22/2004 Arifa Mannan, University of Vermont 12 Web Mining: Classification Web usage mining: tries to make sense of the data generated by the Web surfer’s sessions or behaviors.  Web usage data are secondary data.  Web usage data includes data from web server access logs, proxy server logs, browser logs, registration data, cookies, mouse clicks and scrolls, and any other data as the results of interactions. 4/22/2004 Arifa Mannan, University of Vermont 13 Web Mining: Classification In practice, the three Web mining tasks could be used in isolation or combined in an application, especially in Web content and structure mining since the Web documents might also contain links. 4/22/2004 Arifa Mannan, University of Vermont 14 Web Mining & the Agent Paradigm Web mining is often viewed from or implemented within an agent paradigm. Thus, web mining has a close relationship with software agents or intelligent agents. Two relevant types of software agents:  User interface agents : information retrieval agents, information filtering agents, & personal assistant agents.  Distributed agents : distributed agents for knowledge discovery or data mining . 4/22/2004 Arifa Mannan, University of Vermont 15 Web Mining & the Agent Paradigm Two relevant types of intelligent agents:  Content-based approach: the system searches for items that match based on an analysis of the content using the user preference.  Collaborative approach: the system tries to find users with similar interests to give recommendations to. 4/22/2004 Arifa Mannan, University of Vermont 16 Web Content Mining: IR view The goal of Web content mining is to improve the information finding or filtering the information to the users. Information retrieval view for unstructured documents: most of the research uses “bag of words” to represent unstructured documents. See the table that follows 4/22/2004 Arifa Mannan, University of Vermont 17 1998 1999 1995 1998 1995 1998 1999 1999 1999 1997 1999 1997 2000 1999 1999 1996 1999 1995 1999 1999 4/22/2004 Arifa Mannan, University of Vermont 18 Web Content Mining: IR view 4/22/2004 Arifa Mannan, University of Vermont 19 Web Content Mining: DB View     The database techniques on the web are related to the problem of managing and querying the information on the web. Three classes of tasks: modeling and querying the web, information extraction and integration, and web site construction and restructuring. Tries to model the data on the web and to integrate them so that more sophisticated queries other than the keywords based search can be performed. Research in this area mainly deals with semi-structured data 4/22/2004 Arifa Mannan, University of Vermont 20 Web Content Mining: DB view 4/22/2004 Arifa Mannan, University of Vermont 21 Web Structure Mining In Web structure mining we are interested in the structure of the hyperlinks within the Web itself. (inter-document structure)  This line of research is inspired by the study of social networks and citation analysis.  We could discover specific types of pages (hubs, authorities etc.) based on incoming and outgoing links.  4/22/2004 Arifa Mannan, University of Vermont 22 Web Structure Mining    Web structure mining utilizes the hyperlinks structure to apply social network analysis to model the underlying link structure of the web. A few different algorithms have been proposed to do this such as HITS, PageRank, improved HITS using content info & outlier filtering Applied as a method to calculate the quality rank or relevancy of each Web page. 4/22/2004 Arifa Mannan, University of Vermont 23 Web Usage Mining  Web usage mining focuses on techniques that could predict user behavior while the user interacts with the web.  Two commonly used approaches: 1) mapping the usage data of the web server into relational tables before an adapted data mining technique is performed, 2) uses the log data directly by using special preprocessing techniques. 4/22/2004 Arifa Mannan, University of Vermont 24 Web Usage Mining Applications of web usage mining fall into two main categories: learning a user profile/user modeling in adaptive interfaces [personalized] and learning user navigation patterns [impersonalized]  Web users would be interested in techniques that could learn their information needs and preferences, which is user modeling  Information providers would be interested in techniques that could improve the effectiveness of the information on their websites.  4/22/2004 Arifa Mannan, University of Vermont 25 Conclusions      We surveyed research in Web Mining, clarified some confusion in the use of the term Web mining, suggested three Web mining categories and situated some current research with respect to these categories. explored the connection between Web mining categories and the agent paradigm, The Web presents new challenges to the traditional data mining algorithms that work on flat data. We have seen that some of the traditional data mining algorithms have been extended or new algorithms have been used to work on the Web data. 4/22/2004 Arifa Mannan, University of Vermont 26 Questions and Answers Q1. Outline the main characteristics of web information. Ans: Huge Diverse Dynamic 4/22/2004 Arifa Mannan, University of Vermont 27 Questions and Answers Q2. How data mining techniques can be used in web information analysis? Give at least two examples. Ans: Classification: classification on server logs using decision tree, Naïve-Bayes classifier to discover the profiles of users belonging to a particular class Clustering: Clustering can be used to group users exhibiting similar browsing patterns. Association Analysis: association analysis can be used to relate pages that are most often referenced together in a single server session. 4/22/2004 Arifa Mannan, University of Vermont 28 Questions and Answers Q3. What are the three main areas of interest for web mining? Ans: Web content mining Web structure mining Web usage mining 4/22/2004 Arifa Mannan, University of Vermont 29