Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Web Mining By:Vineeta 8pgc18 M.Tech (II Semester) Introduction        Why we need ? What is it ? How it is different from classical data mining ? What are the problems ? Role of web mining Web mining Taxonomy Applications Why we need Web Mining?  Explosive growth of amount of content on the internet  Web search engines return thousands of results so difficult to browse  Online repositories are growing rapidly Using web mining web documents can easily be BROWSED, ORGANISED and CATALOGED with minimal human intervention What is it?  Web mining - data mining techniques to automatically discover and extract information from web documents/services www Knowledge How does it differ from “classical” Data Mining?  The web is not a relation  Textual information and linkage structure  Usage data is huge and growing rapidly  Google’s usage logs are bigger than their web crawl  Data generated per day is comparable to conventional data warehouses  Ability to react in real-time to usage patterns  No human in the loop largest Web Mining: Problems  The “abundance” problem  Limited coverage of the Web  Limited query interface based on keyword-oriented search  Limited customization to individual users  Dynamic and semi structured Role of web mining  Finding Relevant Information  Creating knowledge from Information available  Personalization of the information  Learning about customers / individual users Web Mining Taxonomy Web Mining Web Content Mining Identify information within given web pages Distinguish personal home pages from other web pages Web Structure Mining Web Usage Mining Uses interconnections between web pages to give weight to the pages Understand access patterns and the trends to improve structure Web Content Mining  Web Content Mining is the process of extracting useful information from the contents of Web documents.  Content data corresponds to the collection of facts a Web page was designed to convey to the users. It may consist of text, images, audio, video, or structured records such as lists and tables.  Research activities in this field also involve using techniques from other disciplines such as Information Retrieval (IR) and natural language processing (NLP). Web Content Mining Web Content Mining Agent Based Approach Intelligent Search Agent Information Personalized Filtering & Web Agent Categorization Database Approach Multilevel Databases Web Query Systems Intelligent Search Agents  Concentrate on searching relevant information using the characteristics of a particular domain to interpret and organize the collected information.  It can be further classified into two types:  Interpretation Based on Pre-Specified Information:  Examples: Harvest, Manifold, OCCAM FAQFinder, Information  Interpretation Based on Unfamiliar Source:  Example: ShopBot ShopBot  A ShopBot is an autonomous software agent that comb the internet providing users with low price product or product recommendations.  A ShopBot basically looks for product information from a variety of vendor sites using the general information about the product domain.  The following example www.allbookstores.com. displays a shopBot at Information Filtering & Categorization  This makes use of various information retrieval techniques and characteristics of hypertext web documents to interpret and categorize data.  Examples: Organizer). HyPursuit, BO (Bookmark Bookmark Organizer (BO)  Makes use of hierarchical clustering techniques and involves user interaction to organize a collection of web documents.  It operates in two modes:  Automatic  Manual  Frozen Nodes: In a hierarchical structure, if we freeze a node N, then the subtree rooted at N represents a coherent group of documents. Personalized Web Agents  This category of Web agents learn user preferences and discover Web information sources based on these preferences, and those of other individuals with similar interests.  Examples:      WebWatcher PAINT Syskill&Webert GroupLens Firefly Multilevel Databases  Layer 0 :  Unstructured, massive and global information base.  Layer 1:  Derived from lower layers.  Relatively structured.  Obtained by data analysis, transformation & Generalization.  Higher Layers (Layer n):  Further generalization to form smaller, better structured databases for more efficient retrieval. Web Query System  These systems attempt to make use of:  Standard database query language – SQL  Structural information about web documents  Natural language processing for queries made in www searches.  Examples:  WebLog: Restructuring extracted information from Web sources.  W3QL: Combines structure query (organization of hypertext) and content query (information retrieval techniques). Web Structure Mining  Web Structure Mining is the process of discovering structure information from the Web. This type of mining can be performed either at the (intra-page) document level or at the (inter-page) hyperlink level.The research at the hyperlink level is also called HYPERLINK ANALYSIS Web Structure Mining Different Algorithms for Web Structures:  Page-Rank Method Sergey Brin and Lawrence Page: The anatomy of a large-scale hypertextual web search engine. In Proc. Of WWW, pages 107–117, Brisbane, Australia, 1998.  CLEVER Method http://www.almaden.ibm.com/projects/clever.shtml Page-Rank Method  Introduced by Brin and Page (1998)  Used in Google Search Engine  Mine hyperlink structure of web to produce ‘global’ importance ranking of every web page  Web search result is returned in the rank order  Treats link as like academic citation  Assumption: Highly linked pages are more ‘important’ than pages with a few links  A page has a high rank if the sum of the ranks of its back-links is high Backlink  Link Structure of the Web CLEVER Method  CLient–side EigenVector-Enhanced Retrieval  Developed by a team of IBM researchers at IBM Almaden Research Centre  Ranks pages primarily by measuring links between them  Continued refinements of HITS ( Hypertext Induced Topic Selection)  Basic Principles – Authorities, Hubs  Good hubs points to good authorities  Good authorities are referenced by good hubs Web Usage Mining  Web usage mining also known as Web log mining  mining techniques to discover interesting usage patterns from the data derived from the interactions of the users while surfing the web  mining Web log records to discover user access patterns of Web pages Web Usage Mining – Three Phases Web Usage Mining  Pre processing consists of converting the usage, content, and structure information contained in the various available data sources into the data abstractions necessary for pattern discovery  Pattern discovery draws upon methods and algorithms developed from several fields such as statistics, data mining, machine learning and pattern recognition.  The motivation behind pattern analysis is to filter out uninteresting rules or patterns from the set found in the pattern discovery phase. The exact analysis methodology is usually governed by the application for which Web mining is done. Applications  Personalized experience in B2C ecommerce –Amazon.com  Web search –Google  Web-wide user tracking –DoubleClick  Understanding user communities –AOL  Understanding auction behavior –eBay  Personalized web portal –MyYahoo Conclusion  Web mining - data mining techniques to automatically discover and extract information from Web documents/services (Etzioni, 1996). Web mining research – integrate research from several research communities (Kosala and Blockeel, July 2000) such as:  Database (DB)  Information retrieval (IR)  The sub-areas of machine learning (ML)  Natural language processing (NLP) References  mandolin.cais.ntu.edu.sg/wise2002/web-miningWISE-30  David Gibson, Jon Kleinberg, and Prabhakar Raghavan. Inferring web communities from link topology. In Conference on Hypertext and Hypermedia. ACM, 1998.  www.iprcom.com/papers/pagerank/  http://maya.cs.depaul.edu/~mobasher/webminer/sur vey/node23.html References  http://en.wikipedia.org/wiki/Web_mining  http://en.wikipedia.org/wiki/Shop_bot  Y. S. Mareek and I. Z. B. Shaul. Automatically organizing bookmarks per contents. Proc. Fifth International World Wide Web Conference, May 6-10 1996.  Cooley, R., B. Mobasher, et al. (1997). Web Mining: Information and Pattern Discovery on the World Wide Web, Proc. IEEE Intl. Conf. Tools with AI, Newport Beach, CA, pp. 558-567, 1997. References  R. Kosala. and H. Blockeel, Web Mining Research: A Survey, SIGKDD Explorations, 2(1):1-15, 2000.  R. Cooley, B. Mobasher, and J. Srivastava. Data preparation for mining world wide web browsing patterns. Journal of Knowledge and Information Systems 1, 5-32, 1999  S. Chakrabarti, Data mining for hypertext: A tutorial survey. ACM SIGKDD Explorations, 1(2):1-11, 2000System, 1(1), 1999 THANK YOU!!