Download Web Mining - 123SeminarsOnly.com

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Web Mining
By:Vineeta
8pgc18
M.Tech (II Semester)
Introduction







Why we need ?
What is it ?
How it is different from classical data mining ?
What are the problems ?
Role of web mining
Web mining Taxonomy
Applications
Why we need Web Mining?
 Explosive growth of amount of content on the
internet
 Web search engines return thousands of
results so difficult to browse
 Online repositories are growing rapidly
Using web mining web documents can easily be BROWSED,
ORGANISED and CATALOGED with minimal human
intervention
What is it?
 Web mining - data mining techniques to automatically
discover and extract information from web
documents/services
www
Knowledge
How does it differ from “classical” Data Mining?
 The web is not a relation
 Textual information and linkage structure
 Usage data is huge and growing rapidly
 Google’s usage logs are bigger than their web crawl
 Data generated per day is comparable to
conventional data warehouses
 Ability to react in real-time to usage patterns
 No human in the loop
largest
Web Mining: Problems
 The “abundance” problem
 Limited coverage of the Web
 Limited query interface based on keyword-oriented
search
 Limited customization to individual users
 Dynamic and semi structured
Role of web mining
 Finding Relevant Information
 Creating knowledge from Information available
 Personalization of the information
 Learning about customers / individual users
Web Mining Taxonomy
Web Mining
Web Content
Mining
Identify information
within given web
pages
Distinguish personal
home pages from
other web pages
Web Structure
Mining
Web Usage
Mining
Uses interconnections
between web pages to
give weight to the
pages
Understand access
patterns and the trends
to improve structure
Web Content Mining
 Web Content Mining is the process of extracting
useful information from the contents of Web
documents.
 Content data corresponds to the collection of facts a
Web page was designed to convey to the users. It may
consist of text, images, audio, video, or structured
records such as lists and tables.
 Research activities in this field also involve using
techniques from other disciplines such as
Information Retrieval (IR) and natural language
processing (NLP).
Web Content Mining
Web Content Mining
Agent Based Approach
Intelligent
Search
Agent
Information
Personalized
Filtering &
Web Agent
Categorization
Database Approach
Multilevel
Databases
Web Query
Systems
Intelligent Search Agents
 Concentrate on searching relevant information using
the characteristics of a particular domain to interpret
and organize the collected information.
 It can be further classified into two types:
 Interpretation Based on Pre-Specified Information:
 Examples:
Harvest,
Manifold, OCCAM
FAQFinder,
Information
 Interpretation Based on Unfamiliar Source:
 Example: ShopBot
ShopBot
 A ShopBot is an autonomous software agent that
comb the internet providing users with low price
product or product recommendations.
 A ShopBot basically looks for product information from
a variety of vendor sites using the general information
about the product domain.
 The following example
www.allbookstores.com.
displays
a
shopBot
at
Information Filtering & Categorization
 This makes use of various information
retrieval techniques and characteristics of
hypertext web documents to interpret and
categorize data.
 Examples:
Organizer).
HyPursuit,
BO
(Bookmark
Bookmark Organizer (BO)
 Makes use of hierarchical clustering techniques and
involves user interaction to organize a collection of
web documents.
 It operates in two modes:
 Automatic
 Manual
 Frozen Nodes: In a hierarchical structure, if we freeze
a node N, then the subtree rooted at N represents a
coherent group of documents.
Personalized Web Agents
 This category of Web agents learn user preferences
and discover Web information sources based on
these preferences, and those of other individuals with
similar interests.
 Examples:





WebWatcher
PAINT
Syskill&Webert
GroupLens
Firefly
Multilevel Databases
 Layer 0 :
 Unstructured, massive and global information base.
 Layer 1:
 Derived from lower layers.
 Relatively structured.
 Obtained by data analysis, transformation &
Generalization.
 Higher Layers (Layer n):
 Further generalization to form smaller, better
structured databases for more efficient retrieval.
Web Query System
 These systems attempt to make use of:
 Standard database query language – SQL
 Structural information about web documents
 Natural language processing for queries made in www
searches.
 Examples:
 WebLog: Restructuring extracted information from Web
sources.
 W3QL: Combines structure query (organization of
hypertext) and content query (information retrieval
techniques).
Web Structure Mining
 Web Structure Mining is the process of
discovering structure information from the
Web. This type of mining can be performed
either at the (intra-page) document level or
at the (inter-page) hyperlink level.The
research at the hyperlink level is also called
HYPERLINK ANALYSIS
Web Structure Mining
Different Algorithms for Web Structures:
 Page-Rank Method
Sergey Brin and Lawrence Page: The anatomy of a
large-scale hypertextual web search engine. In
Proc. Of WWW, pages 107–117, Brisbane,
Australia, 1998.
 CLEVER Method
http://www.almaden.ibm.com/projects/clever.shtml
Page-Rank Method
 Introduced by Brin and Page (1998)
 Used in Google Search Engine
 Mine hyperlink structure of web to produce ‘global’
importance ranking of every web page
 Web search result is returned in the rank order
 Treats link as like academic citation
 Assumption: Highly linked pages are more ‘important’
than pages with a few links
 A page has a high rank if the sum of the ranks of its
back-links is high
Backlink
 Link Structure of the Web
CLEVER Method
 CLient–side EigenVector-Enhanced Retrieval
 Developed by a team of IBM researchers at IBM
Almaden Research Centre
 Ranks pages primarily by measuring links between
them
 Continued refinements of HITS ( Hypertext Induced
Topic Selection)
 Basic Principles – Authorities, Hubs
 Good hubs points to good authorities
 Good authorities are referenced by good hubs
Web Usage Mining
 Web usage mining also known as Web log
mining
 mining techniques to discover interesting usage
patterns from the data derived from the
interactions of the users while surfing the web
 mining Web log records to discover user access
patterns of Web pages
Web Usage Mining – Three Phases
Web Usage Mining
 Pre processing consists of converting the usage, content, and
structure information contained in the various available data sources
into the data abstractions necessary for pattern discovery
 Pattern discovery draws upon methods and algorithms developed
from several fields such as statistics, data mining, machine learning
and pattern recognition.
 The motivation behind pattern analysis is to filter out uninteresting
rules or patterns from the set found in the pattern discovery phase.
The exact analysis methodology is usually governed by the
application for which Web mining is done.
Applications
 Personalized experience in B2C ecommerce –Amazon.com
 Web search –Google
 Web-wide user tracking –DoubleClick
 Understanding user communities –AOL
 Understanding auction behavior –eBay
 Personalized web portal –MyYahoo
Conclusion
 Web mining - data mining techniques to
automatically discover and extract information
from Web documents/services (Etzioni, 1996).
Web mining research – integrate research from
several research communities (Kosala and
Blockeel, July 2000) such as:
 Database (DB)
 Information retrieval (IR)
 The sub-areas of machine learning (ML)
 Natural language processing (NLP)
References
 mandolin.cais.ntu.edu.sg/wise2002/web-miningWISE-30
 David Gibson, Jon Kleinberg, and Prabhakar
Raghavan. Inferring web communities from link
topology. In Conference on Hypertext and
Hypermedia. ACM, 1998.
 www.iprcom.com/papers/pagerank/
 http://maya.cs.depaul.edu/~mobasher/webminer/sur
vey/node23.html
References
 http://en.wikipedia.org/wiki/Web_mining
 http://en.wikipedia.org/wiki/Shop_bot
 Y. S. Mareek and I. Z. B. Shaul. Automatically organizing
bookmarks per contents. Proc. Fifth International World
Wide Web Conference, May 6-10 1996.
 Cooley, R., B. Mobasher, et al. (1997). Web Mining:
Information and Pattern Discovery on the World Wide Web,
Proc. IEEE Intl. Conf. Tools with AI, Newport Beach, CA,
pp. 558-567, 1997.
References
 R. Kosala. and H. Blockeel, Web Mining Research:
A Survey, SIGKDD Explorations, 2(1):1-15, 2000.
 R. Cooley, B. Mobasher, and J. Srivastava. Data
preparation for mining world wide web browsing
patterns. Journal of Knowledge and Information
Systems 1, 5-32, 1999
 S. Chakrabarti, Data mining for hypertext: A tutorial
survey. ACM SIGKDD Explorations, 1(2):1-11,
2000System, 1(1), 1999
THANK YOU!!