Download Web Mining Research: A Survey

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Web Mining Research:
A Survey
Raymond Kosala and Hendrik Blockeel
ACM SIGKDD , July 2000
Presented by Shan Huang,
4/24/2007
Outline
Introduction
Web Mining
Web Content Mining
Web Structure Mining
Web Usage Mining
Conclusion & Exam Questions
Four Problems
 Finding relevant information
 Low precision
 Unindexed information
 Creating new knowledge out of available
information on the web
 Personalizing the information
 Catering to personal preference in content and
presentation
 Learning about the consumers
 What does the customer want to do?
 Using web data to effectively market products and/or
services
Other Approaches
Web mining is not the only approach



Database approach (DB)
Information retrieval (IR)
Natural language processing (NLP)
 In-depth syntactic and semantic analysis

Web document community
 Standards, manually appended meta-information,
maintained directories, etc
Direct vs Indirect Web Mining
Web mining techniques can be used to solve
the information overload problems:

Directly
 Attack the problem with web mining techniques
 E.g. newsgroup agent classifies news as relevant

Indirectly
 Used as part of a bigger application that addresses
problems
 E.g. used to create index terms for a web search service
The Research
Converging research from: Database,
information retrieval, and artificial intelligence
(specifically NLP and machine learning)
Paper focuses on research from the machine
learning point of view
Outline
Introduction
Web Mining
Web Content Mining
Web Structure Mining
Web Usage Mining
Conclusion & Exam Questions
Web Mining: Definition
“Web mining refers to the overall process of
discovering potentially useful and previously
unknown information or knowledge from the
Web data.”



Can be viewed as four subtasks
Not the same as Information Retrieval
Not the same as Information Extraction
Web Mining: Subtasks
Resource finding

Retrieving intended documents
Information selection/pre-processing

Select and pre-process specific information from selected
documents
Generalization

Discover general patterns within and across web sites
Analysis

Validation and/or interpretation of mined patterns
Web Mining: Not IR or IE
Information retrieval (IR) is the automatic
retrieval of all relevant documents while at the
same time retrieving as few of the non-relevant
documents as possible
Web document classification, which is a Web
Mining task, could be part of an IR system (e.g.
indexing for a search engine)
Web Mining: Not IR or IE
Information extraction (IE) aims to extract the
relevant facts from given documents while IR
aims to select the relevant documents


IE systems for the general Web are not feasible
Most focus on specific Web sites or content
Web Mining and Machine Learning
As a broad subfield of artificial intelligence, machine
learning is concerned with the development of
algorithms and techniques that allow computers to
"learn".
Web mining not the same as learning from the Web.
Some applications of machine learning on the web
are not Web Mining
Some methods used for Web Mining besides machine
learning
However, there is a close relationship between web
mining and machine learning.
Outline
Introduction
Web Mining
Web Content Mining
Web Structure Mining
Web Usage Mining
Conclusion & Exam Questions
Web Mining Categories
Web Content Mining



Discovering useful information from web
contents/data/documents.
IR view for finding
DB view for modeling
Web Structure Mining


Discovering the model underlying link structures (topology)
on the Web
E.g. discovering authorities and hubs
Web Usage Mining


Make sense of data generated by surfers
Usage data from logs, user profiles, user sessions, cookies,
user queries, bookmarks, mouse clicks and scrolls, etc.
Web Content Data Structure
Unstructured – free text
Semi-structured – HTML
More structured – Table or Database generated
HTML pages
Multimedia data – receive less attention than
text or hypertext
Web Mining: The Agent Paradigm
User Interface Agents

information retrieval agents, information filtering
agents, & personal assistant agents.
Distributed Agents


distributed agents for knowledge discovery or data
mining.
Problem solving by a group of agents
Mobile Agents
Web Mining: The Agent Paradigm
Content-based approach

The system searches for items that match based
on an analysis of the content using the user
preferences.
Collaborative approach


The system tries to find users with similar
interests
Recommendations given based on what similar
users did
Outline
Introduction
Web Mining
Web Content Mining
Web Structure Mining
Web Usage Mining
Conclusion & Exam Questions
Web Content Mining: IR View
Unstructured Documents




Bag of words, or phrase-based feature
representation
Features can be boolean or frequency based
Features can be reduced using different feature
selection techniques
Word stemming, combining morphological
variations into one feature
Web Content Mining: IR View
Semi-Structured Documents


Uses richer representations for features, based on
information from the document structure (typically
HTML and hyperlinks)
Uses common data mining methods (whereas
unstructured might use more text mining methods)
Web Content Mining: DB View
Tries to infer the structure of a Web site or transform
a Web site to become a database


Better information management
Better querying on the Web
Can be achieved by:




Finding the schema of Web documents
Building a Web warehouse
Building a Web knowledge base
Building a virtual database
Web Content Mining: DB View
Mainly uses the Object Exchange Model
(OEM)

Represents semi-structured data (some structure,
no rigid schema) by a labeled graph
Process typically starts with manual selection
of Web sites for content mining
Main application: building a structural
summary of semi-structured data (schema
extraction or discovery)
Outline
Introduction
Web Mining
Web Content Mining
Web Structure Mining
Web Usage Mining
Conclusion & Exam Questions
Web Structure Mining
Interested in the structure between Web
documents (not within a document)
Inspired by the study of social networks and
citation analysis
Example: PageRank – Google
Application: Discovering micro-communities
in the Web
Measuring the “completeness” of a Web site
Outline
Introduction
Web Mining
Web Content Mining
Web Structure Mining
Web Usage Mining
Conclusion & Exam Questions
Web Usage Mining
Tries to predict user behavior from interaction with
the Web
Wide range of data (logs)



Web client data
Proxy server data
Web server data
Two common approaches
1.
2.
Map usage data into relational tables before using
adapted data mining techniques
Use log data directly by utilizing special pre-processing
techniques
Web Usage Mining
Typical problems: Distinguishing among
unique users, server sessions, episodes, etc in
the presence of caching and proxy servers
Often Usage Mining uses some background
or domain knowledge

E.g. site topology, Web content, etc
Web Usage Mining
Two main categories:
1.
Learning a user profile (personalized)
 Web users would be interested in techniques that
learn their needs and preferences automatically
2.
Learning user navigation patterns
(impersonalized)
 Information providers would be interested in
techniques that improve the effectiveness of their
Web site or biasing the users towards the goals of the
site
Outline
Introduction
Web Mining
Web Content Mining
Web Structure Mining
Web Usage Mining
Conclusion & Exam Questions
Conclusions
Tried to resolve confusion with regards to the term
Web Mining

Differentiated from IR and IE
Suggest three Web mining categories:

Content, Structure, and Usage Mining
Briefly described approaches for the three categories
Explored connection with agent paradigm
Exam Question #1
Question: Outline the main characteristics of
Web information.
Answer: Web information is huge, diverse, and
dynamic.
Exam Question #2
Question: How data mining techniques can be used in
Web information analysis? Give at least two examples.



Classification: classification on server logs using decision
tree, Naïve-Bayes classifier to discover the profiles of users
belonging to a particular class
Clustering: Clustering can be used to group users
exhibiting similar browsing patterns.
Association Analysis: association analysis can be used to
relate pages that are most often referenced together in a
single server session.
Exam Question #3
Question: What are the three main areas of
interest for Web mining?
Answer: (1) Web Content
(2) Web Structure
(3) Web Usage
Thank you!