Download Web Mining Research: A Survey

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Web Mining Research: A Survey
Raymond Kosala and Hendrik Blockeel
ACM SIGKDD, July 2000
Presented by Shan Huang, 4/24/2007
Revised and presented by Fan Min, 4/22/2009
1
Outline
 Introduction
 Web Mining
 Web Content Mining
 Web Structure Mining
 Web Usage Mining
 Conclusion & Exam Questions
2
Four Problems
 Finding relevant information
 Low precision and unindexed information
 Creating new knowledge out of available
information on the web
 Personalizing the information
 Catering to personal preference in content and presentation
 Learning about the consumers
 What does the customer want to do?
 Using web data to effectively market products and/or services
3
Other Approaches
Web mining is NOT the only approach
 Database approach (DB)
 Information retrieval (IR)
 Natural language processing (NLP)
 In-depth syntactic and semantic analysis
 Web document community
 Standards, manually appended meta-information,
maintained directories, etc
4
Direct vs. Indirect Web Mining
 Web mining techniques can be used to solve the
information overload problems:
 Directly
Attack the problem with web mining techniques
E.g. newsgroup agent classifies news as relevant
 Indirectly
Used as part of a bigger application that addresses
problems
E.g. used to create index terms for a web search service
5
The Research
 Converging research from: Database,
information retrieval, and artificial
intelligence (specifically NLP and machine
learning)
 Focusing on research from the machine
learning point of view
6
Outline
 Introduction
 Web Mining
 Web Content Mining
 Web Structure Mining
 Web Usage Mining
 Conclusion & Exam Questions
7
Web Mining: Definition
 “Web mining refers to the overall process
of discovering potentially useful and
previously unknown information or
knowledge from the Web data.”
Can be viewed as four subtasks
Not the same as Information Retrieval
Not the same as Information Extraction
8
Web Mining: Subtasks
 Resource finding
 Retrieving intended documents
 Information selection/pre-processing
 Select and pre-process specific information from selected
documents
 Generalization
 Discover general patterns within and across web sites
 Analysis
 Validation and/or interpretation of mined patterns
9
Web Mining: Not IR
 Information retrieval (IR) is the automatic
retrieval of all relevant documents while at
the same time retrieving as few of the nonrelevant documents as possible
 Web document classification, which is a
Web Mining task, could be part of an IR
system (e.g. indexing for a search engine)
10
Web Mining: Not IE
 Information extraction (IE) aims to extract
the relevant facts from given documents
IE systems for the general Web are not feasible
Most focus on specific Web sites or content
11
Web Mining and Machine
Learning
 Machine learning is concerned with the
development of algorithms and techniques that
allow computers to "learn".
 Web mining is NOT learning from the Web.
 Some applications of machine learning on the web
are NOT Web Mining
 Methods used for Web Mining are NOT limited to
machine learning
 Oops, there is a close relationship between web
mining and machine learning
12
Web Mining: The Agent Paradigm
 User Interface Agents
information retrieval agents, information
filtering agents, & personal assistant agents.
 Distributed Agents
distributed agents for knowledge discovery or
data mining.
Problem solving by a group of agents
 Mobile Agents
13
Web Mining: The Agent Paradigm
 Content-based approach
The system searches for items that match based
on an analysis of the content using the user
preferences.
 Collaborative approach
The system tries to find users with similar
interests
Recommendations given based on what similar
users did
14
Outline
 Introduction
 Web Mining
 Web Content Mining
 Web Structure Mining
 Web Usage Mining
 Conclusion & Exam Questions
15
Web Mining Categories
 Web Content Mining
 Discovering useful information from web
contents/data/documents.
 Web Structure Mining
 Discovering the model underlying link structures (topology)
on the Web. E.g. discovering authorities and hubs
 Web Usage Mining
 Make sense of data generated by surfers
 Usage data from logs, user profiles, user sessions, cookies,
user queries, bookmarks, mouse clicks and scrolls, etc.
16
Web Content Data Structure
 Unstructured – free text
 Semi-structured – HTML
 More structured – Table or Database
generated HTML pages
 Multimedia data – receive less attention than
text or hypertext
17
Outline
 Introduction
 Web Mining
 Web Content Mining
 Web Structure Mining
 Web Usage Mining
 Conclusion & Exam Questions
18
Web Content Mining: IR View
 Unstructured Documents
Bag of words, or phrase-based feature
representation
Features can be boolean or frequency based
Features can be reduced using different feature
selection techniques
Word stemming, combining morphological
variations into one feature
19
Web Content Mining: IR View
 Semi-Structured Documents
Uses richer representations for features, based on
information from the document structure
(typically HTML and hyperlinks)
Uses common data mining methods (whereas
unstructured might use more text mining
methods)
20
Web Content Mining: DB View
 Tries to infer the structure of a Web site or
transform a Web site to become a database
 Better information management
 Better querying on the Web
 Can be achieved by:
 Finding the schema of Web documents
 Building a Web warehouse
 Building a Web knowledge base
 Building a virtual database
21
Web Content Mining: DB View
 Mainly uses the Object Exchange Model (OEM)
 Represents semi-structured data (some structure, no rigid
schema) by a labeled graph
 Process typically starts with manual selection of
Web sites for content mining
 Main application: building a structural summary of
semi-structured data (schema extraction or discovery)
22
Outline
 Introduction
 Web Mining
 Web Content Mining
 Web Structure Mining
 Web Usage Mining
 Conclusion & Exam Questions
23
Web Structure Mining
 Interested in the structure between Web documents
(not within a document)
 Inspired by the study of social networks and
citation analysis
 Example: PageRank – Google
 Application: Discovering micro-communities in the
Web
 Measuring the “completeness” of a Web site
24
Outline
 Introduction
 Web Mining
 Web Content Mining
 Web Structure Mining
 Web Usage Mining
 Conclusion & Exam Questions
25
Web Usage Mining
 Tries to predict user behavior from
interaction with the Web
 Wide range of data (logs)
 Web client data
 Proxy server data
 Web server data
 Two common approaches
 Map usage data into relational tables before using
adapted data mining techniques
 Use log data directly by utilizing special pre-processing
techniques
26
Web Usage Mining
 Typical problems: Distinguishing among
unique users, server sessions, episodes, etc
in the presence of caching and proxy
servers
 Often Usage Mining uses some
background or domain knowledge
E.g. site topology, Web content, etc
27
Web Usage Mining
 Two main categories:
 Learning a user profile (personalized)
Web users would be interested in techniques that learn
their needs and preferences automatically
 Learning user navigation patterns (impersonalized)
Information providers would be interested in
techniques that improve the effectiveness of their
Web site or biasing the users towards the goals of
the site
28
Outline
 Introduction
 Web Mining
 Web Content Mining
 Web Structure Mining
 Web Usage Mining
 Conclusion & Exam Questions
29
Conclusions
 The paper tried to resolve confusion with regards to
the term Web Mining
 Differentiated from IR and IE
 Suggest three Web mining categories
 Content, Structure, and Usage Mining
 Briefly described approaches for the three
categories
 Explored connection with agent paradigm
30
Exam Question #1
 Question: Outline the main characteristics of
Web information.
 Answer: Web information is huge, diverse,
and dynamic.
31
Exam Question #2
 Question: How data mining techniques can be used
in Web information analysis? Give at least two
examples.
 Classification: classification on server logs using
decision tree, Naïve-Bayes classifier to discover the
profiles of users belonging to a particular class
 Clustering: Clustering can be used to group users
exhibiting similar browsing patterns.
 Association Analysis: association analysis can be used to
relate pages that are most often referenced together in a
single server session.
32
Exam Question #3
 Question: What are the three main areas of
interest for Web mining?
 Answer: (1) Web Content
(2) Web Structure
(3) Web Usage
33
Thank you!
And Raymond Kosala, Hendrik
Blockeel
And Shan Huang!
34