Download Web Mining Research: A Survey - Department of Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Web Mining Research: A Survey
Authors:
Raymond Kosala and Hendrik Blockeel
ACM SIGKDD, July 2000
Presented by Shan Huang, 4/24/2007
Revised and presented by Fan Min, 4/22/2009
Revised and Presented by
Nima
[Poornima Shetty]
Date: 12/06/2011
Course: Data Mining[CS332]
Computer Science Department
University Of Vermont
Outline
 Introduction
 Web Mining
 Web Content Mining
 Web Structure Mining
 Web Usage Mining
 Conclusion & Exam Questions
Web Mining Research: A Survey
2
Introduction
 With the huge amount of information
available online, the World Wide Web is a
fertile area for data mining research.
 WWW is a popular and interactive medium
to circulate information today.
 The Web is huge, diverse, and dynamic.

Thus raises the scalability, multimedia data, and
temporal issues respectively.
Web Mining Research: A Survey
3
Four Problems
 Finding relevant information

Low precision and unindexed information
 Creating new knowledge out of available
information on the web

A data-triggered process
 Personalizing the information

Personal preference in content and presentation of the information
 Learning about the consumers

What does the customer want to do?
Web Mining Research: A Survey
4
Other Approaches
Web mining is NOT the only approach
 Database approach (DB)
 Information retrieval (IR)
 Natural language processing (NLP)
 Web document community
Web Mining Research: A Survey
5
Direct vs. Indirect Web Mining
 Web mining techniques can be used to solve the
information overload problems:
 Directly
Address the problem with web mining techniques
E.g. newsgroup agent classifies whether the news as relevant
 Indirectly
Used as part of a bigger application that addresses
problems
E.g. used to create index terms for a web search service
Web Mining Research: A Survey
6
The Research
 Converging research from: Database,
information retrieval, and artificial
intelligence (specifically NLP and machine
learning)
 Attempt to put research done in a structured
way from the machine learning point of view
Web Mining Research: A Survey
7
Outline
 Introduction
 Web Mining
 Web Content Mining
 Web Structure Mining
 Web Usage Mining
 Conclusion & Exam Questions
Web Mining Research: A Survey
8
Web Mining: Definition
 “Web mining refers to the overall process
of discovering potentially useful and
previously unknown information or
knowledge from the Web data.”
Can be viewed as four subtasks
Web Mining Research: A Survey
9
Web Mining: Subtasks
 Resource finding
 Retrieving intended web documents
 Information selection and pre-processing
 Select and pre-process specific information from selected documents
 Kind of transformation processes of the original data retrieved in the
IR process
 This transformation could be a kind of pre-processing
 Generalization
 Discover general patterns within and across web sites
 Analysis
 Validation and/or interpretation of mined patterns
Web Mining Research: A Survey
10
Web Mining and Information
Retrieval
 Information retrieval (IR) is the automatic retrieval of all
relevant documents while at the same time retrieving as few
of the non-relevant documents as possible
 Goal: Indexing text and searching for useful documents in a
collection.
 Research in IR: modeling, document classification and
categorization, user interfaces, data visualization, filtering
etc.
 Web document classification, which is a Web Mining task,
could be part of an IR system (e.g. indexing for a search
engine)

Viewed in this respect, Web mining is part of the (Web) IR process.
Web Mining Research: A Survey
11
Web Mining and Information
Extraction
 Information Extraction (IE): Transforming a
collection of documents, into information
that is more easily understood and analyzed.
 Building IE systems manually for the general
Web are not feasible

Most IE systems focus on specific Web sites or
content to extract
Web Mining Research: A Survey
12
Compare IR and IE
 IR aims to select relevant documents

IE aims to extract the relevant facts from given
documents
 IR views the text in a document just as a
bag of unordered words

IE interested in structure or representation of a
document
Web Mining Research: A Survey
13
Web Mining and The Agent
Paradigm
 Web mining is often viewed from or implemented
within an agent paradigm.

Web mining has a close relationship with Intelligent
Agents.
 User Interface Agents
 information retrieval agents, information filtering agents,
& personal assistant agents.
 Distributed Agents

Concerned with problem solving by a group of agents.
 distributed agents for knowledge discovery or data mining.
 Mobile Agents
Web Mining Research: A Survey
14
Web Mining and The Agent
Paradigm (contd.)
 Two frequently used approaches for developing
intelligent agents:
 Content-based approach
 The system searches for items that match based on an
analysis of the content using the user preferences.
 Collaborative approach
 The system tries to find users with similar interests to give
recommendations to.
 Analyze the user profiles and sessions or transactions.
Web Mining Research: A Survey
15
Outline
 Introduction
 Web Mining
 Web Content Mining
 Web Structure Mining
 Web Usage Mining
 Conclusion & Exam Questions
Web Mining Research: A Survey
16
Web Mining Categories
 Web Content Mining
 Discovering useful information from web page
contents/data/documents.
 Web Structure Mining
 Discovering the model underlying link structures (topology)
on the Web. E.g. discovering authorities and hubs
 Web Usage Mining
 Extraction of interesting knowledge from logging information
produced by web servers.
 Usage data from logs, user profiles, user sessions, cookies, user
queries, bookmarks, mouse clicks and scrolls, etc.
Web Mining Research: A Survey
17
Outline
 Introduction
 Web Mining
 Web Content Mining
 Web Structure Mining
 Web Usage Mining
 Conclusion & Exam Questions
Web Mining Research: A Survey
18
Web Content Data Structure
 Web content consists of several types of data

Text, image, audio, video, hyperlinks.
 Unstructured – free text
 Semi-structured – HTML
 More structured – Data in the tables or
database generated HTML pages
Note: much of the Web content data is unstructured text
data.
Web Mining Research: A Survey
19
Web Content Mining: IR View
 Unstructured Documents
 Bag of words to represent unstructured documents
 Takes single word as feature
 Ignores the sequence in which words occur
 Features could be
 Boolean
 Word either occurs or does not occur in a document
 Frequency based
 Frequency of the word in a document
 Variations of the feature selection include
 Removing the case, punctuation, infrequent words and stop words
 Features can be reduced using different feature selection techniques:
 Information gain, mutual information, cross entropy.
 Stemming: which reduces words to their morphological roots.
Web Mining Research: A Survey
20
Web Content Mining: IR View
 Semi-Structured Documents
 Uses richer representations for features
 Due to the additional structural information in the hypertext
document (typically HTML and hyperlinks)
 Uses common data mining methods (whereas
unstructured might use more text mining methods)
 Application:




Hypertext classification or categorization and clustering,
learning relations between web documents,
learning extraction patterns or rules, and
finding patterns in semi-structured data.
Web Mining Research: A Survey
21
Web Content Mining: DB View
 The database techniques on the Web are related to the
problems of managing and querying the information on the
Web.
 DB view tries to infer the structure of a Web site or
transform a Web site to become a database
 Better information management
 Better querying on the Web
 Can be achieved by:




Finding the schema of Web documents
Building a Web warehouse
Building a Web knowledge base
Building a virtual database
Web Mining Research: A Survey
22
Web Content Mining: DB View
 DB view mainly uses the Object Exchange Model (OEM)
 Represents semi-structured data by a labeled graph
 The data in the OEM is viewed as a graph, with objects as the vertices
and labels on the edges
 Each object is identified by an object identifier [oid] and
 Value is either atomic or complex
 Process typically starts with manual selection of Web sites
for doing Web content mining
 Main application:


The task of finding frequent substructures in semi-structured data
The task of creating multi-layered database
Web Mining Research: A Survey
23
Outline
 Introduction
 Web Mining
 Web Content Mining
 Web Structure Mining
 Web Usage Mining
 Conclusion & Exam Questions
Web Mining Research: A Survey
24
Web Structure Mining
 Interested in the structure of the hyperlinks within
the Web
 Inspired by the study of social networks and
citation analysis

Can discover specific types of pages(such as hubs,
authorities, etc.) based on the incoming and outgoing
links.
 Application:


Discovering micro-communities in the Web ,
measuring the “completeness” of a Web site
Web Mining Research: A Survey
25
Outline
 Introduction
 Web Mining
 Web Content Mining
 Web Structure Mining
 Web Usage Mining
 Conclusion & Exam Questions
Web Mining Research: A Survey
26
Web Usage Mining
 Tries to predict user behavior from
interaction with the Web
 Wide range of data (logs)
 Web client data
 Proxy server data
 Web server data
 Two common approaches


Maps the usage data of Web server into relational tables before an
adapted data mining techniques
Uses the log data directly by utilizing special pre-processing
techniques
Web Mining Research: A Survey
27
Web Usage Mining
 Typical problems:


Distinguishing among unique users, server
sessions, episodes, etc. in the presence of
caching and proxy servers
Often Usage Mining uses some background or
domain knowledge
E.g. site topology, Web content, etc.
Web Mining Research: A Survey
28
Web Usage Mining
 Applications:
Two main categories:


Learning a user profile (personalized)
Web users would be interested in techniques that learn their
needs and preferences automatically

Learning user navigation patterns (impersonalized)
Information providers would be interested in techniques that
improve the effectiveness of their Web site
Web Mining Research: A Survey
29
Outline
 Introduction
 Web Mining
 Web Content Mining
 Web Structure Mining
 Web Usage Mining
 Conclusion & Exam Questions
Web Mining Research: A Survey
30
Conclusions
 Survey the research in the area of Web mining.
 Suggest three Web mining categories
 Content, Structure, and Usage Mining
 And then situate some of the research with respect to
these categories
 Explored connection between Web mining
categories and related agent paradigm
Web Mining Research: A Survey
31
Exam Question #1
 Question: Outline the main characteristics of
Web information.
 Answer: Web information is huge, diverse,
and dynamic.
Web Mining Research: A Survey
32
Exam Question #2
 Question: Define Web Mining
 Answer: Web mining refers to the overall
process of discovering potentially useful and
previously unknown information or
knowledge from the Web data.
Web Mining Research: A Survey
33
Exam Question #3
 Question: What are the three main areas of
interest for Web mining?
 Answer: (1) Web Content
(2) Web Structure
(3) Web Usage
Web Mining Research: A Survey
34
Thank you!