Download Web Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Research paper: Web Mining Research: A survey
SIGKDD Explorations, June 2000. Volume 2, Issue 1
Author: R. Kosala and H. Blockeel






Introduction
Web Mining
Web Content Mining
Web Structure Mining
Web Usage Mining
Conclusion


The World Wide Web is a popular and
interactive medium to disseminate information
Information users may encounter four
problems
1. Finding relevant information
a. low precision b. low recall
2. Creating new knowledge out of the
information available on the web
---data-triggered process
3. Personalizing of the information
People differ in the content and presentations of information
4. Learning about consumers or individual users
Mass customizing or even personalizing






Definition: web mining refers to the overall
process of discovering potentially useful
and previously unknown information or
knowledge from the web data
Four subtasks
Resource finding: retrieving intended web documents
Information selection and pre-processing: selecting and preprocessing specific information
Generalization: discovering general patterns
Analysis: validation and/or interpretation of mined patterns

Web Mining and Information Retrieval
Definition: IR is the automatic retrieval of all relevant documents while
at the same time retrieving as few of the non-relevant documents as
possible.
goal: indexing and searching for useful documents

Web Mining and Information Extraction
IE has the goal of transforming a collection of documents into
information that is more readily digested and analyzed.

Compare IR and IE
a. aims
b. fields

Web Mining and the Agent Paradigm
Web mining is often viewed from or implemented within an agent
paradigm
1.
2.
3.
User interface agents
Distributed agents
Mobile agents
Two approaches used to develop intelligent agents
1.
2.
Content-based approach
Collaborative approach



Definition: discovering useful info from web page
contents/data/documents
Several types of data: text, image, audio, video,
hyperlinks
Types of Data Structure:
1.Unstructured: free text
2.Semi- structured: HTML
3.More structured: data in tables or database
generated HTML pages

IR view:
Unstructured Documents
a.
b.
c.
d.
Bag of words to represent unstructured documents
Feature: Boolean, Frequency based
Variations of the feature selection
Features could be reduced using different feature selection
techniques
Semi-Structured Documents
a.
b.
Uses richer representations for features
Uses common data mining methods

DB view:
DB view tries to infer the structure of a web site or
transform a web site to become a database
Methods:
a.
b.
c.
d.
Finding the scheme of web documents
Building a web warehouse
Building a web knowledge base
Building a virtual database

Interested in the structure of the hyperlinks
within the web

Inspired by the study of social networks and
citation analysis
Discover specific types of pages based on the incoming and
outgoing links
Application:
a. discovering micro-communities in the web
b. measuring the completeness of a web site




a.
b.

a.
b.

Tries to predict user behavior from interaction
with the web
Wide range of data
Two commonly used approaches
Maps the usage data of Web server into relational tables before
an adapted data mining technique is performed
Uses the log data directly by utilizing special pre-processing
techniques
problems:
Distinguishing among unique users, server sessions, episodes
in the presence of caching and proxy servers
Often usage mining uses some background or domain
knowledge
applications

Survey of research in the area of web mining

Three web mining categories:
content
structure
usage mining

Connection between web mining categories
and related agent paradigm