Download Web mining - University of Vermont

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Web Mining Research:
A Survey
By Raymond Kosala & Hendrik Blockeel,
Katholieke Universitat Leuven, July 2000
Presented 4/22/2004 by Arifa Mannan
Overview







Introduction
Web Mining
Web Content Mining
Web Structure Mining
Web Usage Mining
Conclusions
Questions/Answers
4/22/2004
Arifa Mannan, University of Vermont
2
Introduction


The Web is huge, dynamic & diverse, and thus
raises the scalability, multimedia data and
temporal issues respectively.
Thus we are drowning in information and
facing information overload. Information users
can encounter problems when interacting with
the Web
4/22/2004
Arifa Mannan, University of Vermont
3
More Introduction
PROBLEMS:
 Finding Relevant information: irrelevance of
many of the search results, inability to index all the
information available on the web.

Creating new knowledge out of the
information available on the web: presumes that
we already have a collection of web data and we want
to extract potentially useful knowledge out of it.
4/22/2004
Arifa Mannan, University of Vermont
4
More Introduction

Personalization of the information: This
problem is often associated with the type and
presentation of the information.

Learning about consumers or individual
users: This problem is about knowing what the
customers do and want.
4/22/2004
Arifa Mannan, University of Vermont
5
More Introduction
Web mining techniques could be directly or indirectly used to
solve the information overload problems described before.
directly - application of web mining techniques directly
addresses the problem
indirectly- web mining techniques are used as a part of a bigger
application that addresses the problems mentioned before.
Web mining NOT only useful tool: other useful techniques
include
 DB database
 IR Information Retrieval
 NLP Natural Language Processing
 Web document community
4/22/2004
Arifa Mannan, University of Vermont
6
Web Mining: Outline





Overview of Web Mining
Describe some confusion in use of the term
“Web Mining”
Provide a Classification
Relate Classification to the agent paradigm
Describe some research in their respective
categories
4/22/2004
Arifa Mannan, University of Vermont
7
Web Mining: Overview
Web mining is the use of data mining techniques to automatically
discover and extract information from web documents and services.
We suggest decomposing Web mining into these subtasks:
 1 Resource finding: the task of retrieving intended web documents



2 Information selection and pre-processing: automatically selecting
and pre-processing specific information from retrieved Web
resources
3 Generalization: automatically discovers general patterns at
individual Web sites as well as across multiple sites.
4 Analysis: validation and/or interpretation of the mined patterns.
We’ll call this pattern 1-2-3-4, as we’ll later see, sometimes 1-3-4 is also used.
4/22/2004
Arifa Mannan, University of Vermont
8
Web Mining: Confusion




Web mining is often associated with Information Retrieval
or Information Extraction, but it is different from both.
IR is the automatic retrieval of all relevant documents
while at the same time retrieving as few non-relevant ones
as possible. [views documents as bag-of-words]
IE has the goal of transforming a collection of documents,
usually with the help of an IR system, into information that
is more readily digested and analyzed. [interested in the
structure or representation of a document]
We argue that Web mining intersects with the application
of machine learning on the web.
4/22/2004
Arifa Mannan, University of Vermont
9
Web Mining: Classification
Web mining is categorized into three areas of
interest based on which part of the web to
mine:
 Web content mining
 Web structure mining
 Web usage mining
4/22/2004
Arifa Mannan, University of Vermont
10
Web Mining: Classification
Web content mining: describes the discovery
of useful information from Web
contents/data/documents.

Web content data – the data the Web page was
designed to convey to the users,
consists of textual, image, audio, video
4/22/2004
Arifa Mannan, University of Vermont
11
Web Mining: Classification
Web structure mining: tries to discover the model
underlying the link structures of the Web.
 Web structure data – which describes the
organization of the content. Intra-page structure
information includes the arrangement of various
HTML or XML tags within a given page. Interpage structure information is hyper-links
connecting one page to another.
4/22/2004
Arifa Mannan, University of Vermont
12
Web Mining: Classification
Web usage mining: tries to make sense of the data
generated by the Web surfer’s sessions or
behaviors.
 Web usage data are secondary data.
 Web usage data includes data from web server access
logs, proxy server logs, browser logs, registration data,
cookies, mouse clicks and scrolls, and any other data as
the results of interactions.
4/22/2004
Arifa Mannan, University of Vermont
13
Web Mining: Classification
In practice, the three Web mining tasks could
be used in isolation or combined in an
application, especially in Web content and
structure mining since the Web documents
might also contain links.
4/22/2004
Arifa Mannan, University of Vermont
14
Web Mining & the Agent Paradigm
Web mining is often viewed from or implemented within an
agent paradigm. Thus, web mining has a close relationship
with software agents or intelligent agents.
Two relevant types of software agents:

User interface agents : information retrieval agents,
information filtering agents, & personal assistant agents.

Distributed agents : distributed agents for knowledge
discovery or data mining .
4/22/2004
Arifa Mannan, University of Vermont
15
Web Mining & the Agent Paradigm
Two relevant types of intelligent agents:
 Content-based approach: the system searches for items
that match based on an analysis of the content using the
user preference.

Collaborative approach: the system tries to find users
with similar interests to give recommendations to.
4/22/2004
Arifa Mannan, University of Vermont
16
Web Content Mining: IR view
The goal of Web content mining is to improve
the information finding or filtering the
information to the users.
Information retrieval view for unstructured
documents:
most of the research uses “bag of words” to represent
unstructured documents.
See the table that follows
4/22/2004
Arifa Mannan, University of Vermont
17
1998
1999
1995
1998
1995
1998
1999
1999
1999
1997
1999
1997
2000
1999
1999
1996
1999
1995
1999
1999
4/22/2004
Arifa Mannan, University of Vermont
18
Web Content Mining: IR view
4/22/2004
Arifa Mannan, University of Vermont
19
Web Content Mining: DB View




The database techniques on the web are related to the
problem of managing and querying the information on the
web.
Three classes of tasks: modeling and querying the web,
information extraction and integration, and web site
construction and restructuring.
Tries to model the data on the web and to integrate them so
that more sophisticated queries other than the keywords
based search can be performed.
Research in this area mainly deals with semi-structured data
4/22/2004
Arifa Mannan, University of Vermont
20
Web Content Mining: DB view
4/22/2004
Arifa Mannan, University of Vermont
21
Web Structure Mining
In Web structure mining we are interested in
the structure of the hyperlinks within the
Web itself. (inter-document structure)
 This line of research is inspired by the study
of social networks and citation analysis.
 We could discover specific types of pages
(hubs, authorities etc.) based on incoming
and outgoing links.

4/22/2004
Arifa Mannan, University of Vermont
22
Web Structure Mining



Web structure mining utilizes the hyperlinks
structure to apply social network analysis to model
the underlying link structure of the web.
A few different algorithms have been proposed to
do this such as HITS, PageRank, improved HITS
using content info & outlier filtering
Applied as a method to calculate the quality rank
or relevancy of each Web page.
4/22/2004
Arifa Mannan, University of Vermont
23
Web Usage Mining
 Web
usage mining focuses on techniques
that could predict user behavior while the
user interacts with the web.
 Two commonly used approaches: 1)
mapping the usage data of the web server
into relational tables before an adapted data
mining technique is performed, 2) uses the
log data directly by using special
preprocessing techniques.
4/22/2004
Arifa Mannan, University of Vermont
24
Web Usage Mining
Applications of web usage mining fall into two
main categories: learning a user profile/user
modeling in adaptive interfaces [personalized] and
learning user navigation patterns [impersonalized]
 Web users would be interested in techniques that
could learn their information needs and
preferences, which is user modeling
 Information providers would be interested in
techniques that could improve the effectiveness of
the information on their websites.

4/22/2004
Arifa Mannan, University of Vermont
25
Conclusions





We surveyed research in Web Mining,
clarified some confusion in the use of the term Web
mining,
suggested three Web mining categories and situated some
current research with respect to these categories.
explored the connection between Web mining categories
and the agent paradigm,
The Web presents new challenges to the traditional data
mining algorithms that work on flat data. We have seen
that some of the traditional data mining algorithms have
been extended or new algorithms have been used to work
on the Web data.
4/22/2004
Arifa Mannan, University of Vermont
26
Questions and Answers
Q1. Outline the main characteristics of web
information.
Ans: Huge
Diverse
Dynamic
4/22/2004
Arifa Mannan, University of Vermont
27
Questions and Answers
Q2. How data mining techniques can be used in web
information analysis? Give at least two examples.
Ans:
Classification: classification on server logs using decision
tree, Naïve-Bayes classifier to discover the profiles of
users belonging to a particular class
Clustering: Clustering can be used to group users exhibiting
similar browsing patterns.
Association Analysis: association analysis can be used to
relate pages that are most often referenced together in a
single server session.
4/22/2004
Arifa Mannan, University of Vermont
28
Questions and Answers
Q3. What are the three main areas of interest
for web mining?
Ans: Web content mining
Web structure mining
Web usage mining
4/22/2004
Arifa Mannan, University of Vermont
29