Download Web Content Mining - TKS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Web Usage Mining
WEB USAGE MINING
1
Contents
2











Web Mining
Web Mining Taxonomy
Web Usage Mining
Web analysis tools
Pattern Discovery Tools & it’s different stages
Pattern Analysis Tools & techniques employed
Web usage Mining Process
Web usage Mining Architecture
Research Directions
Conclusion
References
Web Usage Mining
Web Mining
3
Web mining - data mining techniques to
automatically discover and extract information
from Web documents/services .
 Web mining research – it integrate
information from several research communities
such as:
 Database (DB)
 Information retrieval (IR)
 The sub-areas of machine learning (ML)
 Natural language processing (NLP)

Web Usage Mining
Mining the World-Wide Web
4

WWW is a huge, widely distributed, global
information source for :
 Information
services: news, advertisements, consumer
information, financial management, education,
government, e-commerce, etc.
 Hyper-link information
 Access and usage information
 Web Site contents and Organization
Web Usage Mining
Challenges on WWW Interactions
5




Finding Relevant Information
Creating knowledge from Information available
Personalization of the information
Learning about customers / individual users
Web Mining can play an important Role!
Web Usage Mining
Web Mining Taxonomy
6
Web Mining
Web Content
Mining
Web Structure
Mining
Web Usage Mining
Web Usage
Mining
Web Usage Mining
7

Web usage mining also known as Web log mining
techniques to discover interesting usage
patterns from the secondary data derived from the
interactions of the users while surfing the web
 mining
Web Usage Mining





8
Organizations often generate and collect large volumes of data
in their daily operations while interacting with a web site.
Most of this information is usually generated automatically by
Web servers and collected in server access logs.
Other sources of user information include referrer logs which
contains information about the referring pages for each page
reference, and user registration or survey data gathered via
tools such as CGI scripts.
Analysis of server access logs and user registration data
provide valuable information on how to better structure a Web
site in order to create a more effective presence for the
organization.
Most of the existing Web analysis tools provide mechanisms for
reporting user activity in the servers and various forms of data
filtering.
Web Usage Mining
Web analysis tools:
9
Using these tool it is possible to determine the number of
accesses to the server and the individual files within the
organization's Web space, the times or time intervals of visits,
and domain names and the URLs of users of the Web server.


Pattern Discovery Tools
Pattern Analysis Tools
Web Usage Mining
Pattern Discovery Tools
10



The emerging tools for user pattern discovery that use sophisticated
techniques from AI, data mining, psychology, and information theory,
to mine for knowledge from collected data.
The WEBMINER system introduces a general architecture for Web
usage mining. WEBMINER automatically discovers association rules
and sequential patterns from server access logs.
Pirolli et. al. use information foraging theory to combine path
traversal patterns, Web page typing, and site topology information
to categorize pages for easier access by users.
Web Usage Mining
Pattern Analysis Tools
11




Once access patterns have been discovered,
analysts need the appropriate tools and techniques
to understand, visualize, and interpret these
patterns. Examples of such tools include ,
WebViz system
OLAP techniques such as data cubes for the
purpose of simplifying the analysis of usage
statistics from server access logs .
The WEBMINER system proposes an SQL-like query
mechanism for querying the discovered knowledge
Web Usage Mining
Pattern Discovery from Web Transactions
12

Preprocessing Tasks
 Data
Cleaning
 Transaction Identification

Discovery Techniques on Web Transactions
 Path
Analysis
 Association Rules
 Sequential Patterns
 Clustering and Classification
Web Usage Mining
Preprocessing Tasks
13

Data Cleaning:
Techniques to clean a server log to eliminate
irrelevant items. Elimination of irrelevant items
can be reasonably accomplished by checking the
suffix of the URL name. like, all log entries with
filename suffixes such as, gif, jpeg, GIF, JPEG,
jpg, JPG, and map can be removed.
Web Usage Mining
Transaction Identification
14



Here, sequences of page references are grouped into logical units
representing Web transactions or user sessions.
Two types of transactions are defined.
navigation-content
where each transaction consists of a single content reference and all
of the navigation references in the traversal path leading to the
content reference. These transactions can be used to mine for path
traversal patterns.

content-only
which consists of all of the content references for a given user
session. These transactions can be used to discover associations
between the content pages of a site.
Web Usage Mining
Discovery Techniques on Web Transactions
15

Path Analysis
Here a graph represents the physical layout of a Web
site, with Web pages as nodes and hypertext links
between pages as directed edges.
Other graphs could be formed based on the types of
Web pages with edges representing similarity between
pages, or creating edges that give the number of users
that go from one page to another.
Path analysis could be used to determine most
frequently visited paths in a Web site.
Web Usage Mining




16
Other examples of information that can be discovered
through path analysis are:
70% of clients who accessed
/company/products/file2.html did so by starting at
/company and proceeding through
/company/whatsnew, /company/products, and
/company/products/file1.html;
80% of clients who accessed the site started from
/company/products; or
65% of clients left the site after four or less page
references.
Web Usage Mining
Association Rules
17





This technique is generally applied to databases of transactions where each
transaction consists of a set of items.
the problem is to discover all associations and correlations among data items
Each transaction is comprised of a set of URLs accessed by a client in one visit
to the server. For example, using association rule discovery techniques we can
find correlations such as the following:
40% of clients who accessed the Web page with URL
/company/products/product1.html, also accessed
/company/products/product2.html; or
30% of clients who accessed /company/announcements/special-offer.html,
placed an online order in /company/products/product1.
Web Usage Mining
Contd..
18


Usually such transaction databases contain extremely large
amounts of data, current association rule discovery techniques try
to prune the search space according to support for items under
consideration. Support is a measure based on the number of
occurrences of user transactions within transaction logs.
Discovery of such rules for organizations engaged in electronic
commerce can help in the development of effective marketing
strategies.
Web Usage Mining
Sequential Patterns
19



The problem of discovering sequential patterns is to find
inter-transaction patterns such that the presence of a set of
items is followed by another item in the time-stamp ordered
transaction set.
By analyzing this information, the Web mining system can
determine temporal relationships among data items such as
the following:
30% of clients who visited /company/products/, had done
a search in Yahoo, within the past week on keyword w; or
60% of clients who placed an online order in
/company/products/product1.html, also placed an online
order in /company1/products/product4 within 15 days.
Web Usage Mining
Clustering and Classification
20




Discovering classification rules allows one to develop a profile of items
belonging to a particular group according to their common attributes. This
profile can then be used to classify new data items that are added to the
database such as the following:
clients from state or government agencies who visit the site tend to be
interested in the page /company/products/product1.html; or
50% of clients who placed an online order in /company/products/product2,
were in the 20-25 age group and lived on the West Coast.
Clustering analysis allows one to group together clients or data items that
have similar characteristics. Clustering of client information or data items on
Web transaction logs, can facilitate the development and execution of future
marketing strategies, both online and off-line.
Web Usage Mining
Analysis of Discovered Patterns
21




Web site administrators are extremely interested in
questions like "How are people using the site?", "Which
pages are being accessed most frequently?", etc. These
questions require the analysis of structure of hyperlinks as
well as the contents of the pages. The end products of
such analysis might include 1) the frequency of visits per
document, 2) most recent visit per document, 3) who is
visiting which documents, 4) frequency of use of each
hyperlink, and 5) most recent use of each hyperlink.
Visualization Techniques
OLAP Techniques
Data & Knowledge Querying
Web Usage Mining
Visualization Techniques
22



Visualization has been used very successfully in helping people understand
various kinds of phenomena, both real and abstract.
The WebViz system is used for visualizing WWW access patterns. WebViz
allows the analyst to selectively analyze the portion of the Web that is of
interest by filtering out the irrelevant portions. The Web is visualized as a
directed graph with cycles, where nodes are pages and edges are (interpage) hyperlinks.
The visualization is composed of two windows, the WebViz control window
and the display window . The first provides the analyst with controls to adjust
the bindings, select a specific time to view, control the animation, and
rearrange the layout. The second window's arrangement allows a
document's access frequency to be represented by the width of the node
representing it, while the node's color represents it recency of access.
Web Usage Mining
OLAP Techniques
23



On-Line Analytical Processing (OLAP) is emerging as a very
powerful paradigm for strategic analysis of databases in
business settings.
The key characteristics of strategic analysis include ,very
large data volume, explicit support for the temporal
dimension, support for various kinds of information
aggregation, and long-range analysis. This has led to the
development of the data cube information model , and
techniques for its efficient implementation .
Web usage data have much in common with those of a
data warehouse, and hence OLAP techniques are quite
applicable and the issue needs further investigation.
Web Usage Mining
Data & Knowledge Querying
24



One of the reasons attributed to the great success of
relational database technology has been the existence of a
high-level, declarative, query language, which allows an
application to express what conditions must be satisfied by
the data it needs, rather than having to specify how to get
the required data.
The main focus may be provided in at least two ways.
First, constraints may be placed on the database (perhaps
in a declarative language)
Second, querying may be performed on the knowledge
that has been extracted by the mining process. An SQL-like
querying mechanism has been proposed for the WEBMINER
system.
Web Usage Mining
Web usage Mining Process
25
Web Usage Mining
Web Usage Mining Architecture
26
Web Usage Mining
Research Directions
27





Web Usage Mining, which is just starting as an
area of research, has a number of open issues.
Following are some directions for future research:
Data Pre-Processing for Mining
The Mining Process
Analysis of Mined Knowledge
Web SIFT Example
Web Usage Mining
WebSIFT Example
28
Web Site Information Filter System (WebSIFT)
is a Web usage mining framework, that uses
the content and structure information from a
Web site, and identifies the interesting results
from mining usage data.
 Input of the mining process: server logs (access,
referrer, and agent), HTML files, optional data.
 Prototypical Web usage mining system.

Web Usage Mining
Conclusion
29



Web usage and data mining used for finding
patterns is a growing area with the growth of Webbased applications
Application of web usage data can be used to
better understand web usage, and apply this
specific knowledge to better serve users
Web usage patterns and data mining can be the
basis for a great deal of future research
Web Usage Mining
References:
30






Web Usage: Mining: Discovery and Applications of Usage Patterns from Web
Data - Jaideep Srivastava, Robert Cooley, Mukund Deshpande, Pang-N in
Tan Dept of CSE – University of Minnesota.
Web Mining: Pattern Discovery from World Wide Web Transaction
Web Mining Research: A Survey – Raymond Kosala, Hendrik Blockeel Dept
of CS Katholieke Universiteit LeuvenJ. Srivastava, R. Cooley, M. Deshpande,
Pang-Ning-tan.
Web Usage Mining: Discovery and Applications of Usage Patterns from
Web Data, SIGKDD Explorations, Vol. 1, Issue 2, 2000.
B. Mobasher, R. Cooley and J. Srivastava, Web Mining: Information and
Pattern Discovery on the World Wide Web, Proceedings of the 9th IEEE
International Conference on Tools with Artificial Intelligence (ICTAI'97),
November 1997.
www.wikipedia.org
Web Usage Mining
Web Usage Mining
THANK YOU…
31