Download Web Usage Mining: Processes and Applications

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Transcript
Web Usage Mining:
Processes and Applications
Qiaoyuan Jiang
CSE 8331
November 24, 2003
1
Outline





Brief overview of Web mining
Web usage mining
Application areas of Web usage mining
Future research directions
Conclusions
2
Web Mining

Web Mining is the application of
data mining techniques to discover
and retrieve useful information and
patterns from the World Wide Web
documents and services [Etzioni,
1996].
3
Web Mining Categories



Web Content Mining- extracting
knowledge from the content of the Web
Web Structure Mining- discovering
the model underlying the link structures
of the Web
Web Usage Mining- discovering
user’s navigation pattern and predicting
user’s behavior
4
Web Usage Mining Processes

Preprocessing: conversion of the raw data into the
data abstraction (users, sessions, episodes,
clicktreams, and pageviews) necessary for further
applying the data mining algorithm.

Pattern Discovery: is the key component of WUM,
which converges the algorithms and techniques from
data mining, machine learning, statistics and pattern
recognition etc. research categories.

Pattern Analysis: Validation and interpretation of
the mined patterns
5
Web Usage Mining Processes
(Cont.)
6
Web Usage MiningPreprocessing

Data Cleaning: remove outliers and/or irrelative

User Identification: associate page references with

Session Identification: divide all pages accessed

Path Completion: add important page access

Formatting: format the sessions according to the
data
different users
by a user into sessions
records that are missing in the access log due to
browser and proxy server caching
type of data mining to be accomplished.
7
Web Usage Mining –
Preprocessing (Cont.)
8
Web Usage Mining Pattern Discovery Tasks






Statistical Analysis
Clustering
Classification
Association Rules
Sequential Patterns
Dependency Modeling
9
Web Usage Mining Pattern Discovery Tasks (Cont.)

Statistical Analysis: frequency analysis, mean,
median, etc.




Improve system performance
Provide support for marketing decisions
Simplify site modification task
Clustering:


Clustering of users help to discover groups of
users with similar navigation patterns => provide
personalized Web content
Clustering of pages help to discover groups of
pages having related content => search engine
10
Web Usage Mining Pattern Discovery Tasks (Cont.)

Classification: the technique to map a data
item into one of several predefined classes


Develop profile of users belonging to a particular
class or category
Association Rules: discover correlations
among pages accessed together by a client



Help the restructure of Web site
Page prefetching
Develop e-commerce marketing strategies
11
Web Usage Mining Pattern Discovery Tasks (Cont.)

Sequential Patterns:

Dependency Modeling: determine if there are any
extract frequently occurring intersession patterns such that the presence of a set of items s
followed by another item in time order
 Predict future user visit patterns=>placing ads or
recommendations
 Page prefeteching
significant dependencies among the variables in the Web
domain
 Predict future Web resource consumption
 Develop business strategies to increase sales
 Improve navigational convenience of users
12
Web Usage Mining Pattern Analysis



Pattern Analysis is the final stage of WUM,
which involves the validation and interpretation
of the mined pattern
Validation: to eliminate the irrelative rules or
patterns and to extract the interesting rules or
patterns from the output of the pattern
discovery process
Interpretation: the output of mining algorithms
is mainly in mathematic form and not suitable
for direct human interpretations
13
Web Usage Mining Pattern Analysis Methodologies and Tools

Visualization: help people to understand both
real and abstract concepts


Query mechanism: allow analysts to extract
only relevant and useful patterns by
specifying constraints.


WebViz: Web is visualized as a direct graph
WEBMINER
On-Line Analytical Processing (OLAP): enable
analysts to perform ad hoc analysis of data in
multiple dimensions for decision-making

WebLogMiner
14
WEMINER Query Example

Finds all ARs with min support of 1% and min
confidence of 90%. The analyst only interested in
clients from “.edu” domain and data later than Nov. 1st,
2003 with page accesses start with URL A and contains
B and C in that order:
SELECT association-rules(A*B*C*)
FROM log.data
WHERE date>=031101 AND domain=“edu”
AND support = 1.0 AND confidence = 90.0
15
Application Areas for
Web Usage Mining

Personalized: discover the preference and needs of

Impersonalized: examine general user navigation
individual Web users in order to provide personalized
Web site for certain types of users
patterns in order to understand how general users
use the site
 System Improvement
 Site Modification
 Business Intelligence
 Web Characterization
16
System Improvement



High performance of a web application is
expected since it directly affects user’s
satisfaction
WUM provides a key to understanding Web traffic
behavior
Applications


Develop policies for web caching, network
transmission, load balancing, or data distribution
Detecting intrusion, fraud, and attempted break-ins to
the system
17
Site Modification



Structure of a Web site is another crucial
attribute for attracting users other than the
content of the Web
WUM can provide detailed feedback on user’s
navigation behavior, which can be used to
redesign the Web site structure for user’s
navigational convenience
Adaptive Web site project [Perkowiz & Etzioni,
1998-1999]
18
Business Intelligence



Information on how customers are
using a Web site is critical information
for marketers of e-commerce
businesses
WUM can provide business process
optimization and marketing decisions
Business intelligence includes
personalization for C2B systems
19
Usage Characterization

Mining general usage patterns (do not
focus on any specific users or web sites)
help in the study of how browsers are
used and the user’s interaction with a
browser interface.

Enables the ability to look at the
dynamics of the Web and how it is
growing.
20
Personalization




Choosing among thousands of options is
challenge for Web users
Goal: provides users with dynamic content
tailored to their individual interest
Form: recommending one or more items or
pages to a user, based on the user’s profile and
usage behavior, or the patterns of past visitors
who have similar profiles.
Performance Measurement:


Effectiveness: accuracy + coverage
Scalability
21
Applications of Personalization






Customizing access to information sources
Filtering news or e-mails
Recommendation services for the browsing
process
Tutoring systems
Search
More ...
22
3 phases of Personalization



Data preparation and transformation:
data cleaning, filtering, transaction
identification
Pattern discovery: discovery usage
patterns
Recommendation: generate personalized
content for a user based on matching the
user’s session. (online process)
23
24
Personalization Techniques –
Collaborative Filtering (CF)

Pattern discovery: online kNN algorithm applied on
user profiles in a given domain and matching people who
have the same taste.

Recommendation: pages or items that are
interested to the k-neighbors will be interested to the
active user as well.

Drawbacks:


Online process =>Lack of scalability
Static user profiles => low quality of recommendations
25
Personalization Techniques –
Clustering

Technique: clustering user transactions

Advantages:
and pageviews.



User preference is automatically learned
from usage data and therefore up-to-date.
Better scalability through clustering
Drawbacks:

Low accuracy
26
Personalization Techniques –
Association Rules (ARs)

Technique:




For each user, create a transaction contains all the items the user
have ever accessed.
Find all rules satisfy the given support and confidence.
For each active user, find all the rules supported by the user. Items
predicted by these rules are the candidate recommendations
Drawbacks:


All association rules must be discovered prior generating
recommendation. This can be improved by real-time generating
ARs from a subset of transactions within the active users
neighborhood
High support => better scalability and accuracy, low coverage.
27
Personalization Techniques –
Sequential Patterns (SPs)


Technique: Markov Model
Advantages:


Drawbacks:


Better accuracy: SPs contains more precise
information about user navigation behavior.
Low recommendation coverage
More suitable for predictive tasks, e.g., Web
prefeteching
28
Personalization Techniques –
Hybrid Models

Hybrid Models automatically switch among
different personalization models based on
localized degree of hyperlink connectivity.



High connectivity degree => Non-SP models
Low connectivity degree and deeper navigation
path => SP models
Performance: better than any individual
models
29
Future Research Directions

Usage Mining on Semantic Web



Help to build semantic Web
With semantic Web, WUM can be improved
Multimedia Web Data Mining

Representation, problem solving and
learning from Multimedia data is indeed a
challenge
30
Future Research Directions
(Cont.)

Software Computing Technology for Web Mining



Fuzzy logic: dealing with imprecision and conceptual data. Used
in clustering Web log data and mining ARs.
Neural network:
 Adaptive to new new data and information
 Suitable for parallel process
 Robust for missing, confusing, ill-defined data
 Capable for modeling non-linear decision boundaries
 Effective for learning user profiles
Genetic algorithm: randomized search and optimization guided
by evaluation criteria.
 Efficient, adaptive, robust, parallel process
 Used in search and query optimization, predict user preference
31
Future Research Directions
(Cont.)

Analysis of Discovered Patterns


Research on efficient, flexible and powerful
analysis tools
More Applications




Temporal evolutions of usage behavior
Improving Web services
Detect credit card fraud
Privacy issues
32
Conclusions
33