Download ppt

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
September, 13th
gR2002, Vienna
PAOLO GIUDICI
Faculty of Economics, University of Pavia
Research carried out within the laboratory:
Statistical models for data mining (SMDM)
[email protected]
A small sample of web
clickstream data
(from a logfile)
C, “10908”, 10908
V, 1108
V, 1017
C, “10909”, 10909
V, 1113
V, 1009
V, 1034
C, “10910”, 10910
V, 1026
V, 1017
[email protected]
Analysis of web clickstream data
1. In data matrix form (Giudici and Castelo, 2001; Blanc
and Giudici, 2001):
- Association measures
- Association models (graphical association models)
2. In transactional data form (in this talk)
- Association and sequence rules
- Statistical models for sequences
[email protected]
Association measures and models
Based on data arranged in contingency table form
FOR INSTANCE:
Odds ratios
Graphical loglinear models
Recursive logistic regression models
For a review, see Giudici, Applied data mining, Wiley, 2003
[email protected]
Association and sequence rules
Implemented in main Data Mining softwares
Based on transactional databases
Such databases arise for instance in
- Market basket analysis (order does not matter)
- Web clickstream analysis (order matters)
Aim: search for itemsets (groups of events) that occurr
simultaneously with a high frequency
[email protected]
Formally:
• A1, .., Ap: p binary random variables. Itemset: logical
expression such as A = (Aj1 = 1 ,...,.Ajk =1), k< p.
Association rule: logical relationship between two itemsets:
e.g. if A, then B  A  B
Example:A= (Milk, Coffee) B=(Bread, Biscuits)

Sequence rule: the relationship is determined by a temporal
order.
Example: A= (Home, Register) B=(P_info)
[email protected]
Interestingness of a rule
• Support
 A  B =
• Confidence  A  B =
N A B
N
N A B
NA
=
support
support
A  B
 A
• Lift  A  B=Confidence  A  B / Support (B)
A priori search algorithm (Agrawal et al., 1995):
based on the support.
[email protected]
Application to real data
Data set from a logfile of an e-commerce site, kindly supplied by SAS.
Contains the userid (C_VALUE), the time of connection (C_TIME) and
the page visualised (C_CALLER).
Number of clicks: 21889; Number of visitors (sessions): 1240.
[email protected]
Exploratory step
(data selected from a cluster of visitors, N. 3)
Cluster
mean
Overall
mean
[email protected]
Cluster
N.obs
Variables
1
8802
CLICKS
LENGTH
start
%PURCH
8
6 min
h. 18
0.034
2
2859
CLICKS
LENGTH
start
%PURCH
22
17 min
h. 15
0.241
3
1240
CLICKS
LENGTH
start
%PURCH
18
59min
h. 13
0.194
4
9251
CLICKS
LENGTH
start
%PURCH
8
6 min
h. 10
0.039
10
10 min
14 h
0.072
Remark
Data could have been transformed from transactional to data matrix format.
Doing so information on the order of the visited pages would have been lost
Data matrix format for the considered data:
[email protected]
Application of the apriori
algorithm
Most frequent indirect sequences of order 2
[email protected]
Most frequent indirect sequences of any order
[email protected]
Proposal: direct sequences
• Only “subsequent” visits are being considered
• We have inserted two fictitious (deterministic) pages:
(start_session; end_session)
[email protected]
Most frequent direct sequences of order 2
[email protected]
Towards a global model:
graphical representation of direct
association rules
[email protected]
Link analysis representation
[email protected]
Global models for web mining
Sequence rules are an instance of a local model (or pattern, see Hand et al, 2001) of
data mining.
A local model draws statistical conclusions on parts of the dataset, rather than on the
whole.
Link analysis is an example of a global descriptive model.
We have considered two global inferential models:
- probabilistic expert systems
- Markov chains
[email protected]
Probabilistic expert systems
Graphical models that allow to describe (recursive) dependencies
between (binary) random variables
Can be described by a directed conditional independence graph, that
specifies the factorisation of the joint probability distribution.
They ARE NOT directly comparable with sequence rules, that are local
indexes to study dependencies between events (itemsets)
They are built from contingency table data, thus DO NOT model order of
visit to pages.
[email protected]
Probabilistic expert systems:
structural learning
[email protected]
Probabilistic expert systems:
quantitative learning
[email protected]
Markov Chains for web mining
Ideal to model dependencies between events. Order of the
chain parallels order of a sequence rule.
Data have been structured in the following form:
[email protected]
Results from Markov chains
(entrance to the site- start session)
[email protected]
Exit from the site
(end session)
[email protected]
Most likely paths
17,80%
45,81%
Start_session
Home
70,18%
Progra
m
Product
26,73%
P_info
Markov chains ARE DIRECTLY comparable with direct sequence rules.
E.g. for the most likely path:
from start_session, the highest confidence is with home (45,81%), then program
(20.39,), product ( 78,09% ) and addcart (28,79%).
There are small differences, due to the fact that apriori algorithm considers
only rules with support higher than a fixed threshold (e.g. 5%).
[email protected]
Essential references
Agrawal, R., Manilla, H., Srikant, R., Toivonen, H. and Verkamo, A.I. (1995) Fast
discovery of association rules, in: Advances in knowledge discovery and data
mining, AAAI/MIT Press, Cambridge.
Giudici, P. (2003) Applied Data mining. Wiley, London.
Giudici, P. and Castelo, R. (2001) Association models for web mining. Journal
of Knowledge discovery and data mining, 5, pp. 183-196.
Trevor Hastie, Robert Tibshirani and Jerome Friedman (2001).The elements of
statistical learning: data mining, inference and prediction. Springer-Verlag.
Hand, D.J., Mannilla, H. and Smyth, P (2001) Principles of Data Mining, MIT
Press, New York.
[email protected]
THANKS FOR THE
ATTENTION !
Comments to:
[email protected]
www.baystat.it/giudici/index.htm
[email protected]