Download Multi-Relational Data Mining - CENG METU

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
A New Reactive Method for
Processing
Web Usage Data
Murat Ali Bayır
Middle East Technical University
Department of Computer Engineering
Ankara, Turkey
1
OUTLINE
Web Mining
Previous Session Reconstruction Heuristics
Smart-SRA
Agent Simulator
Experimental Results
Conclusion
Murat Ali Bayir, June 06
2
Data & Web Mining
Data Mining: Discovery of useful and interesting patterns
from a large dataset.
Web mining: the application of data mining techniques to
discover and retrieve useful information and patterns from
the World Wide Web documents and services.
Dimensions:
– Web content mining
– Web structure mining
– Web usage mining
Murat Ali Bayir, June 06
3
Web Mining
Web Usage Mining (WUM)
Application of data mining techniques to web log data in
order to discover user access patterns.
Example User Web Access Log
IP Address
Request Time
Method URL
Protocol
Success of
Return Code
Number of
Bytes
Transmitted
144.123.121.23
[25/Apr/2005:03:04:41–05]
GET
A.html
HTTP/1.0
200
3290
144.123.121.23
[25/Apr/2005:03:04:43–05]
GET
B.html
HTTP/1.0
200
2050
144.123.121.23
[25/Apr/2005:03:04:48–05]
GET
C.html
HTTP/1.0
200
4130
It is possible to capture necessary information for WUM.
Murat Ali Bayir, June 06
4
Web Mining
Phases of Web Usage Mining
1. Data Processing
–
Includes reconstruction of user sessions by using heuristics
techniques. (Most important phase) since it directly affects
quality of extracted frequent patterns at final step
significantly.
2. Pattern Discovery
–
Includes Discovering useful patterns from reconstructed
sessions obtained in the first phase. We have related work
about Pattern Discovery phase [Bayir 06-1].
Murat Ali Bayir, June 06
5
OUTLINE
Web Mining
Previous Session Reconstruction Heuristics
Smart-SRA
Agent Simulator
Experimental Results
Conclusion
Murat Ali Bayir, June 06
6
Previous Session Reconstruction Heuristics
Session Reconstruction
Includes selecting and grouping requests belonging to
the same user by using heuristics techniques.
Types:
– Reactive strategies process requests after they are
handled by the web server, they process web server
logs to obtain session. The proposed approach is this
thesis is reactive.
– Proactive strategies process requests during the
interactive browsing of the web site by the user.
Session data is gathered during interaction of web
user. applied on dynamic server pages.
Murat Ali Bayir, June 06
7
Previous Reactive Heuristics
Session Reconstruction
Proactive Strategies need to change internal structure of
web site. To illustrate, change in source code of each
dynamic web pages.
Reactive strategies need no change, used for web
analytics purposes, customers give web logs of their web
site and analyzed them by using this methods. Reactive
methods are applicable for all web sites satisfying same
log format.
Murat Ali Bayir, June 06
8
Previous Reactive Heuristics
Two types of reactive heuristics defined before
Time-oriented heuristics [Spiliopoulou 98, Cooley 99-1]
Navigation-oriented heuristic [Cooley 99-1, Cooley 99-2]
Smart-SRA [Bayir 06-2] is new approach
proposed in this thesis. It combines these
heuristics with web topology information in order
to increase the accuracy of the reconstructed
sessions.
Murat Ali Bayir, June 06
9
Previous Reactive Heuristics
Example Web Topology Graph used for Applying heuristics
The topology of
web site can be
represented by
directed web
graph.
P13
P1
P23
The topology
information can be
extracted by using
crawling module of
Search engine
APIs.
P20
P34
P49
Example Web Page Request Sequence
Page
P1
P20
P13
P49
P34
P23
Timestamp
0
6
15
29
32
47
Murat Ali Bayir, June 06
10
Previous Session Reconstruction Heuristics
Two types of time oriented Heuristics defined.
Time-oriented heuristics -1
total duration of a discovered session is limited with a
threshold 1
Example:
Page
P1
P20
P13
P49
P34
P23
Timestamp
0
6
15
29
32
47
Time threshold (1 = 30 mins):
1.
2.
[P1, P20, P13, P49] (t(P1) - t(P49) = 29 < 30)
[P34, P23]
(t(P34) - t(P23) = 15 < 30)
Murat Ali Bayir, June 06
11
Previous Session Reconstruction Heuristics
Time-oriented Heuristics -2
The time spent on any page is limited with a threshold 2 .
That means t(Pn+1) - t(Pn) < 2
Example:
Page
P1
P20
P13
P49
P34
P23
Timestamp
0
6
15
29
32
47
Time threshold (2 = 10 mins):
1. [P1, P20, P13]
2. [P49, P34]
3. [P23]
Murat Ali Bayir, June 06
12
Previous Session Reconstruction Heuristics
Navigation-Oriented Heuristic
In Navigation Oriented Heuristics, when processing user request
sequence, There are two cases for Adding new page WPN+1 to a
session
[WP1, WP2, …, WPN]
If WPN has a hyperlink to WPN+1
[WP1, WP2, …, WPN, WPN+1]
If WPN does not have a hyperlink to WPN+1
Assume that WPKmax is the nearest page having a hyperlink
to WPN+1 add backward browser moves
[WP1, WP2,…, WPN, WPN-1, WPN-2,..., WPKmax, WPN+1]
Murat Ali Bayir, June 06
13
Previous Session Reconstruction Heuristics
Navigation-Oriented Heuristic
User request sequence
Example:
Curent Session
Condition
[]
[P1]
New Page
P1
Link[P1, P20]
=1
P20
[P1, P20]
Link[P20, P13] = 0
Link[P1, P13] = 1
P13
[P1, P20, P1, P13]
Link[P13, P49] = 1
P49
[P1, P20, P1, P13, P49]
Link[P49, P34] = 0
Link[P13, P34] = 1
P34
[P1, P20, P1, P13, P49, P13, P34]
Link[P34, P23] =1
P23
[P1, P20, P1, P13, P49, P13, P34, P23]
Murat Ali Bayir, June 06
14
OUTLINE
Web Mining
Previous Session Reconstruction Heuristics
Smart-SRA
Agent Simulator
Experimental Results
Conclusion
Murat Ali Bayir, June 06
15
Smart-SRA
Phase 1: Shorter request sequences are constructed by using overall
session duration time and page-stay time criteria
Phase 2: Candidate sessions are partitioned into maximal sub-sessions
such that:
– between each consecutive page pair in a session there is a hyperlink from the previous
page to the next page
Topology Rule:
i:1 i<n, there is a hyperlink from Pi to Pi+1
Time Rules:
– o
i: 1 i<n, Timestam(Pi) < Timestamp(Pi+1)
– o
i: 1 i<n Timestamp(Pi+1) - Timestamp(Pi)  r (page stay time)
– o
Timestamp(Pn) - Timestamp(P1)  δ (session duration time).
Murat Ali Bayir, June 06
16
Smart-SRA
Phase2 of Smart-SRA process a candidate session from left
to right by repeating the following steps until the
candidate session is empty:
1. Determine the web pages without any referrer (on its left)
and remove them from the candidate session
2. For each one of these pages
For each previously constructed session
–
If there is a hyperlink from the last page of the session to the web
page and page stay time constraint is satisfied then append the web
page to the session
3. Remove non-maximal sessions
Murat Ali Bayir, June 06
17
Smart-SRA
Example Web Topology
P13
P1
P23
P20
P34
Used of Applying
Smart-SRA
P49
Page
P1
P20
P13
P49
P34
P23
Timestamp
0
6
9
12
14
15
Example Candidate Session
Murat Ali Bayir, June 06
18
Smart-SRA
Iteration
1 (non referers in the set)
2
Candidate Session
[P1, P20, P13, P49, P34, P23]
[P20, P13, P49, P34, P23]
New Session Set
(before)
[P1]
Temp Page Set
{P1}
{P20, P13}
Temp Session Set
[P1]
[P1,P20]
[P1,P13]
New Session Set
(after)
[P1]
[P1,P20]
[P1,P13]
Iteration
3
4
Candidate Session
[P49, P34, P23]
[P23]
New Session Set
(before)
[P1,P20]
[P1,P13]
[P1,P13,P34]
[P1, P13, P49]
[P1, P20]
Temp Page Set
{P49, P34}
{P23}
Temp Session Set
[P1,P13,P34]
[P1, P13, P49]
[P1, P13, P34, P23]
[P1, P13, P49, P23], [P1, P20, P23]
New Session Set
(after)
[P1,P13,P34], [P1, P13, P49]
[P1, P20]
[P1, P13, P34, P23] , [P1, P13, P49, P23]
[P1, P20, P23]
Murat Ali Bayir, June 06
19
OUTLINE
Web Mining
Previous Session Reconstruction Heuristics
Smart-SRA
Agent Simulator
Experimental Results
Conclusion
Murat Ali Bayir, June 06
20
Agent Simulator
Models the behavior of web users and generates
web user navigation and the log data kept by the web
server
Used to compare the performances of alternative
session reconstruction heuristics
Murat Ali Bayir, June 06
21
Agent Simulator
Provides 4 basic behaviors of Web User.
•
A Web user can start session with any one of the possible
entry pages of a web site.
•
A Web user can select the next page having a link from the
most recently accessed page.
•
A Web user can press the back button one more time and thus
selects as the next page a page having a link from any one of
the previously browsed pages (i.e., pages accessed before the
most recently accessed one).
•
A Web user can terminate his/her session.
Murat Ali Bayir, June 06
22
Agent Simulator
Behavior I
Web user can start a new session with any one of the possible
entry pages of the web site
P13
P1
S1
Start page
1
P23
2
S2
P34
Murat Ali Bayir, June 06
New request from server
P20
S1
Session I
S2
Session II
P49
23
Agent Simulator
Behavior II
Web user can select a new page having a link from the
most recently accessed page.
P1
1
P13
Start page
P23
2
P34
Murat Ali Bayir, June 06
New request from server
P20
S1
Session I
S2
Session II
P49
24
Agent Simulator
Behavior III
Web user can select as the next page having a link from
any one of the previously browsed pages.
4
P1
1
P13
Start page
New request from server
3
5
P23
P20
S1
Session I
S2
Session II
2
P34
Murat Ali Bayir, June 06
P49
25
Agent Simulator
Behavior IV
Web user can terminate the session.
4
Example
session is
terminated in
P1
1
P13
Start page
P23.
3
P23
5
2
P34
Murat Ali Bayir, June 06
New request from server
P20
6
S1
Session I
S2
Session II
P49
26
Agent Simulator
3 Parameters for simulating behavior of web user
Session Termination Probability (STP)
Link from Previous pages Probability (LPP)
New Initial page Probability (NIP)
Murat Ali Bayir, June 06
27
OUTLINE
Web Mining
Previous Session Reconstruction Heuristics
Smart-SRA
Agent Simulator
Experimental Results
Conclusion
Murat Ali Bayir, June 06
28
Experimental Results
Heuristics Tested
Time oriented heuristic (heur1)
(total time  30 min)
Time oriented heuristic (heur2)
(page stay  10 min)
Navigation oriented heuristic (heur3)
Smart-SRA heuristic (heur4)
Murat Ali Bayir, June 06
29
Experimental Results
Accuracy is determined as:
Reconstructed session H captures
a real session R
if R occurs as a subsequence of H (R  H)
String-matching relation needed
R = [P1, P3, P5]
H = [P9, P1, P3, P5, P8] =>
H = [P1, P9, P3, P5, P8] =>
Murat Ali Bayir, June 06
R  H Yes
R  H No
30
Experimental Results
Parameters for generating user sessions and web
topology
Number of web pages (nodes) in topology
300
Average number of outdegree
15
Average number of page stay time
2,2 min
Deviation for page stay time
0,5 min
Number of agents
10000
STP : Fixed & Range
5%
1%-20%
LPP : Fixed & Range
30%
0%-90%
NIP : Fixed & Range
30%
0%-90%
Murat Ali Bayir, June 06
31
Experimental Results
Accuracy vs. STP
Increasing STP leads to sessions with fewer pages. It becomes more
easy to predict. In small length sessions the probability of LPP and NIP
that holds is also small.
Murat Ali Bayir, June 06
32
Experimental Results
Accuracy vs LPP
Real Accuracy %
Real Accuracy vs LPP
50
40
heur1
30
heur2
20
heur3
10
heur4
0
0 10 20 30 40 50 60 70 80 90
LPP
As LPP increases the real accuracy decreases. Increasing LPP leads to
more complex sessions. Intelligent Path completion is needed for
discovering more accurate sessions.
Murat Ali Bayir, June 06
33
Experimental Results
Accuracy vs. NIP
Real Accuarcy %
Real Accuracy vs NIP
35
30
25
20
15
10
5
0
heur1
heur2
heur3
heur4
0 10 20 30 40 50 60 70 80 90
NIP
Increasing NIP causes more complex sessions, the accuracy decreases
for all heuristics. Path separation is needed for discovering more
accurate sessions.
Murat Ali Bayir, June 06
34
OUTLINE
Web Mining
Previous Session Reconstruction Heuristics
Smart-SRA
Agent Simulator
Experimental Results
Conclusion
Murat Ali Bayir, June 06
35
Conclusion
New session reconstruction heuristic: Smart-SRA
– Does not allow sequences with unrelated consecutive requests
(no hyperlink between the previous one to the next one)
– No artificial browser (back) requests insertion in order to prevent
unrelated consecutive requests
– Only maximal sessions discovered.
Agent simulator simulates behaviors of real www users.
It is possible to evaluate accuracy of heuristics by using
Agent Simulator.
Experimental results show Smart-SRA outperforms
previous reactive heuristics.
Murat Ali Bayir, June 06
36
References
[Bayir 06-1] M. A. Bayir, I. H. Toroslu, A. Cosar, (2006) A Performance Comparison of
Pattern Discovery Methods on Web Log Data, AICCSA-06, the 4th ACS/IEEE
International Conference on Computer Systems and Applications.
[Bayir 06-2] M. A. Bayir, I. H. Toroslu, A. Cosar, (2006): A New Approach for Reactive
Web Usage Data Processing. ICDE Workshops, 44.
[Cooley 99-1] R. Cooley, B. Mobasher, and J. Srivastava (1999), Data Preparation for
Mining World Wide Web Browsing Patterns . Knowledge and Information Systems Vol. 1,
No. 1.
[Cooley 99-2] R. Cooley, P. Tan and J. Srivastava (1999), Discovery of interesting usage
patterns from Web data. Advances in Web Usage Analysis and User Profiling. LNAI 1836,
Springer, Berlin, Germany. 163-182.
[Spiliopoulou 98] M. Spiliopoulou, L.C. Faulstich (1998). WUM: A tool for Web
Utilization analysis. Proceedings EDBT workshop WebDB’98, LNCS 1590, Springer,
Berlin, Germany. 184-203.
Murat Ali Bayir, June 06
37
Thank you for Listening 
Any Questions ?
Murat Ali Bayir, June 06
38