Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data e Web Mining
825368 Paolo Gobbo
Smart Miner: A New Framework for Mining
Large Scale Web Usage Data
Bayir – Toroslu – Cosar - Fidan
Data Mining on Web
Web Mining
discover and retrieve useful and
interesting pattern from large web dataset
web content
mining
real data in
web pages
text and
multimedia
documents
Data e Web Mining
web structure
mining
data describes the
organization of the
content
hyperlink
structure
2
web usage
mining
data describes the
pattern of usage of
web pages
web log records
825368 - Paolo Gobbo
INPUT
PreProcessing
Site File
Access Log
Referrer Log
Registration
Data Cleaning
Path Completion
Session Identification
User Identification
Site
Crawler
PREPROCESISNG
Agent Log
User Session File
SQL
Query
Site Topology
Transaction
Identification
Transaction File
Data e Web Mining
3
825368 - Paolo Gobbo
Session Identification
Session Identification
partitioning each user’s activities into sequence
(session) of entries from web request logs
navigation oriented
heuristics
time oriented
heuristics
link between web pages
temporal boundaries
session length
T ( Pn )  T ( P1 )  
Data e Web Mining
page-stay
i  0 j  i
Link ( Pj , Pi )  true
i : 1  i  n,
T ( Pi 1 )  T ( Pi )  
4
825368 - Paolo Gobbo
Sequential Mining
Sequential Mining
Association Mining with the order of transactions
Given a set of data sequences find all sequences
with a user-specified minimum support
I  {i1 , i2 ,, im }
X  {x1 , x2 ,, xk }
items
:
itemset/element
:
sequence
:
sequence size
:
number of itemsets/elements
sequence length
:
number of items
:
s1  a1 , a2 ,, an
s2  b1 , b2 ,, bn
 i1  i2    in : a1  bi1 , a2  bi2 ,, an  bin
s1
Data e Web Mining
subsequence
s2
s  a1 , a2 ,, ar
5
:
:
xi  I  X  Ø
ai is itemset
825368 - Paolo Gobbo
Sequential Mining algorithms
GSP
Data e Web Mining
Sort Phase
Transforms customer transaction into
custumer sequences
LargeItemSet Phase
Generates set of large itemset
Transformation Phase
Represents customer sequences
based on large itemset
Sequence Phase
Derives large k-sequences based on
large (k-1)-sequences
Maximal Phase
Prunes non maximal sequences
APrioriAll
6
APrioriSome
825368 - Paolo Gobbo
Smart-SRA session
Smart-SRA session
S  {S1 , S2 ,, Sm }
Path
S x  [ P1 , P2 ,, Pk , Pk 1 ,, Pn ]
• timestamp ordering (time oriented) rule
x
(session)
i : 1  i  n, T ( Pi )  T ( Pi 1 )
i : 1  i  n, T ( Pi 1 )  T ( Pi )  
T ( Pn )  T ( P1 )  
• topology (navigation oriented) rule
(path in the web site)
i : Link ( Pi , Pi 1 )  true
• maximality rule
(path in the web site)
S x  S S y : S y  S  S x  S y
Data e Web Mining
7
825368 - Paolo Gobbo
Smart Miner
DATA STREAM
Candidate Session
Smart Session
Sequencial AprioriAll
SMART-SRA
SESSION
CONSTRUCTION
SEQUENCIAL
MINING
FREQUENT ACCESS
PATTERN
Data e Web Mining
8
825368 - Paolo Gobbo
Smart Miner: First Phase Smart SRA
Candidate session construction
time oriented heuristics
 session length
 page-stay
 no backward movement
P1
P13
P23
P1
P20
P13
P49
P34
P23
TimeStamp
0
6
9
12
14
15
Page
P13
P20
P23
P49
TimeStamp
0
5
9
10
P20
P34
P49
Candidate Session
Web Site Graph
Data e Web Mining
Page
9
825368 - Paolo Gobbo
Smart Miner: Second Phase Smart SRA
Smart session construction
time oriented heuristics
 inherithed session length
 re-check page-stay
 no backward movement
 maximality
 topology rule
P1
P13
P23
Page
P1
P20
P13
P49
P34
P23
TimeStamp
0
6
9
12
14
15
[P1, P13, P34, P23]
P20
[P1, P13, P49, P23]
P34
P49
[P1, P20, P23]
Web Site Graph
Data e Web Mining
Smart Session
10
825368 - Paolo Gobbo
Smart Miner: Second Phase Smart
SMART SESSION RECONSTRUCTION
foreach CanditateSession in CandSessionSet
NewSessionSet={}
while CanditateSession ≠Ø
TSessionSet = {}; TPageSet = {};
foreach Pagei in CandSession
StartPageFlag = TRUE
foreach Pagej in CandidateSession with j<i
if (Link[Pagej,Pagei] and TimeDiff(Pagei,Pagej)≤σ
then StartPageFlag = FALSE
endfor
if StartPageFlag then TPageSet = TPageSet U {Pagei}
endfor
CandSession = TPageSet U {Pagei}
if NewSessionSet = {} then
foreach Pagei in TPageSet
TSessionSet = TSessionSet U {[Pagei]}
else
foreach Pagei in TPageSet
foreach Sessionj in NewSessionSet
if (Link[Last(Sessionj),Pagei] and
TimeDiff(Last(Sessionj),Pagei)≤σ) then
TSession = Sessionj
TSession.mark = UNEXTENDED
TSession = TSession • Pagei
TSessionSet = TSessionSet U {TSession}
Sessionj.mark = EXTENDED
endif
endfor
endfor
endif
foreach SessionJ in New SessionSet
if SessionJ.mark ≠ EXTENDED
then TSessionSet = TSessionSet U {SessionJ}
end for
NewSessionSet = TSessionSet
end while
end for
Data e Web Mining
11
page with
no incoming
link
session set
construction
session set
extension
session set
extension
with no
extended
825368 - Paolo Gobbo
Session Construction Example
Iteration
CandidateSession
TPageSet
NewSessionSet
1
[ P1, P20, P13, P49, P34, P23 ]
{ P1 }
[ P1 ]
2
[ P20, P13, P49, P34, P23 ]
{ P20, P13 }
[ P1, P20 ] [ P1, P13]
3
[ P49, P34, P23 ]
{ P49, P34 }
[ P1, P13, P34 ] [ P1, P13, P49 ] [ P1, P20 ]
4
[ P23 ]
{ P23 }
[ P1, P13, P34, P23 ] [ P1, P13, P49, P23] [ P1, P20, P23 ]
P1
P13
P23
P20
P49
P34
Data e Web Mining
12
825368 - Paolo Gobbo
Sequential APrioriAll
Pruning
• during candidate sequence generation before
calculating their support
 topological constraint
 every subsequent pair of pages in a sequence
the former one must have a hyperlink to the
latter one
 string matching costraint
 session S supports a pattern P if and only if P
is a subsequence of S not violating string
matching
<1,2,3> support <1,2>
<1,2,3> not support <1,3>
Data e Web Mining
13
825368 - Paolo Gobbo
Support
Support
• one scan through the transaction database by
keeping candidate session in hashmap
I : pattern
S : user reconstructed sessions
Support ( I , S ) 
Data e Web Mining
{Si | i I is substring of Si }
S
14
825368 - Paolo Gobbo
Sequential Apriori Algorithm
SEQUENTIAL APRIORI
INPUT:
OUTPUT:
minimum support frequency
reconstructed sessions
topology information
set of all web pages
set of maximal frequent patterns
:
:
:
:
:
δ
S
Link
P
Max
L1 = {}
length-1 candidate
for i = 1 to |P| do
pattern generation
L1 = L1 U [Pi] | if Support([Pi],S)> δ
for k = 1 to N-1 do
no further generation
if Lk = Ø then Halt
else
Lk+1 = {}
length-k+1 candidate pattern generation
foreach Ii in Lk
joining step
foreach Pj in P
if Link[Last(Ii),Pj] then
pruning step
T = I i • Pj
// append page
topological rule
if Support(T,S)> δ then
support rule
T.maximal = true
maximality rule
Ii.maximal = false
V = [T2,T3,…, T|T|]
if V in Lk then
V.maximal = false
lk+1 = lk+1 U {T}
endif
endif
endif
endfor
endfor
endif
max = {}
union of the sets of
for k=1 to N-1 do
maximal patterns
max = max U {S|S in Lk and S.maximal = true }
endfor
Data e Web Mining
15
825368 - Paolo Gobbo
Accuracy Metric
MPA
: frequent maximal pattern of the agent simulator
MPH : frequent maximal pattern of the heuristic
REC H 
PRE H 
AH 
Data e Web Mining
MPA  MPH
recall
MPA
MPA  MPH
precision
MPH
RECH * PREH 
16
accuracy
825368 - Paolo Gobbo
Agent Simulator
Agent Simulator Parameters
• STP
: Session Termination Probability
probability of terminating session
• LPP
: Link from Previous page Probability
probability of referring next page from one of
the previously accessed pages except the most
recently accessed one
• LPC
: Link from Current page Probability
probability of referring next page from the
most recently visited page
• NIP
: New Initial page Probability
probability of selecting one of the starting
pages of a web site during the navigation
Data e Web Mining
17
825368 - Paolo Gobbo
Simulated Data
Web topology
• number of web pages from 10 to 1000
• number users from 1000 to 10000
Agent simulator parameters
• 49 different cases
• NIP/STP
0.1 , 0.2 , 0.5 , 1.0 , 2.0 , 5.0 , 10.0
• LPC/LPP
0.1 , 0.2 , 0.5 , 1.0 , 2.0 , 5.0 , 10.0
Support parameter
• Values 0.001 , 0.0025 , 0.005 , 0,0075 , 0.01
Runs of agent simulator
• 10 random different runs
Data e Web Mining
18
825368 - Paolo Gobbo
Results on Simulated Data
NIP : New Initial Page Probability
NIP : New Initial Page Probability
STP : Session Termination Probability
STP : Session Termination Probability
NO
: navigation oriented
: time oriented
TO
SSRA : Smart SRA
Data e Web Mining
19
825368 - Paolo Gobbo
Results on Simulated Data
NO
: navigation oriented
: time oriented
TO
SSRA : Smart SRA
Data e Web Mining
20
825368 - Paolo Gobbo
Real Data
AGMLAB’s company web site
• 4 months user activity
• 3801 users
• 30 minutes session time-out
• 10 web pages
• link graph densely connected
User Activity
• action tracking program
• cookies
• cookie information recorded to a server log file
Data e Web Mining
21
825368 - Paolo Gobbo
Results on Real Data
NO
: navigation oriented
: time oriented
TO
SSRA : Smart SRA
Data e Web Mining
22
825368 - Paolo Gobbo
Scalability
Performance with 50 nodes
Performance on 100 GB Data
MAP/REDUCE paradigm
each node process a block of session database computing the
local frequency of each candidate patterns
Data e Web Mining
23
825368 - Paolo Gobbo
Sitologia/Bibliografia

M.A.Bayir – I.H.Toroslu – A.Cosar – G.Fidan, Smart Miner: A New
Framework for Mining Larga Scale Web Usage Data - 2009

R.Cooley - B.Mobasher - J.Srivastava, Data Preparation for Mining World
Wide Web - 1999

J.Srivastava - R.Cooley – M.Deshpande – P.N. Tan, Web Usage Mining:
Discovery and Applications of Usage Patterns from Web Data - 2000

M.G Da Costa jr – Z. Gong, Web Structure Mining: An Introduction - 2005

J.J.Jung, Semantic PreProcessing of Web Request Streams for Web Usage
Mining - 2005

R.Agrawal – R.Srikant, Mining Sequential Patterns- 1995
Data e Web Mining
24
825368 - Paolo Gobbo
GSP
GSP – GENERALIZED SEQUENTIAL PATTERN
C1 = Init_Pass
L1 = {<{f}>|f in C1, with minimum support}
for (k=2; Lk-1≠Ø; k++) do
begin
Ck = Candidate-gen-SPM Lk-1
foreach sequence s in the database D do
foreach candidate c in Ck
if (c in s) then update candidate c
Lk= candidated c in Ck with minimum support
end
result =
Uk(Lk)
CANDIDATE-GEN-SPM
foreach p in Lk-1
foreach q in Lk-1
if ( i  n  k  2 : pn1  qn )
then Ck = Ck U {p1,…,pk-1,qk-1 }
foreach s in Ck
if exists(r | r  s ˄ r  Lk 1 )
then Ck = Ck - s
Data e Web Mining
25
(join step)
(prune step)
825368 - Paolo Gobbo
GSP Example
L3-sequences
Candidate 4-sequences
(join step)
<{1,2},{4}>
<{1,2},{4,5}>
<{1,2},{5}>
<{1,2},{4},{6}>
Candidate 4-sequences
(prune step)
<{1,2},{4,5}>
<{1},{4,5}>
<{1,4},{6}>
<{2},{4,5}>
<{2},{4},{6}>
<{1},{4},{6}>
Data e Web Mining
26
825368 - Paolo Gobbo
APrioriAll
APRIORIALL
L1 = {large 1-sequences}
for (k=2; Lk-1≠Ø; k++) do
begin
Ck = Apriori-generate function Lk-1
foreach sequence c in the database D do
update candidates in Ck that are contained in c
Lk= candidated in Ck with minimum support
end
result = maximal sequences in
Uk(Lk)
APRIORI-GENERATE
(join step)
foreach p in Lk-1
foreach q in Lk-1
if (p.x1=q.x1) ˄ (p.x2=q.x2) ˄ … ˄ (p.xk-2=q.xk-2)
then Ck = Ck U {<p.x1,…,p.xk-1,q.xk-1>}
(prune step)
foreach s in Ck
if exists(r | r  s ˄ r  Lk 1 )
then Ck = Ck - s
Data e Web Mining
27
825368 - Paolo Gobbo
APrioriAll Example
L3-sequences
Candidate 4-sequences
(join step)
<1,2,3>
<1,2,3,4>
<1,2,4>
<1,2,4,3>
<1,3,4>
<1,3,4,5>
<1,3,5>
<1,3,5,4>
Candidate 4-sequences
(prune step)
<1,2,3,4>
<2,3,4>
Data e Web Mining
28
825368 - Paolo Gobbo
APrioriSome
APRIORISOME
//Forward Phase
L1 = {large 1-sequences};
C1 = L1 ; last = 1;
for (k=2; Ck-1≠Ø; k++) do begin
if (Lk-1 known) then Ck = Apriori-generate function Lk-1
else Ck = Apriori-generate function Ck-1
if (k=next(last)) then
foreach sequence c in the database D do
update candidates in Ck that are contained in c
Lk= candidated in Ck with minimum support;
last = k
end
//Backword Phase
for (k--; k>=1; k--) do begin
if (Lk not found) then
delete all sequences in Ck contained in some Li, i>k
foreach sequence c in the database D do
update candidates in Ck that are contained in c
Lk= candidated in Ck with minimum support
else
delete all sequences in Lk contained in some Li, i>k
end
result = maximal sequences in
Data e Web Mining
Uk(Lk)
29
825368 - Paolo Gobbo
Sequential Mining Algorithm
Customer ID
Transaction Time
Items
1
1
June 25 ’93
June 25 ‘93
30
90
2
2
2
June 10 ’93
June 15 ’93
June 20 ‘93
10,20
30
40,60,60
3
June 25 ’93
30,50,70
4
4
4
June 25 ’93
June 30 ‘93
July 25 ‘93
30
40,70
90
5
June 12 ’93
90
Large itemset
Mapped to
(30)
1
(40)
2
<(30) (50 (70))>
(70)
3
4
<(30) (40 70) (90)>
(40 70)
4
5
<(90)>
(90)
5
Customer ID
Customer Sequence
1
<(30)(90)>
2
<(10 20) (30) (40 60 70)>
3
Data e Web Mining
Customer ID
Customer Sequence
1
<{1} {5}>
2
<{1} {2, 3, 4}>
3
<{1, 3}>
4
<{1} {2, 3, 4} {5}>
5
<{5}>
30
825368 - Paolo Gobbo
Related documents