Download Redo Log Process Mining Mining directly from databases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Redo Log Process Mining
Mining directly from databases
Eduardo González López de Murillas
December 11th 2014
Process Mining
Replay
Discovery
Log
And many other things you already know
Performance
Process Mining
Replay
Discovery
?
Log
And many other things you already know
Performance
Process Mining
New sources of (event) data
• Does exist a database coordinating operations?
Databases store execution data!
(Redo Logs)
Redo Log mining overview
Redo log
Data model
Split in Traces
Event data
Event logs
Plugin outputs
Event extraction
Redo log
Data model
Split in Traces
Event data
Event logs
Plugin outputs
Redo Logs (in Oracle DB)
• Set of rotated files
• Configurable in size and number
• DBMS manages them for us
• Can be mapped to a table:
V$LOGMNR_CONTENTS
Redo Logs
• Records:
• Some common fields:
• Changes in rows
• TABLE NAME
• Changes in the data • DATABASE
schema
• OPERATION: INSERT, UPDATE, DELETE,
• Commits
etc.
• Rollbacks
• USER
• TIMESTAMP
• SEQUENCE NUMBER (order of events)
• SQL REDO
• SQL UNDO
• New / old values for every column
Event extraction
• Redo Log table from newest to oldest: r(n) to r(0)
• Apply changes to DB in reverse.
Undo r(n)
Undo r(n-1)
Tn = now
T(n-1)
T(n-2)
Row
C1
C2
Row
C1
C2
Row
C1
C2
1
a
m
1
a
d
1
a
b
2
c
d
2
c
d
2
c
d
3
e
f
3
e
f
SQL_UNDO: Update row(1).C2 = d
SQL_REDO: Update row(1).C2 = m
SQL_UNDO: Delete row(3)
SQL_REDO: Insert (Row,C1,C2) = (3,e,f)
Events
• Every Redo record r(i) is an event with attributes:
• Redo Log fields: Timestamp, Table_name, Operation,
Seq number, etc.
• And values for columns of “Table_name”: values after
r(i).SQL_REDO is applied.
Attribute
Value
Attribute
Value
Seq
2
Seq
1
Row
1
Row
3
Operation
UPDATE
Operation
INSERT
C1
a
C1
e
C2
m
C2
f
Note: for the sake of performance, SQL_REDO /
SQL_UNDO is never executed (in event extraction).
Content of modified rows is stored in temp space.
Event collections are not logs
• Result of event extraction: Collection of events
• No traces, but...
What will the trace ID be?
Event collections are not logs
• Result of event extraction: Collection of events
• No traces, but...
What will the trace ID be?
We need to find a relation between events!
Finding event relations
Event 1
Event 2
Event 3
Table
CUSTOMER
Table
BOOKING
Table
TICKET
Operation
INSERT
Operation
INSERT
Operation
UPDATE
Seq Num
1
Seq Num
2
Seq Num
3
Changes
vector
111
Changes
vector
11
Changes
vector
001
Timestamp
2014-10-01
12:38
Timestamp
2014-10-01
12:39
Timestamp
2014-10-01
12:40
ID
1234
ID
2459
ID
9846
NAME
User01
CUSTOMER
1234
PRICE
35
BIRTHDAY
1986-09-17
BOOKING
2459
Grouping by COLUMN: INCORRECT
Grouping by TABLE+COLUMN: Lifecycle of ONE table
Transitive relations
Event 1
Event 2
Event 3
Table
CUSTOMER
Table
BOOKING
Table
TICKET
Operation
INSERT
Operation
INSERT
Operation
UPDATE
Seq Num
1
Seq Num
2
Seq Num
3
Changes
vector
111
Changes
vector
11
Changes
vector
001
Timestamp
2014-10-01
12:38
Timestamp
2014-10-01
12:39
Timestamp
2014-10-01
12:40
ID
1234
ID
2459
ID
9846
NAME
User01
CUSTOMER
1234
PRICE
35
BIRTHDAY
1986-09-17
BOOKING
2459
Mined model
Event 1
Event 2
Event 3
Table
CUSTOMER
Table
BOOKING
Table
TICKET
Operation
INSERT
Operation
INSERT
Operation
UPDATE
Seq Num
1
Seq Num
2
Seq Num
3
Changes
vector
111
Changes
vector
11
Changes
vector
001
Timestamp
2014-10-01
12:38
Timestamp
2014-10-01
12:39
Timestamp
2014-10-01
12:40
ID
1234
ID
2459
ID
9846
NAME
User01
CUSTOMER
1234
PRICE
35
BIRTHDAY
1986-09-17
BOOKING
2459
CUSTOMER#INSERT#111
BOOKING#INSERT#11
TICKET#UPDATE#001
Data Model extraction
Redo log
Data model
Split in Traces
Event data
Event logs
Plugin outputs
Data model
Tables
Columns
Keys (PK,FK,UK)
Log splitting
Redo log
Data model
Split in Traces
Event data
Event logs
Plugin outputs
Log creation (Splitting)
• We need:
•
•
•
•
Event collection
Data Model
Relation selection  Trace ID Pattern
Splitting algorithm
Back to the example
Event 1
Event 2
Event 3
Table
CUSTOMER
Table
BOOKING
Table
TICKET
Operation
INSERT
Operation
INSERT
Operation
UPDATE
Seq Num
1
Seq Num
2
Seq Num
3
Changes
vector
111
Changes
vector
11
Changes
vector
001
Timestamp
2014-10-01
12:38
Timestamp
2014-10-01
12:39
Timestamp
2014-10-01
12:40
ID
1234
ID
2459
ID
9846
NAME
User01
CUSTOMER
1234
PRICE
35
BIRTHDAY
1986-09-17
BOOKING
2459
CUSTOMER#INSERT#111
BOOKING#INSERT#11
TICKET#UPDATE#001
Trace ID Pattern
Trace ID Pattern
Trace ID Pattern:
• BOOKING_CON (FK)
• BOOKING_PK (PK)
• BOOKING_FK (FK)
• CUSTOMER_PK (PK)
Trace ID Pattern
Trace ID Pattern:
• BOOKING_CON (FK)
• BOOKING_PK (PK)
• BOOKING_FK (FK)
• CUSTOMER_PK (PK)
Trace ID Pattern canonicalization
Trace ID Pattern:
Trace ID = values for:
• BOOKING_CON (FK) = TICKET:BOOKING_ID
BOOKING:BOOKING_ID
• BOOKING_PK (PK) = BOOKING:BOOKING_ID
CUSTOMER:CUSTOMER_ID
• BOOKING_FK (FK) = BOOKING:CUSTOMER_ID
• CUSTOMER_PK (PK) = CUSTOMER:CUSTOMER_ID
Log creation (Splitting)
• We need:
•
•
•
•
Event collection
Data Model
Relation selection  Trace ID Pattern
Splitting algorithm
Splitting algorithm
Get Next Event
Obtain Subset
Traces Related &
Compatible
(RandC)
Add event
to Trace
Clone
trace
Event
Contains
Root?
No
End
Cleanup
Subtraces
No
Yes
Yes
Modifies
TraceID?
Create
new trace
No
Is RandC
empty?
Get Trace
from RandC
End
Yes
Compatible & Related traces
Trace
Events
TraceID
trace1
e1,e2
[a,b,c]
trace2
e2
[a,b,Ø]
trace3
e2,e3
[a,b,d]
trace4
e2,e4
[a,b,e]
• Trace1 & trace2 are compatible
• No contradictory values for any attribute in TraceID
• Null value is compatible with anything
• Trace1 & trace2 are related
• At least one attribute with same value in TraceID
• Trace3 & trace4 are related but not compatible
Splitting example
Trace
Events
TraceID
-
-
-
Trace ID Pattern:
• Attr1 (Root)
• Attr2
• Attr3
Event
TraceID
e1
[a,b,c]
e2
[a,b,Ø]
e3
[Ø,b,d]
e4
[Ø,b,e]
First Iteration
Trace
Events
TraceID
-
-
-
• traceID(e1) contains root
• Create trace1 = {e1}
• traceID(trace1) = [a,b,c]
Trace
Events
TraceID
trace1
e1
[a,b,c]
Trace ID Pattern:
• Attr1 (Root)
• Attr2
• Attr3
Event
TraceID
e1
[a,b,c]
e2
[a,b,Ø]
e3
[Ø,b,d]
e4
[Ø,b,e]
Second iteration
Trace
Events
TraceID
trace1
e1
[a,b,c]
• traceID(e2) contains root
• Create trace2 = {e2}
• traceID(trace2) = [a,b,Ø]
• Compatible&Related(e2) = {trace1}
• Does not modify TraceID: Trace1.add(e2)
Trace
Events
TraceID
trace1
e1,e2
[a,b,c]
trace2
e2
[a,b,Ø]
Trace ID Pattern:
• Attr1 (Root)
• Attr2
• Attr3
Event
TraceID
e1
[a,b,c]
e2
[a,b,Ø]
e3
[Ø,b,d]
e4
[Ø,b,e]
Third iteration
Trace
Events
TraceID
trace1
e1,e2
[a,b,c]
trace2
e2
[a,b,Ø]
• traceID(e3) does not contain root
• Compatible&Related(e3) = {trace2}
Trace ID Pattern:
• Attr1 (Root)
• Attr2
• Attr3
Event
TraceID
e1
[a,b,c]
e2
[a,b,Ø]
• Modifies traceID: trace3 = clone(Trace2) e3
• Add e3 to trace3
e4
Trace
Events
TraceID
trace1
e1,e2
[a,b,c]
trace2
e2
[a,b,Ø]
trace3
e2,e3
[a,b,d]
[Ø,b,d]
[Ø,b,e]
Fouth Iteration
Trace
Events
TraceID
trace1
e1,e2
[a,b,c]
trace2
e2
[a,b,Ø]
trace3
e2,e3
[a,b,d]
Trace ID Pattern:
• Attr1 (Root)
• Attr2
• Attr3
Event
TraceID
e1
[a,b,c]
e2
[a,b,Ø]
• Modifies traceID: trace4 = clone(Trace2) e3
e4
• Add e4 to trace4
[Ø,b,d]
• traceID(e4) does not contain root
• Compatible&Related(e4) = {trace2}
Trace
Events
TraceID
trace1
e1,e2
[a,b,c]
trace2
e2
[a,b,Ø]
trace3
e2,e3
[a,b,d]
trace4
e2,e4
[a,b,e]
[Ø,b,e]
Cleanup
Trace
Events
TraceID
trace1
e1,e2
[a,b,c]
trace2
e2
[a,b,Ø]
trace3
e2,e3
[a,b,d]
trace4
e2,e4
[a,b,e]
• Remove Trace2: is subtrace of Trace3 and Trace4
Trace
Events
TraceID
trace1
e1,e2
[a,b,c]
trace3
e2,e3
[a,b,d]
trace4
e2,e4
[a,b,e]
Log mining in ProM
Redo log
Data model
Split in Traces
Event data
Event logs
Plugin outputs
Trace ID Pattern canonicalization
Trace ID Pattern:
Trace ID = values for:
• BOOKING_CON (FK) = TICKET:BOOKING_ID
BOOKING:BOOKING_ID
• BOOKING_PK (PK) = BOOKING:BOOKING_ID
CUSTOMER:CUSTOMER_ID
• BOOKING_FK (FK) = BOOKING:CUSTOMER_ID
• CUSTOMER_PK (PK) = CUSTOMER:CUSTOMER_ID
Mined model
Summary
• We can analyze systems we could not analyze before
• Mining life cycle of objects in Databases and
relations between them
• Insights into the real behavior of systems
• Benefits from the data schema (lower dependency
on domain knowledge)
Thank you for your attention
Questions?
?
Related documents