Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Redo Log Process Mining Mining directly from databases Eduardo González López de Murillas December 11th 2014 Process Mining Replay Discovery Log And many other things you already know Performance Process Mining Replay Discovery ? Log And many other things you already know Performance Process Mining New sources of (event) data • Does exist a database coordinating operations? Databases store execution data! (Redo Logs) Redo Log mining overview Redo log Data model Split in Traces Event data Event logs Plugin outputs Event extraction Redo log Data model Split in Traces Event data Event logs Plugin outputs Redo Logs (in Oracle DB) • Set of rotated files • Configurable in size and number • DBMS manages them for us • Can be mapped to a table: V$LOGMNR_CONTENTS Redo Logs • Records: • Some common fields: • Changes in rows • TABLE NAME • Changes in the data • DATABASE schema • OPERATION: INSERT, UPDATE, DELETE, • Commits etc. • Rollbacks • USER • TIMESTAMP • SEQUENCE NUMBER (order of events) • SQL REDO • SQL UNDO • New / old values for every column Event extraction • Redo Log table from newest to oldest: r(n) to r(0) • Apply changes to DB in reverse. Undo r(n) Undo r(n-1) Tn = now T(n-1) T(n-2) Row C1 C2 Row C1 C2 Row C1 C2 1 a m 1 a d 1 a b 2 c d 2 c d 2 c d 3 e f 3 e f SQL_UNDO: Update row(1).C2 = d SQL_REDO: Update row(1).C2 = m SQL_UNDO: Delete row(3) SQL_REDO: Insert (Row,C1,C2) = (3,e,f) Events • Every Redo record r(i) is an event with attributes: • Redo Log fields: Timestamp, Table_name, Operation, Seq number, etc. • And values for columns of “Table_name”: values after r(i).SQL_REDO is applied. Attribute Value Attribute Value Seq 2 Seq 1 Row 1 Row 3 Operation UPDATE Operation INSERT C1 a C1 e C2 m C2 f Note: for the sake of performance, SQL_REDO / SQL_UNDO is never executed (in event extraction). Content of modified rows is stored in temp space. Event collections are not logs • Result of event extraction: Collection of events • No traces, but... What will the trace ID be? Event collections are not logs • Result of event extraction: Collection of events • No traces, but... What will the trace ID be? We need to find a relation between events! Finding event relations Event 1 Event 2 Event 3 Table CUSTOMER Table BOOKING Table TICKET Operation INSERT Operation INSERT Operation UPDATE Seq Num 1 Seq Num 2 Seq Num 3 Changes vector 111 Changes vector 11 Changes vector 001 Timestamp 2014-10-01 12:38 Timestamp 2014-10-01 12:39 Timestamp 2014-10-01 12:40 ID 1234 ID 2459 ID 9846 NAME User01 CUSTOMER 1234 PRICE 35 BIRTHDAY 1986-09-17 BOOKING 2459 Grouping by COLUMN: INCORRECT Grouping by TABLE+COLUMN: Lifecycle of ONE table Transitive relations Event 1 Event 2 Event 3 Table CUSTOMER Table BOOKING Table TICKET Operation INSERT Operation INSERT Operation UPDATE Seq Num 1 Seq Num 2 Seq Num 3 Changes vector 111 Changes vector 11 Changes vector 001 Timestamp 2014-10-01 12:38 Timestamp 2014-10-01 12:39 Timestamp 2014-10-01 12:40 ID 1234 ID 2459 ID 9846 NAME User01 CUSTOMER 1234 PRICE 35 BIRTHDAY 1986-09-17 BOOKING 2459 Mined model Event 1 Event 2 Event 3 Table CUSTOMER Table BOOKING Table TICKET Operation INSERT Operation INSERT Operation UPDATE Seq Num 1 Seq Num 2 Seq Num 3 Changes vector 111 Changes vector 11 Changes vector 001 Timestamp 2014-10-01 12:38 Timestamp 2014-10-01 12:39 Timestamp 2014-10-01 12:40 ID 1234 ID 2459 ID 9846 NAME User01 CUSTOMER 1234 PRICE 35 BIRTHDAY 1986-09-17 BOOKING 2459 CUSTOMER#INSERT#111 BOOKING#INSERT#11 TICKET#UPDATE#001 Data Model extraction Redo log Data model Split in Traces Event data Event logs Plugin outputs Data model Tables Columns Keys (PK,FK,UK) Log splitting Redo log Data model Split in Traces Event data Event logs Plugin outputs Log creation (Splitting) • We need: • • • • Event collection Data Model Relation selection Trace ID Pattern Splitting algorithm Back to the example Event 1 Event 2 Event 3 Table CUSTOMER Table BOOKING Table TICKET Operation INSERT Operation INSERT Operation UPDATE Seq Num 1 Seq Num 2 Seq Num 3 Changes vector 111 Changes vector 11 Changes vector 001 Timestamp 2014-10-01 12:38 Timestamp 2014-10-01 12:39 Timestamp 2014-10-01 12:40 ID 1234 ID 2459 ID 9846 NAME User01 CUSTOMER 1234 PRICE 35 BIRTHDAY 1986-09-17 BOOKING 2459 CUSTOMER#INSERT#111 BOOKING#INSERT#11 TICKET#UPDATE#001 Trace ID Pattern Trace ID Pattern Trace ID Pattern: • BOOKING_CON (FK) • BOOKING_PK (PK) • BOOKING_FK (FK) • CUSTOMER_PK (PK) Trace ID Pattern Trace ID Pattern: • BOOKING_CON (FK) • BOOKING_PK (PK) • BOOKING_FK (FK) • CUSTOMER_PK (PK) Trace ID Pattern canonicalization Trace ID Pattern: Trace ID = values for: • BOOKING_CON (FK) = TICKET:BOOKING_ID BOOKING:BOOKING_ID • BOOKING_PK (PK) = BOOKING:BOOKING_ID CUSTOMER:CUSTOMER_ID • BOOKING_FK (FK) = BOOKING:CUSTOMER_ID • CUSTOMER_PK (PK) = CUSTOMER:CUSTOMER_ID Log creation (Splitting) • We need: • • • • Event collection Data Model Relation selection Trace ID Pattern Splitting algorithm Splitting algorithm Get Next Event Obtain Subset Traces Related & Compatible (RandC) Add event to Trace Clone trace Event Contains Root? No End Cleanup Subtraces No Yes Yes Modifies TraceID? Create new trace No Is RandC empty? Get Trace from RandC End Yes Compatible & Related traces Trace Events TraceID trace1 e1,e2 [a,b,c] trace2 e2 [a,b,Ø] trace3 e2,e3 [a,b,d] trace4 e2,e4 [a,b,e] • Trace1 & trace2 are compatible • No contradictory values for any attribute in TraceID • Null value is compatible with anything • Trace1 & trace2 are related • At least one attribute with same value in TraceID • Trace3 & trace4 are related but not compatible Splitting example Trace Events TraceID - - - Trace ID Pattern: • Attr1 (Root) • Attr2 • Attr3 Event TraceID e1 [a,b,c] e2 [a,b,Ø] e3 [Ø,b,d] e4 [Ø,b,e] First Iteration Trace Events TraceID - - - • traceID(e1) contains root • Create trace1 = {e1} • traceID(trace1) = [a,b,c] Trace Events TraceID trace1 e1 [a,b,c] Trace ID Pattern: • Attr1 (Root) • Attr2 • Attr3 Event TraceID e1 [a,b,c] e2 [a,b,Ø] e3 [Ø,b,d] e4 [Ø,b,e] Second iteration Trace Events TraceID trace1 e1 [a,b,c] • traceID(e2) contains root • Create trace2 = {e2} • traceID(trace2) = [a,b,Ø] • Compatible&Related(e2) = {trace1} • Does not modify TraceID: Trace1.add(e2) Trace Events TraceID trace1 e1,e2 [a,b,c] trace2 e2 [a,b,Ø] Trace ID Pattern: • Attr1 (Root) • Attr2 • Attr3 Event TraceID e1 [a,b,c] e2 [a,b,Ø] e3 [Ø,b,d] e4 [Ø,b,e] Third iteration Trace Events TraceID trace1 e1,e2 [a,b,c] trace2 e2 [a,b,Ø] • traceID(e3) does not contain root • Compatible&Related(e3) = {trace2} Trace ID Pattern: • Attr1 (Root) • Attr2 • Attr3 Event TraceID e1 [a,b,c] e2 [a,b,Ø] • Modifies traceID: trace3 = clone(Trace2) e3 • Add e3 to trace3 e4 Trace Events TraceID trace1 e1,e2 [a,b,c] trace2 e2 [a,b,Ø] trace3 e2,e3 [a,b,d] [Ø,b,d] [Ø,b,e] Fouth Iteration Trace Events TraceID trace1 e1,e2 [a,b,c] trace2 e2 [a,b,Ø] trace3 e2,e3 [a,b,d] Trace ID Pattern: • Attr1 (Root) • Attr2 • Attr3 Event TraceID e1 [a,b,c] e2 [a,b,Ø] • Modifies traceID: trace4 = clone(Trace2) e3 e4 • Add e4 to trace4 [Ø,b,d] • traceID(e4) does not contain root • Compatible&Related(e4) = {trace2} Trace Events TraceID trace1 e1,e2 [a,b,c] trace2 e2 [a,b,Ø] trace3 e2,e3 [a,b,d] trace4 e2,e4 [a,b,e] [Ø,b,e] Cleanup Trace Events TraceID trace1 e1,e2 [a,b,c] trace2 e2 [a,b,Ø] trace3 e2,e3 [a,b,d] trace4 e2,e4 [a,b,e] • Remove Trace2: is subtrace of Trace3 and Trace4 Trace Events TraceID trace1 e1,e2 [a,b,c] trace3 e2,e3 [a,b,d] trace4 e2,e4 [a,b,e] Log mining in ProM Redo log Data model Split in Traces Event data Event logs Plugin outputs Trace ID Pattern canonicalization Trace ID Pattern: Trace ID = values for: • BOOKING_CON (FK) = TICKET:BOOKING_ID BOOKING:BOOKING_ID • BOOKING_PK (PK) = BOOKING:BOOKING_ID CUSTOMER:CUSTOMER_ID • BOOKING_FK (FK) = BOOKING:CUSTOMER_ID • CUSTOMER_PK (PK) = CUSTOMER:CUSTOMER_ID Mined model Summary • We can analyze systems we could not analyze before • Mining life cycle of objects in Databases and relations between them • Insights into the real behavior of systems • Benefits from the data schema (lower dependency on domain knowledge) Thank you for your attention Questions? ?