Download TDMiner User Manual - Department of Electrical Engineering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
TDMiner User Manual
By
Debprakash Patnaik
Version 0.0d
June 20, 2008
Table of Contents
Introduction to TDMiner ................................................................. 3
Serial episodes ........................................................................................................ 4
Parallel episodes ..................................................................................................... 4
Installation and Setup .................................................................... 5
System Requirements ................................................................................................. 5
Installing and Running the Software ............................................................................ 5
Getting Started .............................................................................. 6
Loading/Creating an Event Sequence ......................................................................... 6
Loading Events from a Comma Separated File ....................................................... 6
Mining the Events ........................................................................................................ 8
Mining for serial episodes with inter-event interval constraint .................................. 8
Anatomy of an episode: ......................................................................................... 10
Saving the episodes .............................................................................................. 10
Mining for parallel episodes with expiry constraint ................................................. 12
Visualizing ................................................................................................................. 14
The Event Sequence ............................................................................................. 14
The Episode Instances .......................................................................................... 14
Graph Visualization ................................................................................................... 17
2
Introduction to TDMiner
Temporal data mining is concerned with the mining of large sequential or ordered
data streams. The data streams dealt with here are only categorical data sequences
or event streams. The data can be viewed as a single long sequence of ordered pairs
(Ei, ti) which are called instantaneous events. In each event (Ei, ti), Ei is referred to as
an event type (which takes its value from a finite alphabet) and ti is the time of
occurrence of the event. There are many applications where data appears in this
form, e.g., spike trains in multi-neuronal recordings, alarm sequences in telecom
networks, web navigation logs etc. The data here can be viewed as a sequence of
ordered pairs (Ei, ti), where Ei denotes the event type and ti is the start time of the
event. The data must have been sorted according to start times.
Frequent episodes are temporal patterns that occur sufficiently often along the data
sequence. The temporal patterns referred to as episodes are ordered collections of
event types. For example A→B→C is a temporal pattern where an event type A is
followed some time later by B and then a C, in that order. The frequent episode
framework aims at discovering all episodes that occur often in the data.
TDMiner is a temporal data mining tool that mines Frequent Episodes from long
sequence of events. The program is written in Java and provides a suit of mining
techniques for different types of frequent episodes and also tools for visualizing the
results obtained from mining.
Frequent episode discovery framework was proposed by Mannila et al (Discovery of
frequent episodes in event sequences, Data Mining and Knowledge Discovery,
1997), in the context analyzing alarm sequences in a communication network.
Laxman et al (Discovering frequent episodes and learning hidden markov models: A
formal connection, IEEE Transactions on Knowledge and Data Engineering, 17(11),
2005) introduced the notion of non-overlapped occurrences as episode frequency
and proposed efficient counting algorithms. This work was then extended by
Debprakash et.al. to include temporal constraints on non-overlapped occurrences of
frequent episodes (Application of Frequent Episode Discovery Framework in
Microelectrode Array Data Analysis, Masters Thesis, Electrical Engineering Dept,
Indian Institute of Science, 2006). For example constraints can be added to time
interval between two consecutive events in a serial episode etc. The TDMiner
package provides the algorithms for mining episodes with all different temporal
constraints.
The data to be analyzed is a sequence of events denoted by (E1, t1), (E2, t2) … where
Ei represents an event type and ti, the time of occurrence of the ith event. Ei's are
drawn from a finite set of event types. The sequence is ordered with respect to time of
occurrences of the events so that, ti< ti+1, for all i=1, 2… The following is an example
event sequence:
Event, Time of occurrence
------------------------A, 0.00396
B, 0.05284
V, 0.05800
R, 0.07332
W, 0.11700
D, 0.21044
Y, 0.24852
A, 0.32056
3
…
The patterns to be discovered in the event sequence are called Episodes. An episode
is a collection of events occurring together. Three different kinds of episodes are
shown in the figure below.
Figure 1: Types of episodes
A) Serial Episode, chained events
B) Parallel Episode, synchronized events
C) Neither Serial nor Parallel Episode, synfire-chain events
Serial episodes
Figure 1A shows a serial episode: it occurs in a data stream only if there are events of
types X, Y and Z that occur in order in the sequence, and denoted by (X→Y→Z). In
the sequence there can be other events interspersed between the occurrences of
event types of an episode (e.g. …(X,1),(A,3),(F,4),(Y,6),(M,9),(Z,10)…).
Parallel episodes
Figure 1B shows a parallel episode: there is no constraint on the relative order of X, Y
and Z. Parallel episodes are denoted e.g. as (XYZ).
Figure 1C shows an example of a non-serial and non-parallel episode: it occurs in a
sequence if events of type X and Y occur in any order followed some time later by an
event of type Z.
In multi-neuron data, a spike event has the label of the neuron (or the electrode
number in case of multi-electrode array recordings) which generated the spike as its
event type and has the associated time of occurrence. The neurons in the ensemble
under observation fire action potentials at different times, that is, generate spike
events. All these spike events are strung together, in time order, to give a single long
data sequence.
4
Installation and Setup
The program is distributed as TDMiner.exe for Windows operating system and
TDMiner.jar for all other platforms.
System Requirements
Software Requirements:
Java Runtime Environment (JRE) 5.0 or above
Download: http://www.java.com
Basic Hardware Requirements:
1GB RAM
Intel Pentium Series 1 GHz onwards
Recommended Hardware Requirements:
200MB of free disk space
2GB real memory
Intel Pentium IV 3 GHz onwards
Installing and Running the Software
.
TDMiner.exe – distribution package for Windows platform.
TDMiner.jar – distribution package for all other platforms.
To run TDMiner.jar
$ java –jar TDMiner.jar
5
Getting Started
In this section we give step by step directions for how to load/create an event
sequence, mine for frequent episodes and visualize the episodes discovered.
Loading/Creating an Event Sequence
Loading Events from a Comma Separated File
The program can load events from an ASCII file stored in comma separated format:
event, time on each line of the input file. The following is a portion of a typical input
file:
A,
B,
V,
R,
W,
J,
D,
Y,
A,
…
0.00396
0.05284
0.05800
0.07332
0.11700
0.14344
0.21044
0.24852
0.32056
Events can be alpha-numeric strings of any size and time must be a numeric value.
The times in the file must be in ascending order.
1. Go to Event Sequence Loader Tab
2. Click on
button in File Name panel.
6
3. Select the data file using the file selection dialog.
4. The following dialog provides options to specify the data format of the file. It is
assumed that each line of the data file contains an event with its time of
occurrence.
5. After the event sequence is loaded, plot of the no. of occurrence of each event
type is displayed along with a table indicating the actual numbers. See below:
7
Mining the Events
Frequent episode discovery is an iterative procedure in which frequent episodes of a
given size are found in each pass over the event stream. These frequent episodes
are used to generate the candidate episodes of the next size. That is, first 1-node
frequent episodes are found through one pass over the data. Using the appropriate
candidate generation scheme, the 1-node frequent episodes are then combined to
obtain a set of 2-node candidate episodes. By counting the frequencies of 2-node
candidates through one pass over the data, frequent 2-node episodes are selected.
This process is continued till frequent episodes of all different sizes are obtained. The
candidate generation strategies depend on the frequency measure used. For details
refer to Chapter 2 of Application of Frequent Episode Discovery Framework in
Microelectrode Array Data Analysis, Masters Thesis, Electrical Engineering Dept,
Indian Institute of Science, 2006.
Mining for serial episodes with inter-event interval constraint
1. Select “Discovery of episodes & inter-event intervals (Serial)” from the dropdown list of Algorithms. This mining algorithm discovers serial episodes with
suitable inter-event intervals chosen from a set of intervals provided by the
user.
2. A threshold type determines whether an episode is frequent/significant based
on its number of occurrence in the data. The “Episode Connection Strength”
based Threshold Type is selected by default. The Estrong panel takes the
parameters for this threshold type.
3. Connection Strength: It is the probability of the subsequent event to follow an
event that is part of an episode.
8
4. Type I Error: This is the allowed type I error of incorrectly classifying an
episode as frequent/significant.
5. Time Granularity: This specifies the granularity at which the data sequence is
discretized for computing significance.
6. For this algorithm we need to specify a set of inter-event intervals. The
algorithm then automatically selects the suitable interval for each pair of
consecutive event types in an episode. The intervals are entered in the
Constraints Panel using
button. The following dialog allows us to
enter intervals as pairs of Tlow and Thigh (indicating the start and end of an
interval).
7. “Episode Size” on constraints panel indicates the maximum size of an episode
up to which the mining algorithm will run.
8. After all the parameters are selected we mine the data using
button.
Note:
The other type of frequency threshold available for this mining is “Explicit Decaying
Threshold”.
This is the cut-off frequency expressed as a fraction of the number events in the data
sequence. All episodes that occur more number of times than the cut-off frequency
times the number of events in the sequence are declared as frequent. For example if
threshold = 0.01 and number of events = 20,000, then all frequent episodes must
occur more than 200 (0.01 x 20,000) time is the data. The threshold can be
expressed separately for one-node and two-node onwards episodes. The Decay
Multiplier is used to determine the threshold for episodes of size > 2 if the mining
process continues. That is for 3-node episodes threshold = 0.018 (2-node threshold)
x 0.9 = 0.0162, and the threshold for 4-node episodes = 0.0162 (3-node threshold) x
0.9 = 0.01458.
9
After the event sequence is mined the results are displayed in the Result panel as
shown below. The left panel can be used to browse through episode sets of different
sizes. Clicking on different headers of the table sort the episode sets in different
orders.
Anatomy of an episode:
A[0.004-0.006]-B[0.004-0.006]-C[0.004-0.006]-D #(0.855) : 255
1. A,B,C,D – Event types in the episode (appearing in that order)
2. [0.004-0.006] – The inter-event time between consecutive event types. This
can be different for different pairs.
3. #(0.855) – The conditional probability of one event being followed by the other
(assuming this is same for each consecutive pair of events)
4. 255 – Number of non-overlapped counts in the data sequence of the given
episode.
Saving the episodes
The
button on Results panel saves the mined episodes into a text file.
Show below is an example of a stored episode set.
Frequent Episode discovery results
---------------------------------Episode set size = 4
Episodes of size = 1
----------------------No.
N :
J :
Z :
R :
L :
F :
K :
V :
T :
M :
I :
Y :
B :
of 1 node frequent episodes = 26
432
413
408
408
407
406
404
402
402
401
400
399
394
10
U : 392
C : 390
G : 390
P : 389
D : 383
A : 381
W : 381
E : 378
Q : 377
H : 374
O : 371
S : 364
X : 357
----------------------Episodes of size = 2
----------------------No. of 2 node frequent episodes = 3
B[0.004-0.006]-C #(0.738) : 343
C[0.004-0.006]-D #(0.738) : 338
A[0.004-0.006]-B #(0.738) : 333
----------------------Episodes of size = 3
----------------------No. of 3 node frequent episodes = 2
B[0.004-0.006]-C[0.004-0.006]-D #(0.824) : 297
A[0.004-0.006]-B[0.004-0.006]-C #(0.824) : 290
----------------------Episodes of size = 4
----------------------No. of 4 node frequent episodes = 1
A[0.004-0.006]-B[0.004-0.006]-C[0.004-0.006]-D #(0.855) : 255
-----------------------
11
Mining for parallel episodes with expiry constraint
1. Select “Non-overlapped count with episode expiry constraint (Parallel)” from
the drop-down list of Algorithms. This mining algorithm discovers parallel
episodes with the expiry time constraint provided by the user.
2. A threshold type determines whether an episode is frequent/significant based
on its number of occurrence in the data. Here we can use “Explicit Decay
Threshold” or “Auto-threshold for parallel episode” in the Threshold Type
panel. Explicit Decay Threshold is same as that described above in Serial
Episode mining section.
Auto-threshold for parallel episode: Here a chance occurrence of an episode
in a window of time less than the expiry constraint is considered as success of
a random binomial trail with p = probability of arrival of each of the constituent
event types in the time interval of the expiry constraint.
Error for Parallel Episode is probability of incorrectly selecting a randomly
occurring set of events as a significant parallel episode. Hence it may be
useful to have extremely small values for this e.g. 1e-5.
3. Enter the expiry constraint in sec in the Constraints panel. All episode
occurrences mined here complete within the specified time duration.
12
After the event sequence is mined the results are displayed in the Result panel as
shown below. The left panel can be used to browse through episode sets of different
sizes. Clicking on different headers of the table sort the episode sets in different
orders. Although parallel episodes are shown as a sequence of even types, there is
no order between the shown event types.
13
Visualizing
The Event Sequence
The event sequence can be visualized on a typical raster plot on Event Sequence
Display panel.
Each spike represents a separate event. The y-axis shows the id/label of the event
type and the x-axis shows the time stamp.
Standard Zoom In , Zoom Out
and Scroll operations are supported on the plot.
The reset
button can be used to fit the entire event sequence again into the
display.
The Episode Instances
1. Episode occurrence can be displayed on the raster plot using
from the operations panel.
2. The following dialog is displayed for selecting episodes.
14
3. We select the episodes using the check box beside them. The Select All button
selects all episodes shown on the dialog.
4. The Add Episode panel on the above dialog can be used to enter episodes not
already mined.
The format for specifying episodes here is as follows
Serial Episodes with inter-event intervals: M [0.009-0.011] N [0.004-0.006] O [0.0140.016] P
Each pair event type is separated by the inter-event interval applicable to it.
Parallel Episodes: M N O P (after parallel episode mining has been done)
15
The above plot shows occurrences of episodes A[0.004-0.006]-B[0.004-0.006]-C[0.0040.006]-D and M[0.009-0.011]-N[0.004-0.006]-O[0.014-0.016]-P. It turns out that there are
no occurrences of the second episode satisfying the specified inter-event time
constraints.
16
Graph Visualization
The frequent episodes mined in the process can be strung together to build the
structure of the under-lying network. The Graph Visualization panel provided a way to
see the graph structure based on the mined episodes. The display button loads the
frequent episodes into the graph display. The Level indicates the size of episodes that
are displayed as edges of the graph. The EStrong slider helps limit episodes of certain
conditional probability and above.
Shown below is an example of graph visualization.
17