Download Discovering Representative Episodal Association Rules from Event

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Discovering Representative Episodal Association Rules from Event Sequences
Using Frequent Closed Episode Sets and Event Constraints
Sherri K. Harms, Jitender Deogun
Department of CSCE
University of Nebraska
Lincoln, NE 68588-0115
sharms{deogun}@cse.unl.edu
Jamil Saquer
CS Department
SWMS University
Springfield, MO 65804
[email protected]
Abstract
Discovering association rules from time-series data is an
important data mining problem. The number of potential
rules grows quickly as the number of items in the antecedent
grows. It is therefore difficult for an expert to analyze the
rules and identify the useful. An approach for generating
representative association rules for transactions that uses
only a subset of the set of frequent itemsets called frequent
closed itemsets was presented in [6]. We employ formal
concept analysis to develop the notion of frequent closed
episodes. The concept of representative association rules is
formalized in the context of event sequences. Applying constraints to target highly significant rules further reduces the
number of rules. Our approach results in a significant reduction of the number of rules generated, while maintaining
the minimum set of relevant association rules and retaining
the ability to generate the entire set of association rules with
respect to the given constraints. We show how our method
can be used to discover associations in a drought risk management decision support system and use multiple climatology datasets related to automated weather stations1 .
Tsegaye Tadesse
NDMC
University of Nebraska
Lincoln, NE 68588-0115
[email protected]
minimizing the number of association rules discovered [2].
Most of these approaches introduce additional measures for
interestingness of a rule and prune the rules that do not satisfy the additional measures, as a post-processing step. A
set of representative association rules, on the other hand, is
a minimal set of rules from which all association rules can
be generated, during the actual processing step. Usually, the
number of representative association rules is much smaller
than the number of all association rules, and no additional
measures are needed for determining the representative association rules [4].
Recently, Saquer and Deogun developed a different approach for generating representative association rules [6].
Similarly, Zaki [8] used frequent closed itemsets to generate non-redundant association rules in CHARM.
We use closure as the basis for generating frequent sets
in the context of sequential data. We then generate sequential association rules based on representative association rule approaches while integrating constraints into our
approach. By combining these techniques, our method is
well suited for time series problems that have groupings of
events that occur close together in time, but occur relatively
infrequently over the entire dataset. We apply this technique
to the drought risk management problem.
1. Introduction
2. Frequent Closed Episodes
Discovering association rules is an important datamining problem. The problem was first defined in the context of the market basket data to identify customers’ buying habits [1]. The problem of analyzing and identifying
interesting rules becomes difficult as the number of rules
increases. In most applications the number of rules discovered is usually large. Two different approaches to handle
this problem have been reported: 1) identifying the association rules that are of special importance to the user, and 2)
1 This research was supported in part by NSF Digital Government Grant
No. EIA-0091530 and NSF EPSCOR, Grant No. EPS-0091900.
Our overall goal is to analyze event sequences, discover
recurrent patterns of events, and generate sequential association rules. Our approach is based on the concept of representative association rules combined with event constraints.
A sequential dataset is normalized and then discretized
by forming subsequences using a sliding window [5]. Using
a sliding window of size δ, every normalized time stamp
value xt is used to compute each of the new sequence values
yt−δ/2 to yt+δ/2 . Thus, the dataset has been divided into
segments, each of size δ. The discretized version of the
time series is obtained by using some clustering algorithm
and a suitable similarity measure. Each cluster identifier is
an event type, and the set of cluster labels is the class of
events E.
The newly formed version of the time series is referred
to as an event sequence. An event sequence is a triple
(tB , tD , S) where tB is the beginning time, tD is the ending
time, and S is a finite, time-ordered sequence of events [5].
That is, S = (etB , etB+1p , etB+2p , . . . etB+dp = etD ), where
p is the step size between each event, d is the total number
of steps in the time interval from time tB to time tD , and
D = B + dp. Each eti is a member of a class of events E,
and ti ≤ ti+1 for all i = B, . . . , D − 1p. A sequence of
events S includes events from a single class of events E.
A window on an event sequence S is an event subsequence W = {etj , . . . , etk }, where tB ≤ tj , and tk ≤ tD +
1 [5]. The width of the window W is width(W ) = tk − tj .
The set of all windows W on S, with width(W ) = win
is denoted as W(S, win). The width of the window is prespecified.
An episode in an event sequence is a combination of
events with a partially specified order. The type of an
episode is parallel if no order is specified, and serial if the
events of the episode have a fixed order. An episode is injective if no event type occurs more than once in the episode.
We extend the work of Mannila et al. [5] to consider
closed sets of episodes. We use formal concept analysis as
the basis for developing the notion of closed episode sets
[6]. Informally, a concept is a pair of sets: set of objects
(windows or episodes) and set of features (events) common
to all objects in the set.
Definition 1 An episodal data mining context is defined as
a triple (W(S, win), E, R) where W(S, win) is a set of all
windows of width win defined on the event sequence S,
and E is a set of episodes in the event sequence S, and R ⊆
W × E.
Definition 2 Let (W, E, R) be an episodal data mining
context, X ⊆ W, and Y ⊆ E. Define the mappings α,
β as follows:
β : 2W → 2E , β(X) = {e ∈ E | (w, e) ∈ R ∀ w ∈ X},
α : 2E → 2W , α(Y ) = {w ∈ W | (w, e) ∈ R ∀ e ∈ Y }.
The mapping β(X) associates with X the set of episodes
that are common to all the windows in X. Similarly, the
mapping α(Y ) associates with Y the set of all windows
containing all the episodes in Y . Intuitively, β(X) is the
maximum set of episodes shared by all windows in X and
α(Y ) is the maximum set of windows possessing all the
episodes in Y .
It is easy to see that in general, for any set Y of episodes,
β(α(Y )) 6= Y . A set of episodes Y that satisfies the condition β(α(Y )) = Y is called a closed set of episodes [6].
The frequency of an episode is defined as the fraction
of windows in which the episode occurs. Given an event
sequence S, and a window width win, the frequency of an
episode P of a given type in S is:
f r(P, S, win) =
| w ∈ W(S, win) : P occurs in w |
| W(S, win) |
Given a frequency threshold min fr, P is frequent if
f r(P, S, win) ≥ min f r. A frequent closed set of
episodes (FCE) is a closed set of episodes that satisfy the
minimum frequency threshold. Closure of an episode set
X ⊆ E, denoted by closure(X), is the smallest closed
episode set containing X and is equal to the intersection
of all frequent episode sets containing X.
To generate frequent closed target episodes, we develop
an algorithm called Gen-FCE, shown in Figure 1. GenFCE is a combination of the Close-FCI algorithm [6], the
WINEPI frequent episode algorithms [5], and the Direct algorithm [7]. Gen-FCE generates F CE with respect to a
given set of Boolean target constraints B, an event sequence
S, a window width win, an episode type, a minimum frequency min fr, and a window step size p. The Gen-FCE
algorithm requires one database pass during each iteration.
1) Generate Candidate Frequent Closed Target Episodes
of length 1 (CF C1,B );
2) k = 1;
3) while (CF Ck,B 6= ∅) do
4) Read the sequence S, one window at a time,
let F CEk,B be the elements in CF Ck,B
with a new closure, and
with a frequency ≥ min fr
5) Generate Candidate Frequent Closed Target Episodes
CF Ck+1,B from F CEk,B
6) k++;
Sk−1
7) return i=1
{F CEi,B .closureandF CEi,B .f requency};
Figure 1. Gen-FCE algorithm.
We incorporate constraints similar to the Direct algorithm [7]. This approach is known to work well at low
minimum supports and in large datasets [7]. This approach
requires an expensive cross-product operation, so for disjunctive singleton constraints, the candidate generation algorithm is used [5].
3 Representative Episodal Association Rules
We use the set of frequent closed episodes F CE produced from the Gen-FCE algorithm to generate the representative episodal association rules that cover the entire set
of association rules [4].
The cover of a rule r : X ⇒ Y , denoted by C(r), is the
set of association rules that can be generated from r. That
is, C(r : X ⇒ Y ) = {X ∪ U ⇒ V | U, V ⊆ Y, U ∩
V = ∅, and V 6= ∅}. An important property of the
cover operator stated in [4]is that if an association rule r
has support s and confidence c, then every rule r 0 ∈ C(r)
has support at least s and confidence at least c.
Using the cover operator, a set of representative association rules with minimum support s and minimum confidence c, RAR(s, c), is defined as follows: RAR(s, c) =
{r ∈ AR(s, c) | 6 ∃r 0 ∈ AR(s, c), r 6= r 0 and r ∈
C(r0 )}. That is, a set of representative association rules is
a least set of association rules that cover all the association
rules and from whichSall association rules can be generated.
Clearly, AR(s, c) = {C(r) | r ∈ RAR(s, c)}.
Gen-REAR shown in Figure 2, is a modification of the
Generate-RAR [6] that generates REAR for a given set of
frequent closed episodes F CE with respect to a minimum
confidence c.
1) k = the size of the longest frequent closed episode in
FCE;
2) while (k > 1) do
3)
Generate REARk , by adding each rule X ⇒ Z\X
such that (Z.support/X.support ≥ c and X ⇒ Z\X is
not covered by a previously generated rule
4)
k + +;
5) return REAR;
Figure 2. Gen-REAR algorithm.
Using our technique on multiple time series while constraining the episodes to a user-specified target set, we can
find relationships that occur across the sequences.
4 Empirical Results
We are developing an advanced Geospatial Decision
Support System (GDSS) to improve the quality and accessibility of drought related data for drought risk management
[3]. Our objective is to integrate spatio-temporal knowledge
discovery techniques into the GDSS using a combination
of data mining techniques applied to geospatial time-series
data by: 1) finding relationships between user-specified target episodes and other climatic events and 2) predicting the
target episodes. The REAR approach will be used to meet
the first objective. In this paper we validate the effectiveness of the REAR approach to find relationships between
drought episodes at the automated weather station in Mead,
NE, and other climatic episodes, from 1989-1999. We compare it to the WINEPI algorithm [5]. We use data from nine
sources, including satellite vegetation data and precipitation
and soil moisture data.
We experimented with several different window widths,
minimal frequency values, minimal confidence values, for
both parallel and serial episodes. When using constraints,
we specified droughts as our target episodes. The experiments were ran on a AMD Athlon 1.3GHz PC with 256 MB
main memory, under the Windows 2000 operating system.
4.1 Gen-FCE vs. WINEPI
Tables 1 and 2 represent performance statistics for finding frequent closed episodes in the drought risk management dataset for Mead, NE with various frequency thresholds for injective serial drought episodes with a 2 month
window using the Gen-FCE and WINEPI algorithms, respectively.
Table 1. Gen-FCE serial episode performance.
min-fr Candidates Freq. Closed Iters time
Episodes
(s)
0.05
525
77
3
2
0.10
335
24
2
1
0.15
153
10
2
1
0.20
93
6
2
0
0.25
83
5
2
0
Table 2. WINEPI serial episode performance.
min-fr Candidates Freq. Closed Iters time
Episodes
(s)
0.05
17284
3950
6 6932
0.10
4687
629
5
205
0.15
1704
229
4
10
0.20
807
102
4
1
0.25
567
58
3
1
Gen-FCE performs extremely well when finding the
drought episodes. The number of frequent closed episodes
decreases rapidly as the frequency threshold increases. For
the sample dataset at a frequency threshold of 0.10 and a
window width of 2 months, Gen-FCE produces 6 frequent
drought serial episodes while WINEPI produces 1600%
more (102) episodes.
Because we are working with a fraction of the possible number of episodes, our algorithms are extremely efficient. When finding all frequent drought episodes for
the sample dataset using a window width of 5 months, the
running time was 1 second for Gen-FCE and 6 hours for
WINEPI. This illustrates the benefits of using closures and
constraints when working with the infrequently occurring
drought events.
As the window size increases, so does the frequent
episode generation time and the number of frequent
episodes. When using drought constraints, the increase is at
a much slower pace than WINEPI. For the sample dataset
and a window width of 3 months, Gen-FCE produces 53
frequent drought serial episodes while WINEPI produces
5779% more (3116) episodes.
4.2 Gen-REAR vs. WINEPI Association Rules
We next experimented with finding association rules in
the drought risk management dataset for Mead, NE with
various confidence thresholds and window widths using the
Gen-REAR and WINEPI AR algorithms for injective parallel and serial episodes. The number of rules decreases
rapidly as the confidence threshold increases and increases
rapidly as the window width widens. In all cases, GenREAR produces fewer rules than the WINEPI AR algorithm. Using the Gen-REAR approach, all the rules can be
generated if desired, even though the meaning of the additional AR’s is captured by the smaller set of REAR’s.
Gen-REAR performs extremely well when finding
drought episodal rules as shown in Table 3. The number
of REAR’s decreases rapidly as the confidence interval increases. For the sample dataset at a confidence threshold
of 0.20 and a window width of 2 months, Gen-REAR produces 24 drought parallel episodal rules while WINEPI AR
produces 20892% more (5038) rules. With the same parameters, Gen-REAR produces 14 drought serial episodal rules
while WINEPI AR produces 16257% more (2290) rules.
Table 3. Gen-REAR parallel and serial rules.
Parallel
Serial
Confidence Distinct Distinct
threshold
rules
rules
0.20
24
14
0.25
24
12
0.30
19
9
0.35
13
7
0.40
10
6
0.45
8
5
As the window width widens, Gen-REAR overwhelmingly produces fewer rules than the WINEPI algorithm. The
number of REAR’s increases as the window width. For
the sample dataset at a window width of 3 months, GenREAR produces 30 parallel drought episodal rules while
WINEPI AR produces 53763% more (16159) rules. With
the same parameters, Gen-REAR produces 8 serial drought
episodal rules while WINEPI AR produces 24825% more
(1994). The savings are obvious. The Gen-REAR algorithm
finds the drought REAR’s for all reasonable window widths
and confidence levels on the Mead, NE drought risk management dataset in less than 30 seconds. As the window
widens, the WINEPI AR algorithm quickly becomes computationally infeasible to use for the drought risk management problem.
5 Conclusion
This paper presents Gen-REAR, a new approach for generating representative episodal association rules. We also
presented Gen-FCE, a new approach used to generate the
frequent closed episode sets that conform to user-specified
constraints. Our approach results in a large reduction in
the input size for generating representative episodal association rules for targeted episodes, while retaining the ability
to generate the entire set of association rules. We also studied the gain in efficiency of generating targeted representative episodal association rules as compared to the traditional
WINEPI algorithm on a multiple time series drought risk
management problem.
References
[1] R. Agrawal, T. Imielinski, and A. Swami. Mining association
rules between sets of items in large databases. In Proceedings
of the ACM SIGMOD 1993 International Conference on Management of Data [SIGMOD 93], pages 207–216, Washington
D.C., 1993.
[2] R. Bayardo, R. Agrawal, and D. Gunopupulos. Constraintbased rule mining in large, dense databases. In Proceedings
of ICDE-99, 1999.
[3] S. K. Harms, S. Goddard, S. E. Reichenbach, W. J. Waltman,
and T. Tadesse. Data mining in a geospatial decision support
system for drought risk management. In Proceedings of the
2001 National Conference on Digital Government Research,
pages 9–16, Los Angelos, California, USA, May 2001.
[4] M. Kryszkiewicz. Fast discovery of representative association
rules. In Lecture Notes in Artificial Intelligence, volume 1424,
pages 214–221. Proceedings of RSCTC 98, Springer-Verlag,
1998.
[5] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of
frequent episodes in event sequences. Technical report, Department of Computer Science, University of Helsinki, Finland, 1997. Report C-1997-15.
[6] J. Saquer and J. S. Deogun. Using closed itemsets for discovering representative association rules. In Proceedings of the
Twelfth International Symposium on Methodologies for Intelligent Systems [ISMIS 2000], Charlotte, NC, October 11-14
2000.
[7] R. Srikant, Q. Vu, and R. Agrawal. Mining association rules
with item constraints. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining
[KDD97], pages 67–73, 1997.
[8] M. Zaki. Generating non-redundant association rules. In
Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining [KDD2000], pages 34–43,
Boston, MA, USA, August 20-23 2000.