Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Discovering Representative Episodal Association Rules from Event Sequences Using Frequent Closed Episode Sets and Event Constraints Sherri K. Harms, Jitender Deogun Department of CSCE University of Nebraska Lincoln, NE 68588-0115 sharms{deogun}@cse.unl.edu Jamil Saquer CS Department SWMS University Springfield, MO 65804 [email protected] Abstract Discovering association rules from time-series data is an important data mining problem. The number of potential rules grows quickly as the number of items in the antecedent grows. It is therefore difficult for an expert to analyze the rules and identify the useful. An approach for generating representative association rules for transactions that uses only a subset of the set of frequent itemsets called frequent closed itemsets was presented in [6]. We employ formal concept analysis to develop the notion of frequent closed episodes. The concept of representative association rules is formalized in the context of event sequences. Applying constraints to target highly significant rules further reduces the number of rules. Our approach results in a significant reduction of the number of rules generated, while maintaining the minimum set of relevant association rules and retaining the ability to generate the entire set of association rules with respect to the given constraints. We show how our method can be used to discover associations in a drought risk management decision support system and use multiple climatology datasets related to automated weather stations1 . Tsegaye Tadesse NDMC University of Nebraska Lincoln, NE 68588-0115 [email protected] minimizing the number of association rules discovered [2]. Most of these approaches introduce additional measures for interestingness of a rule and prune the rules that do not satisfy the additional measures, as a post-processing step. A set of representative association rules, on the other hand, is a minimal set of rules from which all association rules can be generated, during the actual processing step. Usually, the number of representative association rules is much smaller than the number of all association rules, and no additional measures are needed for determining the representative association rules [4]. Recently, Saquer and Deogun developed a different approach for generating representative association rules [6]. Similarly, Zaki [8] used frequent closed itemsets to generate non-redundant association rules in CHARM. We use closure as the basis for generating frequent sets in the context of sequential data. We then generate sequential association rules based on representative association rule approaches while integrating constraints into our approach. By combining these techniques, our method is well suited for time series problems that have groupings of events that occur close together in time, but occur relatively infrequently over the entire dataset. We apply this technique to the drought risk management problem. 1. Introduction 2. Frequent Closed Episodes Discovering association rules is an important datamining problem. The problem was first defined in the context of the market basket data to identify customers’ buying habits [1]. The problem of analyzing and identifying interesting rules becomes difficult as the number of rules increases. In most applications the number of rules discovered is usually large. Two different approaches to handle this problem have been reported: 1) identifying the association rules that are of special importance to the user, and 2) 1 This research was supported in part by NSF Digital Government Grant No. EIA-0091530 and NSF EPSCOR, Grant No. EPS-0091900. Our overall goal is to analyze event sequences, discover recurrent patterns of events, and generate sequential association rules. Our approach is based on the concept of representative association rules combined with event constraints. A sequential dataset is normalized and then discretized by forming subsequences using a sliding window [5]. Using a sliding window of size δ, every normalized time stamp value xt is used to compute each of the new sequence values yt−δ/2 to yt+δ/2 . Thus, the dataset has been divided into segments, each of size δ. The discretized version of the time series is obtained by using some clustering algorithm and a suitable similarity measure. Each cluster identifier is an event type, and the set of cluster labels is the class of events E. The newly formed version of the time series is referred to as an event sequence. An event sequence is a triple (tB , tD , S) where tB is the beginning time, tD is the ending time, and S is a finite, time-ordered sequence of events [5]. That is, S = (etB , etB+1p , etB+2p , . . . etB+dp = etD ), where p is the step size between each event, d is the total number of steps in the time interval from time tB to time tD , and D = B + dp. Each eti is a member of a class of events E, and ti ≤ ti+1 for all i = B, . . . , D − 1p. A sequence of events S includes events from a single class of events E. A window on an event sequence S is an event subsequence W = {etj , . . . , etk }, where tB ≤ tj , and tk ≤ tD + 1 [5]. The width of the window W is width(W ) = tk − tj . The set of all windows W on S, with width(W ) = win is denoted as W(S, win). The width of the window is prespecified. An episode in an event sequence is a combination of events with a partially specified order. The type of an episode is parallel if no order is specified, and serial if the events of the episode have a fixed order. An episode is injective if no event type occurs more than once in the episode. We extend the work of Mannila et al. [5] to consider closed sets of episodes. We use formal concept analysis as the basis for developing the notion of closed episode sets [6]. Informally, a concept is a pair of sets: set of objects (windows or episodes) and set of features (events) common to all objects in the set. Definition 1 An episodal data mining context is defined as a triple (W(S, win), E, R) where W(S, win) is a set of all windows of width win defined on the event sequence S, and E is a set of episodes in the event sequence S, and R ⊆ W × E. Definition 2 Let (W, E, R) be an episodal data mining context, X ⊆ W, and Y ⊆ E. Define the mappings α, β as follows: β : 2W → 2E , β(X) = {e ∈ E | (w, e) ∈ R ∀ w ∈ X}, α : 2E → 2W , α(Y ) = {w ∈ W | (w, e) ∈ R ∀ e ∈ Y }. The mapping β(X) associates with X the set of episodes that are common to all the windows in X. Similarly, the mapping α(Y ) associates with Y the set of all windows containing all the episodes in Y . Intuitively, β(X) is the maximum set of episodes shared by all windows in X and α(Y ) is the maximum set of windows possessing all the episodes in Y . It is easy to see that in general, for any set Y of episodes, β(α(Y )) 6= Y . A set of episodes Y that satisfies the condition β(α(Y )) = Y is called a closed set of episodes [6]. The frequency of an episode is defined as the fraction of windows in which the episode occurs. Given an event sequence S, and a window width win, the frequency of an episode P of a given type in S is: f r(P, S, win) = | w ∈ W(S, win) : P occurs in w | | W(S, win) | Given a frequency threshold min fr, P is frequent if f r(P, S, win) ≥ min f r. A frequent closed set of episodes (FCE) is a closed set of episodes that satisfy the minimum frequency threshold. Closure of an episode set X ⊆ E, denoted by closure(X), is the smallest closed episode set containing X and is equal to the intersection of all frequent episode sets containing X. To generate frequent closed target episodes, we develop an algorithm called Gen-FCE, shown in Figure 1. GenFCE is a combination of the Close-FCI algorithm [6], the WINEPI frequent episode algorithms [5], and the Direct algorithm [7]. Gen-FCE generates F CE with respect to a given set of Boolean target constraints B, an event sequence S, a window width win, an episode type, a minimum frequency min fr, and a window step size p. The Gen-FCE algorithm requires one database pass during each iteration. 1) Generate Candidate Frequent Closed Target Episodes of length 1 (CF C1,B ); 2) k = 1; 3) while (CF Ck,B 6= ∅) do 4) Read the sequence S, one window at a time, let F CEk,B be the elements in CF Ck,B with a new closure, and with a frequency ≥ min fr 5) Generate Candidate Frequent Closed Target Episodes CF Ck+1,B from F CEk,B 6) k++; Sk−1 7) return i=1 {F CEi,B .closureandF CEi,B .f requency}; Figure 1. Gen-FCE algorithm. We incorporate constraints similar to the Direct algorithm [7]. This approach is known to work well at low minimum supports and in large datasets [7]. This approach requires an expensive cross-product operation, so for disjunctive singleton constraints, the candidate generation algorithm is used [5]. 3 Representative Episodal Association Rules We use the set of frequent closed episodes F CE produced from the Gen-FCE algorithm to generate the representative episodal association rules that cover the entire set of association rules [4]. The cover of a rule r : X ⇒ Y , denoted by C(r), is the set of association rules that can be generated from r. That is, C(r : X ⇒ Y ) = {X ∪ U ⇒ V | U, V ⊆ Y, U ∩ V = ∅, and V 6= ∅}. An important property of the cover operator stated in [4]is that if an association rule r has support s and confidence c, then every rule r 0 ∈ C(r) has support at least s and confidence at least c. Using the cover operator, a set of representative association rules with minimum support s and minimum confidence c, RAR(s, c), is defined as follows: RAR(s, c) = {r ∈ AR(s, c) | 6 ∃r 0 ∈ AR(s, c), r 6= r 0 and r ∈ C(r0 )}. That is, a set of representative association rules is a least set of association rules that cover all the association rules and from whichSall association rules can be generated. Clearly, AR(s, c) = {C(r) | r ∈ RAR(s, c)}. Gen-REAR shown in Figure 2, is a modification of the Generate-RAR [6] that generates REAR for a given set of frequent closed episodes F CE with respect to a minimum confidence c. 1) k = the size of the longest frequent closed episode in FCE; 2) while (k > 1) do 3) Generate REARk , by adding each rule X ⇒ Z\X such that (Z.support/X.support ≥ c and X ⇒ Z\X is not covered by a previously generated rule 4) k + +; 5) return REAR; Figure 2. Gen-REAR algorithm. Using our technique on multiple time series while constraining the episodes to a user-specified target set, we can find relationships that occur across the sequences. 4 Empirical Results We are developing an advanced Geospatial Decision Support System (GDSS) to improve the quality and accessibility of drought related data for drought risk management [3]. Our objective is to integrate spatio-temporal knowledge discovery techniques into the GDSS using a combination of data mining techniques applied to geospatial time-series data by: 1) finding relationships between user-specified target episodes and other climatic events and 2) predicting the target episodes. The REAR approach will be used to meet the first objective. In this paper we validate the effectiveness of the REAR approach to find relationships between drought episodes at the automated weather station in Mead, NE, and other climatic episodes, from 1989-1999. We compare it to the WINEPI algorithm [5]. We use data from nine sources, including satellite vegetation data and precipitation and soil moisture data. We experimented with several different window widths, minimal frequency values, minimal confidence values, for both parallel and serial episodes. When using constraints, we specified droughts as our target episodes. The experiments were ran on a AMD Athlon 1.3GHz PC with 256 MB main memory, under the Windows 2000 operating system. 4.1 Gen-FCE vs. WINEPI Tables 1 and 2 represent performance statistics for finding frequent closed episodes in the drought risk management dataset for Mead, NE with various frequency thresholds for injective serial drought episodes with a 2 month window using the Gen-FCE and WINEPI algorithms, respectively. Table 1. Gen-FCE serial episode performance. min-fr Candidates Freq. Closed Iters time Episodes (s) 0.05 525 77 3 2 0.10 335 24 2 1 0.15 153 10 2 1 0.20 93 6 2 0 0.25 83 5 2 0 Table 2. WINEPI serial episode performance. min-fr Candidates Freq. Closed Iters time Episodes (s) 0.05 17284 3950 6 6932 0.10 4687 629 5 205 0.15 1704 229 4 10 0.20 807 102 4 1 0.25 567 58 3 1 Gen-FCE performs extremely well when finding the drought episodes. The number of frequent closed episodes decreases rapidly as the frequency threshold increases. For the sample dataset at a frequency threshold of 0.10 and a window width of 2 months, Gen-FCE produces 6 frequent drought serial episodes while WINEPI produces 1600% more (102) episodes. Because we are working with a fraction of the possible number of episodes, our algorithms are extremely efficient. When finding all frequent drought episodes for the sample dataset using a window width of 5 months, the running time was 1 second for Gen-FCE and 6 hours for WINEPI. This illustrates the benefits of using closures and constraints when working with the infrequently occurring drought events. As the window size increases, so does the frequent episode generation time and the number of frequent episodes. When using drought constraints, the increase is at a much slower pace than WINEPI. For the sample dataset and a window width of 3 months, Gen-FCE produces 53 frequent drought serial episodes while WINEPI produces 5779% more (3116) episodes. 4.2 Gen-REAR vs. WINEPI Association Rules We next experimented with finding association rules in the drought risk management dataset for Mead, NE with various confidence thresholds and window widths using the Gen-REAR and WINEPI AR algorithms for injective parallel and serial episodes. The number of rules decreases rapidly as the confidence threshold increases and increases rapidly as the window width widens. In all cases, GenREAR produces fewer rules than the WINEPI AR algorithm. Using the Gen-REAR approach, all the rules can be generated if desired, even though the meaning of the additional AR’s is captured by the smaller set of REAR’s. Gen-REAR performs extremely well when finding drought episodal rules as shown in Table 3. The number of REAR’s decreases rapidly as the confidence interval increases. For the sample dataset at a confidence threshold of 0.20 and a window width of 2 months, Gen-REAR produces 24 drought parallel episodal rules while WINEPI AR produces 20892% more (5038) rules. With the same parameters, Gen-REAR produces 14 drought serial episodal rules while WINEPI AR produces 16257% more (2290) rules. Table 3. Gen-REAR parallel and serial rules. Parallel Serial Confidence Distinct Distinct threshold rules rules 0.20 24 14 0.25 24 12 0.30 19 9 0.35 13 7 0.40 10 6 0.45 8 5 As the window width widens, Gen-REAR overwhelmingly produces fewer rules than the WINEPI algorithm. The number of REAR’s increases as the window width. For the sample dataset at a window width of 3 months, GenREAR produces 30 parallel drought episodal rules while WINEPI AR produces 53763% more (16159) rules. With the same parameters, Gen-REAR produces 8 serial drought episodal rules while WINEPI AR produces 24825% more (1994). The savings are obvious. The Gen-REAR algorithm finds the drought REAR’s for all reasonable window widths and confidence levels on the Mead, NE drought risk management dataset in less than 30 seconds. As the window widens, the WINEPI AR algorithm quickly becomes computationally infeasible to use for the drought risk management problem. 5 Conclusion This paper presents Gen-REAR, a new approach for generating representative episodal association rules. We also presented Gen-FCE, a new approach used to generate the frequent closed episode sets that conform to user-specified constraints. Our approach results in a large reduction in the input size for generating representative episodal association rules for targeted episodes, while retaining the ability to generate the entire set of association rules. We also studied the gain in efficiency of generating targeted representative episodal association rules as compared to the traditional WINEPI algorithm on a multiple time series drought risk management problem. References [1] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD 1993 International Conference on Management of Data [SIGMOD 93], pages 207–216, Washington D.C., 1993. [2] R. Bayardo, R. Agrawal, and D. Gunopupulos. Constraintbased rule mining in large, dense databases. In Proceedings of ICDE-99, 1999. [3] S. K. Harms, S. Goddard, S. E. Reichenbach, W. J. Waltman, and T. Tadesse. Data mining in a geospatial decision support system for drought risk management. In Proceedings of the 2001 National Conference on Digital Government Research, pages 9–16, Los Angelos, California, USA, May 2001. [4] M. Kryszkiewicz. Fast discovery of representative association rules. In Lecture Notes in Artificial Intelligence, volume 1424, pages 214–221. Proceedings of RSCTC 98, Springer-Verlag, 1998. [5] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Technical report, Department of Computer Science, University of Helsinki, Finland, 1997. Report C-1997-15. [6] J. Saquer and J. S. Deogun. Using closed itemsets for discovering representative association rules. In Proceedings of the Twelfth International Symposium on Methodologies for Intelligent Systems [ISMIS 2000], Charlotte, NC, October 11-14 2000. [7] R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining [KDD97], pages 67–73, 1997. [8] M. Zaki. Generating non-redundant association rules. In Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining [KDD2000], pages 34–43, Boston, MA, USA, August 20-23 2000.