Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Proceedings of National Conference on New Horizons in IT - NCNHIT 2013 173 Traditional Set Based Approach in Mining Sequential Patterns Alpa Reshamwala and Dr. Sunita Mahajan Abstract--- Sequential pattern mining is an important data mining problem with broad applications. Sequential Pattern Mining is to discover the frequent sequential pattern in the sequential event dataset. In this paper, we have implemented sequential pattern mining algorithm based on traditional set theory. The proposed algorithm is implemented on a network intrusion dataset from KDD Cup 1999, 10 percent of training dataset, which is the annual Data Mining and Knowledge Discovery competition organized by ACM Special Interest Group on Knowledge Discovery and Data Mining, the leading professional organization of data miners. Keywords--- Data Mining, Sets, Sequence Data, Time Series, Intrusion Detection System, DoS Attacks I. INTRODUCTION W ITH massive amounts of data continuously being collected and stored, many industries are becoming interested in mining sequential patterns from their database. Sequential pattern mining is one of the most well-known methods and has broad applications including web-log analysis, customer purchase behavior analysis and medical record analysis. Many approaches have been proposed to extract information, and mining sequential patterns is one of the most important ones [1][2][3]. Sequential Pattern Mining finds interesting sequential patterns among the large database. It finds out frequent subsequences as patterns from a sequence database. It has been a great challenge to improve the efficiency of Apriori algorithm. Since all the frequent sequential patterns are included in the maximum frequent sequential patterns, the task of mining frequent sequential patterns can be converted as mining maximum frequent sequential patterns. PrefixSpan [8] is not a refinement of the apriori-like, candidate generationand-test approach, rather a divide-and-conquer approach, called pattern-growth approach, which is an extension of FPgrowth [9], an efficient pattern-growth algorithm for mining frequent patterns without candidate generation. SPAM [10] algorithm is based on the lexicographic sequence tree and can make either depth or width first traversal. In this paper, all these algorithms are implemented to mine frequent sequential patterns on KDD Cup 1999 dataset to predict DoS attack sequences on network traffic data. Alpa Reshamwala, Assistant Professor, Computer Engineering Department, MPSTME, SVKM’s NMIMS University, Mumbai, India. E-mail: [email protected] Dr. Sunita Mahajan, Principal, Institute of Computer Science, M.E.T, Bandra, Mumbai, India. E-mail: [email protected] II. RELATED WORK After mid 1990’s, following Agrawal and Srikant [1], many scholars provided more efficient algorithms [8][11][12][13]. Besides these, works have been done to extend the mining of sequential patterns to other time-related patterns. Existing approaches to find appropriate sequential patterns in time related data are mainly classified into two approaches. In the first approach developed by Agarwal and Srikant [14], the algorithm extends the well-known Apriori algorithm. This type of algorithms is based on the characteristic of Apriori—that any sub-pattern of a frequent pattern is also frequent [1]. The latter, uses a pattern growth approach [8], employs the same idea used by the Prefix-Span algorithm. This algorithm divides the original database into smaller sub-databases and solves them recursively. Previous research addresses time intervals in two typical ways, first by the timewindow approach, and second by completely ignoring the time interval. First, the time window approach requires the length of the time window to be specified in advance. A sequential pattern mined from the database is thus a sequence of windows, each of which includes a set of patterns. Patterns in the same time window are bought in the same time period. In the algorithm [12], Shrikant and Agrawal, specified the maximum interval (max-interval), the minimum interval (mininterval) and the sliding time window size (windowsize). Moreover, they cannot find a pattern whose interval between any two sequences is not in the range of the window-size. Agrawal and Srikant [1], introduced traditional sequential mining, by ignoring the time interval and including only the temporal order of the patterns. To address the intervals between successive patterns in sequence database, Chen et al. have proposed a generalization of sequential patterns, called time-interval sequential patterns, which reveals not only the order of patterns, but also the time intervals between successive patterns [4]. Chen et al. developed algorithms to find sequential patterns using both the approaches [4]. Their work, by assuming the partition of time interval as fixed, developed two efficient algorithms -I-Apriori and I- PrefixSpan. The first algorithm is based on the conventional Apriori algorithm, while the second one is based on the PrefixSpan algorithm. An extension of the algorithm developed by Chen et al [4], to solve the problem of sharp boundaries to provide a smooth transition between members and non-members of a set, is addressed in Chen et al [5]. The sharp boundary problems can be solved by the concept of fuzzy sets. The concept included fuzzy time interval (FTI) pattern. Two efficient algorithms, the ISBN 978-93-82338-79-6 Proceedings of National Conference on New Horizons in IT - NCNHIT 2013 FTI-Apriori algorithm and the FTI-PrefixSpan algorithm, were developed for mining FTI sequential patterns. There are several other reasons that support the use of FTI in place of crisp time interval. First, the human knowledge can be easily represented by fuzzy logic. Second, it is widely recognized that many real world situations are intrinsically fuzzy, and the partition of time interval is one of them. Third, FTI is simple and easy for users. Fuzzy logic addresses the formal principles of approximate reasoning. It provides a sound foundation to handle imprecision and vagueness as well as mature inference mechanisms by varying degrees of truth. As boundaries are not always clearly defined, fuzzy logic can be used to identify complex pattern or behavior variations. And it can be accomplished by building an intrusion detection system that combines fuzzy logic rules with an expert system in charge of evaluating rule truthfulness. In [6], we contributed to the ongoing research on FTI sequential pattern mining by proposing an algorithm to detect and classify audit sequential patterns in network traffic data. The paper also defines the confidence of the FTI audit sequences, which is not yet defined in the previous researches. In [7], we have proposed an algorithm which uses a fuzzy genetic approach to discover optimized sequences in the network traffic data to classify and detect intrusion. Anrong et al [15], addresses application of sequential pattern in intrusion detection by refining the pattern rules and reducing redundant rules. Their work implements PrefixSpan algorithm in the data mining module of network intrusion detection system (NIDS). Shang Gao et al [16], describes a setbased approach for mining association rules and finding frequent sequential patterns in customer transactional databases. Their approach relaxes the constraints described in Apriori (All/Some), and improves the performance while being more user-oriented and self-adaptive than the probabilistic knowledge representation. In this paper, all these algorithms are implemented to mine frequent sequential patterns on KDD Cup 1999 dataset to predict DoS attack sequences on network traffic data III. 174 the same support. An itemset with minimum support is called as the large itemset or litemset. Table I: Sequence Database Sid Audit Sequence 10 20 <(a,1),(b,4),(e,29)> <(d,1),(a,2),(d,24)> 30 40 50 <(b,1),(a,11),(e,28)> <(f,1), (b,5),(c,19)> <(a,4),(b,5),(d,10),(e,28)> 60 <(a,0),(b,5),(e,30)> 70 <(j,2),(a,17),(h,17)> 80 <(c,3),(i,10),(f,18)> 90 <(h,4),(a,10),(b,21)> 100 <(g,0),(a,0),(b,3),(e,30)> IV. ALGORITHMS Figure 1 depicts the working of the algorithm to find frequent sequences using set theory. Consider the sequence dataset D,as in Table I. To avoid multiple scan of the dataset D, the dataset is stored in the Hash Map data structure in Java. Now, by applying the Apriori algorithm of candidate generation and considering minimum support of 0.3. In the first pass, find L0 by scanning the dataset D to generate large 1sequences. By Apriori principle C1, candidates are generated. Find L1 satisfying the min_supp =0.3, we get 1-sequence <a>, <b> and <e>. Also form a set of Sequence_id of each of these L1 candidates as shown in Figure 1. For example Sid for 1sequence itemset <a>: {10, 20, 30, 50, 60, 70, 90, 100}, <b>: {10, 30, 40, 50, 60, 90, 100} and <e>: {10, 30, 50, 60, 100}. Interestingness of the 1- sequence is found by applying the set intersection of the set of all the Sid of the candidates in L1. Sid <a> ∩ Sid <b> ∩ Sid <e> SET THEORY Set theory is the branch of mathematical logic that studies sets, which are collections of objects. Although any type of object can be collected into a set. A set theory features binary operations on sets: Union of the sets A and B, denoted A ∪ B, is the set of all objects that are a member of A, or B, or both. The union of {1, 2, 3} and {2, 3, 4} is the set {1, 2, 3, 4}. Intersection of the sets A and B, denoted A ∩ B, is the set of all objects that are members of both A and B. The intersection of {1, 2, 3} and {2, 3, 4} is the set {2, 3}. Consider the sequence database as shown in Table I. The length of a sequence is the number of itemsets in the sequence. A sequence of length k is called a k-sequence. The sequence formed by the concatenation of two sequences x and y is denoted as x.y. the support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction. Thus the itemset i and the 1-sequence <i> have Next pass or when k>=2, we will be considering only those set of Sequence_id which resulted from the previous pass intersection if Sid’s of the l-sequence, where l is the length of sequence. When l=1, we get, a set of Sid {10, 30, 50, 60, 100}. Thus C2 will be generated from this reduced dataset D’ stored as a hash map. Find L2 satisfying the min_supp =0.3, we get 2sequence <a b>, <a e> and <b e>. Also form a set of Sequence_id of each of these L2 candidates as shown in Figure 1. For example, Sid for 2-sequence itemset are. <a>: {10, 30, 50, 60, 100}. <b>: {10, 30, 50, 60, 100}. <e>: {10, 30, 50, 60, 100}. Similarly, Interestingness of the k- sequence is found by intersection of the set of all the Sid of the candidates in Lk ISBN 978-93-82338-79-6 Proceedings of National Conference on New Horizons in IT - NCNHIT 2013 For the example in figure 1 we get interestingness of the 2sequence is by applying the set intersection of the set of all the Sid of the candidates in L2. 175 L1 = Candidates in C1 with minimum support. end. Interestingness of the 1- sequence is found by intersection of the set of all the Sid of the candidates in L1 Sid <a b> ∩ Sid <a e> ∩ Sid <b e> We get, a set of Sid {10, 30, 50, 60, 100}. Repeating the earlier pass till Lk. For the example in figure 1 we get, frequent longest sequence pattern as <a b e> with minimum support >= 0.3. Sid <i1> ∩ Sid <i2> ….∩ Sid<in>; i1,i2,…in - itemsets for (k=2; Lk-1 ≠ φ; k++) do begin Lk = Candidates with minimum support The algorithm is as follows. Interestingness of the k- sequence is found by intersection of the set of all the Sid of the candidates in Lk Algorithm: L0 = Scan the database to generate large 1- sequences; C1 = new candidates generated from L0. end. for each sequence c in the database do Increment the count of all candidates in C1 that are contained in c. Maximal Sequences in UkLk Figure 1: Large Frequent litemset These algorithms were implemented in Sun Java language V. RESULTS AND DISCUSSION and tested on an Intel Core Duo Processor, 2.10 GHz with In this section we perform a simulation study to compare 2GB main memory under Windows XP operating system. the performances of the algorithms: Set Based AprioriALL, AprioriAll based on traditional set theory shrinks the the Apriori [1], PrefixSpan [8] and SPAM [10], which find database size. It also scans the database at most once. Also, as sequential patterns without time intervals. the interestingness of the itemset is increased with the database shrinking leads to longest sequences. As the database is reduced the time taken to mine sequences also ISBN 978-93-82338-79-6 Proceedings of National Conference on New Horizons in IT - NCNHIT 2013 reduces and is faster than traditional algorithms. The Complexity of the Algorithm can also be reduced. As we can observe in the Figure 2 and 3, AprioriAll based on Set theory generates maximum number of sequences as per the Apriori principal. Also, it takes only 1 fixed database scan for k- itemset as compared to k database scans for k-itemset in traditional 176 VI. CONCLUSIONS AND FUTURE WORK In this paper we have implemented Apriori a candidate generation algorithm and PrefixSpan a pattern growth algorithm on KDD Cup 1999 dataset to predict DoS attack sequences on network traffic data using Set theory concepts. Experimental results have shown that Set theory based AprioriALL has generated maximum, longest and efficient sequential patterns. These results are then compared with Traditional AprioriALL, SPAM (Sequential Pattern Mining), and PrefixSpan algorithm. The results in Figure 2, also shows that Set theory based AprioriALL performs well for large datasets like KDD Cup 1999 dataset. The second comparison is done on the number of frequent sequence patterns found executing these algorithms with the varying minimum support threshold and from the results, it is shown that the Set theory based AprioriALL gives the longest efficient sequential patterns. On applying set intersection operation, the interestingness of the itemset is increased with the database shrinking leads to longest sequences. Figure 2: Performance of AprioriAll Set based Algorithm In future work, as in these experiments we have found sequence patterns, by ignoring the time interval and including only the temporal order of the patterns. The approach can be extended to more set-based mathematical models for further data analysis in order to discover hidden sequential patterns. To address the intervals between successive patterns in sequence database, Chen et al. have proposed a generalization of sequential patterns, called time-interval sequential patterns, which reveals not only the order of patterns, but also the time intervals between successive patterns [4]. An extension of the algorithm developed by Chen et al [4], to solve the problem of sharp boundaries to provide a smooth transition between members and non-members of a set, is addressed in Chen et al [5] can also be implemented. Also as proposed in [7], the use of fuzzy genetic approach to discover optimized sequences in the network traffic data to classify and detect intrusion can also be implemented. Figure 3: No. of Patterns of AprioriAll Set based Algorithm REFERENCES [1] AprioriAll. It also generates longest sequences. As the itemsets which satisfies the minimum support constraints will together generate the longest sequences. The interestingness of the itemset increases by taking the intersection of the sequence-id’s in which the itemsets are present. The first comparison is based on the KDD cup 99 dataset where the minimum support threshold is varied 20 % to 90%. Figure 2 summarizes those results. All the results show that Set based AprioriALL algorithm is at par with Prefixspan and SPAM algorithm as copared to AprioriALL algorithm. The second comparison is done on the number of frequent sequence patterns found executing these algorithms with the varying minimum support threshold. From the results in Figure 3, it is shown that Set based AprioriALL generates maximum as well as efficient number of sequential patterns [2] [3] [4] [5] [6] [7] [8] R. Agrawal and R. Srikant, “Mining sequential patterns”, In Proc. Int. Conf. Data Engineering, 1995, pp. 3–14. Y. L. Chen, S. S. Chen, and P. Y. Hsu, “Mining hybrid sequential patterns and sequential rules”, Inf. Syst., vol. 27, no. 5, pp. 345–362, 2002. J. Han and M. Kamber, Data Mining: Concepts and Techniques, New York: Academic, 2001. Y. L. Chen, M. C. Chiang, and M. T. Ko, “Discovering time-interval sequential patterns in sequence databases”, Expert Systems with Applications, Volume 25, Issue 3, October 2003, pp 343–354. Yen-Liang, Tony Cheng-Kui Huang, “Discovering Fuzzy TimeInterval Sequential Patterns in Sequence Databases”, IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics, 2005, vol.35, pp.959-972. Sunita Mahajan and Alpa Reshamwala, “Amalgamation of IDS Classification with Fuzzy techniques for Sequential pattern mining “,IJCA Proceedings on International Conference on Technology Systems and Management - ICTSM 2011, Number 3 - Article 7, pp 9– 14. Sunita Mahajan and Alpa Reshamwala, “An Approach to Optimize Fuzzy Time-Interval Sequential Patterns Using Multi-objective Genetic Algorithm”, ICTSM 2011, CCIS 145, pp. 115–120, 2011, SpringerVerlag Berlin Heidelberg 2011. Pei, J., Han, J., Pinto, H., Chen, Q., Dayal, U., & Hsu, M.-C., “PrefixSpan: Mining sequential patterns efficiently by prefix-projected ISBN 978-93-82338-79-6 Proceedings of National Conference on New Horizons in IT - NCNHIT 2013 [9] [10] [11] [12] [13] [14] [15] [16] pattern growth”, Proceedings of 2001 International Conference on Data Engineering, pp. 215–224. J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation,” Proc. 2000 ACM-SIGMOD Int’l Conf. Management of Data (SIGMOD ’00), pp. 1-12, May 2000. J. Ayres, J. Gehrke, T. Yiu, and J. Flannick, “Sequential PAttern Mining using A Bitmap Representation”, In Proceedings of ACM SIGKDD on Knowledge discovery and data mining, pp. 429-435, 2002. Han, J., Pei, J., Mortazavi-Asl, B., Chen, Q., Dayal, U., & Hsu, M.-C., “FreeSpan: Frequent pattern-projected sequential pattern mining”, Proceedings of 2000 International Conference on Knowledge Discovery and Data Mining, pp. 355–359. Srikant, R., & Agrawal, R., “Mining sequential patterns: Generalizations and performance improvements”, Proceedings of the 5th International Conference on Extending Database Technology,1996, pp. 3–17. Zaki, M. J., “SPADE: An efficient algorithm for mining frequent sequences”, volume 42 Issue 1-2, January-February 2001, pp 31–60. R. Agrawal and R. Srikant, “Fast algorithms for mining association rules”, Proceedings of 20th VLDB Conference Santiago, Chile, 1994, pp. 487–499. XUE Anrong, HONG Shijie, JU Shiguang, CHEN Weihe, “Application of Sequential Patterns Based on User’s Interest in Intrusion Detection”, Proceedings of 2008 IEEE International Symposium on IT in Medicine and Education, pp 1089- 1093, 2008. Shang Gao, Reda Alhaji, Jon Rokne, Jiwen Guan, “Set Based Approach in Mining Sequential Patterns”, 24th International Symposium on Computer and Information Sciences, 2009. ISCIS 2009, pp 218 – 223. ISBN 978-93-82338-79-6 177