Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bonfring International Journal of Data Mining, Vol. 4, No. 1, March 2014 1 Improving Efficiency of Apriori Algorithms for Sequential Pattern Mining Alpa Reshamwala and Dr. Sunita Mahajan Abstract--- Computer Systems are exposed to an increasing number of different types of security threats due to the expanding of internet in recent years. How to detect network intrusions effectively becomes an important security technique. Many intrusions aren’t composed by single events, but by a series of attack steps taken in chronological order. Analyzing the order in which events occur can improve the attack detection accuracy and reduce false alarms. Intrusion is a multi step process in which a number of events must occur sequentially in order to launch a successful attack. Intrusion detection using sequential pattern mining is a research topic focusing on the field of information security. Sequential Pattern Mining is used to discover the frequent sequential pattern in the event dataset. Sequential Pattern mining algorithms can be broadly classified into Apriori based, Pattern growth based and a combination of both. The first algorithm is based on the characteristic of Apriori and the second uses a pattern growth approach. The major drawback of the Apriori based algorithm is the multiple scans of the database, generating maximal patterns. In this paper, a simulation study of both the algorithms, a modified AprioriALL Algorithm to optimize the processing by including set theory techniques and the original AprioriALL algorithm is done on a network intrusion dataset from KDD cup 1999. Experimental results show that the modified algorithm shrinks the dataset size. At the most, it also scans the database twice. Also, as the interestingness of the itemset is increased with the dataset shrinking it leads to efficient sequences with high associativity. As the database is reduced, the time taken to mine sequences also reduces and is faster than Apriori based algorithm. Keywords--- Data mining, Sets, Sequence data, Time series, Intrusion detection system, DoS attacks Agrawal R. et al. in the shopping basket data analysis [1]. Sequential Pattern Mining finds interesting sequential patterns among the large database. It finds out frequent subsequences as patterns from a sequence database. In addition, Constraintbased sequential pattern mining algorithm, based on the pattern of growth approach, and databases based on the projection methods have been proposed. And moreover, there are some expansions of research on SPM, such as closed sequential pattern mining, parallel mining, distributed mining, multi-dimensional sequential pattern mining and approximate sequential pattern mining. Existing approaches to find appropriate sequential patterns in time related data are mainly classified into two approaches. In the first approach developed by Agarwal and Srikant [14], the algorithm extends the well-known Apriori algorithm. This type of algorithms is based on the characteristic of Apriori— that any subpattern of a frequent pattern is also frequent [1]. The latter, uses a pattern growth approach [8] and employs the same idea used by the Prefix-Span algorithm. It has been a great challenge to improve the efficiency of Apriori algorithm. Since all the frequent sequential patterns are included in the maximum frequent sequential patterns, the task of mining frequent sequential patterns can be converted as mining maximum frequent sequential patterns. AprioriALL[1] is based on Apriori algorithm. In each pass we use the large sequences from the previous pass to generate the candidate sequences and then measure their support by making a pass over the database. In this paper, the Apriori based algorithm, AprioriALL[1], as well as modified algorithm AprioriAll_Set, both are implemented to mine frequent sequential patterns. II. I. W INTRODUCTION ITH massive amounts of data continuously being collected and stored, many industries are becoming interested in identifying sequential patterns from their database. Sequential pattern mining is one of the most wellknown methods and has broad applications including web-log analysis, customer purchase behavior analysis, medical record analysis, market analysis, decision support, music recommendation, fraud detection, intrusion detection and business management. Many approaches have been proposed to extract information, and mining sequential patterns is one of the most important ones [1][2][3]. It is firstly proposed by RELATED WORK After mid 1990’s, following Agrawal and Srikant [1], many scholars provided more efficient algorithms [8][9][10][11][12][13]. Besides these, work has been done to extend the mining of sequential patterns to other time-related patterns. Existing efforts to find appropriate sequential patterns in time related data are mainly classified into two approaches. In the first approach developed by Agarwal and Srikant [14], the algorithm extends the well-known Apriori algorithm. This type of algorithms is based on the characteristic of Apriori—that any sub-pattern of a frequent pattern is also frequent [1]. The latter, using a pattern growth approach [8], employs the same idea used by the Prefix-Span algorithm. This algorithm divides the original database into smaller sub-databases and solves them recursively. Alpa Reshamwala Dr. Sunita Mahajan ISSN 2277 - 5048 | © 2014 Bonfring Bonfring International Journal of Data Mining, Vol. 4, No. 1, March 2014 Previous research addresses time intervals in two typical ways, first by the time-window approach, and second by completely ignoring the time interval. First, the time window approach requires the length of the time window to be specified in advance. A sequential pattern mined from the database is thus a sequence of windows, each of which includes a set of patterns. Patterns in the same time window are bought in the same time period. Srikant and Agrawal, specified the maximum interval (max-interval), the minimum interval (min-interval) and the sliding time window size (window-size) in the algorithm [12], Moreover, they cannot find a pattern whose interval between any two sequences is not in the range of the window-size. Agrawal and Srikant [1], introduced traditional sequential mining, by ignoring the time interval and including only the temporal order of the patterns. To address the intervals between successive patterns in sequence database, Chen et al. have proposed a generalization of sequential patterns, called time-interval sequential patterns, which reveals not only the order of patterns, but also the time intervals between successive patterns [4]. Chen et al. developed algorithms to find sequential patterns using both the approaches [4]. Their work, by assuming the partition of time interval as fixed, developed two efficient algorithms -I-Apriori and I- PrefixSpan. The first algorithm is based on the conventional Apriori algorithm, while the second one is based on the PrefixSpan algorithm. An extension of the algorithm developed by Chen et al [4], to solve the problem of sharp boundaries to provide a smooth transition between members and non-members of a set, is addressed by Chen et al [5]. The sharp boundary problems can be solved by the concept of fuzzy sets. The concept included fuzzy time interval (FTI) pattern. Two efficient algorithms, the FTI-Apriori algorithm and the FTI-PrefixSpan algorithm, were developed for mining FTI sequential patterns. There are several other reasons that support the use of FTI in place of crisp time interval. First, the human knowledge can be easily represented by fuzzy logic. Second, it is widely recognized that many real world situations are intrinsically fuzzy, and the partition of time interval is one of them. Third, FTI is simple and easy for users. Fuzzy logic addresses the formal principles of approximate reasoning. It provides a sound foundation to handle imprecision and vagueness as well as mature inference mechanisms by varying degrees of truth. As boundaries are not always clearly defined, fuzzy logic can be used to identify complex pattern or behavior variations. And it can be accomplished by building an intrusion detection system that combines fuzzy logic rules with an expert system in charge of evaluating rule truthfulness. In [6], the authors have contributed to the ongoing research on FTI sequential pattern mining by proposing an algorithm to detect and classify audit sequential patterns in network traffic data. The paper also defines the confidence of the FTI audit sequences, which is not yet defined in the previous researches. In [7], S. Mahajan and A. Reshamwala have proposed an algorithm which uses a fuzzy genetic approach to discover optimized sequences in the network traffic data to classify and detect intrusion. 2 reducing redundant rules. Their work implements PrefixSpan algorithm in the data mining module of network intrusion detection system (NIDS). Shang Gao et al [16], describes a set-based approach for mining association rules and finding frequent sequential patterns in customer transactional databases. Their approach relaxes the constraints described in Apriori (All/Some), and improves the performance while being more user-oriented and self-adaptive than the probabilistic knowledge representation. In [17], A. Reshamwala and S. Mahajan, have implemented on KDD Cup 1999 dataset to predict DoS attack sequences and they conclude that, Approach 2 results are more efficient with dividing the sequence by a timestamp window of 1 day or 86400 seconds. III. SET THEORY Set theory is the branch of mathematical logic that studies sets, which are collections of objects. Although any type of object can be collected into a set. A set theory features binary operations on sets: Union of the sets A and B, denoted A ∪ B, is the set of all objects that are a member of A, or B, or both. The union of {1, 2, 3} and {2, 3, 4} is the set {1, 2, 3, 4}. Intersection of the sets A and B, denoted A ∩ B, is the set of all objects that are members of both A and B. The intersection of {1, 2, 3} and {2, 3, 4} is the set {2, 3}. Consider the sequence database as shown in Table I. The length of a sequence is the number of itemsets in the sequence. A sequence of length k is called a k-sequence. The sequence formed by the concatenation of two sequences x and y is denoted as x, y. the support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction. Thus the itemset i and the 1-sequence <i> have the same support. An itemset with minimum support is called as the large itemset or litemset. IV. APRIORIALL SET BASED ALGORITHM Figure 1 depicts the working of the algorithm to find frequent sequences using set theory. Consider the sequence dataset D, as in Table I. To avoid multiple scans of the dataset D, the dataset is stored in the Hash Map data structure in Java. For the example in figure 1 we get, frequent longest sequence pattern as <a b e> with minimum support >= 0.3. Anrong et al [15], addresses application of sequential pattern in intrusion detection by refining the pattern rules and ISSN 2277 - 5048 | © 2014 Bonfring Bonfring International Journal of Data Mining, Vol. 4, No. 1, March 2014 Sid Sequence 10 <(a,1),(b,4),(e,29)> 20 <(d,1),(a,2),(d,24)> 30 <(b,1),(a,11),(e,28)> 40 <(f,1),(b,5),(c,19)> 50 <(a,4),(b,5),(d,10),(e,28)> 60 <(a,0),(b,5),(e,30)> 70 <(j,2),(a,17),(h,17)> 80 <(c,3),(I,10),(f,18)> 90 <(h,4),(a,10),(b,21)> 100 <(g,0),(a,0),(b,3),(e,30)> 1st Scan 3 Sequence <a> <b> <c> <d> <e> <f> <g> <h> <i> <j> Support 0.8 0.7 0.2 0.2 0.5 0.2 0.1 0.2 0.1 0.1 SUPmin=0.3 0-Length Sequences Sid Sequence Support [10,20,30,50,60,70,90,100] <a> 0.8 [10,30,40,50,60,90,100] <b> 0.7 [10,30,50,60,100] <e> 0.5 Sid Sequence Support [10, 30,50,60, 100] <a b> 0.4 [10, 30,50,60, 100] <a e> 0.5 [10,30,50,60,100] <b e> 0.5 Sid 10 30 50 60 100 Sequence <(a,1),(b,4),(e,29)> <(b,1),(a,11),(e,28)> <(a,4),(b,5),(d,10),(e,28)> <(a,0),(b,5),(e,30)> <(g,0),(a,0),(b,3),(e,30)> 2nd Scan Sid Sequence 10 <(a,1),(b,4),(e,29)> 30 <(b,1),(a,11),(e,28)> 50 <(a,4),(b,5),(d,10),(e,28)> 60 <(a,0),(b,5),(e,30)> 100 <(g,0),(a,0),(b,3),(e,30)> 2-Length Sequence Sequence Support <a b e> 0.4 Figure 1: AprioriAll_Set Algorithm ISSN 2277 - 5048 | © 2014 Bonfring Bonfring International Journal of Data Mining, Vol. 4, No. 1, March 2014 4 The algorithm is as follows Table 1: Sequence Database Sid Audit Sequence 10 20 <(a,1),(b,4),(e,29)> <(d,1),(a,2),(d,24)> 30 40 50 <(b,1),(a,11),(e,28)> <(f,1), (b,5),(c,19)> <(a,4),(b,5),(d,10),(e,28)> 60 <(a,0),(b,5),(e,30)> 70 <(j,2),(a,17),(h,17)> 80 <(c,3),(i,10),(f,18)> 90 <(h,4),(a,10),(b,21)> 100 <(g,0),(a,0),(b,3),(e,30)> Algorithm: L0 = Scan the database to generate large 1- sequences; C1 = new candidates generated from L0. for each sequence c in the database do Increment the count of all candidates in C1 that are contained in c. L1 = Candidates in C1 with minimum support. end. Interestingness of the 1- sequence is found by intersection of the set of all the Sid of the candidates in L1 Sid <i1> ∩ Sid <i2> ….∩ Sid<in>; i1,i2,…in - itemsets for (k=2; Lk-1 ≠ φ; k++) do Now, on applying the AprioriAll_Set algorithm of candidate generation and considering minimum support of 0.3. In the first pass, find L0 by scanning the dataset D to generate large 1-sequences. By Apriori principle C1, candidates are generated. Find L1 satisfying the min_supp =0.3, we get 1sequence <a>, <b> and <e>. Also form a set of Sequence_id of each of these L1 candidates as shown in Figure 1. For example Sid for 1-sequence itemset <a>: {10, 20, 30, 50, 60, 70, 90, 100}, begin Lk = Candidates with minimum support Interestingness of the k- sequence is found by intersection of the set of all the Sid of the candidates in Lk end. Maximal Sequences in UkLk. V. <b>: {10, 30, 40, 50, 60, 90, 100} and <e>: {10, 30, 50, 60, 100}. Interestingness of the 1- sequence is found by applying the set intersection of the set of all the Sid of the candidates in L1. Sid <a> ∩ Sid <b> ∩ Sid <e> Next pass or when k>=2, we will be considering only those set of Sequence_id which resulted from the previous pass intersection if Sid’s of the l-sequence, where l is the length of sequence. When l=1, we get, a set of Sid {10, 30, 50, 60, 100}. Thus C2 will be generated from this reduced dataset D’ stored as a hash map. Find L2 satisfying the min_supp =0.3, we get 2-sequence <a b>, <a e> and <b e>. Also form a set of Sequence_id of each of these L2 candidates as shown in Figure 1. For example, Sid for 2-sequence itemset are. <a>: {10, 30, 50, 60, 100}. <b>: {10, 30, 50, 60, 100}. <e>: {10, 30, 50, 60, 100}. Similarly, Interestingness of the k- sequence is found by intersection of the set of all the Sid of the candidates in Lk For example in figure 1, the interestingness of the 2sequence can be improved by applying the set intersection of the set of all the Sid of the candidates in L2 Sid <a b> ∩ Sid <a e> ∩ Sid <b e> Hence, resulting in a set of Sid {10, 30, 50, 60, 100}. Repeating the earlier pass till Lk. Frequent sequences are the union of Lk.. RESULTS AND DISCUSSION In this section, both the algorithms: AprioirALL [1] and AprioriAll_Set; are implemented to mine sequential patterns without time intervals. These algorithms were implemented in Sun Java language and tested on an Intel Core Duo Processor, 2.10 GHz with 2GB main memory under Windows XP operating system. The dataset used for simulation is the KDD Cup 1999 dataset to detect DoS attack sequences on network traffic data. The sequence dataset is formed using the second approach as in [17]. Here the sequence is divided by a timestamp window of 1 day or 86400 seconds. AprioriAll_Set; based on traditional set theory shrinks the database size. It also scans the database at most twice. Also, as the interestingness of the itemset is increased with the database shrinking leads to longest sequences. As the database is reduced the time taken to mine sequences also reduces and is faster than traditional algorithms. The Complexity of the Algorithm can also be reduced. As we can observe in the Figure 3, AprioriAll_Set; generates efficient sequential patterns as per the Apriori principle. Also, it takes only 2 fixed database scans for k- itemset as compared to k database scans for k-itemset in AprioriALL algorithm. It also generates longest sequences. The itemsets which satisfy the minimum support constraints will together generate the longest sequences. The interestingness of the itemset increases by taking the intersection of the sequence-id’s in which the itemsets are present. ISSN 2277 - 5048 | © 2014 Bonfring Bonfring International Journal of Data Mining, Vol. 4, No. 1, March 2014 Performance 250 200 RunTime(ms) AprioriALL_Set Pattern length Discovery Average Length of Patterns The first comparison is based on the performance of the two algorithms where the minimum support threshold is varied from 20 % to 90%. Figure 2 summarizes those results. All the results show that AprioriAll_Set algorithm is approximately 1.5 times Faster as compared to AprioriALL algorithm as per the results for minimum support of 20%. 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 AprioriALL_Set AprioriALL AprioriALL 150 0.2 100 0.3 0 0.7 0.8 0.9 0.2 Dataset Size - AprioriAll_Set 0.2 0.3 0.4 0.5 0.6 0.7 Support(%) 0.8 0.9 Pattern Discovery 900 800 700 600 500 400 300 200 100 0 120 0.4 100 0.6 80 0.7 60 0.9 Percentage Figure 2: Performance of AprioriAll_Set Algorithm No. Of Patterns 0.5 0.6 Support(%) Figure 5: Pattern Length Discovery of AprioriAll_Set Algorithm 50 AprioriALL_Set AprioriALL 40 20 0 1 2 3Iterations4 5 6 Figure 6: Dataset Size of AprioriAll_Set Algorithm 0.2 0.3 0.4 0.5 0.6 0.7 Support(%) 0.8 0.9 Figure 3: No. of Patterns of AprioriAll_Set Algorithm The second comparison is done on the number of frequent sequence patterns found executing these algorithms with the varying minimum support threshold. From the results in Figure 3, it is shown that AprioriAll_Set generates efficient number of sequential patterns. From Figure 4, it is seen that AprioriALL algorithm requires 34% more memory than AprioriAll_Set when the minimum support is taken as 20%. Figure 5 depicts that, AprioriALL algorithm generates longer patterns as compared to AprioriAll_Set algorithm. AprioriAll_Set; based on traditional set theory shrinks the database size as shown in Figure 6. The comparison is based on the dataset size where the minimum support threshold is varied 20 % to 90%.The average dataset size per iterations in both the algorithms is found in figure 7. VI. CONCLUSION AND FUTURE ENHANCEMENT On applying AprioriALL and AprioriAll_Set on KDD cup 1999 dataset, the results obtained indicate that the algorithm AprioriAll_Set is faster and generates less number of sequential patterns as compared to AprioriALL. Also, Memory Usage 2 AprioriALL_Set 1.5 AprioriALL Dataset Size 1 0.5 AprioriALL_ Set 120 100 Percentage Memory (mb) 0.4 80 60 40 20 0 0.2 0.3 0.4 0.5 0.6 0.7 Support(%) 0.8 0.9 Figure 4: Memory Usage of AprioriAll_Set Algorithm 0 0.2 0.4 0.6 0.7 Support (%) 0.9 Figure 7: Comparison of Dataset Size ISSN 2277 - 5048 | © 2014 Bonfring Bonfring International Journal of Data Mining, Vol. 4, No. 1, March 2014 AprioriALL algorithm requires more memory and generates longer patterns than AprioriAll_Set algorithm. On applying set intersection operation, the interestingness of the itemset is increased in AprioriAll_Set. Dataset shrinking in AprioriAll_Set leads to efficient sequences with high associativity. Lastly, in AprioriAll_Set, as the dataset is stored in Hash Map data structure the multiple scans of the dataset is relatively reduced. In past enhancement, as in these experiments sequence patterns, were discovered by ignoring the time interval and including only the temporal order of the patterns. The approach can be extended to more set-based mathematical models for further data analysis in order to discover hidden sequential patterns. To address the intervals between successive patterns in sequence database, Chen et al. have proposed a generalization of sequential patterns, called timeinterval sequential patterns, which reveals not only the order of patterns, but also the time intervals between successive patterns [4]. An extension of the algorithm developed by Chen et al [4], can also be implemented to solve the problem of sharp boundaries for providing a smooth transition between members and non-members of a set, as addressed in Chen et al [5]. Also as proposed in [7], the use of fuzzy genetic approach to discover optimized sequences in the network traffic data to classify and detect intrusion can also be implemented. REFERENCES [1] [2] [3] [4] R. Agrawal and R. Srikant, ―Mining sequential patterns‖, In Proc. Int. Conf. Data Engineering, pp.3–14, 1995. Y. L. Chen, S. S. Chen and P. Y. Hsu, ―Mining hybrid sequential patterns and sequential rules‖, Inf. Syst., vol. 27, no. 5, pp. 345–362, 2002. J. Han and M. Kamber, Data Mining: Concepts and Techniques, New York: Academic, 2001. Y. L. Chen, M. C. Chiang and M. T. Ko, ―Discovering time-interval sequential patterns in sequence databases‖, Expert Systems with Applications, Volume 25, Issue 3,pp 343–354,2003. Yen-Liang, Tony Cheng-Kui Huang, ―Discovering Fuzzy Time-Interval Sequential Patterns in Sequence Databases‖, IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics, vol.35, pp.959-972, 2005. [6] Sunita Mahajan and Alpa Reshamwala, ―Amalgamation of IDS Classification with Fuzzy techniques for Sequential pattern mining ―,IJCA Proceedings on International Conference on Technology Systems and Management - ICTSM 2011, Number 3 - Article 7, pp 9– 14, 2011. [7] Sunita Mahajan and Alpa Reshamwala, ―An Approach to Optimize Fuzzy Time-Interval Sequential Patterns Using Multi-objective Genetic Algorithm‖, ICTSM 2011, CCIS 145, Springer-Verlag Berlin Heidelberg, pp. 115–120, 2011. [8] Pei, J., Han, J., Pinto, H., Chen, Q., Dayal, U., and Hsu, M.-C., ―PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth‖, Proceedings of 2001 International Conference on Data Engineering, pp. 215–224, 2001. [9] J. Han, J. Pei, and Y. Yin, ―Mining Frequent Patterns without Candidate Generation,‖ Proc. Of ACM-SIGMOD Int’l Conf. Management of Data (SIGMOD ’00), pp. 1-12, 2000. [10] J. Ayres, J. Gehrke, T. Yiu, and J. Flannick, ―Sequential PAttern Mining using A Bitmap Representation‖, In Proceedings of ACM SIGKDD on Knowledge discovery and data mining, pp. 429-435, 2002. [11] Han, J., Pei, J., Mortazavi-Asl, B., Chen, Q., Dayal, U. and Hsu, M.-C., ―FreeSpan: Frequent pattern-projected sequential pattern mining‖, Proceedings of 2000 International Conference on Knowledge Discovery and Data Mining, pp. 355–359, 2000. [5] 6 [12] Srikant, R. and Agrawal, R., ―Mining sequential patterns: Generalizations and performance improvements‖, Proceedings of the 5 th International Conference on Extending Database Technology, pp. 3–17, 1996. [13] Zaki, M. J., ―SPADE: An efficient algorithm for mining frequent sequences‖, volume 42 Issue 1-2, pp 31–60, 2001. [14] R. Agrawal and R. Srikant, ―Fast algorithms for mining association rules‖, Proceedings of 20th VLDB Conference Santiago, Chile, pp. 487– 499, 1994. [15] XUE Anrong, HONG Shijie, JU Shiguan and CHEN Weihe, ―Application of Sequential Patterns Based on User’s Interest in Intrusion Detection‖, Proceedings of 2008 IEEE International Symposium on IT in Medicine and Education, pp 1089- 1093, 2008. [16] Shang Gao, Reda Alhaji, Jon Rokne and Jiwen Guan, ―Set Based Approach in Mining Sequential Patterns‖, 24th International Symposium on Computer and Information Sciences, ISCIS 2009, pp 218 – 223, 2009. [17] Alpa Reshamwala and Dr. Sunita Mahajan, ―Prediction of DoS attack Sequences‖, Proceedings of International Conference on Communication, Information & Computing Technology (ICCICT), pp. 1-5, 2012. Ms. Alpa Reshamwala is currently working as an Asistant Professor in the Department of Computer Engineering at MPSTME, NMIMS University. She received her B.E degree in Computer Engineering from Fr. CRCE, Bandra, Mumbai University in 2000 and M.E degree in Computer Engineering from TSEC, Mumbai University in 2008. Her area of Interest includes Artificial Intelligence, Data Mining, Soft Computing – Fuzzy Logic, Neural Network and Genetic Algorithm. She has 24 papers in National/International Conferences/ Journal to her credit. Dr Sunita M. Mahajan is currently working as the Principal, Mumbai Educational Trust’s Institute of Computer Science. She has done her Doctorate from S.N.D.T. Women’s University in 1997. She has worked as senior scientist at Bhabha Atomic Research Centre for 31 years and entered educational field after her retirement. She has done extensive work in parallel processing. She has more than 45 papers in National and International conferences and journals to her credit. She has guided many PhD students in distributed computing, data mining, natural language processing etc. Her current field of interest is parallel processing, distributed computing, cloud computing, data mining. She has also written a text book on ―Distributed Computing‖(New Delhi, Oxford University Press, 2010) ISSN 2277 - 5048 | © 2014 Bonfring