Download Paper Title (use style: paper title)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Transcript
Proceedings of National Conference on New Horizons in IT - NCNHIT 2013
173
Traditional Set Based Approach in Mining
Sequential Patterns
Alpa Reshamwala and Dr. Sunita Mahajan
Abstract--- Sequential pattern mining is an important data
mining problem with broad applications. Sequential Pattern
Mining is to discover the frequent sequential pattern in the
sequential event dataset. In this paper, we have implemented
sequential pattern mining algorithm based on traditional set
theory. The proposed algorithm is implemented on a network
intrusion dataset from KDD Cup 1999, 10 percent of training
dataset, which is the annual Data Mining and Knowledge
Discovery competition organized by ACM Special Interest
Group on Knowledge Discovery and Data Mining, the leading
professional organization of data miners.
Keywords--- Data Mining, Sets, Sequence Data, Time
Series, Intrusion Detection System, DoS Attacks
I.
INTRODUCTION
W
ITH massive amounts of data continuously being
collected and stored, many industries are becoming
interested in mining sequential patterns from their database.
Sequential pattern mining is one of the most well-known
methods and has broad applications including web-log
analysis, customer purchase behavior analysis and medical
record analysis. Many approaches have been proposed to
extract information, and mining sequential patterns is one of
the most important ones [1][2][3]. Sequential Pattern Mining
finds interesting sequential patterns among the large database.
It finds out frequent subsequences as patterns from a sequence
database.
It has been a great challenge to improve the efficiency of
Apriori algorithm. Since all the frequent sequential patterns
are included in the maximum frequent sequential patterns, the
task of mining frequent sequential patterns can be converted as
mining maximum frequent sequential patterns. PrefixSpan [8]
is not a refinement of the apriori-like, candidate generationand-test approach, rather a divide-and-conquer approach,
called pattern-growth approach, which is an extension of FPgrowth [9], an efficient pattern-growth algorithm for mining
frequent patterns without candidate generation. SPAM [10]
algorithm is based on the lexicographic sequence tree and can
make either depth or width first traversal.
In this paper, all these algorithms are implemented to mine
frequent sequential patterns on KDD Cup 1999 dataset to
predict DoS attack sequences on network traffic data.
Alpa Reshamwala, Assistant Professor, Computer Engineering
Department, MPSTME, SVKM’s NMIMS University, Mumbai, India. E-mail:
[email protected]
Dr. Sunita Mahajan, Principal, Institute of Computer Science, M.E.T,
Bandra, Mumbai, India. E-mail: [email protected]
II.
RELATED WORK
After mid 1990’s, following Agrawal and Srikant [1],
many scholars provided more efficient algorithms
[8][11][12][13]. Besides these, works have been done to
extend the mining of sequential patterns to other time-related
patterns. Existing approaches to find appropriate sequential
patterns in time related data are mainly classified into two
approaches. In the first approach developed by Agarwal and
Srikant [14], the algorithm extends the well-known Apriori
algorithm. This type of algorithms is based on the
characteristic of Apriori—that any sub-pattern of a frequent
pattern is also frequent [1]. The latter, uses a pattern growth
approach [8], employs the same idea used by the Prefix-Span
algorithm.
This algorithm divides the original database into smaller
sub-databases and solves them recursively. Previous research
addresses time intervals in two typical ways, first by the timewindow approach, and second by completely ignoring the time
interval. First, the time window approach requires the length
of the time window to be specified in advance. A sequential
pattern mined from the database is thus a sequence of
windows, each of which includes a set of patterns. Patterns in
the same time window are bought in the same time period. In
the algorithm [12], Shrikant and Agrawal, specified the
maximum interval (max-interval), the minimum interval (mininterval) and the sliding time window size (windowsize).
Moreover, they cannot find a pattern whose interval between
any two sequences is not in the range of the window-size.
Agrawal and Srikant [1], introduced traditional sequential
mining, by ignoring the time interval and including only the
temporal order of the patterns.
To address the intervals between successive patterns in
sequence database, Chen et al. have proposed a generalization
of sequential patterns, called time-interval sequential patterns,
which reveals not only the order of patterns, but also the time
intervals between successive patterns [4]. Chen et al.
developed algorithms to find sequential patterns using both the
approaches [4]. Their work, by assuming the partition of time
interval as fixed, developed two efficient algorithms -I-Apriori
and I- PrefixSpan. The first algorithm is based on the
conventional Apriori algorithm, while the second one is based
on the PrefixSpan algorithm.
An extension of the algorithm developed by Chen et al [4],
to solve the problem of sharp boundaries to provide a smooth
transition between members and non-members of a set, is
addressed in Chen et al [5]. The sharp boundary problems can
be solved by the concept of fuzzy sets. The concept included
fuzzy time interval (FTI) pattern. Two efficient algorithms, the
ISBN 978-93-82338-79-6
Proceedings of National Conference on New Horizons in IT - NCNHIT 2013
FTI-Apriori algorithm and the FTI-PrefixSpan algorithm, were
developed for mining FTI sequential patterns. There are
several other reasons that support the use of FTI in place of
crisp time interval. First, the human knowledge can be easily
represented by fuzzy logic. Second, it is widely recognized
that many real world situations are intrinsically fuzzy, and the
partition of time interval is one of them. Third, FTI is simple
and easy for users. Fuzzy logic addresses the formal principles
of approximate reasoning. It provides a sound foundation to
handle imprecision and vagueness as well as mature inference
mechanisms by varying degrees of truth. As boundaries are
not always clearly defined, fuzzy logic can be used to identify
complex pattern or behavior variations. And it can be
accomplished by building an intrusion detection system that
combines fuzzy logic rules with an expert system in charge of
evaluating rule truthfulness. In [6], we contributed to the
ongoing research on FTI sequential pattern mining by
proposing an algorithm to detect and classify audit sequential
patterns in network traffic data. The paper also defines the
confidence of the FTI audit sequences, which is not yet
defined in the previous researches. In [7], we have proposed
an algorithm which uses a fuzzy genetic approach to discover
optimized sequences in the network traffic data to classify and
detect intrusion.
Anrong et al [15], addresses application of sequential
pattern in intrusion detection by refining the pattern rules and
reducing redundant rules. Their work implements PrefixSpan
algorithm in the data mining module of network intrusion
detection system (NIDS). Shang Gao et al [16], describes a setbased approach for mining association rules and finding
frequent sequential patterns in customer transactional
databases. Their approach relaxes the constraints described in
Apriori (All/Some), and improves the performance while being
more user-oriented and self-adaptive than the probabilistic
knowledge representation.
In this paper, all these algorithms are implemented to mine
frequent sequential patterns on KDD Cup 1999 dataset to
predict DoS attack sequences on network traffic data
III.
174
the same support. An itemset with minimum support is called
as the large itemset or litemset.
Table I: Sequence Database
Sid
Audit Sequence
10
20
<(a,1),(b,4),(e,29)>
<(d,1),(a,2),(d,24)>
30
40
50
<(b,1),(a,11),(e,28)>
<(f,1), (b,5),(c,19)>
<(a,4),(b,5),(d,10),(e,28)>
60
<(a,0),(b,5),(e,30)>
70
<(j,2),(a,17),(h,17)>
80
<(c,3),(i,10),(f,18)>
90
<(h,4),(a,10),(b,21)>
100
<(g,0),(a,0),(b,3),(e,30)>
IV.
ALGORITHMS
Figure 1 depicts the working of the algorithm to find
frequent sequences using set theory. Consider the sequence
dataset D,as in Table I. To avoid multiple scan of the dataset
D, the dataset is stored in the Hash Map data structure in Java.
Now, by applying the Apriori algorithm of candidate
generation and considering minimum support of 0.3. In the first
pass, find L0 by scanning the dataset D to generate large 1sequences. By Apriori principle C1, candidates are generated.
Find L1 satisfying the min_supp =0.3, we get 1-sequence <a>,
<b> and <e>. Also form a set of Sequence_id of each of these
L1 candidates as shown in Figure 1. For example Sid for 1sequence itemset
<a>: {10, 20, 30, 50, 60, 70, 90, 100},
<b>: {10, 30, 40, 50, 60, 90, 100} and
<e>: {10, 30, 50, 60, 100}.
Interestingness of the 1- sequence is found by applying the
set intersection of the set of all the Sid of the candidates in L1.
Sid <a> ∩ Sid <b> ∩ Sid <e>
SET THEORY
Set theory is the branch of mathematical logic that
studies sets, which are collections of objects. Although any
type of object can be collected into a set. A set theory features
binary operations on sets:
Union of the sets A and B, denoted A ∪ B, is the set of all
objects that are a member of A, or B, or both. The union of {1,
2, 3} and {2, 3, 4} is the set {1, 2, 3, 4}.
Intersection of the sets A and B, denoted A ∩ B, is the set
of all objects that are members of both A and B. The
intersection of {1, 2, 3} and {2, 3, 4} is the set {2, 3}.
Consider the sequence database as shown in Table I. The
length of a sequence is the number of itemsets in the sequence.
A sequence of length k is called a k-sequence. The sequence
formed by the concatenation of two sequences x and y is
denoted as x.y. the support for an itemset i is defined as the
fraction of customers who bought the items in i in a single
transaction. Thus the itemset i and the 1-sequence <i> have
Next pass or when k>=2, we will be considering only those
set of Sequence_id which resulted from the previous pass
intersection if Sid’s of the l-sequence, where l is the length of
sequence. When l=1, we get, a set of Sid {10, 30, 50, 60, 100}.
Thus C2 will be generated from this reduced dataset D’ stored
as a hash map. Find L2 satisfying the min_supp =0.3, we get 2sequence <a b>, <a e> and <b e>. Also form a set of
Sequence_id of each of these L2 candidates as shown in Figure
1. For example, Sid for 2-sequence itemset are.
<a>: {10, 30, 50, 60, 100}.
<b>: {10, 30, 50, 60, 100}.
<e>: {10, 30, 50, 60, 100}.
Similarly, Interestingness of the k- sequence is found by
intersection of the set of all the Sid of the candidates in Lk
ISBN 978-93-82338-79-6
Proceedings of National Conference on New Horizons in IT - NCNHIT 2013
For the example in figure 1 we get interestingness of the 2sequence is by applying the set intersection of the set of all the
Sid of the candidates in L2.
175
L1 = Candidates in C1 with minimum support.
end.
Interestingness of the 1- sequence is found by intersection
of the set of all the Sid of the candidates in L1
Sid <a b> ∩ Sid <a e> ∩ Sid <b e>
We get, a set of Sid {10, 30, 50, 60, 100}. Repeating the
earlier pass till Lk.
For the example in figure 1 we get, frequent longest
sequence pattern as <a b e> with minimum support >= 0.3.
Sid <i1> ∩ Sid <i2> ….∩ Sid<in>; i1,i2,…in - itemsets
for (k=2; Lk-1 ≠ φ; k++) do
begin
Lk = Candidates with minimum support
The algorithm is as follows.
Interestingness of the k- sequence is found by
intersection of the set of all the Sid of the
candidates in Lk
Algorithm:
L0 = Scan the database to generate large 1- sequences;
C1 = new candidates generated from L0.
end.
for each sequence c in the database do
Increment the count of all candidates in C1 that are
contained in c.
Maximal Sequences in UkLk
Figure 1: Large Frequent litemset
These algorithms were implemented in Sun Java language
V. RESULTS AND DISCUSSION
and tested on an Intel Core Duo Processor, 2.10 GHz with
In this section we perform a simulation study to compare
2GB main memory under Windows XP operating system.
the performances of the algorithms: Set Based AprioriALL,
AprioriAll based on traditional set theory shrinks the
the Apriori [1], PrefixSpan [8] and SPAM [10], which find
database
size. It also scans the database at most once. Also, as
sequential patterns without time intervals.
the interestingness of the itemset is increased with the
database shrinking leads to longest sequences. As the
database is reduced the time taken to mine sequences also
ISBN 978-93-82338-79-6
Proceedings of National Conference on New Horizons in IT - NCNHIT 2013
reduces and is faster than traditional algorithms. The
Complexity of the Algorithm can also be reduced. As we can
observe in the Figure 2 and 3, AprioriAll based on Set theory
generates maximum number of sequences as per the Apriori
principal.
Also, it takes only 1 fixed database scan for k- itemset as
compared to k database scans for k-itemset in traditional
176
VI.
CONCLUSIONS AND FUTURE WORK
In this paper we have implemented Apriori a candidate
generation algorithm and PrefixSpan a pattern growth
algorithm on KDD Cup 1999 dataset to predict DoS attack
sequences on network traffic data using Set theory concepts.
Experimental results have shown that Set theory based
AprioriALL has generated maximum, longest and efficient
sequential patterns. These results are then compared with
Traditional AprioriALL, SPAM (Sequential Pattern Mining),
and PrefixSpan algorithm. The results in Figure 2, also shows
that Set theory based AprioriALL performs well for large
datasets like KDD Cup 1999 dataset.
The second comparison is done on the number of frequent
sequence patterns found executing these algorithms with the
varying minimum support threshold and from the results, it is
shown that the Set theory based AprioriALL gives the longest
efficient sequential patterns. On applying set intersection
operation, the interestingness of the itemset is increased with
the database shrinking leads to longest sequences.
Figure 2: Performance of AprioriAll Set based Algorithm
In future work, as in these experiments we have found
sequence patterns, by ignoring the time interval and including
only the temporal order of the patterns. The approach can be
extended to more set-based mathematical models for further
data analysis in order to discover hidden sequential patterns.
To address the intervals between successive patterns in
sequence database, Chen et al. have proposed a generalization
of sequential patterns, called time-interval sequential patterns,
which reveals not only the order of patterns, but also the time
intervals between successive patterns [4]. An extension of the
algorithm developed by Chen et al [4], to solve the problem
of sharp boundaries to provide a smooth transition between
members and non-members of a set, is addressed in Chen et al
[5] can also be implemented. Also as proposed in [7], the use
of fuzzy genetic approach to discover optimized sequences in
the network traffic data to classify and detect intrusion can
also be implemented.
Figure 3: No. of Patterns of AprioriAll Set based Algorithm
REFERENCES
[1]
AprioriAll. It also generates longest sequences. As the
itemsets which satisfies the minimum support constraints will
together generate the longest sequences. The interestingness
of the itemset increases by taking the intersection of the
sequence-id’s in which the itemsets are present.
The first comparison is based on the KDD cup 99 dataset
where the minimum support threshold is varied 20 % to 90%.
Figure 2 summarizes those results. All the results show that
Set based AprioriALL algorithm is at par with Prefixspan and
SPAM algorithm as copared to AprioriALL algorithm.
The second comparison is done on the number of frequent
sequence patterns found executing these algorithms with the
varying minimum support threshold. From the results in
Figure 3, it is shown that Set based AprioriALL generates
maximum as well as efficient number of sequential patterns
[2]
[3]
[4]
[5]
[6]
[7]
[8]
R. Agrawal and R. Srikant, “Mining sequential patterns”, In Proc. Int.
Conf. Data Engineering, 1995, pp. 3–14.
Y. L. Chen, S. S. Chen, and P. Y. Hsu, “Mining hybrid sequential
patterns and sequential rules”, Inf. Syst., vol. 27, no. 5, pp. 345–362,
2002.
J. Han and M. Kamber, Data Mining: Concepts and Techniques, New
York: Academic, 2001.
Y. L. Chen, M. C. Chiang, and M. T. Ko, “Discovering time-interval
sequential patterns in sequence databases”, Expert Systems with
Applications, Volume 25, Issue 3, October 2003, pp 343–354.
Yen-Liang, Tony Cheng-Kui Huang, “Discovering Fuzzy TimeInterval Sequential Patterns in Sequence Databases”, IEEE Transactions
on Systems, Man, and Cybernetics-Part B: Cybernetics, 2005, vol.35,
pp.959-972.
Sunita Mahajan and Alpa Reshamwala, “Amalgamation of IDS
Classification with Fuzzy techniques for Sequential pattern mining
“,IJCA Proceedings on International Conference on Technology
Systems and Management - ICTSM 2011, Number 3 - Article 7, pp 9–
14.
Sunita Mahajan and Alpa Reshamwala, “An Approach to Optimize
Fuzzy Time-Interval Sequential Patterns Using Multi-objective Genetic
Algorithm”, ICTSM 2011, CCIS 145, pp. 115–120, 2011, SpringerVerlag Berlin Heidelberg 2011.
Pei, J., Han, J., Pinto, H., Chen, Q., Dayal, U., & Hsu, M.-C.,
“PrefixSpan: Mining sequential patterns efficiently by prefix-projected
ISBN 978-93-82338-79-6
Proceedings of National Conference on New Horizons in IT - NCNHIT 2013
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
pattern growth”, Proceedings of 2001 International Conference on Data
Engineering, pp. 215–224.
J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate
Generation,” Proc. 2000 ACM-SIGMOD Int’l Conf. Management of
Data (SIGMOD ’00), pp. 1-12, May 2000.
J. Ayres, J. Gehrke, T. Yiu, and J. Flannick, “Sequential PAttern Mining
using A Bitmap Representation”, In Proceedings of ACM SIGKDD on
Knowledge discovery and data mining, pp. 429-435, 2002.
Han, J., Pei, J., Mortazavi-Asl, B., Chen, Q., Dayal, U., & Hsu, M.-C.,
“FreeSpan: Frequent pattern-projected sequential pattern mining”,
Proceedings of 2000 International Conference on Knowledge Discovery
and Data Mining, pp. 355–359.
Srikant, R., & Agrawal, R., “Mining sequential patterns:
Generalizations and performance improvements”, Proceedings of the 5th
International Conference on Extending Database Technology,1996, pp.
3–17.
Zaki, M. J., “SPADE: An efficient algorithm for mining frequent
sequences”, volume 42 Issue 1-2, January-February 2001, pp 31–60.
R. Agrawal and R. Srikant, “Fast algorithms for mining association
rules”, Proceedings of 20th VLDB Conference Santiago, Chile, 1994,
pp. 487–499.
XUE Anrong, HONG Shijie, JU Shiguang, CHEN Weihe, “Application
of Sequential Patterns Based on User’s Interest in Intrusion Detection”,
Proceedings of 2008 IEEE International Symposium on IT in Medicine
and Education, pp 1089- 1093, 2008.
Shang Gao, Reda Alhaji, Jon Rokne, Jiwen Guan, “Set Based Approach
in Mining Sequential Patterns”, 24th International Symposium on
Computer and Information Sciences, 2009. ISCIS 2009, pp 218 – 223.
ISBN 978-93-82338-79-6
177