Download Exploiting Frequent Episodes in Weighted Suffix Tree to Improve

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Exploiting Frequent Episodes in Weighted Suffix Tree to Improve Intrusion
Detection System
Min-Feng Wang, Yen-Ching Wu, and Meng-Feng Tsai
Department of Computer Science and Information Engineering
National Central University, Jhongli, Taiwan 32001
[email protected]
Abstract
In this paper we proposed a weighted suffix tree
and find out it can improve the Intrusion Detection
System (IDS). We firstly focus on the analysis of
computer kernel system call, and discover some
meaningful information from the unorganized system
call sequences. We design a weighted suffix tree
algorithm which derives from the concept of suffix tree
algorithm for string matching, which then allows to
mine the frequent episodes in order to get ordered
frequent patterns. We therefore apply these rules to
detect malicious attacks, and it shows our IDS still has
a good ability to detect intrusion when we use fewer
rules.
1. Introduction
Intrusion Detection System (IDS) is one of key
technologies to ensure dynamic system security.
Intrusion detection is defined as “identifying
unauthorized use, misuse, and abuse of computer
systems by both inside and outside intruders” [5].
There are two basic approaches of intrusion detection:
misuse intrusion detection [7, 10], and anomaly
intrusion detection [18]. Generally speaking, an IDS
does not include preventing intrusions from occurring.
All IDSs must have three important functions: data
collection, data classification, and data reporting.
In general, there are several levels of data to be
collected for IDS. Network data has a lot of features
which can be provided to analyze intrusion detections.
But the transmission of network data is always being
encrypted for security; In addition, when source code
of the application does not publish, the method of
embedding monitoring application code become even
more difficult to implement. Unlike network data,
system calls of operating system do not have the
problem of encryption or decryption, they operate
more directly than network packages and are much
easier to analyze. We can collect the system calls
simply by monitoring process. In addition, since the
system calls will not be affected by the language of
process being written, we can successfully avoid the
publication problem of the source code. These
advantages interest us to analyze system calls of
operating system.
Data mining generally refers to the process of
extracting models from large stores of data [4].
However, researches of IDS which apply data mining
techniques [13, 25] do not focus on time order between
system calls. And the extracted rule set only contains
frequent patterns of disorder system calls. For
example, [25] used Apriori algorithm [1] to extract the
rules. If it has a rule {01, 02, 03}, then the system
calls 01, 02, and 03 will be bundled together all the
time without time order identification. In this case, if
system calls 01 followed by 02, followed by 03 is the
normal order of system call of the process, but system
calls 03 followed by 02, followed by 01 is an intrusive
behavior, then, the method will not be able to detect
the intrusion which applies the same system calls with
different order.
In order to improve the accuracy of detection, we
will keep the order between system calls by using our
proposed weighted suffix tree algorithm, and exploit
frequent episode mining algorithm to extract
significant rules from ordered sequences.
2. Related work
Lee et al. introduced data mining techniques into
IDS [11, 14, 15, 16, 17, 21]. They hope to exploit the
ability of handling huge amount of data to improve the
performance of IDS. [13] used the mining algorithm
based on FP-tree to extract rule set of normal behavior.
[8] proposed a “bag of system calls” to represent IDS.
[12] introduced several data mining techniques to
enhance IDS. [26] analyzed several anomaly IDSs
which used system calls. [2] involved with several
main aspects and technologies of IDS.
2.1. Frequent episodes mining
The frequent episodes algorithm described in [20] is
an extension of association rules. A frequent episode
is a set of items that occur frequently within a time
window of a specified length. They emphasize that the
items in a serial episode must occur in partial order in
time. An example is home, research→theory [sup =
5%, conf = 20%, time = 30s], which means that 5% of
the total time windows (30s) of all transactions under
analysis, the visiting order starts with home page, and
followed by the research guide and the theory page;
and 20% of the users who visit the home page and the
research guide in time order will visit the theory page
subsequently.
2.2. Suffix tree
The concept of suffix tree was first introduced as a
position tree by Weiner [23]. The construction was
greatly simplified by McCreight [19], and also by
Ukkonen [22]. Ukkonen provided the first linear-time
online-construction of suffix trees. A suffix tree is a
tree-like data structure representing all suffixes of a
string. The characteristics of suffix tree can support to
keep the order between system calls that are recorded
in the sequence, and to find all substrings of the
sequences.
3. Problem definition
A set of system calls which are arranged in time
order is called system call sequence. Each trace is the
list of system calls issued by a single process from the
beginning to the end of execution.
A daemon process is used to denote a background
system process on UNIX which runs independently of
any user who is logged-in. Trace lengths vary widely
because of differences in program complexity because
some traces are collected daemon processes and others
are not.
The University of New Mexico (UNM) provides a
number of system call data sets. In UNM system call
traces, each trace is an output of one program. Most
traces involve only one process and usually one
sequence is created for each trace. However,
sometimes one trace has multiple processes. In such
cases, we have extracted one sequence per process in
the original trace. Thus, each system call trace can
yield multiple sequences of system calls if the trace has
multiple processes.
Data sets have been collected from those programs
that run with privilege. Monitoring privileged
programs has several advantages over monitoring user
profiles [9] because the misuse of those programs has
the greatest potential for harm to the system. Root
processes are more dangerous than user processes
because they have access to more parts of the computer
system. They have a limited range of behavior, and
their behavior is relatively stable over time [3]. Some
of the normal traces are “synthetic” and some are
“live”. Synthetic traces are collected in production
environments by running a prepared script and the
program options are chosen solely for the purpose of
exercising the program without any real user’s
requests, whereas live traces are collected from the
program during normal usage of a production
computer system.
Without loss of generality, we assume that system
call sequence is mapped to a set of contiguous integers.
Each system call number is an instruction and is
recorded into a mapping file. Therefore, we can
understand the original instruction by finding the
integer from the mapping file. Here we should
emphasize that the same operating system with
different version will have different mapping file of
system calls. It should be noted that the experiments
must work under the same version of operating system.
4. Methodology
4.1. Architecture
Before introducing our methodology, we’d like to
bring out the architecture of our IDS. Because some
traces might have multiple processes. The traces must
be sent to preprocess so as to gain the system call
sequences which contain one process per sequence.
For the trace preprocessor in Figure 1, inputs of the
program are source data sets and output system call
sequences. After preprocessing the traces, the
sequences with different length of normal usage are
sent to frequent pattern generator. Frequent pattern
generator constructs weighted suffix tree for further
computation. Rule extractor will extract all rules that
satisfy the conditions set by users from frequent pattern
generator and collect the rules into rule set pool for
further using.
Table 1. The algorithm for weighted suffix tree
Fig. 1. Architecture of proposed system
When we collect enough information into rule set
pool, our decision engine will exploit rules that are
provided from rule set pool to analyze system calls of
the monitoring program. The decision engine will
report to user interface if an intrusion is detected.
Hence these reports will help human experts to solve
the problems easier than before.
4.2. Weighted suffix tree
A suffix tree can keep the order between system
calls that are recorded in the sequence. We can also
extract all substrings of the sequence from its suffix
tree.
Weighting the suffix tree will help us discover the
occurring frequency of the patterns. These statistics
can then be used to calculate frequent episodes later.
There is only one sequence presented in the original
suffix tree. Now we change the presentation of suffix
tree and our weighted suffix tree is consisted of all
sequences. Table 1 is the algorithm pseudo-code for
weighted suffix tree. We assume that there are two
sequences S1 = {03, 01, 03, 01, 15} and S2 = {01, 01,
03, 01} in the data set D = {S1, S2}.
The weighted suffix tree after processing sequence
S1 and S2 is shown in Figure 2. We record the traversed
number of each node when we built the tree. For
example, the rightmost node of weighted suffix tree in
Figure 2 is noted as 15(1). 15 means the system
number of that node, whereas 1 means the traversed
frequency of that node.
We extend the representation of suffix tree to deal
with multiple input sequences. We note that two or
more input sequences may have common suffixes, so
the nodes of common suffixes will increase the count
number of weight. Further, we can exploit the
information of nodes to help us improving intrusion
detection technique.
Fig. 2. The weighted suffix tree
4.3. Frequent episodes mining
Since the processes we monitored are sometimes
used intermittently, we change the setting of fixed time
window in [20] into fixed length of system call
sequence to suit the feature of our experiments.
As we mentioned before, the weighted suffix tree
algorithm has processed over data set D which contains
two sequences S1, S2. Without loss of generality, we
assume that we want to collect the rule set of a fixed
sequence of length 3. Hence we can discover a rule set
of normal usage which has the fixed sequence length of
3 (see Table 2). There are four different sequences of
length 3 collected from the weighted suffix tree. The
first column in Table 2 shows all of the sequences
which are obtained from the weighted suffix tree,
whereas the second column shows the number of the
sequences which stands for the number of appearances
of the sequence in the whole data set. We can easily
obtain the number of each sequence by reading the
Table 2. The sequence of length 3
Table 4. Two parts of IDS
Table 3. Frequent episodes mining algorithm
Fig. 3. Number of rules with different length
weighted counts of the last node of the sequence. Then
we can calculate the frequent episodes of each rule
further. Here we have a frequent episode rule example
is 03, 01 → 03 [sup = 20%, conf = 50%, length = 3],
which means 20% of total sequences (of length 3)
under analysis system call 03, 01, 03 are used in time
order, and 50% of the cases of which system calls 03,
01 have been used in order, system call 03 will be used
subsequently. Below we show the definition of
support and confidence used in our frequent episodes
mining:
sup =
the weighted count of the last node of the sequence
the sum of the weighted count of the last node of all sequences
conf =
the weighted count of the last node of the sequence
the sum of the weighted count of the last node and its all siblings
The value of confidence is a conditional probability.
It means the probability that the consequent will occur,
given that the antecedents have occurred. We found
that we can summarize the weighted counts of all the
rules with the same antecedent to be denominator and
the weighted count of the sequence be the numerator.
We discovered that the rules with the same antecedent
have the same prefix in the weighted suffix tree (see
Figure 2). Hence, the last nodes of these rules are
siblings to each other. We can easily calculate the sum
of the weighted count of the last node and its all
siblings to gain the value of denominator of
confidence.
We can exploit the mining algorithm described in
Table 3 to repeatedly extract the rule set of different
length when the weighted suffix tree has been
constructed. In the past, those who focused on analysis
of intrusion detection using system call sequences had
Fig. 4. Number of rules with different support
threshold
a lot trouble with how to choose the most appropriate
sequence length.
They chose the length from
experiment results empirically. When they need to
collect another rule set with another sequence length,
they must read whole traces again [6, 14]. Once the
weighted suffix tree has been constructed, our mining
algorithm can overcome the drawback. For example,
suppose it is the first time we collect the rule set of
length 7. After the operating, nodes of level 7 in the
whole tree are labeled as traversed. If we want to
collect the rule set of length 11 later, we can use the
same tree mining again.
4.4. Intrusion detection model
We decide to use t-stide method that is proposed by
C. Warrender et al. [24] verifies whether intrusive
events occur or not. The “t” in “t-stide” represents the
addition of this threshold on sequence frequencies.
They set a threshold to prune rare sequences which are
regarded as abnormal.
We choose support and
Table 5. The rules of length 5
confidence values of frequent episodes to be the
threshold condition in t-stide method, instead of the
occurring frequencies of the sequences that are
previously used in t-stide method. We collect
sequences of fixed length k that exceed the threshold
and store those sequences together for further using.
The traces are compared with the rule set by the
decision engine. The decision engine will report to the
user interface an alarm when mismatch rate exceeds
the threshold.
5. Experiments
In this chapter, we will discuss the performance of
execution time between the original t-stide method and
our method first. Then we will use data sets of
synthetic sendmail [3] in our experiments to show the
results of the experiments.
The experimental results obtained from [14] let us
know that a set of specific rules for normal sequences
can be used to detect any known and unknown
intrusions. Whereas having only the rules for abnormal
sequences only provides the capability to detect known
intrusions we used 80% of normal traces to construct
weighted suffix tree and generated rule sets of different
lengths from the weighted suffix tree. The rest 20% of
normal traces were used for testing. Both normal and
abnormal traces are used for testing. There are three
traces of the sscp (sunsendmail) attacks, two traces of
the decode attacks, and five traces of the error
condition- forwarding loops attacks used for testing
too.
5.1. Comparison of execution during rules
checking
Our IDS can be divided into two parts from data
collection, construction of a weighted suffix tree,
generation of rule sets, and operation of the decision
engine. As we can observe from Table 4, the data
collection, construction of the weighted suffix tree, and
the generation of rule sets are completed off-line.
Table 6. The rules of length 7
Table 7. The rules of length 9
Whereas the decision engine must response as soon as
possible if it detects any deviation of the programs so
as to improve the whole security of computer systems.
Generally speaking, an IDS wishes to have all
possible normal usages of a program to improve the
ability of identifying normal and abnormal behaviors.
Unfortunately, the increment of rules will affect the
performance of the decision engine. We assume that
the length of the trace is L, the size of sliding window
is w, and there are n rules, L>>w. The window sliding
needs to check rule set again in the t-stide method each
time. The time complexity of t-stide in worst is
O((L-w+1)*n) = O(L*n). Figure 3 shows the original
number of rules in the rule set with the length 5 to 11.
Figure 4 shows the number of rules under different
support threshold. We compare the rule sets of length
5, 7, and 9. We discover that the support threshold
becomes higher and the number of rules becomes
fewer. Hence the execution time of IDS will be
improved by exploiting a smaller rule set. Our IDS
can response much more quickly than before when an
intrusion has been detected.
5.2. Experimental results
Now, we are interested in the accuracy of intrusion
detection. Table 5, 6, 7 show that the experimental
results using the rules of length 5, 7, and 9. The first
row is the experimental result running with the method
proposed in [24]. The column of support means the
support threshold of frequent episodes. The column of
confidence means the confidence threshold of frequent
episodes. The column of normal abn% means the
mismatch rate of normal traces. We also record the
mismatch rate of the sscp (sunsendmail) attack traces
in the column of sscp abn%, the mismatch rate of the
decode attack traces in the column of decode abn%,
and the mismatch rate of the error conditionforwarding loops attack traces in the column of fwd
abn%. We compare the results of the different support
and confidence thresholds. The column of gap records
the minimal abn% of attack traces minus abn% of
normal traces.
We can see from Table 5. There are 660 rules when
we use the method of [24]. The mismatch rate of
normal traces is 3.3%. The mismatch rate of sscp
attack is 16.5%. The mismatch rate of decode attack is
5.9%. The mismatch rate of error conditionforwarding loops attack is 13.9%. Hence the mismatch
rate gap between normal traces and abnormal traces is
5.9% - 3.3% = 2.6%.
After we use different support and confidence
thresholds, we can observe that the number of rules is
between 546 ~ 256. Hence the operation of decision
engine will be faster than before. Because we use
fewer rules, some sequences in the trace will be treated
as mismatch. Hence the mismatch rate of normal
traces will be higher than the method of [24]. On the
other hand, the mismatch rate of abnormal traces will
also be higher than the method of [24]. The gaps
between normal traces and abnormal traces are 10.5%,
12.7%, 17.6%, 17.2%, 18.7%, and 16.2%, respectively.
We can discover that the gap is larger than original
method, which help us to prove that the decision
engine can explicitly recognize which trace is normal
or abnormal.
We can see from Table 6 which uses the rules of
length 7, and Table 7 using the rules of length 9. The
gap between normal traces and abnormal traces are still
large enough after pruning some rules.
6. Conclusions
An IDS needs to process a lot of data sets to carry
on analysis. This characteristic allows the application
of data mining to become an important component in
an IDS. The algorithms which were proposed before
can simply find out disorder frequent patterns. It means
that we cannot know the order of the system calls.
This characteristic is very fallible. For example, an
intrusion can use the same frequent patterns but
different order to escape the check of IDS. Weighted
suffix tree proposed by us can solve this problem
easily. Because the system calls between the rules
extracted from weighted suffix tree conserve the order
of them, therefore we can fix the fallible drawback of
the existing algorithms and improve the performance
of the IDS.
For different programs, the fittest length of the rules
will also be different, which forces an IDS to
repeatedly execute experiments with different lengths
of rule sets to find out which rule set is the most
suitable one. According to the method proposed by [3,
6, 24], constructing a new rule set of another length
needs to load the whole traces again. However, with
weighted suffix tree, our method needs to load whole
trace only one time for constructing, and this tree can
quickly extract lots of rule sets. These advantages of
weighted suffix tree help to enhance the timeefficiency of the IDS.
Frequent episodes mining can identify the rules
whose order between system calls is conserved. After
setting different support and confidence thresholds, we
can prune those rare rules. And we can use fewer rules
to speed up the operation of the decision engine.
The accuracy is the most important issue of an IDS.
The experimental results also show that the gaps
between the mismatch rate of normal traces and the
mismatch rate of abnormal traces are large after
pruning rare rules by Frequent Episodes Mining. It
means that IDS can report or alarm to the users or
experts accurately.
7. Acknowledgment
This work was sponsored by National Central
University Project of Promoting Academic Excellence
and Developing World Class Research Centers –
Applied Informatics and Creative Contents: ServiceOriented Information Platform.
8. References
[1] R. Agrawal, and A. Swami, “Fast Algorithms for Mining
Association Rules,” Proc. of the 20th Int’l Conf. on Very
Large Data Bases, pp. 487-499, Santiago, Chile, 1994.
[2] Y. Bai, and H. Kobayashi, “Intrusion Detection Systems:
Technology and Development,” Proc. of the 17th IEEE Int’l
Conf. on Advanced Information Networking and
Applications, pp. 710-715, 2003.
[3] S. Forrest, S. A. Hofmeyr, A. Somayaji, and T. A.
Longstaff, “A Sense of Self for Unix Processes,” Proc. of the
1996 IEEE Symp. on Security and Privacy, pp. 120-128, Los
Alamitos, 1996.
[4] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “The
KDD Process of Eextracting Uuseful Knowledge from
Volumes of Data,” Communication of the ACM, vol. 39, no.
11, pp. 27-34, 1996.
[5] S. Hossain, S. M. Bridges, and R. B. Vaughn, “Adaptive
Intrusion Detection with Data Mining,” Proc. of IEEE Int’l
Conf. on Systems, Man and Cybernetics, pp. 3097-3103,
2003.
[6] S. A. Hofmeyr, S. Forrest, and A. Somayaji, “Intrusion
Detection Using Sequences of System Calls,” J. Computer
Security, vol. 6, pp. 151-180, 1998.
[7] K. Ilgun, R. A. Kemmerer, and P. A. Porras, “State
Transition Analysis: A Rule-Based Intrusion Detection
Approach,” IEEE Trans. on Software Engineering, vol. 21,
no. 3, pp. 181-199, 1995.
[8] D-K Kang, D. Fuller, and V. Honavar, ”Learning
Classifiers for Misuse Detection Using a Bag of System Calls
Representation,” Proc. of IEEE Int’l Conf. on Intelligence
and Security Informatics, pp. 511-516, Atlanta, GA, USA,
2005.
[9] C. Ko, G. Fink, and K. Levitt, “Automated Detection of
Vulnerabilities in Privileged Programs by Execution
Monitoring,” Proc. of the 10th Annual Computer Security
Applications Conference, pp. 134-144, Dec. 1994.
[10] S. Kumar, and E. H. Spafford, “A Software Architecture
to Support Misuse Intrusion Detection,” Proc. of the 18th
National Information Security Conference, pp. 194-204,
1995.
[11] W. Lee, “A Data Mining Framework for Constructing
Features and Model for Intrusion Detection System,” Ph.D
Dissertations, Columbia University, New York, USA 1999.
[12] C.-T. Lu, A. P. Boedihardjo, and P. Manalwar,
“Exploiting Efficient Data Mining Techniques to Enhance
Intrusion Detection Systems,” Proc. of the IEEE Int’l Conf.
on Information Reuse and Integration, pp. 512-517, Las
Vegas, Nevada, 2005.
[13] T.-R. Li, and W.-M. Pan, “Intrusion Detection System
Based on New Association Rule Mining Model,” Proc. of
IEEE Int’l Conf. on Granular Computing, pp. 512-515,
Beijing, China, 2005.
[14] W. Lee, and S. J. Stolfo, “Data Mining Approaches for
Intrusion Detection,” Proc. of the 7th USENIX Security
Symposium, San Antonio, Texas, 1998.
[15] W. Lee, and S. J. Stolfo, “A Framework for
Constructing Features and Models for Intrusion Detection
Systems,” ACM Trans. on Information and System Security,
vol. 3, no. 4, pp. 227-261, 2000.
[16] W. Lee, S. J. Stolfo, and P. K. Chan, “Learning Patterns
from Unix Process Execution Traces for Intrusion
Detection,” Proc. of the AAAI97 Workshop on AI
Approaches to Fraud Detection and Risk Management, pp.
50-56, 1997.
[17] W. Lee, S. J. Stolfo, and K. W. Mok, “A Data Mining
Framework for Building Intrusion Detection Model,” Proc.
of the IEEE Symp. on Security and Privacy, pp. 120-132,
Oakland, CA, May 1999.
[18] T. Lunt, A. Tamaru, F. Gilham, R. Jagannathan, P.
Neumann, H. Javitz, A. Valdes, and T. Garvey, ”A RealTime Intrusion Detection Expert System (IDES) –Final
Technical Report,” Technical Report, Computer Science
Laboratory, SRI International, Menlo Park, California, 1992.
[19] E. M. McCreight, “A Space-Economical Suffix Tree
Construction Algorithm,” J. ACM, vol. 23, no. 2, pp. 262272, 1976.
[20] H. Mannila, H. Toivonen, and A. I. Verkamo,
“Discovering Frequent Episodes in Sequences,” Proc. of the
1st Int’l Conf. on Knowledge Discovery in Databases and
Data Mining, pp. 210-215, Montreal, Canada, 1995.
[21] S. J. Stolfo, W. Lee, P. K. Chan, W. Fan, and E. Eskin,
“Special Section on Data Mining for Intrusion Detection and
Threat Analysis: Data Mining-Based Intrusion Detectors: An
Overview of the Columbia IDS Project,” ACM SIGMOD
Record, vol. 30, no. 4, pp. 5-14, Dec. 2001.
[22] E. Ukkonen, “On-Line Construction of Suffix Trees,”
Algorithmica, vol. 14, no. 3, pp. 249-260, 1995.
[23] P. Weiner, “Linear Pattern Matching Algorithm,” Proc.
of the 14th IEEE Symp. on Switching and Automata Theory,
pp 1-11, 1973.
[24] C. Warrender, S. Forrest, and B. Pearlmutter, “Detecting
Intrusions Using System Calls: Alternative Data Models,”
Proc. of the IEEE Symp. on Security and Privacy, pp. 133145, 1999.
[25] M.-J. Xu, and G. Gu, “Method and Usage of Mining
Association Rules in System Call Serial,” Computer Systems
& Applications, no. 4, pp. 26-28, 2004.
[26] M. M. Yasin, and A. A. Awan, ”A Study of Host-based
IDS Using System Calls,” Proc. of the IEEE Int’l Conf. on
Networking and Communication, pp. 36-41, 2004.