Download View PDF - CiteSeerX

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, Guangzhou, 18-21 August 2005
DISTRIBUTED INTRUSION DETECTION BASED ON CLUSTERING
YU-FANG ZHANG, ZHONG-YANG XIONG, XIU-QIONG WANG
Department of Computer Science, Chongqing University, Chongqing, China, 400044
E-MAIL: [email protected], [email protected], [email protected]
Abstract:
The research on distributed intrusion detection system
(DIDS) is a rapidly growing area of interest because the
existence of centralized intrusion detection system (IDS)
techniques is increasingly unable to protect the global
distributed information infrastructure. Distributed analysis
employed by Agent-based DIDS is an accepted fabulous
method. Clustering-based intrusion detection technique
overcomes the drawbacks of relying on labeled training data
which most current anomaly-based intrusion detection depend
on. Clustering-based DIDS technique according to the
advantages of two techniques is presented. For effectively
choosing the attacks, twice clustering is employed: the first
clustering is to choose the candidate anomalies at Agent IDS
and the second clustering is to choose the true attack at central
IDS. At last, through experiment on the KDD CUP 1999 data
records of network connections verified that the methods put
forward is better.
Keywords:
Intrusion detection; Distributed intrusion Detection
system; Anomaly detection; Cluster; Data mining
1.
Introduction
Intrusion detection system (IDS) is a critical issue in
network security research. Misuse detection and anomaly
detection, on which many researches have been done, are
two basic approaches of intrusion detection. Misuse-based
IDS performs signature analysis by comparing on-going
activities with patterns representing known attacks or weak
spots of the system in attempt to recognize similar attacks.
Anomaly-based IDS is to detect intrusions by identifying
activities that are different from a user's or a system's
normal behavior. Anomaly detection tries to establish the
normal user group and host profile and subsequent uses are
checked against the normal profile. All deviations from
normal profiles can be flagged as suspicious activities or
intrusions. The two approaches by themselves are more
suitable toward to local incident monitoring and analysis
and are effectively applied to traditional IDS based on
centralized pattern with a one-way data collection and
processes incident at only a single site.
However, with growing complexity of the network
environment and the appearance of coordinated attacks
involving multiple attackers, the existence of traditional
centralized IDS techniques is a passive information
processing paradigm and increasingly unable to protect the
global information infrastructure, so that distributed
intrusion detection system (DIDS) technique have started to
evolve and become a very important issue of security
research in recent years.
Although the different definitions of DIDS are
described by different organizations, they have the common
feature for protecting the global information infrastructure.
A DIDS can be defined as: “consists of multiple Intrusion
Detection Systems (IDS) over a large network, all of which
communicate with each other, or with a central server that
facilitates advanced network monitoring, incident analysis,
and instant attack data”[1]. DIDS, however, faces the new
problem that it needs analyze massive aggregating data
generated by individual IDSs placed across large-scale
networks. How to analyze these a huge amount of data with
high efficiency is a considerable complicated and
comprehensive process.
Data mining is recognized as a useful tool for
discovering interesting patterns and knowledge from large
amounts of data and thus has been the target of some
investigations for its use in intrusion detection. Clustering
technique groups a set of data that exhibit similar
characteristics into meaningful subclasses according to
some pre-defined metrics so that the member is quite
similar to one another within the same cluster, and the
members from different clusters are quite dissimilar from
each other. Over the past years, combining anomaly-based
centralized IDS with clustering techniques was an
important issue of research for finding new and unknown
intrusion methods and achieved success.
Combining DIDS with anomaly-based clustering
techniques for more effectively analyzing the massive data
collected from the large network was developed. For
improving the detection efficiency and accuracy, the paper
0-7803-9091-1/05/$20.00 ©2005 IEEE
2379
Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, Guangzhou, 18-21 August 2005
presented a second clustering method in DIDS for
completing global information extraction of intruder actions.
The remaining sections of this paper were organized as
follows. After discussing related work in Section 2, Section
3 focuses on how to employ the clustering techniques to
Agent-based DIDS. Section 4, experimental results indicate
that the presented method appears very promising when
detecting intrusions in DARPA’98 data and finally
conclude in Section 5.
distributed fashion independently of the others while
cooperating and communicating to provide a truly
distributed detection mechanism without central processing
location. From these instances mentioned, agent technique
and distributed analysis have become the public fabulous
technique for DIDS.
2.
Establishing normal profile in anomaly detection
requires the training dataset to be free of attacks. If there are
some attacks buried within training dataset, algorithm may
not identify these attacks because of being regarded as
normal instances during detection. However, obtaining
purely normal dataset is considerably expensive for
removal of all attacks, including new attacks. Because not
only huge amount of audit data or network data should be
labeled, but also classifying training data is extremely
difficult mainly due to the high cost of obtaining proper
labeling of attacks. Simulating attacks can build labeled
dataset, but just the known attacks can be simulated.
Training dataset has not the ability to response new types of
attacks that maybe will come forth in the future. Eskin et al
[6] developed an unsupervised anomaly detection technique
in unlabeled data. The algorithm, based on clustering
techniques and a simple distance-based measure, groups
instances into several clusters and label the points in the
small clusters as attacks. Its primarily merits are that
detecting abnormal activities is executed automatically
without too much human intervention and the low
computational complexity. Thanks to the modularized
design, experimenting with clustering algorithm in
unlabeled data and detection mechanisms is made easy.
In this paper, combining the advantages of
Agent-based distributed analysis and clustering-based
intrusion detection technique with unlabeled data,
clustering-based DIDS technique is given. The focus
discussed in detail is that, one is how to cluster the input
dataset and choose the candidate anomalies at Agent IDS,
the second is cluster using single-linkage algorithm for
checking true attacks at central IDS node.
2.1.
Related work
DIDS techniques based on data analysis
Over the past years many DIDSs have been developed.
These schemes were a result of research on methods of
aggregating data generated by individual intrusion detection
systems placed across large-scale networks. Generally
speaking, there are two ways to analyze the data: central
analysis and distributed analysis.
A typical example DIDS based central analysis is
being developed by Division of Computer Science,
University of California, Davis in 90th last century [2]. It is
the first intrusion detection system that aggregates audit
reports from a collection of hosts on a single network. The
architecture consists of a host manager, a monitoring
process or collection of processes running in background.
The system has two data collections: one collects host data
and another collects network data. All data are transferred
to a main site or analyzer and analyzed. Obviously, the
architecture will fall across the “system bottleneck”
problem. Firstly, transferring all data without preprocessing
to the main site will occupy a large of network bandwidth.
Secondly, a main analyzer has a limited ability to deal with
massive data. Thirdly, the main analyzer is a single point of
failure, the security of the communication between sites is
not guaranteed without relevant security methods.
For overcoming the drawbacks of central analysis
technique, distributed analysis started to evolve. Several
Agent-based DIDSs were developed in recent years and all
adopted
distributed
analysis
technique.
AAFID
(Autonomous Agents For Intrusion detection)[3] is a
distributed anomaly detection system that employs
autonomous agents at the lowest level for data collection
and analysis. At the higher levels of the hierarchy
transceivers and monitors are used to obtain a global view
of activities. MAIDS (Mobile Agent Intrusion Detection
System) [4] developed the agent that transfers collected
different types of data to other agent, which has the
capability to analyze certain data type. Major Dennis et al
[5] presented the technique that detection agent run in a
2380
2.2.
3.
Clustering-based
unlabeled data
intrusion
detection
with
Agent-based DIDS using clustering
The simple architecture of DIDS based on distributed
analysis is shown in figure 1.
The central IDS node and agent IDS may physically
reside on the same computer, they are logically independent
and communicate bi-directionally each other. There can be
several central IDSs resided on several hosts but only one is
Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, Guangzhou, 18-21 August 2005
running. If the central IDS in running lose communication
with Agent ID, another central IDS will take the task, so
that there is without the problem of a single point failure.
C e n tr a l ID S N o d e
D a ta
S o u rc e s
……
A g e n t ID S
A g e n t ID S
D a ta
S o u rc e s
A g e n t ID S
dis tan ce( z i , z j ) =
3.1.3.
Agent IDS
In this paper all agents are primarily responsible for
collecting network data from data sources, normalizing
them, analyzing and choosing candidate anomalies from the
analysis. The details are described as follows.
3.1.1.
Data collection and normalization
In running process agent collects a huge amount of
raw network packets from different sources with tool
software such as tcmdum. The binary raw network data are
translated into instances with uniform vector format and
these instances are saved into database. The instances
include many features such as src_host (the source IP),
dst_host (the destination IP), src_bytes (number of data
bytes from source to destination) and dst_bytes (number of
data bytes from destination to source) and so on.
Before clustering, data instances will be normalized to
standard form for solving the problem that different features
are on different scales. Normalization takes as input the
translated vector, X, and transforms to a normalized vector,
Z, such that summation over different attributes becomes
meaningful—for they can be possibly in different units, etc
[7]. Given the data instances for a variable f, f is converted
into a standard measurement as following formulas:
z if =
xif − m f
(1)
sf
where m f =
1
n
n
∑x
i =1
if
; sf =
1
n
n
∑|x
if
p
∑z
in
− z jn
2
(2)
n =1
D a ta
S o u rc e s
Figure 1. The simple architecture of DIDS
3.1.
method. Sometimes, some features according to its
perceived importance are put with a weighted value. Some
increases in performance from weighted metrics are not a
significant amount, because which maybe weaken the
system’s generality. In this paper, a metric with equally
weighted features is used and the most popular distance
measure, viz. Euclidian distance to measure dissimilarity
between instances is formulated as
− mf |
i =1
here x1f , x2f, …, xnf are n measurements of f.
3.1.2. Metric for distance measure
The dissimilarity between instances is measured by
distance between instances. Finding or constructing an
appropriate metric is critical to the performance of the
2381
Cluster-based algorithm for agent IDS
A relatively simple clustering algorithm was chosen
because a simple approach has a low time complexity and
much attention can be paid to investigate the effectiveness.
The clustering part of the algorithm is essentially the same
as one used in [6], but differs in how to compute the
distance between clusters and an instance and how
anomalies are chosen.
Input: Dataset D={d1, d2, …, dn}.
Output: Set of cluster C
1. C1←{d1} ;
o1←d1 (feature);
num_cluster=1;
D={ d1, d2,, …, dn}–{d1};
// Phase 1: Creating Clusters
2. for di∈D {
3.
min_distance=0; min_num=0;
4.
for j=1 to num_cluster
5.
If min_distance>distance (di, oj ) then {
min_distance←distance (di, oj );
min_num=j;}
6. If min_distance≤ W then {
Cj←Cj +{ di };
D←D–{ di };
oj←average vector of cluster Cj; }
7. else {
num_cluster=num_cluster +1;
Cnum_cluster←{di};
onum_cluster←di (feature);}
}
// Phase2: choose the candidate anomalies
8. sort Ci according to N(Ci) from small to large;
9. choose and merge small clusters as candidate
nomalies.
Figure 2. The algorithm of choosing candidate anomalies
Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, Guangzhou, 18-21 August 2005
For each cluster Ci, we define the number of points as
N(Ci). Phase 2 first sorts the clusters according to N(Ci).
Second, the algorithm chooses the small clusters as
candidate anomaly sets and merges these candidate sets into
a cluster C. The number’s sum of these small clusters
approximately equals to the value that percentage N
multiplies the number of the points in input dataset D.
3.2.
Central IDS
For most of IDSs, the central IDS node is the heart and
soul for controlling and coordinating all agents. In this
paper, we just discussed how to second mine data, viz.
analyze the candidate anomalies transferred from all agents.
Input: Candidate anomaly cluster C containing n objects.
Output: Attack cluster CA.
// Phase 1: Creating Clusters
1.
set n objects as n clusters: C1, C2, …, Cn;
2.
num_cluster= n;
3.
repeat
4.
arbitrarily choose a cluster Ci (i<=num_cluster);
5.
compute the distances between Ci with others;
6.
merge the closest clusters;
7.
num_cluster = num_cluster-1;
8.
until num_cluster= K;
// Phase 2: Choose the true attacks
9. calculate the distances between all pairs of clusters;
10. choose the shortest distance d min_i for every cluster;
11. sort Ci according to d min_i from large to small;
12. merge the k top of small clusters as cluster CA.
Figure 3. The algorithm of choosing attacks
Because the size of the candidate anomalies is not
large, no much more time complexity can be put into
attention. A little variant single-linkage algorithm is used to
cluster accurately. The number of clusters K and the number
of k are fixed for choosing the true attack sets. Assume the
input candidate anomaly cluster C contains n objects.
Distance between two average vectors of clusters
determines the distance between two clusters. The
algorithm is also dissected into two phases: Phase 1 creates
clusters using single-linkage algorithm and Phase 2 is to
detect the true attacks. Figure 3 illustrates the steps of the
two phases.
2382
4.
4.1.
Experimental evaluation
Evaluation data
For evaluating the performance of above system
perfectly, the authoritative network intrusion dataset, viz.
KDD [8], which is obtained by simulating a large number
of different types of intrusions in a military network
environment, are used. The data set contains record that
contains 41 features describing a network connection. In
order to make the data set more realistic, we sampled data
sets consisted of 1 to 1.5% attacks and 98.5 to 99% normal
instances. For each of data sets, the data is split into two
sets: training dataset (40%) and test dataset (60%). For
simulating the distributed environment, the test data is
distributed into four data sets equally which simulate the
instances collected from different agent IDS.
4.2.
Evaluation parameter
For the algorithm of choosing candidate anomalies,
the method of setting parameter in [6] is adopted. From the
experimental results over the training data sets, the cluster
width is fixed as W =40 in the feature space, percentage N=
15%.
For the algorithm of choosing attacks, K is fixed to 15
and the k will be variable for evaluating the performance of
experimental results.
4.3.
Experimental Performance
Two major indicators of performance are used; they
are the detection rate and the false positive rate. The
detection rate equals the number of correctly detected
intrusions divided by the total number of intrusions in the
data set; while the false positive rate equals the total
number of normal instances that were incorrectly regarded
as attacks divided by the total number of normal instances
in the data set. The value of the detection is expected to be
as large as possible, while the value of the false positive
rate is expected to be as small as possible. The average
results with several datasets for same parameter k are
shown in table 1.
ROC (Receiver Operating Characteristic)[9] curve is
computed according to above experimental results and
illustrated in figure 4, which shows how the detection rate
and false alarm rate vary when different thresholds are
used.
Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, Guangzhou, 18-21 August 2005
Table 1. The experimental results
k
4
6
8
10
12
14
15
Detection Rate
15
28
42
58
67
75
90
References
False Positive Rate
0.2
0.8
1.4
2.3
5.1
8.2
16
Figure 4. ROC Curve showing
5.
Conclusions
The research on distributed intrusion detection is a
young domain yet, and not a mature architecture up to now.
The contribution that we presented in this paper was a trial
application about distributed analysis based on clustering
with unlabeled data. The experimental results indicated that
the method is feasible for detecting attacks on DIDS.
Our future work involves research on combining the
clustering technique with real homogeneous distributed
system, and furthermore with heterogeneous distributed
system.
2383
[1] Nathan Einwechter. "An Introduction To Distributed
Intrusion Detection Systems"[M]. Security Focus. Jan
2001.
[2] J. Brentano S. Snapp and G. Dias et al. Dids
(distributed intrusion detection system) motivation,
archi-tecture, and an early prototype. In Fourteenth
National Computer Security Conference, Washington,
DC, October 1991.
[3] Eugene H. Spafford and Diego Zamboni. Intrusion
detection using autonomous agents. Computer
Networks, 34(4):547{570, October 2000.
[4] Mark Slagell. The design and implementation of
MAIDS (mobile agent intrusion detection system) [R].
Technical Report TR01-07, Iowa State University
Department of Computer Science, Ames, IA, USA,
2001.
[5] Major Dennis J. Ingram, H Steven Kremer, and Neil C.
Rowe. Distributed Intrusion Detection for Computer
Systems Using Communicating Agents.
[6] Eleazar Eskin, Andrew Arnold, Michael Prerau,
Leonid Portnoy and Salvatore Stolfo. A Geometric
Framework for Unsupervised Anomaly Detection:
Detecting Intrusions in Unlabeled Data. Applications
of Data Mining in Computer Security. Kluwer 2002.
[7] J. Han and M. Kamber. Data Mining, Concepts and
Technique. Morgan Kaufmann, San Francisco, 2001.
[8] The third international knowledge discovery and data
mining tools competition dataset KDD99-Cup
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.ht
ml, 1999.
[9] Foster Provost, Tom Fawcett, and Ron Kohavi. The
case against accuracy estimation for comparing
induction algorithms. In proceeding of the Fifteenth
International Conference on Machine Learning, July
1998.