Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, Guangzhou, 18-21 August 2005 DISTRIBUTED INTRUSION DETECTION BASED ON CLUSTERING YU-FANG ZHANG, ZHONG-YANG XIONG, XIU-QIONG WANG Department of Computer Science, Chongqing University, Chongqing, China, 400044 E-MAIL: [email protected], [email protected], [email protected] Abstract: The research on distributed intrusion detection system (DIDS) is a rapidly growing area of interest because the existence of centralized intrusion detection system (IDS) techniques is increasingly unable to protect the global distributed information infrastructure. Distributed analysis employed by Agent-based DIDS is an accepted fabulous method. Clustering-based intrusion detection technique overcomes the drawbacks of relying on labeled training data which most current anomaly-based intrusion detection depend on. Clustering-based DIDS technique according to the advantages of two techniques is presented. For effectively choosing the attacks, twice clustering is employed: the first clustering is to choose the candidate anomalies at Agent IDS and the second clustering is to choose the true attack at central IDS. At last, through experiment on the KDD CUP 1999 data records of network connections verified that the methods put forward is better. Keywords: Intrusion detection; Distributed intrusion Detection system; Anomaly detection; Cluster; Data mining 1. Introduction Intrusion detection system (IDS) is a critical issue in network security research. Misuse detection and anomaly detection, on which many researches have been done, are two basic approaches of intrusion detection. Misuse-based IDS performs signature analysis by comparing on-going activities with patterns representing known attacks or weak spots of the system in attempt to recognize similar attacks. Anomaly-based IDS is to detect intrusions by identifying activities that are different from a user's or a system's normal behavior. Anomaly detection tries to establish the normal user group and host profile and subsequent uses are checked against the normal profile. All deviations from normal profiles can be flagged as suspicious activities or intrusions. The two approaches by themselves are more suitable toward to local incident monitoring and analysis and are effectively applied to traditional IDS based on centralized pattern with a one-way data collection and processes incident at only a single site. However, with growing complexity of the network environment and the appearance of coordinated attacks involving multiple attackers, the existence of traditional centralized IDS techniques is a passive information processing paradigm and increasingly unable to protect the global information infrastructure, so that distributed intrusion detection system (DIDS) technique have started to evolve and become a very important issue of security research in recent years. Although the different definitions of DIDS are described by different organizations, they have the common feature for protecting the global information infrastructure. A DIDS can be defined as: “consists of multiple Intrusion Detection Systems (IDS) over a large network, all of which communicate with each other, or with a central server that facilitates advanced network monitoring, incident analysis, and instant attack data”[1]. DIDS, however, faces the new problem that it needs analyze massive aggregating data generated by individual IDSs placed across large-scale networks. How to analyze these a huge amount of data with high efficiency is a considerable complicated and comprehensive process. Data mining is recognized as a useful tool for discovering interesting patterns and knowledge from large amounts of data and thus has been the target of some investigations for its use in intrusion detection. Clustering technique groups a set of data that exhibit similar characteristics into meaningful subclasses according to some pre-defined metrics so that the member is quite similar to one another within the same cluster, and the members from different clusters are quite dissimilar from each other. Over the past years, combining anomaly-based centralized IDS with clustering techniques was an important issue of research for finding new and unknown intrusion methods and achieved success. Combining DIDS with anomaly-based clustering techniques for more effectively analyzing the massive data collected from the large network was developed. For improving the detection efficiency and accuracy, the paper 0-7803-9091-1/05/$20.00 ©2005 IEEE 2379 Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, Guangzhou, 18-21 August 2005 presented a second clustering method in DIDS for completing global information extraction of intruder actions. The remaining sections of this paper were organized as follows. After discussing related work in Section 2, Section 3 focuses on how to employ the clustering techniques to Agent-based DIDS. Section 4, experimental results indicate that the presented method appears very promising when detecting intrusions in DARPA’98 data and finally conclude in Section 5. distributed fashion independently of the others while cooperating and communicating to provide a truly distributed detection mechanism without central processing location. From these instances mentioned, agent technique and distributed analysis have become the public fabulous technique for DIDS. 2. Establishing normal profile in anomaly detection requires the training dataset to be free of attacks. If there are some attacks buried within training dataset, algorithm may not identify these attacks because of being regarded as normal instances during detection. However, obtaining purely normal dataset is considerably expensive for removal of all attacks, including new attacks. Because not only huge amount of audit data or network data should be labeled, but also classifying training data is extremely difficult mainly due to the high cost of obtaining proper labeling of attacks. Simulating attacks can build labeled dataset, but just the known attacks can be simulated. Training dataset has not the ability to response new types of attacks that maybe will come forth in the future. Eskin et al [6] developed an unsupervised anomaly detection technique in unlabeled data. The algorithm, based on clustering techniques and a simple distance-based measure, groups instances into several clusters and label the points in the small clusters as attacks. Its primarily merits are that detecting abnormal activities is executed automatically without too much human intervention and the low computational complexity. Thanks to the modularized design, experimenting with clustering algorithm in unlabeled data and detection mechanisms is made easy. In this paper, combining the advantages of Agent-based distributed analysis and clustering-based intrusion detection technique with unlabeled data, clustering-based DIDS technique is given. The focus discussed in detail is that, one is how to cluster the input dataset and choose the candidate anomalies at Agent IDS, the second is cluster using single-linkage algorithm for checking true attacks at central IDS node. 2.1. Related work DIDS techniques based on data analysis Over the past years many DIDSs have been developed. These schemes were a result of research on methods of aggregating data generated by individual intrusion detection systems placed across large-scale networks. Generally speaking, there are two ways to analyze the data: central analysis and distributed analysis. A typical example DIDS based central analysis is being developed by Division of Computer Science, University of California, Davis in 90th last century [2]. It is the first intrusion detection system that aggregates audit reports from a collection of hosts on a single network. The architecture consists of a host manager, a monitoring process or collection of processes running in background. The system has two data collections: one collects host data and another collects network data. All data are transferred to a main site or analyzer and analyzed. Obviously, the architecture will fall across the “system bottleneck” problem. Firstly, transferring all data without preprocessing to the main site will occupy a large of network bandwidth. Secondly, a main analyzer has a limited ability to deal with massive data. Thirdly, the main analyzer is a single point of failure, the security of the communication between sites is not guaranteed without relevant security methods. For overcoming the drawbacks of central analysis technique, distributed analysis started to evolve. Several Agent-based DIDSs were developed in recent years and all adopted distributed analysis technique. AAFID (Autonomous Agents For Intrusion detection)[3] is a distributed anomaly detection system that employs autonomous agents at the lowest level for data collection and analysis. At the higher levels of the hierarchy transceivers and monitors are used to obtain a global view of activities. MAIDS (Mobile Agent Intrusion Detection System) [4] developed the agent that transfers collected different types of data to other agent, which has the capability to analyze certain data type. Major Dennis et al [5] presented the technique that detection agent run in a 2380 2.2. 3. Clustering-based unlabeled data intrusion detection with Agent-based DIDS using clustering The simple architecture of DIDS based on distributed analysis is shown in figure 1. The central IDS node and agent IDS may physically reside on the same computer, they are logically independent and communicate bi-directionally each other. There can be several central IDSs resided on several hosts but only one is Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, Guangzhou, 18-21 August 2005 running. If the central IDS in running lose communication with Agent ID, another central IDS will take the task, so that there is without the problem of a single point failure. C e n tr a l ID S N o d e D a ta S o u rc e s …… A g e n t ID S A g e n t ID S D a ta S o u rc e s A g e n t ID S dis tan ce( z i , z j ) = 3.1.3. Agent IDS In this paper all agents are primarily responsible for collecting network data from data sources, normalizing them, analyzing and choosing candidate anomalies from the analysis. The details are described as follows. 3.1.1. Data collection and normalization In running process agent collects a huge amount of raw network packets from different sources with tool software such as tcmdum. The binary raw network data are translated into instances with uniform vector format and these instances are saved into database. The instances include many features such as src_host (the source IP), dst_host (the destination IP), src_bytes (number of data bytes from source to destination) and dst_bytes (number of data bytes from destination to source) and so on. Before clustering, data instances will be normalized to standard form for solving the problem that different features are on different scales. Normalization takes as input the translated vector, X, and transforms to a normalized vector, Z, such that summation over different attributes becomes meaningful—for they can be possibly in different units, etc [7]. Given the data instances for a variable f, f is converted into a standard measurement as following formulas: z if = xif − m f (1) sf where m f = 1 n n ∑x i =1 if ; sf = 1 n n ∑|x if p ∑z in − z jn 2 (2) n =1 D a ta S o u rc e s Figure 1. The simple architecture of DIDS 3.1. method. Sometimes, some features according to its perceived importance are put with a weighted value. Some increases in performance from weighted metrics are not a significant amount, because which maybe weaken the system’s generality. In this paper, a metric with equally weighted features is used and the most popular distance measure, viz. Euclidian distance to measure dissimilarity between instances is formulated as − mf | i =1 here x1f , x2f, …, xnf are n measurements of f. 3.1.2. Metric for distance measure The dissimilarity between instances is measured by distance between instances. Finding or constructing an appropriate metric is critical to the performance of the 2381 Cluster-based algorithm for agent IDS A relatively simple clustering algorithm was chosen because a simple approach has a low time complexity and much attention can be paid to investigate the effectiveness. The clustering part of the algorithm is essentially the same as one used in [6], but differs in how to compute the distance between clusters and an instance and how anomalies are chosen. Input: Dataset D={d1, d2, …, dn}. Output: Set of cluster C 1. C1←{d1} ; o1←d1 (feature); num_cluster=1; D={ d1, d2,, …, dn}–{d1}; // Phase 1: Creating Clusters 2. for di∈D { 3. min_distance=0; min_num=0; 4. for j=1 to num_cluster 5. If min_distance>distance (di, oj ) then { min_distance←distance (di, oj ); min_num=j;} 6. If min_distance≤ W then { Cj←Cj +{ di }; D←D–{ di }; oj←average vector of cluster Cj; } 7. else { num_cluster=num_cluster +1; Cnum_cluster←{di}; onum_cluster←di (feature);} } // Phase2: choose the candidate anomalies 8. sort Ci according to N(Ci) from small to large; 9. choose and merge small clusters as candidate nomalies. Figure 2. The algorithm of choosing candidate anomalies Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, Guangzhou, 18-21 August 2005 For each cluster Ci, we define the number of points as N(Ci). Phase 2 first sorts the clusters according to N(Ci). Second, the algorithm chooses the small clusters as candidate anomaly sets and merges these candidate sets into a cluster C. The number’s sum of these small clusters approximately equals to the value that percentage N multiplies the number of the points in input dataset D. 3.2. Central IDS For most of IDSs, the central IDS node is the heart and soul for controlling and coordinating all agents. In this paper, we just discussed how to second mine data, viz. analyze the candidate anomalies transferred from all agents. Input: Candidate anomaly cluster C containing n objects. Output: Attack cluster CA. // Phase 1: Creating Clusters 1. set n objects as n clusters: C1, C2, …, Cn; 2. num_cluster= n; 3. repeat 4. arbitrarily choose a cluster Ci (i<=num_cluster); 5. compute the distances between Ci with others; 6. merge the closest clusters; 7. num_cluster = num_cluster-1; 8. until num_cluster= K; // Phase 2: Choose the true attacks 9. calculate the distances between all pairs of clusters; 10. choose the shortest distance d min_i for every cluster; 11. sort Ci according to d min_i from large to small; 12. merge the k top of small clusters as cluster CA. Figure 3. The algorithm of choosing attacks Because the size of the candidate anomalies is not large, no much more time complexity can be put into attention. A little variant single-linkage algorithm is used to cluster accurately. The number of clusters K and the number of k are fixed for choosing the true attack sets. Assume the input candidate anomaly cluster C contains n objects. Distance between two average vectors of clusters determines the distance between two clusters. The algorithm is also dissected into two phases: Phase 1 creates clusters using single-linkage algorithm and Phase 2 is to detect the true attacks. Figure 3 illustrates the steps of the two phases. 2382 4. 4.1. Experimental evaluation Evaluation data For evaluating the performance of above system perfectly, the authoritative network intrusion dataset, viz. KDD [8], which is obtained by simulating a large number of different types of intrusions in a military network environment, are used. The data set contains record that contains 41 features describing a network connection. In order to make the data set more realistic, we sampled data sets consisted of 1 to 1.5% attacks and 98.5 to 99% normal instances. For each of data sets, the data is split into two sets: training dataset (40%) and test dataset (60%). For simulating the distributed environment, the test data is distributed into four data sets equally which simulate the instances collected from different agent IDS. 4.2. Evaluation parameter For the algorithm of choosing candidate anomalies, the method of setting parameter in [6] is adopted. From the experimental results over the training data sets, the cluster width is fixed as W =40 in the feature space, percentage N= 15%. For the algorithm of choosing attacks, K is fixed to 15 and the k will be variable for evaluating the performance of experimental results. 4.3. Experimental Performance Two major indicators of performance are used; they are the detection rate and the false positive rate. The detection rate equals the number of correctly detected intrusions divided by the total number of intrusions in the data set; while the false positive rate equals the total number of normal instances that were incorrectly regarded as attacks divided by the total number of normal instances in the data set. The value of the detection is expected to be as large as possible, while the value of the false positive rate is expected to be as small as possible. The average results with several datasets for same parameter k are shown in table 1. ROC (Receiver Operating Characteristic)[9] curve is computed according to above experimental results and illustrated in figure 4, which shows how the detection rate and false alarm rate vary when different thresholds are used. Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, Guangzhou, 18-21 August 2005 Table 1. The experimental results k 4 6 8 10 12 14 15 Detection Rate 15 28 42 58 67 75 90 References False Positive Rate 0.2 0.8 1.4 2.3 5.1 8.2 16 Figure 4. ROC Curve showing 5. Conclusions The research on distributed intrusion detection is a young domain yet, and not a mature architecture up to now. The contribution that we presented in this paper was a trial application about distributed analysis based on clustering with unlabeled data. The experimental results indicated that the method is feasible for detecting attacks on DIDS. Our future work involves research on combining the clustering technique with real homogeneous distributed system, and furthermore with heterogeneous distributed system. 2383 [1] Nathan Einwechter. "An Introduction To Distributed Intrusion Detection Systems"[M]. Security Focus. Jan 2001. [2] J. Brentano S. Snapp and G. Dias et al. Dids (distributed intrusion detection system) motivation, archi-tecture, and an early prototype. In Fourteenth National Computer Security Conference, Washington, DC, October 1991. [3] Eugene H. Spafford and Diego Zamboni. Intrusion detection using autonomous agents. Computer Networks, 34(4):547{570, October 2000. [4] Mark Slagell. The design and implementation of MAIDS (mobile agent intrusion detection system) [R]. Technical Report TR01-07, Iowa State University Department of Computer Science, Ames, IA, USA, 2001. [5] Major Dennis J. Ingram, H Steven Kremer, and Neil C. Rowe. Distributed Intrusion Detection for Computer Systems Using Communicating Agents. [6] Eleazar Eskin, Andrew Arnold, Michael Prerau, Leonid Portnoy and Salvatore Stolfo. A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data. Applications of Data Mining in Computer Security. Kluwer 2002. [7] J. Han and M. Kamber. Data Mining, Concepts and Technique. Morgan Kaufmann, San Francisco, 2001. [8] The third international knowledge discovery and data mining tools competition dataset KDD99-Cup http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.ht ml, 1999. [9] Foster Provost, Tom Fawcett, and Ron Kohavi. The case against accuracy estimation for comparing induction algorithms. In proceeding of the Fifteenth International Conference on Machine Learning, July 1998.