Download Attack Detection By Clustering And Classification

ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 2, April 2012 Attack Detection By Clustering And Classification Approach Ms. Priyanka J. Pathak , Asst. Prof. Snehlata S. Dongre  Abstract—Intrusion detection is a software application that monitors network and/or system activities for malicious activities or policy violations and produces reports to a Management Station. Security is becoming big issue for all networks. Hackers and intruders have made many successful attempts to bring down high profile company networks and web services. Intrusion Detection System (IDS) is an important detection that is used as a countermeasure to preserve data integrity and system availability from attacks. The work is implemented in two phases, in first phase clustering by K-means is done and in next step of classification is done with k-nearest neighbours and decision trees. The objects are clustered or grouped based on the principle of maximizing the intra-class similarity and minimizing the interclass similarity. This paper proposes an approach which make the clusters of similar attacks and in next step of classification with K nearest neighbours and Decision trees it detect the attack types. This method is advantageous over single classifier as it detect better class than single classifier system. Index Terms—Data mining, Decision Trees, K-Means, K-Nearest Neighbours, I. INTRODUCTION Now a day everyone gets connected to the system but, there are many issues in network environment. So there is need of securing information, because there are lots of security threats are present in network environment. Intrusion detection[9] is a software application that monitors network and/or system activities for malicious activities or policy violations and produces reports to a Management Station. Intrusion detection systems are usually classified as host based or network based in terms of target domain. A number of techniques are available for intrusion detection [8],[11]. Data mining is the one of the efficient techniques available for intrusion detection. Data mining refers to extracting or mining knowledge from large amounts of data. Data mining techniques may be supervised or unsupervised. The Clustering[10] and Classification are the two important techniques used in data mining for extracting valuable information. This paper proposes clustering and classification. Classification is the processing of finding a set of models which describe and distinguish data classes or concepts, for the purposes of being able to use the model to predict the class of objects whose class label is unknown. The paper is organized as follows. Section 2 gives the details of clustering algorithms. Section 3 & Section 4 details about strategies for classification for novel class detection in data streams for intrusion detection. Clustering is done by K-means algorithm this will generates clusters of similar attack type which fall into same category. The centroids generated by this algorithm is used in next step of classification. In classification step K-nearest neighbors and decision tree algorithms are used which detects the different types of attacks. II. CLUSTERING A cluster [Han & Camber] is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. A cluster of data objects can be treated collectively as one group and so may be Considered as a form of data compression. Although classification is an effective means for distinguishing groups or classes of objects, it requires the often costly collection and labeling of a large set of training tuples or patterns, which the classifier uses to model each group. It is often more desirable to proceed in the reverse direction: First partition the set of data into groups based on data similarity (e.g., using clustering), and then assign labels to the relatively small number of groups. Additional advantages of such a clustering-based process are that it is adaptable to changes and helps single out useful features that distinguish different groups. Clustering[8],[10] is an unsupervised learning technique which divides the datasets into subparts, which share common properties. For clustering data points, there should be high intra cluster similarity and low inter cluster similarity. A clustering method which results in such type of clusters is considered as good clustering algorithm. A. K-Means Algorithm K-means [Han & Camber] is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids should be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early groupage is done. At this point we need to re-calculate k new centroids as 115 All Rights Reserved © 2012 IJARCSEE ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 2, April 2012 barycenters of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated. As a result of this loop we may notice that the k centroids change their location step by step until no more changes are done. The algorithm is composed of the following steps: Algorithm: k-means The k-means algorithm for partitioning, where each cluster’s center is represented by the mean value of the objects in the cluster. Input: k: the number of clusters, D: a data set containing n objects. Output: A set of k clusters. Method: (1) Arbitrarily choose k objects from D as the initial cluster centers; (2) Repeat (3) Reassign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster; (4) Update the cluster means, i.e., calculate the mean value of the objects for each cluster; (5) Until no change; figure shows overall system architecture. The Clustering and Classification are the two parts of system. The input to the system is KDD data set. The first step is preprocess the input and this preprocessed input is applied to the K-Means Clustering algorithm. The Preprocessing steps contains :  Data cleaning  Feature Selection  Data transformation etc. Fig.3 System Architecture Fig: 1 Clustering of a set of objects based on the k-means method This produces a separation of the objects into groups from which the metric to be minimized can be calculated. The K-Means creates the clusters of similar attacks based on similarity measure. Each record of data set contains 42 different features. The values of these features are already pre-processed in training data. The calculated centroid values are used in next K-Nearest neighbours classification. Fig4:Preprocessed Data Fig2: output of K-means algorithm III. CLASSIFICATION Classification [4],[13] is a process of finding a model that describes and distinguishes data classes or concept for the purpose of being able to use the model to predict the class of objects whose class label is unknown. The derived model is based on the analysis of set of training data. This paper introduces a method about clustering and classification. The clustering is done by K-Means algorithm which is explained in earlier section. The classification method is proposed through which novel class can be detected. The following A. K-Nearest Neighbour Classification In this method, one simply finds in the N-dimensional feature space the closest object from the training set to an object being classified. Since the neighbour is nearby, it is likely to be similar to the object being classified and so is likely to be the same class as that object. Nearest neighbour methods have the advantage that they are easy to implement. They can also give quite good results if the features are chosen carefully. A new sample is classified by calculating the distance to the nearest training case; the sign of that point then determines the classification of the sample. Closeness" 116 All Rights Reserved © 2012 IJARCSEE ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 2, April 2012 is defined in terms of Euclidean distance, where the Euclidean distance between two points, 2. 3. Fig5: Flow Chart of K-Nearest Neighbour algorithm The unknown sample is assigned the most common class among its k nearest neighbours. Following fig. shows attack types and their number of nodes detected by K-Nearest neighbour algorithm. Fig6: Output of K-Nearest-Neighbour algorithm B. Decision Trees Decision tree learning [5] ,[6] is a method commonly used in data mining. The goal is to create a model that predicts the value of a target variable based on several input variables. Each interior node corresponds to one of the input variables; there are edges to children for each of the possible values of that input variable. Each leaf represents a value of the target variable given the values of the input variables represented by the path from the root to the leaf. A tree can be "learned" by splitting the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning. The recursion is completed when the subset at a node all has the same value of the target variable, or when splitting no longer adds value to the predictions. C. ID3 Algorithm In decision tree learning, ID3 (Iterative Dichotomiser 3)[6] is an algorithm used to generate a decision tree. The ID3 algorithm can be summarized as follows: 1. Take all unused attributes and count their entropy concerning test samples Choose attribute for which entropy is minimum (or, equivalently, information gain is maximum) Make node containing that attribute The algorithm is as follows: Input: A data set, S Output: A decision tree  If all the instances have the same value for the target attribute then return a decision tree that is simply this value.  Else ◦ Compute Gain values for all attributes and select an attribute with the highest value and create a node for that attribute. ◦ Make a branch from this node for every value of the attribute ◦ Assign all possible values of the attribute to branches. ◦ Follow each branch by partitioning the dataset to be only instances whereby the value of the branch is present and then go back to 1. This algorithm usually produces small trees, but it does not always produce the smallest possible tree. The optimization step makes use of information entropy: Entropy: Where :  E(S) is the information entropy of the set S;  n is the number of different values of the attribute in S (entropy is computed for one chosen attribute)  f S(j) is the frequency (proportion) of the value j in the set S.  log2 is the binary logarithm An entropy of 0 identifies a perfectly classified set. Entropy is used to determine which node to split next in the algorithm. The higher the entropy, the higher the potential to improve the classification. Gain: Gain is computed to estimate the gain produced by a split over an attribute : Where :  G(S,A) is the gain of the set S after a split over the A attribute  E(S) is the information entropy of the set S.  m is the number of different values of the attribute A in S.  Fs(Ai) is the frequency (proportion) of the items possessing Ai as value for A in S.  Ai is ith possible value of A. 117 All Rights Reserved © 2012 IJARCSEE ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 2, April 2012  SAi is a subset of S containing all items where the value of A is Ai. Gain quantifies the entropy improvement by splitting over an attribute. [12] Srilatha Chebrolua, Ajith Abrahama, Johnson P. Thomasa” Feature deduction and ensemble design of intrusion detection systems” 2004 Elsevier Ltd [13] Sandhya Peddabachigaria, Ajith Abrahamb,,Crina Grosanc, Johnson Thomasa” Modeling intrusion detection system using hybrid intelligent systems” 2005 Elsevier Ltd. First Author: Priyanka J. Pathak IV sem MTech[CSE], G.H.Raisoni College of Engineering,Nagpur, R.T.M.N.U, Nagpur Second Author : Asst. Prof. Snehlata Dongre CSE Department, G.H.Raisoni College of engineering,Nagpur R.T.M.N.U, Nagpur Fig7: Output of ID3 algorithm IV. CONCLUSION In this paper, the work is done through which the novel class is detected The system uses K-Means clustering algorithm which produces different clusters of similar type of attacks and total nodes per clusters in the input dataset. Also it shows the updated centroids of each parameter in the input dataset. The Classification stage gives details about detection of different types of attacks and number of nodes in dataset. REFERENCES [1] Kapil K. Wankhade, Snehlata S. Dongre, Prakash S. Prasad , Mrudula M. Gudadhe, Kalpana A. Mankar,” Intrusion Detection System Using New Ensemble Boosting Approach” 2011 3rd International Conference on Computer Modeling and Simulation (ICCMS 2011) [2] Kapil K. Wankhade, Snehlata S. Dongre, Kalpana A. Mankar, Prashant K. Adakane,” A New Adaptive Ensemble Boosting Classifier for Concept Drifting Stream Data” 2011 3rd International Conference on Computer Modeling and Simulation (ICCMS 2011) [3] Hongbo Zhu, Yaqiang Wang, Zhonghua Yu “Clustering of Evolving Data Stream with Multiple Adaptive Sliding Window” International Conference on Data Storage and Data Engineering, 2010. [4] Peng Zhang†, Xingquan Zhu‡, Jianlong Tan†, Li Guo† “Classifier and Cluster Ensembles for Mining Concept Drifting Data Streams” IEEE International Conference on Data Mining, 2010. [5] T.Jyothirmayi, Suresh Reddy “An Algorithm for Better Decision Tree” T.Jyothirmayi et. al. / (IJCSE) International Journal on Computer Science and Engineering , 2010 [6] Yongjin Liu, NaLi, Leina Shi, Fangping Li “An Intrusion Detection Method Based on Decision Tree”, International Conference on E-Health Networking, Digital Ecosystems and Technologies, 2010. [7] A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby and R. Gavalda, “New ensemble methods for evloving data streams”, In KDD’09, ACM, Paris, 2009, pp. 139-148. [8] LID Li-xiong, KANGJing, GUO Yun-fei, HUANGHai “A Three-Step Clustering Algorithm over an Evolving Data Stream” The National High Technology Research and Development Program("863" Program) of China, Fund 2008 AAOII002 sponsors. [9] Mrutyunjaya Panda, Manas Ranjan Patra “A COMPARATIVE STUDY OF DATA MINING ALGORITHMS FOR NETWORK INTRUSION DETECTION” First International Conference on Emerging Trends in Engineering and Technology,2008 [10] Shuang Wu, Chunyu Yang and Jie Zhou “Clustering-training for Data Stream Mining” Sixth IEEE International Conference on Data Mining Workshops (ICDMW'06) [11] Sang-Hyun Oh, Jin-Suk Kang, Yung-Cheol Byun, Gyung-Leen Park3 and Sang-Yong Byun3 “Intrusion Detection based on Clustering a Data Stream” Third ACIS Int'l Conference on Software Engineering Research, Management and Applications (SERA’05). 118 All Rights Reserved © 2012 IJARCSEE

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Attack Detection By Clustering And Classification