Download Attack Detection By Clustering And Classification

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
ISSN: 2277 – 9043
International Journal of Advanced Research in Computer Science and Electronics Engineering
Volume 1, Issue 2, April 2012
Attack Detection By Clustering And
Classification Approach
Ms. Priyanka J. Pathak , Asst. Prof. Snehlata S. Dongre

Abstract—Intrusion detection is a software application that
monitors network and/or system activities for malicious
activities or policy violations and produces reports to a
Management Station. Security is becoming big issue for all
networks. Hackers and intruders have made many successful
attempts to bring down high profile company networks and web
services. Intrusion Detection System (IDS) is an important
detection that is used as a countermeasure to preserve data
integrity and system availability from attacks. The work is
implemented in two phases, in first phase clustering by
K-means is done and in next step of classification is done with
k-nearest neighbours and decision trees. The objects are
clustered or grouped based on the principle of maximizing the
intra-class similarity and minimizing the interclass similarity.
This paper proposes an approach which make the clusters of
similar attacks and in next step of classification with K nearest
neighbours and Decision trees it detect the attack types. This
method is advantageous over single classifier as it detect better
class than single classifier system.
Index Terms—Data mining, Decision Trees, K-Means,
K-Nearest Neighbours,
I. INTRODUCTION
Now a day everyone gets connected to the system but, there
are many issues in network environment. So there is need of
securing information, because there are lots of security
threats are present in network environment. Intrusion
detection[9] is a software application that monitors network
and/or system activities for malicious activities or policy
violations and produces reports to a Management Station.
Intrusion detection systems are usually classified as host
based or network based in terms of target domain. A number
of techniques are available for intrusion detection [8],[11].
Data mining is the one of the efficient techniques available
for intrusion detection. Data mining refers to extracting or
mining knowledge from large amounts of data. Data mining
techniques may be supervised or unsupervised. The
Clustering[10] and Classification are the two important
techniques used in data mining for extracting valuable
information. This paper proposes clustering and
classification. Classification is the processing of finding a set
of models which describe and distinguish data classes or
concepts, for the purposes of being able to use the model to
predict the class of objects whose class label is unknown. The
paper is organized as follows. Section 2 gives the details of
clustering algorithms. Section 3 & Section 4 details about
strategies for classification for novel class detection in data
streams for intrusion detection. Clustering is done by
K-means algorithm this will generates clusters of similar
attack type which fall into same category. The centroids
generated by this algorithm is used in next step of
classification. In classification step K-nearest neighbors and
decision tree algorithms are used which detects the different
types of attacks.
II. CLUSTERING
A cluster [Han & Camber] is a collection of data objects
that are similar to one another within the same cluster and are
dissimilar to the objects in other clusters. A cluster of data
objects can be treated collectively as one group and so may
be
Considered as a form of data compression. Although
classification is an effective means for distinguishing groups
or classes of objects, it requires the often costly collection
and labeling of a large set of training tuples or patterns, which
the classifier uses to model each group. It is often more
desirable to proceed in the reverse direction: First partition
the set of data into groups based on data similarity (e.g., using
clustering), and then assign labels to the relatively small
number of groups. Additional advantages of such a
clustering-based process are that it is adaptable to changes
and helps single out useful features that distinguish different
groups.
Clustering[8],[10] is an unsupervised learning technique
which divides the datasets into subparts, which share
common properties. For clustering data points, there should
be high intra cluster similarity and low inter cluster similarity.
A clustering method which results in such type of clusters is
considered as good clustering algorithm.
A. K-Means Algorithm
K-means [Han & Camber] is one of the simplest
unsupervised learning algorithms that solve the well known
clustering problem. The procedure follows a simple and easy
way to classify a given data set through a certain number of
clusters (assume k clusters) fixed a priori. The main idea is to
define k centroids, one for each cluster. These centroids
should be placed in a cunning way because of different
location causes different result. So, the better choice is to
place them as much as possible far away from each other. The
next step is to take each point belonging to a given data set
and associate it to the nearest centroid. When no point is
pending, the first step is completed and an early groupage is
done. At this point we need to re-calculate k new centroids as
115
All Rights Reserved © 2012 IJARCSEE
ISSN: 2277 – 9043
International Journal of Advanced Research in Computer Science and Electronics Engineering
Volume 1, Issue 2, April 2012
barycenters of the clusters resulting from the previous step.
After we have these k new centroids, a new binding has to be
done between the same data set points and the nearest new
centroid. A loop has been generated. As a result of this loop
we may notice that the k centroids change their location step
by step until no more changes are done.
The algorithm is composed of the following steps:
Algorithm: k-means
The k-means algorithm for partitioning, where each
cluster’s center is represented by the mean value of the
objects in the cluster.
Input:
k: the number of clusters,
D: a data set containing n objects.
Output:
A set of k clusters.
Method:
(1) Arbitrarily choose k objects from D as the initial
cluster centers;
(2) Repeat
(3) Reassign each object to the cluster to which the object
is the most similar, based on the mean value of the objects in
the cluster;
(4) Update the cluster means, i.e., calculate the mean value
of the objects for each cluster;
(5) Until no change;
figure shows overall system architecture. The Clustering and
Classification are the two parts of system.
The input to the system is KDD data set. The first step is
preprocess the input and this preprocessed input is applied to
the K-Means Clustering algorithm. The Preprocessing steps
contains :
 Data cleaning
 Feature Selection
 Data transformation etc.
Fig.3 System Architecture
Fig: 1 Clustering of a set of objects based on the k-means method
This produces a separation of the objects into groups from
which the metric to be minimized can be calculated.
The K-Means creates the clusters of similar attacks based on
similarity measure. Each record of data set contains 42
different features. The values of these features are already
pre-processed in training data. The calculated centroid values
are used in next K-Nearest neighbours classification.
Fig4:Preprocessed Data
Fig2: output of K-means algorithm
III.
CLASSIFICATION
Classification [4],[13] is a process of finding a model that
describes and distinguishes data classes or concept for the
purpose of being able to use the model to predict the class of
objects whose class label is unknown. The derived model is
based on the analysis of set of training data. This paper
introduces a method about clustering and classification. The
clustering is done by K-Means algorithm which is explained
in earlier section. The classification method is proposed
through which novel class can be detected. The following
A. K-Nearest Neighbour Classification
In this method, one simply finds in the N-dimensional
feature space the closest object from the training set to an
object being classified. Since the neighbour is nearby, it is
likely to be similar to the object being classified and so is
likely to be the same class as that object. Nearest neighbour
methods have the advantage that they are easy to implement.
They can also give quite good results if the features are
chosen carefully. A new sample is classified by calculating
the distance to the nearest training case; the sign of that point
then determines the classification of the sample. Closeness"
116
All Rights Reserved © 2012 IJARCSEE
ISSN: 2277 – 9043
International Journal of Advanced Research in Computer Science and Electronics Engineering
Volume 1, Issue 2, April 2012
is defined in terms of Euclidean distance, where the
Euclidean distance between two points,
2.
3.
Fig5: Flow Chart of K-Nearest Neighbour algorithm
The unknown sample is assigned the most common class
among its k nearest neighbours. Following fig. shows attack
types and their number of nodes detected by K-Nearest
neighbour algorithm.
Fig6: Output of K-Nearest-Neighbour algorithm
B. Decision Trees
Decision tree learning [5] ,[6] is a method commonly used
in data mining. The goal is to create a model that predicts the
value of a target variable based on several input variables.
Each interior node corresponds to one of the input variables;
there are edges to children for each of the possible values of
that input variable. Each leaf represents a value of the target
variable given the values of the input variables represented
by the path from the root to the leaf. A tree can be "learned"
by splitting the source set into subsets based on an attribute
value test. This process is repeated on each derived subset in
a recursive manner called recursive partitioning. The
recursion is completed when the subset at a node all has the
same value of the target variable, or when splitting no longer
adds value to the predictions.
C. ID3 Algorithm
In decision tree learning, ID3 (Iterative Dichotomiser 3)[6]
is an algorithm used to generate a decision tree. The ID3
algorithm can be summarized as follows:
1. Take all unused attributes and count their entropy
concerning test samples
Choose attribute for which entropy is minimum (or,
equivalently, information gain is maximum)
Make node containing that attribute
The algorithm is as follows:
Input: A data set, S
Output: A decision tree
 If all the instances have the same value for the target
attribute then return a decision tree that is simply
this value.
 Else
◦ Compute Gain values for all attributes and
select an attribute with the highest value
and create a node for that attribute.
◦ Make a branch from this node for every
value of the attribute
◦ Assign all possible values of the attribute
to branches.
◦ Follow each branch by partitioning the
dataset to be only instances whereby the
value of the branch is present and then go
back to 1.
This algorithm usually produces small trees, but it does not
always produce the smallest possible tree.
The optimization step makes use of information entropy:
Entropy:
Where :
 E(S) is the information entropy of the set S;
 n is the number of different values of the attribute
in S (entropy is computed for one chosen attribute)
 f S(j) is the frequency (proportion) of the value j in
the set S.
 log2 is the binary logarithm
An entropy of 0 identifies a perfectly classified set.
Entropy is used to determine which node to split next in the
algorithm. The higher the entropy, the higher the potential to
improve the classification.
Gain:
Gain is computed to estimate the gain produced by
a split over an attribute :
Where :

G(S,A) is the gain of the set S after a split over the
A attribute

E(S) is the information entropy of the set S.

m is the number of different values of the attribute A
in S.

Fs(Ai) is the frequency (proportion) of the items
possessing Ai as value for A in S.

Ai is ith possible value of A.
117
All Rights Reserved © 2012 IJARCSEE
ISSN: 2277 – 9043
International Journal of Advanced Research in Computer Science and Electronics Engineering
Volume 1, Issue 2, April 2012

SAi is a subset of S containing all items where the
value of A is Ai.
Gain quantifies the entropy improvement by splitting over an
attribute.
[12] Srilatha Chebrolua, Ajith Abrahama, Johnson P. Thomasa” Feature
deduction and ensemble design of intrusion detection systems” 2004
Elsevier Ltd
[13] Sandhya Peddabachigaria, Ajith Abrahamb,,Crina Grosanc, Johnson
Thomasa” Modeling intrusion detection system using hybrid intelligent
systems” 2005 Elsevier Ltd.
First Author:
Priyanka J. Pathak
IV sem MTech[CSE],
G.H.Raisoni College of Engineering,Nagpur,
R.T.M.N.U, Nagpur
Second Author :
Asst. Prof. Snehlata Dongre
CSE Department,
G.H.Raisoni College of engineering,Nagpur
R.T.M.N.U, Nagpur
Fig7: Output of ID3 algorithm
IV. CONCLUSION
In this paper, the work is done through which the novel
class is detected The system uses K-Means clustering
algorithm which produces different clusters of similar type of
attacks and total nodes per clusters in the input dataset. Also
it shows the updated centroids of each parameter in the input
dataset. The Classification stage gives details about detection
of different types of attacks and number of nodes in dataset.
REFERENCES
[1]
Kapil K. Wankhade, Snehlata S. Dongre, Prakash S. Prasad , Mrudula
M. Gudadhe, Kalpana A. Mankar,” Intrusion Detection System Using
New Ensemble Boosting Approach” 2011 3rd International
Conference on Computer Modeling and Simulation (ICCMS 2011)
[2] Kapil K. Wankhade, Snehlata S. Dongre, Kalpana A. Mankar, Prashant
K. Adakane,” A New Adaptive Ensemble Boosting Classifier for
Concept Drifting Stream Data” 2011 3rd International Conference on
Computer Modeling and Simulation (ICCMS 2011)
[3] Hongbo Zhu, Yaqiang Wang, Zhonghua Yu “Clustering of Evolving
Data Stream with Multiple Adaptive Sliding Window” International
Conference on Data Storage and Data Engineering, 2010.
[4] Peng Zhang†, Xingquan Zhu‡, Jianlong Tan†, Li Guo† “Classifier and
Cluster Ensembles for Mining Concept Drifting Data Streams” IEEE
International Conference on Data Mining, 2010.
[5] T.Jyothirmayi, Suresh Reddy “An Algorithm for Better Decision
Tree” T.Jyothirmayi et. al. / (IJCSE) International Journal on Computer
Science and Engineering , 2010
[6] Yongjin Liu, NaLi, Leina Shi, Fangping Li “An Intrusion Detection
Method Based on Decision Tree”, International Conference on
E-Health Networking, Digital Ecosystems and Technologies, 2010.
[7] A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby and R. Gavalda, “New
ensemble methods for evloving data streams”, In KDD’09, ACM,
Paris, 2009, pp. 139-148.
[8] LID Li-xiong, KANGJing, GUO Yun-fei, HUANGHai “A Three-Step
Clustering Algorithm over an Evolving Data Stream” The National
High Technology Research and Development Program("863"
Program) of China, Fund 2008 AAOII002 sponsors.
[9] Mrutyunjaya Panda, Manas Ranjan Patra “A COMPARATIVE
STUDY OF DATA MINING ALGORITHMS FOR NETWORK
INTRUSION DETECTION” First International Conference on
Emerging Trends in Engineering and Technology,2008
[10] Shuang Wu, Chunyu Yang and Jie Zhou “Clustering-training for Data
Stream Mining” Sixth IEEE International Conference on Data Mining Workshops (ICDMW'06)
[11] Sang-Hyun Oh, Jin-Suk Kang, Yung-Cheol Byun, Gyung-Leen Park3
and Sang-Yong Byun3 “Intrusion Detection based on Clustering a Data
Stream” Third ACIS Int'l Conference on Software Engineering
Research, Management and Applications (SERA’05).
118
All Rights Reserved © 2012 IJARCSEE