Download Current Progress - Portfolios

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Transcript
Detecting Intrusions in the Cloud Using Bayesian
Methods of Anomaly Detection
Andre Torres
Edinboro University of Pennsylvania
Edinboro, PA, USA
Email: [email protected]
Abstract—Cloud computing has recently picked up in IT, but
with it comes new security concerns that we must take
precautions against. By taking preventative measures and
detecting anomalies, indicative of intrusions, before or as they are
happening, otherwise irreversible damage can be stopped in its
tracks. Intrusion detection systems (IDSs) use a multitude of
methods, though many of these are not appropriate for cloud
computing due to the immense resources required or the inability
to detect new types of intrusions. Using a Tree-Augmented Naïve
(TAN) Bayesian Network, with low computational power
requirements, we can train a supervised machine learning
algorithm which additionally considers dependencies among
attributes to analyze network packets and then detect both old
and new types of intrusions.
Index Terms—cloud computing, anomaly detection, Bayesian
INTRODUCTION
In recent years, the concept of “to the cloud” has picked up
in the IT industry, changing the playfield by allowing anyone
around the world to have access to whatever technological
requirements they may have. A common developer is given
the ability to innovate, a small business is given a doorway to
the world, a worldwide corporations is given a way to connect
their branches to a global center; the possibilities of what
we’ve come to know as cloud computing, defined by the
National Institute of Standards and Technology (NIST) as “a
model for enabling ubiquitous, convenient, on-demand
network access to a shared pool of configurable computing
resources that can be rapidly provisioned and released with
minimal management effort or service provider interaction”
[1], have seemingly infinite potential.
With this new capability, unfortunately, new concerns arise
as network-based attacks increase in both frequency and
severity. Anomalies, deviations from the constructed profiles
of systems and their users’ normal behavior pattern, are often
indicative of a potential attack. Employing anomaly detection
allows us to detect new attacks as they happen, albeit this is not
as easy as it seems if a system detects anomalies that are
merely false alarms on a reoccurring basis and as a result waste
time and labor to investigate. By detecting intrusions as they
happen rather than in some offline system, or worse, after the
intrusion has occurred, damage can be prevented before it is
allowed to happen.
BACKGROUND
Intrusion detection systems (IDSs) of this day typically use
supervised machine learning algorithms such as data mining,
fuzzy logic, genetic algorithm, neural network, and support
vector machine to appropriately identify intrusions [2].
Common IDS types include network IDSs (NIDSs) which
investigate incoming and outgoing network traffic, host-based
IDSs (HBIDSs) which audit internal interfaces related to the
machine, protocol-based IDSs (PIDSs) which monitor and
analyze the HTTP protocol stream, application protocol-based
IDSs (APIDSs) which look for the correct use of specific
application or process protocols, and hybrid IDSs (HIDSs)
which combine two or more intrusion detection approaches [2].
One method incorporated by IDSs is using the
Iterative Dichotomiser 3 technique (ID3) to generate a decision
tree from a dataset is an anomaly detection strategy that takes
attributes from a dataset which give the highest information
gain [2]. The idea is that the level of information associated
with an attribute value relates to the probability that some
occurrence may happen, and the objective is to iteratively
separate a dataset into subsets where all elements in each final
subset are of the same class [2].
Another common method that many IDSs employ are data
mining methods which make use of cluster analysis. Clusters
are sets of objects organized such that objects in the same
group are more similar to each other than to those in other
groups. Methods in cluster analysis identify sparse regions in
point cloud data to start a search for anomalies [3]. Variations
in cluster analysis include distance-based methods where if a
data point is far away from its neighbors, it is an anomaly, and
density-based methods where the idea is the same, though here,
anomalies are searched for in less dense areas [3]. The issue
with these cluster methods is the immense amount of data that
is ignored. For instance, assume an example where a set of
data points is graphed and the X-axis represents income and the
Y-axis represents expenditure. We could have four clusters:
O1 (low income, high expenditure, small), O2 (large, low to
medium income, low to medium expenditure), O3 (moderate,
large income, small expenditure), and O4 (small, large income,
large expenditure). In distance-based methods, O4 would
consist of anomalies, while real anomalies which are
understood would be in O1. In addition, the immense amount
of calculations required in determining the distance of some
new point of data in relation to existing clusters or hundreds of
other data points, maybe thousands, are certainly not ideal for
active anomaly detection in cloud computing due to the amount
of resources constantly being used. For these reasons, we turn
to a reliable, low-resource intensive solution: using an
algorithm which uses a Bayesian network to determine
anomalies.
Bayes Rule, shown in Equation 1, is fundamental to the
formation of a Bayesian network and provides a way to
calculate the probability of some hypothesis based on its prior
probability, where the most probable hypothesis is the best
hypothesis [4]. Observed data D is taken into consideration in
addition to any initial knowledge of the prior probabilities of
the various hypotheses h.

Bayes Rule: = 𝑃(ℎ|𝐷) =
𝑃(𝐷 |ℎ )𝑃(ℎ)
𝑃(𝐷)


P(h|D), the probability of some event after the relevant
evidence is taken into account, is determined using the formula
with P(h), the prior probability associated with hypothesis h,
P(D), the probability of the occurrence of data D, and P(D|h),
the probability of an event given another event [4]. What we
are concerned with in our research is a candidate hypothesis, H,
and the most probable hypothesis h which belongs to H given
the data D, also known as the maximum posterior (MAP)
hypothesis [4].
Equation 2 determines the Maximum
Likelihood (ML) hypothesis, likelihood of the data D given h
any hypothesis that maximizes P(D|h).

ℎ𝑀𝐿 ≡ 𝑎𝑟𝑔𝑚𝑎𝑥ℎ∈𝐻 𝑃(𝐷|ℎ)

Using Pseudo-Bayes estimators which follow Bayes Rule
is a technique in discrete multivariate analysis used to provide
estimated cell values of contingency tables which could have a
large quantity of sampling zeros [4]. Issues with sampling
zeros arise with misleading data. For example, reporting 0/5
and 0/500 as both equal to zero is misleading since they both
point to very different rates. Pseudo-Bayes estimators can be
built in a multitude of ways [4], but the goal of them is to build
a naïve Bayesian (NB) classifier. NB classifiers, although they
may use over-simplified assumptions, often are well fit for the
task in many complex real-world situations due to how
efficiently they can be trained in a supervised learning
environment [2]. Limitations of NB classifiers include that
they cannot provide accurate metric attribution information [5].
Adding directional edges between attributes in our Bayesian
network, thus taking into consideration dependencies among
attributes allows us to classify system states into normal or
abnormal and give a list of metrics that are ranked which are
for the most part indicative of the anomaly [5]. Using only the
supervised NB classifier method, we would only be able to
detect reoccurring anomalies, but new anomalies can be
discovered by including data on dependencies among attributes
and creating a Tree-Augmented Naïve (TAN) Bayesian
network [5].
DATASET
The KDD99 dataset, used for evaluation of intrusion
techniques as recently as 2010 [6], is used in order to train and
then determine the effectiveness of a real-time Bayesian
algorithm to detect intrusion anomalies. The set of features
defined for each connection record in KDD99 are as follows:
duration (length, in seconds, of the connection), protocol_type
(tcp, udp, etc.), service (network service on the destination,
e.g., http, telnet, etc.), src_bytes (number of data bytes from
source to destination), dst_bytes (number of data bytes from
destination to source), flag (normal or error status of the
connection), land (1 if connection is from/to the same
host/port; 0 otherwise), wrong_fragment (number of “wrong”
fragments), urgent (number of urgent packets) [6]. KDD99’s
original purpose was to be used in a competition task of
building a network intrusion detector and has millions of
sample records to teach our algorithm with [6], thus it suffices
for our purposes.
CURRENT PROGRESS
Current progress on the program has been inconclusive.
Using the ISABA algorithm presented in [2], a Bayesian
network was created using dlib’s implementation of Bayesian
networks [7]. As of now, using a subset of 1,000 samples in
KDD99, the output of the program created has led to, arguably,
nonsense. It appears that the program created is currently lost
in following each step of the algorithm correctly. Over the last
couple weeks, due to midterms and being caught up in the
work of other classes, there has not been a lot of time to work
on research.
Other algorithms implementing Bayesian
networks with similar, but not equal uses are currently being
examined to see if they can be tested and used on the KDD99
dataset. Some academic papers have promoted the use of an
independence-based approach rather than a scoring-based
approach on whether something is an outlier or not. This will
be taken into consideration in future testing. Progress is
planned to be made by March 21, 2014.
FUTURE PLAN OF ACTION
Future plans of research, subject to change, are to use a
server hosted from my home computer and host a PC gaming
server and website to get a flow of normal traffic going in and
out of the server once a proper Bayesian network is established
using the KDD99 dataset. In this flow of normal traffic,
attacks will be simulated such as Denial of Service attacks,
buffer overflows, probes, etc. [6]. A program such as tcpdump
or Wireshark will be used to dump packet information that will
be read in at an interval not to exceed every two minutes and
analyzed using our Tree-Augmented Naïve Bayesian network
through a program written in C++. A log of alarms given by
the program will be kept to see if they match and are in
accordance with planned simulated attacks to the server, and of
course with any real attacks that may happen. The goal of our
algorithm is currently a 90% detection rate of intrusion
anomalies that are taught with the KDD99 dataset, and a 75%
detection rate of intrusion anomalies that are not taught by the
KDD99 dataset.
REFERENCES
[1] Mell, Peter and Grance, Timothy, “The NIST Definition of Cloud
Computing,” NIST, Gaithersburg, MD, Rep. 800-145, 2011.
[2] Farid, Dewan Md and Rahman, Mohammad Zahidur, “Anomaly
network intrusion detection based on improved self adaptive
Bayesian algorithm,” in Journal of computers, vol. 5, 2010, pp.
23-31.
[3] Babbar, Sakshi and Chawla, Sanja, “On Bayesian Network and
Outlier Detection,” in COMAD, 2010, pp.125-136.
[4] Barbara, Daniel et al., “Detecting novel network intrusions using
bayes estimators,” in First SIAM Conference on Data Mining,
2001 © Citeseer.
[5] Tan, Yongmin et al., "PREPARE: Predictive performance
anomaly prevention for virtualized cloud systems," in 2012
IEEE 32nd International Conference on Distributed Computing
Systems (ICDCS), 2012, pp.285-294.
[6] The KDD Archive. KDD99 Cup 1999 Data. 1999.
https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
[7] Davis E. King. Dlib-ml: A Machine Learning Toolkit. Journal of
Machine Learning Research 10, pp. 1755-1758, 2009