Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Detecting Intrusions in the Cloud Using Bayesian Methods of Anomaly Detection Andre Torres Edinboro University of Pennsylvania Edinboro, PA, USA Email: [email protected] Abstract—Cloud computing has recently picked up in IT, but with it comes new security concerns that we must take precautions against. By taking preventative measures and detecting anomalies, indicative of intrusions, before or as they are happening, otherwise irreversible damage can be stopped in its tracks. Intrusion detection systems (IDSs) use a multitude of methods, though many of these are not appropriate for cloud computing due to the immense resources required or the inability to detect new types of intrusions. Using a Tree-Augmented Naïve (TAN) Bayesian Network, with low computational power requirements, we can train a supervised machine learning algorithm which additionally considers dependencies among attributes to analyze network packets and then detect both old and new types of intrusions. Index Terms—cloud computing, anomaly detection, Bayesian INTRODUCTION In recent years, the concept of “to the cloud” has picked up in the IT industry, changing the playfield by allowing anyone around the world to have access to whatever technological requirements they may have. A common developer is given the ability to innovate, a small business is given a doorway to the world, a worldwide corporations is given a way to connect their branches to a global center; the possibilities of what we’ve come to know as cloud computing, defined by the National Institute of Standards and Technology (NIST) as “a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources that can be rapidly provisioned and released with minimal management effort or service provider interaction” [1], have seemingly infinite potential. With this new capability, unfortunately, new concerns arise as network-based attacks increase in both frequency and severity. Anomalies, deviations from the constructed profiles of systems and their users’ normal behavior pattern, are often indicative of a potential attack. Employing anomaly detection allows us to detect new attacks as they happen, albeit this is not as easy as it seems if a system detects anomalies that are merely false alarms on a reoccurring basis and as a result waste time and labor to investigate. By detecting intrusions as they happen rather than in some offline system, or worse, after the intrusion has occurred, damage can be prevented before it is allowed to happen. BACKGROUND Intrusion detection systems (IDSs) of this day typically use supervised machine learning algorithms such as data mining, fuzzy logic, genetic algorithm, neural network, and support vector machine to appropriately identify intrusions [2]. Common IDS types include network IDSs (NIDSs) which investigate incoming and outgoing network traffic, host-based IDSs (HBIDSs) which audit internal interfaces related to the machine, protocol-based IDSs (PIDSs) which monitor and analyze the HTTP protocol stream, application protocol-based IDSs (APIDSs) which look for the correct use of specific application or process protocols, and hybrid IDSs (HIDSs) which combine two or more intrusion detection approaches [2]. One method incorporated by IDSs is using the Iterative Dichotomiser 3 technique (ID3) to generate a decision tree from a dataset is an anomaly detection strategy that takes attributes from a dataset which give the highest information gain [2]. The idea is that the level of information associated with an attribute value relates to the probability that some occurrence may happen, and the objective is to iteratively separate a dataset into subsets where all elements in each final subset are of the same class [2]. Another common method that many IDSs employ are data mining methods which make use of cluster analysis. Clusters are sets of objects organized such that objects in the same group are more similar to each other than to those in other groups. Methods in cluster analysis identify sparse regions in point cloud data to start a search for anomalies [3]. Variations in cluster analysis include distance-based methods where if a data point is far away from its neighbors, it is an anomaly, and density-based methods where the idea is the same, though here, anomalies are searched for in less dense areas [3]. The issue with these cluster methods is the immense amount of data that is ignored. For instance, assume an example where a set of data points is graphed and the X-axis represents income and the Y-axis represents expenditure. We could have four clusters: O1 (low income, high expenditure, small), O2 (large, low to medium income, low to medium expenditure), O3 (moderate, large income, small expenditure), and O4 (small, large income, large expenditure). In distance-based methods, O4 would consist of anomalies, while real anomalies which are understood would be in O1. In addition, the immense amount of calculations required in determining the distance of some new point of data in relation to existing clusters or hundreds of other data points, maybe thousands, are certainly not ideal for active anomaly detection in cloud computing due to the amount of resources constantly being used. For these reasons, we turn to a reliable, low-resource intensive solution: using an algorithm which uses a Bayesian network to determine anomalies. Bayes Rule, shown in Equation 1, is fundamental to the formation of a Bayesian network and provides a way to calculate the probability of some hypothesis based on its prior probability, where the most probable hypothesis is the best hypothesis [4]. Observed data D is taken into consideration in addition to any initial knowledge of the prior probabilities of the various hypotheses h. Bayes Rule: = 𝑃(ℎ|𝐷) = 𝑃(𝐷 |ℎ )𝑃(ℎ) 𝑃(𝐷) P(h|D), the probability of some event after the relevant evidence is taken into account, is determined using the formula with P(h), the prior probability associated with hypothesis h, P(D), the probability of the occurrence of data D, and P(D|h), the probability of an event given another event [4]. What we are concerned with in our research is a candidate hypothesis, H, and the most probable hypothesis h which belongs to H given the data D, also known as the maximum posterior (MAP) hypothesis [4]. Equation 2 determines the Maximum Likelihood (ML) hypothesis, likelihood of the data D given h any hypothesis that maximizes P(D|h). ℎ𝑀𝐿 ≡ 𝑎𝑟𝑔𝑚𝑎𝑥ℎ∈𝐻 𝑃(𝐷|ℎ) Using Pseudo-Bayes estimators which follow Bayes Rule is a technique in discrete multivariate analysis used to provide estimated cell values of contingency tables which could have a large quantity of sampling zeros [4]. Issues with sampling zeros arise with misleading data. For example, reporting 0/5 and 0/500 as both equal to zero is misleading since they both point to very different rates. Pseudo-Bayes estimators can be built in a multitude of ways [4], but the goal of them is to build a naïve Bayesian (NB) classifier. NB classifiers, although they may use over-simplified assumptions, often are well fit for the task in many complex real-world situations due to how efficiently they can be trained in a supervised learning environment [2]. Limitations of NB classifiers include that they cannot provide accurate metric attribution information [5]. Adding directional edges between attributes in our Bayesian network, thus taking into consideration dependencies among attributes allows us to classify system states into normal or abnormal and give a list of metrics that are ranked which are for the most part indicative of the anomaly [5]. Using only the supervised NB classifier method, we would only be able to detect reoccurring anomalies, but new anomalies can be discovered by including data on dependencies among attributes and creating a Tree-Augmented Naïve (TAN) Bayesian network [5]. DATASET The KDD99 dataset, used for evaluation of intrusion techniques as recently as 2010 [6], is used in order to train and then determine the effectiveness of a real-time Bayesian algorithm to detect intrusion anomalies. The set of features defined for each connection record in KDD99 are as follows: duration (length, in seconds, of the connection), protocol_type (tcp, udp, etc.), service (network service on the destination, e.g., http, telnet, etc.), src_bytes (number of data bytes from source to destination), dst_bytes (number of data bytes from destination to source), flag (normal or error status of the connection), land (1 if connection is from/to the same host/port; 0 otherwise), wrong_fragment (number of “wrong” fragments), urgent (number of urgent packets) [6]. KDD99’s original purpose was to be used in a competition task of building a network intrusion detector and has millions of sample records to teach our algorithm with [6], thus it suffices for our purposes. CURRENT PROGRESS Current progress on the program has been inconclusive. Using the ISABA algorithm presented in [2], a Bayesian network was created using dlib’s implementation of Bayesian networks [7]. As of now, using a subset of 1,000 samples in KDD99, the output of the program created has led to, arguably, nonsense. It appears that the program created is currently lost in following each step of the algorithm correctly. Over the last couple weeks, due to midterms and being caught up in the work of other classes, there has not been a lot of time to work on research. Other algorithms implementing Bayesian networks with similar, but not equal uses are currently being examined to see if they can be tested and used on the KDD99 dataset. Some academic papers have promoted the use of an independence-based approach rather than a scoring-based approach on whether something is an outlier or not. This will be taken into consideration in future testing. Progress is planned to be made by March 21, 2014. FUTURE PLAN OF ACTION Future plans of research, subject to change, are to use a server hosted from my home computer and host a PC gaming server and website to get a flow of normal traffic going in and out of the server once a proper Bayesian network is established using the KDD99 dataset. In this flow of normal traffic, attacks will be simulated such as Denial of Service attacks, buffer overflows, probes, etc. [6]. A program such as tcpdump or Wireshark will be used to dump packet information that will be read in at an interval not to exceed every two minutes and analyzed using our Tree-Augmented Naïve Bayesian network through a program written in C++. A log of alarms given by the program will be kept to see if they match and are in accordance with planned simulated attacks to the server, and of course with any real attacks that may happen. The goal of our algorithm is currently a 90% detection rate of intrusion anomalies that are taught with the KDD99 dataset, and a 75% detection rate of intrusion anomalies that are not taught by the KDD99 dataset. REFERENCES [1] Mell, Peter and Grance, Timothy, “The NIST Definition of Cloud Computing,” NIST, Gaithersburg, MD, Rep. 800-145, 2011. [2] Farid, Dewan Md and Rahman, Mohammad Zahidur, “Anomaly network intrusion detection based on improved self adaptive Bayesian algorithm,” in Journal of computers, vol. 5, 2010, pp. 23-31. [3] Babbar, Sakshi and Chawla, Sanja, “On Bayesian Network and Outlier Detection,” in COMAD, 2010, pp.125-136. [4] Barbara, Daniel et al., “Detecting novel network intrusions using bayes estimators,” in First SIAM Conference on Data Mining, 2001 © Citeseer. [5] Tan, Yongmin et al., "PREPARE: Predictive performance anomaly prevention for virtualized cloud systems," in 2012 IEEE 32nd International Conference on Distributed Computing Systems (ICDCS), 2012, pp.285-294. [6] The KDD Archive. KDD99 Cup 1999 Data. 1999. https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html [7] Davis E. King. Dlib-ml: A Machine Learning Toolkit. Journal of Machine Learning Research 10, pp. 1755-1758, 2009