Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
APPLYING NEURO-FUZZY TECHNIQUES TO INTRUSION DETECTION By LORI LEILANI DELOOZE B.A., University of Colorado, 1985 M.B.A., George Washington University, 1989 M.S., Naval Postgraduate School, 1991 A dissertation proposal submitted to the Graduate Faculty of the University of Colorado at Colorado Springs in partial fulfillment of the degree of Doctor of Philophsophy Department of Computer Science 2004 This dissertation proposal for Doctor of Philosophy degree by Lori Leilani DeLooze Has been approved for the Department of Computer Science by __________________________________________________ Jugal K. Kalita, Chair __________________________________________________ Marijke F. Augusteijn __________________________________________________ C. Edward Chow __________________________________________________ Xiaobo Zhou __________________________________________________ Robert Carlson __________________________________________________ William Ayen ______________________ Date CONTENTS CHAPTER 1 INTRODUCTION.................................................................................................................3 Neuro-Fuzzy .................................................................................................................4 CHAPTER 2 COMPUTER SECURITY DOMAIN ...............................................................................12 Common Vulnerabilities and Exposures List..........................................................12 DARPA Data Sets ......................................................................................................15 Anomaly Detection Schemes .....................................................................................19 KDD CUP ‘99 .............................................................................................................23 Knowledge Discovery and Data Mining (KDD) Processes .....................................24 CHAPTER 3 SYSTEM DESIGN AND ARCHITECTURE ..................................................................32 CHAPTER 4 EXPERIMENT SET-UP ....................................................................................................40 CHAPTER 5 METHOD OF VALIDATION ...........................................................................................45 CHAPTER 6 IMPORTANCE OF THE RESEARCH............................................................................48 REFERENCES .......................................................................................................................50 2 CHAPTER 1 INTRODUCTION As computer technology evolves and the threat of computer crimes increases, the apprehension and preemption of such violations become more and more difficult and challenging. Most system security mechanisms are designed to prevent unauthorized access to system resources and data. To date, it appears that completely preventing breaches of security is unrealistic. Therefore, we must try to detect these intrusions as they occur so that actions may be taken to repair the damage and prevent further harm. Over the years, intrusion detection has become a major area of research in computer science and various innovative methods have been applied to these systems. In the last five years, the information revolution has finally come of age. More than ever before, we see that the Internet has changed our lives. The possibilities and opportunities are limitless; unfortunately, so too are the risks and likelihood of malicious intrusions. Intruders can be classified into two categories: outsiders and insiders. Outsiders are intruders who approach your system from outside of your network and who may attack your external presence (ie. deface web servers, forward spam through e-mail servers, etc.) They may also attempt to go around the firewall and attack machines on the internal network. Insiders, in contrast, are legitimate users of your internal network who misuse privileges, impersonate higher privileged users, or use proprietary information to gain access from external sources. Intrusion Detection Systems (IDS) are designed to monitor network traffic to determine if an intrusion has occurred. The two basic methods of detection are signature based and anomaly based (Denning, 1987). The signature based method, also known as 3 misuse detection, looks for a specific signature to match, signaling an intrusion. Network traffic is scanned as it passes by for specific features that might indicate an attack or an intrusion. This means that these systems are not unlike virus detection systems -- they can detect many or all known attack patterns, but they are of little use for as yet unknown attack methods. Most popular intrusion detection systems fall into this category. A misuse detection IDS uses a database of traffic or activity patterns related to known attacks to identify and categorize malicious activity on the network. SNORT (Caswell et al., 2003), a popular, open-source, IDS has a wide range of downloadable signature files that can be selected to conform to specific system requirements. Another approach to intrusion detection is called anomaly detection. Anomaly based systems basically attempt to map events to the point where they “learn” what is normal and then detect an anomaly that might indicate an intrusion. Anomaly detection techniques assume that all intrusive activities are necessarily anomalous. This means that if we could establish a normal activity profile for a system, we could, in theory, flag all system states that vary from the established profile by statistically significant amounts as intrusion attempts. The main issues in anomaly detection systems thus becomes the selection of threshold levels so that the system does not flag anomalous activities that are non-intrusive nor fail to flag intrusive activities that are not anomalous. Anomaly detection systems are computationally expensive because of the overhead of keeping track of, and possibly updating, several system profile metrics. Neuro-Fuzzy The best approach for an Intrusion Detection System may be to combine the advantages of both the anomaly detection and misuse detection components into a single 4 compound scheme that can also accommodate the imprecision inherent in the domain of network security. An effective IDS must use more than standard mathematical techniques -- conventional analysis methods can be combined with soft computing techniques to synergistically create a more robust system. Soft computing differs from conventional (hard) computing in that, unlike hard computing, it is tolerant of imprecision, uncertainty, partial truth, and approximation. (Zedah, 1994) Soft Computing Components for Intrusion Detection Systems Functional Approximation Approximate reasoning Neural Networks Fuzzy Logic Neuro-Fuzzy IDS Figure 1 Computing Components for Intrusion Detection Systems The principal constituents of Soft Computing are Fuzzy Logic, Neural Computing, and Evolutionary Computation. While the latter may have some contribution to the field of Intrusion Detection Systems, this research will only explore the applicability of the former two (see Figure 1). It is important to note that soft computing is not a mélange. Rather, it is a partnership in which each of the components contributes a distinct methodology for addressing problems in its domain. In this perspective, the principal constituent methodologies in soft computing are complementary rather than competitive. The complementarity of Fuzzy Logic and Neural Computing has an important consequence: in many cases a problem can be solved most effectively by using them in 5 combination rather than using each technique individually. This particularly effective combination is what has come to be known as "Neuro-Fuzzy systems." Neuro-Fuzzy systems synergistically combine the functional approximation intrinsic to neural networks and the power of approximate reasoning capability of Fuzzy Logic to learn to adapt to unknown or changing environments and deal with imprecision and uncertainty. A neuro-fuzzy approach may be able to mitigate the deficiencies of the anomaly detection elements of current intrusion detection systems. Neural Networks. Work on artificial neural networks has been motivated from its inception by the recognition that the human brain computes an entirely different way from the conventional digital computer. (Haykin, 1999, p. 1) The brain has the capability to organize its structure constituents, known as neurons, so as to perform certain computations (e.g. pattern recognition, perception, and motor control) many times faster than the fastest digital computer. To achieve good performance, neural networks employ massive interconnections of neurons. The resulting Neural Network acquires knowledge of the environment through a process of learning, which systematically changes the interconnection strengths, or synaptic weights, of the network in an orderly fashion to attain a desired design objective. McCulloch and Pitts proposed the first model of a formal neuron in their landmark 1943 paper (cited in Haykin, 1999) that described neurons as “threshold-logic switches”, i.e. elements that form weighted sums of signal values and elicit an active response when the sum exceeds a specific threshold value, . Figure 2 gives a graphical representation of a primitive neuron, called a Threshold Logic Unit (TLU), with n real-valued inputs, xi, which are associated with parameter wi. The parameter, wi, is also known as a “synaptic 6 weight and represents the functional connectivity between two cells. The TLU performs a weighted sum operation followed by a non-linear thresholding operation, or step function, such that if the value of the sum is greater than some threshold value, the output y of the unit is 1, otherwise it is 0. In other words, the neuron will “fire”, or emit an instantaneous “1” signal if the threshold is exceeded; otherwise, it will do nothing. Figure 2 Primative Neuron described by McColloch and Pitts Artificial neurons in isolation are not very impressive. Large assemblages, however, can give rise to computationally complex capabilities. An Artificial Neural Network (ANN) is defined as a structure composed of a number of interconnected artificial neurons with characteristic input, output and computational features. The manner in which the neurons of a neural network are structured is intimately linked to the learning algorithm used to train the network. An unsupervised learning system evolves to extract features or regularities in presented patterns, without being told what outputs or classes associated with the input patterns are desired. In other words, the learning system detects or categorizes persistent features without any feedback from the environment. 7 Unsupervised learning is frequently employed on competitive neural networks to aid in data clustering, feature extraction, and similarity detection. The self-organizing map (Dittenbach et al., 2002) is one of the most prominent artificial neural network models adhering to the unsupervised learning paradigm. The model consists of a number of neural processing elements, i.e. units. Each of the units i is assigned an n-dimensional weight vector mi. It is important to note that the weight vectors have the same dimensionality as the input patterns. The training process of self- organizing maps may be described in terms of input pattern presentation and weight vector adaptation. Each training iteration starts with the random selection of one input pattern. This input pattern is presented to the self-organizing map and each unit determines its activation. Usually, the Euclidean distance between weight vector and input pattern is used to calculate a unit's activation. The unit with the lowest activation is referred to as the winner of the training iteration. Finally, the weight vectors of the winner as well as the weight vectors of selected units in the vicinity of the winner are adapted. This adaptation is implemented as a gradual reduction of the component-wise difference between input pattern and weight vector. Geometrically speaking, the weight vectors of the adapted units are moved a bit towards the input pattern. The amount of weight vector movement is guided by a learning rate decreasing in time. The number of units that are affected by adaptation is determined by a so-called neighborhood function. This number of units also decreases in time. This movement has as a consequence. The Euclidean distance between those vectors decreases and thus, the weight vectors become more similar to the input pattern. The respective unit is more likely to win at future presentations of this input pattern. The 8 consequence of adapting not only the winner alone but also a number of units in the neighborhood of the winner leads to a spatial clustering of similar input patterns in neighboring parts of the self-organizing map. Thus, similarities between input patterns that are present in the n-dimensional input space are mirrored within the two-dimensional output space of the self-organizing map. The training process of the self-organizing map describes a topology preserving mapping from a high-dimensional input space onto a two-dimensional output space where patterns that are similar in terms of the input space are mapped to geographically close locations in the output space. Applying Self-Organizing Maps to Intrusion Detection is not a new concept. Brandon Rhodes and his fellow researches conducted an experiment that used 30 Domain Name Service request packets to form a Self-Organizing Map and ten additional packets to test the capabilities of the Map to classify anomalous behavior related to two different buffer overflow intrusions. (Rhodes et al, 2000). Bihn Viet Nguyen (Nguyen, 2002) conducted additional research on the application of self-organizing maps to intrusion detection as part of an Ohio University Artificial Intelligence class final project. Although both were able to prove the effectiveness of the concept, they highlighted the need to address a shifting window of normal behavior. As time passes, the patterns of usage will change. Therefore, old usage patterns will no long accurately reflect the normal behaviors. New Self-Organizing Maps need to be created over time to reflect new usage patterns. These successive SOMs can also be used as a basis for further processing as part of composite neuro-fuzzy system. 9 Fuzzy Sets. Fuzzy set theory provides a mathematical framework for representing and treating uncertainty, imprecision, and approximate reasoning. Unlike classical set theory where membership is all or nothing, fuzzy set theory allows for partial membership. That is, an element x has a membership function, A(x), that represents the degree to which x belongs to the set A. Other features further define the membership function. For a given fuzzy set A, the core membership function is the (conventional) set of all elements x U such that A(x) = 1. The support of the set is all x U such that A(x) > 0. Fuzzy set operations such as union, intersection, and compliment are similar to those of ordinary set operations. The union of two fuzzy sets A and B is a fuzzy set C, where C = A B and whose membership functions are related by the following equation: C(x) = max(A(x), B(x)) = A(x) B(x) The intersection of two fuzzy sets A and B is a fuzzy set C, where C = A B and are related by the following equation: C(x) = min(A(x), B(x)) = A(x) B(x) The union of two fuzzy sets A and B is a fuzzy set C, whose membership functions are related by the following equation: (notA(x)) = 1 - (A(x) Ironically, Susan Bridges introduced the use of fuzzy sets and fuzzy data mining techniques for IDSs at the same National Information Systems Security Conference in Baltimore, MD that introduced the use of Self-Organizing Maps for IDSs (Bridges & Vaughn, 2000). Although crisp rule-based expert systems have served as the basis for several intrusion detection systems, fuzzy rules are better suited to the security domain. When presented with a set of audit data, she proposed a system that is able to mine a set 10 of fuzzy association rules from the data. These rules are considered a high level description of patterns of behavior found in the data. This new research simply uses a significantly different means to mine the fuzzy association rules. 11 CHAPTER 2 COMPUTER SECURITY DOMAIN Rapid adoption of the Internet has led to significant computer security challenges. In its original form, the Internet was a survivable network conceived by the Department of Defense designed for sharing necessary information among government and academic research centers. Because there was a certain level of trust among the nodes of the network, there was no embedded security. As commercial entities and consumers began to flood the network, these research centers began to search for additional protection mechanisms, such as Intrusion Detection Systems, to detect malicious activity and protect their systems against potential damage. Network Intrusion Detection Systems monitor behavior by examining the content and format of data packets. Many IDSs take a data mining approach by “mining” through the network-based data to detect possible attacks for either internal or external intruders. Many network intrusion detection methods have been proposed in the research community over the past several years. Most of them approach intrusion detection as a specialized classification problem. Typical classification problems are built using training examples from all classes so that the system can classify each pattern into one of the training classes with as low generalization error as possible. Common Vulnerabilities and Exposures List One attempt to aid in the classification and sharing of attack information is MITRE’s collection of Common Vulnerabilities and Exposures (CVE). The CVE 12 (MITRE, 2002) is a list or dictionary that provides a common name for publicly known security weaknesses. Using a common name makes it easier to correlate data across multiple databases and integrate disparate security tools. If a security tool incorporates a CVE name into the classification of a security event, then it is much easier to identify the related remedy necessary to fix the problem. Three tenets underlie the CVE initiative (Martin, 2001). First, each vulnerability or exposure should have only one name and standardized description. Second, the CVE list should exist as a dictionary rather than as a database. Third, the CVE can only exist with industry endorsement and the integration of compatible products and services. This doctrine can only be effective if it is accepted throughout the computer security community. The list of entries is the result of an open and collaborative effort of the CVE editorial board, which is made up of numerous security-related organizations including security tool vendors, academic and research institutions and government agencies. As of mid-November 2003, the CVE web site (www.cve.mitre.org) contained 6,347 unique information security CVE entries. Of these, 2,572 were approved CVE entries on the official CVE list and 3,775 were candidate entries pending board approval. Candidate CVE entries may reflect breaking and newly-discovered vulnerability information that may be of special or immediate concern to the public. Both types of entries are useful in helping organizations with the job of managing information systems security. Each entry in the dictionary includes a unique CVE identification number, a text description of the vulnerability and any pertinent references. References may 13 include the initial announcement and the source, related vendor documents and other technical observations or public notices. Various security-related products, services and repositories use CVE names to let users cross reference information with other repositories, tools and services. Integrating vulnerability services, databases, web-sites and tools that incorporate CVE names will provide more complete security coverage. By using a CVE-compatible intrusion detection system, an attack report that includes a reference to the related CVE entry can be used to contact the vendor(s) web site to identify the location of a CVE-compatible fix or procedure. CVE’s adoption and support within the commercial and academic communities should also facilitate a more systematic and predictable handling of security incidents. Even the Federal Bureau of Investigation’s annual list of the top 20 Internet Security Threats list includes related CVE names. The CVE dictionary, however, is not taxonomy. The CVE list is organized in simple numerical order by date of acceptance. The computer security community needs a taxonomy to aid in the classification and correlation between exploitations. The Internet Attack Taxonomy (Mostow et al., 2002) developed by Atlantic Consulting Services, under contract to US Army Communications and Electronics Command, was developed to support an Information Warfare Simulation tool. This library of attacks can be used as a classification system for all publicly known exploits and exposures contained in the CVE listing. According to the proposed taxonomy, all known exploits can fit into 25 “buckets” in four general categories: Reconnaissance, Denial of Service, Unauthorized Access, and Deception. 14 The Internet Attack Taxonomy can provide a valuable stepping-stone for developing a classification scheme for all CVE entries and therefore provide a means for classifying all information system vulnerabilities and exposures. This taxonomy can be easily integrated with an intrusion detection scheme if each pattern used in the training set for the classification of attacks can be correlated with a specific CVE entry. Therefore a similar attack pattern should be related to the same CVE entry even if it is a novel attack. This should enable security managers to begin the recovery process immediately without specific knowledge of the details of the attack technique. In addition, this relationship between the CVE entries and training attack scenarios will provide the means to develop a CVE-compatible IDS device. DARPA Data Sets In 1998, the Defense Advanced Research Projects Agency (DARPA) intrusion detection evaluation created the first standard corpus for evaluating intrusion detection systems. The 1998 off-line intrusion detection evaluation was the first in a planned series of annual evaluations conducted by the Massachusetts Institute of Technology (MIT) Lincoln Laboratories under DARPA sponsorship. The corpus was designed to evaluate both false alarm rates and detection rates of intrusion detection systems using many types of both known and new attacks embedded in a large amount of normal background traffic (Kendall, 1999, p 2). Over 300 attacks were included in the 9 weeks of data collected for the evaluation. These 300 attacks were drawn from 32 different attack types and 7 different attack scenarios. 15 The attacks developed for the evaluation were developed to provide a reasonable amount of variance in attack methods. Some attacks occurred in a single session with all actions occurring in the clear, while others were spread out over several different sessions and clearly employed methods to evade detection. The attack scenarios also included diversity in the intent of the exploitation. Some attacks were just for fun while others were for the expressed purpose of collecting confidential information or causing damage. The corpus was collected from a simulation network (Kendall, 1999, p 22) that was used to automatically generate realistic traffic – including the attacks cited above. The simulation network consisted of two Ethernet network segments connected by a router. The “outside” network consisted of a traffic generator (used for both background traffic and automated attacks), a web server, a sniffer, and two workstations for ad-hoc attack generation. The “inside” network, which simulated the fictitious “eyrie.af.mil” domain, consisted of a background traffic generator, a sniffer, and four UNIX victim workstations. Modifications to the operating systems of the background traffic generators and web servers enabled them to simulate the actions of several hundred “virtual” machines. Training data was labeled with attacks and provided to participants to train and tune their intrusion detection systems (DARPA, 2001). Unlabeled test data was later provided for blind evaluation. List files were used to label attacks in the training data. These files contain entries for every important TCP network connection and relevant ICMP and UDP packet. Each line begins with a unique identification number, the start date and time for the first byte in the connection or packet, the duration until the final byte was transmitted, the service name, and source and destination ports and IP 16 addresses. The service name contains either the common port name for TCP and UDP connections or the packet type for ICMP packets. Further attack information is included in each connection/packet involved in an attack to aid in classification processes. The attack score is a 0 and the name is “-“ for connections that are not part of an attack. In the training data, the attack score is set to 1 and the name is a text string to label all connections associated with attacks (Fig 3). In the test data, attacks are not labeled. Instead, all attack scores are 0 and all attack names are “-“. Participants are expected to note list file entries corresponding to detected attacks by the IDS. Attack names are expected contain either the name of an old attack or a more generic attack category name for new attacks. Line # 69 73 77 90 99 101 110 125 Start Date 01/23/1998 01/23/1998 01/23/1998 01/23/1998 01/23/1998 01/23/1998 01/23/1998 01/23/1998 Start Time 16:58:55 16:58:58 16:59:05 16:59:22 16:58:58 16:59:37 17:00:00 17:00:38 Duration Service 00:00:04 00:00:18 00:00:01 00:00:22 00:00:03 00:00:44 00:00:23 00:00:02 finger ftp finger telnet smtp telnet telnet rsh Src Port 1847 1850 1855 1867 43533 1876 1884 1023 Dest Port 79 21 79 23 25 23 23 1021 Src IP 192.168.1.30 192.168.1.30 192.168.1.30 192.168.1.30 192.168.0.40 192.168.1.30 192.168.1.30 192.168.1.30 Dest IP 192.168.0.20 192.168.0.20 192.168.0.20 192.168.0.20 192.168.0.20 192.168.0.20 192.168.0.20 192.168.0.20 Attack Score 0 0 0 1 0 0 1 1 Table 1 Sample Training List Initial observations of the evaluation results for the 1998 competition concluded that most IDSs can easily identify older, known attacks with a low false-alarm rate, but do not perform as well when identifying novel or new attacks. (Lippman et al, 2000, p 17) Several additional intrusion detection contests, such as DARPA 1999 and KDD Cup 1999, used similar data sets to evaluate results in intrusion detection research. The DARPA 1999 evaluation used a similar structure for the contest, but included Widows NT workstations in the simulation network. These evaluations of 17 Name guess guess rcp developing technologies are essential to focus effort, document existing capabilities, and guide research. The DARPA evaluation focused on the development of evaluation corpora that could be used by many researchers for system designs and refinement. The evaluation used the Receiver Operating Characteristic (ROC) technique to assess intrusion detection systems. The ROC approach analyzed the tradeoff between false alarm rates and detection rates for detection systems. ROC curves for intrusion detection indicate how the detection rate changes as internal thresholds are varied to generate fewer more or fewer false alarms to tradeoff detection accuracy against analyst workload. Lincoln Labs used thirty-two different attack types during the evaluation. Several attacks were used in both the training and testing phase, while other attacks were new and novel attacks that were used only during the test phase. The attacks were categorized as Denial of Service (DOS), Remote to Local (R2L), User to Root (U2R), and Surveillance attacks. Denial of service attacks are designed to disrupt a host or network service. Some DOS attacks excessively load a legitimate network service, others create malformed packets which are incorrectly handled by the victim machine, and others take advantage of software bugs in network daemon programs. Remote to Local attacks involve a user who does not have an account on a victim machine, sends packets to that machine and gains local access. Some R2L attacks exploit buffer overflows in network server software, others exploit weak or misconfigured security policies, and one even used a Trojan password capture program. Local users employ a User to Root attacks to obtain privileges normally reserved for the UNIX root or super user. Some U2R attacks exploit poorly written system programs that run at root level which are susceptible to 18 buffer overflows, others exploit weaknesses in path name verification, bugs in some versions of perl and other software flaws. Reconnaissance attacks are probes or scans that can automatically examine the network of computers to gather information or find known vulnerabilities. Such probes are often precursors to more dangerous attacks because they provide a map of machines and services and pinpoint weak points in a network. Anomaly Detection Schemes Six groups participated in the evaluation and submitted systems using a variety of approaches to intrusion detection. Two systems used a statistical approach, three used a rule-based approach and one used a data mining approach to intrusion detection. Statistical-based intrusion detection (SBID) systems seek to identify abusive behavior by noting and analyzing audit data that deviates from a predicted norm (Sand. SBID is based on the premise that intrusions can be detected by inspecting a system's audit trail data for unusual activity, and that an intruder's behavior will be noticeably different than that of a legitimate user. Before unusual activity can be detected, SBID systems require a characterization of user or system activity that is considered "normal." These characterizations, called profiles, are typically represented by sequences of events that may be found in the system's audit data. Any sequence of system events deviating from the expected profile by a statistically significant amount is flagged as an intrusion attempt. (Sundaram, 1996) The main advantage of SBID systems is that intrusions can be detected without a priori information about the security flaws of a system. Because user profiles are updated periodically, it is possible for an insider to slowly modify his behavior over time until a new behavior pattern has been established within which an 19 attack can be safely mounted. (Lunt, 1996) Determining an appropriate threshold for "statistically significant deviations" can be difficult. If the threshold is set too low, anomalous activities that are not intrusive are flagged as intrusive (false positive). If the threshold is set too high, anomalous activities that are intrusive are not flagged as intrusive (false negative). Defining user profiles may be difficult, especially for those users with erratic work schedules/habits. Rule-based intrusion detection (RBID) systems, in contrast to SBID systems are predicated on the assumption that intrusion attempts can be characterized by sequences of user activities that lead to compromised system states. RBID systems are characterized by their expert system properties that fire rules when audit records or system status information begin to indicate illegal activity. (Ilgun, 1993) These predefined rules typically look for high-level state change patterns observed in the audit data compared to predefined penetration state change scenarios. If an RBID expert system infers that a penetration is in process or has occurred, it will alert the computer system security officers and provide them with both a justification for the alert and the user identification of the suspected intruder. There are two major types of rule-based intrusion detection systems: state-based and model-based. In the state-based RBID, the rule base is codified using the terminology found in the audit trails. Intrusion attempts are defined as sequences of system states, as defined by audit trail information, leading from an initial, limited access state to a final compromised state. In the model-based RBID system, known intrusion attempts are modeled as sequences of users’ behavior; these behaviors may then be modeled, for example, as events in an audit trail. Note, however, that the intrusion detection system itself is responsible for determining how an identified user 20 behavior may manifest itself in an audit trail. Due to the voluminous, detailed nature of system audit data (some of which may have little if any meaning to a human reviewer) and the difficulty of discriminating between normal and intrusive behavior, analysts may use expert systems technology to automatically analyze audit trail data for intrusion attempts. Data Mining Intrusion Detection (DMID) systems take a data-centric point of view and consider intrusion detection as a data analysis process. Data mining generally refers to the process of (automatically) extracting models from large stores of data. The recent rapid development in data mining has made available a wide variety of algorithms, drawn from the fields of statistics, pattern recognition, machine learning, and databases. DMID systems use data mining techniques to correlate knowledge derived from separate, heterogeneous data sets into a rule-set capable of providing a general description of an environment comprising these sets. This work led to the further use of data mining techniques to build better models for intrusion detection by analyzing audit data using associations and frequent episodes, and utilizing the resulting rules when constructing classifiers. The majority of the DARPA evaluation systems tested used either tcpdump data alone or tcpdump data along with Basic Security Module (BSM) audit data to detect attacks. (Lippman, 2000) Three of the systems were designed to detect all four categories of attacks (DOS, R2L, L2R, and Reconnaissance). The best system detected about 75% of the attacks in the test data with fewer than two false alarms per day. This RBID system used both tcpdump and BSM data with hand-created attack signatures generated using the training data. It, however, missed many of the new attacks. The next-best 21 system, based on data mining, was able to detect 64% of the attacks with 20 false alarms per day. It used only tcpdump data and created rules learned using pattern classification and data mining with hand-selected features. The detection rate of this system is similar to that of the first rule-based system when it used tcpdump data alone. Under these circumstances, the rule-based system detects about 45% of the attacks with a false alarm rate of 46 false alarms per day. These results suggest that either the rule-based system or data mining system can provide good performance on previously seen attacks, but neither approach, however, is capable of detecting new attacks with high accuracy. Many systems provided good detection accuracy for old attacks that were included in the training data, but poor detection accuracy for new attacks that were only in the test data. The two best performing systems detected old attacks with reasonable accuracies ranging from 63% to 93% detections at 10 false alarms per day. Performance was much worse for new attacks. Detection accuracy for new attacks is below 25% for the R2L and DOS categories and not significantly different from performance with old attacks for the Reconnaissance and U2R categories. This poor performance for new attacks in these two categories indicates that rules learned from the training data on old attacks do not generalize to the new attacks. Performance with new attacks is not degraded in the Reconnaissance and U2R categories because new attacks in these categories were not very different from old attacks and did not employ as many different mechanisms as used in the new R2L and DOS attacks. These results demonstrated that the evaluated systems could reliably detect many existing attacks with false alarm rates so long as examples of these attacks were available for training. All research systems could effectively use training data to improve detection 22 performance and minimize false alarm rates for known attacks. Research systems, however, missed many dangerous new attacks, especially when the attack mechanism or TCP/IP services used differed from the old attack. These results, and the general success of the evaluation procedures, suggested further research in approaches that could detect new attacks with low false alarm rates. KDD CUP ‘99 Due to the relative success of the Data Mining approach during the DARPA Intrusion Detection System Evaluation and the significant challenge of identifying new attacks, the organizational committee for the 1999 Knowledge Discovery and Data Mining (KDD) Conference suggested an intrusion detection problem augment the existing 1999 KDD learning challenge. The KDD competition aims at showcasing the best methods for discovering higher level knowledge from data and closing the gap between research and industry, thereby stimulating further KDD research and development. The KDD’99 Cup competition used a subset of the preprocessed DARPA training and test data supplied by Professors Sal Solvo and Wenke Lee (Elkin, 1999), the principal researchers for the Data Mining entry to the DARPA evaluation. The raw training data was about four gigabytes of compressed binary tcpdump data from seven weeks of network traffic. This was processed into about five million connection records. Similarly, the two weeks of test data yielded around two million connection records. Scoring focused on the systems ability to detect novel attacks in the test data that was a variant of a known attack labeled in the training data. The KDD ’99 training datasets contained a total of 24 training attack types, with an additional 14 attack types in the test data only. 23 Participants were given a list of high-level features that could be used to distinguish normal connections from attacks. A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol. Each connection is labeled as either normal, or as an attack, with exactly one specific attack type. Each connection record consists of about 100 bytes. Three sets of features were made available for analysis. First, the ``same host'' features examine only the connections in the past two seconds that have the same destination host as the current connection, and calculate statistics related to protocol behavior, service, etc. The similar ``same service'' features examine only the connections in the past two seconds that have the same service as the current connection. "Same host" and "same service" features are together called time-based traffic features of the connection records. Some probing attacks scan the hosts (or ports) using a much larger time interval than two seconds, for example once per minute. Therefore, additional features can be constructed using a window of 100 connections to the same host instead of a time window. This yields a set of so-called host-based traffic features. Finally, domain knowledge can be used to create features that look for suspicious behavior in the data portions, such as the number of failed login attempts. These features are called “content” features. Knowledge Discovery and Data Mining (KDD) Processes Many people treat data mining as a synonym for knowledge discovery, while others view data mining as simply an essential step in the knowledge discovery process. Knowledge discovery is the overall process of finding and interpreting patterns from data that involves the repeated application of several steps. (Han & Kambber, 2001) The first 24 step, selection, involves developing an understanding of the application domain and applying any relevant prior knowledge. For computer security applications, the data analyst must understand the various types of computer attacks and how to dissect a TCP or IP packet. Next, the analyst must select a data set or data sample on which the discovery is to be performed. The DARPA dataset consists of multiple data samples that can be used for the analysis. Data cleaning and preprocessing is performed on the dataset to remove any noise or to determine a strategy to deal with missing data fields. If our intrusion detection problem focused only on an internal threat, the comprehensive dataset could be winnowed down to just internal source IP addresses for analysis. The next step involves data reduction and projection. At this point, the analyst tries to find useful features to represent the data depending on the goal of the task. The analyst can use dimensionality reduction or transformation methods to find invariant representations of the data. The KDD ’99 dataset was already preprocessed to aggregate data packets into feature sets based on connection factors. Each feature set can support separate and distinct analyses based on the relevant value of the set to the defined goal. For example, if time connection features have a greater likelihood of describing an attack scenario, the results of the analysis using this feature set should be weighted heavier than that of the other two sets when determining a final assessment. Decisions made during these data selection, preprocessing, and data transformation steps will all have an impact on further data mining tasks. The two primary high-level goals of data mining, in practice, are prediction and description. Prediction involves using some variables or fields in the dataset to predict unknown or future values of other variables of interest. While description focuses on 25 finding human-interpretable patterns describing the data. In general, for the application of intrusion detection, description is more important than prediction. However, if we were trying to predict the next action of an intruder based on an observed sequence of actions, prediction is obviously more important. Prediction and description are accomplished by the following primary data mining tasks: classification or clustering, summarization, and change detection. Classification is learning a function that maps a data item into one of several predefined classes while clustering seeks to identify a finite set of categories to describe the data. If the goal of our intrusion detection system is to differentiate an attack from an innocuous event, classification is appropriate. If the system seeks to identify the type of attack or malicious action, clustering is appropriate. The next task, summarization, involves methods for finding a compact description for a subset of data and deviation detection focuses on discovering the most significant changes in the data from the previously measures or normative values. Detecting deviations from normal behavior is, of course, the entire purpose of an anomaly-based approach to intrusion detection. It may be almost impossible to determine normal activity when we introduce a new system to our network or add a new user to the system. Traditional anomaly detection techniques focus on detecting anomalies in new data after training on normal, or clean, data. Ideally, we would like an effective technique to detect anomalies immediately upon implementation without the need to train using normal data. Recent research in intrusion detection has approached this problem in a variety of ways. 26 Approaches to Unsupervised Anomaly Detection George Mason University developed ADAM (Audit Analysis and Mining) as a testbed to study how useful data mining techniques can be in intrusion detection. (Barbara, 2001) The earliest versions of ADAM used a combination of association rules mining and classification to discover attacks in a tcpdump audit trail. This approach required a training phase, which is based on the availability of labeled data, where labels indicate whether the points correspond to normal events or attacks. This type of data, of course, was available for the DARPA dataset, but is not readily available in practice. ADAM has incorporated a new method that starts by segmenting a sample of unlabeled data (using time for the basis of segmentation), finding frequent itemsets for each of the segments, intersecting these itemsets and mapping the resulting set back to the connections. (Barbara, 2003) This base set, which is assumed to be attack-free, is processed using an entropy-based algorithm to detect outliers. Lower entropy values represent a higher likelihood of being an outlier. Every new point gets evaluated against an abbreviated description of the existing clusters. Although the resulting probability density functions for data sets can show which clusters are most likely to be attack or attack-free clusters, there is currently no way to further analyze the clusters to determine the type of attack represented in the connection data. Stanford Research Institute has recently incorporated Bayes inference techniques into the statistical anomaly detector of EMERALD (Event Monitoring Enabling Responses to Anomalous Live Disturbances). EMERALD’s Bayes (eBayes) system encodes a knowledge base in terms of conditional probability relationships rather than rules or signatures. (Valdes & Skinner, 2000) It applies Bayesian inference to TCP 27 sessions based on observed and derived variables at periodic intervals. Historial records are used as its normal training data. eBayes then compares distributions of new data to form a belief network of hypotheses. Given a naïve Bayes model and the set of hypotheses, a conditional probability table is generated for the current set of hypotheses and variables. Adding a dummy state of hypotheses and a new row to the conditional probability table initialized as a uniform distribution, eBayes has the ability to generate the new hypotheses dynamically that helps to detect new attacks. eBayes enables EMERALD to detect distributed attacks in which none of the attack sessions are individually suspicious enough to generate an altert. This correlation capability, however, is extremely computationally expensive. Three additional algorithms have been proposed and together create a geometric framework for unsupervised anomaly detection. (Eskin et al., 2002) The framework maps data into a feature space and determines what points are outliers. Points that are in sparse regions of the feature space are labeled as anomalies. Although distance-based outliers have been used in other domains, the nature of the outliers is different. Often in network data, the same intrusion occurs many times, which means that there are many similar instances of the data. However, the number of instances of this intrusion is still significantly smaller than the typical cluster of normal instances. Because intrusion detection is a very complex problem, the framework outlines three different algorithms to detect outliers in the feature space: a cluster-based approach, a k-nearest neighbor based approach and a Support Vector Machine-based approach. The goal of the cluster-based algorithm is to compute how many points are “near” each point in the feature space. A fixed-width clustering algorithm was developed to 28 reduce the computational requirements of a pairwise comparison for all points in the feature space. The first point is the center of the first cluster. For every subsequent point, if it is within distance w (where w is dynamically defined by the user) of a cluster center, it is added to that cluster. Otherwise it becomes the center of a new cluster. Points may will be added to multiple clusters. This algorithm only requires one pass through the dataset. Outliers will belong to clusters with the fewest members. In contrast to the cluster algorithm above, the K-nearest neighbor based algorithm determines whether a point lies in a sparse region of the feature space by computing the sum of the distances to the k-nearest neighbors of the point. This algorithm uses a variation of the clustering algorithm where each point is placed in just one cluster. Intuitively, the points in a dense region will have many points near them and will have a smaller k-NN score than points in a sparse region. Even with refinements to the algorithm to cut down on the number of points and clusters required for comparison, computing the k-NN score for each point is computationally expensive and makes it impractical for real-time intrusion detection. The third algorithm uses a variation on Support Vector Machines (SVM) to estimate the region of the feature space when most of the data occurs. The Standard SVM algorithm is a supervised learning algorithm that requires labeled training data to create a classification rule and tries to maximally separate two classes of data in the feature space by a hyperplane. The unsupervised SVM algorithm, in contrast, does not require its training set and attempts to separate the entire set of testing data from the origin. This algorithm tries to find a small region where most of the data lies and label points in that region as class +1 (normal). Points in other regions are labeled class –1 29 (anomalous). The main idea is that the algorithm attempts to find a hyperplane that separates the data points from the origin with maximal margin. The SVM approach was by far the most efficient and the most effective. The SVM approach detected 98% of the KDD Cup attacks with a 10% false detection rate, while the Cluster and K-NN approaches detected only around 93% with the same false detection rate. A more recent comparative study among five anomaly detection schemes confirmed that the neural network approach is effective in determining perturbations of normal behavior. (Lasarevic et al., 2003) Most anomaly detection algorithms require a set of purely normal data to train the model, and implicitly assume that anomalies can be treated as patterns not observed before. Since an outlier may be defined as a data point which is very different from the rest of the data, based on some measure, the study employed several outlier detection schemes. The problem with most of these methods, however, is that an action must be clearly marked as either included or excluded from the “normal” set. A SOM is a neural network that can approximate the distribution of target patterns with a small number of weight vectors by competitive learning. We can use a SOM to classify normal behavior into recognized patterns and thereby characterize anomalous activities relative to these groupings. Shinichi Horikawa outlined an innovative process to derive fuzzy classification rules using the final weight vectors of the SOM after learning and their corresponding membership and support values. These rules are divided further, as necessary to separate each sample class as much as possible in the pattern space. (Horikawa, 1997) While he applied it effectively to the classic iris 30 classification problem, it is also relevant to the challenges of the computer security domain. 31 CHAPTER 3 SYSTEM DESIGN AND ARCHITECTURE Soft computing is an innovative approach to constructing computationally intelligent systems. Complex, real-world problems, such as computer security, require intelligent systems that combine the knowledge, techniques, and methodologies from various sources. Combining several computing techniques, synergistically rather than exclusively, is the best approach for the multifaceted domain of intrusion detection. Neural networks model the brain in a dynamic connectionist structure to mimic brain mechanisms and simulate intelligent behavior. These systems are ideal for pattern recognition and classification problems. Fuzzy set theory can be used to differentiate between entities that are classified as members of two sets simultaneously by using numerical computations and linguistic labels stipulated by the membership functions. Thus, we can synergistically incorporate neural network learning concepts and fuzzy inference systems to create the quintessential neural-fuzzy intrusion detection system. There are several schemes for categorizing neuro-fuzzy systems based upon functionality and the degree of interconnectivity. The most comprehensive classification method characterizes a hybrid system into one of three major groups depending on the configuration of the internal modules and the conceptual understanding of the processing required. (McGarry et al., 1999, p 66) The first group, unified hybrid systems, consists of those systems that have all preprocessing activities implemented by the neural network elements. These systems have had a limited impact upon real world applications due to 32 the complexity of implementation and limited knowledge representation capabilities. The second category, transformational hybrid systems, representations into neural representations or vice versa. transforms symbolic The main processing is performed by neural representations but there are automatic procedures for transferring neural representations to symbolic representations or vice versa. The third category, modular hybrid systems, covers those hybrid systems that are modular in nature, i.e. they are comprised of several neural network and rule-based modules which can have different degrees of coupling and integration. The vast majority of hybrid systems fall into this category. They are powerful processors of information and are relatively easy to implement. Both transformational and modular systems must be able to convert from an initial neural architecture to a symbolic domain or vice versa. A direct way of converting neural to symbolic knowledge is through rule extraction. This process discovers the fuzzy subspaces and relative positions of the input units to the output units of a neural network and then formulates fuzzy IF . . . THEN rules based on these positions. The discovery of the subspaces is found by a number of techniques that analyze the weights and biases of the neural network. Self-organizing Kohonen maps are able to organize a set of multiple input patterns into class subspaces distributed on a two-dimensional neuron configuration (map). Each output subspace corresponds to a fuzzy output variable that can, in turn, be used in the formulation a fuzzy ruleset. Depending on the processing requirements, a module may be either sequential or parallel. Sequential flow implies that one process must be completed before the data may be passed to the next module. One module acts as the preprocessor of data extracting the 33 required features to a form suitable for the next module. A neural network can act as a preprocessor for a rule-based system by converting raw input features into a form more suitable for symbolic level decision-making. A parallel architecture, in contrast, has both the neural network and the rule-based modules operating on some common data. Another possibility is for parallel operation is where the neural network and rule-based elements operate on different data but combine their results for an overall classification. For an application in the domain of intrusion detection, we need to determine the characteristics that will represent the input nodes of the Kohonen map. The datasets collected by DARPA have several relevant features that can be used as characteristics for the initial classification problem. Each packet or session element has a start time, stop time and duration. Therefore, we can classify them initially by which 4 hour block of time they were collected and the duration of the session. Service type, requested port, and IP address are additional features that can be used for a complete set of input variables for the Neural Network. The combination of these features for a particular packet or session will aid in the classification process. Each input neuron is assigned a representative vector element and each input pattern reflects a combination of input signals and weights. During the learning process, vectors are adapted in accordance with the input signals, i.e. their positions are shifted in the input space in the direction of the input vector. The result is an organized network where similar input patterns are located with a degree of proximity. The distribution density of the input patterns determines the resolution. For example, a large number of neurons are configured on the map for areas from which a large number of input patterns 34 have been presented and these areas are also represented on the map to a higher resolution than areas with a smaller number of patterns. Resulting clusters that create crisp partitions are unacceptable for an intrusion detection system. Invasive activities can rarely be defined into one or two distinct categories. Most events can only be classified in conjunction with other related activities. It is imperative, therefore, to create a clustering technique without using purely discriminating features. We will combine the classification capabilities of a classic Kohonen Neural Network with the addition of a fuzzy c-means step. It is essentially different from the training algorithm of a classic Kohonen network in that a learning step always considers all of the training examples together. In addition to providing the position of the cluster centers, which is the goal with the classic Kohonen Network, the fuzzy c-means step also provides the membership values of the individual objects to the different clusters. This permits the classification of new objects and their degrees of membership. An expert will know how to react to certain sets of vulnerabilities in the resulting map without necessarily having detailed knowledge relating to the input variables or the functions that describe their interactions. It is sufficient that he defines the attributes or terms of the variables that relate to which value range the expert regards as "normal" and which not. Different combinations of input variables will be used for the construction of various sets of fuzzy rules. In contrast to conventional expert systems where values ranges are disjunctive, a fuzzy expert system can use value ranges that do, and indeed should, overlap. This means that a concrete value can belong to several terms with varying degrees of membership if appropriate. 35 The parallel modular hybrid Intrusion Detection System is designed to detect anomalous behavior and report it to various devices so that they can take appropriate action. A bit of preprocessing must be carried out since the raw tcpdump data has a high dimensionality and only a small fraction is needed to create each derivative Kohonen Self-Organizing Map. The preprocessing module reduces the dimensionality of the raw input spectra by selecting a different subset of parameters for each SOM. The transformed data is then passed into various neural network modules, which are designed to classify session data into various types of events. A neural network is required for this task since several events exhibit the same indicators. The output of each neural network module is used to create a rule-based diagnostic module, which provides particulars of the observed events and is also able to provide trend analysis. Finally, a synthesis of these disparate fuzzy expert systems converges to produce a set of required actions for the responsive devices on the network. Figure 3 Intrusion Detection System Architecture While the ultimate goal of this research project is to develop a more effective IDS that combines the best features of misuse detection and anomaly detection techniques, the focus will be on the use of anomaly detection aspect. SNORT is a well-respected misuse 36 detection system that has database of over 1800 signatures reflecting all know vulnerabilities cited in MITRE’s collection of Common Vulnerabilities and Exposures. The CVE is the de facto standard within the computer security community for classifying and naming all publicly known vulnerabilities and security exposures. The goal of CVE is to make it easier to share data across separate vulnerability databases and security tools. In addition to its powerful misuse detection capabilities, SNORT has a robust packet logging feature. SNORT’s preprocessors are able to log each connection from a client to server upon its completion, noting such features as source and destination IP addresses and port, time of connection, size of packets, etc. SNORT also has an XML plugin that enables it to log in the Simple Network Markup Language (SNML) format. This plug-in can be used with one or more SNORT sensors to log to a central database and create highly configurable intrusion detection infrastructures within a network. Jed Pickel and Roman Danyliw, from the Computer Emergency Response Team Coordination Center, developed this plugin as part of the AIRCERT project. (Roesch & Green, 2002, p 41) Because the SNML is still in its early phases of development, it is highly likely to be modified as it undergoes public scrutiny. The Intrusion Detection Working Group has solicited modifications and this research may identify significant revisions to the existing document definition to ensure a suitable vector configurations for initial processing by the IDS. Components of the IDS will be modeled in DataEngine. DataEngine’s technical computing environment provides the features and support required to complete our research project. It provides core mathematics and advanced graphical tools for data 37 analysis, visualization, and application development. DataEngine’s SOM module will process the preliminary SOM vectors formed from SNORT’s output logs. The SOM’s final 2-dimentional representative sample patterns are determined based on the number of classes, the weight vectors and the Euclidian distance from the closest weight vector. These SOM parameters will determine a series of Fuzzy Rules which will be used by DataEngine’s Fuzzy Rule Base Module. For each weight, wi, the structure of a fuzzy membership function, Aij, is determined according to its distribution range. The grade of certainty is calculated next, based on the generated membership functions. In each subspace of the pattern space defined by the membership functions, the grade of uncertainty is assigned according to the truth value and belonging class of the sample patterns. X1 A A32 R R 2 A A22 R 3 A11 A21 A A31 X2 Figure 4 Generation of Membership Functions 38 A12 For each weight vector associated with a node in the competitive layer, there is a set of patterns. Each sample pattern is assigned to the set where the Euclid distance is the smallest. Any empty sets are deleted. All other sets determine a fuzzy rule as follows: Ri: If xi is Ai1 and x2 is Ai2, then x belongs to class 1 with i1 and x belongs to class 2 with i2 Aij is a fuzzy variable for the jth dimensional element, xj, and i2 is the grade of uncertainty when x belongs to class c. Aij is defined by the triangular membership function where the center is determined by the value of the corresponding weight vector and the width is determined by the distribution of all members of the set. Each rule is normalized so that the total membership function becomes 1. For example, in Figure 3, above, the set R2 overlaps with R1, therefore the membership functions 12 and 22 may be broken down into 0.5 and 0.5 and 11 and 21 may be broken down into 0.9 and 0.1. If there is no overlap, as in R3, the set is isolated with a membership function of 1.0. An unknown pattern is classified by using the fuzzy rules obtained above. A truth value, Ai(x), is calculated for each fuzzy rule. The product of Aij and the grade of uncertainty, ic, for each class, c, is calculated for each rule with a truth value greater than 0. The test sample pattern is assigned to the class for which the above product is the greatest. The test sample in Figure 4, would clearly be classified as class 2. If the test sample does not clearly match any of the specified classes, it will be flagged as an anomaly with a profile associated with those classes closest to it as indicated by the shortest Euclid distance. This process, therefore, will not have false positives, but alerts with various levels of confidence. 39 CHAPTER 4 EXPERIMENT SET-UP The goal of this research is to develop an automated process to identify anomalous behavior in network activity while it is occurring so that immediate action may be taken. The initial experiment, however, will just determine if the use of multiple self-organizing maps and resulting fuzzy rules are adequate for the clustering and classification of these anomalies. We will use tcpdump data collected during the DARPA 1998 Evaluation of Intrusion Detection Systems as a baseline. Features will be extracted from the raw data and will be used to create input vectors for three different selforganizing maps. We will follow the pre-processing technique proposed by MINDS (Ertoz, et al., 2003) whereby features are extracted based on content, time and connection. Content features include number of total packets, acknowledgement packets, data bytes, retransmitted packets, pushed packets, SYN and FIN packets flowing to or from the source and destination. We will also track the status (ie. completed, not completed or reset) for each connection. Time-based features include the number of connections and type of services to or from the source and destination within the last 5 seconds. Because this time-based approach will not be able to detect slow and stealthy attacks, we will also extract connection-based features, such as the number of connections made to or from the same source or destination within the last 100 connections. Each respective feature set will create a representative vector for one of the three SOMs. In other words, one SOM 40 will try to find anomalous behavior based on content, aonther based on time, and the third based on connection factors. Although Self-Organizing Maps use an unsupervised approach to machine learning, we must find some way to label the neurons on the resulting 2-dimensional map. We can use a label set that has several records of normal behavior and several records for each of the four major types of anomalous behavior required for identification in the DARPA evaluation (Denial of Service (DOS), Remote to Local (R2L), User to Root (U2R), and Reconnaissance attacks). If no records match the anomalies in the label set, no neurons will be labeled. Sparse clusters should only have a few neurons labeled. We assume that these neurons are the most likely to represent anomalous behavior and should be highlighted. Recall that we will use three different SOMs to cluster tcpdump data. Each of these SOMs will represent very different aspects of a tcp connection. The combination of the three results should allow us to better identify true anomalies and significantly reduce false positives. The resulting SOM differs with each set of input. Therefore, we will need to create a process that takes the current SOM and uses the relative position of labeled neurons to see if other connections may have similar features. These would represent attacks that are similar enough to be placed in close proximity to a known attack but is different enough so that it is not associated with the same neuron. These neighboring neurons will be included when we set up rules for evaluation and interpretation if we use fuzzy sets but would be missed if we used crisp sets. Crisp rules attempt to classify members distinctly into one set or another, never in both at the same time. Fuzzy rules, in contrast, allow a member to have a degree of 41 membership in two different clusters simultaneously. Therefore, we can see a connection that has qualities of a both a normal connection and an attack sequence at the same time. If we used only crisp sets for our classification rulebase, we would miss threatening connections that are predominantly normal (and therefore placed with the normal connections) or misclassify a new, but perfectly benign connection, that has more in common with a known attack than with our normal processes. Y 1 11 2 2 2 3 44 3 3 X 1 Figure 5 Labeled SOM 42 The red node in Figure 5 represents a sparse node; therefore we deduce that it is an anomaly. This vector represents a connection that should be classified and brought to the analysts attention. The SOM will be completely different if it is constructed with a different input set. Therefore, we will simply create arbitrary fuzzy rules using the X1 and Y1 axis for SOM 1. Each subsequent SOM will have similar rules for the nodes represented on the Xn and Yn axis, where n is the SOM identifier. We have decided to use three SOMs for this domain, but other problems may need different feature sets to solve the problem. Additional rules will be created from the other two SOMs. The rules created to find the four classes of DARPA attacks for this SOM would be as follows: R11: If X1 is High and Y1 is Low then Type 1 R12: If X1 is High and Y1 is High then Type 2 R13: If X1 is Somewhat Low and Y1 is Somewhat Low then Type 3 R14: If X1 is Medium and Y1 is High then Type 4 Inference using the 12 fuzzy rules representing the three different SOMs should allow use to classify anomalies among the four types tested in the DARPA evaluation. We will develop ROC curves for this process and compare them with the published results from other related research. Depending on the initial results of this research, we will also explore the use of a sliding window to automate the processing and classification of intrusion detection data for real-time analysis and notification. For example, it may be possible to collect system data, create new SOMs, classify anomalous connections and create alerts every 15 minutes. Additional research is also possible in further classifying the four DARPA evaluation attack types to a more granular level. For example, Reconnaissance attacks can be further classified as a probe, an active attempt to initiate a response from network 43 entities, or passive information gathering, obtaining information without interaction with the information source or destination. We have developed a taxonomy of 24 classes of attacks that fall into the 4 general areas of Denial of Service (DOS), Remote to Local (R2L), User to Root (U2R), and Reconnaissance attacks. Seeding the SOMs with 24 label records and increasing the dimension of each map should make this a fairly easy extension. 44 CHAPTER 5 METHOD OF VALIDATION As interest in intrusion detection has grown, the topic of evaluation of intrusion detection systems has also received great attention. Since it is difficult and costly to perform reliable, systematic evaluations of intrusion detection systems, few such evaluations have been performed. One such effort was a combined research effort by Lincoln Laboratory, the Defense Advanced Research Projects Agency (DARPA) and the U.S. Air Force. The aim of the evaluation was to assess the current state of IDSs within the Department of Defense and the U.S. government. Evaluations were preformed in both 1998 and 1999. These evaluations attempted to quantify specific performance measures of IDSs and test these against a background of realistic network traffic. The performance measures used by these evaluations included a ratio of attack detection to false positives, the capability to detect new and stealthy attacks, and the ability to accurately identify attacks. The research also attempted to establish the reason each IDS failed to detect an attack or generated a false positive. The testing process used a sample of generated network traffic, audit logs, system logs and file system information. An identical data set was used for all systems evaluated. Three weeks of training data, composed of two weeks of background traffic with no attacks and one week of data with a few attacks, were provided to participants to support tuning and training. Locations of attacks in the training data were clearly labeled. 45 Participants then used two weeks of unlabeled test data which included 200 instances of 58 different attacks and were asked to provide a detailed list of hits or alerts using the output of their intrusion detection systems. At a minimum, the results were required to include the date, time, victim IP address and a score for each putative attack detected. An alert could also optionally indicate the attack category. Putative detections were counted as correct if the time of the alert occurred within 60 seconds of the actual time of the event and referred to the correct victim IP address. The score produced by a system was required to be a number that increases as the certainty of an attack at the specified time increased. Although all participants returned numbers ranging between zero and one, many participants produced binary (0 or 1) inputs only. An initial analysis was performed to determine how well all systems taken together detected attacks regardless of false alarm rates. Thirty-seven of the fifty-eight attack types were detected well, but many stealthy and new attacks were always or frequently missed. (Lippman et al., 1999) Attacks were detected best when they produced a consistent “signature” or sequence of events in the data that was different from the sequences produced for normal traffic. Systems that relied on rules or signatures missed new attacks because signatures did not exist for these attacks, or because existing signatures did not generalize to variants of old attacks, or to new or stealthy attacks. This research will use the same test data and published procedures as the DARPA 1998 and 1999 IDS Evaluations. Scores for the neuro-fuzzy IDS with SNORT augmentation will be evaluated and compared against the baseline results of the DARPA study. Detecting unknown attacks is the most important and most challenging aspect of 46 any IDS. An additional capability to classify these unknown attacks into one of several categories and assigning a relative importance of the attack will allow an analyst to prioritize and investigate alerts according to their corresponding operational impact or potential effect. 47 CHAPTER 6 IMPORTANCE OF THE RESEARCH During a panel discussion at the 1998 Recent Advances in Intrusion Detection Conference, Marc Wilkens identified three main issues applicable to deploying intrusion detection systems that can be addressed by the proposed neuro-fuzzy approach. (Frincke et al, 1998) First, he stated that an IDS must be able to adapt to the environment and detect constantly evolving attack patterns. Using XML tags to describe the structure of an anomalous connection can simplify the identification of an attack or a variation of an known attack with the aid of an existing self-organized map. Each collection of XML tags correlates to a vector of characteristics pertaining to the attack profile. When executing a self-organizing map, an attack profile will immediately gravitate to those areas that are most similar. In addition to adaptability, an IDS must be able to integrate and interoperate with multiple Intrusion Detection techniques and architectures (such as anomaly detection, misuse detection, host-based systems and network-based systems) in order to provide real business solutions. Numerous intrusion detection systems are available in the market and different sites will no doubt select different vendors. Since incidents are often distributed over multiple sites, it is highly likely that different aspects of a single incident will be visible to different systems. Thus it would be advantageous for diverse intrusion detection systems to be able to share data on attacks in progress. In addition, it should 48 also be able to integrate with other tools that are already in use in the network. To meet this goal, IDSs can utilize the SNML data formats and exchange procedures outlined by the Intrusion Detection Working Group (IDWG, 2002) for sharing information of interest to intrusion detection and response systems, and to management systems that may need to interact with them. Finally, this need to share information between and among components highlights the importance of having standards for characterization, storage and exchange of data about attack intrusions, vulnerabilities and evidence. The standard XML template can be used by any computer security mechanism that can create a formatted alert or system log entry. The CERT created this language to exchange information about suspicious events. The current DTD and SNML Exchange Requirements (IDWG, 2004) will describe constraints and limitations that apply to the construction and transportation of Intrusion Detection Message Exchange Formats. The final document will also address semantics and context relating to these messages. 49 CHAPTER 7 SCHEDULE AND RECOMMENDATIONS While many of the tools needed for this project are readily available, they are complicated and require significant familiarization with MATLAB. To make this project a little less daunting, it will be broken down into successive phases that will facilitate the construction of the final IDS while examining the capabilities of each tool separately. The first step will be to create and parse an XML tagged CVE dataset and use the resulting vectors to create a SOM. New CVE entries will be categorized according to the class to which the sample vector is most similar. The next phase will be a similar process using an XML tagged SNORT output log to create a SOM representing normal network traffic. The biggest challenge during this phase is to create a suitable vector to meet the needs of anomaly detection. The final phase will incorporate the fuzzy logic capabilities and thereby create a final product. The 1998 and 1999 DARPA Evaluation test sets will be used for both the second and third phases. 50 REFERENCES Barbara, D. (2001). ADAM: Detecting Intrusions by Data Mining, Proceedings of the 2001 IEEE Workshop on Information Assurance and Security (p. 11-16). West Point, NY, Jun 6-7. Barbara, D. (2003). Bootstrapping a Data Mining Instrusion Detection System, Proceedings of the 2003 ACM Symposium on Applied Computing (SAC). Melborne, FL, Mar 9-12. Bridges, S. & Vaughn, R. (2000). Fuzzy Data Mining and Genetic Algorithms Applied to Intrusion Detection, Proceedings of the 23rd National Information Systems Security Conference (NISSC). Baltimore, Maryland, Oct 16-19. Caswell, B., Beale, J., Foster, J., & Faircloth, J. (2003). Snort 2.0. Rockland, MA: Syngress. DARPA Intrusion Detection Evaluation (2001), Data Sets Overview. Downloaded from http://www.ll.mit.edu/IST/ideval/data/data_index.html. Denning, D. (1987, Feb). An Intrusion-Detection Model. IEEE Transactions on Software Engineering, 13 (2), 222-232. Dittenbach, M., Rauber, A., & Merkl, D. (2002, Oct). Uncovering the Hierarchial Structure in Data Using the Growing Hierachical Self-Organizing Map. Neurocomputing, 48(1-4), 199-216. Elkin, C. (1999, Sep). Results of the KDD-99 Classifier Learning Contest. Downloaded from http://www.cs.ucsd.edu/users/elkan/clresults.html. Ertoz, L., Eilertson, E., Lazarevic, A., Tan, P., Srivastava, J., Kumar, V., & Dokas, P. (2004) The MINDS – Minnesota Intrusion Detection System, accepted for the book Next Generation Data Mining. Downloaded from http://www.cs.umn.edu/research/minds/MINDS_papers.htm Eskin, E., Arnold, A., Prerau, M., Portnoy, L. & Stolfo, S. (2002). A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data. In D. Barbara & S. Jajodia (Eds.), Data Mining for Security Applications. Boston: Kluwer. Frincke, D., Levitt, K., Miqueu, M., Quisquater, J., Milkens, M., & Ziese, K., (1998). Recent Advances in Intrusion Detection, Panel: Intrusion Detection in the Large. Louvain-la_Neuve, Belgium, Sep 16. Han, J. & Kambber, M. (2001). Data Mining. San Francisco: Morgan Kaufmann. Haykin, S. (1999). Neural Networks. New Jersey: Prentice Hall. 2 Horikawa, S. (1997). Fuzzy Classification System Using Self-Organizing Feature Map. OKI Technical Review, No 159, Vol 63. Ilgun, K. (1993). USTAT: A Real-Time Intrusion Detection System for UNIX, Proceedings of the 1993 Computer Security Symposium on Research in Security and Privacy, May 24-26, 1993, Los Alamitos, CA. IEEE Computer Society Press. Intrusion Detection Working Group. Intrusion Detection Exchange Format Charter (2004). Downloaded from http://www.ietf.org/html.charters/idwg-charter.html. Intrusion Detection Working Group (2002). Intrusion Detection Exchange Format Requirements. Downloaded from http://www.ietf.org/internet-drafts/draft-ietf-idwgrequirements-10.txt. Kendall, K. (1999). A Database of Computer Attacks for the Evaluation of Intrusion Detection Systems. Master’s thesis, Massachusetts Instititute of Technology, Department of Electrical Engineering and Computer Science, Cambridge, MA, June 1999. Lasarevic, A., Ertoz, L., Kumar, V., Ozjur, A., & Srivatava, J. (2003). A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection. Proceedings of the SIAM International Conference on Data Mining, May 1-3, 2003. San Francisco, CA. Lippman, R. et al (2000). Evaluating Intrusion Detection Systems: The DARPA OffLine Intrusion Detection Evaluation, Proceedings of the DARPA Information Survivability Conference and Exposition, Jan 25-27, 2000. Los Alamitos, CA, IEEE Computer Society Press. Lunt, T. (1993, Jun). A Survey of Intrusion Detection Techniques. Computers and Security, 12 (4), 405-418. Martin R. (2001, Nov). Managing Vulnerabilities in Networked Systems, IEEE Computer Society Magazine, 34 (11), 32-38. McGarry, S., Wermter, S., & MacIntyre, J. (1999) Hybrid Neural Systems: From Simple Coupling to Fully Integrated Neural Network, Neural Computing Surveys, 2. Mostow, J., Roberts J., & Bott, J. Integreation of an Internet Attack Simulator in an HLA Environment, Proceedings of the 2000 IEEE Workshop on Information Assurance and Security (p. 162–168). West Point, NY, Jun 6-7. MITRE. (2002). Common Vulnerabilities and Exposures: The Key to Information Sharing. Downloaded from http://cve.mitre.org/docs/ 3 Nguyen, B. (2002) Self-Orgainizing Map and Genetic Algorith for Intrusion Detection System, Final Project for CS680: Advanced Topics in Artificial Intelligence, Spring 2002. Downloaded from http://132.235.28.162/bnguyen/papers/IDS_SOM.pdf. Rhodes, B., Mahaffey, J., Cannady, J. (2000). Multiple Self-Orgainizing Maps for Intrusion Detection, Proceedings of the 23rd National Information Systems Security Conference (NISSC) (pp. ?? - ?? ). Baltimore, Maryland, Oct 16-19. Roesch, M. & Green, C. (2002), SNORT Users Manual. Downloaded from http://www.snort.org/docs/ Sundaram, A. (1996). An Introduction to Intrusion Detection. Downloaded from http://www.acm.org/crossroads/xrds2-4/xrds2-4.html Valdes, A. & Skinner, A. (2000, Oct). Adaptive, Model-based Monitoring for Cyber Attack Detection. In H. Debar, L. Me & F. Wu (Eds.), Recent Advanced in Intrusion Detection (RAID). Toulouse, France: Springer-Verlag, 80-92. Zedah, L.A. (1994, March). Fuzzy Logic, Neural Networks, and Software Computing, Communications of the ACM, 37 (3), 77. 4