Download dissertation_proposal - College of Engineering and Applied

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
APPLYING NEURO-FUZZY TECHNIQUES
TO INTRUSION DETECTION
By
LORI LEILANI DELOOZE
B.A., University of Colorado, 1985
M.B.A., George Washington University, 1989
M.S., Naval Postgraduate School, 1991
A dissertation proposal submitted to the Graduate Faculty of the
University of Colorado at Colorado Springs in partial fulfillment of
the degree of Doctor of Philophsophy
Department of Computer Science
2004
This dissertation proposal for Doctor of Philosophy degree by
Lori Leilani DeLooze
Has been approved for the
Department of Computer Science
by
__________________________________________________
Jugal K. Kalita, Chair
__________________________________________________
Marijke F. Augusteijn
__________________________________________________
C. Edward Chow
__________________________________________________
Xiaobo Zhou
__________________________________________________
Robert Carlson
__________________________________________________
William Ayen
______________________
Date
CONTENTS
CHAPTER 1
INTRODUCTION.................................................................................................................3
Neuro-Fuzzy .................................................................................................................4
CHAPTER 2
COMPUTER SECURITY DOMAIN ...............................................................................12
Common Vulnerabilities and Exposures List..........................................................12
DARPA Data Sets ......................................................................................................15
Anomaly Detection Schemes .....................................................................................19
KDD CUP ‘99 .............................................................................................................23
Knowledge Discovery and Data Mining (KDD) Processes .....................................24
CHAPTER 3
SYSTEM DESIGN AND ARCHITECTURE ..................................................................32
CHAPTER 4
EXPERIMENT SET-UP ....................................................................................................40
CHAPTER 5
METHOD OF VALIDATION ...........................................................................................45
CHAPTER 6
IMPORTANCE OF THE RESEARCH............................................................................48
REFERENCES .......................................................................................................................50
2
CHAPTER 1
INTRODUCTION
As computer technology evolves and the threat of computer crimes increases, the
apprehension and preemption of such violations become more and more difficult and
challenging. Most system security mechanisms are designed to prevent unauthorized
access to system resources and data. To date, it appears that completely preventing
breaches of security is unrealistic. Therefore, we must try to detect these intrusions as
they occur so that actions may be taken to repair the damage and prevent further harm.
Over the years, intrusion detection has become a major area of research in computer
science and various innovative methods have been applied to these systems.
In the last five years, the information revolution has finally come of age. More than
ever before, we see that the Internet has changed our lives.
The possibilities and
opportunities are limitless; unfortunately, so too are the risks and likelihood of malicious
intrusions.
Intruders can be classified into two categories: outsiders and insiders.
Outsiders are intruders who approach your system from outside of your network and who
may attack your external presence (ie. deface web servers, forward spam through e-mail
servers, etc.) They may also attempt to go around the firewall and attack machines on the
internal network. Insiders, in contrast, are legitimate users of your internal network who
misuse privileges, impersonate higher privileged users, or use proprietary information to
gain access from external sources.
Intrusion Detection Systems (IDS) are designed to monitor network traffic to
determine if an intrusion has occurred. The two basic methods of detection are signature
based and anomaly based (Denning, 1987). The signature based method, also known as
3
misuse detection, looks for a specific signature to match, signaling an intrusion. Network
traffic is scanned as it passes by for specific features that might indicate an attack or an
intrusion. This means that these systems are not unlike virus detection systems -- they
can detect many or all known attack patterns, but they are of little use for as yet unknown
attack methods. Most popular intrusion detection systems fall into this category. A
misuse detection IDS uses a database of traffic or activity patterns related to known
attacks to identify and categorize malicious activity on the network. SNORT (Caswell et
al., 2003), a popular, open-source, IDS has a wide range of downloadable signature files
that can be selected to conform to specific system requirements.
Another approach to intrusion detection is called anomaly detection. Anomaly based
systems basically attempt to map events to the point where they “learn” what is normal
and then detect an anomaly that might indicate an intrusion.
Anomaly detection
techniques assume that all intrusive activities are necessarily anomalous. This means that
if we could establish a normal activity profile for a system, we could, in theory, flag all
system states that vary from the established profile by statistically significant amounts as
intrusion attempts. The main issues in anomaly detection systems thus becomes the
selection of threshold levels so that the system does not flag anomalous activities that are
non-intrusive nor fail to flag intrusive activities that are not anomalous.
Anomaly
detection systems are computationally expensive because of the overhead of keeping
track of, and possibly updating, several system profile metrics.
Neuro-Fuzzy
The best approach for an Intrusion Detection System may be to combine the
advantages of both the anomaly detection and misuse detection components into a single
4
compound scheme that can also accommodate the imprecision inherent in the domain of
network security.
An effective IDS must use more than standard mathematical
techniques -- conventional analysis methods can be combined with soft computing
techniques to synergistically create a more robust system. Soft computing differs from
conventional (hard) computing in that, unlike hard computing, it is tolerant of
imprecision, uncertainty, partial truth, and approximation. (Zedah, 1994)
Soft Computing Components for
Intrusion Detection Systems
Functional Approximation
Approximate reasoning
Neural Networks
Fuzzy Logic
Neuro-Fuzzy
IDS
Figure 1 Computing Components for Intrusion Detection Systems
The principal constituents of Soft Computing are Fuzzy Logic, Neural Computing,
and Evolutionary Computation. While the latter may have some contribution to the field
of Intrusion Detection Systems, this research will only explore the applicability of the
former two (see Figure 1). It is important to note that soft computing is not a mélange.
Rather, it is a partnership in which each of the components contributes a distinct
methodology for addressing problems in its domain. In this perspective, the principal
constituent methodologies in soft computing are complementary rather than competitive.
The complementarity of Fuzzy Logic and Neural Computing has an important
consequence: in many cases a problem can be solved most effectively by using them in
5
combination rather than using each technique individually. This particularly effective
combination is what has come to be known as "Neuro-Fuzzy systems." Neuro-Fuzzy
systems synergistically combine the functional approximation intrinsic to neural
networks and the power of approximate reasoning capability of Fuzzy Logic to learn to
adapt to unknown or changing environments and deal with imprecision and uncertainty.
A neuro-fuzzy approach may be able to mitigate the deficiencies of the anomaly
detection elements of current intrusion detection systems.
Neural Networks. Work on artificial neural networks has been motivated from its
inception by the recognition that the human brain computes an entirely different way
from the conventional digital computer. (Haykin, 1999, p. 1) The brain has the capability
to organize its structure constituents, known as neurons, so as to perform certain
computations (e.g. pattern recognition, perception, and motor control) many times faster
than the fastest digital computer. To achieve good performance, neural networks employ
massive interconnections of neurons. The resulting Neural Network acquires knowledge
of the environment through a process of learning, which systematically changes the
interconnection strengths, or synaptic weights, of the network in an orderly fashion to
attain a desired design objective.
McCulloch and Pitts proposed the first model of a formal neuron in their landmark
1943 paper (cited in Haykin, 1999) that described neurons as “threshold-logic switches”,
i.e. elements that form weighted sums of signal values and elicit an active response when
the sum exceeds a specific threshold value, . Figure 2 gives a graphical representation
of a primitive neuron, called a Threshold Logic Unit (TLU), with n real-valued inputs, xi,
which are associated with parameter wi. The parameter, wi, is also known as a “synaptic
6
weight and represents the functional connectivity between two cells. The TLU performs
a weighted sum operation followed by a non-linear thresholding operation, or step
function, such that if the value of the sum is greater than some threshold value, the output
y of the unit is 1, otherwise it is 0. In other words, the neuron will “fire”, or emit an
instantaneous “1” signal if the threshold is exceeded; otherwise, it will do nothing.
Figure 2 Primative Neuron described by McColloch and Pitts
Artificial neurons in isolation are not very impressive. Large assemblages, however,
can give rise to computationally complex capabilities. An Artificial Neural Network
(ANN) is defined as a structure composed of a number of interconnected artificial
neurons with characteristic input, output and computational features. The manner in
which the neurons of a neural network are structured is intimately linked to the learning
algorithm used to train the network. An unsupervised learning system evolves to extract
features or regularities in presented patterns, without being told what outputs or classes
associated with the input patterns are desired. In other words, the learning system detects
or categorizes persistent features without any feedback from the environment.
7
Unsupervised learning is frequently employed on competitive neural networks to aid in
data clustering, feature extraction, and similarity detection.
The self-organizing map (Dittenbach et al., 2002) is one of the most prominent
artificial neural network models adhering to the unsupervised learning paradigm. The
model consists of a number of neural processing elements, i.e. units. Each of the units i is
assigned an n-dimensional weight vector mi. It is important to note that the weight vectors
have the same dimensionality as the input patterns.
The training process of self-
organizing maps may be described in terms of input pattern presentation and weight
vector adaptation. Each training iteration starts with the random selection of one input
pattern. This input pattern is presented to the self-organizing map and each unit
determines its activation. Usually, the Euclidean distance between weight vector and
input pattern is used to calculate a unit's activation. The unit with the lowest activation is
referred to as the winner of the training iteration. Finally, the weight vectors of the
winner as well as the weight vectors of selected units in the vicinity of the winner are
adapted. This adaptation is implemented as a gradual reduction of the component-wise
difference between input pattern and weight vector.
Geometrically speaking, the weight vectors of the adapted units are moved a bit
towards the input pattern. The amount of weight vector movement is guided by a learning
rate decreasing in time. The number of units that are affected by adaptation is determined
by a so-called neighborhood function. This number of units also decreases in time. This
movement has as a consequence.
The Euclidean distance between those vectors
decreases and thus, the weight vectors become more similar to the input pattern. The
respective unit is more likely to win at future presentations of this input pattern. The
8
consequence of adapting not only the winner alone but also a number of units in the
neighborhood of the winner leads to a spatial clustering of similar input patterns in
neighboring parts of the self-organizing map. Thus, similarities between input patterns
that are present in the n-dimensional input space are mirrored within the two-dimensional
output space of the self-organizing map. The training process of the self-organizing map
describes a topology preserving mapping from a high-dimensional input space onto a
two-dimensional output space where patterns that are similar in terms of the input space
are mapped to geographically close locations in the output space.
Applying Self-Organizing Maps to Intrusion Detection is not a new concept.
Brandon Rhodes and his fellow researches conducted an experiment that used 30 Domain
Name Service request packets to form a Self-Organizing Map and ten additional packets
to test the capabilities of the Map to classify anomalous behavior related to two different
buffer overflow intrusions. (Rhodes et al, 2000). Bihn Viet Nguyen (Nguyen, 2002)
conducted additional research on the application of self-organizing maps to intrusion
detection as part of an Ohio University Artificial Intelligence class final project.
Although both were able to prove the effectiveness of the concept, they highlighted the
need to address a shifting window of normal behavior. As time passes, the patterns of
usage will change. Therefore, old usage patterns will no long accurately reflect the
normal behaviors. New Self-Organizing Maps need to be created over time to reflect
new usage patterns. These successive SOMs can also be used as a basis for further
processing as part of composite neuro-fuzzy system.
9
Fuzzy Sets. Fuzzy set theory provides a mathematical framework for representing
and treating uncertainty, imprecision, and approximate reasoning. Unlike classical set
theory where membership is all or nothing, fuzzy set theory allows for partial
membership. That is, an element x has a membership function, A(x), that represents the
degree to which x belongs to the set A. Other features further define the membership
function. For a given fuzzy set A, the core membership function is the (conventional) set
of all elements x  U such that A(x) = 1. The support of the set is all x  U such that
A(x) > 0.
Fuzzy set operations such as union, intersection, and compliment are similar to those
of ordinary set operations.
The union of two fuzzy sets A and B is a fuzzy set C, where C = A  B and whose
membership functions are related by the following equation:
C(x) = max(A(x), B(x)) = A(x)  B(x)
The intersection of two fuzzy sets A and B is a fuzzy set C, where C = A  B and are
related by the following equation:
C(x) = min(A(x), B(x)) = A(x)  B(x)
The union of two fuzzy sets A and B is a fuzzy set C, whose membership functions
are related by the following equation:
(notA(x)) = 1 - (A(x)
Ironically, Susan Bridges introduced the use of fuzzy sets and fuzzy data mining
techniques for IDSs at the same National Information Systems Security Conference in
Baltimore, MD that introduced the use of Self-Organizing Maps for IDSs (Bridges &
Vaughn, 2000). Although crisp rule-based expert systems have served as the basis for
several intrusion detection systems, fuzzy rules are better suited to the security domain.
When presented with a set of audit data, she proposed a system that is able to mine a set
10
of fuzzy association rules from the data.
These rules are considered a high level
description of patterns of behavior found in the data. This new research simply uses a
significantly different means to mine the fuzzy association rules.
11
CHAPTER 2
COMPUTER SECURITY DOMAIN
Rapid adoption of the Internet has led to significant computer security challenges.
In its original form, the Internet was a survivable network conceived by the Department
of Defense designed for sharing necessary information among government and academic
research centers. Because there was a certain level of trust among the nodes of the
network, there was no embedded security. As commercial entities and consumers began
to flood the network, these research centers began to search for additional protection
mechanisms, such as Intrusion Detection Systems, to detect malicious activity and protect
their systems against potential damage. Network Intrusion Detection Systems monitor
behavior by examining the content and format of data packets. Many IDSs take a data
mining approach by “mining” through the network-based data to detect possible attacks
for either internal or external intruders. Many network intrusion detection methods have
been proposed in the research community over the past several years. Most of them
approach intrusion detection as a specialized classification problem. Typical
classification problems are built using training examples from all classes so that the
system can classify each pattern into one of the training classes with as low
generalization error as possible.
Common Vulnerabilities and Exposures List
One attempt to aid in the classification and sharing of attack information is
MITRE’s collection of Common Vulnerabilities and Exposures (CVE). The CVE
12
(MITRE, 2002) is a list or dictionary that provides a common name for publicly known
security weaknesses. Using a common name makes it easier to correlate data across
multiple databases and integrate disparate security tools. If a security tool incorporates a
CVE name into the classification of a security event, then it is much easier to identify the
related remedy necessary to fix the problem.
Three tenets underlie the CVE initiative (Martin, 2001). First, each vulnerability
or exposure should have only one name and standardized description. Second, the CVE
list should exist as a dictionary rather than as a database. Third, the CVE can only exist
with industry endorsement and the integration of compatible products and services. This
doctrine can only be effective if it is accepted throughout the computer security
community. The list of entries is the result of an open and collaborative effort of the
CVE editorial board, which is made up of numerous security-related organizations
including security tool vendors, academic and research institutions and government
agencies.
As of mid-November 2003, the CVE web site (www.cve.mitre.org) contained
6,347 unique information security CVE entries. Of these, 2,572 were approved CVE
entries on the official CVE list and 3,775 were candidate entries pending board approval.
Candidate CVE entries may reflect breaking and newly-discovered vulnerability
information that may be of special or immediate concern to the public. Both types of
entries are useful in helping organizations with the job of managing information systems
security. Each entry in the dictionary includes a unique CVE identification number, a
text description of the vulnerability and any pertinent references. References may
13
include the initial announcement and the source, related vendor documents and other
technical observations or public notices.
Various security-related products, services and repositories use CVE names to let
users cross reference information with other repositories, tools and services. Integrating
vulnerability services, databases, web-sites and tools that incorporate CVE names will
provide more complete security coverage. By using a CVE-compatible intrusion
detection system, an attack report that includes a reference to the related CVE entry can
be used to contact the vendor(s) web site to identify the location of a CVE-compatible fix
or procedure. CVE’s adoption and support within the commercial and academic
communities should also facilitate a more systematic and predictable handling of security
incidents. Even the Federal Bureau of Investigation’s annual list of the top 20 Internet
Security Threats list includes related CVE names.
The CVE dictionary, however, is not taxonomy. The CVE list is organized in
simple numerical order by date of acceptance. The computer security community needs a
taxonomy to aid in the classification and correlation between exploitations. The Internet
Attack Taxonomy (Mostow et al., 2002) developed by Atlantic Consulting Services,
under contract to US Army Communications and Electronics Command, was developed
to support an Information Warfare Simulation tool. This library of attacks can be used as
a classification system for all publicly known exploits and exposures contained in the
CVE listing. According to the proposed taxonomy, all known exploits can fit into 25
“buckets” in four general categories: Reconnaissance, Denial of Service, Unauthorized
Access, and Deception.
14
The Internet Attack Taxonomy can provide a valuable stepping-stone for
developing a classification scheme for all CVE entries and therefore provide a means for
classifying all information system vulnerabilities and exposures. This taxonomy can be
easily integrated with an intrusion detection scheme if each pattern used in the training
set for the classification of attacks can be correlated with a specific CVE entry.
Therefore a similar attack pattern should be related to the same CVE entry even if it is a
novel attack. This should enable security managers to begin the recovery process
immediately without specific knowledge of the details of the attack technique. In
addition, this relationship between the CVE entries and training attack scenarios will
provide the means to develop a CVE-compatible IDS device.
DARPA Data Sets
In 1998, the Defense Advanced Research Projects Agency (DARPA) intrusion
detection evaluation created the first standard corpus for evaluating intrusion detection
systems. The 1998 off-line intrusion detection evaluation was the first in a planned series
of annual evaluations conducted by the Massachusetts Institute of Technology (MIT)
Lincoln Laboratories under DARPA sponsorship. The corpus was designed to evaluate
both false alarm rates and detection rates of intrusion detection systems using many types
of both known and new attacks embedded in a large amount of normal background traffic
(Kendall, 1999, p 2). Over 300 attacks were included in the 9 weeks of data collected for
the evaluation. These 300 attacks were drawn from 32 different attack types and 7
different attack scenarios.
15
The attacks developed for the evaluation were developed to provide a reasonable
amount of variance in attack methods. Some attacks occurred in a single session with all
actions occurring in the clear, while others were spread out over several different sessions
and clearly employed methods to evade detection. The attack scenarios also included
diversity in the intent of the exploitation. Some attacks were just for fun while others
were for the expressed purpose of collecting confidential information or causing damage.
The corpus was collected from a simulation network (Kendall, 1999, p 22) that
was used to automatically generate realistic traffic – including the attacks cited above.
The simulation network consisted of two Ethernet network segments connected by a
router. The “outside” network consisted of a traffic generator (used for both background
traffic and automated attacks), a web server, a sniffer, and two workstations for ad-hoc
attack generation. The “inside” network, which simulated the fictitious “eyrie.af.mil”
domain, consisted of a background traffic generator, a sniffer, and four UNIX victim
workstations. Modifications to the operating systems of the background traffic
generators and web servers enabled them to simulate the actions of several hundred
“virtual” machines.
Training data was labeled with attacks and provided to participants to train and
tune their intrusion detection systems (DARPA, 2001). Unlabeled test data was later
provided for blind evaluation. List files were used to label attacks in the training data.
These files contain entries for every important TCP network connection and relevant
ICMP and UDP packet. Each line begins with a unique identification number, the start
date and time for the first byte in the connection or packet, the duration until the final
byte was transmitted, the service name, and source and destination ports and IP
16
addresses. The service name contains either the common port name for TCP and UDP
connections or the packet type for ICMP packets.
Further attack information is included in each connection/packet involved in an
attack to aid in classification processes. The attack score is a 0 and the name is “-“ for
connections that are not part of an attack. In the training data, the attack score is set to 1
and the name is a text string to label all connections associated with attacks (Fig 3). In
the test data, attacks are not labeled. Instead, all attack scores are 0 and all attack names
are “-“. Participants are expected to note list file entries corresponding to detected
attacks by the IDS. Attack names are expected contain either the name of an old attack or
a more generic attack category name for new attacks.
Line
#
69
73
77
90
99
101
110
125
Start
Date
01/23/1998
01/23/1998
01/23/1998
01/23/1998
01/23/1998
01/23/1998
01/23/1998
01/23/1998
Start
Time
16:58:55
16:58:58
16:59:05
16:59:22
16:58:58
16:59:37
17:00:00
17:00:38
Duration
Service
00:00:04
00:00:18
00:00:01
00:00:22
00:00:03
00:00:44
00:00:23
00:00:02
finger
ftp
finger
telnet
smtp
telnet
telnet
rsh
Src
Port
1847
1850
1855
1867
43533
1876
1884
1023
Dest
Port
79
21
79
23
25
23
23
1021
Src
IP
192.168.1.30
192.168.1.30
192.168.1.30
192.168.1.30
192.168.0.40
192.168.1.30
192.168.1.30
192.168.1.30
Dest
IP
192.168.0.20
192.168.0.20
192.168.0.20
192.168.0.20
192.168.0.20
192.168.0.20
192.168.0.20
192.168.0.20
Attack
Score
0
0
0
1
0
0
1
1
Table 1 Sample Training List
Initial observations of the evaluation results for the 1998 competition concluded
that most IDSs can easily identify older, known attacks with a low false-alarm rate,
but do not perform as well when identifying novel or new attacks. (Lippman et al,
2000, p 17) Several additional intrusion detection contests, such as DARPA 1999
and KDD Cup 1999, used similar data sets to evaluate results in intrusion detection
research. The DARPA 1999 evaluation used a similar structure for the contest, but
included Widows NT workstations in the simulation network. These evaluations of
17
Name
guess
guess
rcp
developing technologies are essential to focus effort, document existing capabilities,
and guide research.
The DARPA evaluation focused on the development of evaluation corpora that
could be used by many researchers for system designs and refinement. The evaluation
used the Receiver Operating Characteristic (ROC) technique to assess intrusion detection
systems.
The ROC approach analyzed the tradeoff between false alarm rates and
detection rates for detection systems. ROC curves for intrusion detection indicate how
the detection rate changes as internal thresholds are varied to generate fewer more or
fewer false alarms to tradeoff detection accuracy against analyst workload.
Lincoln Labs used thirty-two different attack types during the evaluation. Several
attacks were used in both the training and testing phase, while other attacks were new and
novel attacks that were used only during the test phase. The attacks were categorized as
Denial of Service (DOS), Remote to Local (R2L), User to Root (U2R), and Surveillance
attacks. Denial of service attacks are designed to disrupt a host or network service.
Some DOS attacks excessively load a legitimate network service, others create
malformed packets which are incorrectly handled by the victim machine, and others take
advantage of software bugs in network daemon programs. Remote to Local attacks
involve a user who does not have an account on a victim machine, sends packets to that
machine and gains local access. Some R2L attacks exploit buffer overflows in network
server software, others exploit weak or misconfigured security policies, and one even
used a Trojan password capture program. Local users employ a User to Root attacks to
obtain privileges normally reserved for the UNIX root or super user. Some U2R attacks
exploit poorly written system programs that run at root level which are susceptible to
18
buffer overflows, others exploit weaknesses in path name verification, bugs in some
versions of perl and other software flaws. Reconnaissance attacks are probes or scans
that can automatically examine the network of computers to gather information or find
known vulnerabilities. Such probes are often precursors to more dangerous attacks
because they provide a map of machines and services and pinpoint weak points in a
network.
Anomaly Detection Schemes
Six groups participated in the evaluation and submitted systems using a variety of
approaches to intrusion detection. Two systems used a statistical approach, three used a
rule-based approach and one used a data mining approach to intrusion detection.
Statistical-based intrusion detection (SBID) systems seek to identify abusive behavior by
noting and analyzing audit data that deviates from a predicted norm (Sand. SBID is
based on the premise that intrusions can be detected by inspecting a system's audit trail
data for unusual activity, and that an intruder's behavior will be noticeably different than
that of a legitimate user. Before unusual activity can be detected, SBID systems require a
characterization of user or system activity that is considered "normal."
These
characterizations, called profiles, are typically represented by sequences of events that
may be found in the system's audit data. Any sequence of system events deviating from
the expected profile by a statistically significant amount is flagged as an intrusion
attempt. (Sundaram, 1996) The main advantage of SBID systems is that intrusions can
be detected without a priori information about the security flaws of a system. Because
user profiles are updated periodically, it is possible for an insider to slowly modify his
behavior over time until a new behavior pattern has been established within which an
19
attack can be safely mounted. (Lunt, 1996) Determining an appropriate threshold for
"statistically significant deviations" can be difficult. If the threshold is set too low,
anomalous activities that are not intrusive are flagged as intrusive (false positive). If the
threshold is set too high, anomalous activities that are intrusive are not flagged as
intrusive (false negative). Defining user profiles may be difficult, especially for those
users with erratic work schedules/habits.
Rule-based intrusion detection (RBID) systems, in contrast to SBID systems are
predicated on the assumption that intrusion attempts can be characterized by sequences of
user activities that lead to compromised system states. RBID systems are characterized
by their expert system properties that fire rules when audit records or system status
information begin to indicate illegal activity. (Ilgun, 1993)
These predefined rules
typically look for high-level state change patterns observed in the audit data compared to
predefined penetration state change scenarios. If an RBID expert system infers that a
penetration is in process or has occurred, it will alert the computer system security
officers and provide them with both a justification for the alert and the user identification
of the suspected intruder. There are two major types of rule-based intrusion detection
systems: state-based and model-based. In the state-based RBID, the rule base is codified
using the terminology found in the audit trails. Intrusion attempts are defined as
sequences of system states, as defined by audit trail information, leading from an initial,
limited access state to a final compromised state.
In the model-based RBID system,
known intrusion attempts are modeled as sequences of users’ behavior; these behaviors
may then be modeled, for example, as events in an audit trail. Note, however, that the
intrusion detection system itself is responsible for determining how an identified user
20
behavior may manifest itself in an audit trail. Due to the voluminous, detailed nature of
system audit data (some of which may have little if any meaning to a human reviewer)
and the difficulty of discriminating between normal and intrusive behavior, analysts may
use expert systems technology to automatically analyze audit trail data for intrusion
attempts.
Data Mining Intrusion Detection (DMID) systems take a data-centric point of
view and consider intrusion detection as a data analysis process. Data mining generally
refers to the process of (automatically) extracting models from large stores of data. The
recent rapid development in data mining has made available a wide variety of algorithms,
drawn from the fields of statistics, pattern recognition, machine learning, and databases.
DMID systems use data mining techniques to correlate knowledge derived from separate,
heterogeneous data sets into a rule-set capable of providing a general description of an
environment comprising these sets. This work led to the further use of data mining
techniques to build better models for intrusion detection by analyzing audit data using
associations and frequent episodes, and utilizing the resulting rules when constructing
classifiers.
The majority of the DARPA evaluation systems tested used either tcpdump data
alone or tcpdump data along with Basic Security Module (BSM) audit data to detect
attacks. (Lippman, 2000) Three of the systems were designed to detect all four categories
of attacks (DOS, R2L, L2R, and Reconnaissance). The best system detected about 75%
of the attacks in the test data with fewer than two false alarms per day. This RBID
system used both tcpdump and BSM data with hand-created attack signatures generated
using the training data. It, however, missed many of the new attacks. The next-best
21
system, based on data mining, was able to detect 64% of the attacks with 20 false alarms
per day. It used only tcpdump data and created rules learned using pattern classification
and data mining with hand-selected features. The detection rate of this system is similar
to that of the first rule-based system when it used tcpdump data alone. Under these
circumstances, the rule-based system detects about 45% of the attacks with a false alarm
rate of 46 false alarms per day. These results suggest that either the rule-based system or
data mining system can provide good performance on previously seen attacks, but neither
approach, however, is capable of detecting new attacks with high accuracy.
Many systems provided good detection accuracy for old attacks that were
included in the training data, but poor detection accuracy for new attacks that were only
in the test data. The two best performing systems detected old attacks with reasonable
accuracies ranging from 63% to 93% detections at 10 false alarms per day. Performance
was much worse for new attacks. Detection accuracy for new attacks is below 25% for
the R2L and DOS categories and not significantly different from performance with old
attacks for the Reconnaissance and U2R categories. This poor performance for new
attacks in these two categories indicates that rules learned from the training data on old
attacks do not generalize to the new attacks. Performance with new attacks is not
degraded in the Reconnaissance and U2R categories because new attacks in these
categories were not very different from old attacks and did not employ as many different
mechanisms as used in the new R2L and DOS attacks.
These results demonstrated that the evaluated systems could reliably detect many
existing attacks with false alarm rates so long as examples of these attacks were available
for training. All research systems could effectively use training data to improve detection
22
performance and minimize false alarm rates for known attacks.
Research systems,
however, missed many dangerous new attacks, especially when the attack mechanism or
TCP/IP services used differed from the old attack. These results, and the general success
of the evaluation procedures, suggested further research in approaches that could detect
new attacks with low false alarm rates.
KDD CUP ‘99
Due to the relative success of the Data Mining approach during the DARPA
Intrusion Detection System Evaluation and the significant challenge of identifying new
attacks, the organizational committee for the 1999 Knowledge Discovery and Data
Mining (KDD) Conference suggested an intrusion detection problem augment the
existing 1999 KDD learning challenge. The KDD competition aims at showcasing the
best methods for discovering higher level knowledge from data and closing the gap
between research and industry, thereby stimulating further KDD research and
development. The KDD’99 Cup competition used a subset of the preprocessed DARPA
training and test data supplied by Professors Sal Solvo and Wenke Lee (Elkin, 1999), the
principal researchers for the Data Mining entry to the DARPA evaluation. The raw
training data was about four gigabytes of compressed binary tcpdump data from seven
weeks of network traffic. This was processed into about five million connection records.
Similarly, the two weeks of test data yielded around two million connection records.
Scoring focused on the systems ability to detect novel attacks in the test data that was a
variant of a known attack labeled in the training data. The KDD ’99 training datasets
contained a total of 24 training attack types, with an additional 14 attack types in the test
data only.
23
Participants were given a list of high-level features that could be used to
distinguish normal connections from attacks. A connection is a sequence of TCP packets
starting and ending at some well defined times, between which data flows to and from a
source IP address to a target IP address under some well defined protocol. Each
connection is labeled as either normal, or as an attack, with exactly one specific attack
type. Each connection record consists of about 100 bytes. Three sets of features were
made available for analysis. First, the ``same host'' features examine only the
connections in the past two seconds that have the same destination host as the current
connection, and calculate statistics related to protocol behavior, service, etc. The similar
``same service'' features examine only the connections in the past two seconds that have
the same service as the current connection. "Same host" and "same service" features are
together called time-based traffic features of the connection records. Some probing
attacks scan the hosts (or ports) using a much larger time interval than two seconds, for
example once per minute. Therefore, additional features can be constructed using a
window of 100 connections to the same host instead of a time window. This yields a set
of so-called host-based traffic features. Finally, domain knowledge can be used to create
features that look for suspicious behavior in the data portions, such as the number of
failed login attempts. These features are called “content” features.
Knowledge Discovery and Data Mining (KDD) Processes
Many people treat data mining as a synonym for knowledge discovery, while
others view data mining as simply an essential step in the knowledge discovery process.
Knowledge discovery is the overall process of finding and interpreting patterns from data
that involves the repeated application of several steps. (Han & Kambber, 2001) The first
24
step, selection, involves developing an understanding of the application domain and
applying any relevant prior knowledge. For computer security applications, the data
analyst must understand the various types of computer attacks and how to dissect a TCP
or IP packet. Next, the analyst must select a data set or data sample on which the
discovery is to be performed. The DARPA dataset consists of multiple data samples that
can be used for the analysis. Data cleaning and preprocessing is performed on the dataset
to remove any noise or to determine a strategy to deal with missing data fields. If our
intrusion detection problem focused only on an internal threat, the comprehensive dataset
could be winnowed down to just internal source IP addresses for analysis. The next step
involves data reduction and projection. At this point, the analyst tries to find useful
features to represent the data depending on the goal of the task. The analyst can use
dimensionality reduction or transformation methods to find invariant representations of
the data. The KDD ’99 dataset was already preprocessed to aggregate data packets into
feature sets based on connection factors. Each feature set can support separate and
distinct analyses based on the relevant value of the set to the defined goal. For example,
if time connection features have a greater likelihood of describing an attack scenario, the
results of the analysis using this feature set should be weighted heavier than that of the
other two sets when determining a final assessment. Decisions made during these data
selection, preprocessing, and data transformation steps will all have an impact on further
data mining tasks.
The two primary high-level goals of data mining, in practice, are prediction and
description. Prediction involves using some variables or fields in the dataset to predict
unknown or future values of other variables of interest. While description focuses on
25
finding human-interpretable patterns describing the data. In general, for the application
of intrusion detection, description is more important than prediction. However, if we
were trying to predict the next action of an intruder based on an observed sequence of
actions, prediction is obviously more important. Prediction and description are
accomplished by the following primary data mining tasks: classification or clustering,
summarization, and change detection. Classification is learning a function that maps a
data item into one of several predefined classes while clustering seeks to identify a finite
set of categories to describe the data. If the goal of our intrusion detection system is to
differentiate an attack from an innocuous event, classification is appropriate. If the
system seeks to identify the type of attack or malicious action, clustering is appropriate.
The next task, summarization, involves methods for finding a compact description for a
subset of data and deviation detection focuses on discovering the most significant
changes in the data from the previously measures or normative values. Detecting
deviations from normal behavior is, of course, the entire purpose of an anomaly-based
approach to intrusion detection.
It may be almost impossible to determine normal activity when we introduce a
new system to our network or add a new user to the system. Traditional anomaly
detection techniques focus on detecting anomalies in new data after training on normal,
or clean, data. Ideally, we would like an effective technique to detect anomalies
immediately upon implementation without the need to train using normal data. Recent
research in intrusion detection has approached this problem in a variety of ways.
26
Approaches to Unsupervised Anomaly Detection
George Mason University developed ADAM (Audit Analysis and Mining) as a
testbed to study how useful data mining techniques can be in intrusion detection.
(Barbara, 2001) The earliest versions of ADAM used a combination of association rules
mining and classification to discover attacks in a tcpdump audit trail. This approach
required a training phase, which is based on the availability of labeled data, where labels
indicate whether the points correspond to normal events or attacks. This type of data, of
course, was available for the DARPA dataset, but is not readily available in practice.
ADAM has incorporated a new method that starts by segmenting a sample of unlabeled
data (using time for the basis of segmentation), finding frequent itemsets for each of the
segments, intersecting these itemsets and mapping the resulting set back to the
connections. (Barbara, 2003) This base set, which is assumed to be attack-free, is
processed using an entropy-based algorithm to detect outliers. Lower entropy values
represent a higher likelihood of being an outlier. Every new point gets evaluated against
an abbreviated description of the existing clusters. Although the resulting probability
density functions for data sets can show which clusters are most likely to be attack or
attack-free clusters, there is currently no way to further analyze the clusters to determine
the type of attack represented in the connection data.
Stanford Research Institute has recently incorporated Bayes inference techniques
into the statistical anomaly detector of EMERALD (Event Monitoring Enabling
Responses to Anomalous Live Disturbances). EMERALD’s Bayes (eBayes) system
encodes a knowledge base in terms of conditional probability relationships rather than
rules or signatures. (Valdes & Skinner, 2000) It applies Bayesian inference to TCP
27
sessions based on observed and derived variables at periodic intervals. Historial records
are used as its normal training data. eBayes then compares distributions of new data to
form a belief network of hypotheses. Given a naïve Bayes model and the set of
hypotheses, a conditional probability table is generated for the current set of hypotheses
and variables. Adding a dummy state of hypotheses and a new row to the conditional
probability table initialized as a uniform distribution, eBayes has the ability to generate
the new hypotheses dynamically that helps to detect new attacks. eBayes enables
EMERALD to detect distributed attacks in which none of the attack sessions are
individually suspicious enough to generate an altert. This correlation capability,
however, is extremely computationally expensive.
Three additional algorithms have been proposed and together create a geometric
framework for unsupervised anomaly detection. (Eskin et al., 2002) The framework
maps data into a feature space and determines what points are outliers. Points that are in
sparse regions of the feature space are labeled as anomalies. Although distance-based
outliers have been used in other domains, the nature of the outliers is different. Often in
network data, the same intrusion occurs many times, which means that there are many
similar instances of the data. However, the number of instances of this intrusion is still
significantly smaller than the typical cluster of normal instances. Because intrusion
detection is a very complex problem, the framework outlines three different algorithms to
detect outliers in the feature space: a cluster-based approach, a k-nearest neighbor based
approach and a Support Vector Machine-based approach.
The goal of the cluster-based algorithm is to compute how many points are “near”
each point in the feature space. A fixed-width clustering algorithm was developed to
28
reduce the computational requirements of a pairwise comparison for all points in the
feature space. The first point is the center of the first cluster. For every subsequent point,
if it is within distance w (where w is dynamically defined by the user) of a cluster center,
it is added to that cluster. Otherwise it becomes the center of a new cluster. Points may
will be added to multiple clusters. This algorithm only requires one pass through the
dataset. Outliers will belong to clusters with the fewest members.
In contrast to the cluster algorithm above, the K-nearest neighbor based algorithm
determines whether a point lies in a sparse region of the feature space by computing the
sum of the distances to the k-nearest neighbors of the point. This algorithm uses a
variation of the clustering algorithm where each point is placed in just one cluster.
Intuitively, the points in a dense region will have many points near them and will have a
smaller k-NN score than points in a sparse region. Even with refinements to the
algorithm to cut down on the number of points and clusters required for comparison,
computing the k-NN score for each point is computationally expensive and makes it
impractical for real-time intrusion detection.
The third algorithm uses a variation on Support Vector Machines (SVM) to
estimate the region of the feature space when most of the data occurs. The Standard
SVM algorithm is a supervised learning algorithm that requires labeled training data to
create a classification rule and tries to maximally separate two classes of data in the
feature space by a hyperplane. The unsupervised SVM algorithm, in contrast, does not
require its training set and attempts to separate the entire set of testing data from the
origin. This algorithm tries to find a small region where most of the data lies and label
points in that region as class +1 (normal). Points in other regions are labeled class –1
29
(anomalous). The main idea is that the algorithm attempts to find a hyperplane that
separates the data points from the origin with maximal margin. The SVM approach was
by far the most efficient and the most effective. The SVM approach detected 98% of the
KDD Cup attacks with a 10% false detection rate, while the Cluster and K-NN
approaches detected only around 93% with the same false detection rate.
A more recent comparative study among five anomaly detection schemes
confirmed that the neural network approach is effective in determining perturbations of
normal behavior. (Lasarevic et al., 2003) Most anomaly detection algorithms require a
set of purely normal data to train the model, and implicitly assume that anomalies can be
treated as patterns not observed before. Since an outlier may be defined as a data point
which is very different from the rest of the data, based on some measure, the study
employed several outlier detection schemes. The problem with most of these methods,
however, is that an action must be clearly marked as either included or excluded from the
“normal” set.
A SOM is a neural network that can approximate the distribution of target patterns
with a small number of weight vectors by competitive learning. We can use a SOM to
classify normal behavior into recognized patterns and thereby characterize anomalous
activities relative to these groupings. Shinichi Horikawa outlined an innovative process
to derive fuzzy classification rules using the final weight vectors of the SOM after
learning and their corresponding membership and support values.
These rules are
divided further, as necessary to separate each sample class as much as possible in the
pattern space. (Horikawa, 1997)
While he applied it effectively to the classic iris
30
classification problem, it is also relevant to the challenges of the computer security
domain.
31
CHAPTER 3
SYSTEM DESIGN AND ARCHITECTURE
Soft computing is an innovative approach to constructing computationally
intelligent systems. Complex, real-world problems, such as computer security, require
intelligent systems that combine the knowledge, techniques, and methodologies from
various sources. Combining several computing techniques, synergistically rather than
exclusively, is the best approach for the multifaceted domain of intrusion detection.
Neural networks model the brain in a dynamic connectionist structure to mimic brain
mechanisms and simulate intelligent behavior.
These systems are ideal for pattern
recognition and classification problems. Fuzzy set theory can be used to differentiate
between entities that are classified as members of two sets simultaneously by using
numerical computations and linguistic labels stipulated by the membership functions.
Thus, we can synergistically incorporate neural network learning concepts and fuzzy
inference systems to create the quintessential neural-fuzzy intrusion detection system.
There are several schemes for categorizing neuro-fuzzy systems based upon
functionality and the degree of interconnectivity. The most comprehensive classification
method characterizes a hybrid system into one of three major groups depending on the
configuration of the internal modules and the conceptual understanding of the processing
required. (McGarry et al., 1999, p 66) The first group, unified hybrid systems, consists of
those systems that have all preprocessing activities implemented by the neural network
elements. These systems have had a limited impact upon real world applications due to
32
the complexity of implementation and limited knowledge representation capabilities.
The
second
category,
transformational
hybrid
systems,
representations into neural representations or vice versa.
transforms
symbolic
The main processing is
performed by neural representations but there are automatic procedures for transferring
neural representations to symbolic representations or vice versa. The third category,
modular hybrid systems, covers those hybrid systems that are modular in nature, i.e. they
are comprised of several neural network and rule-based modules which can have different
degrees of coupling and integration. The vast majority of hybrid systems fall into this
category.
They are powerful processors of information and are relatively easy to
implement.
Both transformational and modular systems must be able to convert from an initial
neural architecture to a symbolic domain or vice versa.
A direct way of converting
neural to symbolic knowledge is through rule extraction.
This process discovers the
fuzzy subspaces and relative positions of the input units to the output units of a neural
network and then formulates fuzzy IF . . . THEN rules based on these positions. The
discovery of the subspaces is found by a number of techniques that analyze the weights
and biases of the neural network. Self-organizing Kohonen maps are able to organize a
set of multiple input patterns into class subspaces distributed on a two-dimensional
neuron configuration (map).
Each output subspace corresponds to a fuzzy output
variable that can, in turn, be used in the formulation a fuzzy ruleset.
Depending on the processing requirements, a module may be either sequential or
parallel. Sequential flow implies that one process must be completed before the data may
be passed to the next module. One module acts as the preprocessor of data extracting the
33
required features to a form suitable for the next module. A neural network can act as a
preprocessor for a rule-based system by converting raw input features into a form more
suitable for symbolic level decision-making. A parallel architecture, in contrast, has both
the neural network and the rule-based modules operating on some common data.
Another possibility is for parallel operation is where the neural network and rule-based
elements operate on different data but combine their results for an overall classification.
For an application in the domain of intrusion detection, we need to determine the
characteristics that will represent the input nodes of the Kohonen map. The datasets
collected by DARPA have several relevant features that can be used as characteristics for
the initial classification problem. Each packet or session element has a start time, stop
time and duration. Therefore, we can classify them initially by which 4 hour block of
time they were collected and the duration of the session. Service type, requested port,
and IP address are additional features that can be used for a complete set of input
variables for the Neural Network. The combination of these features for a particular
packet or session will aid in the classification process.
Each input neuron is assigned a representative vector element and each input
pattern reflects a combination of input signals and weights. During the learning process,
vectors are adapted in accordance with the input signals, i.e. their positions are shifted in
the input space in the direction of the input vector. The result is an organized network
where similar input patterns are located with a degree of proximity. The distribution
density of the input patterns determines the resolution. For example, a large number of
neurons are configured on the map for areas from which a large number of input patterns
34
have been presented and these areas are also represented on the map to a higher
resolution than areas with a smaller number of patterns.
Resulting clusters that create crisp partitions are unacceptable for an intrusion
detection system. Invasive activities can rarely be defined into one or two distinct
categories. Most events can only be classified in conjunction with other related activities.
It is imperative, therefore, to create a clustering technique without using purely
discriminating features. We will combine the classification capabilities of a classic
Kohonen Neural Network with the addition of a fuzzy c-means step. It is essentially
different from the training algorithm of a classic Kohonen network in that a learning step
always considers all of the training examples together. In addition to providing the
position of the cluster centers, which is the goal with the classic Kohonen Network, the
fuzzy c-means step also provides the membership values of the individual objects to the
different clusters. This permits the classification of new objects and their degrees of
membership.
An expert will know how to react to certain sets of vulnerabilities in the resulting
map without necessarily having detailed knowledge relating to the input variables or the
functions that describe their interactions. It is sufficient that he defines the attributes or
terms of the variables that relate to which value range the expert regards as "normal" and
which not. Different combinations of input variables will be used for the construction of
various sets of fuzzy rules. In contrast to conventional expert systems where values
ranges are disjunctive, a fuzzy expert system can use value ranges that do, and indeed
should, overlap. This means that a concrete value can belong to several terms with
varying degrees of membership if appropriate.
35
The parallel modular hybrid Intrusion Detection System is designed to detect
anomalous behavior and report it to various devices so that they can take appropriate
action. A bit of preprocessing must be carried out since the raw tcpdump data has a high
dimensionality and only a small fraction is needed to create each derivative Kohonen
Self-Organizing Map. The preprocessing module reduces the dimensionality of the raw
input spectra by selecting a different subset of parameters for each SOM.
The
transformed data is then passed into various neural network modules, which are designed
to classify session data into various types of events. A neural network is required for this
task since several events exhibit the same indicators. The output of each neural network
module is used to create a rule-based diagnostic module, which provides particulars of
the observed events and is also able to provide trend analysis. Finally, a synthesis of
these disparate fuzzy expert systems converges to produce a set of required actions for
the responsive devices on the network.
Figure 3 Intrusion Detection System Architecture
While the ultimate goal of this research project is to develop a more effective IDS
that combines the best features of misuse detection and anomaly detection techniques, the
focus will be on the use of anomaly detection aspect. SNORT is a well-respected misuse
36
detection system that has database of over 1800 signatures reflecting all know
vulnerabilities cited in MITRE’s collection of Common Vulnerabilities and Exposures.
The CVE is the de facto standard within the computer security community for classifying
and naming all publicly known vulnerabilities and security exposures. The goal of CVE
is to make it easier to share data across separate vulnerability databases and security
tools.
In addition to its powerful misuse detection capabilities, SNORT has a robust packet
logging feature. SNORT’s preprocessors are able to log each connection from a client to
server upon its completion, noting such features as source and destination IP addresses
and port, time of connection, size of packets, etc. SNORT also has an XML plugin that
enables it to log in the Simple Network Markup Language (SNML) format. This plug-in
can be used with one or more SNORT sensors to log to a central database and create
highly configurable intrusion detection infrastructures within a network.
Jed Pickel and Roman Danyliw, from the Computer Emergency Response Team
Coordination Center, developed this plugin as part of the AIRCERT project. (Roesch &
Green, 2002, p 41) Because the SNML is still in its early phases of development, it is
highly likely to be modified as it undergoes public scrutiny. The Intrusion Detection
Working Group has solicited modifications and this research may identify significant
revisions to the existing document definition to ensure a suitable vector configurations for
initial processing by the IDS.
Components of the IDS will be modeled in DataEngine. DataEngine’s technical
computing environment provides the features and support required to complete our
research project. It provides core mathematics and advanced graphical tools for data
37
analysis, visualization, and application development. DataEngine’s SOM module will
process the preliminary SOM vectors formed from SNORT’s output logs. The SOM’s
final 2-dimentional representative sample patterns are determined based on the number of
classes, the weight vectors and the Euclidian distance from the closest weight vector.
These SOM parameters will determine a series of Fuzzy Rules which will be used by
DataEngine’s Fuzzy Rule Base Module. For each weight, wi, the structure of a fuzzy
membership function, Aij, is determined according to its distribution range. The grade of
certainty is calculated next, based on the generated membership functions. In each
subspace of the pattern space defined by the membership functions, the grade of
uncertainty is assigned according to the truth value and belonging class of the sample
patterns.
X1
A
A32
R
R
2
A
A22
R
3
A11
A21
A
A31
X2
Figure 4 Generation of Membership Functions
38
A12
For each weight vector associated with a node in the competitive layer, there is a set
of patterns. Each sample pattern is assigned to the set where the Euclid distance is the
smallest. Any empty sets are deleted. All other sets determine a fuzzy rule as follows:
Ri: If xi is Ai1 and x2 is Ai2, then x belongs to class 1 with i1 and x belongs to class 2 with i2
Aij is a fuzzy variable for the jth dimensional element, xj, and i2 is the grade of
uncertainty when x belongs to class c. Aij is defined by the triangular membership
function where the center is determined by the value of the corresponding weight vector
and the width is determined by the distribution of all members of the set. Each rule is
normalized so that the total membership function becomes 1. For example, in Figure 3,
above, the set R2 overlaps with R1, therefore the membership functions 12 and 22 may
be broken down into 0.5 and 0.5 and 11 and 21 may be broken down into 0.9 and 0.1. If
there is no overlap, as in R3, the set is isolated with a membership function of 1.0.
An unknown pattern is classified by using the fuzzy rules obtained above. A truth
value, Ai(x), is calculated for each fuzzy rule. The product of Aij and the grade of
uncertainty, ic, for each class, c, is calculated for each rule with a truth value greater
than 0. The test sample pattern is assigned to the class for which the above product is the
greatest. The test sample in Figure 4, would clearly be classified as class 2.
If the test
sample does not clearly match any of the specified classes, it will be flagged as an
anomaly with a profile associated with those classes closest to it as indicated by the
shortest Euclid distance. This process, therefore, will not have false positives, but alerts
with various levels of confidence.
39
CHAPTER 4
EXPERIMENT SET-UP
The goal of this research is to develop an automated process to identify
anomalous behavior in network activity while it is occurring so that immediate action
may be taken. The initial experiment, however, will just determine if the use of multiple
self-organizing maps and resulting fuzzy rules are adequate for the clustering and
classification of these anomalies. We will use tcpdump data collected during the DARPA
1998 Evaluation of Intrusion Detection Systems as a baseline. Features will be extracted
from the raw data and will be used to create input vectors for three different selforganizing maps.
We will follow the pre-processing technique proposed by MINDS (Ertoz, et al.,
2003) whereby features are extracted based on content, time and connection. Content
features include number of total packets, acknowledgement packets, data bytes,
retransmitted packets, pushed packets, SYN and FIN packets flowing to or from the
source and destination. We will also track the status (ie. completed, not completed or
reset) for each connection. Time-based features include the number of connections and
type of services to or from the source and destination within the last 5 seconds. Because
this time-based approach will not be able to detect slow and stealthy attacks, we will also
extract connection-based features, such as the number of connections made to or from the
same source or destination within the last 100 connections. Each respective feature set
will create a representative vector for one of the three SOMs. In other words, one SOM
40
will try to find anomalous behavior based on content, aonther based on time, and the third
based on connection factors.
Although Self-Organizing Maps use an unsupervised approach to machine
learning, we must find some way to label the neurons on the resulting 2-dimensional
map. We can use a label set that has several records of normal behavior and several
records for each of the four major types of anomalous behavior required for identification
in the DARPA evaluation (Denial of Service (DOS), Remote to Local (R2L), User to
Root (U2R), and Reconnaissance attacks). If no records match the anomalies in the label
set, no neurons will be labeled. Sparse clusters should only have a few neurons labeled.
We assume that these neurons are the most likely to represent anomalous behavior and
should be highlighted.
Recall that we will use three different SOMs to cluster tcpdump data. Each of
these SOMs will represent very different aspects of a tcp connection. The combination of
the three results should allow us to better identify true anomalies and significantly reduce
false positives. The resulting SOM differs with each set of input. Therefore, we will
need to create a process that takes the current SOM and uses the relative position of
labeled neurons to see if other connections may have similar features. These would
represent attacks that are similar enough to be placed in close proximity to a known
attack but is different enough so that it is not associated with the same neuron. These
neighboring neurons will be included when we set up rules for evaluation and
interpretation if we use fuzzy sets but would be missed if we used crisp sets.
Crisp rules attempt to classify members distinctly into one set or another, never in
both at the same time. Fuzzy rules, in contrast, allow a member to have a degree of
41
membership in two different clusters simultaneously. Therefore, we can see a connection
that has qualities of a both a normal connection and an attack sequence at the same time.
If we used only crisp sets for our classification rulebase, we would miss threatening
connections that are predominantly normal (and therefore placed with the normal
connections) or misclassify a new, but perfectly benign connection, that has more in
common with a known attack than with our normal processes.
Y
1
11
2
2
2
3
44
3
3
X
1
Figure 5 Labeled SOM
42
The red node in Figure 5 represents a sparse node; therefore we deduce that it is
an anomaly. This vector represents a connection that should be classified and brought
to the analysts attention. The SOM will be completely different if it is constructed
with a different input set. Therefore, we will simply create arbitrary fuzzy rules using
the X1 and Y1 axis for SOM 1. Each subsequent SOM will have similar rules for the
nodes represented on the Xn and Yn axis, where n is the SOM identifier. We have
decided to use three SOMs for this domain, but other problems may need different
feature sets to solve the problem. Additional rules will be created from the other two
SOMs. The rules created to find the four classes of DARPA attacks for this SOM
would be as follows:
R11: If X1 is High and Y1 is Low then Type 1
R12: If X1 is High and Y1 is High then Type 2
R13: If X1 is Somewhat Low and Y1 is Somewhat Low then Type 3
R14: If X1 is Medium and Y1 is High then Type 4
Inference using the 12 fuzzy rules representing the three different SOMs should
allow use to classify anomalies among the four types tested in the DARPA evaluation.
We will develop ROC curves for this process and compare them with the published
results from other related research. Depending on the initial results of this research, we
will also explore the use of a sliding window to automate the processing and
classification of intrusion detection data for real-time analysis and notification. For
example, it may be possible to collect system data, create new SOMs, classify anomalous
connections and create alerts every 15 minutes.
Additional research is also possible in further classifying the four DARPA
evaluation attack types to a more granular level. For example, Reconnaissance attacks
can be further classified as a probe, an active attempt to initiate a response from network
43
entities, or passive information gathering, obtaining information without interaction with
the information source or destination. We have developed a taxonomy of 24 classes of
attacks that fall into the 4 general areas of Denial of Service (DOS), Remote to Local
(R2L), User to Root (U2R), and Reconnaissance attacks. Seeding the SOMs with 24
label records and increasing the dimension of each map should make this a fairly easy
extension.
44
CHAPTER 5
METHOD OF VALIDATION
As interest in intrusion detection has grown, the topic of evaluation of intrusion
detection systems has also received great attention. Since it is difficult and costly to
perform reliable, systematic evaluations of intrusion detection systems, few such
evaluations have been performed. One such effort was a combined research effort by
Lincoln Laboratory, the Defense Advanced Research Projects Agency (DARPA) and the
U.S. Air Force. The aim of the evaluation was to assess the current state of IDSs within
the Department of Defense and the U.S. government. Evaluations were preformed in
both 1998 and 1999.
These evaluations attempted to quantify specific performance measures of IDSs
and test these against a background of realistic network traffic.
The performance
measures used by these evaluations included a ratio of attack detection to false positives,
the capability to detect new and stealthy attacks, and the ability to accurately identify
attacks. The research also attempted to establish the reason each IDS failed to detect an
attack or generated a false positive. The testing process used a sample of generated
network traffic, audit logs, system logs and file system information. An identical data set
was used for all systems evaluated.
Three weeks of training data, composed of two weeks of background traffic with
no attacks and one week of data with a few attacks, were provided to participants to
support tuning and training. Locations of attacks in the training data were clearly labeled.
45
Participants then used two weeks of unlabeled test data which included 200 instances of
58 different attacks and were asked to provide a detailed list of hits or alerts using the
output of their intrusion detection systems.
At a minimum, the results were required to include the date, time, victim IP
address and a score for each putative attack detected. An alert could also optionally
indicate the attack category. Putative detections were counted as correct if the time of the
alert occurred within 60 seconds of the actual time of the event and referred to the correct
victim IP address. The score produced by a system was required to be a number that
increases as the certainty of an attack at the specified time increased. Although all
participants returned numbers ranging between zero and one, many participants produced
binary (0 or 1) inputs only.
An initial analysis was performed to determine how well all systems taken
together detected attacks regardless of false alarm rates. Thirty-seven of the fifty-eight
attack types were detected well, but many stealthy and new attacks were always or
frequently missed. (Lippman et al., 1999) Attacks were detected best when they produced
a consistent “signature” or sequence of events in the data that was different from the
sequences produced for normal traffic. Systems that relied on rules or signatures missed
new attacks because signatures did not exist for these attacks, or because existing
signatures did not generalize to variants of old attacks, or to new or stealthy attacks.
This research will use the same test data and published procedures as the DARPA
1998 and 1999 IDS Evaluations. Scores for the neuro-fuzzy IDS with SNORT
augmentation will be evaluated and compared against the baseline results of the DARPA
study. Detecting unknown attacks is the most important and most challenging aspect of
46
any IDS. An additional capability to classify these unknown attacks into one of several
categories and assigning a relative importance of the attack will allow an analyst to
prioritize and investigate alerts according to their corresponding operational impact or
potential effect.
47
CHAPTER 6
IMPORTANCE OF THE RESEARCH
During a panel discussion at the 1998 Recent Advances in Intrusion Detection
Conference, Marc Wilkens identified three main issues applicable to deploying intrusion
detection systems that can be addressed by the proposed neuro-fuzzy approach. (Frincke
et al, 1998)
First, he stated that an IDS must be able to adapt to the environment and
detect constantly evolving attack patterns. Using XML tags to describe the structure of
an anomalous connection can simplify the identification of an attack or a variation of an
known attack with the aid of an existing self-organized map. Each collection of XML
tags correlates to a vector of characteristics pertaining to the attack profile. When
executing a self-organizing map, an attack profile will immediately gravitate to those
areas that are most similar.
In addition to adaptability, an IDS must be able to integrate and interoperate with
multiple Intrusion Detection techniques and architectures (such as anomaly detection,
misuse detection, host-based systems and network-based systems) in order to provide real
business solutions. Numerous intrusion detection systems are available in the market and
different sites will no doubt select different vendors. Since incidents are often distributed
over multiple sites, it is highly likely that different aspects of a single incident will be
visible to different systems. Thus it would be advantageous for diverse intrusion
detection systems to be able to share data on attacks in progress. In addition, it should
48
also be able to integrate with other tools that are already in use in the network. To meet
this goal, IDSs can utilize the SNML data formats and exchange procedures outlined by
the Intrusion Detection Working Group (IDWG, 2002) for sharing information of interest
to intrusion detection and response systems, and to management systems that may need to
interact with them.
Finally, this need to share information between and among components highlights
the importance of having standards for characterization, storage and exchange of data
about attack intrusions, vulnerabilities and evidence. The standard XML template can be
used by any computer security mechanism that can create a formatted alert or system log
entry. The CERT created this language to exchange information about suspicious events.
The current DTD and SNML Exchange Requirements (IDWG, 2004) will describe
constraints and limitations that apply to the construction and transportation of Intrusion
Detection Message Exchange Formats. The final document will also address semantics
and context relating to these messages.
49
CHAPTER 7
SCHEDULE AND RECOMMENDATIONS
While many of the tools needed for this project are readily available, they are
complicated and require significant familiarization with MATLAB. To make this project
a little less daunting, it will be broken down into successive phases that will facilitate the
construction of the final IDS while examining the capabilities of each tool separately.
The first step will be to create and parse an XML tagged CVE dataset and use the
resulting vectors to create a SOM. New CVE entries will be categorized according to the
class to which the sample vector is most similar. The next phase will be a similar process
using an XML tagged SNORT output log to create a SOM representing normal network
traffic. The biggest challenge during this phase is to create a suitable vector to meet the
needs of anomaly detection. The final phase will incorporate the fuzzy logic capabilities
and thereby create a final product. The 1998 and 1999 DARPA Evaluation test sets will
be used for both the second and third phases.
50
REFERENCES
Barbara, D. (2001). ADAM: Detecting Intrusions by Data Mining, Proceedings of the
2001 IEEE Workshop on Information Assurance and Security (p. 11-16). West Point,
NY, Jun 6-7.
Barbara, D. (2003). Bootstrapping a Data Mining Instrusion Detection System,
Proceedings of the 2003 ACM Symposium on Applied Computing (SAC). Melborne,
FL, Mar 9-12.
Bridges, S. & Vaughn, R. (2000). Fuzzy Data Mining and Genetic Algorithms Applied
to Intrusion Detection, Proceedings of the 23rd National Information Systems Security
Conference (NISSC). Baltimore, Maryland, Oct 16-19.
Caswell, B., Beale, J., Foster, J., & Faircloth, J. (2003). Snort 2.0. Rockland, MA:
Syngress.
DARPA Intrusion Detection Evaluation (2001), Data Sets Overview. Downloaded from
http://www.ll.mit.edu/IST/ideval/data/data_index.html.
Denning, D. (1987, Feb). An Intrusion-Detection Model. IEEE Transactions on
Software Engineering, 13 (2), 222-232.
Dittenbach, M., Rauber, A., & Merkl, D. (2002, Oct). Uncovering the Hierarchial
Structure in Data Using the Growing Hierachical Self-Organizing Map.
Neurocomputing, 48(1-4), 199-216.
Elkin, C. (1999, Sep). Results of the KDD-99 Classifier Learning Contest. Downloaded
from http://www.cs.ucsd.edu/users/elkan/clresults.html.
Ertoz, L., Eilertson, E., Lazarevic, A., Tan, P., Srivastava, J., Kumar, V., & Dokas, P. (2004)
The MINDS – Minnesota Intrusion Detection System, accepted for the book Next Generation
Data Mining. Downloaded from http://www.cs.umn.edu/research/minds/MINDS_papers.htm
Eskin, E., Arnold, A., Prerau, M., Portnoy, L. & Stolfo, S. (2002). A Geometric
Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled
Data. In D. Barbara & S. Jajodia (Eds.), Data Mining for Security Applications.
Boston: Kluwer.
Frincke, D., Levitt, K., Miqueu, M., Quisquater, J., Milkens, M., & Ziese, K., (1998).
Recent Advances in Intrusion Detection, Panel: Intrusion Detection in the Large.
Louvain-la_Neuve, Belgium, Sep 16.
Han, J. & Kambber, M. (2001). Data Mining. San Francisco: Morgan Kaufmann.
Haykin, S. (1999). Neural Networks. New Jersey: Prentice Hall.
2
Horikawa, S. (1997). Fuzzy Classification System Using Self-Organizing Feature Map.
OKI Technical Review, No 159, Vol 63.
Ilgun, K. (1993). USTAT: A Real-Time Intrusion Detection System for UNIX,
Proceedings of the 1993 Computer Security Symposium on Research in Security and
Privacy, May 24-26, 1993, Los Alamitos, CA. IEEE Computer Society Press.
Intrusion Detection Working Group. Intrusion Detection Exchange Format Charter
(2004). Downloaded from http://www.ietf.org/html.charters/idwg-charter.html.
Intrusion Detection Working Group (2002). Intrusion Detection Exchange Format
Requirements. Downloaded from http://www.ietf.org/internet-drafts/draft-ietf-idwgrequirements-10.txt.
Kendall, K. (1999). A Database of Computer Attacks for the Evaluation of Intrusion
Detection Systems. Master’s thesis, Massachusetts Instititute of Technology,
Department of Electrical Engineering and Computer Science, Cambridge, MA, June
1999.
Lasarevic, A., Ertoz, L., Kumar, V., Ozjur, A., & Srivatava, J. (2003). A Comparative
Study of Anomaly Detection Schemes in Network Intrusion Detection. Proceedings
of the SIAM International Conference on Data Mining, May 1-3, 2003. San
Francisco, CA.
Lippman, R. et al (2000). Evaluating Intrusion Detection Systems: The DARPA OffLine Intrusion Detection Evaluation, Proceedings of the DARPA Information
Survivability Conference and Exposition, Jan 25-27, 2000. Los Alamitos, CA, IEEE
Computer Society Press.
Lunt, T. (1993, Jun). A Survey of Intrusion Detection Techniques. Computers and
Security, 12 (4), 405-418.
Martin R. (2001, Nov). Managing Vulnerabilities in Networked Systems, IEEE
Computer Society Magazine, 34 (11), 32-38.
McGarry, S., Wermter, S., & MacIntyre, J. (1999) Hybrid Neural Systems: From Simple
Coupling to Fully Integrated Neural Network, Neural Computing Surveys, 2.
Mostow, J., Roberts J., & Bott, J. Integreation of an Internet Attack Simulator in an
HLA Environment, Proceedings of the 2000 IEEE Workshop on Information
Assurance and Security (p. 162–168). West Point, NY, Jun 6-7.
MITRE. (2002). Common Vulnerabilities and Exposures: The Key to Information
Sharing. Downloaded from http://cve.mitre.org/docs/
3
Nguyen, B. (2002) Self-Orgainizing Map and Genetic Algorith for Intrusion Detection
System, Final Project for CS680: Advanced Topics in Artificial Intelligence, Spring
2002. Downloaded from http://132.235.28.162/bnguyen/papers/IDS_SOM.pdf.
Rhodes, B., Mahaffey, J., Cannady, J. (2000). Multiple Self-Orgainizing Maps for
Intrusion Detection, Proceedings of the 23rd National Information Systems Security
Conference (NISSC) (pp. ?? - ?? ). Baltimore, Maryland, Oct 16-19.
Roesch, M. & Green, C. (2002), SNORT Users Manual. Downloaded from
http://www.snort.org/docs/
Sundaram, A. (1996). An Introduction to Intrusion Detection. Downloaded from
http://www.acm.org/crossroads/xrds2-4/xrds2-4.html
Valdes, A. & Skinner, A. (2000, Oct). Adaptive, Model-based Monitoring for Cyber
Attack Detection. In H. Debar, L. Me & F. Wu (Eds.), Recent Advanced in Intrusion
Detection (RAID). Toulouse, France: Springer-Verlag, 80-92.
Zedah, L.A. (1994, March). Fuzzy Logic, Neural Networks, and Software Computing,
Communications of the ACM, 37 (3), 77.
4