Download Here

Document related concepts

Psychometrics wikipedia , lookup

Artificial neural network wikipedia , lookup

Foundations of statistics wikipedia , lookup

Statistical mechanics wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Statistical inference wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Anomaly Detection Systems
Contents
• Statistical methods
• Systems with learning
• Clustering in anomaly detection systems
2/87
Anomaly detection
• Anomaly detection involves a process of
establishing profiles of normal behaviour,
comparing actual user/network behaviour
to those profiles, and flagging deviations
from the normal.
• The basis of anomaly detection is the
assertion that abnormal behaviour
patterns indicate misuse of systems.
3/87
Anomaly detection
• Profiles are defined as sets of metrics.
Metrics are measures of particular aspects
of user behaviour.
• Each metric is associated with a threshold
or range of values.
4/87
Anomaly detection
• Anomaly detection depends on an
assumption that users exhibit predictable,
consistent patterns of system usage.
• The approach also accommodates
adaptations to changes in user behaviour
over time.
5/87
Anomaly detection
• The completeness of anomaly detection
has yet to be verified (no one knows
whether any given set of metrics is rich
enough to express all anomalous
behaviour).
6/87
Statistical methods
• Parametric methods
– Analytical approaches in which assumptions
are made about the underlying distribution of
the data being analyzed.
– The usual assumption is that the distributions
of usage patterns are Gaussian:
f x  
1
 2

x  x0 2

e
2
2
x0 – mean
 - standard deviation
7/87
Statistical methods
• Non-parametric methods
– Involve nonparametric data classification
techniques - cluster analysis.
8/87
Statistical methods
• The Denning’s model (the IDES model for
intrusion).
– Four statistical models may be included in the
system:
•
•
•
•
Operational model
Mean and standard deviation model
Multivariate model
Markov process model.
– Each model is suitable for a particular type of
system metric.
9/87
Statistical methods
• Operational model
– This model applies to metrics such as event
counters for the number of password failures
in a particular time interval.
– The model compares the metric to a set
threshold, triggering an anomaly when the
metric exceeds the threshold value.
10/87
Statistical methods
• Mean and standard deviation model
– A classical mean and standard deviation
characterization of data.
– The assumption is that all the analyzer knows
about system behaviour metrics are the mean
and standard deviations.
11/87
Statistical methods
• Mean and standard deviation model
(cont.)
– A new behaviour observation is defined to be
abnormal if it falls outside a confidence
interval.
– This confidence interval is defined as d
standard deviations from the mean for some
parameter d (usually d=3).
12/87
Statistical methods
• Mean and standard deviation model
(cont.)
– This characterization is applicable to event
counters, interval timers, and resource
measures (memory, CPU, etc.)
– It is possible to assign weights to these
computations, such that more recent data are
assigned greater weights.
13/87
Statistical methods
• Multivariate model
– This is an extension to the mean and
standard deviation model.
– It is based on performing correlations among
two or more metrics.
– Instead of basing the detection of an anomaly
strictly on one measure, one might base it on
the correlation of that measure with another
measure.
14/87
Statistical methods
• Multivariate model (cont.)
– Example:
• Instead of detecting an anomaly based solely on
the observed length of a session, one might base it
on the correlation of the length of the session with
the number of CPU cycles utilized.
15/87
Statistical methods
• Markov process model
– Under this model, the detector considers each
different type of audit event as a state variable
and uses a state transition matrix to
characterize the transition frequencies
between states (not the frequencies of the
individual states/audit records).
16/87
Statistical methods
• Markov process model (cont.)
– A new observation is defined as anomalous if
its probability, as determined by the previous
state and value in the state transition matrix,
is too low/high.
– This allows the detector to spot unusual
command or event sequences, not just single
events.
– This introduces the notion of performing
stateful analysis of event streams (frequent
episodes, etc.)
17/87
Statistical methods
• Markov process model (cont.)
– Example - NIDES (Next-generation Intrusion
Detection Expert System)
• Developed by SRI (Stanford Research Institute) in
1990s.
• Measures various activity levels.
• Combines these into a single “normality” measure.
• Checks this against a threshold.
• If the measure is above the threshold, the activity
is considered abnormal.
18/87
Statistical methods
• Markov process model (cont.)
– Example - NIDES (cont.)
• NIDES measures
– Intensity measures
» An example would be the number of audit records
(log entries) generated within a set time interval.
» Several different time intervals are used in order to
track short-, medium-, and long-term behaviour.
19/87
Statistical methods
• Markov process model (cont.)
– Example - NIDES (cont.)
• NIDES measures (cont.)
– Distribution measures
» The overall distribution of the various audit records
(log file entries) is tracked via histograms.
» A difference measure is defined to determine how
close a given short-term histogram is to “normal”
behaviour.
20/87
Statistical methods
• Markov process model (cont.)
– Example - NIDES (cont.)
• NIDES measures (cont.)
– Categorical data
» The names of files accessed or the names of remote
computers accessed are examples of categorical
data used.
21/87
Statistical methods
• Markov process model (cont.)
– Example - NIDES (cont.)
• NIDES measures (cont.)
– Counting measures
» These are numerical values that measure
parameters such as the number of seconds of CPU
time used.
» They are generally taken over a fixed amount of time
or over a specific event, such as a single login.
» Thus, they are similar in character to intensity
measures, although they measure a different kind of
activity.
22/87
Statistical methods
• Markov process model (cont.)
– Example - NIDES (cont.)
• The different measurements each define a
statistic Sj .
• These measurements are assumed (constructed
to be) appropriate (this includes normalization),
and are combined to produce a 2-like statistic:
n
1
T 2   S 2j
n j 1
23/87
Statistical methods
• Markov process model (cont.)
– Example - NIDES (cont.)
• A more complicated measure would include the
correlation between the events (as was done with
IDES):
IS  S1 ,, Sn C1 S1 ,, Sn 
T
• Here, C is the correlation matrix between Si and Sj
for all i and j. IS is called the IDES score.
24/87
Statistical methods
• Markov process model (cont.)
– Example - NIDES (cont.)
• NIDES compares recent activity with past activity,
using a methodology that amounts to a sliding
window on the past.
• Thus it is designed to detect changes in activity
and to adapt to new activity levels.
25/87
Statistical methods
• Markov process model (cont.)
– Example - NIDES (cont.)
• NIDES intensity measures are counts of audit
records per time unit etc.
• This provides an overall activity level for the
system.
• These are updated continuously rather than
recomputed at each time interval.
26/87
Statistical methods
• Markov process model (cont.)
– Example - NIDES (cont.)
• Possible elements that can be monitored with this
basic idea:
–
–
–
–
Average system load.
Number of active processes.
Number of E-mails received.
Different types of audit records (can be tracked
separately).
27/87
Statistical methods
• Markov process model (cont.)
– Example - NIDES (cont.)
• The obvious extension of the intensity measures
idea is to track the different types of audit records.
• This leads to a distribution (histogram) for the audit
records.
• Similarly, one could track the sizes of E-mail
messages received, or the types of files accessed.
• These can be updated continuously.
• Distributions are then compared by means of a
squared error metric.
28/87
Statistical methods
• Markov process model (cont.)
– Example - NIDES (cont.)
• Categorical measures can be for example the
names of files accessed.
• They are treated just like distributional measures.
• Now each bin corresponds to a categorical, while
with distributional measures the bin can
correspond to a range of values.
• The updates are still done continuously.
29/87
Statistical methods
• Markov process model (cont.)
– Example - NIDES (cont.)
• All the measures are combined into the T2 statistic.
• The value is compared with a threshold to
determine if the activity is “abnormal”.
• The threshold is usually set empirically, based on
the observed network behaviour in some period of
time.
30/87
Statistical methods
• Markov process model (cont.)
– Example - NIDES (cont.)
• NIDES produces a single, overall measure of
“normality”, which could allow further investigation
into the components that make up the statistic
upon an alert.
• The problem with this is that an unusually low
value for one statistic can mask a high one for
another – multifaceted measures are more useful.
31/87
Statistical methods
• Advantages of parametric approach
– Statistical anomaly detection could reveal
interesting, sometimes suspicious, activities
that could lead to discoveries of security
breaches.
32/87
Statistical methods
• Advantages of parametric approach (cont.)
– Parametric statistical systems do not require
the constant updates and maintenance that
misuse detection systems do.
– However, metrics must be well chosen,
adequate for good discrimination, and welladapted to changes in behaviour (that is,
changes in user behaviour must produce a
consistent, noticeable change in the
corresponding metrics).
33/87
Statistical methods
• Disadvantages of parametric approach
– Batch mode processing of audit records,
which eliminates the capability to perform
automated responses to block damage.
– Although more recent systems attempt to
perform real-time analysis of audit data, the
memory and processing loads involved in
using and maintaining the user profile
knowledge base usually cause the system to
lag behind audit record generation.
34/87
Statistical methods
• Disadvantages of parametric approach
(cont.)
– The nature of statistical analysis reduces the
capability of taking into account the sequential
relationships between events.
– The exact order of the occurrence of events is
not provided as an attribute in most of these
systems.
35/87
Statistical methods
• Disadvantages of parametric approach
(cont.)
– Because many anomalies indicating attack
depend on such sequential event
relationships, this situation represents a
serious limitation to the approach.
– In cases when quantitative methods
(Denning's operational model) are utilized, it is
also difficult to select appropriate values for
thresholds and ranges.
36/87
Statistical methods
• Disadvantages of parametric approach
(cont.)
– The false positive rates associated with
statistical analysis systems are high, which
sometimes leads to users ignoring or
disabling the systems.
– The false negative rates are also difficult to
reduce in these systems.
37/87
Statistical methods
• Nonparametric measures
– One of the problems of parametric methods is
that error rates are high when the
assumptions about the distribution are
incorrect.
38/87
Statistical methods
• Nonparametric measures (cont.)
– When researchers began collecting
information about system usage patterns that
included attributes such as system resource
usage, the distributions were discovered not
to be normal.
– Then, including normal distribution
assumption into the measures led to high
error rates.
39/87
Statistical methods
• Nonparametric measures (cont.)
– A way of overcoming these problems is to
utilize nonparametric techniques for
performing anomaly detection.
– This approach provides the capability of
accommodating users with less predictable
usage patterns and allows the analyzer to
take into account system measures that are
not easily accommodated by parametric
schemes.
40/87
Statistical methods
• Nonparametric measures (cont.)
– The nonparametric approach involves
nonparametric data classification techniques,
specifically cluster analysis.
– In cluster analysis, large quantities of
historical data are collected (a sample set)
and organized into clusters according to some
evaluation criteria.
41/87
Statistical methods
• Nonparametric measures (cont.)
– Preprocessing is performed in which features
associated with a particular event stream
(often mapped to a specific user) are
converted into a vector representation (for
example, Xi = [f1, f2, ..., fn] in an n-dimensional
state).
42/87
Statistical methods
• Nonparametric measures (cont.)
– A clustering algorithm is used to group vectors
into classes by behaviours, attempting to
group them so that members of each class
are as close as possible to each other while
different classes are as far apart as they can
be.
43/87
Statistical methods
• Nonparametric measures (cont.)
– In nonparametric statistical anomaly
detection, the premise is that a user's activity
data, as expressed in terms of the features,
falls into two distinct clusters: one indicating
anomalous activity and the other indicating
normal activity.
44/87
Statistical methods
• Nonparametric measures (cont.)
– Various clustering algorithms are available.
These range from algorithms that use simple
distance measures to determine whether an
object falls into a cluster, to more complex
concept-based measures (in which an object
is "scored“ according to a set of conditions
and that score is used to determine
membership in a particular cluster) .
– Different clustering algorithms usually best
serve different data sets and analysis goals.
45/87
Statistical methods
• Nonparametric measures (cont.)
– The advantages of nonparametric approaches
include the capability of performing reliable
reduction of event data (in the transformation
of raw event data to vectors).
– This effect may reach as high as two orders of
magnitude compared to the classical
approach that does not include vectors.
46/87
Statistical methods
• Nonparametric measures (cont.)
– Other benefits are improvement in the speed
of detection and improvement in accuracy
over parametric statistical analysis.
– Disadvantages involve concerns that
expanding features beyond resource usage
would reduce the efficiency and the accuracy
of the analysis.
47/87
Systems with learning
• Two phases of system operation:
– The learning phase, in which the system is
taught what a normal behaviour is.
– The recognition phase, in which the system
classifies the input vectors according to the
knowledge acquired in the learning process.
– These systems also include a conversion of
raw data into feature vectors.
48/87
Systems with learning
• Example: Neural networks
– Neural networks use adaptive learning
techniques to characterize anomalous
behaviour.
– This analysis technique operates on historical
sets of training data, which are presumably
cleansed of any data indicating intrusions or
other undesirable user behaviour.
49/87
Systems with learning
• Example: Neural networks (cont.)
– Neural networks consist of numerous simple
processing elements called neurons that
interact by using weighted connections.
– The knowledge of a neural network is
encoded in the structure of the net in terms of
connections between units and their weights.
– The actual learning process takes place by
changing weights and adding or removing
connections.
50/87
Systems with learning
51/87
Systems with learning
• Example: Neural networks (cont.)
– Neural network processing involves two
stages.
• In the first stage, the network is populated by a
training set of historical or other sample data that is
representative of user behaviour.
• In the second stage, the network accepts event
data and compares it to historical behaviour
references, determining similarities and
differences.
52/87
Systems with learning
• Example: Neural networks (cont.)
– The network indicates that an event is
abnormal by changing the state of the units,
changing the weights of connections, adding
connections, or removing them.
– The network also modifies its definition of
what constitutes a normal event by performing
stepwise corrections.
53/87
Systems with learning
• Example: Neural networks (cont.)
– Neural networks don't make prior
assumptions on expected statistical
distribution of metrics, so this method retains
some of the advantages over classical
statistical analysis associated with statistical
nonparametric techniques.
54/87
Systems with learning
• Example: Neural networks (cont.)
– Among the problems associated with utilizing
neural networks for intrusion detection is a
tendency to form mysterious unstable
configurations in which the network fails to
learn certain things for no apparent reason.
55/87
Systems with learning
• Example: Neural networks (cont.)
– The major drawback to utilizing neural
networks for intrusion detection is that neural
networks don't provide any explanation of the
anomalies they find.
56/87
Systems with learning
• Example: Neural networks (cont.)
– This practice impedes the ability of users to
establish accountability or otherwise address
the roots of the security problems that allowed
the detected intrusion.
– This made neural networks poorly suited to
the needs of security managers.
57/87
Systems with learning
• General problems related to all systems
with learning
– The problem with all learning-based
approaches is in the fact that the
effectiveness of the approach depends on the
quality of the training data.
– In learning-based systems, the training data
must reflect normal activity for the users of the
system.
58/87
Systems with learning
• General problems related to all systems
with learning (cont.)
– This approach may not be comprehensive
enough to reflect all possible normal user
behaviour patterns.
– This weakness produces a large false positive
error rate.
– The error rate is high because if an event
does not match the learnt knowledge
completely, a false alarm is often generated,
although it does not always happen.
59/87
Clustering in anomaly detection
• Clustering definition:
– “Cluster analysis is the art of finding groups in
data”
– The aim: group the given objects in such a
way that the objects within a group are
mutually similar and at the same time
dissimilar from other groups.
60/87
Clustering in anomaly detection
• Formal definition:
– Let P be a set of vectors, whose cardinality
is m, and whose elements are p1,…,pm , of
dimensions n1,…,nm , respectively.
– The task: partition, optimizing a partition
criterion, the set P into k subsets P1,…,Pk ,
such that the following holds:
P1  P2    Pk  P
Pi  Pj  , i, j  1,2,, k , i  j
61/87
Clustering in anomaly detection
Incoming
traffic/logs
Data pre-processor
Activity data
Detection
model(s)
Detection algorithm
Clustering!
Alerts
Decision
criteria
Alert filter
Action/Report
62/87
Clustering in anomaly detection
• Why should we do clustering instead of
supervised learning?
– Labelling a large set of samples is often costly.
– Very large data sets – train the system with a
large amount of unlabelled data and then label
with supervision.
– Track slow changes of patterns in time without
supervision – improves performances.
– Smart feature extraction.
– Initial exploratory data analysis.
63/87
Clustering in anomaly detection
• Appropriate cluster analysis algorithms:
– Two main classes of clustering algorithms
• Hierarchical
• Non-hierarchical (partitioning)
– Hierarchical
• Less efficient
• More biased results in general
– Non-hierarchical
• Results often depend on the initial partition.
64/87
Clustering in anomaly detection
• Appropriate cluster analysis algorithms
(cont.)
– A trade-off between correctness and efficiency
of the CA algorithm must be found in order to
achieve the real-time operation of an IDS.
– K-means algorithm – could be a good
candidate for implementation in IDS.
65/87
Clustering in anomaly detection
•
Appropriate cluster analysis algorithms
(cont.)
– An outline of the K-means algorithm
1. Initialization: Randomly choose K vectors from the
data set and make them initial cluster centres.
2. Assignment: Assign each vector to its closest
centre.
3. Updating: Replace each centre with the mean of its
members.
4. Iteration: Repeat steps 2 and 3 until there is no
more updating.
66/87
Clustering in anomaly detection
•
K-means algorithm
– A local optimization algorithm – hill climbing.
– Clustering depends on initial centres, but
this can be overcome in several ways.
– Time complexity linear in the number of
input vectors.
67/87
Clustering in anomaly detection
•
Problems to solve
– Determine the number of clusters
– Determine the appropriate distance measure
68/87
Clustering in anomaly detection
•
Determine the number of clusters
– 2 clusters if we want only to tell “abnormal”
from “normal” behaviour.
– More complex clustering evaluation
algorithms should be used to detect the
number of clusters at which the most
compact and separated clusters are
obtained.
– Use hierarchical clustering + clustering
evaluation algorithms (inefficient).
69/87
Clustering in anomaly detection
•
Determine the appropriate distance
measure
– It must be a metric:
•
•
•
•
a,b, d(a,b)0
a,b, d(a,b)=0a=b
a,b, d(a,b)=d(b,a)
a,b,c, d(a,c)d(a,b)+d(b,c), i.e. the triangle
inequality must hold.
70/87
Clustering in anomaly detection
•
Determine the appropriate distance
measure (cont.)
– Typical metrics:
•
•
For equal length input vectors – the Minkowski
metric.
For unequal length input vectors – the edit distance
(which is also a metric).
71/87
Clustering in anomaly detection
•
Minkowski metric
d X, Y  
n
q
x y
i 1
•
•
i
q
i
q=1, Manhattan (city block) distance
q=2, Euclidean distance
72/87
Clustering in anomaly detection
• Edit distance
– Elementary edit operations
• Deletions
• Insertions
• Substitutions
– Minimum number of elementary edit
operations needed to transform one vector
into another.
– Computed recursively, by filling the matrix of
partial edit distances – edit distance matrix.
– The definition can include constraints.
73/87
Clustering in anomaly detection
• Labelling clusters
– Way to determine which cluster contains
normal instances and which contain attacks.
– 1st assumption
• Associate the label “normal” with the cluster of the
greatest cardinality.
• Fails with massive attacks, as for example the
Syn-flood attack.
• Fails with KDD cup data without filtering out the
attacks.
74/87
Clustering in anomaly detection
• To label properly, we need to explore the
structure of the clusters.
• The clustering quality criteria are used,
combined with some characteristics of
clusters:
– Silhouette index
– Davies-Bouldin index
– Dunn’s index
– Clusters’ diameters
75/87
Clustering in anomaly detection
• Intra-cluster distance
– The measure of compactness of a cluster
(complete diameter, average diameter,
centroid diameter)
• Inter-cluster distance
– The measure of separation between clusters
(single linkage, complete linkage, average
linkage, centroid linkage).
76/87
Example – Davies-Bouldin
• Data set for clustering
X  X1,, X N 
• Clustering into L clusters
C  C1,, CL 
• Distance between the vectors X k and X l
d X k , X l 
77/87
Example – Davies-Bouldin
• Davies-Bouldin index:
 Ci    C j 
1
DB C    max 

i

j
L i 1
  Ci , C j  
L
• Inter-cluster distance
 Ci , C j 
• Intra-cluster distance
Ci 
78/87
Example – Davies-Bouldin
• Intra-cluster distance – Centroid diameter

  d X k , sCi
 X k Ci
 Ci   2
Ci


1
sCi 
Ci
X
X k Ci
 



k
79/87
Example – Davies-Bouldin
• Inter-cluster distance – Centroid linkage

 Ci , C j   d sC , sC
1
sCi 
Ci
X
X k Ci
i
k
j

1
sC j 
Cj
X
X k C j
k
80/87
Clustering in anomaly detection
• A clusters labelling algorithm:
– Uses a combination of the Davies-Bouldin
index of the clustering and the centroid
diameters of the clusters.
– Two clusters: “normal” and “abnormal”.
– Main idea:
• Attack vectors are often mutually very similar, if not
identical.
• Consequently, the attack cluster in the case of a
massive attack is very compact.
81/87
Clustering in anomaly detection
– Main idea (continued):
• The Davies-Bouldin index of such a clustering is
either zero (non-attack cluster is empty) or very
close to zero.
• The expected value of the centroid diameter of the
attack cluster is smaller than that of the non-attack
cluster.
82/87
Clustering in anomaly detection
– Main idea (continued):
• Small value of the Davies-Bouldin index indicates
the existence of a massive attack
• Small value of the centroid diameter indicates the
attack cluster.
83/87
Clustering in anomaly detection
– Main idea (continued):
• A higher value of the Davies-Bouldin index
indicates that no massive attack is taking place.
• Then the attack cluster is expected to be less
compact than the non-attack cluster, i.e. its
centroid diameter is greater than that of the nonattack cluster (because non-massive attack
vectors are very different in general).
• In this case, even the cluster cardinality can be
used for proper labelling.
84/87
Clustering in anomaly detection
• The clusters labelling algorithm:
– Input:
• A clustering C of N vectors into 2 clusters, C1 and
C2; C1 is the “non-attack” cluster, labelled with “1”.
• The Davies-Bouldin index threshold, DB.
• The centroid diameters difference thresholds, CD1
and CD2.
– Output:
• The eventually relabelled input clustering, if
relabelling conditions are met.
85/87
Clustering in anomaly detection
• The clusters labelling algorithm (cont.):
db = DaviesBouldingIndex(C) ;
cd1 = CentroidDiameter(C1) ;
cd2 = CentroidDiameter(C2) ;
if (db==0)&&(cd2==0)
Relabel(C) ;
else if (db>DeltaDB)&&(cd1>(cd2+DeltaCD1))
Relabel(C) ;
else if (db<DeltaDB)&&((cd1+DeltaCD2)<cd2)
Relabel(C);
86/87
Example – KDD cup data base
• Sample size: N=1000
• Number of clusters: 2
Record
No.
DB
CD1
CD2
0-1000
1.13
32759.24 7108.57
Intrusion Good
labelling
(K- means)
Relabel
N
N
2
5000-6000 0.96
4344.63
14158.54 N
Y
0
7000-8000 0.14
69.6
7096.34
Y-376
N
3
8000-9000 0
25.19
0
Y-1000
N
1
87/87