Download Anomaly Detection - Emory Math/CS Department

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
CS570 Data Mining
Anomaly Detection
Cengiz Günay
Dept. Math & CS, Emory University
Fall 2013
Some slides courtesy of Li Xiong,
Han, Kamber, Pei (2012)
Tan, Steinbach, Kumar (2006)
Günay (Emory)
Anomaly Detection
Fall 2013
1/6
Today
Midterm next week_Tuesday:
In-class or take-home?
Günay (Emory)
Anomaly Detection
Fall 2013
2/6
Today
Midterm next week_Tuesday:
In-class or take-home?
Guest speaker_Thursday:
Olgert Denas "Feature extraction from deep models"
Günay (Emory)
Anomaly Detection
Fall 2013
2/6
Today
Midterm next week_Tuesday:
In-class or take-home?
Guest speaker_Thursday:
Olgert Denas "Feature extraction from deep models"
Today’s menu:
Anomaly Detection: In the context of clustering and otherwise
Günay (Emory)
Anomaly Detection
Fall 2013
2/6
Anomaly Detection
Anomaly is a pattern in the data that does not
conform to the expected behavior
outliers, exceptions, peculiarities, surprise
Type of Anomaly
Point Anomalies
Contextual Anomalies
Collective Anomalies
Point Anomalies
An individual data instance is anomalous w.r.t.
the data
Y
N1
o1
O3
o2
N2
X
Contextual Anomalies
An individual data instance is anomalous within a context
Requires a notion of context
Also referred to as conditional anomalies*
Anomaly
Normal
* Xiuyao Song, Mingxi Wu, Christopher Jermaine, Sanjay Ranka, Conditional Anomaly Detection, IEEE
Transactions on Data and Knowledge Engineering, 2006.
Collective Anomalies
A collection of related data instances is anomalous
Requires a relationship among data instances
Sequential Data
Spatial Data
Graph Data
The individual instances within a collective anomaly are
not anomalous by themselves
Anomalous Subsequence
Anomaly Detection
Anomalies often have
significant impact
Cyber intrusions
Credit card fraud
Health condition
anomaly
Crust deformation
anomaly
Black Swan Theory
The impact of rare events is huge
and highly underrated
Black swan event
The event is a surprise
The event has a major impact.
After its first recording, the
event is rationalized by
hindsight
Almost all scientific discoveries,
historical events are black swan
events
March 15, 2011
Data Mining: Concepts and Techniques
8
Anomaly Detection
Anomaly Detection
Related problems
Rare Class Mining
Chance discovery
Novelty Detection
Exception Mining
Noise Removal
Black Swan
“Mining needle in a haystack.
So much hay and so little time”
Intrusion Detection
Intrusion Detection:
Process of monitoring the events occurring in a
computer system or network for intrusions
Intrusions are attempts to bypass the security
mechanisms of a computer or network
Approaches
Traditional signature-based intrusion detection systems
are based on signatures of known attacks
Anomaly detection
Fraud Detection
Fraud detection
detection of criminal activities occurring in commercial
organizations
Types of fraud
Credit card fraud
Insurance claim fraud
Mobile / cell phone fraud
Insider trading
Challenges
Fast and accurate real-time detection
Misclassification cost is very high
Image Processing
Detecting outliers in a image
monitored over time
Detecting anomalous regions
within an image
Used in
mammography image analysis
video surveillance
satellite image analysis
Key Challenges
Detecting collective anomalies
Data sets are very large
Anomaly
Anomaly Detection
Supervised Anomaly Detection
Labels available for both normal data and anomalies
Classification
Semi-supervised Anomaly Detection
Labels available only for normal data
Classification
Unsupervised Anomaly Detection
No labels assumed
Based on the assumption that anomalies are very rare
compared to normal data
Output of Anomaly Detection
Label
Each test instance is given a normal or anomaly
label
Score
Each test instance is assigned an anomaly score
Allows the output to be ranked
Requires an additional threshold parameter
Classification Based Techniques
Main idea: build a classification model for normal (and
anomalous (rare)) events based on labeled training data, and
use it to classify each new unseen event
Categories:
Supervised classification techniques
Require knowledge of both normal and anomaly class
Build classifier to distinguish between normal and known anomalies
Semi-supervised classification techniques
Require knowledge of normal class only!
Use modified classification model to learn the normal behavior and then
detect any deviations from normal behavior as anomalous
Advantages and disadvantages?
1
Supervised
2
Semi-supervised
Classification Based Techniques
Advantages:
Supervised classification techniques
Semi-supervised classification techniques
Models that can be easily understood
High accuracy in detecting many kinds of known anomalies
Models that can be easily understood
Normal behavior can be accurately learned
Drawbacks:
Supervised classification techniques
Require both labels from both normal and anomaly class
Cannot detect unknown and emerging anomalies
Semi-supervised classification techniques
Require labels from normal class
Possible high false alarm rate - previously unseen (yet legitimate) data records
may be recognized as anomalies
Supervised Anomaly Detection
Challenge
Classification models must be able to handle skewed
(imbalanced) class distributions
Misclassification cost for the rare class tend to be high
Supervised Classification Techniques
Blackbox approaches
Manipulating data records (oversampling /
undersampling / generating artificial examples)
Whitebox approaches
Adapt classification models
Design new classification models
Cost-sensitive classification techniques
Ensemble based algorithms (SMOTEBoost,
RareBoost
Manipulating Data Records
Over-sampling the rare class [Ling98]
Down-sizing (undersampling) the majority class [Kubat97]
Make the duplicates of the rare events until the data set contains as
many examples as the majority class => balance the classes
Sample the data records from majority class (Randomly, Near miss
examples, Examples far from minority class examples (far from decision
boundaries)
Generating artificial anomalies
SMOTE (Synthetic Minority Over-sampling TEchnique) [Chawla02] - new
rare class examples are generated inside the regions of existing rare
class examples
Artificial anomalies are generated around the edges of the sparsely
populated data regions [Fan01]
Adapting Existing Rule Based Classifiers
Case specific feature weighting [Cardey97]
Increases the weight for rare class examples in decision
tree learning
Weight dynamically generated based on the path taken by
that example
Case specific rule weighting [Grzymala00]
LERS (Learning from Examples based on Rough Sets)
increases the rule strength for all rules describing the rare
class
Rare Class Detection Evaluation
True positive rate, true negative rate, false
positive rate, false negative rate
Precision/recall
Implications due to imbalanced class distribution
Base rate fallacy
March 15, 2011
Data Mining: Concepts and Techniques
21
Base Rate Fallacy (Axelsson, 1999)
Base Rate Fallacy
Even though the test is 99% certain, your chance
of having the disease is 1/100, because the
population of healthy people is much larger than
sick people
Semi-supervised Classification Techniques
Use modified classification model to learn the
normal behavior and then detect any deviations
from normal behavior as anomalous
Recent approaches:
Neural network based approaches
Support Vector machines (SVM) based approaches
Markov model based approaches
Rule-based approaches
Using Support Vector Machines
One class classification problem
computes a spherically shaped decision boundary
with minimal volume around a training set of
objects.
Anomaly Detection
Supervised Anomaly Detection
Semi-supervised Anomaly Detection
Unsupervised Anomaly Detection
Graphical based
Statistical based
Nearest neighbor based techniques
Graphical Approaches
Boxplot (1-D), Scatter plot (2-D), Spin plot (3-D)
Limitations
Time consuming
Subjective
Statistical Approaches
Assume a parametric model describing the distribution of
the data (e.g., normal distribution)
Apply a statistical test that depends on
Data distribution
Parameter of distribution (e.g., mean, variance)
Number of expected outliers (confidence limit)
Grubbs’ Test
Detect outliers in univariate data
Assume data comes from normal distribution
Detects one outlier at a time, remove the outlier,
and repeat
H0: There is no outlier in data
HA: There is at least one outlier
Grubbs’ test statistic: G = max X − X
s
Reject H0 if: G > ( N − 1)
N
2
t (α / N ,N −2 )
N − 2 + t (2α / N , N − 2 )
Statistical-based – Likelihood Approach
Assume the data set D contains samples from a mixture
of two probability distributions:
M (majority distribution)
A (anomalous distribution)
General Approach:
Initially, assume all the data points belong to M
Let Lt(D) be the log likelihood of D at time t
For each point xt that belongs to M, move it to A
Let Lt+1 (D) be the new log likelihood.
Compute the difference, ∆ = Lt(D) – Lt+1 (D)
If ∆ > c (some threshold), then xt is declared as an anomaly
and moved permanently from M to A
Limitations of Statistical Approaches
Most of the tests are for a single attribute
In many cases, data distribution may not be
known
For high dimensional data, it may be difficult to
estimate the true distribution
Methods for High Dimensional Data
Mahalonobis distance:
Incorporates interaction among dimensions
Uses covariance matrix S:
q
D(x, y) = (x − y)T S−1 (x − y)
Other methods:
Angle method
Günay (Emory)
Anomaly Detection
Fall 2013
6/6
Distance-based Approaches
Data is represented as a vector of features
Three major approaches
Nearest-neighbor based
Density based
Clustering based
Nearest Neighbor Based Techniques
Key assumption: normal points have close neighbors
while anomalies are located far from other points
General two-step approach
1.
2.
Compute neighborhood for each data record
Analyze the neighborhood to determine whether data
record is anomaly or not
Categories:
Distance based methods
Anomalies are data points most distant from other points
Density based methods
Anomalies are data points in low density regions
© Tan,Steinbach, Kumar
Introduction to Data Mining
1/17/2006
5
© Tan,Steinbach, Kumar
Introduction to Data Mining
1/17/2006
8
Nearest Neighbor Based Techniques
Distance based approaches
A point O in a dataset is an DB(p, d) outlier if at least
fraction p of the points in the data set lies greater than
distance d from the point O*
Density based approaches
Compute local densities of particular regions and declare
instances in low density regions as potential anomalies
Approaches
Local Outlier Factor (LOF)
Connectivity Outlier Factor (COF)
Multi-Granularity Deviation Factor (MDEF)
*Knorr, Ng,Algorithms for Mining Distance-Based Outliers in Large Datasets, VLDB98
Local Outlier Factor (LOF)*
For each data point q compute the distance to the k-th nearest neighbor
(k-distance)
Compute reachability distance (reach-dist) for each data example q with
respect to data example p as:
reach-dist(q, p) = max{k-distance(p), d(q,p)}
Compute local reachability density (lrd) of data example q as inverse of
the average reachabaility distance based on the MinPts nearest neighbors
of data example q
MinPts
lrd(q) =
∑ reach _ distMinPts (q, p)
p
Compaute LOF(q) as ratio of average local reachability density of q’s knearest neighbors and local reachability density of the data record q
LOF(q) =
1
lrd ( p )
⋅∑
MinPts p lrd (q )
* - Breunig, et al, LOF: Identifying Density-Based Local Outliers, KDD 2000.
Advantages of Density based Techniques
Local Outlier Factor (LOF) approach
Example:
Distance from p3 to
nearest neighbor
In the NN approach,
p2 is not considered
as outlier, while the
LOF approach find
both p1 and p2 as
outliers
p3 ×
Distance from p2 to
nearest neighbor
p2
×
p1
×
NN approach may
consider p3 as outlier,
but LOF approach
does not
Density based approach Using Support Vector Machines
Main idea [Steinwart05] :
Normal data records belong to high density data regions
Anomalies belong to low density data regions
Use unsupervised approach to learn high density and low
density data regions
Use SVM to classify data density level
Clustering Based Techniques
Key assumption
normal data records belong to large and dense clusters,
while anomalies belong do not belong to any of the clusters
or form very small clusters
Anomalies detected using clustering based methods can be:
Data records that do not fit into any cluster (residuals from clustering)
Small clusters
Low density clusters or local anomalies (far from other points within the
same cluster)