Download Anomaly detection algorithms in Scikit-Learn

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Anomaly detection algorithms in Scikit-Learn
Nicolas Goix
Supervision: Alexandre Gramfort
Institut Mines-Télécom, Télécom ParisTech, CNRS-LTCI
OSI day, Oct 2015
1 What is Anomaly Detection?
2 Isolation Forest algorithm
3 Examples
Anomaly Detection (AD)
What is Anomaly Detection ?
”Finding patterns in the data that do not conform to expected
behavior”
Huge number of applications: Network intrusions, credit card
fraud detection, insurance, finance, military surveillance,...
Machine Learning context
Different kind of Anomaly Detection
• Supervised AD
- Labels available for both normal data and anomalies
- Similar to rare class mining
• Semi-supervised AD (Novelty Detection)
- Only normal data available to train
- The algorithm learns on normal data only
• Unsupervised AD (Outlier Detection)
- no labels, training set = normal + abnormal data
- Assumption: anomalies are very rare
Important litterature in Anomaly Detection:
• statistical AD techniques
fit a statistical model for normal behavior
ex: EllipticEnvelope
• density-based
- ex: Local Outlier Factor (LOF) and variantes (COF ODIN
LOCI)
• Support estimation - OneClassSVM - MV-set estimate
• high-dimensional techniques: - Spectral Techniques -
Random Forest - Isolation Forest
1 What is Anomaly Detection?
2 Isolation Forest algorithm
3 Examples
Idea:
IsolationForest.fit(X)
IsolationForest
Inputs: X, n estimators, max samples
Output: Forest with:
• # trees = n estimators
• sub-sampling size = max samples
• maximal depth max depth = int(log2 max samples)
Complexity: O(n estimators max samples log(max samples))
default: n estimators=100, max samples=256
IsolationForest.predict(X)
Finding the depth in each tree
depth ( Tree , X ) :
# − Finds t h e depth l e v e l o f t h e l e a f node
# f o r each sample x i n X .
# − Add a v e r a g e p a t h l e n g t h ( n s a m p l e s i n l e a f )
# i f x not i s o l a t e d
−
score(x, n) = 2
E(depth(x))
c(n)
Complexity: O( n samples n estimators log(max samples))
1 What is Anomaly Detection?
2 Isolation Forest algorithm
3 Examples
Examples
• code example:
>>
>>
>>
>>
from s k l e a r n . ensemble import I s o l a t i o n F o r e s t
IF = I s o l a t i o n F o r e s t ( )
IF . f i t ( X t r a i n ) # b u i l d the trees
I F . p r e d i c t ( X t e s t ) # f i n d t h e average depth
• plotting decision function:
n samples normal = 150
n s a m p l e s o u t i e r s = 50
n samples normal = 150
n s a m p l e s o u t i e r s = 50
Thanks !