Download Lecture 8 - The University of Texas at Dallas

Document related concepts

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data Stream Classification and Novel
Class Detection
Mehedy Masud, Latifur Khan, Qing Chen
and Bhavani Thuraisingham
Department of Computer Science , University of Texas at Dallas
Jing Gao, Jiawei Han
Department of Computer Science , University of Illionois at Urbana Champaign
Charu Aggarwal
IBM T. J. Watson
This work was funded in part by
Masud et al.
University of Texas at Dallas
Aug 10, 2011
Outline of The Presentation

Background

Data Stream Classification

Novel Class Detection
Masud et al.
University of Texas at Dallas
Aug 10, 2011
2
Introduction

Characteristics of Data streams are:
◦ Continuous flow of data
◦ Examples:
Network traffic
Sensor data
Call center records
Masud et al.
University of Texas at Dallas
Aug 10, 2011
3
Data Stream Classification



Uses past labeled data to build classification model
Predicts the labels of future instances using the model
Helps decision making
Expert
analysis
and
labeling
Block and
quarantine
Network traffic
Attack traffic
Firewall
Classification
model
Benign traffic
Server
Masud et al.
University of Texas at Dallas
Aug 10, 2011
4
Data Stream Classification (cont..)

What are the applications?
◦
◦
◦
◦
◦
Security Monitoring
Network monitoring and traffic engineering.
Business : credit card transaction flows.
Telecommunication calling records.
Web logs and web page click streams.
Masud et al.
University of Texas at Dallas
Aug 10, 2011
5
Challenges

Infinite length

Concept-drift

Concept-evolution

Feature Evolution
Masud et al.
University of Texas at Dallas
Aug 10, 2011
6
Infinite Length

Impractical to store and use all historical data
◦ Requires infinite storage
1
1
1
0
0
◦ And running time
Masud et al.
University of Texas at Dallas
0
0
1
1
Aug 10, 2011
7
1
0 0
Concept-Drift
Current hyperplane
Previous hyperplane
A data chunk
Negative instance
Positive instance
Masud et al.
University of Texas at Dallas
Instances victim of concept-drift
Aug 10, 2011
8
Concept-Evolution
y
y1
Novel class
----- --------
y
D
y1
C
A
++++ ++
++ + + ++
+ +++ ++ +
++ + + + ++ +
+++++ ++++ +++
+ ++ + + ++ ++ +
----- --------
A
- --- - ------------------ - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - -- --++++++++
++++++++
x1
B
y2
D
XXXXX
X X X X X XX X X X X X
XX X X X X X X X X X X
X X X X X XX X X X X X
X XXX
X X
++++ ++
++ + + ++
+ +++ ++ +
++ + + + ++ +
+++++ ++++ +++
+ ++ + + ++ ++ +
x
C
- --- - ------------------ - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - -- --++++++++
++++++++
x1
B
y2
x
Classification rules:
R1. if (x > x1 and y < y2) or (x < x1 and y < y1) then class = +
R2. if (x > x1 and y > y2) or (x < x1 and y > y1) then class = -
Existing classification models misclassify novel class instances
Masud et al.
University of Texas at Dallas
Aug 10, 2011
9
Dynamic Features

Why new features evolving
◦ Infinite data stream
 Normally, global feature set is unknown
 New features may appear
◦ Concept drift
 As concept drifting, new features may appear
◦ Concept evolution
 New type of class normally holds new set of features
Different chunks may have different feature sets
Masud et al.
University of Texas at Dallas
Aug 10, 2011
10
Dynamic Features
runway, climb
ith
runway, clear, ramp
chunk
Feature Extraction
& Selection
ith chunk and i + 1st
chunk and models have
different feature sets
Feature Set
i + 1st chunk
runway, ground, ramp
Current
model
Classification &
Novel Class Detection
Feature Space
Conversion
Training
New Model
Existing classification models need complete fixed features and apply to all the
chunks. Global features are difficult to predict. One solution is using all English
words and generate vector. Dimension of the vector will be too high.
Masud et al.
University of Texas at Dallas
Aug 10, 2011
11
Outline of The Presentation

Introduction

Data Stream Classification

Novel Class Detection
Masud et al.
University of Texas at Dallas
Aug 10, 2011
12
DataStream Classification (cont..)

Single Model Incremental Classification

Ensemble – model based classification
◦ Supervised
◦ Semi-supervised
◦ Active learning
Masud et al.
University of Texas at Dallas
Aug 10, 2011
13
Overview

Single Model Incremental Classification

I
Ensemble
– model based classification
◦ Data Selection
◦ Semi-supervised
◦ Skewed Data
Masud et al.
University of Texas at Dallas
Aug 10, 2011
14
Ensemble of Classifiers
C1
+
x,?
C2
+
input
C3
-
Classifier
Masud et al.
University of Texas at Dallas
+
Individual
outputs
voting
Aug 10, 2011
Ensemble
output
15
Ensemble Classification of Data
Streams

Divide the data stream into equal sized chunks
◦ Train a classifier from each data chunk
◦ Keep the best L such classifier-ensemble
◦ Example: for L= 3
Note: Di may contain data points from different classes
Labeled chunk
Data
chunks
D2
D1
D654
Unlabeled chunk
Classifiers
C1
Ensemble
C1
C2
C42
C53
Masud et al.
University of Texas at Dallas
D543
Addresses infinite length
and concept-drift
C543
Prediction
Aug 10, 2011
16
ECSMiner
Concept-Evolution Problem

A completely new class of data arrives in the stream
y
x<x1
T
T
y<y1
y1
F
y<y2
F T
F
+
-
A
D B
+
-
C
(a)
Novel class
----- - --------
y
D
A
++++ ++
++ + + ++
+ +++ ++ +
++ + + + ++ +
+++++ ++++ +++
+ ++ + + ++ ++ +
y1
C
- --- - ------------------ - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - -- --+++++++
++++++++
x1
(b)
B
y2
----- --------
XXXXX
X X X X X XX X X X X X
XX X X X X X X X X X X
X X X X X XX X X X X X
X XXX
X X
++++ ++
++ + + ++
+ +++ ++ +
++ + + + ++ +
+++++ ++++ +++
+ ++ + + ++ ++ +
x
- --- - ------------------ - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - -- --++++++++
++++++++
x1
(c)
y2
x
(a) A decision tree, (b) corresponding feature space partitioning
(c) A Novel class (denoted by x) arrives in the stream.
Masud et al.
University of Texas at Dallas
Aug 10, 2011
17
ECSMiner
ECSMiner: Overview
Data Stream
Older instances (labeled)
Last labeled
chunk
Training
Update
xnow
Outlier
detection
Yes
Buffering and
novel class
detection
Buffer?
No
Classification
Ensemble of L models
M1 M2 . . .
ML
New
model
Based on:
Just arrived
Newer instances (unlabeled)
Overview of ECSMiner algorithm
Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham. “Integrating Novel Class Detection
with Classification for Concept-Drifting Data Streams”. In Proceedings of 2009 European Conf. on Machine Learning
and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD’09), Bled, Slovenia, 7-11 Sept, 2009, pp
79-94 (extended version appeared in IEEE Transaction on Knowledge and Data Engineering (TKDE)).
Masud et al.
University of Texas at Dallas
Aug 10, 2011
18
ECSMiner
Algorithm
Training
Novel class detection
and classification
Masud et al.
University of Texas at Dallas
Aug 10, 2011
19
ECSMiner
Novel Class Detection
 Non
parametric
◦ does not assume any underlying model of existing classes
 Steps:
1.
Creating and saving decision boundary during training
2.
Detecting and filtering outliers
3.
Measuring cohesion and separation among test and training
instances
Masud et al.
University of Texas at Dallas
Aug 10, 2011
20
ECSMiner
Training: Creating Decision Boundary
y
y1
Raw training data
Clusters are created
-- - - - - -------
Pseudopoints
y
D
D
y1
C
C
A
A
--- - ------- - - - - - - - - - -- - - - - - - - - - -- - - - - - - - - - -
++++
++ + + +
+ +++ ++ +
+ + + + ++ +
+++ ++ ++ +++
+++++ ++++ +++
+ ++ + + ++ ++
+ ++
B
y2
B
+++ +
++++++
++++++
x1
x
x1
y2
x
Addresses Infinite length problem
Masud et al.
University of Texas at Dallas
Aug 10, 2011
21
ECSMiner
Outlier Detection and Filtering
Test instance inside
decision boundary
(not outlier)
Test instance outside
decision boundary
Raw outlier or Routlier
Test instance
x
y
D
x
Ensemble of L models
M1
...
ML
M2
y1
A
C
Routlier
x
B
x1
y2
x
Routlier
Routlier
X is an existing
AND
False class instance
True
X is a filtered outlier (Foutlier)
(potential novel class instance)
Routliers may appear as a result of novel class, concept-drift, or noise.
Therefore, they are filtered to reduce noise as much as possible.
Masud et al.
University of Texas at Dallas
Aug 10, 2011
22
ECSMiner
Novel Class Detection
Test instance
x
(Step 1)
Ensemble of L models
M1
...
ML
M2
Routlier
(Step 2)
Routlier
for q’>q
Foutliers
with all
models?
Routlier
X is an existing
AND
False class instance
True
X is a filtered outlier (Foutlier)
(potential novel class instance)
Masud et al.
University of Texas at Dallas
(Step 4) q-NSC>0
Compute
q-NSC with
all models
and other
Foutliers
(Step 3)
Aug 10, 2011
N Treat as
existing
class
Y
Novel class found
23
ECSMiner
Computing Cohesion & Separation
 o,5(x)
a(x)
+,5(x)
b+(x)
+
+ +
++
+
+
+
+



-
- - - -
-
a(x) = mean distance from an Foutlier x to the instances in o,q(x)
bmin(x) = minimum among all bc(x) (e.g. b+(x) in figure)
q-Neighborhood Silhouette Coefficient (q-NSC):
q - NSC(x) 

-,5(x)
x b (x)
(b min (x)  a(x))
max( b min (x) , a(x))
If q-NSC(x) is positive, it means x is closer to Foutliers than any other class.
Masud et al.
University of Texas at Dallas
Aug 10, 2011
24
Speeding Up


Computing N-NSC for every Foutlier instance x takes
quadratic time in the number of Foutliers.
In order to make the computation faster,
We create Ko pseudopoints (Fpseudopoints) from Foutliers using
K-means clustering,
where Ko = (No/S) * K. Here S is the chunk size and No is the
number of Foutliers.
perform the computations on the Fpseudopoints

Thus, the time complexity
◦ to compute the N-NSC of all of the Fpseudopoints is O(Ko(Ko+K))
◦ which is constant, since both Ko and K are independent of the
input size.
◦ However, by gaining speed we lose some precision, although the
loss is negligible (to be analyzed shortly)
Masud et al.
University of Texas at Dallas
Aug 10, 2011
25
ECSMiner
Algorithm To Detect Novel Class
Masud et al.
University of Texas at Dallas
Aug 10, 2011
26
“Speedup” Penalty

As discussed earlier
◦ by speeding up computation in step – 3, we lose some precision
since the result deviates from exact result
◦ This analysis shows
(x-i)2 that the deviation is negligible
x
i
i
(i-j)2
(x-j)2
j
j
Figure 6. Illustrating the computation of deviation. i is an Fpseudopoint, i,e., a cluster of Foutliers, and j is
an existing class Pseudopoint, i.e., a cluster of existing class instances. In this particular example, all
instances in i belong to a novel class.
Masud et al.
University of Texas at Dallas
Aug 10, 2011
27
“Speedup” Penalty
Approximate:
Exact:
Deviation:
Masud et al.
University of Texas at Dallas
Aug 10, 2011
28
Experiments - Datasets
• We evaluated our approach on two synthetic and two
real datasets:
• SynC – Synthetic data with only concept-drift. Generated
using hyperplane equation. 2 classes, 10 attributes, 250K
instances
• SynCN – Synthetic data with concept-drift and novel class.
Generated using Gaussian distribution. 20 classes, 40 attributes,
400K instances
• KDD cup 1999 intrusion detection (10% version) – real
dataset. 23 classes, 34 attributes, 490K instances
• Forest cover – real dataset. 7 classes, 54 attributes, 581K
instances
Masud et al.
University of Texas at Dallas
Aug 10, 2011
29
Experiments - Setup

Development:
◦ Language: Java

H/W:
◦ Intel P-IV with
◦ 2GB memory and
◦ 3GHz dual processor CPU.

Parameter settings:
◦ K (number of pseudopoints per chunk) = 50
◦ N (minimum number of instances required to declare novel class) = 50
◦ M (ensemble size) = 6
◦ S (chunk size) = 2,000
Masud et al.
University of Texas at Dallas
Aug 10, 2011
30
Experiments - Baseline

Competing approaches:
◦ i) MineClass (MC): our approach
◦ ii) WCE-OLINDDA_Parallel (W-OP)
◦ iii) WCE-OLINDDA_Single (W-OS): Where WCE-OLINDDA is a combination of the
Weighted Classifier Ensemble (WCE) and novel class detector OLINDDA, with default
parameter settings for WCE and OLINDDA

We use this combination since to the best of our knowledge there is no approach that Can
classify and detect novel classes simultaneously

OLINDDA assumes there is only one normal class, and all other classes are novel
◦ Therefore, we apply two variations –
 W-OP keeps parallel OLINDDA models, one for each class
 W-OS keeps a single model that absorbs a novel class when encountered
Masud et al.
University of Texas at Dallas
Aug 10, 2011
31
Experiments - Results

Evaluation metrics
◦ Mnew = % of novel class instances Misclassified as existing class
= Fn∗100/Nc
◦ Fnew = % of existing class instances Falsely identified as novel class
= Fp∗100/ (N−Nc)
◦ ERR = Total misclassification error (%)(including Mnew and Fnew)
= (Fp+Fn+Fe)∗100/N
◦ where Fn = total novel class instances misclassified as existing class,
◦ Fp = total existing class instances misclassified as novel class,
◦ Fe = total existing class instances misclassified (other than Fp),
◦ Nc = total novel class instances in the stream,
◦ N = total instances the stream.
Masud et al.
University of Texas at Dallas
Aug 10, 2011
32
Experiments - Results
Forest Cover
KDD cup
Masud et al.
University of Texas at Dallas
SynCN
Aug 10, 2011
33
Experiments - Results
Masud et al.
University of Texas at Dallas
Aug 10, 2011
34
Experiments – Parameter Sensitivity
Masud et al.
University of Texas at Dallas
Aug 10, 2011
35
Experiments – Runtime
Masud et al.
University of Texas at Dallas
Aug 10, 2011
36
Dynamic Features

Solution:
◦ Global Features
◦ Local Features
◦ Union
Mohammad Masud, Qing Chen, Latifur Khan, Jing Gao, Jiawei Han,
and Bhavani Thuraisingham, “Classification and Novel Class
Detection of Data Streams in A Dynamic Feature Space,” in Proc.
of Machine Learning and Knowledge Discovery in Databases,
European Conference, ECML PKDD 2010, Barcelona, Spain, Sept
2010, Springer, Page 337-352
Masud et al.
University of Texas at Dallas
Aug 10, 2011
37
Feature Mapping Across Models and
Test Data Points


Feature set varies in different chunks. Especially, when
new class appears, new features should be selected and
added to the feature set.
Strategy 1 – Lossy fixed (Lossy-F) conversion / Global
◦ Use the same fixed feature in the entire stream.
 We call this a lossy conversion because future model and instances may lose
important features due to this mapping.

Strategy 2 – Lossy local (Lossy-L) conversion / Local
◦ We call this lossy conversion because it may loss feature values during mapping.

Strategy 3 – Dimension preserving (D-Preserving) Mapping /
Union
Masud et al.
University of Texas at Dallas
Aug 10, 2011
38
Feature Space Conversion – Lossy-L
Mapping (Local)

Assume that each data chunk has different feature
vectors

When a classification model is trained, we save the
feature vector with the model

When an instance is tested, its feature vector is mapped
(i.e., projected) to the model’s feature vector.
Masud et al.
University of Texas at Dallas
Aug 10, 2011
39
Feature Space Conversion – Lossy-L
Mapping

For example,
◦ Suppose the model has two features (x,y)
◦ The instance has two features (y,z)
◦ When testing, assume the instance has two features
(x,y)
◦ Where x = 0, and y value is kept as it is
Masud et al.
University of Texas at Dallas
Aug 10, 2011
40
Conversion Strategy II – Lossy-L
Mapping

Graphically:
Masud et al.
University of Texas at Dallas
Aug 10, 2011
41
Conversion Strategy III – D-Preserving
Mapping

When an instance is tested, both the model’s feature
vector and the instance’s feature vector are mapped
(i.e., projected) to the union of their feature vectors.
◦ The feature dimension is increased.
◦ In the mapping, both the features in the testing instance and
model are preserved. The extra features are filled with all 0s.
Masud et al.
University of Texas at Dallas
Aug 10, 2011
42
Conversion Strategy III – D-Preserving
Mapping

For example,
◦ suppose the model has three features (a,b,c)
◦ The instance has four features (b,c,d,e)
◦ When testing, we project both the model’s feature
vector and the instance’s feature vector to (a,b,c,d,e)
◦ Therefore, in the model, d, and e will be considered
0s and in the instance, a will be considered 0
Masud et al.
University of Texas at Dallas
Aug 10, 2011
43
Conversion Strategy III – D-Preserving
Mapping

Previous Example
Masud et al.
University of Texas at Dallas
Aug 10, 2011
44
Discussion

Local does not favor novel class, it favors existing
classes.
◦ Local features will be enough to model existing classes.

Union favors novel class.
◦ New features may be discriminating for novel class, hence Union
works.
Masud et al.
University of Texas at Dallas
Aug 10, 2011
45
Comparison
Which strategy is the better?
 Assumption: lossless conversion (union) preserves the
properties of a novel class.
 In other words, if an instance belongs to a novel class, it
remains outside the decision boundary of any model Mi
of the ensemble M in the converted feature space.
Lemma:
 If a test point x belongs to a novel class, it will be missclassified by the ensemble M as an existing class instance
under certain conditions when the Lossy-L conversion is used.

Masud et al.
University of Texas at Dallas
Aug 10, 2011
46
Comparison

Proof:

Let X1,…,XL,XL+1,…,XM be the dimensions of the model and

Let X1,…,XL,XM+1,…,XN be the dimensions of the test point

Suppose the radius of the closest cluster (in the higher dimension)
is R

Also, let the test point be a novel class instance.

Combined feature space = X1,…,XL,XL+1,…,XM,XM+1,…,XN
Masud et al.
University of Texas at Dallas
Aug 10, 2011
47
Comparison

Proof (continued):

Combined feature space = X1,…,XL,XL+1,…,XM,XM+1,…,XN

Centroid of the cluster (original space):
X1=x1,…,XL=xL,XL+1=xL+1,…,XM=xM i.e., x1,…,xL, xL+1,…,xM

Centroid of the cluster (combined space): x1,…,xL, xL+1,…,xM ,
0,…,0

Test point (original space):

X1=x’1,…,XL=x’L,XM+1=x’M+1,…,XN=x’N i.e., x1,…,xL, x’M+1,…,x’N

Test point (combined space):
Masud et al.
University of Texas at Dallas
x’1,…,x’L, 0,…,0, x’M+1,…,x’N
Aug 10, 2011
48
Comparison

Proof (continued):

Centroid (combined spc):

Test point (combined space): x’1,…,x’L, 0,…,

R2< ((x1 –x’1)2+,…, +(xL –x’L)2+ x2L+1+…+x2M)+ (x’2M+1+…+x’2N)

R2 <

R2 = a2 + b2 - e2

a2 = R2 + (e2 – b2)

a2 < R2

Therefore, in Lossy-L conversion, the test point will not be an outlier
a2
+
0, x’M+1,…,x’N
b2
(e2 >0)
(provided that e2 < b2)
Masud et al.
University of Texas at Dallas
x1,…,xL, xL+1,…,xM , 0 ,…, 0
Aug 10, 2011
49
Baseline Approaches
WCE is Weighted Classifier Ensemble1, which addresses multi-class
ensemble classifier.


OLINDDA is a novel class detector 2 works only for binary class.
FAE algorithm is an ensemble classifier that addresses feature
evolution3 and concept drift.

ECSMiner is a multi-class ensemble classifier that addresses concept
drift and concept evolution4.

Masud et al.
University of Texas at Dallas
Aug 10, 2011
50
Approaches Comparison
Proposed
techniques
Challenges
Infinite
length
Conceptdrift
Conceptevolution
OLINDDA
WCE
FAE
ECSMiner
DXMiner
Masud et al.
University of Texas at Dallas
Aug 10, 2011
51
Dynamic
Features
Experiments: Datasets

We evaluated our approach on different datasets:
Data Set
Concept
Concept
Dynamic
# of
# of
Drift
Evolution
Feature
Instance
Class
KDD
492K
7
Forest Cover
387K
7
NASA
140K
21
Twitter
335K
21
Masud et al.
University of Texas at Dallas
Aug 10, 2011
52
Experiments: Results

Evaluation metrics: let
◦ Fn = total novel class instances misclassified as existing class,
◦ Fp = total existing class instances misclassified as novel class,
◦ Fe = total existing class instances misclassified (other than Fp),
◦ Nc = total novel class instances in the stream,
◦ N = total instances the stream
Masud et al.
University of Texas at Dallas
Aug 10, 2011
53
Experiments: Results

We use the following performance metrics to evaluate our
technique:
◦ Mnew = % of novel class instances Misclassified as existing
class, i.e,
◦ Fnew = % of existing class instances Falsely identified as novel class,
i.e.,
◦ ERR = Total misclassification error (%)(including Mnew and Fnew),
i.e.,
Masud et al.
University of Texas at Dallas
Aug 10, 2011
54
Experiments: Setup

Development:
◦ Language: Java

H/W:
◦ Intel P-IV with
◦ 3GB memory and
◦ 3GHz dual processor CPU.

Parameter settings:
◦ K (number of pseudo points per chunk) = 50
◦ q (minimum number of instances required to declare novel class) = 50
◦ L (ensemble size) = 6
◦ S (chunk size) = 1,000
Masud et al.
University of Texas at Dallas
Aug 10, 2011
55
Experiments: Baseline

Competing approaches:
◦ i) DXMiner (DXM): our approach- 4 variations:
 Lossy-F conversion
 Lossy-L conversion
 D-Preserving conversion
◦ ii) FAE-WCE-OLINDDA_Parallel (W-OP)
◦ Assumes there is only one normal class, and all other classes are
novel . W-OP keeps parallel OLINDDA models, one for each class
 We use this combination since to the best of our knowledge there is no approach
that can classify and detect novel classes simultaneously with feature evolution.
◦ iii) FAE-ECSMiner
Masud et al.
University of Texas at Dallas
Aug 10, 2011
56
Twitter Results
Masud et al.
University of Texas at Dallas
Aug 10, 2011
57
Twitter Results
D-preserving
AUC
0.88
Masud et al.
University of Texas at Dallas
Lossy Local
LossyGlobal
0.83
0.76
Aug 10, 2011
O-F
0.56
58
NASA Dataset
Deviation
AUC
Masud et al.
University of Texas at Dallas
0.996
Info Gain
0.967
O-F
0.876
Aug 10, 2011
59
Forest Cover Results
Masud et al.
University of Texas at Dallas
Aug 10, 2011
60
Forest Cover Results
D-preserving
AUC
Masud et al.
University of Texas at Dallas
0.97
O-F
0.74
Aug 10, 2011
61
KDD Results
Masud et al.
University of Texas at Dallas
Aug 10, 2011
62
KDD Results
D-preserving
AUC
Masud et al.
University of Texas at Dallas
0.98
FAE-Olindda
0.96
Aug 10, 2011
63
Summary Results
Masud et al.
University of Texas at Dallas
Aug 10, 2011
64
Proposed Methods
Improved Outlier Detection and Multiple Novel Class
Detection


Challenges
◦ High false positive (FP) (existing classes detected as novel) and false
negative (FN) (missed novel classes) rates
◦ Two or more novel classes arrive at a time
Solutions1
◦ Dynamic decision boundary – based on previous mistakes
 Inflate the decision boundary if high FP, deflate if high FN
◦ Build statistical model to filter out noise data and concept drift from the
outliers.
◦ Multiple novel classes are detected by
 Constructing a graph where outlier cluster is a vertex
 Merging the vertices based on silhouette coefficient
 Counting the number of connected components in the resultant (i.e., merged) graph
1 Mohammad
M. Masud, Qing Chen, Jing Gao, Latifur Khan, Charu Aggarwal, Jiawei Han, and Bhavani Thuraisingham, Addressing ConceptEvolution in Concept-Drifting Data Streams, In Proc ICDM ’10, Sydney, Australia, Dec 14-17, 2010.
Masud et al.
University of Texas at Dallas
Aug 10, 2011
67
Proposed Methods
Outlier Threshold (OUTTH)

To declare a testing instance being an outlier, using cluster radius r is
not enough because of the data noise
◦ So, beyond the radius r, a threshold (OUTTH) will be setup,
so that most noisy data around model cluster will be classified
immediately
 o,5(x)
a(x)
+,5(x)
x
b+(x)
+
+
+
+ +
++
Masud et al.
University of Texas at Dallas
Aug 10, 2011
68
Proposed Methods
Outlier Threshold (OUTTH)

Every instance outside the cluster range has a weight
wt ( x)  exp( (bx  r ))
◦ If wt(x) >= OUTTH, this instance will be consider as existing class.
◦ If wt(x) < OUTTH, this instance will be an outlier.

Pros:
◦ Noisy data will be classified immediately

Cons
◦ OUTTH is hard to be determined
 Noisy data and novel class instance may occur simultaneously
 Different dataset may have different OUTTH
Masud et al.
University of Texas at Dallas
Aug 10, 2011
69
Proposed Methods
Outlier Threshold (OUTTH)
a(x)
+,5(x)
+ +
++

x
b+(x)
+
+
 o,5(x)
+
OUTTH = ?
If threshold is too high, noisy data may become outlier
◦ FP rate will go up

If threshold is too low, novel class instance will be labeled as existing
class
◦ FN rate will go up
Masud et al.
University of Texas at Dallas
We need to balance on these two
Aug 10, 2011
70

Introduction

Data Stream Classification

Clustering

Novel Class Detection
• Finer Grain Novel Class Detection
• Dynamic Novel Class Detection
• Multiple Novel Class Detection
Masud et al.
University of Texas at Dallas
Aug 10, 2011
71
Proposed Methods
Dynamic threshold setting
a(x)
x
+
+
Marginal FN
+
+ +
++
◦
Marginal FP
Defer approach

After a testing chunk has been labeled, based on the marginal FP and FN rate of the this testing
chunk update the OUTTH, and then apply the new OUTTH to the next testing chunk
◦
Eager approach

What is marginal FP or marginal FN

Once a marginal FP or marginal FN instance detected, update OUTTH with step function, and
apply the updated OUTTH to the next testing instance
Masud et al.
Aug 10, 2011
University of Texas at Dallas
72
Proposed Methods
Dynamic threshold setting
Masud et al.
University of Texas at Dallas
Aug 10, 2011
73
Proposed Methods
Defer approach and Eager approach
comparison

In Defer approach, OUTTH updates after a data chunk is labeled
◦ Too late – In the testing chunk, many marginal FP or FN may occur due
to an improper OUTTH threshold
◦ Overreact – If there are many marginal FP or FN instances in the
labeled testing chunk, the OUTTH update may overreact for the next
testing chunk

In Eager approach, OUTTH updates aggressively whenever
marginal FP or FN happens.
◦ The model is more tolerate to noisy data and concept drift.
◦ The model is more sensitive to novel class instances.
Masud et al.
University of Texas at Dallas
Aug 10, 2011
74
Proposed Methods
Outliers Statistics

For each outlier instance, we calculate the novelty probability Pnov
P nov ( x) 
1  wt ( x)
 Psc
1  min( wt ( x))
◦ If Pnov is large (close to 1), indicates that the outlier has a high
probability of being a novel instance.

Pnov contains two parts
◦ The first part measures how far the outlier being away from the model
cluster
◦ The second part Psc is the Silhouette Coefficient, measures the
cohesion and separation to the model cluster of the q-Neighbors of the
outlier
Masud et al.
University of Texas at Dallas
Aug 10, 2011
75
Proposed Methods
Outliers Statistics
•
Noise Data
•
Concept Drift
•
Novel Class
Three scenarios may occur
simultaneously
Masud et al.
University of Texas at Dallas
Aug 10, 2011
76
Proposed Methods
Outlier Statistics Gini Analysis

The Gini coefficient is a measure of statistical inequality. The
discrete Gini coefficient is:

 n n  1  i  yi
1
G ( s)   n  1  2 i 1 n

n
i1 yi





 

If we divide 0~1 into n equal size bin, and put all outlier Pnov into
corresponding bin, then we can get cdf yi
◦ If all Pnov is very low, to an extreme cdf yi = 1

 n n  1  i  yi
1
G ( s)   n  1  2 i 1 n

n
i1 yi


n

 1 
 n



    n  1  2 i 1 n  1  i     1  n  1  2 n  1  i 1 i    0
  n 

  n 

n
n  







◦ If all Pnov are very high, to an extreme cdf yi =0; except yn=1

 n n  1  i  y i
1
G ( s)   n  1  2 i 1 n

n
i 1 yi


Masud et al.
University of Texas at Dallas
 1
   n  1  2  n  1
  n
n

Aug 10, 2011
77
Proposed Methods
Outlier Statistics Gini Analysis
◦ If all outlier Pnov distribute evenly, yi =i/n

  n  1  i  y   1 
  n  1  i i  
1





G(s)   n  1  2
  n  1  2





n
 n
 y
 i  





n
n
i 1
i 1
i
n
n
i 1
n


i2
1


i 1
 n 1 2 n 1
n

n
i 1 i


n  1 2(2n  1) n  1



n
3n
3n

 1 
2 
     n  1  2 n(n  1)( 2n  1) 
 

  n 
6
n
(
n

1
)




After get the outlier Pnov distribution, calculate G(s)
n 1
3n

If G(s)>

If G(s) <=

When n ∞, n3n1  0.33
Masud et al.
University of Texas at Dallas
i 1
i
, declare novel class
n 1
3n
, classified the outlier as existing class instance.
Aug 10, 2011
78
Proposed Methods
Outlier Statistics Gini Analysis Limitation
1
0.9
0.8
0.7
cdf-concept drift
0.6
cdf-severe-conceptdrift
cdf-novel
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
◦ To an extreme, it is impossible the differentiate concept drift and concept
evolution by Gini coefficient, when concept drift is just “looks like”
concept evolution.
Masud et al.
University of Texas at Dallas
Aug 10, 2011
79

Introduction

Data Stream Classification

Clustering

Novel Class Detection
• Finer Grain Novel Class Detection
• Dynamic Novel Class Detection
• Multiple Novel Class Detection
Masud et al.
University of Texas at Dallas
Aug 10, 2011
80
Proposed Methods
Multi Novel Class Detection
Data Stream
Positive Instance
Novel class A
Negative Instance
Novel class B
Novel Instance
y
y1
y2
x1
y2
x
x1
y2
x
If we always assume novel instances belong to one novel type,
one type of novel instances, either A or B, will be misclassified.
Masud et al.
University of Texas at Dallas
Aug 10, 2011
81
Proposed Methods
Multi Novel Class Detection

The main idea in detecting multiple novel classes is to construct a graph,
and identify the connected components in the graph.

The number of connected components determines the number of novel
classes.
Masud et al.
University of Texas at Dallas
Aug 10, 2011
82
Proposed Methods
Multi Novel Class Detection

Two Phases:
◦ Building the connected graph
 Build directed nearest neighbor graph. From each
vertex (outlier cluster), add edge from this vertex
to its nearest neighbor.
 Silhouette coefficient from the vertex to its
nearest neighbor is larger than some threshold,
the edge will be removed.
 Problem: Linkage Circle
◦ Component merging phase
 Gaussian distribution centric decision
Masud et al.
University of Texas at Dallas
Aug 10, 2011
83
Proposed Methods
Multi Novel Class Detection
◦ Component merging phase
 In probability theory, “the normal (or Gaussian) distribution, is a continuous probability distribution that is often
used as a first approximation to describe real-valued random variables that tend to cluster around a single mean
value” 1 d  centroid _ dist ( g , g )    
1
2
1
2
If two Gaussian Distribution variables (g1, g2) can be separated, the following condition will be hold:
d  centroid _ dist ( g1 , g2 )  c(1  2 )
Since μ is proportion to σ, if the two variables (components) will remain separated; otherwise, these two
components will be merged.
 
1
2 2


0
 x2
xe
2 2
dx  
2 2
 x2
 2
0 2
1 
x2



2
2
2
  e 2 d (
)
(e 2  e 2 )  
(0  1) 
2
2
0
2
2
2
2
2
2
1. Amari Shunichi, Nagaoka Hiroshi. Methods of information geometry. Oxford University Press. ISBN 0-8218-0531-2,
2000.
Masud et al.
University of Texas at Dallas
Aug 10, 2011
84
Experiment Results
Experiments: Datasets

We evaluated our approach on different datasets:
Concept
Drift
Data Set
Concept
Evolution
Dynamic
Feature
# of
Instance
# of
Class
KDD
492K
7
Forest Cover
387K
7
NASA
140K
21
Twitter
335K
21
SynED
400K
20
Masud et al.
University of Texas at Dallas
Aug 10, 2011
85
Experiment Results
Experiments: Setup

Development:
◦ Language: Java

H/W:
◦ Intel P-IV with
◦ 3GB memory and
◦ 3GHz dual processor CPU.

Parameter settings:
◦ K (number of pseudo points per chunk) = 50
◦ q (minimum number of instances required to declare novel class) = 50
◦ L (ensemble size) = 6
◦ S (chunk size) = 1,000
Masud et al.
University of Texas at Dallas
Aug 10, 2011
86
Experiments: Baseline

Competing approaches:
◦ i) DEMminer our approach- 5 variations:

Lossy-F conversion

Lossy-L conversion

Lossless conversion - DEMminer

Dynamic OUTTH + Lossless conversion - DEMminer-Ex (without Gini)

Dynamic OUTTH + Gini + Lossless conversion - DEMminer-Ex
◦ ii) WCE-OLINDDA (O-W)
◦ iii) FAE-WCE-OLINDDA_Parallel (O-F)

We use this combination since to the best of our knowledge there is no approach that can classify
and detect novel classes simultaneously with feature evolution.
Masud et al.
University of Texas at Dallas
Aug 10, 2011
87
Experiment Results
Experiments: Results

Evaluation metrics:
◦ Fn = total novel class instances misclassified as existing class,
◦ Fp = total existing class instances misclassified as novel class,
◦ Fe = total existing class instances misclassified (other than Fp),
◦ Nc = total novel class instances in the stream,
◦ N = total instances the stream
Masud et al.
University of Texas at Dallas
Aug 10, 2011
88
Experiment Results
Twitter Results
7000
50
45
DEMminer
Lossy-F
Lossy-L
O-F
35
30
25
20
15
10
Novel Class Instances
6000
40
Total Err %
DEMminer
Lossy-F
Lossy-L
O-F
5000
4000
3000
2000
1000
5
0
0
50
100
150
200
250
Stream (in thousand data pts)
Masud et al.
University of Texas at Dallas
300
350
0
0
100
200
300
Stream (in thousand data pts)
Aug 10, 2011
89
400
Experiment Results
Twitter Results
1
True novel class detection rate
0.9
0.8
0.7
0.6
DEMminer
0.5
Lossy-F
0.4
Lossy-L
O-F
0.3
0.2
0.1
0
0
0.1
DEMminer
AUC
0.88
Masud et al.
University of Texas at Dallas
0.2
0.3
0.4
0.5
0.6
0.7
False novel class detection rate
Lossy -L
0.83
0.8
0.9
Lossy-F
O-F
0.76
Aug 10, 2011
1
0.56
90
Experiment Results
Twitter Results
7000
20
18
6000
Total Err %
14
12
DEMminer
10
DEMminer-Ex
OW
8
6
4
Novel Class Instances
16
5000
4000
DEMminer
DEMminer-Ex
3000
OW
2000
1000
2
0
0
0
50
Stream
100
(in
150thousand
200 data pts)
250
Masud et al.
University of Texas at Dallas
300
350
0
50
100
150
200
250
Strem (in thousand data pts)
Aug 10, 2011
91
300
350
Experiment Results
Twitter Results
1
0.9
0.8
0.7
0.6
DEMminer-Ex
0.5
DEMminer
0.4
OW
0.3
0.2
0.1
0
0
0.1
0.2
0.3
DEMminer-Ex
AUC
0.94
Masud et al.
University of Texas at Dallas
0.4
0.5
0.6
0.7
0.8
0.9
DEMminer
OW
0.88
Aug 10, 2011
1
0.56
92
Experiment Results
Forest Cover Results
3500
25
DEMminer
20
2500
DEMminer
2000
DEMminer-Ex(without
Gini)
1500
DEMminer-Ex
1000
OW
Total Err %
Novel Class Instances
3000
DEMminerEx(without Gini)
DEMminer-Ex
15
OW
10
5
500
0
0
100
200
300
Stream (in thousand data pts)
Masud et al.
University of Texas at Dallas
400
500
0
0
100
200
300
Stream (in thousand data pts)
Aug 10, 2011
93
400
500
Experiment Results
Forest Cover Results
1
0.9
0.8
0.7
0.6
DEMminer
0.5
DEMmier-Ex (without Gini)
DEMminer-Ex
0.4
OW
0.3
0.2
0.1
0
0
0.1
0.2
DEMminer
AUC
0.97
Masud et al.
University of Texas at Dallas
0.3
0.4
0.5
0.6
DEMminer-Ex
(without Gini)
0.99
0.7
0.8
0.9
DEMminer-Ex
0.97
Aug 10, 2011
1
OW
0.74
94
Experiment Results
NASA Dataset
Masud et al.
University of Texas at Dallas
Aug 10, 2011
95
Experiment Results
NASA Dataset
Deviation
AUC
0.996
Masud et al.
University of Texas at Dallas
Info Gain
0.967
FAE
0.876
Aug 10, 2011
96
Experiment Results
KDD Results
KDD Novel Error
2500
Novel Error Instance
2000
1500
DEMminer
O-F
1000
500
0
0
Masud et al.
University of Texas at Dallas
100
200
300
Chunk #
400
Aug 10, 2011
500
600
97
Experiment Results
KDD Results
KDD ROC
1.2
1
0.9
TP rate
0.8
0.6
DEMminer
0.4
O-F
0.2
0
0
0.2
0.4
DEMminer
AUC
Masud et al.
University of Texas at Dallas
0.98
0.6
FP rate
0.8
1
1.2
O-F
0.96
Aug 10, 2011
98
Experiment Results
Result Summary
Dataset
Method
DEMminer
ERR
4.2
Mnew
30.5
Lossy-F
32.5
0.0
Lossy-L
1.6
O-F
DEMminer
3.4
0.02
DEMminer(info-gain)
1.4
O-F
DEMminer
Fnew
0.8
AUC
0.877
FP
-
FN
-
32.6
0.834
-
-
82.0
0.0
0.764
-
-
96.7
-
1.6
-
0.557
0.996
0.00
0.1
-
-
0.967
0.04
10.3
3.4
3.6
8.4
1.3
0.876
0.973
0.00
-
24.7
-
O-F
DEMminer
5.9
1.2
20.6
5.9
1.1
0.9
0.743
0.986
-
-
O-F
4.7
9.6
4.4
0.967
-
-
Twitter
ASRS
Forest
Cover
KDD
Masud et al.
University of Texas at Dallas
Aug 10, 2011
99
Experiment Results
Result Summary
Dataset
Twitter
Forest Cover
Method
ERR
Mnew
DEMminer
4.2
30.5
0.8
0.877
DEMminer-Ex
1.8
0.7
0.6
0.944
OW
3.4
96.7
1.6
0.557
DEMminer
3.6
8.4
1.3
0.974
DEMminer-Ex
3.1
4.0
0.68
0.990
OW
5.9
20.6
1.1
0.743
Masud et al.
University of Texas at Dallas
Fnew
Aug 10, 2011
AUC
100
Experiment Results
Running Time Comparison
Time(sec)1/K
Points/sec
Speed gain
Dataset
DEMminer Lossy-F
O-F
DEMminer
Lossy-F
O-F
Twitter
23
3.5
66.7
43
289
15
2.9
ASRS
21
4.3
38.5
47
233
26
1.8
Forest Cover
1.0
1.0
4.7
967
1003
212
4.7
KDD
1.2
1.2
3.3
858
812
334
2.5
Masud et al.
University of Texas at Dallas
Aug 10, 2011
101
DEMminer over O-F
Experiment Results
Multi Novel Detection Results
Masud et al.
University of Texas at Dallas
Aug 10, 2011
102
Experiment Results
Multi Novel Detection Results
Masud et al.
University of Texas at Dallas
Aug 10, 2011
103
Experiment Results
Conclusion
•Our data stream classification technique addresses
•Infinite length
•Concept-drift
•Concept-evolution
•Feature-evolution
•Existing approaches only address first two issues
•Applicable to many domains such as
•Intrusion/malware detection
•Text categorization
•Fault detection etc.
Masud et al.
University of Texas at Dallas
Aug 10, 2011
104
References

J. Gehrke, V. Ganti, R. Ramakrishnan, and W. Loh. : BOAT-Optimistic Decision Tree
Construction. In Proc. SIGMOD, 1999.

P. Domingos and G. Hulten, “Mining high-speed data streams”. In Proc. SIGKDD, pages 7180, 2000.

Wenerstrom, B., Giraud-Carrier, C., “Temporal data mining in dynamic feature spaces”. In:
Perner, P. (ed.) ICDM 2006. LNCS (LNAI), vol. 4065, pp. 1141.1145. Springer, Heidelberg
(2006)

E. J. Spinosa, A. P. de Leon F. de Carvalho, and J. Gama. “Cluster-based novel concept
detection in data streams applied to intrusion detection in computer networks”. In Proc. 2008
ACM symposium on Applied computing, pages 976–980, (2008).

M. Scholz and R. Klinkenberg. “An ensemble classifier for drifting concepts.” In Proc.
ICML/PKDD Workshop in Knowledge Discovery in Data Streams., 2005.
Masud et al.
University of Texas at Dallas
Aug 10, 2011
105
References (contd.)

Brutlag, J.(2000). “Aberrant behavior detection in time series for network monitoring.” In:
Proc. Usenix Fourteenth System Admin. Conf. LISA XIV, New Orleans, LA. (Dec 2000)

Eskin, E., Arnold, A., Prerau, M., Portnoy, L., Stolfo, S.: “A geometric framework for
unsupervised anomaly detection: Detection intrusions in unlabeled data.” Applications of
Data Mining in Computer Security, Kluwer (2002).

Fan, W. “Systematic data selection to mine concept-drifting data streams.” In Proc. KDD 04

Gao, J, Wei Fan, and Jiawei Han. (2007a). "On Appropriate Assumptions to Mine Data
Streams”

Gao, J. Wei Fan, Jiawei Han, Philip S. Yu. (2007b). “A General Framework for Mining ConceptDrifting Data Streams with Skewed Distributions.” SDM 2007

Goebel, J. and T. Holz. Rishi: “Identify bot contaminated hosts by irc nickname evaluation. In
Usenix/Hotbots” ’07 Workshop, 2007.

Grizzard, J. B., V. Sharma, C. Nunnery, B. B. Kang, and D. Dagon (2007). “Peer-to-peer
botnets: Overview and case study.” In Usenix/Hotbots ’07 Workshop.
Masud et al.
University of Texas at Dallas
Aug 10, 2011
106
References (contd.)

Keogh & Pazzani, (2000) E.J., J., P.M.: “Scaling up dynamic time warping for data mining
applications.” In: ACM SIGKDD. (2000)

Lemos, R. (2006): Bot software looks to improve peerage. SecurityFocus.
http://www.securityfocus.com/news/11390 (2006).

Livadas, C., B.Walsh, D. Lapsley, and T. Strayer (2006) “Using machine learning techniques
to identify botnet traffic.” In 2nd IEEE LCN Workshop on Network Security (WoNS’2006),
November 2006.

LURHQ Threat Intelligence Group (2004). Sinit p2p trojan analysis.
http://www.lurhq.com/sinit.html (2004)

Rajab, M. A. J. Zarfoss, F. Monrose, and A. Terzis (2006) “A multifaceted approach to
understanding the botnet phenomenon.” In Proceedings of the 6th ACM SIGCOMM on Internet
Measurement Conference (IMC), 2006.

Kagan Tumar and Joydeep ghosh (1996).“Error correlation and error reduction in
ensemble classifiers” (Connection sciece), 8(3-4):385-403
Masud et al.
University of Texas at Dallas
Aug 10, 2011
107
References (contd.)

Mohammad Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham, “A
Multi-Partition Multi-Chunk Ensemble Technique to Classify Concept-Drifting Data Streams.” In
Proc, of 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD-
09), Page: 363-375, Bangkok, Thailand, April 2009.

Mohammad Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham, “A
Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled
Data.” In Proc. of 2008 IEEE International Conference on Data Mining (ICDM 2008), Pisa,
Italy, Page 929-934, December, 2008.

Clay Woolam, Mohammed Masud, and Latifur Khan , “Lacking Labels In The Stream:
Classifying Evolving Stream Data With Few Labels”. In Proc. of 18th International Symposium
on Methodologies for Intelligent Systems (ISMIS), Page 552-562, September 2009 Prague,
Czech Republic
Masud et al.
University of Texas at Dallas
Aug 10, 2011
108
References (contd.)

Mohammad Masud, Qing Chen, Latifur Khan, Charu Aggarwal, Jing Gao, Jiawei Han, and
Bhavani Thuraisingham, “Addressing Concept-Evolution in Concept-Drifting Data Streams”. In Proc
of 2010 10th IEEE International Conference on Data Mining (ICDM 2010), Sydney, Australia,
Dec 2010.

Mohammad M. Masud, Qing Chen, Jing Gao, Latifur Khan, Jiawei Han, Bhavani Thuraisingham
“Classification and Novel Class Detection of Data Streams in a Dynamic Feature Space”. In Proc. of
European Conference on Machine Learning and Knowledge Discovery in Databases, ECML
PKDD 2010, Barcelona, Spain, September 20- 24, 2010, Springer 2010, ISBN 978-3-64215882-7, Page: 337-352.

Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham,
“Classification and Novel Class Detection in Data Streams with Active Mining”. In Proc of 14th
Pacific-Asia Conference on Knowledge Discovery and Data Mining, 21-24 June, 2010, Page
311-324, - Hyderabad, India.

University of Texas at Dallas
Masud et al.
Aug 10, 2011
109
References (contd.)

Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham,
“Classification and Novel Class Detection in Concept-Drifting Data Streams under Time
Constraints" , IEEE Transactions on Knowledge & Data Engineering (TKDE), 2011, IEEE
Computer Society, June 2011, Vol. 23, No. 6, Page 859-874.

Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu, “A Framework for Clustering
Evolving Data streams” Published in Proceedings VLDB ’03 proceedings of the 29th
international conference on Very Large Data Bases-Volume 29

H. Wang, W. Fan, P. S. Yu, and J. Han. “Mining concept-drifting data streams using ensemble
classifiers”. In Proc. ninth ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 226–235, Washington, DC, USA, Aug, 2003. ACM.

Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham.
“Integrating Novel Class Detection with Classification for Concept-Drifting Data Streams”. In
Proceedings of 2009 European Conf. on Machine Learning and Principles and Practice of
Knowledge Discovery in Databases (ECML/PKDD’09), Bled, Slovenia, 7-11 Sept, 2009.
Masud et al.
University of Texas at Dallas
Aug 10, 2011
110

Questions
Masud et al.
University of Texas at Dallas
Aug 10, 2011
111