Download Locally Linear Reconstruction: Classification performance

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Instance-based Learning:
Locally linear reconstruction and its applications
Data Mining Lab.
Seoul National University
Pilsung Kang
2010. 01. 05.
Table of Contents
Introduction: Instance-based Learning
Locally Linear Reconstruction for Classification & Regression
Learning
Algorithms
Distance & Local Topology-based Hybrid Score for Novelty Detection
Local Topology-based Seed Initialization for Clustering
Application I: Response Modeling
Real-world
Applications
Application II: Virtual Metrology
Application III: Keystroke Dynamics Analysis
Conclusion
Pilsung Kang, DataMining Laboratory, SNU
1
Introduction: What is learning?
“A computer program is said to learn from experience E
with respect to some class of task T and performance
measure P, if its performance at task in T, as measured by
P, improves with experience E,” – Tom Mitchell.
Pilsung Kang, Data Mining Laboratory, SNU
2
Introduction: Data for machine learning
Data
Digitized and (un)structured observations on real-world events that should be
provided to the machine to learn.
Not only required to build learning models, but also determine the type of
learning tasks.
Pilsung Kang, Data Mining Laboratory, SNU
3
Introduction: Data for machine learning
Unsupervised learning
 Explores intrinsic characteristics.
 Estimates underlying distribution.
 Density estimation, clustering, novelty detection, etc.
Pilsung Kang, Data Mining Laboratory, SNU
4
Introduction: Data for machine learning
Supervised learning
y  f ( x)
 Finds relations between X and Y.
 Estimate the underlying function y  f ( x) .
 Classification, regression.
Pilsung Kang, Data Mining Laboratory, SNU
5
Introduction: Instance-based learning
Instance-based learning (IBL)
Also called memory-based reasoning (MBR) or lazy learning.
A non-parametric approach where training or learning does not take place until
a new query is made.
k-nearest neighbor (k-NN) is the most popular.
k-NN covers most learning tasks such as density estimation, novelty detection,
classification, and regression, etc.
Pilsung Kang, Data Mining Laboratory, SNU
6
Introduction: k-NN density estimation
k-NN density estimation
With the small region
containing x:
.
According to the binomial law, the probability that k of N instances fall within
For large n, the following holds,
If the region
.
is sufficiently small,
The estimated density becomes
Pilsung Kang, Data Mining Laboratory, SNU
, where V is the volume of
.
.
7
Introduction: k-NN novelty detection
k-NN novelty detection
Use distance information as a novelty score.
Maximum distance, average distance, and distance to the mean vector, etc.
Distance to the nearest neighbor Distance to the k-th neighbor
d
1
NN
 x n 1  x
1
n 1
Pilsung Kang, Data Mining Laboratory, SNU
d
k
max
 x n 1  x
k
n 1
Average distance to k neighbors
d
k
avg
1 k
  x n 1  xin 1
k i 1
8
Introduction: k-NN classification and regression
k-NN classification
k-NN regression
The weight assigned to the neighbor xi
Pilsung Kang, Data Mining Laboratory, SNU
9
Introduction: k-NN classification and regression
k-NN classification examples
Pilsung Kang, Data Mining Laboratory, SNU
10
Introduction: Strengths of k-NN learning
Strengths of k-NN learning
Sound theoretical background.

The error rate of 1-NN is bounded by the twice the Bayes error rate.

The error rate of k-NN converges to the Bayes error rate as k increases
provided with a sufficient number of reference instances.
Fast training procedure unlike complex learning algorithms such as neural
networks or support vector machines.
Fits naturally well with on-line (incremental) learning.
Pilsung Kang, Data Mining Laboratory, SNU
11
Introduction: Application areas of k-NN learning
Unsupervised learning
Density estimation

Data restoration.
Clustering

Image processing, text categorization.
Novelty detection

Intrusion detection, identity verification, user authentication, medical diagnosis.
Supervised learning
Classification

Collaborative filtering, image processing, text mining, bioinformatics.
Regression

Manufacturing, time-series analysis.
Pilsung Kang, Data Mining Laboratory, SNU
12
Introduction: Limitations of k-NN learning
Supervised learning (classification and regression)
Parameter dependency.
How many nearest neighbors should be considered?
How to give those neighbors appropriate weights?
Novelty detection
Counter examples exist.
Conventional nearest-neighbor-based novelty detectors conflict with intuition.
Clustering
Most seed initialization techniques are purely heuristic.
Pilsung Kang, Data Mining Laboratory, SNU
13
Introduction: Contributions
Learning algorithms
A systematic weight allocation method, locally linear reconstruction (LLR),
is proposed for classification and regression.
LLR is able to identify important neighbors for the prediction.
LLR assigns the appropriate weights for the important neighbors.
A distance & local topology-based hybrid score is proposed for novelty
detection.
The hybrid score combines two distances, one of which is associated with
relative similarity, while the other of which is associated with local topology.
The hybrid novelty score is able to overcome conventional nearest-neighborbased novelty detectors.
Pilsung Kang, Data Mining Laboratory, SNU
14
Introduction: Contributions
Learning algorithms
A new seed initialization algorithm based on centrality, sparsity, and
isotropy, (CSI), is proposed for clustering.
Three properties associated with inter- or intra-cluster variance are identified.
Relative similarity and local topology are used for measuring these properties.
CSI is able to lead K-Means clustering algorithm to the optimal clustering
structure rapidly.
Pilsung Kang, Data Mining Laboratory, SNU
15
Introduction: Contributions
Real-world applications
LLR classification and CSI are employed for response modeling.
Response modeling: to predict whether each customer will respond to a given
marketing campaign.
Class imbalance is a common and significant problem of response modeling.
Class imbalance is alleviated and response rate predictive accuracy is improved.
LLR regression is employed for virtual metrology.
Virtual metrology: to predict the metrological values using sensor data and
other relevant information in semiconductor manufacturing.
A good prediction model should be robust to the parameters while keeping
prediction accuracy as high as possible.
Both goals are achieved by LLR regression.
Pilsung Kang, Data Mining Laboratory, SNU
16
Introduction: Contributions
Real-world applications
A distance & local topology-based hybrid score is employed for keystroke
dynamics-based user authentication.
KDA: to authenticate users based on their keyboard typing behaviors.
KDA should be formulated as a novelty detection problem.
An authenticator should work well for incremental environments.
A distance & local topology-based hybrid score results in outstanding
authentication performances.
Pilsung Kang, Data Mining Laboratory, SNU
17
Table of Contents
Introduction: Instance-based Learning
Locally Linear Reconstruction for Classification & Regression
Learning
Algorithms
Distance & Local Topology-based Hybrid Score for Novelty Detection
Local Topology-based Seed Initialization for Clustering
Application I: Response Modeling
Real-world
Applications
Application II: Virtual Metrology
Application III: Keystroke Dynamics Analysis
Conclusion
Pilsung Kang, DataMining Laboratory, SNU
18
Locally Linear Reconstruction: Classification and regression
k-NN classification
k-NN regression
The weight assigned to the neighbor xi
Pilsung Kang, Data Mining Laboratory, SNU
19
Locally Linear Reconstruction: Issues and current solutions
Issues and current solutions
How many nearest neighbors should be considered?
Empirically determined by cross-validation.
Domain experts decided for certain real-world cases.
How to give those neighbors appropriate weights?
“A father neighbor gets a smaller weights”
Kernel functions, which decrease proportional to the dissimilarity are
commonly used.
Kernel function examples:
,
Pilsung Kang, Data Mining Laboratory, SNU
,
,
20
Locally Linear Reconstruction: Algorithm
An illustration of LLR algorithm procedure
Pilsung Kang, Data Mining Laboratory, SNU
21
Locally Linear Reconstruction: Algorithm for classification
LLR classification algorithm
Step 1: Compute the distance and find k nearest neighbors.
Step 2: Minimize the reconstruction error to find the critical neighbors and
their corresponding weights.
Step 3: Make a prediction based on the assigned weights.
Pilsung Kang, Data Mining Laboratory, SNU
22
Locally Linear Reconstruction: Algorithm for classification
Proposition 1: LLR for classification
The optimal weight w is determined by minimizing the reconstruction error,
Min

2
1
1
k
j
E (w )  x n 1   j 1 w j x n 1  x n 1  X nNN1w
2
2
 x
T
n 1

 X nNN1w ,
with two constraints,
w j  0, j,

j
w j  1.
This minimization problem can be solved by any algorithm developed for solving the
quadratic program (QP).
Pilsung Kang, Data Mining Laboratory, SNU
23
Locally Linear Reconstruction: Algorithm for classification
LLR classification algorithm
Pilsung Kang, Data Mining Laboratory, SNU
24
Locally Linear Reconstruction: Algorithm for classification
Proposition 2: Computational complexity of LLR classification
Let n be the number of reference instance and k be the number of nearest neighbors,
then the computational complexity of
conventional k-NN is O(nlogn),
LLR with standard QP is O(nlogn + k3),
LLR with SMO is between O(nlogn+k) and O(nlogn+ k2).
Pilsung Kang, Data Mining Laboratory, SNU
25
Locally Linear Reconstruction: Algorithm for classification
Proof
Conventional k-NN:
Distance calculation: O(n), sorting: O(nlogn), weight allocation: O(k).
Total computational complexity: O(nlogn).
LLR with standard QP:
Distance calculation: O(n), sorting: O(nlogn), weight allocation: O(k3).
Total computational complexity: O(nlogn+k3).
LLR with SMO:
Distance calculation: O(n), sorting: O(nlogn), weight allocation: O(k) ~ O(k2).
Total computational complexity: O(nlogn+k) ~ O(nlogn+k2).
Pilsung Kang, Data Mining Laboratory, SNU
26
Locally Linear Reconstruction: Algorithm for regression
Proposition 3: Explicit solution for LLR regression
The optimal weight w is determined by minimizing the reconstruction error,
Min

2
1
1
k
j
E (w )  x n 1   j 1 w j x n 1  x n 1  X nNN1w
2
2
 x
T
n 1

 X nNN1w ,
without any constraint.
Proof

2
1
1
k
j
E (w )  x n 1   j 1 w j x n 1  x n 1  X nNN1w
2
2

E (w )
  x n 1  X nNN1w
w
T
w X
NN T
n 1
X
NN
n 1
 x n1 X
T
NN
n 1

T
 x
T
n 1

 X nNN1w ,
X nNN1  0,


T
, w   x Tn1 X nNN1 X nNN1 X nNN1

Pilsung Kang, Data Mining Laboratory, SNU

1 T
 , w

w
.
jwj
27
Locally Linear Reconstruction: Algorithm for regression
Proposition 4: Linear equations for LLR regression
The optimal weight w is determined by minimizing the reconstruction error,
Min

2
1
1
k
j
E (w )  x n 1   j 1 w j x n 1  x n 1  X nNN1w
2
2
with one constraint,

j
 x
T
n 1

 X nNN1w ,
w j  1.
Proof

2
1
1
k
j
E (w )  x n 1   j 1 w j x n 1  x n 1  XnNN1w
2
2
 x
T
n 1

 XnNN1w , s.t. w T 1  1.
This can be rewritten as follows,
Min




T
1
NN
E (w )   X n 1  X n 1 w   X n 1  X nNN1 w  , s.t. w T 1  1.
2
Pilsung Kang, Data Mining Laboratory, SNU
28
Locally Linear Reconstruction: Algorithm for regression
Proof (cont’)
The primal Lagrangian of the problem becomes,
L


1 T
w CL w   w T 1  1 .
2
The Karash-Kuhn-Tucker condition for the optimal solution becomes,
L
 CL w  1  0  CL w  1 .
w
The solution of this problem can be obtained by solving a series of linear equations,
T
CL w  1 and re-scale w such that w 1  1.
Pilsung Kang, Data Mining Laboratory, SNU
29
Locally Linear Reconstruction: Algorithm for regression
Pilsung Kang, Data Mining Laboratory, SNU
30
Locally Linear Reconstruction: Algorithm for regression
Proposition 5: Computational complexity of LLR regression
Let n be the number of reference instance and k be the number of nearest neighbors,
then the computational complexity of LLR regression is O(nlogn+k2.376).
Proof
Distance calculation: O(n), sorting: O(nlogn), matrix factorization: O(k2.376).
Total computational complexity: O(nlogn+k2.376).
Pilsung Kang, Data Mining Laboratory, SNU
31
Locally Linear Reconstruction: Performance evaluation
Data sets
Classification
Regression
Pilsung Kang, Data Mining Laboratory, SNU
32
Locally Linear Reconstruction: Performance evaluation
Benchmark kernel functions
The number of k used in classification and regression
Performance measures
Class accuracy (classification), RMSE and MAPE (regression)
Pilsung Kang, Data Mining Laboratory, SNU
33
Locally Linear Reconstruction: Classification performance
Classification performance (class accuracy)
Pilsung Kang, Data Mining Laboratory, SNU
34
Locally Linear Reconstruction: Classification performance
Classification performance w.r.t. various k
Pilsung Kang, Data Mining Laboratory, SNU
35
Locally Linear Reconstruction: Classification performance
Classification performance w.r.t. various k (cont’)
Pilsung Kang, Data Mining Laboratory, SNU
36
Locally Linear Reconstruction: Classification performance
The average number of important neighbors (non-zero weighted neighbors)
The number of important neighbors (neighbors with nonzero weights) increases
in a log scale, then remain stable beyond a certain level.
Pilsung Kang, Data Mining Laboratory, SNU
37
Locally Linear Reconstruction: Classification performance
The execution time of LLR classification
As the number of reference instances increases, the computation time for the
QP becomes negligible compared to the sorting time.
Pilsung Kang, Data Mining Laboratory, SNU
38
Locally Linear Reconstruction: Classification performance
Execution time of LLR classification
Pilsung Kang, Data Mining Laboratory, SNU
39
Locally Linear Reconstruction: Regression performance
Classification performance (RMSE)
Pilsung Kang, Data Mining Laboratory, SNU
40
Locally Linear Reconstruction: Regression performance
Classification performance (MAPE)
Pilsung Kang, Data Mining Laboratory, SNU
41
Locally Linear Reconstruction: Regression performance
Regression performance (RMSE) w.r.t. various k
Pilsung Kang, Data Mining Laboratory, SNU
42
Locally Linear Reconstruction: Regression performance
Regression performance (MAPE) w.r.t. various k
Pilsung Kang, Data Mining Laboratory, SNU
43
Locally Linear Reconstruction: Classification performance
Execution time of LLR regression
Pilsung Kang, Data Mining Laboratory, SNU
44
Locally Linear Reconstruction: Summary
Locally linear reconstruction algorithm
A local topology-based optimization problem is formulated.
Able to find important neighbors for prediction.
Able to assign appropriate weights to those neighbors.
Performance evaluation
Outperformed conventional weight allocation methods both for classification
and regression.
Found to be robust to the number of nearest neighbors (k).
Additional computational burden was not so significant.
Pilsung Kang, Data Mining Laboratory, SNU
45
Table of Contents
Introduction: Instance-based Learning
Locally Linear Reconstruction for Classification & Regression
Learning
Algorithms
Distance & Local Topology-based Hybrid Score for Novelty Detection
Local Topology-based Seed Initialization for Clustering
Application I: Response Modeling
Real-world
Applications
Application II: Virtual Metrology
Application III: Keystroke Dynamics Analysis
Conclusion
Pilsung Kang, DataMining Laboratory, SNU
46
Novelty Detection: Definition
Novelty detection
What is novel instance?
“Observations that deviate so much from other observations as to arouse suspicions
that they were generated by a different mechanism (Hawkins, 1980)”
“Instances that their true probability density is very low (Harmeling et al., 2006)”
Binary classification vs. Novelty detection (Lee, 2007)
Binary
classification
Pilsung Kang, Data Mining Laboratory, SNU
Novelty
detection
47
Novelty Detection: Approaches
Properties for the success of novelty detection
Flexibility
Ability of generating an arbitrary shape of description boundary.
Simplicity
The small number of model parameters.
Updatability
Ability to update the model with new instances.
Stability
Sensitivity to initial condition of the model learning.
Pilsung Kang, Data Mining Laboratory, SNU
48
Novelty Detection: Approaches
Properties of various novelty detection algorithms
Nearest-neighbor-based novelty detectors have many positive properties.
Pilsung Kang, Data Mining Laboratory, SNU
49
Novelty Detection: Nearest-neighbor-based approaches
Maximum distance (Ramaswamy et al., 2000)
Distance to the kth nearest neighbor:
k
d max
 x n 1  x kn 1
Average distance (Angiulli and Pizzuti, 2005)
Average distance to k nearest neighbors:
d
k
avg
1 k
  x n 1  xin 1
k i 1
Distance to the mean (Harmeling et al., 2006)
Distance to the mean vector of k nearest neighbors:
d
Pilsung Kang, Data Mining Laboratory, SNU
k
m ean
1 k i
 xn 1   xn 1
k i 1
50
Novelty Detection: Nearest-neighbor-based approaches
The effect of nearest-neighbor-based novelty detectors
Pilsung Kang, Data Mining Laboratory, SNU
51
Novelty Detection: Counter examples
Which one should be identified as novel?
A
B
A (k=4)
B (k=5)
dkmax
dkavg
dkmean
Circle
1.58
1.14
0.50
Triangle
1.64
1.07
0.94
Circle
1.56
1.08
0.80
Triangle
1.86
1.09
0.88
Pilsung Kang, Data Mining Laboratory, SNU
52
Distance & Local Topology-based Hybrid Novelty Score
The hybrid novelty score algorithm
Step 1: Compute the distance and find k nearest neighbors.
Step 2 (absolute measure): Compute the average distance to k nearest
neighbors.
Pilsung Kang, Data Mining Laboratory, SNU
53
Distance & Local Topology-based Hybrid Novelty Score
The hybrid novelty score algorithm (cont’)
Step 3 (relative measure): Compute the distance to the convex hull that is
constituted by the neighbors.
Pilsung Kang, Data Mining Laboratory, SNU
54
Distance & Local Topology-based Hybrid Novelty Score
The hybrid novelty score algorithm (cont’)
Step 4 (combine two measures): Compute the hybrid score by combining
the absolute measure and relative measure.
Pilsung Kang, Data Mining Laboratory, SNU
55
Distance & Local Topology-based Hybrid Novelty Score
Pilsung Kang, Data Mining Laboratory, SNU
56
Distance & Local Topology-based Hybrid Novelty Score
Counter examples revisited
A
B
A (k=4)
B (k=5)
dkmax
dkavg
dkmean
dkhybrid
Circle
1.58
1.14
0.50
1.42
Triangle
1.64
1.07
0.94
1.18
Circle
1.56
1.08
0.80
1.18
Triangle
1.86
1.09
0.88
1.09
Pilsung Kang, Data Mining Laboratory, SNU
57
The Hybrid Novelty Score: An illustrative example
Pilsung Kang, Data Mining Laboratory, SNU
58
The Hybrid Novelty Score: Performance evaluation
Data sets
Pilsung Kang, Data Mining Laboratory, SNU
59
The Hybrid Novelty Score: Performance evaluation
Grouping data sets in terms of
The number of normal instances (TrNn)

Small: Reference instances < 200

Large: Reference instances > 200
The number of attributes (Dim.)

Low: Attributes < Reference instances

High: Attributes > Reference instances
Pilsung Kang, Data Mining Laboratory, SNU
60
The Hybrid Novelty Score: Performance evaluation
Performance measures
Integrated error
Robust to an arbitrary threshold setting.
Pilsung Kang, Data Mining Laboratory, SNU
61
The Hybrid Novelty Score: Performance evaluation
Benchmark novelty detectors
Three density-based
Gaussian density estimation (Gauss).
Mixture of Gaussians (MoG).
Parzen window density estimator (Parzen).
One support vector-based
One class support vector machine (1-SVM).
Three clustering-based
K-Means clustering (KMC).
K-Center clustering (KCC).
Average linkage-based hierarchical clustering (HC).
Pilsung Kang, Data Mining Laboratory, SNU
62
The Hybrid Novelty Score: Performance evaluation
Benchmark novelty detectors (cont’)
One dimensionality reduction-based
Principal component analysis (PCA).
Five distance-based
Max distance (dmax).
Average distance (davg).
Distance to the mean vector (dmean).
1-nearest neighbor (1-NN).
Minimum spanning tree (MST).
A total of 13 benchmark novelty detectors
Pilsung Kang, Data Mining Laboratory, SNU
63
The Hybrid Novelty Score: Novelty detection performance
Pilsung Kang, Data Mining Laboratory, SNU
64
The Hybrid Novelty Score: Novelty detection performance
In low dimensions (Group A and Group B data sets)
The proposed hybrid score (dkhybrid) was outstanding.
Best for eight data sets out of ten.
Followed by MST-CD and HC.
In high dimensions (Group B and Group C data sets)
dkhybrid and MST-CD were superior to other novelty detectors.
When dimensionality is high, local topology becomes more important.
Best for four and three data sets out of 11.
In common
Gauss and PCA were generally inferior to other novelty detectors.
They are not able to produce an arbitrary shape of class boundary.
Pilsung Kang, Data Mining Laboratory, SNU
65
The Hybrid Novelty Score: Novelty detection performance
Execution time of all novelty detectors
Pilsung Kang, Data Mining Laboratory, SNU
66
The Hybrid Novelty Score: Summary
Distance & Local topology-based novelty detector
A hybrid novelty score combining distance and local topology is proposed.
Absolute measure: average distance to k nearest neighbors.
Relative measure: distance to the convex hull made by k nearest neighbors.
Able to overcome limitations of conventional nearest-neighbor-based novelty
detectors.
Performance evaluation
Improved conventional nearest-neighbor-based novelty detectors.
Outperformed other state-of-the art novelty detectors for most cases.
Kept computational complexity low.
Pilsung Kang, Data Mining Laboratory, SNU
67
Table of Contents
Introduction: Instance-based Learning
Locally Linear Reconstruction for Classification & Regression
Learning
Algorithms
Distance & Local Topology-based Hybrid Score for Novelty Detection
Local Topology-based Seed Initialization for Clustering
Application I: Response Modeling
Real-world
Applications
Application II: Virtual Metrology
Application III: Keystroke Dynamics Analysis
Conclusion
Pilsung Kang, DataMining Laboratory, SNU
68
Clustering: Overview
Clustering
A data analysis tool that partitions the entire data set into some number of
meaningful subsets or groups, called clusters.
A good clustering algorithm results in a clustering structure where
The set of clusters is heterogeneous.
Each cluster is homogeneous.
Pilsung Kang, Data Mining Laboratory, SNU
69
Clustering: K-Means clustering
K-Means clustering
By far the most widely used clustering algorithm.
Finds K clusters by minimizing the within-cluster sum of squared error,
Benefits of K-Means clustering
Works well with any Lp norm.
Allows straightforward parallelization.
Does not depend on data ordering.
Pilsung Kang, Data Mining Laboratory, SNU
70
Clustering: K-Means clustering
Limitation of K-Means clustering
Clustering structure relies on the choice of initial seed.
Pilsung Kang, Data Mining Laboratory, SNU
71
Clustering: K-Means clustering seed initialization
Seed initialization approaches
Methods
Description
Limitations
R-Mean
He et al. (2002)
Simply adding a Gaussian noise to
the mean vector
Randomness
Considers absolute distance
among seed
Randomness
SCS
Tou and Gonzalez (1974)
KKZ
Katsavounidis et al.
(1994)
Considers relative distance among
seed
Does not consider sparsity
KR
Kaufman and
Rousseeuw (1990)
Considers distances from seed to
other instances
Heavy computational cost
CCIA
Khan and Ahmad (2004)
Use attribute information
Kd-tree
Redmond and Heneghan Use density information
(2007)
Pilsung Kang, Data Mining Laboratory, SNU
Order dependency
Parameter sensitivity
72
Clustering: K-Means clustering seed initialization
Three properties required for good seed
Centrality
An initial seed should be located in the middle of a cloud of instances.
Leads to quick convergence.
Sparsity
Any pair of seeds is separated by a sparse region.
Help find an optimal clustering structure.
Isotropy
Seeds should be located far from each other.
Help find an optimal clustering structure.
Pilsung Kang, Data Mining Laboratory, SNU
73
Clustering: K-Means clustering seed initialization
Examples that lack some properties
Pilsung Kang, Data Mining Laboratory, SNU
74
CSI: Algorithm
The CSI algorithm
Step 1 (Centrality): Set the instances with zero dc-hull to the seed candidates.
d4c-hull=0
Pilsung Kang, Data Mining Laboratory, SNU
d4c-hull>0
75
CSI: Algorithm
The CSI algorithm (cont’)
Step 2 (Isotropy): The geometric means of the distances between a
candidate and existing seeds are computed.
High isotropy
Pilsung Kang, Data Mining Laboratory, SNU
Low isotropy
76
CSI: Algorithm
The CSI algorithm (cont’)
Step 3 (Sparsity): The minimum radius of the empty ball is computed.
dr
dr
dR
Pilsung Kang, Data Mining Laboratory, SNU
77
CSI: Algorithm
The CSI algorithm (cont’)
Step 4 (Seed score): Compute the score based on sparsity and isotropy.
Pilsung Kang, Data Mining Laboratory, SNU
78
CSI: Algorithm
Pilsung Kang, Data Mining Laboratory, SNU
79
CSI: Algorithm
Pilsung Kang, Data Mining Laboratory, SNU
80
CSI: Performance evaluation
Benchmark methods
Method
Source
Method
Source
R-MEAN
He et al. (2002)
KR
Kaufman and
Rousseeuw (1990)
SCS
Tou and Gonzalez
(1974)
CCIA
Khan and Ahmad
(2004)
KKZ
Katsavounidis et al.
(1994)
Kd-tree
Redmond and
Heneghan (2007)
Data sets
Data
N Ins.
N Att.
Class
Data
N Ins.
N Att.
Class
Synthetic
300
2
4
Iris
150
4
3
Abalone
1,854
8
5
Handdigits
2,000
76
10
Segmentation
2,100
19
7
Letter
3,878
16
5
Satellite
4,435
36
5
Pilsung Kang, Data Mining Laboratory, SNU
81
CSI: Performance evaluation
Performance measures
Sum of squared error
SD validity index
Class accuracy
Pilsung Kang, Data Mining Laboratory, SNU
82
CSI: Clustering results
Sum of squared error (SSE)
The lower, the better.
CSI performed best for six out of seven data sets.
KR, CCIA, kd-tree performed best for two data sets.
Data set
K
R-SEL
RMEAN
SCS
KKZ
KR
CCIA
Kd-tree
CSI
Synthetic
4
1,250
1,239
1,204
1,249
1,240
1,243
1,217
1,154
Iris
3
172
161
161
140
140
140
140
140
Abalone
5
3,580
3,517
3,660
3,660
3,671
3,640
3,614
3,409
Handdigits
10
108,145
108,218
108,272
108,541
107,056
107,086
107,048
107,135
Segmentation
7
14,131
13,177
15,013
19,031
12,561
11,841
18,602
11,841
Letter
5
35,922
36,426
35,776
36,312
34,764
34,841
36,012
34,744
Satellite
6
37,352
34,136
40,927
39,899
34,136
38,477
37,568
34,136
Pilsung Kang, Data Mining Laboratory, SNU
83
CSI: Clustering results
Sum of squared error (SSE)
130
125
120
115
110
105
100
95
90
Synthetic
Iris
R-SEL
Abalone
RMEANS
Pilsung Kang, Data Mining Laboratory, SNU
SCS
Handdigit
KKZ
KR
Segmentation
CCIA
Letter
kd-tree
Satellite
CSI
84
CSI: Clustering results
SD validity index
The lower, the better.
CSI performed best for six out of seven data sets.
CCIA and kd-tree performed best for two data sets.
Data set
K
R-SEL
RMEAN
SCS
KKZ
KR
CCIA
Kd-tree
CSI
Synthetic
4
0.181
0.175
0.159
0.203
0.197
0.187
0.145
0.116
Iris
3
0.561
0.557
0.557
0.550
0.550
0.550
0.550
0.550
Abalone
5
0.545
0.536
14.290
14.290
14.238
0.544
0.539
0.518
Handdigits
10
0.839
0.837
0.841
0.847
0.733
0.821
0.742
0.732
Segmentation
7
3.152
2.968
3.831
15.167
2.702
2.700
2.754
2.741
Letter
5
0.714
0.719
0.703
0.724
0.668
0.701
0.684
0.659
Satellite
6
0.319
0.282
0.369
0.350
0.284
0.291
0.282
0.282
Pilsung Kang, Data Mining Laboratory, SNU
85
CSI: Clustering results
SD validity index
130
125
120
115
110
105
100
95
90
Synthetic
Iris
R-SEL
Abalone
RMEANS
Pilsung Kang, Data Mining Laboratory, SNU
SCS
Handdigit
KKZ
KR
Segmentation
CCIA
Letter
kd-tree
Satellite
CSI
86
CSI: Clustering results
Class accuracy
The higher, the better.
CSI performed best for six out of seven data sets.
KKZ, KR, CCIA, and kd-tree performed best for two data sets.
Data set
K
R-SEL
RMEAN
SCS
KKZ
KR
CCIA
Kd-tree
CSI
Synthetic
4
0.691
0.690
0.690
0.691
0.690
0.690
0.690
0.690
Iris
3
0.730
0.764
0.764
0.833
0.833
0.833
0.833
0.833
Abalone
5
0.468
0.485
0.502
0.502
0.447
0.500
0.503
0.513
Handdigits
10
0.508
0.505
0.515
0.474
0.533
0.541
0.528
0.563
Segmentation
7
0.584
0.562
0.415
0.302
0.592
0.589
0.596
0.608
Letter
5
0.546
0.518
0.554
0.510
0.615
0.618
0.604
0.624
Satellite
6
0.705
0.746
0.626
0.647
0.746
0.746
0.746
0.746
Pilsung Kang, Data Mining Laboratory, SNU
87
CSI: Clustering results
Class accuracy
110
105
100
95
90
85
80
Synthetic
Iris
R-SEL
Abalone
RMEANS
Pilsung Kang, Data Mining Laboratory, SNU
SCS
Handdigit
KKZ
KR
Segmentation
CCIA
Letter
kd-tree
Satellite
CSI
88
CSI: Clustering results
Computational time
RMEAN was the fastest, while KR was the slowest.
CSI was comparable with SCS, KKS, CCIA, and kd-tree.
1000.000
100.000
10.000
1.000
Synthetic
Iris
Abalone
Handdits
Segmentation
Letter
Satellite
0.100
0.010
0.001
RMEAN
Pilsung Kang, Data Mining Laboratory, SNU
SCS
KKZ
KR
CCIA
kd-Tree
CSI
89
CSI: Clustering results
Clustering iterations
KR and CSI converged faster than others in general.
The convergence speeds of R-Sel, SCS, and KKZ were slower.
60.0
50.0
40.0
30.0
20.0
10.0
0.0
Synthetic
Iris
R-Sel
Pilsung Kang, Data Mining Laboratory, SNU
Abalone
RMEAN
SCS
Handdits
KKZ
KR
Segmentation
CCIA
kd-Tree
Letter
Satellite
CSI
90
CSI: Summary
A new seed initialization algorithm (CSI) for K-Means clustering
A new seed initialization algorithm (CSI) for K-Means clustering is proposed.
Three properties (centrality, sparsity, isotropy) are identified and
accommodated.
Similarity and local topology are taken into account.
Performance evaluation
Able to find the optimal clustering structure.
Lead to a quick convergence.
Pilsung Kang, Data Mining Laboratory, SNU
91
Table of Contents
Introduction: Instance-based Learning
Locally Linear Reconstruction for Classification & Regression
Learning
Algorithms
Distance & Local Topology-based Hybrid Score for Novelty Detection
Local Topology-based Seed Initialization for Clustering
Application I: Response Modeling
Real-world
Applications
Application II: Virtual Metrology
Application III: Keystroke Dynamics Analysis
Conclusion
Pilsung Kang, DataMining Laboratory, SNU
92
Application I: Response modeling
Response modeling
Identifies customers who are likely to purchase a product, based on their
purchase history and other information.
Firms attempt to induce higher potential buyers to purchase the campaigned
product using their communication channels, e.g., phone, catalog, or e-mail.
Demographic information: Age, Sex, Job…
Product
Potential
Customers
Behavioral Information: Recency, Frequency, Monetary…
Pilsung Kang, Data Mining Laboratory, SNU
93
Application I: Response modeling
A well-developed response model can
Increase total revenue,
Lower total marketing cost.
Higher revenues
Well-developed
Response Model
Lowered costs
Pilsung Kang, Data Mining Laboratory, SNU
94
Application I: Response modeling
Increasing response rate is not an easy task, but its impact is incredible.
Only a small increase of response rate can
Change the total result of a direct mailing campaign from failure to success
(Baesens et al., 2002).
Boost the total revenue and raised revenue per respondent significantly (Knott
et al., 2002).
Not only increase profit, but also strengthen customer loyalty (Sun et al, 2006).
Pilsung Kang, Data Mining Laboratory, SNU
95
Response modeling: Approaches
Statistics
Logistic regression (Aaker et al., 2001; Hosmer and Lemeshow, 1989).
Stochastic RFM (Colombo and Jiang, 1999).
Hazard function (Gonul et al., 2000).
Pattern recognition and data mining
Artificial neural networks (Baesens et al., 2002; Kaefer et al., 2005).
Bagging artificial neural networks (Ha et al., 2005).
Bayesian neural networks (Baesens et al., 2002).
Support vector machines (Shin and Cho, 2006).
Decision trees (Coenen et al., 2000).
Pilsung Kang, Data Mining Laboratory, SNU
96
Response modeling: Class imbalance
Class imbalance in response modeling
Non-respondents overwhelmingly outnumber respondents.
9.4% are respondents in DMEF4 data set (Shin and Cho, 2006).
Only 6% are respondents in CoIL Challenge 2000 data set (Putten et al., 2000).
Response rate in general direct marking situations are often much lower.
Responders
Pilsung Kang, Data Mining Laboratory, SNU
Non-responders
97
Response modeling: Class imbalance
Approaches to deal with class imbalance
Cost differentiation
Algorithm
modification
Boundary alignment
Handling
Class Imbalance
Under-sampling
Data balancing
Over-sampling
Data balancing methods are more universal in that they can be combined with
any prediction models.
Pilsung Kang, Data Mining Laboratory, SNU
98
Response modeling: Data balancing methods
Under-sampling based methods
Reduces the number of majority class instances while keeping all the minority
class instances.
Effective in reducing training time.
Often distort the class distribution (sampling bias).
Random under-sapming, SHRINK (Kubat et al., 1997), one-sided selection
(OSS) (Kubat and Matwin, 1997).
Respondents
Pilsung Kang, Data Mining Laboratory, SNU
Non-respondents
99
Response modeling: Data balancing methods
Over-sampling based methods
increases the number of minority class instances while keeping all the majority
class instances.
Preserve the original data distribution.
Increase training time.
Random over-sapming, SMOTE (Chawla et al., 1994), SMOTE boost (Chawla et
al., 1998).
Respondents
Pilsung Kang, Data Mining Laboratory, SNU
Non-respondents
100
Response modeling: Proposal
A new data balancing method based on clustering, under-sampling, and
ensemble (CUE).
Clustering-based under-sampling
Eliminate the sampling bias.
Reduce the performance variation.
K-Means clustering with CSI seed initialization is employed.
Ensemble-based prediction model
Boost the response rate predictive accuracy.
LLR classification is employed.
Pilsung Kang, Data Mining Laboratory, SNU
101
Response modeling: CUE procedure
Customer data
Separate the customer data into
respondents and nonrespondents.
Step 2:
Divide the non-respondents
using clustering.
Respondents
Step 1:
: Non-respondents
Respondents
: Respondents,
Pilsung Kang, Data Mining Laboratory, SNU
Non-respondents
Cluster 1
Cluster 2
Cluster K
102
Training set 1
Training set 2
Respondents
Sampled
Nonrespondents
Cluster K
Cluster 2
Sampled
Nonrespondents
Construct multiple training sets
by combining the respondents
and sampled non-respondents
from each segment.
Respondents
Step 3:
Cluster 1
Sampled
Nonrespondents
Divide the non-respondents
using clustering.
Respondents
Step 2:
Respondents
Response modeling: CUE procedure
Training set N
Step 4:
Train each prediction model
with corresponding training set.
Prediction
model 1
Pilsung Kang, Data Mining Laboratory, SNU
Prediction
model 2
Prediction
model N
103
Training set 1
Training set 2
Sampled
Nonrespondents
Respondents
Sampled
Nonrespondents
Respondents
Construct multiple training sets
by combining the respondents
and sampled non-respondents
from each segment.
Sampled
Nonrespondents
Step 3:
Respondents
Response modeling: CUE procedure
Training set N
Step 4:
Train each prediction model
with corresponding training set.
Prediction
model 1
Step 5:
Make a prediction by
aggregating the prediction
results.
Pilsung Kang, Data Mining Laboratory, SNU
Prediction
model 2
Prediction
model N
Is he/she going to respond?
104
Response modeling: Implementation
Data sets
CoIL Challenge 2000
Provided by the Dutch data mining company “Sentiment Machine Research” for
a data mining competition purpose.
To predict which customers are potentially interested in a caravan insurance
polity.
85 explanatory variables: 42 product usage variables, 43 socio-demographic
variables.
Training set: 348(5.98%) respondents out of 5,822 customers.
Test set: 238(5.95) respondents out of 4,000 customers.
Pilsung Kang, Data Mining Laboratory, SNU
105
Response modeling: Implementation
Data sets
DMEF4
Provided by the “Direct Marketing Education Foundation” for research purpose.
To discover customers who purchased the product in the test period based on
demographic and historic purchase information in the reference period.
15 variables are selected from 91 explanatory variables.
101,532 customers with 9.4% of respondents.
Pilsung Kang, Data Mining Laboratory, SNU
106
Response modeling: Implementation
Benchmark data balancing methods
No-sampling (NS)
Under-sampling
Random under sampling (RUS).
One-sided selection (OSS).
Over-sampling
Random over-sampling (ROS).
Synthetic minority over-sampling technique (SMOTE).
Pilsung Kang, Data Mining Laboratory, SNU
107
Response modeling: Implementation
Classification models
Logistic regression (LR)
Multi-layer perceptron (MLP)
k-nearest neighbor classification with locally linear reconstruction (k-NN)
Support vector machine (SVM)
Performance measures
Balanced correction rate (BCR)
Geometric mean of majority class accuracy and minority class accuracy.
Pilsung Kang, Data Mining Laboratory, SNU
108
Response modeling: CoIL Challenge 2000 results
Prediction accuracy
Pilsung Kang, Data Mining Laboratory, SNU
109
Response modeling: CoIL Challenge 2000 results
BCR improvement and variation reduction of CUE over RUS, ROS, and
SMOTE (%)
BCR improved the most with MLP.
Performance variation reduction was significant.
Pilsung Kang, Data Mining Laboratory, SNU
110
Response modeling: CoIL Challenge 2000 results
True response rate (TRR) vs. true non-response rate (TNR)
Pilsung Kang, Data Mining Laboratory, SNU
111
Response modeling: CoIL Challenge 2000 results
Lift charts
Pilsung Kang, Data Mining Laboratory, SNU
112
Response modeling: DMEF4 results
Prediction accuracy
Pilsung Kang, Data Mining Laboratory, SNU
113
Response modeling: DMEF4 results
BCR improvement and variation reduction of CUE over RUS, ROS, and
SMOTE (%)
BCR improved the most with MLP.
Performance variation reduction was significant for MLP and k-NN.
Pilsung Kang, Data Mining Laboratory, SNU
114
Response modeling: DMEF4 results
True response rate (TRR) vs. true non-response rate (TNR)
Pilsung Kang, Data Mining Laboratory, SNU
115
Response modeling: DMEF4 results
Lift charts
Pilsung Kang, Data Mining Laboratory, SNU
116
Response modeling: DMEF4 results
Total profit with various marketing costs
Pilsung Kang, Data Mining Laboratory, SNU
117
Response modeling: DMEF4 results
Total revenues and costs with various marketing costs (LR and MLP)
Pilsung Kang, Data Mining Laboratory, SNU
118
Response modeling: DMEF4 results
Total revenues and costs with various marketing costs (k-NN and SVM)
Pilsung Kang, Data Mining Laboratory, SNU
119
Response modeling: Summary
CUE for response modeling
To deal with class imbalance.
A new data balancing method based on clustering, under-sampling, ensemble
was proposed to boost response rate predictive accuracy and lower performance
variation.
CSI algorithm was implemented in clustering step, while LLR classification was
implemented in prediction step.
Performance evaluation
CUE improved prediction accuracy while keeping the variance low.
LLR classification was the most accurate and profitable prediction model.
Pilsung Kang, Data Mining Laboratory, SNU
120
Table of Contents
Introduction: Instance-based Learning
Locally Linear Reconstruction for Classification & Regression
Learning
Algorithms
Distance & Local Topology-based Hybrid Score for Novelty Detection
Local Topology-based Seed Initialization for Clustering
Application I: Response Modeling
Real-world
Applications
Application II: Virtual Metrology
Application III: Keystroke Dynamics Analysis
Conclusion
Pilsung Kang, DataMining Laboratory, SNU
121
Application II: Virtual metrology
Virtual metrology
Predict, not actually measure, the metrological values using sensor data from
production equipments and actual metrological values of sampled wafers.
Pilsung Kang, Data Mining Laboratory, SNU
122
Application II: Virtual metrology
A well-developed virtual metrology system
Enhance the final yield by managing scrapped wafers appropriately.
Enable a predictive maintenance based on real-time forecast of metrological
data.
Can detect process drifts timely and promptly.
Enable run-to-run (R2R) process control.
Reduce the cost and time required for actual metrology.
Pilsung Kang, Data Mining Laboratory, SNU
123
Application II: Virtual metrology
Practical issues on virtual metrology
A large number of input variables
Very large number of sensor parameters are activated.
Curse of dimensionality.
A limited number of wafers available
Only a few wafers are actually measured.
Not enough training instances.
Non-stationary process
Process environments may change.
Frequent model update is mandatory.
Pilsung Kang, Data Mining Laboratory, SNU
124
Application II: Proposal
A new virtual metrology system
A large number of input variables
Reduced by dimensionality reduction techniques.
A limited number of wafers available
Employ an instance-based learning algorithm.
k-NN regression with LLR.
Non-stationary process
Adopt prediction algorithms that are suitable for incremental learning.
k-NN regression with LLR.
Pilsung Kang, Data Mining Laboratory, SNU
125
Virtual Metrology: Process
Overlay in photolithography
The lateral positioning between layers comprising integrated circuits.
Overlay misalignment.
Overlay Misalignment
Ideal Situation
4th Layer
3rd Layer
2nd Layer
1st Layer
In Practice
4th Layer
3rd Layer
2nd Layer
1st Layer
Pilsung Kang, Data Mining Laboratory, SNU
126
Virtual Metrology: Data
Data description
Collected from two chucks for eight months.
1,612 wafers from chuck 1, 1,563 wafers from chuck 2.
Sensor parameters
37 sensor parameters.
4 summary statistics (mean, standard deviation, max, min) are recorded.
A total of 148 variables exist.
Target metrology variables
Eight variables regarding overlay misalignment.
4 variables (Y1, Y2, Y3, Y4) have more impact on productivity than others.
Pilsung Kang, Data Mining Laboratory, SNU
127
Virtual Metrology: Model update
Moving window scheme
Jan.
Feb.
Model building
Mar.
Apr.
Aug.
Test
Model building
Pilsung Kang, Data Mining Laboratory, SNU
Test
128
Virtual Metrology: Model update
The number of wafers used for each period
Chuck 1
Chuck 2
Pilsung Kang, Data Mining Laboratory, SNU
129
Virtual Metrology: Dimensionality reduction
Variable selection
Select a subset of important input variables for the prediction model.
Stepwise selection with linear regression (Stepwise LR).
Genetic algorithm with linear regression (GA-LR).
Genetic algorithm with support vector regression (GA-SVR).
Variable extraction
Construct a reduced set of input variables by transforming the original variables.
Principal component analysis (PCA).
Kernel principal component analysis (KPCA).
Pilsung Kang, Data Mining Laboratory, SNU
130
Virtual Metrology: Prediction model
A total of four regression models are employed
Linear regression
k-NN regression with LLR
Pilsung Kang, Data Mining Laboratory, SNU
Multi-layer perceptron (MLP)
Support vector regression (SVR)
131
Virtual Metrology: Performance measures
Mean squared error (MSE)
How well a prediction model fits the relation between input variables and
targets.
Mean absolute specification error (MASE)
How closely the model predicts the target with regard to its tolerance.
Pilsung Kang, Data Mining Laboratory, SNU
132
Virtual Metrology: Dimensionality reduction results
The number of selected input variables
Reduced total number of variables to between 21.5% (stepwise LR) to 42.7%
(GA-SVR).
Pilsung Kang, Data Mining Laboratory, SNU
133
Virtual Metrology: Dimensionality reduction results
The number of extracted input variables
Only 14 variables can explain the 50% variance of original input data.
Two-third of variables can explain the 99% variance, implying that many
variables are highly correlated with each other.
Pilsung Kang, Data Mining Laboratory, SNU
134
Virtual Metrology: Prediction results
Best VM model for each metrology measurement
k-NN with LLR resulted in the best model for seven cases, followed by MLP six
cases.
Pilsung Kang, Data Mining Laboratory, SNU
135
Virtual Metrology: Prediction results
Prediction results (MSE, Y1~Y4)
Pilsung Kang, Data Mining Laboratory, SNU
136
Virtual Metrology: Prediction results
Prediction results (MSE, Y5~Y8)
Pilsung Kang, Data Mining Laboratory, SNU
137
Virtual Metrology: Prediction results
Prediction results (MASE, Y1~Y4)
Pilsung Kang, Data Mining Laboratory, SNU
138
Virtual Metrology: Prediction results
Prediction results (MASE, Y5~Y8)
Pilsung Kang, Data Mining Laboratory, SNU
139
Virtual Metrology: Prediction results
Parameter sensitivity of MLP and k-NN with LLR
MLP is very sensitive to its number of hidden nodes.
k-NN with LLR is robust to the number of nearest neighbors.
Pilsung Kang, Data Mining Laboratory, SNU
140
Virtual Metrology: Prediction results
Prediction example
Y1, k-NN with LLR, trained based on five months (Mar. to Jul.)
Pilsung Kang, Data Mining Laboratory, SNU
141
Virtual Metrology: Summary
k-NN regression with LLR for virtual metrology
Virtual metrology predicts metrological values based on available production
information.
Small training wafers and parameter sensitivity are handled.
Performance evaluation
k-NN with LLR resulted in the best prediction model.
Its parameter sensitivity is much lower than other learning algorithms.
Pilsung Kang, Data Mining Laboratory, SNU
142
Table of Contents
Introduction: Instance-based Learning
Locally Linear Reconstruction for Classification & Regression
Learning
Algorithms
Distance & Local Topology-based Hybrid Score for Novelty Detection
Local Topology-based Seed Initialization for Clustering
Application I: Response Modeling
Real-world
Applications
Application II: Virtual Metrology
Application III: Keystroke Dynamics Analysis
Conclusion
Pilsung Kang, DataMining Laboratory, SNU
143
Application III: Keystroke dynamics analysis
Keystroke dynamics
The way that a person types a string of characters.
A
Duration
B
C
D
Interval
[40, 50, 30, 35, 40, -15, 40]
Time (ms)
Password-based user authentication
Most commonly used authentication system.
Easy to develop, operate, maintain, and cost-efficient.
Vulnerable when the password is leaked.
Pilsung Kang, Data Mining Laboratory, SNU
144
Application III: Keystroke dynamics analysis
Keystroke dynamics-based user authentication
Use one’s keystroke typing behavior as well as password.
Strengthen the security of one’s account.
Pilsung Kang, Data Mining Laboratory, SNU
145
Application III: Keystroke dynamics analysis
Practical issues on keystroke dynamics-based authentication
Data availability
Only a valid user’s keystroke typing patterns are available.
Classification models are not appropriate.
Concept drift
One’s typing behavior can change.
Frequent model update is mandatory.
Pilsung Kang, Data Mining Laboratory, SNU
146
Application III: Proposal
Keystroke dynamics-based user authentication (KDA)
Data availability
Classification models are not appropriate.
Novelty detection problem is formulated.
Concept drift
Frequent model update is mandatory.
The distance & local topology based hybrid novelty score (dhybrid) is employed.
Pilsung Kang, Data Mining Laboratory, SNU
147
KDA: Data
Group A data sets
Collected in 1996~1998 from a SUN workstation.
21 users were involved whose training typing patterns varies from 76~388.
15 impostors were recruited.
75 valid & 75 impostor test patterns.
Pilsung Kang, Data Mining Laboratory, SNU
148
KDA: Data
Group B data sets
Collected in 2005 from the subjects’ own PC.
25 users were involved whose training typing patterns were 30.
Users changed their roles.
24 valid & 24 impostor test patterns.
Pilsung Kang, Data Mining Laboratory, SNU
149
KDA: Authenticators
Benchmark novelty detectors
Three density-based
Gaussian density estimation (Gauss).
Mixture of Gaussians (MoG).
Parzen window density estimator (Parzen).
One support vector-based
One class support vector machine (1-SVM)
Three clustering-based
K-Means clustering (KMC)
K-Center clustering (KCC)
Average linkage-based hierarchical clustering (HC)
Pilsung Kang, Data Mining Laboratory, SNU
150
KDA: Authenticators
Benchmark novelty detectors (cont’)
One dimensionality reduction-based
Principal component analysis (PCA)
Five distance-based
Max distance (dmax)
Average distance (davg)
Distance to the mean vector (dmean)
1-nearest neighbor (1-NN)
Minimum spanning tree (MST)
A total of 13 benchmark novelty detectors
Pilsung Kang, Data Mining Laboratory, SNU
151
KDA: Incremental learning and performance measures
Incremental learning
Test patterns are evaluated individually and independently (one at a time).
The model is updated according to the test pattern’s prediction results, not its
actual label.
All valid test patterns and randomly selected 10 impostor patterns were used.
Performance measure: Integrated error
Pilsung Kang, Data Mining Laboratory, SNU
152
KDA: Authentication results
Authentication performance for Group A users
Pilsung Kang, Data Mining Laboratory, SNU
153
KDA: Authentication results
Improvement over davg and 1-SVM for Group A users
Pilsung Kang, Data Mining Laboratory, SNU
154
KDA: Authentication results
Authentication performance for Group B users
Pilsung Kang, Data Mining Laboratory, SNU
155
KDA: Authentication results
Improvement over davg and 1-SVM for Group B users
Pilsung Kang, Data Mining Laboratory, SNU
156
KDA: Authentication results
Total computation time for each group
When the number of training patterns are large, 1-SVM, PCA and dhybrid were
efficient, while PCA and dhybrid were still efficient when the number of typing
patterns increased.
dhybrid was able to achieve both high detection performance and efficiency.
Pilsung Kang, Data Mining Laboratory, SNU
157
KDA: Summary
The distance & local topology based hybrid novelty score (dhybrid) for
keystroke dynamics-based user authentication
Keystroke dynamics-based user authentication utilizes one’s keyboard typing
behavior for authentication.
Only valid users’ typing patterns are available and frequent model update is
required.
The distance & local topology based hybrid novelty score (dhybrid) is employed.
Performance evaluation
dhybrid resulted in the best novelty detection performance for both groups.
It is efficiently adopted for incremental learning environments.
Pilsung Kang, Data Mining Laboratory, SNU
158
Table of Contents
Introduction: Instance-based Learning
Locally Linear Reconstruction for Classification & Regression
Learning
Algorithms
Distance & Local Topology-based Hybrid Score for Novelty Detection
Local Topology-based Seed Initialization for Clustering
Application I: Response Modeling
Real-world
Applications
Application II: Virtual Metrology
Application III: Keystroke Dynamics Analysis
Conclusion
Pilsung Kang, DataMining Laboratory, SNU
159
What Have Been Done
Locally linear reconstruction (LLR) for classification and regression
A systematic weight allocation method based on local topology.
Able to identify important neighbors for the prediction.
Assigns the appropriate weights for the important neighbors.
Distance & local topology based hybrid score (dhybrid) for novelty detection
Take both absolute and relative similarity into account.
Absolute similarity is associated with average distance to one’s neighbors.
Relative similarity is associated with local topology among one’s neighbors.
Able to overcome conventional nearest-neighbor-based novelty detectors.
Outperformed other popular novelty detectors.
Pilsung Kang, Data Mining Laboratory, SNU
160
What Have Been Done
A new seed initialization algorithm based on centrality, sparsity, and
isotropy, (CSI), for clustering
Three properties associated with inter- or intra-cluster variance are identified.
Relative similarity and local topology are used for measuring these properties.
Able to lead K-Means clustering algorithm to the optimal clustering structure
rapidly.
Pilsung Kang, Data Mining Laboratory, SNU
161
What Have Been Done
Application to response modeling
LLR classification and CSI were employed for handling class imbalance.
Improved response rate and reduced performance variation.
Application to virtual metrology
LLR regression was employed for handling parameter sensitivity and model
update.
Both goals are successfully achieved.
Application to keystroke dynamics analysis
The hybrid novelty score (dhybrid) was employed for handling data availability
and model update.
Efficient and accurate authenticator could be built.
Pilsung Kang, Data Mining Laboratory, SNU
162
What Should Be Done
In learning theory
Similarity measures
Non-numeric attributes
Optimizing the number of clusters
Scalability
Concept drift environment
Response modeling
Combining algorithm modification and data balancing method, uplift modeling
Virtual metrology
Integration with R2R control systems
Keystroke dynamics analysis
Long-free text based authentication, account sharing
Pilsung Kang, Data Mining Laboratory, SNU
163
Q&A
Pilsung Kang, Data Mining Laboratory, SNU
164